Entity Matching with Similarity Maps and Deep Learning¶
The example concisely introduces the basic concepts and functionality of the package. Its goals are (i) to bring the reader up to speed with how to use the package to fit entity-matching models and (ii) to familiarize the reader with the basic concepts that are used in the documentation and naming conventions of the package. For simplicity, we use a deep learning matching model for this example. Examples of neural-symbolic matching models are given in the Neural-symbolic Entity Matching and Reasoning vignettes.
Prerequisites¶
Load the libraries we will use and set the seed for reproducibility.
from neer_match.matching_model import DLMatchingModel
from neer_match.similarity_map import SimilarityMap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import tensorflow as tf
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)
Preprocessing¶
The preprocessing stage aims to bring the data in a form compatible with
the neermatch
package’s calling conventions. The package expects the
data to be in the form of two data frames, left
and right
, that
contain the entities to be matched. The package also expects a third
data frame, matches
, that contains the indices of the matching
entities in the left
and right
data frames. We illustrate the
calling convention by constructing a toy example.
Left Data Set¶
The data used in this example is constructed from a subset of the
game_reviews
dataset shipped with the R
version of the package (see
the game_review
documentation).
The subset is created by selecting all the records in game_reviews
having titles starting with either "Metal Slug"
or "Metal Gear"
. The
selection results are hard-coded here to keep the example code
self-contained. Following the package’s naming convention, we name the
resulting data frame left.
left = pd.DataFrame(
{
"title": [
# fmt: off
"Metal Gear Solid 4: Guns of the Patriots", "Metal Gear Solid",
"Metal Gear Solid 3: Snake Eater", "Metal Gear Solid V: The Phantom Pain",
"Metal Gear Solid Mobile", "Metal Gear Solid: Portable Ops",
"Metal Gear Solid 2: Substance", "Metal Gear Solid: The Twin Snakes",
"Metal Gear Solid", "Metal Gear Rising: Revengeance",
"Metal Gear Solid Integral", "Metal Gear Solid: Peace Walker HD Edition",
"Metal Slug 4 Mobile", "Metal Gear Rising: Revengeance",
"Metal Gear Online", "Metal Gear Acid",
"Metal Slug Anthology", "Metal Gear Solid V: Ground Zeroes",
"Metal Slug Mobile", "Metal Gear Rising: Revengeance - Jetstream",
"Metal Slug 2", "Metal Gear Solid V: Ground Zeroes",
"Metal Slug 7", "Metal Gear Solid: VR Missions",
"Metal Slug 4 & 5", "Metal Gear Rising: Revengeance - Blade Wolf",
"Metal Gear Acid 2", "Metal Slug XX",
"Metal Slug Defense", "Metal Slug 1",
"Metal Gear Solid V: Metal Gear Online", "Metal Slug Touch",
"Metal Slug X", "Metal Gear Rising: Revengeance - Jetstream",
"Metal Gear Survive", "Metal Gear",
# fmt: on
],
"platform": [
# fmt: off
"PS3", "PS", "PS2", "XONE", "MOBI", "PSP", "XBOX", "GC", "PC", "PC", "PS", "PS3",
"MOBI", "X360", "PS3", "PSP", "PS4", "PS4", "MOBI", "PS3", "IOS", "XONE", "DS", "PS",
"XBOX", "X360", "MOBI", "PSP", "IOS", "IOS", "PS4", "IOS", "IOS", "X360", "PS4", "MOBI",
# fmt: on
],
"year": [
# fmt: off
2008, 1998, 2004, 2015, 2008, 2006, 2002, 2004, 2000, 2014, 1999, 2012, 2008, 2013,
2008, 2005, 2016, 2014, 2004, 2013, 2013, 2014, 2008, 1999, 2005, 2013, 2009, 2010,
2014, 2012, 2015, 2009, 2013, 2013, 2018, 2004
# fmt: on
],
"scores": [
# fmt: off
93.53, 93.24, 91.77, 90.38, 92.50, 86.95, 86.66, 85.58, 84.22, 83.55, 90.00, 90.00,
90.00, 82.56, 80.37, 76.70, 80.00, 75.23, 75.00, 75.00, 77.00, 72.50, 72.11, 70.64,
70.47, 70.00, 70.00, 68.97, 70.00, 68.33, 67.50, 60.00, 64.00, 60.00, 61.09, 50.00
# fmt: on
],
"reviews": [
# fmt: off
85, 29, 86, 12, 2, 60, 56, 69, 18, 11, 1, 1, 1, 31, 8, 52, 1, 48, 4, 2, 5, 6, 42, 18,
33, 2, 1, 18, 6, 3, 2, 1, 5, 1, 33, 1
# fmt: on
],
"developer": [
# fmt: off
"Kojima Productions/Konami", "KCEJ/Konami", "KCEJ/Konami",
"Kojima Productions/Konami", "Ideaworks3D/Konami", "Kojima Productions/Konami",
"KCEJ/Konami", "Silicon Knights/Konami", "Digital Dialect/Konami",
"PlatinumGames/Konami", "Konami", "Kojima Productions/Konami", "SNK Playmore/I-Play",
"PlatinumGames/Konami", "Kojima Productions/Konami", "Konami",
"Terminal Reality/SNK Playmore", "Kojima Productions/Konami", "SNK Playmore/I-Play",
"PlatinumGames/Konami", "SNK Playmore", "Kojima Productions/Konami",
"SNK Playmore/Ignition Entertainment", "KCEJ/Konami", "BrezzaSoft/SNK Playmore",
"PlatinumGames/Konami", "Konami/Glu Mobile", "SNK Playmore", "SNK Playmore",
"SNK Playmore", "Kojima Productions/Konami", "SNK Playmore", "SNK Playmore",
"PlatinumGames/Konami", "Konami", "Konami Mobile & Online, Inc./Konami"
# fmt: on
],
}
)
Right Data Set¶
What is the right
data frame in this example? We construct the right
data frame by copying the left
data frame and introducing noise in the
title
and developer
columns. Up to three characters are randomly
removed from the title
column, and up to two characters are randomly
removed from the developer
column.
In addition, we create three duplicate matching records on the right
data frame to illustrate that the matching models of the package can be
used to link datasets with records/entities having either one-to-many or
many-to-many relations.
By construction, the left
and right
data frames in our examples have
the same columns. In practice, this is not always the case. For such
cases, the neermatch
package supports specifying different columns of
the left
and right
data frames. To illustrate how the package can be
used in such cases, we rename the developer
column in the right
data
frame to dev
.
right = left.copy()
no_duplicates = 3
right = pd.concat([right, right.sample(no_duplicates)])
tpos = right.columns.get_loc("title")
dpos = right.columns.get_loc("developer")
for r in range(right.shape[0]):
no_chars = random.randint(1, 3)
title = right.iloc[r, tpos]
right.iloc[r, tpos] = "".join(
[
l
for i, l in enumerate(title)
if i not in random.sample(range(len(title)), no_chars)
]
)
no_chars = random.randint(1, 2)
developer = right.iloc[r, dpos]
right.iloc[r, dpos] = "".join(
[
l
for i, l in enumerate(developer)
if i not in random.sample(range(len(developer)), no_chars)
]
)
right.rename(columns={"developer": "dev"}, inplace=True)
Matches Data Set¶
The matching examples are passed to the matching models as pairs of
indices. We do not need to provide non-matching examples. The models
automatically pick non-matching examples from the Cartesian product of
the left
and right
data frames as long as their indices are not in
the matches
data frame. For this application, the matches are
constructed by the rows of the left
and right
data frames with the
same index and the 3 duplicate matches we created.
matches = pd.DataFrame(
{
"left": left.index.append(right.iloc[-no_duplicates:,].index),
"right": range(len(right)),
}
)
Matching Model Setup¶
A matching model initialization requires instructions on constructing
the similarity map between the left
and right
datasets. The
instructions are passed to the model as a named list that specifies
which fields are used from each dataset and how they are associated with each other, and
how to encode them, i.e., how to calculate the similarity values.
If the column names of the left
and right
are the same, it suffices
to supply the common name to the similarity map instruction list (see,
e.g., the title
, platform
, and year
items below). If two columns
have different names, the instruction should have the form
left_name~right_name
, as it is, for instance, the case for the
developing studio in this application (see item developer~dev
). Not
all columns of the datasets need to be used in the instructions. We do
not use the scores
and reviews
columns in this example.
The model can be instructed to calculate one or more similarities for
each column association. For instance, the instructions of this example
specify two similarity calculations for the platform
and year
associations and one for the title
and developer~dev
. The
neermatch
provides a set of predefined similarity functions that can
be used in similarity maps. The string similarities and ratios are
calculated using RapidFuzz.
The complete set of predefined functions can be retrieved by calling the
available_similarities()
function.
instructions = {
"title": ["jaro_winkler"],
"platform": ["levenshtein", "discrete"],
"year": ["euclidean", "discrete"],
"developer~dev": ["jaro"]
}
similarity_map = SimilarityMap(instructions)
print(similarity_map)
SimilarityMap[
('title', 'title', 'jaro_winkler')
('platform', 'platform', 'levenshtein')
('platform', 'platform', 'discrete')
('year', 'year', 'euclidean')
('year', 'year', 'discrete')
('developer', 'dev', 'jaro')]
A matching model object is constructed by passing the similarity map instructions. The model prepares encoding operations based on the passed instructions and uses them whenever the model is fitted or evaluated. We skip the construction details of the model in this example and refer the interested reader to the documentation entries of the DLMatchingModel, RecordPairNetwork, and FieldPairNetwork classes.
model = DLMatchingModel(similarity_map)
The model is compiled in the usual (keras) way. The compile function wraps the tensorflow.keras.Model.compile function, so all the functionality and options in the latter can be used here.
model.compile(
loss = tf.keras.losses.BinaryCrossentropy(),
optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001)
)
Matching Model Fit and Evaluation¶
The model is fitted using the fit
function (see the fit
documentation
for details). The fit
function extends the functionality of
tensorflow.keras.Model.fit
to accommodate the entity matching problem requirements. Firstly,
instead of passing features and label arguments, fit
expects the
left
, right
, and matches
data frames.
During training, counterexamples of matching records are automatically
selected from the cross-product of left and right data frames based on
the mismatch_share
parameter. Including all the non-matching examples
can lead to a highly unbalanced matching dataset. For each record in the
left
data set, appearing in k examples in the matches
data set,
there are up to n - k
counterexamples, where n
is the number of
records in the right
data set. The mismatch_share
parameter controls
the ratio of counterexamples to matches. For instance, if
mismatch_share = 0.5
, the encode
function selects 50% of the
possible counterexamples for each match. The counterexamples are
selected randomly from the right
dataset.
For instance, the left
and right
data sets in this example have 36
and 39 records, respectively, and the matches
data set has 39 records.
This allows us to construct 1365 non-matching examples. We set
mismatch_share = 0.2
, which means that for each matching example
provided in matches
, we get 7 non-matching examples (the integer part,
i.e., the floor, of the number of counterexamples used).
The remaining arguments are similar to tensorflow.keras.Model.fit.
model.fit(
left,
right,
matches,
epochs=100,
batch_size=32,
verbose=0
)
The evaluate
function overloads the
tensorflow.keras.Model.evaluate
function (see also the evaluate
documentation).
Similar to fitting a matching model, the evaluate
call expects left
,
right
, and matches
data sets.
model.evaluate(left, right, matches, verbose = 0)
0.044671885669231415
Predictions and Suggestions¶
Matching predictions can be obtained in two ways from the fitted model.
Either by calling predict
or by calling suggest
. The predict
function returns a vector of prediction probabilities for each
combination of left
and right
records. The prediction probabilities
are stored in row-major order. First, the matching probabilities of the
first row of left
with all the rows of right
are given. Then, the
probabilities of the second row of left
with all the rows of right
are given, and so on. In total, the predict
function returns a vector
with rows equal to the product of the number of rows in the left
and
right
data sets.
predictions = model.predict(left, right, verbose = 0)
fig, ax = plt.subplots()
counts, bins= np.histogram(predictions, bins = 100)
cdf = np.cumsum(counts)/np.sum(counts)
ax.plot(bins[1:], cdf)
ax.set_xlabel("Matching Prediction")
ax.set_ylabel("Cumulative Density")
plt.show()
The suggest
function returns the best matching predictions of the
model for each row of the left
dataset. The prediction probabilities
of predict
are grouped by the indices of the left
dataset and sorted
in descending order. The caller can choose the number of returned
suggestions by setting the count
argument of suggest
.
suggestions = model.suggest(left, right, count = 3, verbose = 0)
suggestions["true_match"] = suggestions.loc[:, ["left", "right"]].apply(
lambda x: any((x.left == matches.left) & (x.right==matches.right)), axis=1
)
suggestions = suggestions.join(
matches.iloc[-no_duplicates:,:].assign(duplicate=True).set_index(['left', 'right']),
on = ["left", "right"],
how = "left"
).fillna(False)
suggestions
left | right | prediction | true_match | duplicate | |
---|---|---|---|---|---|
0 | 0 | 0 | 9.778342e-01 | True | False |
14 | 0 | 14 | 9.169472e-01 | False | False |
4 | 0 | 4 | 9.598946e-04 | False | False |
40 | 1 | 1 | 9.718049e-01 | True | False |
62 | 1 | 23 | 8.463337e-05 | False | False |
49 | 1 | 10 | 7.474281e-05 | False | False |
80 | 2 | 2 | 9.787740e-01 | True | False |
85 | 2 | 7 | 1.013026e-03 | False | False |
113 | 2 | 35 | 6.155510e-04 | False | False |
120 | 3 | 3 | 9.766330e-01 | True | False |
147 | 3 | 30 | 1.074756e-03 | False | False |
138 | 3 | 21 | 5.062067e-05 | False | False |
160 | 4 | 4 | 9.787740e-01 | True | False |
168 | 4 | 12 | 9.383137e-01 | False | False |
156 | 4 | 0 | 9.753937e-04 | False | False |
200 | 5 | 5 | 9.777580e-01 | True | False |
210 | 5 | 15 | 7.423223e-05 | False | False |
222 | 5 | 27 | 3.159808e-05 | False | False |
240 | 6 | 6 | 9.773185e-01 | True | False |
258 | 6 | 24 | 2.507875e-05 | False | False |
251 | 6 | 17 | 7.486212e-08 | False | False |
280 | 7 | 7 | 9.781616e-01 | True | False |
275 | 7 | 2 | 1.006788e-03 | False | False |
308 | 7 | 35 | 6.105449e-04 | False | False |
320 | 8 | 8 | 9.772826e-01 | True | False |
321 | 8 | 9 | 3.877738e-05 | False | False |
335 | 8 | 23 | 1.128767e-07 | False | False |
360 | 9 | 9 | 9.716737e-01 | True | False |
368 | 9 | 17 | 5.324303e-04 | False | False |
379 | 9 | 28 | 4.801422e-04 | False | False |
400 | 10 | 10 | 9.733325e-01 | True | False |
413 | 10 | 23 | 9.662850e-01 | False | False |
391 | 10 | 1 | 7.031505e-05 | False | False |
440 | 11 | 11 | 9.782392e-01 | True | False |
458 | 11 | 29 | 4.065759e-04 | False | False |
448 | 11 | 19 | 5.362284e-05 | False | False |
480 | 12 | 12 | 9.742600e-01 | True | False |
472 | 12 | 4 | 9.637859e-01 | False | False |
490 | 12 | 22 | 1.027538e-03 | False | False |
520 | 13 | 13 | 9.783336e-01 | True | False |
544 | 13 | 37 | 9.752246e-01 | True | True |
540 | 13 | 33 | 9.751084e-01 | False | False |
560 | 14 | 14 | 9.712535e-01 | True | False |
546 | 14 | 0 | 9.653772e-01 | False | False |
550 | 14 | 4 | 1.020161e-03 | False | False |
600 | 15 | 15 | 9.767061e-01 | True | False |
609 | 15 | 24 | 3.229330e-04 | False | False |
590 | 15 | 5 | 7.657281e-05 | False | False |
640 | 16 | 16 | 9.648299e-01 | True | False |
654 | 16 | 30 | 4.539440e-05 | False | False |
658 | 16 | 34 | 3.271291e-05 | False | False |
680 | 17 | 17 | 9.758328e-01 | True | False |
684 | 17 | 21 | 1.238351e-03 | False | False |
672 | 17 | 9 | 5.116409e-04 | False | False |
720 | 18 | 18 | 9.786937e-01 | True | False |
737 | 18 | 35 | 9.030289e-01 | False | False |
738 | 18 | 36 | 8.958049e-01 | False | False |
760 | 19 | 19 | 9.778247e-01 | True | False |
774 | 19 | 33 | 1.464835e-03 | False | False |
754 | 19 | 13 | 1.276108e-03 | False | False |
800 | 20 | 20 | 9.783897e-01 | True | False |
812 | 20 | 32 | 9.765700e-01 | False | False |
817 | 20 | 37 | 3.983492e-04 | False | False |
840 | 21 | 21 | 9.739489e-01 | True | False |
836 | 21 | 17 | 1.318118e-03 | False | False |
828 | 21 | 9 | 4.996893e-04 | False | False |
880 | 22 | 22 | 9.768478e-01 | True | False |
870 | 22 | 12 | 7.845204e-04 | False | False |
862 | 22 | 4 | 5.201011e-04 | False | False |
920 | 23 | 23 | 9.778675e-01 | True | False |
907 | 23 | 10 | 9.233682e-01 | False | False |
898 | 23 | 1 | 7.060236e-05 | False | False |
960 | 24 | 24 | 9.670053e-01 | True | False |
951 | 24 | 15 | 3.808707e-04 | False | False |
942 | 24 | 6 | 3.039734e-05 | False | False |
1000 | 25 | 25 | 9.785444e-01 | True | False |
988 | 25 | 13 | 9.746334e-01 | False | False |
1008 | 25 | 33 | 9.737168e-01 | False | False |
1040 | 26 | 26 | 9.766273e-01 | True | False |
1052 | 26 | 38 | 9.702485e-01 | True | True |
1045 | 26 | 31 | 4.269315e-04 | False | False |
1080 | 27 | 27 | 9.779201e-01 | True | False |
1068 | 27 | 15 | 3.216346e-05 | False | False |
1058 | 27 | 5 | 3.040044e-05 | False | False |
1120 | 28 | 28 | 9.773872e-01 | True | False |
1109 | 28 | 17 | 5.012505e-04 | False | False |
1113 | 28 | 21 | 5.000369e-04 | False | False |
1160 | 29 | 29 | 9.787740e-01 | True | False |
1142 | 29 | 11 | 4.129025e-04 | False | False |
1163 | 29 | 32 | 8.954816e-05 | False | False |
1200 | 30 | 30 | 9.761566e-01 | True | False |
1173 | 30 | 3 | 1.078105e-03 | False | False |
1187 | 30 | 17 | 7.555061e-05 | False | False |
1240 | 31 | 31 | 9.777137e-01 | True | False |
1235 | 31 | 26 | 5.508639e-04 | False | False |
1247 | 31 | 38 | 3.956477e-04 | False | False |
1280 | 32 | 32 | 9.785187e-01 | True | False |
1268 | 32 | 20 | 9.764295e-01 | False | False |
1285 | 32 | 37 | 3.983492e-04 | False | False |
1320 | 33 | 33 | 9.786212e-01 | True | False |
1300 | 33 | 13 | 9.748630e-01 | False | False |
1312 | 33 | 25 | 9.736227e-01 | False | False |
1360 | 34 | 34 | 9.765700e-01 | True | False |
1356 | 34 | 30 | 4.525233e-05 | False | False |
1343 | 34 | 17 | 4.510958e-05 | False | False |
1401 | 35 | 36 | 9.575554e-01 | True | True |
1400 | 35 | 35 | 9.503515e-01 | True | False |
1383 | 35 | 18 | 9.235189e-01 | False | False |