Entity Matching with Similarity Maps and Deep Learning

The example concisely introduces the basic concepts and functionality of the package. Its goals are (i) to bring the reader up to speed with how to use the package to fit entity-matching models and (ii) to familiarize the reader with the basic concepts that are used in the documentation and naming conventions of the package. For simplicity, we use a deep learning matching model for this example. Examples of neural-symbolic matching models are given in the Neural-symbolic Entity Matching and Reasoning vignettes.

Prerequisites

Load the libraries we will use and set the seed for reproducibility.

from neer_match.matching_model import DLMatchingModel
from neer_match.similarity_map import SimilarityMap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import tensorflow as tf

random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

Preprocessing

The preprocessing stage aims to bring the data in a form compatible with the neermatch package’s calling conventions. The package expects the data to be in the form of two data frames, left and right, that contain the entities to be matched. The package also expects a third data frame, matches, that contains the indices of the matching entities in the left and right data frames. We illustrate the calling convention by constructing a toy example.

Left Data Set

The data used in this example is constructed from a subset of the game_reviews dataset shipped with the R version of the package (see the game_review documentation). The subset is created by selecting all the records in game_reviews having titles starting with either "Metal Slug" or "Metal Gear". The selection results are hard-coded here to keep the example code self-contained. Following the package’s naming convention, we name the resulting data frame left.

left = pd.DataFrame(
    {
        "title": [
            # fmt: off
            "Metal Gear Solid 4: Guns of the Patriots", "Metal Gear Solid",
             "Metal Gear Solid 3: Snake Eater", "Metal Gear Solid V: The Phantom Pain",
             "Metal Gear Solid Mobile", "Metal Gear Solid: Portable Ops",
             "Metal Gear Solid 2: Substance", "Metal Gear Solid: The Twin Snakes",
             "Metal Gear Solid", "Metal Gear Rising: Revengeance",
             "Metal Gear Solid Integral", "Metal Gear Solid: Peace Walker HD Edition",
             "Metal Slug 4 Mobile", "Metal Gear Rising: Revengeance",
             "Metal Gear Online", "Metal Gear Acid",
             "Metal Slug Anthology", "Metal Gear Solid V: Ground Zeroes",
             "Metal Slug Mobile", "Metal Gear Rising: Revengeance - Jetstream",
             "Metal Slug 2", "Metal Gear Solid V: Ground Zeroes",
             "Metal Slug 7", "Metal Gear Solid: VR Missions",
             "Metal Slug 4 & 5", "Metal Gear Rising: Revengeance - Blade Wolf",
             "Metal Gear Acid 2", "Metal Slug XX",
             "Metal Slug Defense", "Metal Slug 1",
             "Metal Gear Solid V: Metal Gear Online", "Metal Slug Touch",
             "Metal Slug X", "Metal Gear Rising: Revengeance - Jetstream",
             "Metal Gear Survive", "Metal Gear",
             # fmt: on
        ],
        "platform": [
            # fmt: off
            "PS3", "PS", "PS2", "XONE", "MOBI", "PSP", "XBOX", "GC", "PC", "PC", "PS", "PS3",
             "MOBI", "X360", "PS3", "PSP", "PS4", "PS4", "MOBI", "PS3", "IOS", "XONE", "DS", "PS",
             "XBOX", "X360", "MOBI", "PSP", "IOS", "IOS", "PS4", "IOS", "IOS", "X360", "PS4", "MOBI",
             # fmt: on
        ],
        "year": [
            # fmt: off
            2008, 1998, 2004, 2015, 2008, 2006, 2002, 2004, 2000, 2014, 1999, 2012, 2008, 2013,
            2008, 2005, 2016, 2014, 2004, 2013, 2013, 2014, 2008, 1999, 2005, 2013, 2009, 2010,
            2014, 2012, 2015, 2009, 2013, 2013, 2018, 2004
            # fmt: on
        ],
        "scores": [
            # fmt: off
            93.53, 93.24, 91.77, 90.38, 92.50, 86.95, 86.66, 85.58, 84.22, 83.55, 90.00, 90.00,
            90.00, 82.56, 80.37, 76.70, 80.00, 75.23, 75.00, 75.00, 77.00, 72.50, 72.11, 70.64,
            70.47, 70.00, 70.00, 68.97, 70.00, 68.33, 67.50, 60.00, 64.00, 60.00, 61.09, 50.00
            # fmt: on
        ],
        "reviews": [
            # fmt: off
            85, 29, 86, 12, 2, 60, 56, 69, 18, 11, 1, 1, 1, 31, 8, 52, 1, 48, 4, 2, 5, 6, 42, 18,
            33, 2, 1, 18, 6, 3, 2, 1, 5, 1, 33, 1
            # fmt: on
        ],
        "developer": [
            # fmt: off
            "Kojima Productions/Konami", "KCEJ/Konami", "KCEJ/Konami", 
            "Kojima Productions/Konami", "Ideaworks3D/Konami", "Kojima Productions/Konami", 
            "KCEJ/Konami", "Silicon Knights/Konami", "Digital Dialect/Konami",
            "PlatinumGames/Konami", "Konami", "Kojima Productions/Konami", "SNK Playmore/I-Play",
            "PlatinumGames/Konami", "Kojima Productions/Konami", "Konami",
            "Terminal Reality/SNK Playmore", "Kojima Productions/Konami", "SNK Playmore/I-Play",
            "PlatinumGames/Konami", "SNK Playmore", "Kojima Productions/Konami",
            "SNK Playmore/Ignition Entertainment", "KCEJ/Konami", "BrezzaSoft/SNK Playmore",
            "PlatinumGames/Konami", "Konami/Glu Mobile", "SNK Playmore", "SNK Playmore",
            "SNK Playmore", "Kojima Productions/Konami", "SNK Playmore", "SNK Playmore",
            "PlatinumGames/Konami", "Konami", "Konami Mobile & Online, Inc./Konami"
            # fmt: on
        ],
    }
)

Right Data Set

What is the right data frame in this example? We construct the right data frame by copying the left data frame and introducing noise in the title and developer columns. Up to three characters are randomly removed from the title column, and up to two characters are randomly removed from the developer column.

In addition, we create three duplicate matching records on the right data frame to illustrate that the matching models of the package can be used to link datasets with records/entities having either one-to-many or many-to-many relations.

By construction, the left and right data frames in our examples have the same columns. In practice, this is not always the case. For such cases, the neermatch package supports specifying different columns of the left and right data frames. To illustrate how the package can be used in such cases, we rename the developer column in the right data frame to dev.

right = left.copy()
no_duplicates = 3
right = pd.concat([right, right.sample(no_duplicates)])

tpos = right.columns.get_loc("title")
dpos = right.columns.get_loc("developer")

for r in range(right.shape[0]):
    no_chars = random.randint(1, 3)
    title = right.iloc[r, tpos]
    right.iloc[r, tpos] = "".join(
        [
            l
            for i, l in enumerate(title)
            if i not in random.sample(range(len(title)), no_chars)
        ]
    )
    no_chars = random.randint(1, 2)
    developer = right.iloc[r, dpos]
    right.iloc[r, dpos] = "".join(
        [
            l
            for i, l in enumerate(developer)
            if i not in random.sample(range(len(developer)), no_chars)
        ]
    )

right.rename(columns={"developer": "dev"}, inplace=True)

Matches Data Set

The matching examples are passed to the matching models as pairs of indices. We do not need to provide non-matching examples. The models automatically pick non-matching examples from the Cartesian product of the left and right data frames as long as their indices are not in the matches data frame. For this application, the matches are constructed by the rows of the left and right data frames with the same index and the 3 duplicate matches we created.

matches = pd.DataFrame(
    {
        "left": left.index.append(right.iloc[-no_duplicates:,].index),
        "right": range(len(right)),
    }
)

Matching Model Setup

A matching model initialization requires instructions on constructing the similarity map between the left and right datasets. The instructions are passed to the model as a named list that specifies

  1. which fields are used from each dataset and how they are associated with each other, and

  2. how to encode them, i.e., how to calculate the similarity values.

If the column names of the left and right are the same, it suffices to supply the common name to the similarity map instruction list (see, e.g., the title, platform, and year items below). If two columns have different names, the instruction should have the form left_name~right_name, as it is, for instance, the case for the developing studio in this application (see item developer~dev). Not all columns of the datasets need to be used in the instructions. We do not use the scores and reviews columns in this example.

The model can be instructed to calculate one or more similarities for each column association. For instance, the instructions of this example specify two similarity calculations for the platform and year associations and one for the title and developer~dev. The neermatch provides a set of predefined similarity functions that can be used in similarity maps. The string similarities and ratios are calculated using RapidFuzz. The complete set of predefined functions can be retrieved by calling the available_similarities() function.

instructions = {
    "title": ["jaro_winkler"],
    "platform": ["levenshtein", "discrete"],
    "year": ["euclidean", "discrete"],
    "developer~dev": ["jaro"]
}

similarity_map = SimilarityMap(instructions)
print(similarity_map)
SimilarityMap[
  ('title', 'title', 'jaro_winkler')
  ('platform', 'platform', 'levenshtein')
  ('platform', 'platform', 'discrete')
  ('year', 'year', 'euclidean')
  ('year', 'year', 'discrete')
  ('developer', 'dev', 'jaro')]

A matching model object is constructed by passing the similarity map instructions. The model prepares encoding operations based on the passed instructions and uses them whenever the model is fitted or evaluated. We skip the construction details of the model in this example and refer the interested reader to the documentation entries of the DLMatchingModel, RecordPairNetwork, and FieldPairNetwork classes.

model = DLMatchingModel(similarity_map)

The model is compiled in the usual (keras) way. The compile function wraps the tensorflow.keras.Model.compile function, so all the functionality and options in the latter can be used here.

model.compile(
    loss = tf.keras.losses.BinaryCrossentropy(),
    optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001)
)

Matching Model Fit and Evaluation

The model is fitted using the fit function (see the fit documentation for details). The fit function extends the functionality of tensorflow.keras.Model.fit to accommodate the entity matching problem requirements. Firstly, instead of passing features and label arguments, fit expects the left, right, and matches data frames.

During training, counterexamples of matching records are automatically selected from the cross-product of left and right data frames based on the mismatch_share parameter. Including all the non-matching examples can lead to a highly unbalanced matching dataset. For each record in the left data set, appearing in k examples in the matches data set, there are up to n - k counterexamples, where n is the number of records in the right data set. The mismatch_share parameter controls the ratio of counterexamples to matches. For instance, if mismatch_share = 0.5, the encode function selects 50% of the possible counterexamples for each match. The counterexamples are selected randomly from the right dataset.

For instance, the left and right data sets in this example have 36 and 39 records, respectively, and the matches data set has 39 records. This allows us to construct 1365 non-matching examples. We set mismatch_share = 0.2, which means that for each matching example provided in matches, we get 7 non-matching examples (the integer part, i.e., the floor, of the number of counterexamples used).

The remaining arguments are similar to tensorflow.keras.Model.fit.

model.fit(
    left,
    right,
    matches,
    epochs=100,
    batch_size=32,
    verbose=0
)

The evaluate function overloads the tensorflow.keras.Model.evaluate function (see also the evaluate documentation). Similar to fitting a matching model, the evaluate call expects left, right, and matches data sets.

model.evaluate(left, right, matches, verbose = 0)
0.044671885669231415

Predictions and Suggestions

Matching predictions can be obtained in two ways from the fitted model. Either by calling predict or by calling suggest. The predict function returns a vector of prediction probabilities for each combination of left and right records. The prediction probabilities are stored in row-major order. First, the matching probabilities of the first row of left with all the rows of right are given. Then, the probabilities of the second row of left with all the rows of right are given, and so on. In total, the predict function returns a vector with rows equal to the product of the number of rows in the left and right data sets.

predictions = model.predict(left, right, verbose = 0)

fig, ax = plt.subplots()
counts, bins= np.histogram(predictions, bins = 100)
cdf = np.cumsum(counts)/np.sum(counts)
ax.plot(bins[1:], cdf)
ax.set_xlabel("Matching Prediction")
ax.set_ylabel("Cumulative Density")
plt.show()

The suggest function returns the best matching predictions of the model for each row of the left dataset. The prediction probabilities of predict are grouped by the indices of the left dataset and sorted in descending order. The caller can choose the number of returned suggestions by setting the count argument of suggest.

suggestions = model.suggest(left, right, count = 3, verbose = 0)
suggestions["true_match"] = suggestions.loc[:, ["left", "right"]].apply(
    lambda x: any((x.left == matches.left) & (x.right==matches.right)), axis=1
)
suggestions = suggestions.join(
    matches.iloc[-no_duplicates:,:].assign(duplicate=True).set_index(['left', 'right']),
    on = ["left", "right"],
    how = "left"
).fillna(False)

suggestions
left right prediction true_match duplicate
0 0 0 9.778342e-01 True False
14 0 14 9.169472e-01 False False
4 0 4 9.598946e-04 False False
40 1 1 9.718049e-01 True False
62 1 23 8.463337e-05 False False
49 1 10 7.474281e-05 False False
80 2 2 9.787740e-01 True False
85 2 7 1.013026e-03 False False
113 2 35 6.155510e-04 False False
120 3 3 9.766330e-01 True False
147 3 30 1.074756e-03 False False
138 3 21 5.062067e-05 False False
160 4 4 9.787740e-01 True False
168 4 12 9.383137e-01 False False
156 4 0 9.753937e-04 False False
200 5 5 9.777580e-01 True False
210 5 15 7.423223e-05 False False
222 5 27 3.159808e-05 False False
240 6 6 9.773185e-01 True False
258 6 24 2.507875e-05 False False
251 6 17 7.486212e-08 False False
280 7 7 9.781616e-01 True False
275 7 2 1.006788e-03 False False
308 7 35 6.105449e-04 False False
320 8 8 9.772826e-01 True False
321 8 9 3.877738e-05 False False
335 8 23 1.128767e-07 False False
360 9 9 9.716737e-01 True False
368 9 17 5.324303e-04 False False
379 9 28 4.801422e-04 False False
400 10 10 9.733325e-01 True False
413 10 23 9.662850e-01 False False
391 10 1 7.031505e-05 False False
440 11 11 9.782392e-01 True False
458 11 29 4.065759e-04 False False
448 11 19 5.362284e-05 False False
480 12 12 9.742600e-01 True False
472 12 4 9.637859e-01 False False
490 12 22 1.027538e-03 False False
520 13 13 9.783336e-01 True False
544 13 37 9.752246e-01 True True
540 13 33 9.751084e-01 False False
560 14 14 9.712535e-01 True False
546 14 0 9.653772e-01 False False
550 14 4 1.020161e-03 False False
600 15 15 9.767061e-01 True False
609 15 24 3.229330e-04 False False
590 15 5 7.657281e-05 False False
640 16 16 9.648299e-01 True False
654 16 30 4.539440e-05 False False
658 16 34 3.271291e-05 False False
680 17 17 9.758328e-01 True False
684 17 21 1.238351e-03 False False
672 17 9 5.116409e-04 False False
720 18 18 9.786937e-01 True False
737 18 35 9.030289e-01 False False
738 18 36 8.958049e-01 False False
760 19 19 9.778247e-01 True False
774 19 33 1.464835e-03 False False
754 19 13 1.276108e-03 False False
800 20 20 9.783897e-01 True False
812 20 32 9.765700e-01 False False
817 20 37 3.983492e-04 False False
840 21 21 9.739489e-01 True False
836 21 17 1.318118e-03 False False
828 21 9 4.996893e-04 False False
880 22 22 9.768478e-01 True False
870 22 12 7.845204e-04 False False
862 22 4 5.201011e-04 False False
920 23 23 9.778675e-01 True False
907 23 10 9.233682e-01 False False
898 23 1 7.060236e-05 False False
960 24 24 9.670053e-01 True False
951 24 15 3.808707e-04 False False
942 24 6 3.039734e-05 False False
1000 25 25 9.785444e-01 True False
988 25 13 9.746334e-01 False False
1008 25 33 9.737168e-01 False False
1040 26 26 9.766273e-01 True False
1052 26 38 9.702485e-01 True True
1045 26 31 4.269315e-04 False False
1080 27 27 9.779201e-01 True False
1068 27 15 3.216346e-05 False False
1058 27 5 3.040044e-05 False False
1120 28 28 9.773872e-01 True False
1109 28 17 5.012505e-04 False False
1113 28 21 5.000369e-04 False False
1160 29 29 9.787740e-01 True False
1142 29 11 4.129025e-04 False False
1163 29 32 8.954816e-05 False False
1200 30 30 9.761566e-01 True False
1173 30 3 1.078105e-03 False False
1187 30 17 7.555061e-05 False False
1240 31 31 9.777137e-01 True False
1235 31 26 5.508639e-04 False False
1247 31 38 3.956477e-04 False False
1280 32 32 9.785187e-01 True False
1268 32 20 9.764295e-01 False False
1285 32 37 3.983492e-04 False False
1320 33 33 9.786212e-01 True False
1300 33 13 9.748630e-01 False False
1312 33 25 9.736227e-01 False False
1360 34 34 9.765700e-01 True False
1356 34 30 4.525233e-05 False False
1343 34 17 4.510958e-05 False False
1401 35 36 9.575554e-01 True True
1400 35 35 9.503515e-01 True False
1383 35 18 9.235189e-01 False False