Neural-symbolic Entity Matching

The example introduces the package’s entity matching functionality using neural-symbolic learning. The goals of the example are (i) to introduce the available options, (ii) discuss some aspects of the underlying model logic, and (iii) contrast the neural-symbolic functionality to that of purely deep learning entity matching. For an introduction to basic concepts and conventions of the package one may consult the Entity Matching with Similarity Maps and Deep Learning example (henceforth Example 1). For the reasoning capabilities of the package, set the Reasoning example.

Prerequisites

Load the libraries we will use and set the seed for reproducibility.

from neer_match.matching_model import NSMatchingModel
from neer_match.similarity_map import SimilarityMap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import tensorflow as tf

random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

Preprocessing

We use the gaming data of Example 1. Similar to the deep learning case, the neural-symbolic functionality leverages similarity maps and expects the same structure of datasets as inputs. The concepts and requirements are detailed in Example 1, we abstain from detailing them here for brevity. In summary, our preprocessing stage constructs three datasets left, right, and matches, where left has 36 records, right has 39 records, and matches has the indices of the matching records in left and right.

left, right, matches = prepare_data()

Matching Model Setup

For simplicity, we employ the similarity map we used in Example 1.

instructions = {
    "title": ["jaro_winkler"],
    "platform": ["levenshtein", "discrete"],
    "year": ["euclidean", "discrete"],
    "developer~dev": ["jaro"]
}

similarity_map = SimilarityMap(instructions)
print(similarity_map)
SimilarityMap[
  ('title', 'title', 'jaro_winkler')
  ('platform', 'platform', 'levenshtein')
  ('platform', 'platform', 'discrete')
  ('year', 'year', 'euclidean')
  ('year', 'year', 'discrete')
  ('developer', 'dev', 'jaro')]

Matching models using neural-symbolic learning are constructed using the NSMatchingModel class. The constructor expects a similarity map object, exactly as in the pure deep learning case.

model = NSMatchingModel(similarity_map)

Although the exposed interface is similar to the deep learning model, the underlying class designs are different. The DLMatchingModel class inherits from the tensorflow.keras.Model class, while the NSMatchingModel uses a custom training loop. Nonetheless, the NSMatchingModel class exposes compile, fit, evaluate, predict, and suggest functions to ensure that the calling conventions in the user space are as consistent as possible between the two classes.

For instance, the compile method of the NSMatchingModel can be used to set the optimizer used during training.

model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001))

Model Training

The model is fitted using the fit function (see the fit documentation for details). The underline training loop of the neural-symbolic model is custom. Training treats the field pair and record pair models as fuzzy logic predicates. It adjusts their parameters to satisfy the following logic. First, for every matching example, all the field predicates and the record predicate should be (fuzzily) true. This rule is motivated by the observations that true record matches should in principle constitute matches in all subsets of their associated fields. Second, at least one of the field predicates and the record predicate should by (fuzzily) false for every non matching example. This rule is motivated by the observation that non matches should be distinct in at least one of their associated fields, because if they are not, then, in principle, they constitute a match.

The fit function expects left, right, and matches data sets. As in the deep learning case, the mismatch_share parameter controls the ratio of counterexamples to matches. The parameter satisfiability_weight is a float value between 0 and 1 that controls the weight of the satisfiability loss in the total loss function. The default value is 1.0, i.e., the training process considers only the fuzzy logic axioms. A value of 0.0 means that only the binary cross entropy loss is considered, which essentially reduces the model to a deep learning model. Any value between 0 and 1 trains a hybrid model that balances the two losses.

Finally, logging can be customized by setting the parameters verbose and log_mod_n. The verbose parameter controls whether the training process prints the loss values at each epoch. The log_mod_n parameter controls the frequency of the logging. For instance, if log_mod_n = 5, the training process logs the loss values every 5 epochs. The default value is 1.

model.fit(
    left,
    right,
    matches,
    epochs=100,
    batch_size=16,
    verbose=1,
    log_mod_n=10
)

Predictions and Suggestions

Matching predictions and suggestions follow the same calling conventions as corresponding deep learning member functions. Predictions can be obtained by calling predict and passing the left and right datasets. The returned prediction probabilities are stored in row-major order. First, the matching probabilities of the first row of left with all the rows of right are given. Then, the probabilities of the second row of left with all the rows of right are given, and so on. In total, the predict function returns a vector with rows equal to the product of the number of rows in the left and right data sets.

predictions = model.predict(left, right)

fig, ax = plt.subplots()
counts, bins= np.histogram(predictions, bins = 100)
cdf = np.cumsum(counts)/np.sum(counts)
ax.plot(bins[1:], cdf)
ax.set_xlabel("Matching Prediction")
ax.set_ylabel("Cumulative Density")
plt.show()

The suggest function also expects the left and right datasets as input. The function returns the top count matching predictions of the model for each row of the left dataset. The prediction probabilities of predict are grouped by the indices of the left dataset and sorted in descending order.

suggestions = model.suggest(left, right, count = 3)
suggestions["true_match"] = suggestions.loc[:, ["left", "right"]].apply(
    lambda x: any((x.left == matches.left) & (x.right==matches.right)), axis=1
)
suggestions = suggestions.join(
    matches.iloc[-no_duplicates:,:].assign(duplicate=True).set_index(['left', 'right']),
    on = ["left", "right"],
    how = "left"
).fillna(False)

suggestions
left right prediction true_match duplicate
0 0 0 0.993542 True False
14 0 14 0.993527 False False
22 0 22 0.033070 False False
40 1 1 0.993530 True False
62 1 23 0.011010 False False
49 1 10 0.010907 False False
80 2 2 0.993543 True False
85 2 7 0.025760 False False
113 2 35 0.025760 False False
120 3 3 0.993539 True False
147 3 30 0.025760 False False
138 3 21 0.011008 False False
160 4 4 0.993543 True False
168 4 12 0.993500 False False
156 4 0 0.025760 False False
200 5 5 0.993542 True False
222 5 27 0.011720 False False
210 5 15 0.011008 False False
240 6 6 0.993541 True False
258 6 24 0.011438 False False
265 6 31 0.000853 False False
280 7 7 0.993542 True False
291 7 18 0.025760 False False
308 7 35 0.025760 False False
320 8 8 0.993540 True False
321 8 9 0.013035 False False
322 8 10 0.000875 False False
360 9 9 0.993531 True False
368 9 17 0.033070 False False
372 9 21 0.025760 False False
400 10 10 0.993535 True False
413 10 23 0.993527 False False
391 10 1 0.010875 False False
440 11 11 0.993542 True False
458 11 29 0.025760 False False
429 11 0 0.011720 False False
480 12 12 0.993536 True False
472 12 4 0.993493 False False
468 12 0 0.025760 False False
520 13 13 0.993542 True False
544 13 37 0.993538 True True
540 13 33 0.993537 False False
560 14 14 0.993529 True False
546 14 0 0.993527 False False
568 14 22 0.033070 False False
600 15 15 0.993540 True False
609 15 24 0.024652 False False
612 15 27 0.011961 False False
640 16 16 0.993527 True False
641 16 17 0.011301 False False
658 16 34 0.011269 False False
680 17 17 0.993538 True False
672 17 9 0.033070 False False
684 17 21 0.025769 False False
720 18 18 0.993543 True False
737 18 35 0.993527 False False
738 18 36 0.993527 False False
760 19 19 0.993541 True False
774 19 33 0.025787 False False
754 19 13 0.025772 False False
800 20 20 0.993543 True False
812 20 32 0.993540 False False
793 20 13 0.025760 False False
840 21 21 0.993536 True False
847 21 28 0.032593 False False
836 21 17 0.025776 False False
880 22 22 0.993540 True False
858 22 0 0.033070 False False
872 22 14 0.033070 False False
920 23 23 0.993541 True False
907 23 10 0.993478 False False
898 23 1 0.011008 False False
960 24 24 0.993527 True False
951 24 15 0.025006 False False
942 24 6 0.011234 False False
1000 25 25 0.993543 True False
988 25 13 0.993537 False False
1008 25 33 0.993536 False False
1040 26 26 0.993539 True False
1052 26 38 0.993527 True True
1045 26 31 0.032593 False False
1080 27 27 0.993541 True False
1068 27 15 0.011961 False False
1058 27 5 0.011720 False False
1120 28 28 0.993541 True False
1113 28 21 0.032593 False False
1101 28 9 0.025760 False False
1160 29 29 0.993543 True False
1142 29 11 0.025760 False False
1162 29 31 0.011512 False False
1200 30 30 0.993539 True False
1173 30 3 0.025760 False False
1204 30 34 0.011512 False False
1240 31 31 0.993541 True False
1235 31 26 0.032593 False False
1247 31 38 0.032593 False False
1280 32 32 0.993543 True False
1268 32 20 0.993540 False False
1261 32 13 0.025760 False False
1320 33 33 0.993543 True False
1300 33 13 0.993537 False False
1312 33 25 0.993536 False False
1360 34 34 0.993540 True False
1343 34 17 0.011720 False False
1356 34 30 0.011512 False False
1383 35 18 0.993527 False False
1400 35 35 0.993527 True False
1401 35 36 0.993527 True True