Neural-symbolic Entity Matching¶
The example introduces the package’s entity matching functionality using neural-symbolic learning. The goals of the example are (i) to introduce the available options, (ii) discuss some aspects of the underlying model logic, and (iii) contrast the neural-symbolic functionality to that of purely deep learning entity matching. For an introduction to basic concepts and conventions of the package one may consult the Entity Matching with Similarity Maps and Deep Learning example (henceforth Example 1). For the reasoning capabilities of the package, set the Reasoning example.
Prerequisites¶
Load the libraries we will use and set the seed for reproducibility.
from neer_match.matching_model import NSMatchingModel
from neer_match.similarity_map import SimilarityMap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import tensorflow as tf
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)
Preprocessing¶
We use the gaming data of Example 1. Similar to the deep
learning case, the neural-symbolic functionality leverages similarity
maps and expects the same structure of datasets as inputs. The concepts
and requirements are detailed in Example 1, we abstain
from detailing them here for brevity. In summary, our preprocessing
stage constructs three datasets left
, right
, and matches
, where
left
has 36 records, right
has 39 records, and matches
has the
indices of the matching records in left
and right
.
left, right, matches = prepare_data()
Matching Model Setup¶
For simplicity, we employ the similarity map we used in Example 1.
instructions = {
"title": ["jaro_winkler"],
"platform": ["levenshtein", "discrete"],
"year": ["euclidean", "discrete"],
"developer~dev": ["jaro"]
}
similarity_map = SimilarityMap(instructions)
print(similarity_map)
SimilarityMap[
('title', 'title', 'jaro_winkler')
('platform', 'platform', 'levenshtein')
('platform', 'platform', 'discrete')
('year', 'year', 'euclidean')
('year', 'year', 'discrete')
('developer', 'dev', 'jaro')]
Matching models using neural-symbolic learning are constructed using the
NSMatchingModel
class. The constructor expects a similarity map
object, exactly as in the pure deep learning case.
model = NSMatchingModel(similarity_map)
Although the exposed interface is similar to the deep learning model,
the underlying class designs are different. The DLMatchingModel
class
inherits from the tensorflow.keras.Model
class, while the
NSMatchingModel
uses a custom training loop. Nonetheless, the
NSMatchingModel
class exposes compile
, fit
, evaluate
, predict
,
and suggest
functions to ensure that the calling conventions in the
user space are as consistent as possible between the two classes.
For instance, the compile
method of the NSMatchingModel
can be used
to set the optimizer used during training.
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001))
Model Training¶
The model is fitted using the fit
function (see the fit
documentation
for details). The underline training loop of the neural-symbolic model
is custom. Training treats the field
pair
and record
pair
models as fuzzy logic predicates. It adjusts their parameters to satisfy
the following logic. First, for every matching example, all the field
predicates and the record predicate should be (fuzzily) true. This rule
is motivated by the observations that true record matches should in
principle constitute matches in all subsets of their associated fields.
Second, at least one of the field predicates and the record predicate
should by (fuzzily) false for every non matching example. This rule is
motivated by the observation that non matches should be distinct in at
least one of their associated fields, because if they are not, then, in
principle, they constitute a match.
The fit
function expects left
, right
, and matches
data sets. As
in the deep learning case, the mismatch_share
parameter controls the
ratio of counterexamples to matches. The parameter
satisfiability_weight
is a float value between 0 and 1 that controls
the weight of the satisfiability loss in the total loss function. The
default value is 1.0, i.e., the training process considers only the
fuzzy logic axioms. A value of 0.0 means that only the binary cross
entropy loss is considered, which essentially reduces the model to a
deep learning model. Any value between 0 and 1 trains a hybrid model
that balances the two losses.
Finally, logging can be customized by setting the parameters verbose
and log_mod_n
. The verbose
parameter controls whether the training
process prints the loss values at each epoch. The log_mod_n
parameter
controls the frequency of the logging. For instance, if log_mod_n = 5
,
the training process logs the loss values every 5 epochs. The default
value is 1.
model.fit(
left,
right,
matches,
epochs=100,
batch_size=16,
verbose=1,
log_mod_n=10
)
Predictions and Suggestions¶
Matching predictions and suggestions follow the same calling conventions
as corresponding deep learning member functions. Predictions can be
obtained by calling predict
and passing the left
and right
datasets. The returned prediction probabilities are stored in row-major
order. First, the matching probabilities of the first row of left
with
all the rows of right
are given. Then, the probabilities of the second
row of left
with all the rows of right
are given, and so on. In
total, the predict
function returns a vector with rows equal to the
product of the number of rows in the left
and right
data sets.
predictions = model.predict(left, right)
fig, ax = plt.subplots()
counts, bins= np.histogram(predictions, bins = 100)
cdf = np.cumsum(counts)/np.sum(counts)
ax.plot(bins[1:], cdf)
ax.set_xlabel("Matching Prediction")
ax.set_ylabel("Cumulative Density")
plt.show()
The suggest
function also expects the left
and right
datasets as
input. The function returns the top count
matching predictions of the
model for each row of the left
dataset. The prediction probabilities
of predict
are grouped by the indices of the left
dataset and sorted
in descending order.
suggestions = model.suggest(left, right, count = 3)
suggestions["true_match"] = suggestions.loc[:, ["left", "right"]].apply(
lambda x: any((x.left == matches.left) & (x.right==matches.right)), axis=1
)
suggestions = suggestions.join(
matches.iloc[-no_duplicates:,:].assign(duplicate=True).set_index(['left', 'right']),
on = ["left", "right"],
how = "left"
).fillna(False)
suggestions
left | right | prediction | true_match | duplicate | |
---|---|---|---|---|---|
0 | 0 | 0 | 0.993542 | True | False |
14 | 0 | 14 | 0.993527 | False | False |
22 | 0 | 22 | 0.033070 | False | False |
40 | 1 | 1 | 0.993530 | True | False |
62 | 1 | 23 | 0.011010 | False | False |
49 | 1 | 10 | 0.010907 | False | False |
80 | 2 | 2 | 0.993543 | True | False |
85 | 2 | 7 | 0.025760 | False | False |
113 | 2 | 35 | 0.025760 | False | False |
120 | 3 | 3 | 0.993539 | True | False |
147 | 3 | 30 | 0.025760 | False | False |
138 | 3 | 21 | 0.011008 | False | False |
160 | 4 | 4 | 0.993543 | True | False |
168 | 4 | 12 | 0.993500 | False | False |
156 | 4 | 0 | 0.025760 | False | False |
200 | 5 | 5 | 0.993542 | True | False |
222 | 5 | 27 | 0.011720 | False | False |
210 | 5 | 15 | 0.011008 | False | False |
240 | 6 | 6 | 0.993541 | True | False |
258 | 6 | 24 | 0.011438 | False | False |
265 | 6 | 31 | 0.000853 | False | False |
280 | 7 | 7 | 0.993542 | True | False |
291 | 7 | 18 | 0.025760 | False | False |
308 | 7 | 35 | 0.025760 | False | False |
320 | 8 | 8 | 0.993540 | True | False |
321 | 8 | 9 | 0.013035 | False | False |
322 | 8 | 10 | 0.000875 | False | False |
360 | 9 | 9 | 0.993531 | True | False |
368 | 9 | 17 | 0.033070 | False | False |
372 | 9 | 21 | 0.025760 | False | False |
400 | 10 | 10 | 0.993535 | True | False |
413 | 10 | 23 | 0.993527 | False | False |
391 | 10 | 1 | 0.010875 | False | False |
440 | 11 | 11 | 0.993542 | True | False |
458 | 11 | 29 | 0.025760 | False | False |
429 | 11 | 0 | 0.011720 | False | False |
480 | 12 | 12 | 0.993536 | True | False |
472 | 12 | 4 | 0.993493 | False | False |
468 | 12 | 0 | 0.025760 | False | False |
520 | 13 | 13 | 0.993542 | True | False |
544 | 13 | 37 | 0.993538 | True | True |
540 | 13 | 33 | 0.993537 | False | False |
560 | 14 | 14 | 0.993529 | True | False |
546 | 14 | 0 | 0.993527 | False | False |
568 | 14 | 22 | 0.033070 | False | False |
600 | 15 | 15 | 0.993540 | True | False |
609 | 15 | 24 | 0.024652 | False | False |
612 | 15 | 27 | 0.011961 | False | False |
640 | 16 | 16 | 0.993527 | True | False |
641 | 16 | 17 | 0.011301 | False | False |
658 | 16 | 34 | 0.011269 | False | False |
680 | 17 | 17 | 0.993538 | True | False |
672 | 17 | 9 | 0.033070 | False | False |
684 | 17 | 21 | 0.025769 | False | False |
720 | 18 | 18 | 0.993543 | True | False |
737 | 18 | 35 | 0.993527 | False | False |
738 | 18 | 36 | 0.993527 | False | False |
760 | 19 | 19 | 0.993541 | True | False |
774 | 19 | 33 | 0.025787 | False | False |
754 | 19 | 13 | 0.025772 | False | False |
800 | 20 | 20 | 0.993543 | True | False |
812 | 20 | 32 | 0.993540 | False | False |
793 | 20 | 13 | 0.025760 | False | False |
840 | 21 | 21 | 0.993536 | True | False |
847 | 21 | 28 | 0.032593 | False | False |
836 | 21 | 17 | 0.025776 | False | False |
880 | 22 | 22 | 0.993540 | True | False |
858 | 22 | 0 | 0.033070 | False | False |
872 | 22 | 14 | 0.033070 | False | False |
920 | 23 | 23 | 0.993541 | True | False |
907 | 23 | 10 | 0.993478 | False | False |
898 | 23 | 1 | 0.011008 | False | False |
960 | 24 | 24 | 0.993527 | True | False |
951 | 24 | 15 | 0.025006 | False | False |
942 | 24 | 6 | 0.011234 | False | False |
1000 | 25 | 25 | 0.993543 | True | False |
988 | 25 | 13 | 0.993537 | False | False |
1008 | 25 | 33 | 0.993536 | False | False |
1040 | 26 | 26 | 0.993539 | True | False |
1052 | 26 | 38 | 0.993527 | True | True |
1045 | 26 | 31 | 0.032593 | False | False |
1080 | 27 | 27 | 0.993541 | True | False |
1068 | 27 | 15 | 0.011961 | False | False |
1058 | 27 | 5 | 0.011720 | False | False |
1120 | 28 | 28 | 0.993541 | True | False |
1113 | 28 | 21 | 0.032593 | False | False |
1101 | 28 | 9 | 0.025760 | False | False |
1160 | 29 | 29 | 0.993543 | True | False |
1142 | 29 | 11 | 0.025760 | False | False |
1162 | 29 | 31 | 0.011512 | False | False |
1200 | 30 | 30 | 0.993539 | True | False |
1173 | 30 | 3 | 0.025760 | False | False |
1204 | 30 | 34 | 0.011512 | False | False |
1240 | 31 | 31 | 0.993541 | True | False |
1235 | 31 | 26 | 0.032593 | False | False |
1247 | 31 | 38 | 0.032593 | False | False |
1280 | 32 | 32 | 0.993543 | True | False |
1268 | 32 | 20 | 0.993540 | False | False |
1261 | 32 | 13 | 0.025760 | False | False |
1320 | 33 | 33 | 0.993543 | True | False |
1300 | 33 | 13 | 0.993537 | False | False |
1312 | 33 | 25 | 0.993536 | False | False |
1360 | 34 | 34 | 0.993540 | True | False |
1343 | 34 | 17 | 0.011720 | False | False |
1356 | 34 | 30 | 0.011512 | False | False |
1383 | 35 | 18 | 0.993527 | False | False |
1400 | 35 | 35 | 0.993527 | True | False |
1401 | 35 | 36 | 0.993527 | True | True |