Reasoning with Neural-symbolic Entity Matching¶
The example illustrates the package’s out-of-the-box reasoning functionality. The goals of the example are (i) to introduce the available options and (ii) discuss the underlying logic of the reasoning. For an introduction to basic concepts and conventions of the package one may consult the Entity Matching with Similarity Maps and Deep Learning example (henceforth Example 1). For an introduction to the package’s neural-symbolic entity matching functionality, one may consult the Neural-Symbolic Learning example (henceforth Example 2).
Prerequisites¶
Load the libraries we will use and set the seed for reproducibility.
from neer_match.reasoning import RefutationModel
from neer_match.similarity_map import SimilarityMap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import tensorflow as tf
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)
Preprocessing¶
We use the gaming data of Example 1 and Example
2. Similar to the deep and the neural-symbolic cases, the
reasoning functionality leverages similarity maps and expects the same
structure of datasets as inputs. More details are given in Example
1 and Example 2. In summary, our
preprocessing stage constructs three datasets left
, right
, and
matches
, where left
has 36 records, right
has 39 records, and
matches
has the indices of the matching records in left
and right
.
left, right, matches = prepare_data()
left["mock"] = np.random.randint(0, 1e5, left.shape[0])
right["mock"] = np.random.randint(0, 1e5, right.shape[0])
Refutation Model Setup¶
We employ the similarity map we used in Example 1 and Example 2.
instructions = {
"title": ["jaro_winkler"],
"platform": ["levenshtein", "discrete"],
"year": ["euclidean", "discrete"],
"developer~dev": ["jaro"],
"mock": ["euclidean"],
}
similarity_map = SimilarityMap(instructions)
print(similarity_map)
SimilarityMap[
('title', 'title', 'jaro_winkler')
('platform', 'platform', 'levenshtein')
('platform', 'platform', 'discrete')
('year', 'year', 'euclidean')
('year', 'year', 'discrete')
('developer', 'dev', 'jaro')
('mock', 'mock', 'euclidean')]
The out-of-the-box reasoning that the neer_match
package provides
allows one to easily refute the significance of one or more conjectured
associations in detecting entity matches. The functionality is provided
by the class RefutationModel
. The RefutationModel
inherits from the
NSMatchingModel
class and provides additional functionality allowing
one to use refutation logic.
model = RefutationModel(similarity_map)
All the member functions exposed by NSMatchingModel
can be directly
used with the RefutationModel
class. This includes the functions
compile
, fit
, evaluate
, predict
, and suggest
that are used to
train, evaluate, and use entity matching models. The functions
compile
, evaluate
, predict
, and suggest
do not modify the
behavior of the underlying NSMatchingModel
class. For example, the
compile
method of the NSMatchingModel
can be used to set the
optimizer used during training.
model.compile(
optimizer=tf.keras.optimizers.Adam(
tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=0.001,
decay_steps=10,
decay_rate=0.96,
staircase=True,
)
)
)
Model Training¶
The model is fitted using the fit
function (see the fit
documentation
for details). Similar to the deep and neural-symbolic entity matching
models, the fit
function expects left
, right
, and matches
data
sets. Moreover, the mismatch_share
parameter controls the ratio of
counterexamples to matches. The parameter satisfiability_weight
controls the mixing of the binary cross entropy and matching axioms’
satisfiability during model training. Finally, logging verbosity and
frequency are controlled by the parameters verbose
and log_mod_n
(see also Example 2).
More importantly, the refutation model fit method introduces four new
parameters, namely refutation
, penalty_threshold
, penalty_scale
,
and penalty_decay
, to guide the model’s refutation reasoning. The
refutation
parameter is either a field pair association of a
dictionary with a signle field pair association as key and a list of
similarities as value. In the first case, the model will try to refute
the association between the two fields using the similarities documented
in its similarity map. In the second case, the model will try to refute
the association between the field pair using only the similarities in
the list. The refutation claim is that the supplied field pair’s
similarities are a (fuzzy) necessary condition for entity matching,
i.e., whenever a record pair constitutes a match, the supplied field
pair’s similarities are close to one. Refutation involves minimizing the
satisfiability of the refutation claim, while penalizing states for
which the entity matching axioms are not satisfied.
The penalty_threshold
parameter controls the threshold for the
penalty. If the satisfiability of the matching axioms are below the
threshold, then the model’s objective has a linear penalty structure.
The penalty is proportional to the difference between the threshold and
the satisfiability of the matching axioms, with scale parameter equal to
penalty_scale
. If the satisfiability of the matching axioms is above
the threshold, the model’s objective has an exponentially decaying
penalty structure. The exponential decay is controlled by the
penalty_decay
parameter. The default values for the
penalty_threshold
, penalty_scale
, and penalty_decay
are 0.95, 1.0,
and 0.1, respectively.
The following example attempts to (fuzzily) refute the claim that title matching is a necessary condition for the record matching of the left and right datasets. The training results indicate that the claim cannot be refuted. At the end of the training, the refutation claim has a satisfiability value closer to one than to zero, while the matching axiom value is also close to one. This indicates, that the optimizer fails to find a network parameter configuration for which the claim can be refuted (has small satisfiability) while at the same time the matching axioms are satisfied.
model.fit(
left,
right,
matches,
refutation="title",
epochs=51,
penalty_threshold=0.99,
penalty_scale=2.0,
penalty_decay=0.1,
batch_size=12,
verbose=1,
log_mod_n=10,
)
| Epoch | BCE | Recall | Precision | F1 | CSat | ASat |
| 0 | 8.9875 | 0.0000 | nan | nan | 0.8902 | 0.7418 |
| 10 | 8.3274 | 0.0000 | nan | nan | 0.9014 | 0.7548 |
| 20 | 7.6994 | 0.0000 | nan | nan | 0.9153 | 0.7697 |
| 30 | 7.2188 | 0.0000 | nan | nan | 0.9284 | 0.7838 |
| 40 | 6.9653 | 0.0000 | nan | nan | 0.9371 | 0.7931 |
| 50 | 6.8437 | 0.0000 | nan | nan | 0.9421 | 0.7984 |
Training finished at Epoch 50 with DL loss 6.8437 and Sat 0.9421
In contrast, the following example succeeds in refuting the claim that randomly generated columns are necessary for entity matching. Two mock columns with independently drawn random values are introduced in the left and right data sets and a corresponding association is added to the similarity map. At the end of the training, the optimizer finds a configuration with very low satisfiability for the refutation claim, while keeping the satisfiability of the matching axioms above \(0.61\).
left["mock"] = np.random.randint(0, 1e5, left.shape[0])
right["mock"] = np.random.randint(0, 1e5, right.shape[0])
instructions = {
"title": ["jaro_winkler"],
"platform": ["levenshtein", "discrete"],
"year": ["euclidean", "discrete"],
"developer~dev": ["jaro"],
"mock": ["euclidean"],
}
similarity_map = SimilarityMap(instructions)
model = RefutationModel(similarity_map)
model.compile(
optimizer=tf.keras.optimizers.Adam(
tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=0.001,
decay_steps=10,
decay_rate=0.96,
staircase=True,
)
)
)
model.fit(
left,
right,
matches,
refutation="mock",
epochs=51,
penalty_threshold=0.99,
penalty_scale=2.0,
penalty_decay=0.1,
batch_size=12,
verbose=1,
log_mod_n=10,
)
| Epoch | BCE | Recall | Precision | F1 | CSat | ASat |
| 0 | 9.5453 | 1.0000 | 0.2500 | 0.4000 | 0.4723 | 0.7320 |
| 10 | 25.9528 | 1.0000 | 0.2500 | 0.4000 | 0.0705 | 0.6344 |
| 20 | 37.3276 | 1.0000 | 0.2500 | 0.4000 | 0.0201 | 0.6215 |
| 30 | 40.1300 | 1.0000 | 0.2500 | 0.4000 | 0.0148 | 0.6200 |
| 40 | 41.6946 | 1.0000 | 0.2500 | 0.4000 | 0.0125 | 0.6193 |
| 50 | 42.6372 | 1.0000 | 0.2500 | 0.4000 | 0.0113 | 0.6190 |
Training finished at Epoch 50 with DL loss 42.6372 and Sat 0.0113