Neural-symbolic Entity Matching¶

The example introduces the package’s entity matching functionality using neural-symbolic learning. The goals of the example are (i) to introduce the available options, (ii) discuss some aspects of the underlying model logic, and (iii) contrast the neural-symbolic functionality to that of purely deep learning entity matching. For an introduction to basic concepts and conventions of the package one may consult the Entity Matching with Similarity Maps and Deep Learning example (henceforth Example 1). For the reasoning capabilities of the package, set the Reasoning example.

Prerequisites¶

Load the libraries we will use and set the seed for reproducibility.

from neer_match.matching_model import NSMatchingModel
from neer_match.similarity_map import SimilarityMap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import tensorflow as tf

random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

Preprocessing¶

We use the gaming data of Example 1. Similar to the deep learning case, the neural-symbolic functionality leverages similarity maps and expects the same structure of datasets as inputs. The concepts and requirements are detailed in Example 1, we abstain from detailing them here for brevity. In summary, our preprocessing stage constructs three datasets left, right, and matches, where left has 36 records, right has 39 records, and matches has the indices of the matching records in left and right.

left, right, matches = prepare_data()

Matching Model Setup¶

For simplicity, we employ the similarity map we used in Example 1.

instructions = {
    "title": ["jaro_winkler"],
    "platform": ["levenshtein", "discrete"],
    "year": ["euclidean", "discrete"],
    "developer~dev": ["jaro"]
}

similarity_map = SimilarityMap(instructions)
print(similarity_map)

SimilarityMap[
  ('title', 'title', 'jaro_winkler')
  ('platform', 'platform', 'levenshtein')
  ('platform', 'platform', 'discrete')
  ('year', 'year', 'euclidean')
  ('year', 'year', 'discrete')
  ('developer', 'dev', 'jaro')]

Matching models using neural-symbolic learning are constructed using the NSMatchingModel class. The constructor expects a similarity map object, exactly as in the pure deep learning case.

model = NSMatchingModel(similarity_map)

Although the exposed interface is similar to the deep learning model, the underlying class designs are different. The DLMatchingModel class inherits from the tensorflow.keras.Model class, while the NSMatchingModel uses a custom training loop. Nonetheless, the NSMatchingModel class exposes compile, fit, evaluate, predict, and suggest functions to ensure that the calling conventions in the user space are as consistent as possible between the two classes.

For instance, the compile method of the NSMatchingModel can be used to set the optimizer used during training.

model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001))

Model Training¶

The model is fitted using the fit function (see the fit documentation for details). The underline training loop of the neural-symbolic model is custom. Training treats the field pair and record pair models as fuzzy logic predicates. It adjusts their parameters to satisfy the following logic. First, for every matching example, all the field predicates and the record predicate should be (fuzzily) true. This rule is motivated by the observations that true record matches should in principle constitute matches in all subsets of their associated fields. Second, at least one of the field predicates and the record predicate should by (fuzzily) false for every non matching example. This rule is motivated by the observation that non matches should be distinct in at least one of their associated fields, because if they are not, then, in principle, they constitute a match.

The fit function expects left, right, and matches data sets. As in the deep learning case, the mismatch_share parameter controls the ratio of counterexamples to matches. The parameter satisfiability_weight is a float value between 0 and 1 that controls the weight of the satisfiability loss in the total loss function. The default value is 1.0, i.e., the training process considers only the fuzzy logic axioms. A value of 0.0 means that only the binary cross entropy loss is considered, which essentially reduces the model to a deep learning model. Any value between 0 and 1 trains a hybrid model that balances the two losses.

Finally, logging can be customized by setting the parameters verbose and log_mod_n. The verbose parameter controls whether the training process prints the loss values at each epoch. The log_mod_n parameter controls the frequency of the logging. For instance, if log_mod_n = 5, the training process logs the loss values every 5 epochs. The default value is 1.

model.fit(
    left,
    right,
    matches,
    epochs=100,
    batch_size=16,
    verbose=1,
    log_mod_n=10
)

Predictions and Suggestions¶

Matching predictions and suggestions follow the same calling conventions as corresponding deep learning member functions. Predictions can be obtained by calling predict and passing the left and right datasets. The returned prediction probabilities are stored in row-major order. First, the matching probabilities of the first row of left with all the rows of right are given. Then, the probabilities of the second row of left with all the rows of right are given, and so on. In total, the predict function returns a vector with rows equal to the product of the number of rows in the left and right data sets.

predictions = model.predict(left, right)

fig, ax = plt.subplots()
counts, bins= np.histogram(predictions, bins = 100)
cdf = np.cumsum(counts)/np.sum(counts)
ax.plot(bins[1:], cdf)
ax.set_xlabel("Matching Prediction")
ax.set_ylabel("Cumulative Density")
plt.show()

The suggest function also expects the left and right datasets as input. The function returns the top count matching predictions of the model for each row of the left dataset. The prediction probabilities of predict are grouped by the indices of the left dataset and sorted in descending order.

suggestions = model.suggest(left, right, count = 3)
suggestions["true_match"] = suggestions.loc[:, ["left", "right"]].apply(
    lambda x: any((x.left == matches.left) & (x.right==matches.right)), axis=1
)
suggestions = suggestions.join(
    matches.iloc[-no_duplicates:,:].assign(duplicate=True).set_index(['left', 'right']),
    on = ["left", "right"],
    how = "left"
)
suggestions.duplicate = suggestions.duplicate.apply(
    lambda x: False if pd.isna(x) else x
)
suggestions = suggestions.sort_values(
    by=["left", "prediction"], ascending=[True, False]
)

suggestions

	left	right	prediction	true_match	duplicate
0	0	0	0.994650	True	False
1	0	14	0.994635	False	False
2	0	22	0.032443	False	False
3	1	1	0.994637	True	False
4	1	23	0.014025	False	False
5	1	10	0.013826	False	False
6	2	2	0.994651	True	False
7	2	36	0.024976	False	False
8	2	35	0.024976	False	False
9	3	3	0.994647	True	False
10	3	30	0.024976	False	False
11	3	21	0.014028	False	False
12	4	4	0.994651	True	False
13	4	12	0.994617	False	False
14	4	22	0.024976	False	False
15	5	5	0.994650	True	False
16	5	27	0.014272	False	False
17	5	15	0.014030	False	False
18	6	6	0.994649	True	False
19	6	24	0.014035	False	False
20	6	31	0.001203	False	False
21	7	7	0.994650	True	False
22	7	36	0.024976	False	False
23	7	35	0.024976	False	False
24	8	8	0.994648	True	False
25	8	9	0.014807	False	False
26	8	7	0.001235	False	False
27	9	9	0.994638	True	False
28	9	17	0.032443	False	False
29	9	21	0.024976	False	False
30	10	10	0.994641	True	False
31	10	23	0.994635	False	False
32	10	1	0.013791	False	False
33	11	11	0.994650	True	False
34	11	29	0.024976	False	False
35	11	14	0.014272	False	False
36	12	12	0.994642	True	False
37	12	4	0.994613	False	False
38	12	22	0.024976	False	False
39	13	13	0.994650	True	False
40	13	37	0.994645	True	True
41	13	33	0.994644	False	False
42	14	14	0.994636	True	False
43	14	0	0.994635	False	False
44	14	22	0.032443	False	False
45	15	15	0.994648	True	False
46	15	24	0.023320	False	False
47	15	27	0.014370	False	False
48	16	16	0.994635	True	False
49	16	17	0.014104	False	False
50	16	30	0.014014	False	False
51	17	17	0.994646	True	False
52	17	9	0.032443	False	False
53	17	28	0.024976	False	False
54	18	18	0.994651	True	False
55	18	36	0.994635	False	False
56	18	35	0.994635	False	False
57	19	19	0.994650	True	False
58	19	33	0.024989	False	False
59	19	13	0.024982	False	False
60	20	20	0.994651	True	False
61	20	32	0.994647	False	False
62	20	37	0.024976	False	False
63	21	21	0.994641	True	False
64	21	28	0.031826	False	False
65	21	17	0.024984	False	False
66	22	22	0.994647	True	False
67	22	14	0.032443	False	False
68	22	0	0.032443	False	False
69	23	23	0.994650	True	False
70	23	10	0.994604	False	False
71	23	1	0.014028	False	False
72	24	24	0.994635	True	False
73	24	15	0.023762	False	False
74	24	6	0.013815	False	False
75	25	25	0.994651	True	False
76	25	13	0.994643	False	False
77	25	33	0.994641	False	False
78	26	26	0.994647	True	False
79	26	38	0.994635	True	True
80	26	31	0.031826	False	False
81	27	27	0.994650	True	False
82	27	15	0.014370	False	False
83	27	5	0.014227	False	False
84	28	28	0.994649	True	False
85	28	21	0.031826	False	False
86	28	9	0.024976	False	False
87	29	29	0.994651	True	False
88	29	11	0.024976	False	False
89	29	31	0.014205	False	False
90	30	30	0.994646	True	False
91	30	3	0.024976	False	False
92	30	34	0.014205	False	False
93	31	31	0.994649	True	False
94	31	38	0.031826	False	False
95	31	26	0.031826	False	False
96	32	32	0.994651	True	False
97	32	20	0.994647	False	False
98	32	37	0.024976	False	False
99	33	33	0.994651	True	False
100	33	13	0.994643	False	False
101	33	25	0.994641	False	False
102	34	34	0.994647	True	False
103	34	17	0.014272	False	False
104	34	30	0.014205	False	False
105	35	36	0.994635	True	True
106	35	35	0.994635	True	False
107	35	18	0.994635	False	False