Neural-symbolic Entity Matching

The example introduces the package’s entity matching functionality using neural-symbolic learning. The goals of the example are (i) to introduce the available options, (ii) discuss some aspects of the underlying model logic, and (iii) contrast the neural-symbolic functionality to that of purely deep learning entity matching. For an introduction to basic concepts and conventions of the package one may consult the Entity Matching with Similarity Maps and Deep Learning example (henceforth Example 1). For the reasoning capabilities of the package, set the Reasoning example.

Prerequisites

Load the libraries we will use and set the seed for reproducibility.

from neer_match.matching_model import NSMatchingModel
from neer_match.similarity_map import SimilarityMap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import tensorflow as tf

random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

Preprocessing

We use the gaming data of Example 1. Similar to the deep learning case, the neural-symbolic functionality leverages similarity maps and expects the same structure of datasets as inputs. The concepts and requirements are detailed in Example 1, we abstain from detailing them here for brevity. In summary, our preprocessing stage constructs three datasets left, right, and matches, where left has 36 records, right has 39 records, and matches has the indices of the matching records in left and right.

left, right, matches = prepare_data()

Matching Model Setup

For simplicity, we employ the similarity map we used in Example 1.

instructions = {
    "title": ["jaro_winkler"],
    "platform": ["levenshtein", "discrete"],
    "year": ["euclidean", "discrete"],
    "developer~dev": ["jaro"]
}

similarity_map = SimilarityMap(instructions)
print(similarity_map)
SimilarityMap[
  ('title', 'title', 'jaro_winkler')
  ('platform', 'platform', 'levenshtein')
  ('platform', 'platform', 'discrete')
  ('year', 'year', 'euclidean')
  ('year', 'year', 'discrete')
  ('developer', 'dev', 'jaro')]

Matching models using neural-symbolic learning are constructed using the NSMatchingModel class. The constructor expects a similarity map object, exactly as in the pure deep learning case.

model = NSMatchingModel(similarity_map)

Although the exposed interface is similar to the deep learning model, the underlying class designs are different. The DLMatchingModel class inherits from the tensorflow.keras.Model class, while the NSMatchingModel uses a custom training loop. Nonetheless, the NSMatchingModel class exposes compile, fit, evaluate, predict, and suggest functions to ensure that the calling conventions in the user space are as consistent as possible between the two classes.

For instance, the compile method of the NSMatchingModel can be used to set the optimizer used during training.

model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001))

Model Training

The model is fitted using the fit function (see the fit documentation for details). The underline training loop of the neural-symbolic model is custom. Training treats the field pair and record pair models as fuzzy logic predicates. It adjusts their parameters to satisfy the following logic. First, for every matching example, all the field predicates and the record predicate should be (fuzzily) true. This rule is motivated by the observations that true record matches should in principle constitute matches in all subsets of their associated fields. Second, at least one of the field predicates and the record predicate should by (fuzzily) false for every non matching example. This rule is motivated by the observation that non matches should be distinct in at least one of their associated fields, because if they are not, then, in principle, they constitute a match.

The fit function expects left, right, and matches data sets. As in the deep learning case, the mismatch_share parameter controls the ratio of counterexamples to matches. The parameter satisfiability_weight is a float value between 0 and 1 that controls the weight of the satisfiability loss in the total loss function. The default value is 1.0, i.e., the training process considers only the fuzzy logic axioms. A value of 0.0 means that only the binary cross entropy loss is considered, which essentially reduces the model to a deep learning model. Any value between 0 and 1 trains a hybrid model that balances the two losses.

Finally, logging can be customized by setting the parameters verbose and log_mod_n. The verbose parameter controls whether the training process prints the loss values at each epoch. The log_mod_n parameter controls the frequency of the logging. For instance, if log_mod_n = 5, the training process logs the loss values every 5 epochs. The default value is 1.

model.fit(
    left,
    right,
    matches,
    epochs=100,
    batch_size=16,
    verbose=1,
    log_mod_n=10
)

Predictions and Suggestions

Matching predictions and suggestions follow the same calling conventions as corresponding deep learning member functions. Predictions can be obtained by calling predict and passing the left and right datasets. The returned prediction probabilities are stored in row-major order. First, the matching probabilities of the first row of left with all the rows of right are given. Then, the probabilities of the second row of left with all the rows of right are given, and so on. In total, the predict function returns a vector with rows equal to the product of the number of rows in the left and right data sets.

predictions = model.predict(left, right)

fig, ax = plt.subplots()
counts, bins= np.histogram(predictions, bins = 100)
cdf = np.cumsum(counts)/np.sum(counts)
ax.plot(bins[1:], cdf)
ax.set_xlabel("Matching Prediction")
ax.set_ylabel("Cumulative Density")
plt.show()

The suggest function also expects the left and right datasets as input. The function returns the top count matching predictions of the model for each row of the left dataset. The prediction probabilities of predict are grouped by the indices of the left dataset and sorted in descending order.

suggestions = model.suggest(left, right, count = 3)
suggestions["true_match"] = suggestions.loc[:, ["left", "right"]].apply(
    lambda x: any((x.left == matches.left) & (x.right==matches.right)), axis=1
)
suggestions = suggestions.join(
    matches.iloc[-no_duplicates:,:].assign(duplicate=True).set_index(['left', 'right']),
    on = ["left", "right"],
    how = "left"
)
suggestions.duplicate = suggestions.duplicate.apply(
    lambda x: False if pd.isna(x) else x
)
suggestions = suggestions.sort_values(
    by=["left", "prediction"], ascending=[True, False]
)

suggestions

left

right

prediction

true_match

duplicate

0

0

0

0.994650

True

False

1

0

14

0.994635

False

False

2

0

22

0.032443

False

False

3

1

1

0.994637

True

False

4

1

23

0.014025

False

False

5

1

10

0.013826

False

False

6

2

2

0.994651

True

False

7

2

36

0.024976

False

False

8

2

35

0.024976

False

False

9

3

3

0.994647

True

False

10

3

30

0.024976

False

False

11

3

21

0.014028

False

False

12

4

4

0.994651

True

False

13

4

12

0.994617

False

False

14

4

22

0.024976

False

False

15

5

5

0.994650

True

False

16

5

27

0.014272

False

False

17

5

15

0.014030

False

False

18

6

6

0.994649

True

False

19

6

24

0.014035

False

False

20

6

31

0.001203

False

False

21

7

7

0.994650

True

False

22

7

36

0.024976

False

False

23

7

35

0.024976

False

False

24

8

8

0.994648

True

False

25

8

9

0.014807

False

False

26

8

7

0.001235

False

False

27

9

9

0.994638

True

False

28

9

17

0.032443

False

False

29

9

21

0.024976

False

False

30

10

10

0.994641

True

False

31

10

23

0.994635

False

False

32

10

1

0.013791

False

False

33

11

11

0.994650

True

False

34

11

29

0.024976

False

False

35

11

14

0.014272

False

False

36

12

12

0.994642

True

False

37

12

4

0.994613

False

False

38

12

22

0.024976

False

False

39

13

13

0.994650

True

False

40

13

37

0.994645

True

True

41

13

33

0.994644

False

False

42

14

14

0.994636

True

False

43

14

0

0.994635

False

False

44

14

22

0.032443

False

False

45

15

15

0.994648

True

False

46

15

24

0.023320

False

False

47

15

27

0.014370

False

False

48

16

16

0.994635

True

False

49

16

17

0.014104

False

False

50

16

30

0.014014

False

False

51

17

17

0.994646

True

False

52

17

9

0.032443

False

False

53

17

28

0.024976

False

False

54

18

18

0.994651

True

False

55

18

36

0.994635

False

False

56

18

35

0.994635

False

False

57

19

19

0.994650

True

False

58

19

33

0.024989

False

False

59

19

13

0.024982

False

False

60

20

20

0.994651

True

False

61

20

32

0.994647

False

False

62

20

37

0.024976

False

False

63

21

21

0.994641

True

False

64

21

28

0.031826

False

False

65

21

17

0.024984

False

False

66

22

22

0.994647

True

False

67

22

14

0.032443

False

False

68

22

0

0.032443

False

False

69

23

23

0.994650

True

False

70

23

10

0.994604

False

False

71

23

1

0.014028

False

False

72

24

24

0.994635

True

False

73

24

15

0.023762

False

False

74

24

6

0.013815

False

False

75

25

25

0.994651

True

False

76

25

13

0.994643

False

False

77

25

33

0.994641

False

False

78

26

26

0.994647

True

False

79

26

38

0.994635

True

True

80

26

31

0.031826

False

False

81

27

27

0.994650

True

False

82

27

15

0.014370

False

False

83

27

5

0.014227

False

False

84

28

28

0.994649

True

False

85

28

21

0.031826

False

False

86

28

9

0.024976

False

False

87

29

29

0.994651

True

False

88

29

11

0.024976

False

False

89

29

31

0.014205

False

False

90

30

30

0.994646

True

False

91

30

3

0.024976

False

False

92

30

34

0.014205

False

False

93

31

31

0.994649

True

False

94

31

38

0.031826

False

False

95

31

26

0.031826

False

False

96

32

32

0.994651

True

False

97

32

20

0.994647

False

False

98

32

37

0.024976

False

False

99

33

33

0.994651

True

False

100

33

13

0.994643

False

False

101

33

25

0.994641

False

False

102

34

34

0.994647

True

False

103

34

17

0.014272

False

False

104

34

30

0.014205

False

False

105

35

36

0.994635

True

True

106

35

35

0.994635

True

False

107

35

18

0.994635

False

False