Similarity Encoding¶

Similarity encoding module.

The module provides functionality to store and manage a similarity encoders.

class neer_match.similarity_encoding.SimilarityEncoder(similarity_map)¶

Similarity encoder class.

The class creates a similarity encoder from a similarity map. It can be used to encode pairs of records from two datasets.

similarity_map¶

The similarity map object.

Type:: SimilarityMap

scalls¶

The similarity function names.

Type:: list[str]

no_scalls¶

The number of similarity calls (field pairs and similarities).

Type:: int

no_assoc¶

The number of associations (field pairs).

Type:: int

assoc_begin¶

The beginning offsets of the associations.

Type:: numpy.ndarray

assoc_sizes¶

The sizes (number of used similarities) of the associations.

Type:: numpy.ndarray

assoc_end¶

The ending indices of the associations.

Type:: numpy.ndarray

__init__(similarity_map)¶

Initialize a similarity encoder object.

Parameters:: similarity_map (SimilarityMap) – The similarity map.

encode_as_matrix(left, right)¶

Encode a pair of records as a matrix.

Calculate the similarities for each association (field pair) and return them all stacked together in a matrix (i.e., the similarity matrix).

Parameters:

left (DataFrame) – The left dataset.
right (DataFrame) – The right dataset.

Return type:

ndarray

encoded_shape(batch_size=-1)¶

Return the shape of the encoded data.

Return type:: List[Tuple[int, int]]

report_encoding(left, right)¶

Report encoding of a pair of records.

Calculate the similarities for each association (field pair) in the similarity map and return them in a list of data frames. The function expects that the left and right datasets have the same number of records. It does not operate on the cross product of the records, but rather on the records at the same position in both datasets.

Parameters:

left (Union[Series, DataFrame]) – The left dataset.
right (Union[Series, DataFrame]) – The right dataset.

Return type:

List[DataFrame]