Similarity Encoding

Similarity encoding module.

The module provides functionality to store and manage a similarity encoders.

class neer_match.similarity_encoding.SimilarityEncoder(similarity_map)

Similarity encoder class.

The class creates a similarity encoder from a similarity map. It can be used to encode pairs of records from two datasets.

similarity_map

The similarity map object.

Type:

SimilarityMap

scalls

The similarity function names.

Type:

list[str]

no_scalls

The number of similarity calls (field pairs and similarities).

Type:

int

no_assoc

The number of associations (field pairs).

Type:

int

assoc_begin

The beginning offsets of the associations.

Type:

numpy.ndarray

assoc_sizes

The sizes (number of used similarities) of the associations.

Type:

numpy.ndarray

assoc_end

The ending indices of the associations.

Type:

numpy.ndarray

__init__(similarity_map)

Initialize a similarity encoder object.

Parameters:

similarity_map (SimilarityMap) – The similarity map.

encode_as_matrix(left, right)

Encode a pair of records as a matrix.

Calculate the similarities for each association (field pair) and return them all stacked together in a matrix (i.e., the similarity matrix).

Parameters:
  • left (DataFrame) – The left dataset.

  • right (DataFrame) – The right dataset.

Return type:

ndarray

encoded_shape(batch_size=-1)

Return the shape of the encoded data.

Return type:

List[Tuple[int, int]]

report_encoding(left, right)

Report encoding of a pair of records.

Calculate the similarities for each association (field pair) in the similarity map and return them in a list of data frames. The function expects that the left and right datasets have the same number of records. It does not operate on the cross product of the records, but rather on the records at the same position in both datasets.

Parameters:
  • left (Union[Series, DataFrame]) – The left dataset.

  • right (Union[Series, DataFrame]) – The right dataset.

Return type:

List[DataFrame]