Similarity Encoding¶
Similarity encoding module.
The module provides functionality to store and manage a similarity encoders.
- class neer_match.similarity_encoding.SimilarityEncoder(similarity_map)¶
Similarity encoder class.
The class creates a similarity encoder from a similarity map. It can be used to encode pairs of records from two datasets.
- similarity_map¶
The similarity map object.
- Type:
- scalls¶
The similarity function names.
- Type:
list[str]
- no_scalls¶
The number of similarity calls (field pairs and similarities).
- Type:
int
- no_assoc¶
The number of associations (field pairs).
- Type:
int
- assoc_begin¶
The beginning offsets of the associations.
- Type:
numpy.ndarray
- assoc_sizes¶
The sizes (number of used similarities) of the associations.
- Type:
numpy.ndarray
- assoc_end¶
The ending indices of the associations.
- Type:
numpy.ndarray
- __init__(similarity_map)¶
Initialize a similarity encoder object.
- Parameters:
similarity_map (
SimilarityMap
) – The similarity map.
- encode_as_matrix(left, right)¶
Encode a pair of records as a matrix.
Calculate the similarities for each association (field pair) and return them all stacked together in a matrix (i.e., the similarity matrix).
- Parameters:
left (
DataFrame
) – The left dataset.right (
DataFrame
) – The right dataset.
- Return type:
ndarray
- encoded_shape(batch_size=-1)¶
Return the shape of the encoded data.
- Return type:
List
[Tuple
[int
,int
]]
- report_encoding(left, right)¶
Report encoding of a pair of records.
Calculate the similarities for each association (field pair) in the similarity map and return them in a list of data frames. The function expects that the left and right datasets have the same number of records. It does not operate on the cross product of the records, but rather on the records at the same position in both datasets.
- Parameters:
left (
Union
[Series
,DataFrame
]) – The left dataset.right (
Union
[Series
,DataFrame
]) – The right dataset.
- Return type:
List
[DataFrame
]