Data Generator¶

Entity matching data generator module.

This module provides a data generation functionality for entity matching tasks.

class neer_match.data_generator.DataGenerator(similarity_map, left, right, matches=None, batch_size=32, mismatch_share=0.1, shuffle=False)¶

Data generator class.

The class provides a data generator for entity matching tasks. It inherits from the tf.keras.utils.Sequence class. Instances generate batches of similarities for the associated fields of two records in the cross product of the left and right data frames. The cross product is not explicitly computed. Instead, instances emulate it using indexing calculations.

left¶

The left DataFrame.

Type:: pandas.DataFrame

right¶

The right DataFrame.

Type:: pandas.DataFrame

matches¶

The matches DataFrame.

Type:: pandas.DataFrame

batch_size¶

Batch size.

Type:: int

mismatch_share¶

Mismatches share.

Type:: float

shuffle¶

Shuffle flag.

Type:: bool

full_size¶

The number of potential the record pairs.

Type:: int

used_size¶

The used size of the record pairs.

Type:: int

no_used_mismatches_per_match¶

The number of used mismatches per match.

Type:: int

no_batches¶

The number of batches per epoch.

Type:: int

last_batch_size¶

The size of the last batch.

Type:: int

similarity_map¶

The similarity map.

Type:: SimilarityMap

similarity_encoder¶

The similarity encoder.

Type:: SimilarityEncoder

indices¶

The used indices for the record pairs.

Type:: numpy.ndarray

__getitem__(index)¶

Get the batch at the given index.

Return type:: Union[dict, Tuple[dict, ndarray]]

__init__(similarity_map, left, right, matches=None, batch_size=32, mismatch_share=0.1, shuffle=False)¶

Initialize a data generator object.

Prepare the indexing variables that are used in the data generation process.

Parameters:

similarity_map (SimilarityMap) – A similarity map object.
left (DataFrame) – The left DataFrame.
right (DataFrame) – The right DataFrame.
matches (DataFrame) – The matches DataFrame.
batch_size (int) – Batch size.
mismatch_share (float) – Mismatches share.
shuffle (bool) – Shuffle flag.

__len__()¶

Return the number of batches per epoch.

Return type:: int

__str__()¶

Return a string representation of the data generator.

Return type:: str

no_matches()¶

Return the number of matches.

Return type:: int

no_mismatches()¶

Return the number of mismatches.

Return type:: int

no_pairs()¶

Return the number of record pairs.

Return type:: int

on_epoch_end()¶

Maybe shuffle indices at the end of each epoch.

Return type:: None