Data Generator

Entity matching data generator module.

This module provides a data generation functionality for entity matching tasks.

class neer_match.data_generator.DataGenerator(similarity_map, left, right, matches=None, batch_size=32, mismatch_share=0.1, shuffle=False)

Data generator class.

The class provides a data generator for entity matching tasks. It inherits from the tf.keras.utils.Sequence class. Instances generate batches of similarities for the associated fields of two records in the cross product of the left and right data frames. The cross product is not explicitly computed. Instead, instances emulate it using indexing calculations.

left

The left DataFrame.

Type:

pandas.DataFrame

right

The right DataFrame.

Type:

pandas.DataFrame

matches

The matches DataFrame.

Type:

pandas.DataFrame

batch_size

Batch size.

Type:

int

mismatch_share

Mismatches share.

Type:

float

shuffle

Shuffle flag.

Type:

bool

full_size

The number of potential the record pairs.

Type:

int

used_size

The used size of the record pairs.

Type:

int

no_used_mismatches_per_match

The number of used mismatches per match.

Type:

int

no_batches

The number of batches per epoch.

Type:

int

last_batch_size

The size of the last batch.

Type:

int

similarity_map

The similarity map.

Type:

SimilarityMap

similarity_encoder

The similarity encoder.

Type:

SimilarityEncoder

indices

The used indices for the record pairs.

Type:

numpy.ndarray

__getitem__(index)

Get the batch at the given index.

Return type:

Union[dict, Tuple[dict, ndarray]]

__init__(similarity_map, left, right, matches=None, batch_size=32, mismatch_share=0.1, shuffle=False)

Initialize a data generator object.

Prepare the indexing variables that are used in the data generation process.

Parameters:
  • similarity_map (SimilarityMap) – A similarity map object.

  • left (DataFrame) – The left DataFrame.

  • right (DataFrame) – The right DataFrame.

  • matches (Optional[DataFrame]) – The matches DataFrame.

  • batch_size (int) – Batch size.

  • mismatch_share (float) – Mismatches share.

  • shuffle (bool) – Shuffle flag.

__len__()

Return the number of batches per epoch.

Return type:

int

__str__()

Return a string representation of the data generator.

Return type:

str

no_matches()

Return the number of matches.

Return type:

int

no_mismatches()

Return the number of mismatches.

Return type:

int

no_pairs()

Return the number of record pairs.

Return type:

int

on_epoch_end()

Maybe shuffle indices at the end of each epoch.

Return type:

None