Data Generator¶
Entity matching data generator module.
This module provides a data generation functionality for entity matching tasks.
- class neer_match.data_generator.DataGenerator(similarity_map, left, right, matches=None, batch_size=32, mismatch_share=0.1, shuffle=False)¶
Data generator class.
The class provides a data generator for entity matching tasks. It inherits from the tf.keras.utils.Sequence class. Instances generate batches of similarities for the associated fields of two records in the cross product of the left and right data frames. The cross product is not explicitly computed. Instead, instances emulate it using indexing calculations.
- left¶
The left DataFrame.
- Type:
pandas.DataFrame
- right¶
The right DataFrame.
- Type:
pandas.DataFrame
- matches¶
The matches DataFrame.
- Type:
pandas.DataFrame
- batch_size¶
Batch size.
- Type:
int
Mismatches share.
- Type:
float
- shuffle¶
Shuffle flag.
- Type:
bool
- full_size¶
The number of potential the record pairs.
- Type:
int
- used_size¶
The used size of the record pairs.
- Type:
int
- no_used_mismatches_per_match¶
The number of used mismatches per match.
- Type:
int
- no_batches¶
The number of batches per epoch.
- Type:
int
- last_batch_size¶
The size of the last batch.
- Type:
int
- similarity_map¶
The similarity map.
- Type:
- similarity_encoder¶
The similarity encoder.
- Type:
- indices¶
The used indices for the record pairs.
- Type:
numpy.ndarray
- __getitem__(index)¶
Get the batch at the given index.
- Return type:
Union
[dict
,Tuple
[dict
,ndarray
]]
- __init__(similarity_map, left, right, matches=None, batch_size=32, mismatch_share=0.1, shuffle=False)¶
Initialize a data generator object.
Prepare the indexing variables that are used in the data generation process.
- Parameters:
similarity_map (
SimilarityMap
) – A similarity map object.left (
DataFrame
) – The left DataFrame.right (
DataFrame
) – The right DataFrame.matches (
Optional
[DataFrame
]) – The matches DataFrame.batch_size (
int
) – Batch size.mismatch_share (
float
) – Mismatches share.shuffle (
bool
) – Shuffle flag.
- __len__()¶
Return the number of batches per epoch.
- Return type:
int
- __str__()¶
Return a string representation of the data generator.
- Return type:
str
- no_matches()¶
Return the number of matches.
- Return type:
int
- no_mismatches()¶
Return the number of mismatches.
- Return type:
int
- no_pairs()¶
Return the number of record pairs.
- Return type:
int
- on_epoch_end()¶
Maybe shuffle indices at the end of each epoch.
- Return type:
None