Python API¶
Block Generator¶
Module that implements final block generations.
- blocklib.blocks_generator.check_block_object(candidate_block_objs: Sequence[blocklib.candidate_blocks_generator.CandidateBlockingResult])[source]¶
Check candidate block objects type and their states type.
- Raises
TypeError – if conditions aren’t met.
- Parameters
candidate_block_objs – A list of candidate block result objects from 2 data providers
- blocklib.blocks_generator.generate_blocks(candidate_block_objs: Sequence[blocklib.candidate_blocks_generator.CandidateBlockingResult], K: int) List[Dict[Any, List[Any]]] [source]¶
Generate final blocks given list of candidate block objects from 2 or more than 2 data providers.
- Parameters
candidate_block_objs – A list of CandidateBlockingResult from multiple data providers
K – it specifies the minimum number of occurrence for records to be included in the final blocks
- Returns
List of dictionaries, filter out records that appear in less than K parties
- blocklib.blocks_generator.generate_blocks_psig(reversed_indices: Sequence[Dict], block_states: Sequence[blocklib.pprlpsig.PPRLIndexPSignature], threshold: int)[source]¶
Generate blocks for P-Sig
- Parameters
reversed_indices – A list of dictionaries where key is the block key and value is a list of record IDs.
block_states – A list of PPRLIndex objects that hold configuration of the blocking job
threshold – int which decides a pair when number of 1 bits in bloom filter is large than or equal to threshold
- Returns
A list of dictionaries where blocks that don’t contain any matches are deleted
- blocklib.blocks_generator.generate_reverse_blocks(reversed_indices: Sequence[Dict])[source]¶
Invert a map from “blocks to records” to “records to blocks”.
- Parameters
reversed_indices – A list of dictionaries where key is the block key and value is a list of record IDs.
- Returns
A list of dictionaries where key is the record ID and value is a set of blocking keys the record belongs to.
Blocks¶
- class blocklib.candidate_blocks_generator.CandidateBlockingResult(blocking_result: blocklib.pprlindex.ReversedIndexResult, state: blocklib.pprlindex.PPRLIndex)[source]¶
Object for holding candidate blocking results.
- Variables
blocks – a dictionary that contains a mapping from the block ID to the record IDs in that block.
state – A PPRLIndex state that contains the configuration of blocking
stats – a dictionary containing the summary statistics of the generated blocks
- print_summary_statistics(output: typing.TextIO = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, round_ndigits: int = 4)[source]¶
Print the summary statistics of this candidate blocking result to ‘output’. :param output: a file like object to write to. Defaults to sys.stdout :param round_ndigits: round floating point numbers to ndigits precision. Defaults to 4.
- blocklib.candidate_blocks_generator.generate_candidate_blocks(data: Sequence[Tuple[str, ...]], blocking_schema: Dict, header: Optional[List[str]] = None) blocklib.candidate_blocks_generator.CandidateBlockingResult [source]¶
- Parameters
data – list of tuples E.g. (‘0’, ‘Kenneth Bain’, ‘1964/06/17’, ‘M’)
blocking_schema – A description of how the signatures should be generated. See Blocking Schema
header – column names (optional) Program should throw exception if block features are string but header is None
- Returns
A 2-tuple containing A list of “signatures” per record in data. Internal state object from the signature generation (or None).
Base PPRL Index¶
- class blocklib.pprlindex.PPRLIndex(config: Union[blocklib.validation.psig_validation.PSigConfig, blocklib.validation.lambda_fold_validation.LambdaConfig])[source]¶
Base class for PPRL indexing/blocking.
- build_reversed_index(data: Sequence[Sequence], header: Optional[List[str]] = None)[source]¶
Method which builds the index for all database.
- Parameters
data – list of tuples, PII dataset
header – file header, optional
- Return type
ReversedIndexResult
See derived classes for actual implementations.
- get_feature_to_index_map(data: Sequence[Sequence], header: Optional[List[str]] = None)[source]¶
Return feature name to feature index mapping if there is a header and feature is of type string.
Signature Generator¶
- blocklib.signature_generator.generate_by_char_at(attr_ind: int, dtuple: Sequence, pos: List[Any])[source]¶
Generate signatures by select subset of characters in original features.
>>> res = generate_by_char_at(2, ('harry potter', '4 Privet Drive', 'Little Whinging', 'Surrey'), [0, 3]) >>> assert res == 'Lt' >>> res = generate_by_char_at(2, ('harry potter', '4 Privet Drive', 'Little Whinging', 'Surrey'), [":4"]) >>> assert res == 'Litt'
- blocklib.signature_generator.generate_by_feature_value(attr_ind: int, dtuple: Sequence)[source]¶
Generate signatures by simply return original feature at attr_ind.
- blocklib.signature_generator.generate_by_metaphone(attr_ind: int, dtuple: Sequence)[source]¶
Generate a phonetic encoding of features using metaphone.
>>> generate_by_metaphone(0, ('Smith', 'Schmidt', 2134)) 'SM0XMT'
- blocklib.signature_generator.generate_signatures(signature_strategies: List[List[Union[blocklib.validation.psig_validation.PSigCharsAtSignatureSpec, blocklib.validation.psig_validation.PSigMetaphoneSignatureSpec, blocklib.validation.psig_validation.PSigFeatureValueSignatureSpec]]], dtuple: Sequence, feature_to_index: Optional[Dict[str, int]] = None)[source]¶
Generate signatures for one record.
- Parameters
signature_strategies – A list of PSigSignatureModel instances each describing a strategy to generate signatures.
dtuple – Raw data to generate signatures from
feature_to_index – Mapping from feature name to feature index
- Return signatures
set of str
P-Sig¶
- class blocklib.pprlpsig.PPRLIndexPSignature(config: Union[blocklib.validation.psig_validation.PSigConfig, Dict])[source]¶
Class that implements the PPRL indexing technique:
Reference scalability entity resolution using probability signatures on parallel databases.
This class includes an implementation of p-sig algorithm.
Configuration¶
- class blocklib.pprlpsig.PSigConfig(*, filter: Union[blocklib.validation.psig_validation.PSigFilterRatioConfig, blocklib.validation.psig_validation.PSigFilterCountConfig], signatureSpecs: List[List[Union[blocklib.validation.psig_validation.PSigCharsAtSignatureSpec, blocklib.validation.psig_validation.PSigMetaphoneSignatureSpec, blocklib.validation.psig_validation.PSigFeatureValueSignatureSpec]]], **extra_data: Any)[source]¶
- blocking_filter: blocklib.validation.psig_validation.PSigBlockingBFFilterConfig¶
- class blocklib.validation.psig_validation.PSigBlockingBFFilterConfig(*, type: Literal['bloom filter'], **extra_data: Any)[source]¶
- bloom_filter_length: int¶
- compress_block_key: Optional[bool]¶
- number_of_hash_functions: int¶
- type: Literal['bloom filter']¶
- class blocklib.validation.psig_validation.PSigCharsAtSignatureConfig(*, pos: List[Union[blocklib.validation.constrained_types.PositiveInt, str]])[source]¶
- pos: List[Union[blocklib.validation.constrained_types.PositiveInt, str]]¶
- class blocklib.validation.psig_validation.PSigCharsAtSignatureSpec(*, type: typing.Literal[<PSigSignatureTypes.chars_at: 'characters-at'>], feature: typing.Union[int, str], config: blocklib.validation.psig_validation.PSigCharsAtSignatureConfig)[source]¶
-
- type: Literal[<PSigSignatureTypes.chars_at: 'characters-at'>]¶
- class blocklib.validation.psig_validation.PSigConfig(*, filter: Union[blocklib.validation.psig_validation.PSigFilterRatioConfig, blocklib.validation.psig_validation.PSigFilterCountConfig], signatureSpecs: List[List[Union[blocklib.validation.psig_validation.PSigCharsAtSignatureSpec, blocklib.validation.psig_validation.PSigMetaphoneSignatureSpec, blocklib.validation.psig_validation.PSigFeatureValueSignatureSpec]]], **extra_data: Any)[source]¶
- blocking_features: Union[List[int], List[str]]¶
- blocking_filter: blocklib.validation.psig_validation.PSigBlockingBFFilterConfig¶
- filter: Union[blocklib.validation.psig_validation.PSigFilterRatioConfig, blocklib.validation.psig_validation.PSigFilterCountConfig]¶
- record_id_column: Optional[int]¶
- class blocklib.validation.psig_validation.PSigFeatureValueSignatureSpec(*, type: typing.Literal[<PSigSignatureTypes.feature_value: 'feature-value'>], feature: typing.Union[int, str])[source]¶
- type: Literal[<PSigSignatureTypes.feature_value: 'feature-value'>]¶
- class blocklib.validation.psig_validation.PSigFilterCountConfig(*, type: Literal['count'], max: blocklib.validation.constrained_types.PositiveInt, min: blocklib.validation.constrained_types.PositiveInt)[source]¶
- max: blocklib.validation.constrained_types.PositiveInt¶
- min: blocklib.validation.constrained_types.PositiveInt¶
- type: Literal['count']¶
- class blocklib.validation.psig_validation.PSigFilterRatioConfig(*, type: Literal['ratio'], max: blocklib.validation.constrained_types.UnitFloat, min: blocklib.validation.constrained_types.UnitFloat = 0.0)[source]¶
- max: blocklib.validation.constrained_types.UnitFloat¶
- min: blocklib.validation.constrained_types.UnitFloat¶
- type: Literal['ratio']¶
- class blocklib.validation.psig_validation.PSigMetaphoneSignatureSpec(*, type: typing.Literal[<PSigSignatureTypes.metaphone: 'metaphone'>], feature: typing.Union[int, str])[source]¶
- type: Literal[<PSigSignatureTypes.metaphone: 'metaphone'>]¶
Lambda Fold¶
- class blocklib.pprllambdafold.PPRLIndexLambdaFold(config: Union[blocklib.validation.lambda_fold_validation.LambdaConfig, Dict])[source]¶
Class that implements the PPRL indexing technique:
An LSH-Based Blocking Approach with a Homomorphic Matching Technique for Privacy-Preserving Record Linkage.
This class includes an implementation of Lambda-fold redundant blocking method.
Configuration¶
- class blocklib.pprllambdafold.LambdaConfig(*, Lambda: int, K: int, random_state: int, **extra_data: Any)[source]¶
- K: int¶
- Lambda: int¶
- block_encodings: bool¶
- bloom_filter_length: int¶
- number_of_hash_functions: int¶
- random_state: int¶
- class blocklib.validation.lambda_fold_validation.LambdaConfig(*, Lambda: int, K: int, random_state: int, **extra_data: Any)[source]¶
- K: int¶
- Lambda: int¶
- block_encodings: bool¶
- blocking_features: Union[List[int], List[str]]¶
- bloom_filter_length: int¶
- number_of_hash_functions: int¶
- random_state: int¶
- record_id_column: Optional[int]¶
Internal¶
blocklib
uses pydantic for validation.
- class blocklib.validation.BlockingSchemaModel(*, version: int, type: blocklib.validation.BlockingSchemaTypes, config: Union[blocklib.validation.psig_validation.PSigConfig, blocklib.validation.lambda_fold_validation.LambdaConfig])[source]¶
- config: Union[blocklib.validation.psig_validation.PSigConfig, blocklib.validation.lambda_fold_validation.LambdaConfig]¶
- classmethod config_gen(config_to_validate, values) Union[blocklib.validation.psig_validation.PSigConfig, blocklib.validation.lambda_fold_validation.LambdaConfig] [source]¶
- version: int¶
- class blocklib.validation.BlockingSchemaTypes(value)[source]¶
An enumeration.
- lambdafold = 'lambda-fold'¶
- psig = 'p-sig'¶
- blocklib.validation.validate_blocking_schema(config: Dict) blocklib.validation.BlockingSchemaModel [source]¶
Validate blocking schema data with pydantic.
- Raises
ValueError – exceptions when passed an invalid config.
Encoding¶
Class to implement privacy preserving encoding.
- blocklib.encoding.flip_bloom_filter(string: str, bf_len: int, num_hash_funct: int)[source]¶
Hash string and return indices of bits that have been flipped correspondingly.
- Parameters
string – string: to be hashed and to flip bloom filter
bf_len – int: length of bloom filter
num_hash_funct – int: number of hash functions
- Returns
bfset: a set of integers - indices that have been flipped to 1
- blocklib.encoding.generate_bloom_filter(list_of_strs: List[str], bf_len: int, num_hash_funct: int)[source]¶
Generate a bloom filter given list of strings.
- Parameters
return_cbf_index_sig_map –
list_of_strs –
bf_len –
num_hash_funct –
- Returns
bloom_filter_vector if return_cbf_index_sig_map is False else (bloom_filter_vector, cbf_index_sig_map)