Python API

Block Generator

Module that implements final block generations.

blocklib.blocks_generator.check_block_object(candidate_block_objs: Sequence[blocklib.candidate_blocks_generator.CandidateBlockingResult])[source]

Check candidate block objects type and their states type.

Raises

TypeError – if conditions aren’t met.

Parameters

candidate_block_objs – A list of candidate block result objects from 2 data providers

blocklib.blocks_generator.generate_blocks(candidate_block_objs: Sequence[blocklib.candidate_blocks_generator.CandidateBlockingResult], K: int) List[Dict[Any, List[Any]]][source]

Generate final blocks given list of candidate block objects from 2 or more than 2 data providers.

Parameters
  • candidate_block_objs – A list of CandidateBlockingResult from multiple data providers

  • K – it specifies the minimum number of occurrence for records to be included in the final blocks

Returns

List of dictionaries, filter out records that appear in less than K parties

blocklib.blocks_generator.generate_blocks_psig(reversed_indices: Sequence[Dict], block_states: Sequence[blocklib.pprlpsig.PPRLIndexPSignature], threshold: int)[source]

Generate blocks for P-Sig

Parameters
  • reversed_indices – A list of dictionaries where key is the block key and value is a list of record IDs.

  • block_states – A list of PPRLIndex objects that hold configuration of the blocking job

  • threshold – int which decides a pair when number of 1 bits in bloom filter is large than or equal to threshold

Returns

A list of dictionaries where blocks that don’t contain any matches are deleted

blocklib.blocks_generator.generate_reverse_blocks(reversed_indices: Sequence[Dict])[source]

Invert a map from “blocks to records” to “records to blocks”.

Parameters

reversed_indices – A list of dictionaries where key is the block key and value is a list of record IDs.

Returns

A list of dictionaries where key is the record ID and value is a set of blocking keys the record belongs to.

Blocks

class blocklib.candidate_blocks_generator.CandidateBlockingResult(blocking_result: blocklib.pprlindex.ReversedIndexResult, state: blocklib.pprlindex.PPRLIndex)[source]

Object for holding candidate blocking results.

Variables
  • blocks – a dictionary that contains a mapping from the block ID to the record IDs in that block.

  • state – A PPRLIndex state that contains the configuration of blocking

  • stats – a dictionary containing the summary statistics of the generated blocks

print_summary_statistics(output: typing.TextIO = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, round_ndigits: int = 4)[source]

Print the summary statistics of this candidate blocking result to ‘output’. :param output: a file like object to write to. Defaults to sys.stdout :param round_ndigits: round floating point numbers to ndigits precision. Defaults to 4.

blocklib.candidate_blocks_generator.generate_candidate_blocks(data: Sequence[Tuple[str, ...]], blocking_schema: Dict, header: Optional[List[str]] = None) blocklib.candidate_blocks_generator.CandidateBlockingResult[source]
Parameters
  • data – list of tuples E.g. (‘0’, ‘Kenneth Bain’, ‘1964/06/17’, ‘M’)

  • blocking_schema – A description of how the signatures should be generated. See Blocking Schema

  • header – column names (optional) Program should throw exception if block features are string but header is None

Returns

A 2-tuple containing A list of “signatures” per record in data. Internal state object from the signature generation (or None).

Base PPRL Index

class blocklib.pprlindex.PPRLIndex(config: Union[blocklib.validation.psig_validation.PSigConfig, blocklib.validation.lambda_fold_validation.LambdaConfig])[source]

Base class for PPRL indexing/blocking.

build_reversed_index(data: Sequence[Sequence], header: Optional[List[str]] = None)[source]

Method which builds the index for all database.

Parameters
  • data – list of tuples, PII dataset

  • header – file header, optional

Return type

ReversedIndexResult

See derived classes for actual implementations.

get_feature_to_index_map(data: Sequence[Sequence], header: Optional[List[str]] = None)[source]

Return feature name to feature index mapping if there is a header and feature is of type string.

classmethod select_reference_value(reference_data: Sequence[Sequence], ref_data_config: Dict)[source]

Load reference data for methods need reference.

set_blocking_features_index(blocking_features, feature_to_index: Optional[Dict[str, int]] = None)[source]

Set value of member variable blocking features index.

self.blocking_features could be string (column name) or int (column index) self.blocking_features_index must be int (column index)

Signature Generator

blocklib.signature_generator.generate_by_char_at(attr_ind: int, dtuple: Sequence, pos: List[Any])[source]

Generate signatures by select subset of characters in original features.

>>> res = generate_by_char_at(2, ('harry potter', '4 Privet Drive', 'Little Whinging', 'Surrey'), [0, 3])
>>> assert res == 'Lt'
>>> res = generate_by_char_at(2, ('harry potter', '4 Privet Drive', 'Little Whinging', 'Surrey'), [":4"])
>>> assert res == 'Litt'
blocklib.signature_generator.generate_by_feature_value(attr_ind: int, dtuple: Sequence)[source]

Generate signatures by simply return original feature at attr_ind.

blocklib.signature_generator.generate_by_metaphone(attr_ind: int, dtuple: Sequence)[source]

Generate a phonetic encoding of features using metaphone.

>>> generate_by_metaphone(0, ('Smith', 'Schmidt', 2134))
'SM0XMT'
blocklib.signature_generator.generate_signatures(signature_strategies: List[List[Union[blocklib.validation.psig_validation.PSigCharsAtSignatureSpec, blocklib.validation.psig_validation.PSigMetaphoneSignatureSpec, blocklib.validation.psig_validation.PSigFeatureValueSignatureSpec]]], dtuple: Sequence, feature_to_index: Optional[Dict[str, int]] = None)[source]

Generate signatures for one record.

Parameters
  • signature_strategies – A list of PSigSignatureModel instances each describing a strategy to generate signatures.

  • dtuple – Raw data to generate signatures from

  • feature_to_index – Mapping from feature name to feature index

Return signatures

set of str

P-Sig

class blocklib.pprlpsig.PPRLIndexPSignature(config: Union[blocklib.validation.psig_validation.PSigConfig, Dict])[source]

Class that implements the PPRL indexing technique:

Reference scalability entity resolution using probability signatures on parallel databases.

This class includes an implementation of p-sig algorithm.

build_reversed_index(data: Sequence[Sequence], header: Optional[List[str]] = None)[source]

Build inverted index given P-Sig method.

Configuration

class blocklib.pprlpsig.PSigConfig(*, filter: Union[blocklib.validation.psig_validation.PSigFilterRatioConfig, blocklib.validation.psig_validation.PSigFilterCountConfig], signatureSpecs: List[List[Union[blocklib.validation.psig_validation.PSigCharsAtSignatureSpec, blocklib.validation.psig_validation.PSigMetaphoneSignatureSpec, blocklib.validation.psig_validation.PSigFeatureValueSignatureSpec]]], **extra_data: Any)[source]
blocking_filter: blocklib.validation.psig_validation.PSigBlockingBFFilterConfig
filter: Union[blocklib.validation.psig_validation.PSigFilterRatioConfig, blocklib.validation.psig_validation.PSigFilterCountConfig]
signatures: List[List[Union[blocklib.validation.psig_validation.PSigCharsAtSignatureSpec, blocklib.validation.psig_validation.PSigMetaphoneSignatureSpec, blocklib.validation.psig_validation.PSigFeatureValueSignatureSpec]]]
class blocklib.validation.psig_validation.PSigBlockingBFFilterConfig(*, type: Literal['bloom filter'], **extra_data: Any)[source]
bloom_filter_length: int
compress_block_key: Optional[bool]
number_of_hash_functions: int
type: Literal['bloom filter']
class blocklib.validation.psig_validation.PSigCharsAtSignatureConfig(*, pos: List[Union[blocklib.validation.constrained_types.PositiveInt, str]])[source]
pos: List[Union[blocklib.validation.constrained_types.PositiveInt, str]]
class blocklib.validation.psig_validation.PSigCharsAtSignatureSpec(*, type: typing.Literal[<PSigSignatureTypes.chars_at: 'characters-at'>], feature: typing.Union[int, str], config: blocklib.validation.psig_validation.PSigCharsAtSignatureConfig)[source]
config: blocklib.validation.psig_validation.PSigCharsAtSignatureConfig
type: Literal[<PSigSignatureTypes.chars_at: 'characters-at'>]
class blocklib.validation.psig_validation.PSigConfig(*, filter: Union[blocklib.validation.psig_validation.PSigFilterRatioConfig, blocklib.validation.psig_validation.PSigFilterCountConfig], signatureSpecs: List[List[Union[blocklib.validation.psig_validation.PSigCharsAtSignatureSpec, blocklib.validation.psig_validation.PSigMetaphoneSignatureSpec, blocklib.validation.psig_validation.PSigFeatureValueSignatureSpec]]], **extra_data: Any)[source]
blocking_features: Union[List[int], List[str]]
blocking_filter: blocklib.validation.psig_validation.PSigBlockingBFFilterConfig
filter: Union[blocklib.validation.psig_validation.PSigFilterRatioConfig, blocklib.validation.psig_validation.PSigFilterCountConfig]
record_id_column: Optional[int]
signatures: List[List[Union[blocklib.validation.psig_validation.PSigCharsAtSignatureSpec, blocklib.validation.psig_validation.PSigMetaphoneSignatureSpec, blocklib.validation.psig_validation.PSigFeatureValueSignatureSpec]]]
class blocklib.validation.psig_validation.PSigFeatureValueSignatureSpec(*, type: typing.Literal[<PSigSignatureTypes.feature_value: 'feature-value'>], feature: typing.Union[int, str])[source]
type: Literal[<PSigSignatureTypes.feature_value: 'feature-value'>]
class blocklib.validation.psig_validation.PSigFilterConfigBase(*, type: str)[source]
type: str
class blocklib.validation.psig_validation.PSigFilterCountConfig(*, type: Literal['count'], max: blocklib.validation.constrained_types.PositiveInt, min: blocklib.validation.constrained_types.PositiveInt)[source]
max: blocklib.validation.constrained_types.PositiveInt
min: blocklib.validation.constrained_types.PositiveInt
type: Literal['count']
class blocklib.validation.psig_validation.PSigFilterRatioConfig(*, type: Literal['ratio'], max: blocklib.validation.constrained_types.UnitFloat, min: blocklib.validation.constrained_types.UnitFloat = 0.0)[source]
max: blocklib.validation.constrained_types.UnitFloat
min: blocklib.validation.constrained_types.UnitFloat
type: Literal['ratio']
class blocklib.validation.psig_validation.PSigMetaphoneSignatureSpec(*, type: typing.Literal[<PSigSignatureTypes.metaphone: 'metaphone'>], feature: typing.Union[int, str])[source]
type: Literal[<PSigSignatureTypes.metaphone: 'metaphone'>]
class blocklib.validation.psig_validation.PSigSignatureSpecBase(*, type: str, feature: Union[int, str])[source]
feature: Union[int, str]
type: str
class blocklib.validation.psig_validation.PSigSignatureTypes(value)[source]

An enumeration.

chars_at = 'characters-at'
feature_value = 'feature-value'
metaphone = 'metaphone'

Lambda Fold

class blocklib.pprllambdafold.PPRLIndexLambdaFold(config: Union[blocklib.validation.lambda_fold_validation.LambdaConfig, Dict])[source]

Class that implements the PPRL indexing technique:

An LSH-Based Blocking Approach with a Homomorphic Matching Technique for Privacy-Preserving Record Linkage.

This class includes an implementation of Lambda-fold redundant blocking method.

build_reversed_index(data: Sequence[Any], header: Optional[List[str]] = None)[source]

Build inverted index for PPRL Lambda-fold blocking method.

Parameters
  • data – list of lists

  • header – file header, optional

Returns

reversed index as ReversedIndexResult

Configuration

class blocklib.pprllambdafold.LambdaConfig(*, Lambda: int, K: int, random_state: int, **extra_data: Any)[source]
K: int
Lambda: int
block_encodings: bool
bloom_filter_length: int
number_of_hash_functions: int
random_state: int
class blocklib.validation.lambda_fold_validation.LambdaConfig(*, Lambda: int, K: int, random_state: int, **extra_data: Any)[source]
K: int
Lambda: int
block_encodings: bool
blocking_features: Union[List[int], List[str]]
bloom_filter_length: int
number_of_hash_functions: int
random_state: int
record_id_column: Optional[int]

Internal

blocklib uses pydantic for validation.

class blocklib.validation.BlockingSchemaModel(*, version: int, type: blocklib.validation.BlockingSchemaTypes, config: Union[blocklib.validation.psig_validation.PSigConfig, blocklib.validation.lambda_fold_validation.LambdaConfig])[source]
config: Union[blocklib.validation.psig_validation.PSigConfig, blocklib.validation.lambda_fold_validation.LambdaConfig]
classmethod config_gen(config_to_validate, values) Union[blocklib.validation.psig_validation.PSigConfig, blocklib.validation.lambda_fold_validation.LambdaConfig][source]
type: blocklib.validation.BlockingSchemaTypes
classmethod validate_config(config_to_validate, values)[source]
version: int
class blocklib.validation.BlockingSchemaTypes(value)[source]

An enumeration.

lambdafold = 'lambda-fold'
psig = 'p-sig'
blocklib.validation.load_schema(file_name: str)[source]
blocklib.validation.validate_blocking_schema(config: Dict) blocklib.validation.BlockingSchemaModel[source]

Validate blocking schema data with pydantic.

Raises

ValueError – exceptions when passed an invalid config.

Encoding

Class to implement privacy preserving encoding.

blocklib.encoding.flip_bloom_filter(string: str, bf_len: int, num_hash_funct: int)[source]

Hash string and return indices of bits that have been flipped correspondingly.

Parameters
  • string – string: to be hashed and to flip bloom filter

  • bf_len – int: length of bloom filter

  • num_hash_funct – int: number of hash functions

Returns

bfset: a set of integers - indices that have been flipped to 1

blocklib.encoding.generate_bloom_filter(list_of_strs: List[str], bf_len: int, num_hash_funct: int)[source]

Generate a bloom filter given list of strings.

Parameters
  • return_cbf_index_sig_map

  • list_of_strs

  • bf_len

  • num_hash_funct

Returns

bloom_filter_vector if return_cbf_index_sig_map is False else (bloom_filter_vector, cbf_index_sig_map)