Blocking API¶
Blocking is a technique that makes record linkage scalable. It is achieved by partitioning datasets into groups, called blocks and only comparing records in corresponding blocks. This can reduce the number of comparisons that need to be conducted to find which pairs of records should be linked.
There are two main metrics to evaluate a blocking technique - reduction ratio and pair completeness.
Reduction Ratio
Reduction ratio measures the proportion of number of comparisons reduced by using blocking technique. If we have two data providers each has number of records, then
Pair Completeness
Pair completeness measure how many true matches are maintained after blocking. It is evalauted as
Different blocking techniques have different methods to partition datasets in order to reduce as much number of comparisons as possible while maintain high pair completeness.
In this tutorial, we demonstrate how to use blocking in privacy preserving record linkage.
Load example Northern Carolina voter registration dataset:
[1]:
import blocklib
[2]:
# NBVAL_IGNORE_OUTPUT
import pandas as pd
df_alice = pd.read_csv('data/alice.csv')
df_alice.head()
[2]:
recid | givenname | surname | suburb | pc | |
---|---|---|---|---|---|
0 | 761859 | kate | chapman | brighton | 4017 |
1 | 1384455 | lian | hurse | carisbrook | 3464 |
2 | 1933333 | matthew | russo | bardon | 4065 |
3 | 1564695 | lorraine | zammit | minchinbury | 2770 |
4 | 5971993 | ingo | richardson | woolsthorpe | 3276 |
In this dataset, recid
is the voter registration number, with that we are able to verify the quality of a linkage between snapshots of this dataset taken at different times. pc
refers to postcode.
The next step is to configure how to block the data. There are two privacy preserving blocking methods currently supported by blocklib
:
Probabilistic signature (p-sig)
LSH based -fold redundant (lambda-fold)
This tutorial will demonstrate using both of these, starting with probabilistic signatures.
Blocking Methods - Probabilistic signature (p-sig)¶
The high level idea behind this blocking method is that it uses signatures as the blocking key and places records having the same signatures into the same block. You can find the original paper here: Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases.
An example blocking configuration using probabilistic signatures:
[3]:
blocking_config = {
"type": "p-sig",
"version": 1,
"config": {
"blocking-features": ['givenname', 'surname'],
"filter": {
"type": "ratio",
"max": 0.02,
"min": 0.00,
},
"blocking-filter": {
"type": "bloom filter",
"number-hash-functions": 20,
"bf-len": 2048,
},
"signatureSpecs": [
[
{"type": "characters-at", "config": {"pos": [0]}, "feature": "givenname"},
{"type": "characters-at", "config": {"pos": [0]}, "feature": "surname"},
],
[
{"type": "metaphone", "feature": "givenname"},
{"type": "metaphone", "feature": "surname"},
]
]
}
}
The blocking config can be fully validated to ensure all required types are present.
[4]:
blocklib.validation.validate_blocking_schema(blocking_config)
[4]:
BlockingSchemaModel(version=1, type=<BlockingSchemaTypes.psig: 'p-sig'>, config=PSigConfig(record_id_column=None, blocking_features=['givenname', 'surname'], filter=PSigFilterRatioConfig(type='ratio', max=0.02, min=0.0), blocking_filter=PSigBlockingBFFilterConfig(type='bloom filter', number_of_hash_functions=20, bloom_filter_length=2048, compress_block_key=False), signatures=[[PSigCharsAtSignatureSpec(type=<PSigSignatureTypes.chars_at: 'characters-at'>, feature='givenname', config=PSigCharsAtSignatureConfig(pos=[0])), PSigCharsAtSignatureSpec(type=<PSigSignatureTypes.chars_at: 'characters-at'>, feature='surname', config=PSigCharsAtSignatureConfig(pos=[0]))], [PSigMetaphoneSignatureSpec(type=<PSigSignatureTypes.metaphone: 'metaphone'>, feature='givenname'), PSigMetaphoneSignatureSpec(type=<PSigSignatureTypes.metaphone: 'metaphone'>, feature='surname')]]))
Step 1 - Generate Signatures¶
For a record r
, a signature is a sub-record derived from record r
with a signature strategy. An example signature strategy is to concatenate the initials of first and last name, e.g., the signature for record "John White"
is "JW"
.
blocklib
provides the following signature strategies:
feature-value
: the signature is generated by returning the selected featurecharacters-at
: the signature is generated by selecting a single character or a sequence of characters from selected featuremetaphone
: the signature is generated by phonetic encoding the selected feature using metaphone
The output of this step is a reversed index where keys are generated signatures / blocking key, and the values are lists of corresponding record IDs. A record ID could be row index or the actual record identifier if it is available in the dataset.
Signature strategies are defined in the signatureSpecs
section. For example, in the above configuration, we are going to generate two signatures for each record. The first signature produces initials:
[
{"type": "characters-at", "config": {"pos": [0]}, "feature": "givenname"},
{"type": "characters-at", "config": {"pos": [0]}, "feature": "surname"}
]
The second signature is generated by a combination of how the two components of a person’s name sounds:
[
{"type": "metaphone", "feature": "givenname"},
{"type": "metaphone", "feature": "surname"}
]
That is phonetic encoding of first name and last name.
One signature corresponds to one block. I will use signature and block interchangeably but they mean the same thing.
Step 2 - Filter Signatures¶
Signature strategies can create blocks with many records, and blocks with just one record. To impose limits on the minimum and maximum block size blocklib
provides configurable filtering.
For example, in the above configuration, the filter is configured as:
{
"type": "ratio",
"max": 0.02,
"min": 0.001
}
blocklib
will filter out all signatures / blocks whose number of records is greater than 2% of number of total records or is less than 0.1% of number of total records. Note these percentages are based on the data provided to blocklib
so only use on roughly symmetric sized record linkage.
Absolute filtering is also supported to filter by number of records. An example filter
configuration:
{
"type": "count",
"max": 100,
"min": 5
}
Step 3 - Anonymization¶
Given the aim of privacy preserving record linkage, the signatures themselves (e.g. "JW"
) are not going to be shared, instead following the p-sig
paper, the signatures all get encoded into a Bloom Filter. Here we use one Bloom Filter and map all filtered signatures into that Bloom Filter.
"blocking-filter": {
"type": "bloom filter",
"number-hash-functions": 20,
"bf-len": 2048,
}
After anonymization, the signature becomes the set of indices of bits 1 in the bloom filter and hence can preserve the privacy of data for each data provider.
Blocking Data¶
Now that we have configured how the P-Sig blocking will work, we can carry out our blocking job with blocklib
. Note blocklib
only accept list of tuples or lists as input data, so some pre-processing may be necessary. Example data input for blocklib
:
[
[761859, 'kate', 'chapman', 'brighton', 4017],
[1384455, 'lian', 'hurse', 'carisbrook', 3464],
[1933333, 'matthew', 'russo', 'bardon', 4065],
[1564695, 'lorraine', 'zammit', 'minchinbury', 2770],
[5971993, 'ingo', 'richardson', 'woolsthorpe', 3276]
]
Step 1 - Generate Candidate Blocks for Party A - Alice
[5]:
alice = df_alice.to_dict(orient='split')
print("Example PII", alice['data'][0])
Example PII [761859, 'kate', 'chapman', 'brighton', 4017]
[6]:
from blocklib import generate_candidate_blocks
block_obj_alice = generate_candidate_blocks(alice['data'], blocking_config, header=alice['columns'])
block_obj_alice.print_summary_statistics()
Statistics for the generated blocks:
Number of Blocks: 5028
Minimum Block Size: 1
Maximum Block Size: 61
Average Block Size: 1.8341
Median Block Size: 1
Standard Deviation of Block Size: 3.8372
Coverage: 100.0%
Individual statistics for each strategy:
Strategy: 0
Number of Blocks: 503
Minimum Block Size: 1
Maximum Block Size: 61
Average Block Size: 9.167
Median Block Size: 6
Standard Deviation of Block Size: 9.3427
Coverage: 100.0%
Strategy: 1
Number of Blocks: 4534
Minimum Block Size: 1
Maximum Block Size: 7
Average Block Size: 1.017
Median Block Size: 1
Standard Deviation of Block Size: 0.1584
Coverage: 100.0%
You can print the statistics of the blocks in order to inspect the block distribution and decide if this is a good blocking result.
generate_candidate_blocks
returns a CandidateBlockingResult
, the attribute we are most interested in is blocks
, a dict
that maps signatures to lists of records.
[7]:
list(block_obj_alice.blocks.keys())[0]
[7]:
'(1920, 1031, 142, 401, 1560, 671, 1830, 941, 52, 1211, 1470, 581, 1740, 851, 2010, 1121, 232, 491, 1650, 761)'
To protect privacy, the signature / blocking key is not the original signature such as JW
. Instead, it is a list of mapped indices of bits set to 1 in the Bloom Filter for the original signature. Next we want to do the same thing for another party - enter Bob.
Step2 - Generate Candidate Blocks for Party B - Bob
[8]:
# NBVAL_IGNORE_OUTPUT
df_bob = pd.read_csv('data/bob.csv')
bob = df_bob.to_dict(orient='split')
block_obj_bob = generate_candidate_blocks(bob['data'], blocking_config, header=bob['columns'])
block_obj_bob.print_summary_statistics()
Statistics for the generated blocks:
Number of Blocks: 5018
Minimum Block Size: 1
Maximum Block Size: 59
Average Block Size: 1.8378
Median Block Size: 1
Standard Deviation of Block Size: 3.8383
Coverage: 100.0%
Individual statistics for each strategy:
Strategy: 0
Number of Blocks: 500
Minimum Block Size: 1
Maximum Block Size: 59
Average Block Size: 9.222
Median Block Size: 6
Standard Deviation of Block Size: 9.3374
Coverage: 100.0%
Strategy: 1
Number of Blocks: 4529
Minimum Block Size: 1
Maximum Block Size: 4
Average Block Size: 1.0181
Median Block Size: 1
Standard Deviation of Block Size: 0.1445
Coverage: 100.0%
Generate Final Blocks¶
Now we have candidate blocks from both parties, we can generate final blocks by only including signatures that appear in both parties. Instead of directly comparing signatures, the algorithm maps the list of signatures into a Bloom Filter for each party called the candidate blocking filter, and then creates the combined blocking filter by only retaining the bits that are present in both candidate filters.
[9]:
from blocklib import generate_blocks
filtered_blocks_alice, filtered_blocks_bob = generate_blocks([block_obj_alice, block_obj_bob], K=2)
print('Alice: {} out of {} blocks are in common'.format(len(filtered_blocks_alice), len(block_obj_alice.blocks)))
print('Bob: {} out of {} blocks are in common'.format(len(filtered_blocks_bob), len(block_obj_bob.blocks)))
Alice: 2794 out of 5028 blocks are in common
Bob: 2794 out of 5018 blocks are in common
Assess Blocking¶
We can assess the blocking result when we have ground truth. There are two main metrics to assess blocking result as mentioned in the beginning of this tutorial. Here is a recap:
reduction ratio: relative reduction in the number of record pairs to be compared.
pair completeness: the percentage of true matches after blocking
[10]:
# NBVAL_IGNORE_OUTPUT
from blocklib.evaluation import assess_blocks_2party
subdata1 = [x[0] for x in alice['data']]
subdata2 = [x[0] for x in bob['data']]
rr, pc = assess_blocks_2party([filtered_blocks_alice, filtered_blocks_bob],
[subdata1, subdata2])
print(f'reduction ratio: {round(rr, 3)}')
print(f'pair completeness: {pc}')
assessing blocks: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2794/2794 [00:00<00:00, 223055.41key/s]
reduction ratio: 0.996
pair completeness: 1.0
Blocking Methods - LSH Based -fold Redundant¶
Now we look the other blocking method that we support - LSH Based -fold Redundant blocking. This blocking method uses a list of selected bits selected randomly from Bloom Filter for each record as block keys. refers the degree of redundancy i.e. we will conduct LSH-based blocking times, each forms a blocking group. Then those blocking groups are combined into one blocking results. This will make a record redundant times but will increase the recall.
Let’s see an example config, this time selecting the blocking features using column indices instead of column names:
[11]:
blocking_config = {
"type": "lambda-fold",
"version": 1,
"config": {
"blocking-features": [1, 2],
"Lambda": 5,
"bf-len": 2048,
"num-hash-funcs": 10,
"K": 40,
"random_state": 0,
"input-clks": False
}
}
blocklib.validation.validate_blocking_schema(blocking_config)
[11]:
BlockingSchemaModel(version=1, type=<BlockingSchemaTypes.lambdafold: 'lambda-fold'>, config=LambdaConfig(record_id_column=None, blocking_features=[1, 2], Lambda=5, bloom_filter_length=2048, number_of_hash_functions=10, K=40, block_encodings=False, random_state=0))
Now let’s explain the meaning of each argument:
blocking-features: a list of feature indice that we are going to use to generate blocks
Lambda: this number denotes the degree of redundancy - , where each represents one independent blocking group
bf-len: length of Bloom Filter for each record
num-hash-funcs: number of hash functions used to map record to Bloom Filter
K: number of bits we selected from Bloom Filter for each record
random_state: control random seed
Then we can carry out the blocking job and assess the result just like above steps
[12]:
print('Generating candidate blocks for Alice:')
block_obj_alice = generate_candidate_blocks(alice['data'], blocking_config)
block_obj_alice.print_summary_statistics()
print()
print('Generating candidate blocks for Bob: ')
block_obj_bob = generate_candidate_blocks(bob['data'], blocking_config)
block_obj_bob.print_summary_statistics()
Generating candidate blocks for Alice:
Statistics for the generated blocks:
Number of Blocks: 6050
Minimum Block Size: 1
Maximum Block Size: 873
Average Block Size: 3.8107
Median Block Size: 1
Standard Deviation of Block Size: 20.9703
Generating candidate blocks for Bob:
Statistics for the generated blocks:
Number of Blocks: 6085
Minimum Block Size: 1
Maximum Block Size: 862
Average Block Size: 3.7888
Median Block Size: 1
Standard Deviation of Block Size: 20.715
[13]:
filtered_blocks_alice, filtered_blocks_bob = generate_blocks([block_obj_alice, block_obj_bob], K=2)
print('Alice: {} out of {} blocks are in common'.format(len(filtered_blocks_alice), len(block_obj_alice.blocks)))
print('Bob: {} out of {} blocks are in common'.format(len(filtered_blocks_bob), len(block_obj_bob.blocks)))
Alice: 4167 out of 6050 blocks are in common
Bob: 4167 out of 6085 blocks are in common
[14]:
# NBVAL_IGNORE_OUTPUT
rr, pc = assess_blocks_2party([filtered_blocks_alice, filtered_blocks_bob],
[subdata1, subdata2])
print(f'reduction ratio: {round(rr, 3)}')
print(f'pair completeness: {pc}')
assessing blocks: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4167/4167 [00:00<00:00, 212918.95key/s]
reduction ratio: 0.872
pair completeness: 1.0
[ ]: