Blocking Schema

Each blocking method has its own configuration and parameters to tune with. To make our API as generic as possible, we designed the blocking schema to specify the configuration of the blocking method including features to use in generating blocks and hyperparameters etc.

Currently we support two blocking methods:

  • p-sig”: Probabilistic signature

  • lambda-fold”: LSH based \lambda-fold

which are proposed by the following publications:

The format of the blocking schema is defined in a separate JSON Schema specification document - blocking-schema.json.

Basic Structure

A blocking schema consists of three parts:

  • type, the blocking method to be used

  • version, the version number of the hashing schema.

  • config, an json configuration of that blocking method that varies with different blocking methods

Example Schema

{
  "type": "lambda-fold",
  "version": 1,
  "config": {
    "blocking-features": ["name", "suburb"],
    "Lambda": 30,
    "bf-len": 2048,
    "num-hash-funcs": 5,
    "K": 20,
    "input-clks": true,
    "random_state": 0
  }
}

Schema Components

type

String value which describes the blocking method.

name

detailed description

p-sig

Probability Signature blocking method from Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

lambda-fold

LSH based Lambda Fold Redundant blocking method from Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

version

Integer value that indicates the version of blocking schema. Currently the only supported version is 1.

config

Configuration specific to each blocking method. Next we will detail the specific configuration for supported blocking methods.

Specific configuration of supported blocking methods can be found here:

Probabilistic Signature Configuration

attribute

type

description

blocking-features

list[integer]

specify which features u

filter

dictionary

filtering threshold

blocking-filter

dictionary

type of filter to generate blocks

signatureSpecs

list of lists

signature strategies where each list is a combination of signature strategies

Filter Configuration

attribute

type

description

type

string

either “ratio” or “count” that represents proportional or absolute filtering

max

numeric

for ratio, it should be within 0 and 1; for count, it should not exceed the number of records

Blocking-filter Configuration

A blocking filter is represented as a string describing the bit positions in the Bloom filter set to one, e.g.: “(3, 265, 403, 665, 927, 165, 41, 303, 565, 827, 965, 203, 465, 727, 865, 103, 365, 627, 503, 765)”. This representation consumes a considerable amount of space. If the indices are not needed for further processing, you can tell blocklib to replace these strings with a 5 byte hash by setting the compress-block-key flag.

attribute

type

description

type

string

currently we only support “bloom filter”

number-hash-functions

integer

this specifies how many bits will be flipped for each signature

bf-len

integer

defines the length of blocking filter, for bloom filter usually this is 1024 or 2048

compress-block-key

boolean

optional. Replace the block key by a 5 bytes hash versions of itself.

SignatureSpecs Configurations

It is better to illustrate this one with an example:

{
  "signatureSpecs": [
    [
     {"type": "characters-at", "config": {"pos": [0]}, "feature": 1},
     {"type": "characters-at", "config": {"pos": [0]}, "feature": 2},
    ],
    [
     {"type": "metaphone", "feature": 1},
     {"type": "metaphone", "feature": 2},
    ]
  ]
}

here we generate two signatures for each record where each signature is a combination of signatures: - first signature is the first character of feature at index 1, concatenating with first character of feature at index 2 - second signature is the metaphone transformation of feature at index 1, concatenating with metaphone transformation of feature at index 2

The following specifies the current supported signature strategies:

strategies

description

feature-value

exact feature at specified index

characters-at

substring of feature

metaphone

phonetic encoding of feature

Finally a full example of p-sig blocking schema:

{
 "type": "p-sig",
 "version": 1,
 "config": {
     "blocking_features": [1],
     "filter": {
         "type": "ratio",
         "max": 0.02,
         "min": 0.00,
     },
     "blocking-filter": {
         "type": "bloom filter",
         "number-hash-functions": 4,
         "bf-len": 2048,
     },
     "signatureSpecs": [
         [
              {"type": "characters-at", "config": {"pos": [0]}, "feature": 1},
              {"type": "characters-at", "config": {"pos": [0]}, "feature": 2},
         ],
         [
             {"type": "metaphone", "feature": 1},
             {"type": "metaphone", "feature": 2},
         ]
     ]
   }
 }

LSH based \lambda-fold Configuration

attribute

type

description

blocking-features

list[integer]

specify which features to used in blocks generation

Lambda

integer

denotes the degree of redundancy - H^i, i=1,2,..., \Lambda where each H^i represents one independent blocking group

bf-len

integer

length of bloom filter

num-hash-funcs

integer

number of hash functions used to map record to Bloom filter

K

integer

number of bits we will select from Bloom filter for each reocrd

random_state

integer

control random seed

input-clks

boolean

input data is CLKS if true else input data is not CLKS

Here is a full example of lambda-fold blocking schema:

{
  "type": "lambda-fold",
  "version": 1,
  "config": {
     "blocking-features": [1, 2],
     "Lambda": 5,
     "bf-len": 2048,
     "num-hash-funcs": 10,
     "K": 40,
     "random_state": 0,
     "input-clks": False
  }
}