{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Blocking API\n", "\n", "Blocking is a technique that makes record linkage scalable. It is achieved by partitioning datasets into groups, called blocks and only comparing records in corresponding blocks. This can reduce the number of comparisons that need to be conducted to find which pairs of records should be linked.\n", "\n", "There are two main metrics to evaluate a blocking technique - reduction ratio and pair completeness. \n", "\n", "**Reduction Ratio**\n", "\n", "Reduction ratio measures the proportion of number of comparisons reduced by using blocking technique. If we have two data providers each has $N$ number of records, then \n", "\n", "$$\\text{reduction ratio}= 1 - \\frac{\\text{number of comparisons after blocking}}{N^2}$$\n", "\n", "**Pair Completeness**\n", "\n", "Pair completeness measure how many true matches are maintained after blocking. It is evalauted as\n", "\n", "$$\\text{pair completeness}= 1 - \\frac{\\text{number of true matches after blocking}}{\\text{number of all true matches}}$$\n", "\n", "Different blocking techniques have different methods to partition datasets in order to reduce as much number of comparisons as possible while maintain high pair completeness.\n", "\n", "In this tutorial, we demonstrate how to use blocking in privacy preserving record linkage. \n", "\n", "Load example Nothern Carolina voter registration dataset:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | recid | \n", "givenname | \n", "surname | \n", "suburb | \n", "pc | \n", "
---|---|---|---|---|---|
0 | \n", "761859 | \n", "kate | \n", "chapman | \n", "brighton | \n", "4017 | \n", "
1 | \n", "1384455 | \n", "lian | \n", "hurse | \n", "carisbrook | \n", "3464 | \n", "
2 | \n", "1933333 | \n", "matthew | \n", "russo | \n", "bardon | \n", "4065 | \n", "
3 | \n", "1564695 | \n", "lorraine | \n", "zammit | \n", "minchinbury | \n", "2770 | \n", "
4 | \n", "5971993 | \n", "ingo | \n", "richardson | \n", "woolsthorpe | \n", "3276 | \n", "