Score:0

How can I best construct data structures to retrieve similar values for demographic matching?

de flag

The job is person demographic matching/consolidation.

I have incoming person demographic information which I need to determine if it is a match against an existing person in the a dataset. I get the following data;

NAME_LAST VARCHAR2(40), 
NAME_FIRST VARCHAR2(40), 
NAME_MIDDLE VARCHAR2(40), 
NAME_MAIDEN VARCHAR2(40), 
RESIDENCE_ADDRESS VARCHAR2(60),
RESIDENCE_CITY VARCHAR2(50), 
RESIDENCE_STATE VARCHAR2(2), 
RESIDENCE_ZIP VARCHAR2(9),
RACE VARCHAR2(2), 
DATE_OF_BIRTH DATE, 
GENDER VARCHAR2(1),
TELEPHONE VARCHAR2(10),
SSN VARCHAR2(9)

The incoming and existing data can and does have typographic errors in any/all fields. I have written a probabilistic algorithm which will take an existing record, incoming record and score their similarity reasonably well (99.99%+).

The problem is performance. The match of two records is reasonably quick, but the dataset I need to match against currently has over 3.9 million rows. So obviously I can't try to match against all records in the dataset.

The common way around this is to limit searches using deterministic matches against limited subsets of the data (blocking). Soundex and double metaphone "hashing" is used on name fields, DOB is split into year and MMDD segments, and this blocking yields good results but unless I cast a wide net, I miss some matches. If I cast a wide net, the performance degrades.

So the questions are;

  1. What types of "hashing" can I do, other than double metaphone & soundex, on the data elements which would be suitable for exact or range matching which would yield small subsets of data likely to contain the "best" match?
  2. Is there a better approach to creating a suitable data structure for matching?

The data is contained in an Oracle DB 19c the main language at my disposal is PL/SQL.

Nikita Kipriyanov avatar
za flag
This is not the best site to ask such questions. Either dba.stackexchange.com or stackoverflow.com are better suited to this.
Paul Stearns avatar
de flag
You are correct. I thought I was on stackoverflow. I will try again.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.