Identity Resolution: Twins and Little Bobby Tables

Trần Long, Pexels

We were processing a 150,000-member eligibility file for a jumbo employer when the Data Nexus flagged an anomaly: two members with the same first name, last name, and birthdate. The system read it as duplicate data entry, rejected both records, and logged the error.

Our team sent the inquiry. Two days later, the employer came back.

"Those are correct," they said. "They're twin boys, both named after their father. The only difference is their middle names."

I had several thoughts about this:

First: who does this to their children? Those boys are going to spend the rest of their lives saying, "No, I'm the other one."
Second: are both of these boys Juniors? Or does the eldest get "the Second" and the younger "the Third?"
Third, and most relevant to our work: how many other systems is this family going to break?

The answer, we found out, was many. They already had, and we had to work with our downstream partners to make sure they weren't added to the list.

This incident reminded me of one of my favorite XKCD comics. The mother of Little Bobby Tables did it on purpose, and while these parents just wanted their boys to carry on the family name, every database those kids encounter is going to have the same problem ours did. Two people, identical on every field that usually matters, with no way to tell them apart other than a middle name that's not even consistently captured.

Those boys are 3 or 4 years old right now. School enrollment, government IDs, employer HR systems, health insurance carriers: the chaos is still mostly ahead of them, but there's no doubt it's coming.

This is identity resolution. And it's the problem beneath the problem you're solving every time you load a data set.

Most matching logic works on deterministic rules: if first name + last name + birthdate all match, it's the same person. It's simple, fast, and correct most of the time. But "most of the time" in a data set of 150,000 employee records (or 6 million claims records) still leaves thousands of people in ambiguous territory, and that's a problem.

‍This potential error rate does real-life damage

The average healthcare organization has a 10–20% duplicate record rate, so your matching logic is often working against data that's already dirty even before you slam it up against another data set. Some are actual duplicates. Some are different people the system collapsed into one record. Two are twins named after their father.

This potential error rate does real-life damage: missed coverage, claims landing on the wrong account, a member who calls a provider and hears that they aren't eligible for care even when they should be. Bad data can harm people's health, so we have to do better.

Getting to 95%+ accuracy requires a different approach:

Probabilistic matching. With this method, the system weighs multiple signals together — address history, SSN fragments, enrollment dates, phone numbers, the presence or absence of a middle name — instead of failing on any single mismatch.
Persistent entity management: once you've resolved a record, you remember how you did it and apply that decision consistently across every downstream feed, forever. In the Data Nexus, we store one "canonical member" record across data sets, including that member's unique ID for each data source, so we know where little Bobby Tables lives in every new pile of records.
Human review: when all else fails, throw it to a humn. Sometimes, a data mystery requires a bit of human sleuthing to recognize new patterns or just to send an email and ask, "Is this what you meant to send us?"

And, of course, the whole system requires intentional, ongoing tuning, because every new employer brings edge cases you haven't seen before.

‍An engineer can build a deterministic matcher in a week, and it will handle most records correctly while quietly failing on the rest.

An engineer can build a deterministic matcher in a week, and it will handle most records correctly while quietly failing on the rest. Building one that handles the twins—that knows to hold both records, request the missing field, resolve the conflict, propagate the correction downstream, and flag this family as a known edge case for every future file—that takes months to make production ready, it took us years to get right, and we’re continuously improving it so our clients don’t have to.

Those boys have a lifetime of electronic pain ahead of them, but at least Data Nexus customers aren’t adding to the chaos.

More About Our Services

Download a 1 Page Data Nexus overview



Onboard Data in Days. Not Months.

→

Group Health &
Benefits Data. Solved.

Are you ready for help to handle the mess of benefits data for you? Our AI-powered cloud-based solutions and white-glove data wrangling complete onboarding in 1/10 the time, and our experts are ready to answer your questions.

Book an Introduction

Identity Resolution: Twins and Little Bobby Tables

More About Our Services

Related Solutions

Group Health &Benefits Data. Solved.

Group Health &
Benefits Data. Solved.