A simple technical framework in SQL for de-deplicating data.

Introduction

When de-duplicating data, we could consider just doing a group-by on a number of fields, looking for results where two or more rows fall into the same group, and calling those duplicates. While this is computationally efficient, it is a little bit blunt, and means it will only capture duplicates if the grouped fields are identical for both records.

We can improve this situation a little using the concept of 'match keys'. This is where we create a new field in the data, which are modified or combined versions of existing fields. For instance, if we wanted to match individuals on first names, but we wanted to deal with the fact that the same name can be spelt differently - we could create a new field which is the soundex of the first name. Soundex is a very old algorithm for converting words into a 4 letter code which represents it's sound. You can read about the soundex algorithm online, but for now it's enough to know that Neal and Neil will both resolve to the sound-ex 'N400'. In this way, you can use the group-by method to look for duplicates in data, but cater for the cases where the underlying values aren't exactly the same between rows.

However, some more advanced tests exist which simply cannot be done using a group-by method. For example, consider the simple test: 'How many individual letter edits would it take to get from this piece text to that piece of text?'. (Various implementations of this test are known as 'Edit-distance' or the 'Levenshtein distance'). If I want to use this test when looking for duplicate individuals, say by applying it to first names, then I simply cannot use a group-by method - I need to consider individual pairs of records.

So, on the other end of the scale, you could considering comparing every row of data against every other row (in SQL I can do this by self-joining the a table of individuals to itself) - and running a selection of tests to compare every single pair of records. If we did this, we could do tests equivalent to the group-by method (e.g. does A.surname = B.surname), but also more advanced pairwise tests like 'what is the edit distance between rowA surname and rowB surname.'

So it seems like the self-join method is better. However, if we actually did this, and we had N records to start with, then the resulting output would have (N^2)/2 records - most of which would be pairs of records which very definitely aren't duplicates. And consequently we would have wasted a lot of time and disk space comparing every record when we only want to compare some sets of records together.

Instead, this framework uses a mix of these two methods. We use a group-by to create 'windows' of records to compare in a pair-wise fashion. Then we put the resulting output into a results table, and do it again, using a different group-by window.

We can do this as many times as we like, for each 'scenario' of duplicates we are considering, and finally, when we have finished, we can turn our attention to our big set of results, and the process of deciding which of those potential pairs are actually duplicates, based on the results of each of pairwise tests.

In this way, this framework separates the computationally expensive process of 'testing' the records from the interesting process of working out which pairs are actually duplicates. (This also leaves us free to apply some interesting machine learning to the second problem...)

Example

Let's imagine we have data about individuals with the following fields: Firstname, Surname, DOB, Address, and we are looking for duplicates. Let's imagine also that we have these four duplicate records for Neal Simpson in our data:

ID Firstname Surname DOB
1 Neil Simpson 08/06/1986
2 Neil Simpson 08/06/1986
3 Neal Simpson 08/06/1986
4 Neil Simpson 09/06/1986

1. The basic method we could take is to group by Firstname, Surname and DOB. 

This would correctly identify rows 1 and 2 as duplicates, but would miss 3 and 4.

2. Now we could try adding some match keys to the data, for instance, we could add a Soundex(Firstname) field:

ID Firstname Surname DOB SoundexFirst
1 Neil Simpson 08/06/1986 N400
2 Neil Simpson 08/06/1986 N400
3 Neal Simpson 08/06/1986 N400
4 Neil Simpson 09/06/1986 N400

Now we could group by 'SoundexFirst', Surname and DOB and the system would correctly identify 1, 2, and 3 as duplicates - but miss 4.

 3. We could try to also catch row 4 in our group by approach, by grouping by the 'week' of the DOB, rather than the date itself - on the basis that if two dates are close, they are likely to be in the same week. However, in this case, the 8th is a Sunday, and the 9th is a Monday, so even though the dates are one apart, we would miss the match since they fall in different weeks. In order to capture the fact that the dates are one day apart, we must consider pairwise matching.

So now considering doing a pairwise match on these rows and putting the results into a potential match table. But we only want to consider pairs where SoundexFirst and Surname match:

INSERT INTO dbo.MatchResults
SELECT 
	A.ID AS [ID1],
B.ID AS [ID2],
CASE WHEN A.Firstname = B.Firstname THEN 1 ELSE 0 END AS [Firstname_Match]
1 AS [SoundexFirst_Match], 1 AS [Surname_Match], CASE WHEN A.DOB = B.DOB THEN 1 ELSE 0 END AS [DOB_Match], DATEDIFF(DAY, A.DOB, B.DOB) AS [DOB_DifferenceInDays], FROM Indivisuals A INNER JOIN Individuals B ON A.ID < B.ID AND A.SoundexFirst = B.SoundexFirst AND A.Surname = B.Surname

This would make a results table something like this:

ID1 ID2 Firstname_Match SoundexFirst_Match Surname_Match DOB_Match DOB_DifferenceInDays
1 2 1 1 1 1 0
1 3 0 1 1 1 0
1 4 1 1 1 0 1
2 3 0 1 1 1 0
2 4 1 1 1 0 1
3 4 1 1 1 0 1

4. Now in a real situation we could consider a number of other potential situations.

 

To be continued...