I have a database with duplicate records in it and I want to know how many records I should have if I clean out the duplicates. Boy is this thing dirty! The dataset I’m working with a mid sized (approximately 2 million records) and it’s a dump from another system. One of the problems is that the data in the dump has been denormalized. The second part of the problem is that some data has been entered multiple times in the source system1.
Let me give you an example.
1
2
3
4
5
6
7
|
blog_example=# \d
List of relations
Schema | Name | Type | Owner
--------+---------------+-------+-------------
public | addresses | table | markmansour
public | users | table | markmansour
(2 rows)
|
If I wanted to count the number of users, this would be straight forward, I’d just:
1
2
3
4
5
|
blog_example=# select count(*) from users;
count
-------
3
(1 row)
|
But let’s look at the data a bit more closely.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
blog_example=# select * from users;
id | name
----+-------
1 | Korny
2 | Tim
3 | Korny
(3 rows)
blog_example=# select * from phone_numbers;
id | user_id | number
----+---------+----------
1 | 1 | 11111111
2 | 1 | 22222222
3 | 2 | 33333333
4 | 3 | 11111111
5 | 3 | 22222222
(5 rows)
|
For the purposes of this example I’ll consider a duplicate to be a user with the exacly the same name, phone number and address – the main thing is that there are multiple one-to-many relationships and that there is repetition. In this example the user Korny (users with the id 1 & 3) have the same phone numbers and the same address and should be considered duplicates.
In
SQL the normal way to group things together is to use the cleverly named “group by” clause, but that doesn’t get us what we’re after
2. I’d like to see the following:
1
2
3
4
5
6
|
blog_example=# magic select name, number but put the numbers on the same line
name | number
-------+----------
Korny | 11111111, 22222222
Tim | 33333333
(2 rows)
|
This can be done with PostgreSql (if you know how to do this in MySql please let me know!) by
creating your own aggregate function . You’ve probably used an aggregate function like
MAX or
AVG before. I’m after a string aggregation function. You can define one like this:
1
2
3
4
5
6
|
CREATE AGGREGATE array_accum_text (
basetype = text,
sfunc = array_append,
stype = text[],
initcond = '{}',
sortop = >); |
This allows related rows to be grouped up, for example:
1
2
3
4
5
6
7
8
9
10
|
blog_example=# select u.*, array_to_string(array_accum_text(cast(ph.number as text)), ',') as all_phone_numbers
blog_example=# from users as u
blog_example=# inner join phone_numbers as ph on u.id = ph.user_id
blog_example=# group by u.id, u.name;
id | name | all_phone_numbers
----+-------+-------------------
3 | Korny | 11111111,22222222
1 | Korny | 11111111,22222222
2 | Tim | 33333333
(3 rows)
|
To take it a step further we can now group related fields together, but I’ll do it via a view. I want the users id to remain so that when I join the text together it doesn’t collapse all the telephone numbers from all the names even if their user ids are different (this is really hard to explain so I suggest trying it out without a view to see what I mean).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
blog_example=# create view extended_users as
blog_example-# select u.id as user_id,
blog_example-# u.name as name,
blog_example-# array_to_string(array_accum_text(cast(ph.number as text)), ',') as all_phone_numbers
blog_example-# from users as u
blog_example-# inner join phone_numbers as ph on u.id = ph.user_id
blog_example-# group by u.id, u.name;
CREATE VIEW
blog_example=# select name, all_phone_numbers from extended_users
blog_example-# group by name, all_phone_numbers;
name | all_phone_numbers
-------+-------------------
Korny | 11111111,22222222
Tim | 33333333
(2 rows)
|
From this query I know that there are only two records once I remove all duplicates. When I ran this over my dataset it took the rows from 2 million down to 1.4 million. That is a lot of redundancy that my users don’t want to see. My next action is going to be to writing some
Rails migrations to clean it up :), but that will have to wait for another post.
Footnotes
1 I want to talk about a technical solution so let ignore the politics of the situation (i.e. let’s presume that I can’t get the data keyed in a better way or have the data delivered in a more normalized format).
2 Some SQL:
1
2
3
4
5
6
7
8
9
|
blog_example=# select u.name, ph.number from users as u
blog_example-# inner join phone_numbers as ph on u.id = ph.user_id
blog_example-# group by u.name, ph.number;
name | number
-------+----------
Korny | 11111111
Korny | 22222222
Tim | 33333333
(3 rows)
|
aggregate, denormalize, postgres, postgresql, sql