Archive for June, 2008

Column editing with Emacs

Posted in Home on June 14th, 2008 by mark – 3 Comments


Emacs Column Editing from Mark Mansour on Vimeo.

My OS X lovin’ buddies love to point out how easy it is to manipulate columns in TextMate . But I’m old skool and I still love emacs . So to prove that column editing mode is a cinch I’ve put together my first screencast to demonstrate column editing mode in Emacs. Emacs ships with CUA mode which you need to turn on and then let the good times begin. Enjoy…

Cleaning dirty database data

Posted in Home on June 9th, 2008 by mark – Be the first to comment

I have a database with duplicate records in it and I want to know how many records I should have if I clean out the duplicates. Boy is this thing dirty! The dataset I’m working with a mid sized (approximately 2 million records) and it’s a dump from another system. One of the problems is that the data in the dump has been denormalized. The second part of the problem is that some data has been entered multiple times in the source system1.

Let me give you an example.

blog_example=# \d
              List of relations
 Schema |     Name      | Type  |    Owner    
--------+---------------+-------+-------------
 public | addresses     | table | markmansour
 public | users         | table | markmansour
(2 rows)

If I wanted to count the number of users, this would be straight forward, I’d just:

blog_example=# select count(*) from users;
 count 
-------
     3
(1 row)

But let’s look at the data a bit more closely.

blog_example=# select * from users;
 id | name  
----+-------
  1 | Korny
  2 | Tim
  3 | Korny
(3 rows)

blog_example=# select * from phone_numbers;
 id | user_id |  number  
----+---------+----------
  1 |       1 | 11111111
  2 |       1 | 22222222
  3 |       2 | 33333333
  4 |       3 | 11111111
  5 |       3 | 22222222
(5 rows)

For the purposes of this example I’ll consider a duplicate to be a user with the exacly the same name, phone number and address – the main thing is that there are multiple one-to-many relationships and that there is repetition. In this example the user Korny (users with the id 1 & 3) have the same phone numbers and the same address and should be considered duplicates.
In SQL the normal way to group things together is to use the cleverly named “group by” clause, but that doesn’t get us what we’re after2. I’d like to see the following:

blog_example=# magic select name, number but put the numbers on the same line
 name  |  number  
-------+----------
 Korny | 11111111, 22222222
 Tim   | 33333333
(2 rows)

This can be done with PostgreSql (if you know how to do this in MySql please let me know!) by creating your own aggregate function . You’ve probably used an aggregate function like MAX or AVG before. I’m after a string aggregation function. You can define one like this:

CREATE AGGREGATE array_accum_text (
    basetype = text,
    sfunc = array_append,
    stype = text[],
    initcond = '{}',
    sortop = >);

This allows related rows to be grouped up, for example:

blog_example=# select u.*, array_to_string(array_accum_text(cast(ph.number as text)), ',') as all_phone_numbers
blog_example=#   from users as u
blog_example=#   inner join phone_numbers as ph on u.id = ph.user_id
blog_example=#   group by u.id, u.name;
 id | name  | all_phone_numbers 
----+-------+-------------------
  3 | Korny | 11111111,22222222
  1 | Korny | 11111111,22222222
  2 | Tim   | 33333333
(3 rows)

To take it a step further we can now group related fields together, but I’ll do it via a view. I want the users id to remain so that when I join the text together it doesn’t collapse all the telephone numbers from all the names even if their user ids are different (this is really hard to explain so I suggest trying it out without a view to see what I mean).

blog_example=# create view extended_users as
blog_example-#   select u.id as user_id,
blog_example-#          u.name as name, 
blog_example-#          array_to_string(array_accum_text(cast(ph.number as text)), ',') as all_phone_numbers
blog_example-#     from users as u
blog_example-#     inner join phone_numbers as ph on u.id = ph.user_id
blog_example-#     group by u.id, u.name;
CREATE VIEW

blog_example=# select name, all_phone_numbers from extended_users
blog_example-#   group by name, all_phone_numbers;
 name  | all_phone_numbers 
-------+-------------------
 Korny | 11111111,22222222
 Tim   | 33333333
(2 rows)

From this query I know that there are only two records once I remove all duplicates. When I ran this over my dataset it took the rows from 2 million down to 1.4 million. That is a lot of redundancy that my users don’t want to see. My next action is going to be to writing some Rails migrations to clean it up :) , but that will have to wait for another post.

Footnotes

1 I want to talk about a technical solution so let ignore the politics of the situation (i.e. let’s presume that I can’t get the data keyed in a better way or have the data delivered in a more normalized format).

2 Some SQL:

blog_example=# select u.name, ph.number from users as u
blog_example-#        inner join phone_numbers as ph on u.id = ph.user_id
blog_example-#        group by u.name, ph.number;
 name  |  number  
-------+----------
 Korny | 11111111
 Korny | 22222222
 Tim   | 33333333
(3 rows)