<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>State Of Flux &#187; postgres</title>
	<atom:link href="http://stateofflux.com/tag/postgres/feed/" rel="self" type="application/rss+xml" />
	<link>http://stateofflux.com</link>
	<description>always changing</description>
	<lastBuildDate>Fri, 29 Jan 2010 03:29:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Australian GeoSpatial Data &#8211; Free</title>
		<link>http://stateofflux.com/2008/10/19/australian-geospatial-data-free/</link>
		<comments>http://stateofflux.com/2008/10/19/australian-geospatial-data-free/#comments</comments>
		<pubDate>Sun, 19 Oct 2008 09:44:00 +0000</pubDate>
		<dc:creator>mark</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[australia]]></category>
		<category><![CDATA[centroid]]></category>
		<category><![CDATA[esri]]></category>
		<category><![CDATA[geocoding]]></category>
		<category><![CDATA[geospatial]]></category>
		<category><![CDATA[lambertconformalconic]]></category>
		<category><![CDATA[postgis]]></category>
		<category><![CDATA[postgres]]></category>
		<category><![CDATA[postgresql]]></category>
		<category><![CDATA[projection]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[suburbs]]></category>
		<category><![CDATA[transform]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://markmansour.wordpress.com//2008/10/19/australian-geospatial-data-free</guid>
		<description><![CDATA[Edit: There are notes in the comments from Tim that explain the changes for PostgreSql 8.4.  Thanks Tim! I’ve built a couple of sites that needed geospatial data. One was a social networking site that needed a way to list people who were near other people, the other was a art web site that allowed [...]]]></description>
			<content:encoded><![CDATA[<p><span style="color: #ff0000;">Edit: There are notes in the comments from Tim that explain the changes for PostgreSql 8.4.  Thanks Tim!</span></p>
<p>I’ve built a couple of sites that needed geospatial data.  One was a social networking site that needed a way to list people who were near other people, the other was a art web site that allowed users to upload steet art and show it on a map.  I thought it would be interesting to get the basics of an Australian suburb dataset up and running in a geospatial database and do some simple queries.</p>
<h3>Install PostgreSql and PostGIS</h3>
<p>First thing to do is setup PostgreSql and PostGIS.  I’m sure you can do this in MySQL but I haven’t done it, so leave a note in the comments if you get that up and running <img src='http://stateofflux.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .   There are a few article on how to do this and it is platform specific so go and do that.</p>
<h3>Get some Suburb data</h3>
<p>Now we need some data.  The <span class="caps">ABS</span> is kind enough to provide Australia broken down into suburbs and postcodes on their site.  I’m going to deal with suburbs so go ahead and download the <a href="http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/2923.0.30.0012006?OpenDocument">State Suburbs (SSC) 2006 Digital Boundaries in <span class="caps">ESRI</span> Shapefile format</a> data cube.  This data cube has every suburb in Australia defined as a Polygon (or a multipolygon) with each node defined as a latitude and longitude.</p>
<h3>Converting it to <span class="caps">SQL</span></h3>
<p>Unzip the downloaded shapefile and you’ll get 8 files but we are only concerned with the <code>SSC06aAUST_region.*</code> ones.  We are going to load the POA06aAUST_region data into the database but firstly we need to convert it into <span class="caps">SQL</span>.</p>
<pre lang="bash">shp2pgsql SSC06aAUST_region.shp suburbs -s 4283 -I -d &gt; suburbs.sql</pre>
<p>shp2pgsql converts the <span class="caps">ESRI</span> Shapefile into <span class="caps">SQL</span>.  -I adds an index (which is very important for speed) and the -d Drop and recreates the table.  The -s 4283 make sure the suburb data is defined in with the correct projection.  The earth isn’t a sphere and different parts of the earth are curved slighly differently so the geo-bods came up with a whole bunch of projections.  4283 is the standardized number for the <span class="caps">GDA 1994</span> projection which is the projection the suburb data comes in (you can just take a peek inside the POA06aAUST_region.prj file to see what the project is).</p>
<h3>Create a Geo-enabled DB and load the data</h3>
<pre lang="bash">createdb australia
createlang plpgsql australia
psql -f /opt/local/share/postgis/lwpostgis.sql -d australia
psql -f /opt/local/share/postgis/spatial_ref_sys.sql -d australia
psql australia &lt; suburbs.sql</pre>
<p>Note: The directories for the lwpostgis.sql and spatial_ref_sys will vary from system to system so you’ll have to find them on your own machine.</p>
<p>You will also want to create a reference table for the Australian States</p>
<pre lang="sql">create table aust_states (id integer primary key, state_name varchar, state_abbrev varchar);
insert into aust_states (id, state_name, state_abbrev) values (1, 'New South Wales', 'NSW');
insert into aust_states (id, state_name, state_abbrev) values (2, 'Victoria', 'VIC');
insert into aust_states (id, state_name, state_abbrev) values (3, 'Queensland', 'QLD');
insert into aust_states (id, state_name, state_abbrev) values (4, 'South Australia', 'SA');
insert into aust_states (id, state_name, state_abbrev) values (5, 'Western Australia', 'WA');
insert into aust_states (id, state_name, state_abbrev) values (6, 'Tasmania', 'TAS');
insert into aust_states (id, state_name, state_abbrev) values (7, 'Northern Territory', 'NT');
insert into aust_states (id, state_name, state_abbrev) values (8, 'Australian Captial Territory', 'ACT');
insert into aust_states (id, state_name, state_abbrev) values (9, 'Other Territories', 'OT');</pre>
<h3>Get some awesome answers!</h3>
<h4>Show me the polygon of Port Melbourne</h4>
<pre lang="sql">select name_2006, astext(the_geom)  from suburbs where name_2006 = 'Port Melbourne';</pre>
<p>This returns a whole bunch of lat and longs.  Pretty useless really.  Maybe having the center of a suburb would be more useful.</p>
<h4>Show me the center of Port Melbourne</h4>
<pre lang="sql">select name_2006, astext(centroid(the_geom))  from suburbs where name_2006 = 'Port Melbourne';

   name_2006    |                  astext
----------------+-------------------------------------------
 Port Melbourne | POINT(144.921987367191 -37.8328692507562)
(1 row)</pre>
<p>Much better!</p>
<h4>Show me the suburbs that surround Port Melbourne</h4>
<pre lang="sql">select surrounding.name_2006
    from suburbs source, suburbs surrounding
    where source.name_2006 = 'Port Melbourne'
        and touches(source.the_geom, surrounding.the_geom);

    name_2006
-----------------
 Albert Park
 Docklands
 South Melbourne
 Southbank
 Spotswood
 West Melbourne
 Yarraville
(7 rows)</pre>
<p>Here I select the suburb table twice, once to represent it as the source suburb, in this case Port Melbourne and as a destination or surrounding suburb.  I then restrict my matches to only show polygons that touch the source.</p>
<h4>Show me the suburbs that surround Port Melbourne with distances between suburbs</h4>
<pre lang="sql">select surrounding.name_2006,
       distance(transform(centroid(source.the_geom),3112),
                transform(centroid(surrounding.the_geom),3112))
    From suburbs source, suburbs surrounding
    where source.name_2006 = 'Port Melbourne'
        and touches(source.the_geom, surrounding.the_geom);

    name_2006    |     distance
-----------------+------------------
 Albert Park     | 3908.06472236311
 Docklands       | 2316.21021732757
 South Melbourne | 3106.68573231296
 Southbank       | 3492.93829708397
 Spotswood       |  3035.6283677131
 West Melbourne  | 2682.84381789969
 Yarraville      | 3914.04324956383
(7 rows)</pre>
<p>The interesting part here is getting the distance between suburbs.  The distance() method gets the distance between two points, which for us is the 2 center points of our suburbs.  Unfortunately if you measure the distance you’ll get an answer in degrees which isn’t that useful.  So you need to transform the projection from a degree (lat and long are in degrees) to a <a href="http://postgis.refractions.net/pipermail/postgis-users/2008-June/020182.html">meter based projection</a> .  Australia happens to have one called the Lambert Conformal Conic projection known as number 3112.  Hence:</p>
<pre lang="sql">distance(transform(centroid(source.the_geom),3112),
                transform(centroid(surrounding.the_geom),3112))</pre>
<p>will get the distance, in meters, betwwen two suburbs.</p>
<h4>Show me all the suburbs named Richmond</h4>
<pre lang="sql">select name_2006,
       state_name
    from suburbs
    inner join aust_states on suburbs.state_2006 = aust_states.id
    where name_2006 = 'Richmond';

 name_2006 |   state_name
-----------+-----------------
 Richmond  | Victoria
 Richmond  | South Australia
 Richmond  | Tasmania
(3 rows)</pre>
<h3>What’s next?</h3>
<p>This is all very nice, but when you start geocoding data and getting lat/longs of items you can store in the db then you can do some really fun stuff.  If this article generates enough interest I’ll follow up with some Ruby code and Google Maps integration.</p>
]]></content:encoded>
			<wfw:commentRss>http://stateofflux.com/2008/10/19/australian-geospatial-data-free/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>activerecord-postgresql-adapter in Rails 2.1</title>
		<link>http://stateofflux.com/2008/07/13/activerecord-postgresql-adapter-in-rails-2-1/</link>
		<comments>http://stateofflux.com/2008/07/13/activerecord-postgresql-adapter-in-rails-2-1/#comments</comments>
		<pubDate>Sun, 13 Jul 2008 01:20:00 +0000</pubDate>
		<dc:creator>mark</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[activerecord]]></category>
		<category><![CDATA[pg]]></category>
		<category><![CDATA[postgres]]></category>
		<category><![CDATA[postgresql]]></category>
		<category><![CDATA[rails]]></category>

		<guid isPermaLink="false">http://markmansour.wordpress.com//2008/07/13/activerecord-postgresql-adapter-in-rails-2-1</guid>
		<description><![CDATA[If you get the following message Please install the postgresql adapter: `gem install activerecord-postgresql-adapter` It means you don&#8217;t have the new &#8216;pg&#8217; postgresql library installed. This is easily fixed with a bit of sudo gem install pg]]></description>
			<content:encoded><![CDATA[<p>If you get the following message</p>
<pre lang="ruby">
Please install the postgresql adapter: `gem install activerecord-postgresql-adapter`
</pre>
<p>It means you don&#8217;t have the new &#8216;pg&#8217; postgresql library installed.  This is easily fixed with a bit of</p>
<pre lang="ruby">
sudo gem install pg
</pre>
]]></content:encoded>
			<wfw:commentRss>http://stateofflux.com/2008/07/13/activerecord-postgresql-adapter-in-rails-2-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cleaning dirty database data</title>
		<link>http://stateofflux.com/2008/06/09/cleaning-dirty-database-data/</link>
		<comments>http://stateofflux.com/2008/06/09/cleaning-dirty-database-data/#comments</comments>
		<pubDate>Mon, 09 Jun 2008 05:30:00 +0000</pubDate>
		<dc:creator>mark</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[aggregate]]></category>
		<category><![CDATA[denormalize]]></category>
		<category><![CDATA[postgres]]></category>
		<category><![CDATA[postgresql]]></category>
		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://markmansour.wordpress.com//2008/06/09/cleaning-dirty-database-data</guid>
		<description><![CDATA[I have a database with duplicate records in it and I want to know how many records I should have if I clean out the duplicates. Boy is this thing dirty! The dataset I&#8217;m working with a mid sized (approximately 2 million records) and it&#8217;s a dump from another system. One of the problems is [...]]]></description>
			<content:encoded><![CDATA[<p>I have a database with duplicate records in it and I want to know how many records I should have if I clean out the duplicates.  Boy is this thing <strong>dirty!</strong>  The dataset I&#8217;m working with a mid sized (approximately 2 million records) and it&#8217;s a dump from another system.  One of the problems is that the data in the dump has been denormalized.  The second part of the problem is that some data has been entered multiple times in the source system<sup><a href="#fn1">1</a></sup>.</p>
<p>Let me give you an example.</p>
<pre lang="sql">
blog_example=# \d
              List of relations
 Schema |     Name      | Type  |    Owner
--------+---------------+-------+-------------
 public | addresses     | table | markmansour
 public | users         | table | markmansour
(2 rows)
</pre>
<p>If I wanted to count the number of users, this would be straight forward, I&#8217;d just:</p>
<pre lang="sql">
blog_example=# select count(*) from users;
 count
-------
     3
(1 row)
</pre>
<p>But let&#8217;s look at the data a bit more closely.</p>
<pre lang="sql">
blog_example=# select * from users;
 id | name
----+-------
  1 | Korny
  2 | Tim
  3 | Korny
(3 rows)

blog_example=# select * from phone_numbers;
 id | user_id |  number
----+---------+----------
  1 |       1 | 11111111
  2 |       1 | 22222222
  3 |       2 | 33333333
  4 |       3 | 11111111
  5 |       3 | 22222222
(5 rows)
</pre>
<p>For the purposes of this example I&#8217;ll consider a duplicate to be a user with the exacly the same name, phone number and address &#8211; the main thing is that there are multiple one-to-many relationships and that there is repetition.  In this example the user Korny (users with the id 1 &#38; 3) have the same phone numbers and the same address and should be considered duplicates.<br />
In <span class="caps">SQL</span> the normal way to group things together is to use the cleverly named &#8220;group by&#8221; clause, but that doesn&#8217;t get us what we&#8217;re after<sup><a href="#fn2">2</a></sup>.  I&#8217;d like to see the following:</p>
<pre lang="sql">
blog_example=# magic select name, number but put the numbers on the same line
 name  |  number
-------+----------
 Korny | 11111111, 22222222
 Tim   | 33333333
(2 rows)
</pre>
<p>This can be done with PostgreSql (if you know how to do this in MySql please let me know!) by <a href="http://www.postgresql.org/docs/8.3/static/xaggr.html">creating your own aggregate function</a> .  You&#8217;ve probably used an aggregate function like <span class="caps">MAX</span> or <span class="caps">AVG</span> before.  I&#8217;m after a string aggregation function.  You can define one like this:</p>
<pre lang="sql">
CREATE AGGREGATE array_accum_text (
    basetype = text,
    sfunc = array_append,
    stype = text[],
    initcond = '{}',
    sortop = >);
</pre>
<p>This allows related rows to be grouped up, for example:</p>
<pre lang="sql">
blog_example=# select u.*, array_to_string(array_accum_text(cast(ph.number as text)), ',') as all_phone_numbers
blog_example=#   from users as u
blog_example=#   inner join phone_numbers as ph on u.id = ph.user_id
blog_example=#   group by u.id, u.name;
 id | name  | all_phone_numbers
----+-------+-------------------
  3 | Korny | 11111111,22222222
  1 | Korny | 11111111,22222222
  2 | Tim   | 33333333
(3 rows)
</pre>
<p>To take it a step further we can now group related fields together, but I&#8217;ll do it via a view.  I want the users id to remain so that when I join the text together it doesn&#8217;t collapse all the telephone numbers from all the names even if their user ids are different (this is really hard to explain so I suggest trying it out without a view to see what I mean).</p>
<pre lang="sql">
blog_example=# create view extended_users as
blog_example-#   select u.id as user_id,
blog_example-#          u.name as name,
blog_example-#          array_to_string(array_accum_text(cast(ph.number as text)), ',') as all_phone_numbers
blog_example-#     from users as u
blog_example-#     inner join phone_numbers as ph on u.id = ph.user_id
blog_example-#     group by u.id, u.name;
CREATE VIEW

blog_example=# select name, all_phone_numbers from extended_users
blog_example-#   group by name, all_phone_numbers;
 name  | all_phone_numbers
-------+-------------------
 Korny | 11111111,22222222
 Tim   | 33333333
(2 rows)
</pre>
<p>From this query I know that there are only two records once I remove all duplicates.  When I ran this over my dataset it took the rows from 2 million down to 1.4 million.  That is a lot of redundancy that my users don&#8217;t want to see.  My next action is going to be to writing some <a href="http://api.rubyonrails.org/classes/ActiveRecord/Migration.html">Rails migrations</a> to clean it up <img src='http://stateofflux.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> , but that will have to wait for another post.</p>
<h3>Footnotes</h3>
<p id="fn1"><sup>1</sup> I want to talk about a technical solution so let ignore the politics of the situation (i.e. let&#8217;s presume that I can&#8217;t get the data keyed in a better way or have the data delivered in a more normalized format).</p>
<p id="fn2"><sup>2</sup> Some <span class="caps">SQL</span>:</p>
<pre lang="sql">
blog_example=# select u.name, ph.number from users as u
blog_example-#        inner join phone_numbers as ph on u.id = ph.user_id
blog_example-#        group by u.name, ph.number;
 name  |  number
-------+----------
 Korny | 11111111
 Korny | 22222222
 Tim   | 33333333
(3 rows)
</pre>
]]></content:encoded>
			<wfw:commentRss>http://stateofflux.com/2008/06/09/cleaning-dirty-database-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
