Waiting for PostGIS 2.1 - Faster Dump Points

BostonGIS BostonGIS Home About Postgres OnLine Journal PostGIS in Action

Friday, November 02. 2012

Posted by Regina Obe in gis, postgis postgresql, waiting_postgis_21

Comments (0)
Trackbacks (0)

Waiting for PostGIS 2.1 - Faster Dump Points

One of the latest exciting additions to PostGIS 2.1 is a complete C implementation of ST_DumpPoints which is a patch contributed by Nathan Wagner. Past versions of ST_DumpPoints piggy backed on ST_Dump and plpgsql for recursion so this has been a long-awaited todo. This function is dear to my heart because it is a sub-step of many geoprocessing workflows. What's dearer to my heart though is getting rid of the memory copy bottlenecks that make plpgsql processes inherently slower than C-native ones. That's probably too much to ask. In theory this new ST_DumpPoints version should be faster, but how much. There was also some intimation that the new ST_DumpPoints implementation has improvements over the C-based ST_Dump. I decided to put both of these theories to the test.

If you are on windows PostgreSQL 64-bit or using PostgreSQL 9.2+ - you can download experimental binaries for windows

The trivial lets Dump Points from multipoint collection

If you had a multipoint collection, in prior versions of PostGIS you'd be better off using ST_Dump instead of ST_DumpPoints, but in 2.1.0, this is not the case. Why was it faster to use ST_Dump -- ST_Dump is a pure C and ST_DumpPoints is just going to call ST_Dump anyway and it apparently does it less efficiently than I had hoped, probably because of unnecessary memory copy calls.

First we'll create a dummy table with a single row multipoint collection.

CREATE TABLE miscpoints(geom geometry(MULTIPOINTZ));
INSERT INTO miscpoints(geom)
SELECT ST_Collect(ST_MakePoint(x,y,z))
 FROM generate_series(1,100) As x
        CROSS JOIN generate_series(200,300) As y
       CROSS JOIN generate_series(-10,10) As z;

Dumping points from a multipoint with ST_DumpPoints

Now to test ST_DumpPoints under both 2.1 and 2.0 (I'm not bothering with 1.5 since I don't have that readily accessible and should be in theory same speed as 2.0)

This is also testing under PostgreSQL 9.2 32-bit on a windows 7 box with not much memory. PostGIS versions tested are:
2.1.0 SVN r10609
2.0.2 SVN r10315

-- 212,100 points
-- 2.1.0 SVN 5720 ms
-- 2.0.2 SVN 11806 ms
SELECT (gd).path[1] As pos, (gd).geom As geom
FROM (SELECT ST_DumpPoints(geom) As gd 
 FROM miscpoints) As foo;

-- without network effects ST_DumpPoints
-- 212,100 points
-- 2.1.0 SVN 230 ms
-- 2.0.2 SVN 5797 ms
SELECT count(*) 
FROM 
 (SELECT (gd).path[1] As pos, (gd).geom As geom
FROM (SELECT ST_DumpPoints(geom) As gd 
 FROM miscpoints) As foo
) as foo;

As you can see, the new implementation is clearly faster. It's about 10 times faster and the speed between the 2 versions is a bit dwarfed if you needed to transfer these records somewhere like to pgAdmin so in practice you may only see a 2-fold to 5 fold improvement.

-- bulk insert (partial network)
-- 2.0.2 SVN 7211 ms
-- 2.1.0 SVN 1306 ms
SELECT (gd).path[1] As pos, (gd).geom As geom
INTO miscdp
FROM (SELECT ST_DumpPoints(geom) As gd 
    FROM miscpoints) As foo;

Dumping to another table is not so bad

Dumping points from a multipoint with ST_Dump

ST_DumpPoints seems to be faster than it's older ST_Dump sibling as demonstrated here. In 2.0 you see ST_Dump is faster than using ST_DumpPoints for this use-case and shockingly much much faster. However in 2.1.0, the tides have changed, and ST_DumpPoints even for points is faster.

-- without network effects - ST_Dump
-- 212,100 points
-- 2.1.0 SVN 540 ms
-- 2.0.2 SVN 560 ms
SELECT count(*) 
FROM 
 (SELECT (gd).path[1] As pos, (gd).geom As geom
FROM (SELECT ST_Dump(geom) As gd FROM miscpoints) As foo
) as foo;

Dumping Points from Huge Multipolygons

One use case of dump points is to expand a multipolygon so you can feed it to something like PL/R to be reconstituted. This is one one of the reasons why the path variable is important. So I tried this with US State 2011 tiger boundaries (top 3 states with most points) which you can determine by running this query.

-- TX (58556), MN (41195), VA (39931) have most points
 SELECT stusps
    ,ST_NPoints(geom) As npoints
    , ROW_NUMBER() OVER(ORDER BY ST_NPoints(geom) DESC) As n
 FROM tl_2011_us_state
 ORDER BY n LIMIT 3;

Now to test the dumping speeds

-- 2.1.0 SVN r10609 139,682 rows, 3660ms
-- 2.0.2 SVN r10315 139,682 rows, 4209ms
SELECT stusps
        , (gd).path[1] As polyp
        , (gd).path[2] As ringp
        , (gd).path[3] As pointp
        , (gd).geom as pt
FROM (
        SELECT stusps, geom, ST_DumpPoints(geom) as gd
                FROM tl_2011_us_state
        WHERE stusps IN('TX', 'MN', 'VA') ) As foo;

-- without network effects
-- 2.1.0 SVN r10609 139,682 rows, 150ms
-- 2.0.2 SVN r10315 139,682 rows, 890ms
SELECT count(*)
FROM (
SELECT stusps
        , (gd).path[1] As polyp
        , (gd).path[2] As ringp
        , (gd).path[3] As pointp
        , (gd).geom as pt
FROM (
        SELECT stusps, geom, ST_DumpPoints(geom) as gd
                FROM tl_2011_us_state
        WHERE stusps IN('TX', 'MN', 'VA') ) As foo ) As foo;

So for this use-case, 2.1.0 ST_DumpPoints is about 8 times faster, but again the differences between the two speeds will be mostly drowned by network effects.

Trackbacks

Trackback specific URI for this entry

No Trackbacks

Comments

Display comments as (Linear | Threaded)

No comments

Add Comment

Name
Email
Homepage
In reply to
Comment	E-Mail addresses will not be displayed and will only be used for E-Mail notifications. To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above:
	Remember Information? Subscribe to this entry