Many a child has looked at a dog chasing its tail and wondered "What is the point of this game? and why do I not have a tail to chase?"
After much time pondering these intriguing questions of life, I have concluded that a dog chases its tail in a hope to catch it.
Once caught, the game then becomes "Can I catch my tail faster than I did before".
In this excerpt, I shall demonstrate the relational database version (PostgreSQL specifically) of the dog chasing its tail.
I call it this because it stirs up such vivid imagery and I have a great fondness for physical imagery that transforms seemingly stale abstract concepts into vibrant creatures.
I suppose that is also a statement of how my mind works: a stream of seemingly unrelated allegories strung together in a bizarre but interesting orchestra.
Getting back to our nearest neighbor problem. We know that expand uses indexes and distance does not. If only we could guess at the smallest expanse that encases
our k neighbors for each reference geometry, we would be in much better shape. We have the same problem as the dog. The faster the dog runs the faster his tail runs away from him. In our case, the larger we make our box, the more likely we would be to trap our neighbors, but then also the more difficult it would be to catalog all those distances of trapped objects.
So how do we guess at this magical expand box? We keep on looking back at the path we have traveled so far and if we have collected enough objects and then stop. We must also observe that our problem can be thought of as a sequence of rhetorical questions that are all asking "Are we there yet?" and that we can solve our problem by simply asking the same question over and over again "Are we there yet? Do we have enough k yet? Until the answer switches from no to YES!
For this particular approach, I am using an sql recursive function that will for each iteration calculate the difference between the current expand box and the previous and then continue with the next expand box if the sum of all these differences collected is still less than our target. Since each nested box is only returning the differential of the previous, it is already pretty much sorted by distance and even more sorted the more slices you slice and dice your problem into. So the cost is the effort to make slices vs. the cost to do distance against a larger slice. We stop calculating at the biggest box that has more neighbors than we need but where the previous expand box had fewer neighbors than we need. This is the solution that I was really hoping the PostgreSQL planner would reduce to with my previous exercises, but sadly didn't.
For this particular approach, we will use the expandoverlap_metric function we created in Battle for the Nearest Neighbor A small success.
Below is what the code looks like for this new function:
CREATE OR REPLACE FUNCTION fntestpolys_nn_evenbetter(gid integer, geom1 geometry, distguess double precision, numnn integer, maxslices integer, it integer)
RETURNS SETOF testpolys AS
$BODY$
--NOTE: it: the iteration we are currently at
--start at the bounding box of the object (expand 0) and move up until it has collected more objects than we need or it = maxslices whichever event ha
ppens first
SELECT thefinalit.* FROM
(SELECT currentit.*
FROM testpolys As currentit
WHERE $1 <> currentit.gid
AND expand($2,$3/$5*$6) && currentit.the_geom
AND expandoverlap_metric($2,currentit.the_geom, $3,$5) = $6
UNION
SELECT nextit.*
FROM fntestpolys_nn_evenbetter($1, $2, $3, $4, $5, ($6 + 1)) nextit
WHERE $1 <> nextit.gid AND $6 < $5 AND
(SELECT COUNT(tp3.gid)
FROM testpolys As tp3
WHERE $1 <> tp3.gid
AND expand($2,$3/$5*$6) && tp3.the_geom) < $4
) thefinalit
ORDER BY distance(thefinalit.the_geom, $2) LIMIT $4;
$BODY$
LANGUAGE 'sql' STABLE;
Here are some test results. The test data used is that which we produced in the Battle for the Nearest Neighbor A small success.
/**Problem calculate the 5 nearest neighbors in table testpolys for the first 500 records **/
--Total query runtime: 9156 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors, max 100 iterations, start at bounding box
SELECT g1.gid as gid_ref , (fntestpolys_nn_evenbetter(g1.gid, g1.the_geom, 10000, 5,100,0)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
--Total query runtime: 44063 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors, max 100 slices
SELECT g1.gid as gid_ref , (fntestpolys_nn_better(g1.gid, g1.the_geom, 10000, 5,100)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
--Total query runtime: 98516 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors
SELECT g1.gid as gid_ref, (fntestpolys_nn(g1.gid, g1.the_geom, 10000, 5)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
This solution is on the order of 10 times faster than the original solution and on superficial inspection, seems to come up with the same answer. Better yet to handle our full 3000 simple polygon it still performs faster than the original when asked to calculate for the whole batch of 3000 simple polygons.
/**Problem calculate the 5 nearest neighbors in table testpolys for all records in testpolys (3000 records) **/
--Total query runtime: 95079 ms, 15000 rows retrieved.
SELECT g1.gid as gid_ref , (fntestpolys_nn_evenbetter(g1.gid, g1.the_geom, 10000, 5,100,0)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid) g1;
I guess this final solution really doesn't look much like a dog chasing its tail. It actually looks more like a game of tag to me.
I'm really sorry I lead people astray with this
dog chasing tail promise only to reduce it to a game of
Who is it?.
Note this doesn't deal with ties so if you have 2 neighbors of equal distance away from your reference object at the tail end, then one may get left out.