Tuesday, November 4, 2008

Fuzzy date matching in PostgreSQL

As part of one of my side-projects, I wanted a way for users of a web site to search for events by date, but with some flexibility. I also wanted users who are creating events to be able to express uncertainty about the dates on which they happened.
For example, if I record an event that happened sometime in September 1998, I can say it happened on September 15th, 1998 +/- 15 days. That should then be included in the results of a query for events that happened on October 1st, 1998 +/ 1 month, or June 1st, 1998 +/- 6 months, or even September 10th, 1998 +/- 1 day. An exact date would be represented as +/- 0 days.
One way to visualise these is as line segments on a literal date line, stretching from the past into the future, with a mark per day. PostgreSQL's spatial types can be used to represent this in an indexable way. First of all, we'll create a composite type to represent fuzzy dates:
CREATE TYPE fuzzydate AS (midpoint date, fuzziness interval);
The midpoint and fuzziness fields should be self-explanatory. Next, we create a function that will take a fuzzydate and convert it into a box. (PostgreSQL doesn't have an 'overlaps' operator for line segments, so I chose to use zero-height boxes.) The function calculates the start and end dates of the date range, and uses extract to convert them into epoch times (the number of seconds since the beginning of 1970). These are then divided by 86400 to get the number of days since 1970. These values form the x coordinates of the returned box.
CREATE FUNCTION GeometricDate(fd fuzzydate) RETURNS box AS $$
    start_day integer;
    end_day integer;
    start_day := extract(epoch from fd.midpoint - fd.fuzziness)::integer / 86400;
    end_day := extract(epoch from fd.midpoint + fd.fuzziness)::integer / 86400;
    RETURN box(point(start_day, 0), point(end_day, 0));
Now we can now perform a query like this:
SELECT GeometricDate(('2008-11-04', '2 days'));
to get a result like (14189,0),(14185,0), so let's create a table that we can perform our date range queries on:
  id serial PRIMARY KEY,
  name varchar(100),
  fdate fuzzydate
Note that the fdate column is only storing the original midpoint and fuzziness. So where is the geometric representation? Well, PostgreSQL allows us to index not just a column, but the results of applying a function to a column:
CREATE INDEX events_fdate_index ON events USING gist (GeometricDate(fdate));
Now we can use the geometric date in a query like this one, which says "give me all the events that were within 10 days before or after November 3rd, 2008":
SELECT * FROM events WHERE GeometricDate(fdate) && GeometricDate('2008-11-03', '10 days');
The really important thing to note here is that GeometricDate is not being called for every row in the table. The result of GeometricDate is taken directly from the index we created, and it's the IMMUTABLE flag on the function that tells PostgreSQL that it's okay to do this.