Geospatial processing with Clickhouse

Summary

This post may describe functionality for an old version of CARTO. Find out about the latest and cloud-native version here.
Geospatial processing with Clickhouse

In the geospatial world it is easy to find really small datasets  like an excel file with client locations or a geojson with the coordinates for a few thousand Starbucks coffee shops. Those cases are usually easy enough to deal with that even the browser can load  process and visualize without major problems.

But when we talk about core geospatial challenges  we talk about a lot of data. For example  raster data coming from satellite imagery (especially nowadays with companies like Planet doing continuous planet monitoring)  big cadastral shapefiles with millions of small polygons  or .csv files generated from IoT networks (ok  they are actually GPS traces with a lot of attached data expecting to get something useful out of it eventually).

There are plenty of CSVs in S3 with GPS-like data. Take the NYC Taxi data as an example  which contains pickup information for all the NYC taxis. By the way  this is becoming the benchmarking data for the Big data industry; you should check out Mark Litwintschik's blog posts  they are really good.

So  in general  when you have to analyze geospatial data you end up using PostgreSQL and PostGIS. The ability to query geospatial data with SQL is hard to beat with any other tool out there. But on the other hand  the problem is that PostgreSQL is not meant to work with huge amounts of data. Well  let me explain this a little bit more… You can load huge amounts of information in PostgreSQL and then using things like the new BRIN index  you can select a small portion of that data really  really quickly. The problem shows up when you want to aggregate a lot of rows. In that case PostgreSQL is not that good and that’s why there are solutions like citusdb  greenplum or Amazon redshift that use columnar storage  parallel processing and push downs (among other things) to improve performance. So we have PostgreSQL which supports PostGIS on the one hand and other databases that work really fast with lots of data but with partial or inexistent support for geospatial (mapd  clickhouse  vertica or druiddb to mention some non-batch technologies) on the other hand.

Going back to NYC Taxi dataset  when you want to know the average amount of points in a specific zone  PostgreSQL takes “ages”. Let me demonstrate it with a few queries in a local PostgreSQL instance (9.6):

[local] postgres@nyc-taxi-data=# select count(1) from trips;
  count
──────────
 78424550
(1 row)

Time: 25347.000 ms

So if I want to find out how many pickups took place within a radius of 300 meters around Times Square:

[local] postgres@nyc-taxi-data=# select count(1) from trips where st_dwithin(the_geom  'SRID=4326;POINT(-73.985195 40.75896)'  180*300/(6371*1000*pi()));                 
 count
────────
 797801
(1 row)

Time: 76552.877 ms

(I’m using angles to calculate distance on a sphere  I know  but since it’s only 300m I use the formula distance = arc*radious)

76 seconds sounds like too much. I’m pretty sure there are ways to improve this like sorting data in better ways  tuning PostgreSQL or using parallel queries  but we will not get a massive improvement. So even if we use a big instance with 32 cores and where the whole dataset fits in memory  in theory we should be able to achieve 76s/32 ~ 2.3s

There are other databases/frameworks more suited to work with that amount of data like spark  presto  vertica  mapd and many others as I said before. I recently got to know Clickhouse  and after testing it  I fell in love. I thought I was doing something wrong but then Mark Litwintschik published a blog post about the strength of Clickhouse.

So I gave it a try and loaded some NYC taxi data. Clickhouse does not have any geospatial toolkit  but you can emulate it with basic math using euclidean distance.

:) select count(1) from a2 where (pow(pickup_longitude - -73.985195  2) +  pow(pickup_latitude - 40.75896  2)) < pow(180*300/(6371*1000*pi())  2);

SELECT count(1)
FROM trips
WHERE (pow(pickup_longitude - -73.985195  2) + pow(pickup_latitude - 40.75896  2)) < pow((180 * 300) / ((6371 * 1000) * pi())  2)

┌─count(1)─┐
│   613491 │
└──────────┘
rows in set. Elapsed: 0.308 sec. Processed 45.45 million rows  818.13 MB (147.63 million rows/s.  2.66 GB/s.)

308ms.

The thing is we are processing 45.5M rows (basically performing a full scan) and even that is fast as hell. If only we had an index to select just the rows close to Times Square (as PostgreSQL does) it would be much faster. Luckily Clickhouse supports indices. The bad news is that obviously it does not support index on points or any 2D data. Nevertheless there are ways to linearize 2D space into a scalar one  so traditional database indices will work. Hilbert curve and geohash are more commonly used ones. In our case we decided to use quadkey.

Quadkey is basically a way to encode a tile coordinate in an integer  so for each zoom we use two bits to encode which of the 4 children is the one where each position is. In this way we can encode any lat-lon pair in a 64-bit integer with a precision of 1.9 cm (tile size at zoom 31). We could use Hilbert of geohash but quadkey fits pretty well with the way we visualize data (using quadtree). It’s fast to generate (check out python library -- with native extensions it’s extremely fast) and it works pretty well with general database indices.

So by creating a new table with a quadkey column with pickup_latitude and pickup_longitude and using it as index  we can query it using quadkey in the where condition:

CREATE table test (
    lat Nullable(Float64) 
    lon Nullable(Float64) 
    datetime Nullable(DateTime) 
    value Nullable(Float64) 
    date_index Date 
    quadkey UInt64
)
ENGINE = MergeTree(date_index  quadkey  8192)

Run the query  this time including quadkey index:

:) select count(1) from a2 where quadkey between 1013670044871163904 and 1013670049166131200  and (pow(pickup_longitude - -73.985195  2) +  pow(pickup_latitude - 40.75896  2)) < pow(180*300/(6371*1000*pi())  2)

SELECT count(1)
FROM a2
WHERE ((quadkey >= 1013670044871163904) AND (quadkey <= 1013670049166131200)) AND ((pow(pickup_longitude - -73.985195  2) + pow(pickup_latitude - 40.75896  2)) < pow((180 * 300) / ((6371 * 1000) * pi())  2))

┌─count(1)─┐
│   613491 │
└──────────┘

1 rows in set. Elapsed: 0.032 sec. Processed 2.41 million rows  56.07 MB (75.88 million rows/s.  1.77 GB/s.)

Same value as before but this time Clickhouse just processed 2.41M rows  10x improvement.

Do you want to know more about how we use quadkey as a geospatial index? More CARTO blogpost are coming to our “inside” blog

Did you like it? Join us