In the geospatial world it is easy to find really small datasets, like an excel file with client locations or a geojson with the coordinates for a few thousand Starbucks coffee shops. Those cases are usually easy enough to deal with that even the browser can load, process and visualize without major problems.
But when we talk about core geospatial challenges, we talk about a lot of data. For example, raster data coming from satellite imagery (especially nowadays with companies like Planet doing continuous planet monitoring), big cadastral shapefiles with millions of small polygons, or .csv files generated from IoT networks (ok, they are actually GPS traces with a lot of attached data expecting to get something useful out of it eventually).
There are plenty of CSVs in S3 with GPS-like data. Take the NYC Taxi data as an example, which contains pickup information for all the NYC taxis. By the way, this is becoming the benchmarking data for the Big data industry; you should check out Mark Litwintschik’s blog posts, they are really good.
So, in general, when you have to analyze geospatial data you end up using PostgreSQL and PostGIS. The ability to query geospatial data with SQL is hard to beat with any other tool out there. But on the other hand, the problem is that PostgreSQL is not meant to work with huge amounts of data. Well, let me explain this a little bit more… You can load huge amounts of information in PostgreSQL and then using things like the new BRIN index, you can select a small portion of that data really, really quickly. The problem shows up when you want to aggregate a lot of rows. In that case PostgreSQL is not that good and that’s why there are solutions like citusdb, greenplum or Amazon redshift that use columnar storage, parallel processing and push downs (among other things) to improve performance. So we have PostgreSQL which supports PostGIS on the one hand and other databases that work really fast with lots of data but with partial or inexistent support for geospatial (mapd, clickhouse, vertica or druiddb to mention some non-batch technologies) on the other hand.
Going back to NYC Taxi dataset, when you want to know the average amount of points in a specific zone, PostgreSQL takes “ages”. Let me demonstrate it with a few queries in a local PostgreSQL instance (9.6):
[local] postgres@nyc-taxi-data=# select count(1) from trips; count ────────── 78424550 (1 row) Time: 25347.000 ms
So if I want to find out how many pickups took place within a radius of 300 meters around Times Square:
[local] postgres@nyc-taxi-data=# select count(1) from trips where st_dwithin(the_geom, 'SRID=4326;POINT(-73.985195 40.75896)', 180*300/(6371*1000*pi())); count ──────── 797801 (1 row) Time: 76552.877 ms
(I’m using angles to calculate distance on a sphere, I know, but since it’s only 300m I use the formula distance = arc*radious)
76 seconds sounds like too much. I’m pretty sure there are ways to improve this like sorting data in better ways, tuning PostgreSQL or using parallel queries, but we will not get a massive improvement. So even if we use a big instance with 32 cores and where the whole dataset fits in memory, in theory we should be able to achieve 76s/32 ~ 2.3s
There are other databases/frameworks more suited to work with that amount of data like spark, presto, vertica, mapd and many others as I said before. I recently got to know Clickhouse, and after testing it, I fell in love. I thought I was doing something wrong but then Mark Litwintschik published a blog post about the strength of Clickhouse.
So I gave it a try and loaded some NYC taxi data. Clickhouse does not have any geospatial toolkit, but you can emulate it with basic math using euclidean distance.
:) select count(1) from a2 where (pow(pickup_longitude - -73.985195, 2) + pow(pickup_latitude - 40.75896, 2)) < pow(180*300/(6371*1000*pi()), 2); SELECT count(1) FROM trips WHERE (pow(pickup_longitude - -73.985195, 2) + pow(pickup_latitude - 40.75896, 2)) < pow((180 * 300) / ((6371 * 1000) * pi()), 2) ┌─count(1)─┐ │ 613491 │ └──────────┘ rows in set. Elapsed: 0.308 sec. Processed 45.45 million rows, 818.13 MB (147.63 million rows/s., 2.66 GB/s.)
The thing is we are processing 45.5M rows (basically performing a full scan) and even that is fast as hell. If only we had an index to select just the rows close to Times Square (as PostgreSQL does) it would be much faster. Luckily Clickhouse supports indices. The bad news is that obviously it does not support index on points or any 2D data. Nevertheless there are ways to linearize 2D space into a scalar one, so traditional database indices will work. Hilbert curve and geohash are more commonly used ones. In our case we decided to use quadkey.
Quadkey is basically a way to encode a tile coordinate in an integer, so for each zoom we use two bits to encode which of the 4 children is the one where each position is. In this way we can encode any lat-lon pair in a 64-bit integer with a precision of 1.9 cm (tile size at zoom 31). We could use Hilbert of geohash but quadkey fits pretty well with the way we visualize data (using quadtree). It’s fast to generate (check out python library – with native extensions it’s extremely fast) and it works pretty well with general database indices.
So by creating a new table with a quadkey column with pickup_latitude and pickup_longitude and using it as index, we can query it using quadkey in the where condition:
CREATE table test ( lat Nullable(Float64), lon Nullable(Float64), datetime Nullable(DateTime), value Nullable(Float64), date_index Date, quadkey UInt64 ) ENGINE = MergeTree(date_index, quadkey, 8192)
Run the query, this time including quadkey index:
:) select count(1) from a2 where quadkey between 1013670044871163904 and 1013670049166131200 and (pow(pickup_longitude - -73.985195, 2) + pow(pickup_latitude - 40.75896, 2)) < pow(180*300/(6371*1000*pi()), 2) SELECT count(1) FROM a2 WHERE ((quadkey >= 1013670044871163904) AND (quadkey <= 1013670049166131200)) AND ((pow(pickup_longitude - -73.985195, 2) + pow(pickup_latitude - 40.75896, 2)) < pow((180 * 300) / ((6371 * 1000) * pi()), 2)) ┌─count(1)─┐ │ 613491 │ └──────────┘ 1 rows in set. Elapsed: 0.032 sec. Processed 2.41 million rows, 56.07 MB (75.88 million rows/s., 1.77 GB/s.)
Same value as before but this time Clickhouse just processed 2.41M rows, 10x improvement.
Do you want to know more about how we use quadkey as a geospatial index? More CARTO blogpost are coming to our “inside” blog
Did you like it? Join us