Geospatial processing with Clickhouse, take 2
In an earlier post we talked about the potential of Clickhouse as a database for a GIS. Today we want to talk about some of the design decisions we made and some of the iterations we did to improve the performance of the system.
In CARTO as any other map web service we use a Web Mercator projection to generate our maps. This projection takes several shortcuts and limitations to reduce calculations and increase speed greatly and leaves us with 2 interesting properties important to us throughout this post:
- There are limited zoom levels. That is you cannot select a special resolution outside the ones provided by the system. For example CARTO Builder is currently setup to allow zooms from 1 to 18 which is from around 78km per pixel to around 0.5 ground meters per pixel.
- The map is divided in cache-friendly tiles so for a certain zoom level you only get a small amount of tiles. When you want to show a map the browser will get all the pre-established tiles to overlap the screen. The most common tile size is 256x256.
The quadtree keys or quadkeys system was designed with these two concepts in mind and it also divides by zoom levels and tiles. At level 1 it divides the map in 4 tiles (0 to 4 or 00 to 11 in binary) and at each subsequent zoom layer it divides every sub-tile in 4 parts. This can easily be encoded in a 64 bit integer by using 2 bits per level.
Let's suppose you have a ~400 million points Clickhouse database in a single server (Intel E5-2670 32G RAM) and another single server to run CARTO Builder on top (specs not important for our tests) and you want to show some of the data in the map.When you load a map several requests for tiles are created in this fashion:
This is essentially asking for the tile (2101 / 3047) at zoom level 13. Once translated to webmercator coordinates it means that it is asking for the tile between (-9759479.771451304 5126784.361143341) and (-9754587.801641053 5131676.330953592).
In a straightforward way this is translated in a query like this one:
Making use of the common properties between quadkey and the projection we now that this tile is taking at zoom level 13 so if we translate the coordinates we see that it is formed by (0x0CAAD3B000000000 to 0x0CAAD3BFFFFFFFFF) which you can check with an AND operator:
And now as we decided to add the quadkey to the index but it only works with simple operations (comparisons) we'll translate the coordinates to quadkeys (you can use our OSS library to do it):
This is a 5x improvement mainly because we have reduced the number of read rows from 387M to just 2M and the index is always in memory. I'll explain later what index we use and why but you can find a nice explanation of how they work in this Medium post.
These are pretty good times but we don't need the count of the points in the requested tile. Instead we need the count of matches for each pixel (256x256 = 65536 pixels) so we will aggregate them directly in Clickhouse to make use of its parallelization.
To aggregate the points into 65k we use the properties of quadkeys again: Since we want 2^16 points and we know each zoom level uses 2 bits we can use the mask for our level + 8 (13 + 8 in this example) to aggregate them. In this case the mask for level 13 + 8 is 0x3FFFFFFFFFF00000 or 4611686018426339328):
Note that 0x0CAAD3BFFFF00000 would also be a valid mask in this case and that since we only encode 31 levels of zoom the first 2 bits of any quadkey are always 0.
For big datasets there is another nice Clickhouse feature that came in handy: sampling. You can SAMPLE for an approximate part of the data (e.g. SAMPLE 0.5 for 50%) for the number of rows read from the database (e.g SAMPLE 1000000 for 1M rows) thus reducing the reads from disk.
SAMPLE is highly related to the database index as the sampling expression is part of the primary key and that leave us with 2 options: Use a sampling function that makes sure close points are also close in the index (for example using some form of bitAnd(quadkey appropiate_zoom_level)) or use a sampling expression to make sure that when we wanted to take a sample of the data it is heterogeneous so we can quickly extrapolate results for the whole table.
In our case we initially went for grouping the nearest points together:
But after some testing and considering that we would generate each tile only once (that is what caches are there for) and widgets like histograms or categories are recalculated with every screen movement we chose to disperse near points so sampling worked better and allowed us to read less rows:
This change alone made tile calculations around 2-5% slower but histogram queries became ~30% faster.
Our way to make up for that loss in tile calculation was to introduce a SAMPLE based on the zoom level but instead of using it to read a percentage of the database we limited the amount of rows read to 256 per possible pixel at that zoom level. For example at zoom level 3 we have 2^3 tiles each of those have 65k pixels so: 256 rows/pixel * 2^16 pixel/tile * 2^3 tiles = 2^(8+16+3) = 134217728 rows:
This is always dependent of your data and sampling function but in this specific query we reduced the number of read rows by ~25% which meant around 10% faster response with an output good enough for tile data.
Once we managed to have fast enough tiles it became clear that we needed to speed up those queries related to the widgets. We have to show information next to the map as they where not only slower than tiles but sometimes even taking minutes to complete.
A histogram is just a division in equal parts of a certain column and counting the number of rows associated to each bucket. To do that you need to know the min and max values (to know the bucket width) and then generate the buckets.
To calculate the min and max values we decided to use quantiles to get rid of possible odd values in the data and since we needed some extra information we used a query like this:
The first thought to optimise this query is to use sampling to cut the read rows but it is simpler and faster if you store the return values you really need (min and max) and calculate the others (total and average) from the following query. This way except for the first time we don't have to make this query and we can simply reuse the min_max_val we have stored.
Once you have these values you can request the actual buckets:
With a 10% sampling the query is ~25% faster (1.716 sec 283.64M rows) but we can be even faster if we do some of the math beforehand and using the fact that quadkeys are sampled and in the index keys we can add a prefilter by the quadkeys that intersect with our coordinates (OR conditions):
So before all the changes we were at 3.0 + 2.3 and now we are just a merely above 1 second which is ~80% less time per query (except the first one as we don't have the needed max and min values yet) or 400% more queries in the same time.
One of the useful widgets we have is the option to aggregate some column from the data to see what the main categories are and how many points are included. Here we initially followed this line of work:
- Get the number of totals rows and null rows for that category.
- Get the count of the most frequent and most infrequent categories
- Get all the categories with their count
This procedure is not very efficient but with small datasets (in the thousands of points) it is not that important as you end up hitting caches and the calculations are fast enough. As you might guess the bigger the database the bigger the problem so now we have a 1000x bigger problem.Easy enough all three queries can be simplified to just the last one use the first row as the max and the last row as the min value and adding everything for the full count. From 5.1 + 7.8 + 7.8 to just 7.8 seconds: 60% less time.
Although this was an important issue it wasn't the most noticeable one as even after reduction there were some moments where the connection with the database would slow to a crawl. Here is way:Suppose you have a table with 400 million rows and you decide to count the different values of a column that are present in an average of 500 rows each; that gives us 800 000 different categories.Now let's suppose that to send this output you use JSON and that each generated output row is about 50 Bytes (some kind of id plus the count both in UTF-8) so the full output is 40 MB of data that needs to be transmited out of which you are going to discard 99% of it as you only show the top 6.
While reading the values from the column and choosing the top categories is not avoidable we can make sure we return just what we really need with WITH TOTALS or LIMIT the query to a reasonable amount of categories:
In this kind of cases we are now over 90% faster since we went from over 2 minutes and a half to 12 seconds which is still high but considering we are using a standalone server for these tests we're happy with the initial results.
If you liked this post you might be interested in joining us.