Faster data updates with CartoDB

Summary

Optimize your CartoDB tables with efficient batch updates. Learn to enhance SQL queries for faster inserts, reducing table scans. Boost app performance!

This post may describe functionality for an old version of CARTO. Find out about the latest and cloud-native version here.
Faster data updates with CartoDB

A feature of CartoDB that makes it stand out from most other mapping platforms out there is the ability to create maps and build applications on top of data that grows and changes. The dynamic nature of the CartoDB stack is what makes it such a fun and powerful platform for so many of our users. Users that build applications or link CartoDB to their other technology  most often perform these updates and modifications through the SQL API.



us election results map



 

The SQL API allows you to use a library of PostgreSQL and PostGIS SQL functions to manage data and analyse data. By far though  the most common functions we see ar where users are simply adding new or updating existing records on their databases. Unfortunately  the method we see most frequently being used is also one of the most inefficient. So we thought we would take a deeper look at how to efficiently batch update or add new records to your tables.

Faster updates

Let's take for example the update of election night election maps for all 3100 counties in the United States. First  here is how we will UPDATE a single county

##_INIT_REPLACE_ME_PRE_##
UPDATE election_results SET votes=1  pro=1 WHERE county_id = 1;
##_END_REPLACE_ME_PRE_##

The UPDATE causes the database to do a table scan  finds the appropriate record(s) to update and finishes  nothing special there. So now  what happens when we perform three UPDATEs?

##_INIT_REPLACE_ME_PRE_##
UPDATE election_results SET votes=52  pro=24 WHERE county_id = 1;
UPDATE election_results SET votes=33  pro=19 WHERE county_id = 2;
UPDATE election_results SET votes=11  pro=2  WHERE county_id = 3;
##_END_REPLACE_ME_PRE_##

Well  in fact  there are three table scans now. If you can’t tell  our UPDATE just got really slow! If you are regularly updating hundreds or thousands of records  this just isn’t going to be a fun process. The same operation can be rewritten to scan the table only once while joining it against the set of new values:

##_INIT_REPLACE_ME_PRE_##
UPDATE election_results o
SET votes=n.votes  pro=n.pro
FROM (VALUES (1 11 9)
            (2 44 28)
            (3 25 4)
     ) n(county_id votes pro)
WHERE o.county_id = n.county_id;
##_END_REPLACE_ME_PRE_##

Pretty cool  right? So instead of three table scans  we have reduced it down to only one. If you were updating 3100 records instead of three  your query would be approximately 3100x faster!



us election map



 

Faster inserts

The same method can be used to perform more efficient INSERTs into your tables. The INSERT version of the optimization wont squeeze as much of a performance boost  but it will be better none-the-less. Let’s take a look at how you perform a batch INSERT (notice that INSERTs don’t require you to name the variables like UPDATEs)

##_INIT_REPLACE_ME_PRE_##
INSERT INTO election_results (county_id voters pro)
VALUES  (1  11 8)
       (12 21 10)
       (78 31 27);
##_END_REPLACE_ME_PRE_##

The UPDATE/INSERT double-whammy

Now  what about applications that want to UPDATE some records and INSERT some new records at the same time? In the inefficient implementation  we would first run the UPDATE and the INSERT as two distinct queries  taking two complete table scans. There is actually a really neat approach that some applications use to combine the two statements into one and thus also reducing table scans down to one at the same time. The approach works by running the UPDATE and then INSERTing only the records that were not involved in the UPDATE.

An example is maintaining a table of last access date from IP addresses. New IP addresses (not seen before) require an INSERTion  while already known addresses need an UPDATE.  This kind of operation is also known as an "upsert".  

Let’s take a look at an example using a CTE to perform such an operation with a single record and a single table scan:

##_INIT_REPLACE_ME_PRE_##
UPDATE election_results o
SET votes=n.votes  pro=n.pro
FROM (VALUES (1 11 9)
            (2 44 28)
            (3 25 4)
     ) n(county_id votes pro)
WHERE o.county_id = n.county_id;
##_END_REPLACE_ME_PRE_##

To get the same  single table scan performance with a batch upsert  we can write the query as follows

##_INIT_REPLACE_ME_PRE_##
WITH
-- write the new values
n(ip visits clicks) AS (
 VALUES ('192.168.1.1' 2 12)
        ('192.168.1.2' 6 18)
        ('192.168.1.3' 3 4)
)
-- update existing rows
upsert AS (
 UPDATE page_views o
 SET visits=n.visits  clicks=n.clicks
 FROM n WHERE o.ip = n.ip
 RETURNING o.ip
)
-- insert missing rows
INSERT INTO page_views (ip visits clicks)
SELECT n.ip  n.visits  n.clicks FROM n
WHERE n.ip NOT IN (
 SELECT ip FROM upsert
);
##_END_REPLACE_ME_PRE_##

Pretty interesting! We know that many of you are updating data constantly. If you aren’t already finding some of these optimizations on your own  we hope this helps you get your applications in top condition. As always  if you have questions  feel free to drop a line in our community thread so we can all figure out the right answers together!