Data Synchronization with OGR

Summary

This post may describe functionality for an old version of CARTO. Find out about the latest and cloud-native version here.
Data Synchronization with OGR

CartoDB has an excellent and easy-to-use synced tables feature that reads data from a public URL on a regular schedule and keeps an associated CartoDB table up-to-date. However synced tables require the data to be publicly available so CartoDB can read it and that the destination table get over-written at each update.

For some users these limitations are not acceptable--their data is not public or they want to continually append new information to their table.

Fortunately it's easy to create your own live data tables in CartoDB updated on your own schedule from your private data using the ogr2ogr tool from the open source GDAL library. The support for CartoDB in GDAL has gotten a lot better recently so use the latest version from SVN or the upcoming GDAL 2.0 release.

Initial Load

You can load a fresh table directly into CartoDB using ogr2ogr like this (I'm loading my trusty 1:250K map grid for British Columbia):

##_INIT_REPLACE_ME_PRE_##
ogr2ogr \
  --config CARTODB_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
  -t_srs EPSG:4326 \
  -f CartoDB \
  "CartoDB:pramsey" \
  g250.shp
##END_REPLACE_ME_PRE_##

The commandline syntax of ogr2ogr is not the most obvious thing in the world.

  • The API key is required to allow write access into the database.
  • The -t_srs EPSG:4326 option tells ogr2ogr reproject the data to geographic coordinates which is required for data to be published via the CartoDB Editor.
  • The -f CartoDB option tells ogr2ogr to use the CartoDB driver for writing.
  • The CartoDB:pramsey entry is the destination data in this case the pramsey account at CartoDB.
  • The g250.shp entry is the source data.

Once the command has run though the data are available in the CartoDB dashboard ready for mapping. (It may take a minute for the dashboard to recognize the new table be patient and hit reload.)

Refreshing the Data

Once you've loaded and styled your data and published your map you want to keep the map up-to-date with the latest data. (Remember if your data are public then the easiest way to do this is with a synced table).

If you've read the ogr2ogr manual you may want to use the -overwrite option: don't.

The overwrite option will drop the table and then create a replacement. For some applications this would be fine but for CartoDB it means the associated visualizations will also get dropped. For that reason we want to leave the table in place but simply remove the data inside it before loading new data.

  • First remove all the existing data. We do that by sending the TRUNCATE command directly to the SQL API to quickly delete all the records in the table.

    ##_INIT_REPLACE_ME_PRE_##
    key=abcdefghijabcdefghijabcdefghijabcdefghij
    sql=`echo "TRUNCATE TABLE g250" | tr ' ' +`
    curl "http://pramsey.cartodb.com:80/api/v2/sql?api_key=${key}&q=${sql}"
    
    ##END_REPLACE_ME_PRE_##
  • Then load new data into the table. Now that the table is empty we can run ogr2ogr in -append mode to put the new records in place.

    ##_INIT_REPLACE_ME_PRE_##
    ogr2ogr \
        --config CARTODB_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
        -append \
        -t_srs EPSG:4326 \
        -f CartoDB \
        "CartoDB:pramsey" g250.shp
    
    ##END_REPLACE_ME_PRE_##

Now your table (and all the maps that depend on it) is full of the latest data and you didn't have to place the raw data on a public URL to make it happen.

Appending new Data

Sometimes applications with real-time data will not need to simply replace existing data but rather add new data as it comes in. You can do this by just running ogr2ogr in -append mode without first running the TRUNCATE command.

The table will grow and grow and grow as you add data and you may want to delete older data. If your raw data has a time stamp you can use that as a key to delete old data. If it doesn't you can alter your table after import to include a timestamp field that holds the date the data was loaded.

  • First load your initial raw data.

    ##_INIT_REPLACE_ME_PRE_##
    ogr2ogr \
        --config CARTODB_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
        -t_srs EPSG:4326 \
        -f CartoDB \
        "CartoDB:pramsey" g250.shp
    
    ##END_REPLACE_ME_PRE_##
  • Then add a time column with a default value so that it gets filled in automatically when new data are inserted.

    ##_INIT_REPLACE_ME_PRE_##
    key=abcdefghijabcdefghijabcdefghijabcdefghij
    sql=`echo "ALTER TABLE g250 ADD COLUMN ts TIMESTAMP DEFAULT now()" | tr ' ' +`
    curl "http://pramsey.cartodb.com:80/api/v2/sql?api_key=${key}&q=${sql}"
    
    ##END_REPLACE_ME_PRE_##
  • Then run your import in -append mode to see that new columns are timestamped with the time of import.

    ##_INIT_REPLACE_ME_PRE_##
    ogr2ogr \
        --config CARTODB_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
        -append \
        -t_srs EPSG:4326 \
        -f CartoDB \
        "CartoDB:pramsey" g250.shp
    
    ##END_REPLACE_ME_PRE_##
  • Finally run a scheduled job to delete all records you consider too old to map. For example 1 week old.

    ##_INIT_REPLACE_ME_PRE_##
    key=abcdefghijabcdefghijabcdefghijabcdefghij
    sql=`echo "DELETE FROM g250 WHERE ts < now() - '1 week'::interval" | tr ' ' +`
    curl "http://pramsey.cartodb.com:80/api/v2/sql?api_key=${key}&q=${sql}"
    
    ##END_REPLACE_ME_PRE_##

That's all there is to it! The GDAL 2.0 release is expected in the next month or so at which point binary builds for Windows and other platforms will be available. But for now pull down the source build and start syncing your data with ogr2ogr!