ETL into CARTO with ogr2ogr

Summary

Complex data loads can be automatated using the open source ogr2ogr tool.

This post may describe functionality for an old version of CARTO. Find out about the latest and cloud-native version here.
ETL into CARTO with ogr2ogr

The default CARTO data importer is a pretty convenient way to quickly get data into the platform  but for enterprises setting up automated update it has some limitations:

  • there's no way to define type coercions for CSV data;
  • some common GIS formats like File Geodatabase aren't supported;
  • the sync facility is "pull" only  so data behind a firewall is inaccessible; and
  • the sync cannot automatically filter the data before loading it into the system.

Fortunately  there's a handy command-line tool that can automate many common enterprise data loads: ogr2ogr.

Basic Operation

ogr2ogr has a well-earned reputation for being hard to use. The commandline options are plentiful and terse  the standard documentation page lacks examples  and format-specific documentation is hidden away with the driver documentation.

Shapefile

The basic structure of an ogr2ogr call is "ogr2ogr -f format destination source". Here's a simple shapefile load.

ogr2ogr \
  --debug ON \
  --config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
  -t_srs "EPSG:4326" \
  -nln interesting \
  -f Carto \
  "Carto:pramsey" \
  interesting_things.shp

The parameters are:

  • --debug turns on verbose debugging  which is useful during development to see what's happening behind the scenes.
  • --config is used to pass generic "configuration parameters". In this case we pass our CARTO API key so we are allowed to write to the database.
  • -t_srs is the "target spatial reference system"  telling ogr2ogr to convert the spatial coordinates to "EPSG:4326" (WGS84) before writing them to CARTO. The CARTO driver expects inputs in WGS84  so this step is mandatory.
  • -nln is the "new layer name"  so the name of the uploaded table can differ from that of the input file.
  • -f is the format of the destination layer  so for uploads to CARTO  it is always "Carto".
  • Carto:pramsey is the "destination datasource"  so it's a CARTO source  in the "pramsey" account. Change this to your user name. (Note for multi-user accounts: you must supply your user name here  not your organization name.)
  • interesting_things.shp is the "source datasource"  which for a shapefile is just the path to the file.

File Geodatabase

Loading a File Geodatabase is almost the same as loading a shapefile  except that a file geodatabase can contain multiple layers  so the conversion must also specify which layer to convert  by adding the source layer name after the data source. You can load multiple layers in one run by providing multiple layer names.

ogr2ogr \
  --debug ON \
  --config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
  -t_srs "EPSG:4326" \
  -nln cities \
  -f Carto \
  "Carto:pramsey" \
  CountyData.gdb Cities

In this example  we take the "Cities" layer from the county database  and write it into the "cities" table of CARTO. Note that if you do not re-map the layer name to all lower case  you'll get a mixed case layer in CARTO  which you may not want.

Filtering

You can use OGR on any input data source to filter the data prior to loading. This can be useful for loads of large inputs that are "only the data since time X" or "only the data in this region"  like this:

ogr2ogr \
  --debug ON \
  --config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
  -t_srs "EPSG:4326" \
  -nln cities \
  -f Carto \
  -sql "SELECT * FROM Cities WHERE state_fips = 53" \
  "Carto:pramsey" \
  CountyData.gdb Cities

Since the filter is just a SQL statement  the filter can both reduce the number of records and also apply transforms to the output on the way: reduce the number of columns  apply some data transformations  anything that is possible using the [SQLite dialect] of SQL.

Overwrite or Append

By default  ogr2ogr runs in "append" mode (you can force it with the -append flag  so if you run the same translation multiple times  you'll get rows added into your table. This be useful for processes that regularly take the most recent entries and copy them into CARTO.

For translations where you want to replace the existing table  use the -overwrite mode  which will drop the existing table  and create a new one in its place.

Because of some limitations in how the OGR CARTO driver handles primary keys  the OGR -update mode does not work correctly.

OGR Virtual Format

As you can see  the command-line complexity of an OGR conversions starts high. The complexity only goes up as advanced features like filtering and arbitrary SQL are added.

To contain the complexity in one location  you can use the OGR "virtual format"  VRT files  to define your data sources. This is handy for managing a library of conversions in source control. Each data source becomes it's own VRT file  and the actual OGR commands become smaller.

CSV Type Enforcement

CSV files are convenient ways of passing data  but they are under-defined: they supply column names  but not column types. This forces CSV consumers to do type guessing based on the input data  or to coerce every input to a lowest common denominator string type.

Particularly for repeated and automated uploads it would nice to define the column types once beforehand and have them respected in the final CARTO table.

For example  take this tiny CSV file:

longitude latitude name the_date the_double the_int the_int16 the_int_as_str the_datetime
-120 51 "First Place" 2018-01-01 2.3 123456789 1234 00001234 2014-03-04 08:12:23
-121 52 "Second Place" 2017-02-02 4.3 423456789 4234 00004234 "2015-05-05 09:15:25"

Using a VRT  we can define a CSV file as a source  and also add the rich metadata needed to support proper type definitions:

<OGRVRTDataSource>
    <OGRVRTLayer name="test_csv">
        <SrcDataSource>/data/exports/test_csv.csv</SrcDataSource>
        <GeometryField encoding="PointFromColumns" x="longitude" y="latitude"/>
        <GeometryType>wkbPoint</GeometryType>
        <LayerSRS>WGS84</LayerSRS>
        <OpenOptions>
            <OOI key="EMPTY_STRING_AS_NULL">YES</OOI>
        </OpenOptions>
        <Field name="name" type="String" nullable="false" />
        <Field name="a_date" type="Date" src="the_date" nullable="true" />
        <Field name="the_double" type="Real" nullable="true" />
        <Field name="the_int" type="Integer" nullable="true" />
        <Field name="the_int16" type="Integer" subtype="Int16" nullable="true" />
        <Field name="the_int_as_str" type="String" nullable="true" />
        <Field name="the_datetime" type="DateTime" nullable="true" />
    </OGRVRTLayer>
</OGRVRTDataSource>

This example has a number of things going on:

  • The <SrcDataSource> is an OGR connection string  as defined in the driver documentation for the format. For a CSV  it's just the path to a file with a "csv" extension.
  • The <GeometryField> line maps coordinate columns into a point geometry.
  • The <LayerSRS> confirms the coordinates are WGS84. They could also be some planar format  and OGR can reproject them if requested.
  • The <OpenOptions> let us pass one of the many CSV open options.
  • The <Field> type definitions  using the "type" attribute to explicitly define types  including obscure ones like 16-bit integers.
  • Column renaming  in the "a_date" <Field>  maps the source column name "the_date" to "a_date" in the target.
  • Null enforcement  in the "name" <Field>  creates a target column with a NOT NULL constraint.

To execute the translation  we use the VRT as the source argument in the ogr2ogr call.

ogr2ogr \
  --debug ON \
  --config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
  -t_srs "EPSG:4326" \
  -f Carto \
  "Carto:pramsey" \
  test_csv.vrt

Database-side SQL

Imagine you have an Oracle database with sales information in it  and you want to upload a weekly snaphot of transactions. The database is behind the firewall  and the transactions need to be joined to location data in order to mapped. How to do it?

With VRT tables and the OGR Oracle driver  it's just some more configuration during the load step:

<OGRVRTDataSource>
    <OGRVRTLayer name="detroit_locations">
        <SrcDataSource>OCI:scott/password@ora.company.com</SrcDataSource>
        <LayerSRS>EPSG:26917</LayerSRS>
        <GeometryField encoding="PointFromColumns" x="easting" y="northing"/>
        <SrcSQL>
            SELECT sales.sku  sales.amount  sales.tos 
                   locs.latitude  locs longitude
            FROM sales JOIN locs ON sales.loc_id = locs.loc_id
            WHERE locs.city = 'Detroit'
            AND sales.transaction_date > '2018-01-01'
        </SrcSQL>
    </OGRVRTLayer>
</OGRVRTDataSource>

Some things to note in this example:

  • The <SrcDataSource> holds the Oracle connection string
  • The coordinates are stored in UTM17  in northing/easting columns  but we can still easily map them into a point type for reprojection later.
  • The output data source is actually the result of a join executed on the Oracle database  attributing each sale with the location it was made. We don't have to ship the tables to CARTO separately.

The ability to run any SQL on the source database is a very powerful tool to ensure that the uploaded data is "just right" before it arrives on the CARTO side for analysis and display.

As before  the VRT is run with a simple execution of the ogr2ogr command line:

ogr2ogr \
  --debug ON \
  --config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
  -t_srs "EPSG:4326" \
  -f Carto \
  "Carto:pramsey" \
  test_oracle.vrt

Multi-format Sources

Suppose you have a source of attribute information and a source of location information  but they are in different formats in different databases: how to bring them together in CARTO? One way  as usual  would be to upload them separately and join them on the CARTO side with SQL. Another way is to use the power of ogr2ogr and VRT to do the join during the data upload.

For example  imagine having transaction data in a PostgreSQL database  and store locations in a Geodatabase. How to bring them together? Here's a joined_stores.vrt file that does the join in ogr2ogr:

<OGRVRTDataSource>
    <OGRVRTLayer name="sales_data">
        <SrcDataSource>Pg:dbname=pramsey</SrcDataSource>
        <SrcLayer>sales.sales_data_2017</SrcLayer>
    </OGRVRTLayer>
    <OGRVRTLayer name="stores">
        <SrcDataSource>store_gis.gdb</SrcDataSource>
        <SrcLayer>Stores</SrcLayer>
    </OGRVRTLayer>
    <OGRVRTLayer name="joined">
        <SrcDataSource>joined_stores.vrt</SrcDataSource>
        <SrcSQL dialect="SQLITE">
            SELECT stores.*  sales_data.*
            FROM sales_data
            JOIN stores
            ON sales_data.store_id = stores.store_id
        </SrcSQL>
    </OGRVRTLayer>
</OGRVRTDataSource>

Some things to note:

  • The "joined" layer uses the VRT file itself in the <SrcDataSource> definition!
  • Each <OGRVRTLayer> is a full-fledged VRT layer  so you can do extra processing in them. Apply type definitions to CSV  run complex SQL on a remote database  whatever you want.
  • The join layer uses the "SQLite" dialect  so anything available in SQLite is available to you in the join step.

Almost an ETL

Combining the ability to read and write from multiple formats with the basic functionality of the SQLite engine  and chaining operations through multi-layer VRT layers  ogr2ogr provides the core functions of an "extract-transform-load" engine  in a package that is easy to automate and maintain.

For users with data behind a firewall  who need more complex processing during their loads  or who have data locked in formats the CARTO importer cannot read  ogr2ogr should be an essential tool.

Getting ogr2ogr

OGR is a subset of the GDAL suite of libraries and tools  so you need to install GDAL to get ogr2ogr

  • For Linux  look for "gdal" packages in your Linux distribution of choice.
  • For Mac OS X  use the GDAL Framework builds.
  • For Windows  use the MS4 package system or pull the Mapserver builds from GisInternals and use the included GDAL binaries.