ETL into CARTO with ogr2ogr
Complex data loads can be automatated using the open source ogr2ogr tool.
The default CARTO data importer is a pretty convenient way to quickly get data into the platform but for enterprises setting up automated update it has some limitations:
- there's no way to define type coercions for CSV data;
- some common GIS formats like File Geodatabase aren't supported;
- the sync facility is "pull" only so data behind a firewall is inaccessible; and
- the sync cannot automatically filter the data before loading it into the system.
Fortunately there's a handy command-line tool that can automate many common enterprise data loads: ogr2ogr.
ogr2ogr has a well-earned reputation for being hard to use. The commandline options are plentiful and terse the standard documentation page lacks examples and format-specific documentation is hidden away with the driver documentation.
The basic structure of an ogr2ogr call is "ogr2ogr -f format destination source". Here's a simple shapefile load.
The parameters are:
- --debug turns on verbose debugging which is useful during development to see what's happening behind the scenes.
- --config is used to pass generic "configuration parameters". In this case we pass our CARTO API key so we are allowed to write to the database.
- -t_srs is the "target spatial reference system" telling ogr2ogr to convert the spatial coordinates to "EPSG:4326" (WGS84) before writing them to CARTO. The CARTO driver expects inputs in WGS84 so this step is mandatory.
- -nln is the "new layer name" so the name of the uploaded table can differ from that of the input file.
- -f is the format of the destination layer so for uploads to CARTO it is always "Carto".
- Carto:pramsey is the "destination datasource" so it's a CARTO source in the "pramsey" account. Change this to your user name. (Note for multi-user accounts: you must supply your user name here not your organization name.)
- interesting_things.shp is the "source datasource" which for a shapefile is just the path to the file.
Loading a File Geodatabase is almost the same as loading a shapefile except that a file geodatabase can contain multiple layers so the conversion must also specify which layer to convert by adding the source layer name after the data source. You can load multiple layers in one run by providing multiple layer names.
In this example we take the "Cities" layer from the county database and write it into the "cities" table of CARTO. Note that if you do not re-map the layer name to all lower case you'll get a mixed case layer in CARTO which you may not want.
You can use OGR on any input data source to filter the data prior to loading. This can be useful for loads of large inputs that are "only the data since time X" or "only the data in this region" like this:
Since the filter is just a SQL statement the filter can both reduce the number of records and also apply transforms to the output on the way: reduce the number of columns apply some data transformations anything that is possible using the [SQLite dialect] of SQL.
Overwrite or Append
By default ogr2ogr runs in "append" mode (you can force it with the -append flag so if you run the same translation multiple times you'll get rows added into your table. This be useful for processes that regularly take the most recent entries and copy them into CARTO.
For translations where you want to replace the existing table use the -overwrite mode which will drop the existing table and create a new one in its place.
Because of some limitations in how the OGR CARTO driver handles primary keys the OGR -update mode does not work correctly.
OGR Virtual Format
As you can see the command-line complexity of an OGR conversions starts high. The complexity only goes up as advanced features like filtering and arbitrary SQL are added.
To contain the complexity in one location you can use the OGR "virtual format" VRT files to define your data sources. This is handy for managing a library of conversions in source control. Each data source becomes it's own VRT file and the actual OGR commands become smaller.
CSV Type Enforcement
CSV files are convenient ways of passing data but they are under-defined: they supply column names but not column types. This forces CSV consumers to do type guessing based on the input data or to coerce every input to a lowest common denominator string type.
Particularly for repeated and automated uploads it would nice to define the column types once beforehand and have them respected in the final CARTO table.
For example take this tiny CSV file:
Using a VRT we can define a CSV file as a source and also add the rich metadata needed to support proper type definitions:
This example has a number of things going on:
- The <SrcDataSource> is an OGR connection string as defined in the driver documentation for the format. For a CSV it's just the path to a file with a "csv" extension.
- The <GeometryField> line maps coordinate columns into a point geometry.
- The <LayerSRS> confirms the coordinates are WGS84. They could also be some planar format and OGR can reproject them if requested.
- The <OpenOptions> let us pass one of the many CSV open options.
- The <Field> type definitions using the "type" attribute to explicitly define types including obscure ones like 16-bit integers.
- Column renaming in the "a_date" <Field> maps the source column name "the_date" to "a_date" in the target.
- Null enforcement in the "name" <Field> creates a target column with a NOT NULL constraint.
To execute the translation we use the VRT as the source argument in the ogr2ogr call.
Imagine you have an Oracle database with sales information in it and you want to upload a weekly snaphot of transactions. The database is behind the firewall and the transactions need to be joined to location data in order to mapped. How to do it?
With VRT tables and the OGR Oracle driver it's just some more configuration during the load step:
Some things to note in this example:
- The <SrcDataSource> holds the Oracle connection string
- The coordinates are stored in UTM17 in northing/easting columns but we can still easily map them into a point type for reprojection later.
- The output data source is actually the result of a join executed on the Oracle database attributing each sale with the location it was made. We don't have to ship the tables to CARTO separately.
The ability to run any SQL on the source database is a very powerful tool to ensure that the uploaded data is "just right" before it arrives on the CARTO side for analysis and display.
As before the VRT is run with a simple execution of the ogr2ogr command line:
Suppose you have a source of attribute information and a source of location information but they are in different formats in different databases: how to bring them together in CARTO? One way as usual would be to upload them separately and join them on the CARTO side with SQL. Another way is to use the power of ogr2ogr and VRT to do the join during the data upload.
For example imagine having transaction data in a PostgreSQL database and store locations in a Geodatabase. How to bring them together? Here's a joined_stores.vrt file that does the join in ogr2ogr:
Some things to note:
- The "joined" layer uses the VRT file itself in the <SrcDataSource> definition!
- Each <OGRVRTLayer> is a full-fledged VRT layer so you can do extra processing in them. Apply type definitions to CSV run complex SQL on a remote database whatever you want.
- The join layer uses the "SQLite" dialect so anything available in SQLite is available to you in the join step.
Almost an ETL
Combining the ability to read and write from multiple formats with the basic functionality of the SQLite engine and chaining operations through multi-layer VRT layers ogr2ogr provides the core functions of an "extract-transform-load" engine in a package that is easy to automate and maintain.
For users with data behind a firewall who need more complex processing during their loads or who have data locked in formats the CARTO importer cannot read ogr2ogr should be an essential tool.
OGR is a subset of the GDAL suite of libraries and tools so you need to install GDAL to get ogr2ogr
- For Linux look for "gdal" packages in your Linux distribution of choice.
- For Mac OS X use the GDAL Framework builds.
- For Windows use the MS4 package system or pull the Mapserver builds from GisInternals and use the included GDAL binaries.