ETL into CARTO with ogr2ogr

The default CARTO data importer is a pretty convenient way to quickly get data into the platform, but for enterprises setting up automated update it has some limitations:

• there’s no way to define type coercions for CSV data;
• some common GIS formats like File Geodatabase aren’t supported;
• the sync facility is “pull” only, so data behind a firewall is inaccessible; and,
• the sync cannot automatically filter the data before loading it into the system.

Fortunately, there’s a handy command-line tool that can automate many common enterprise data loads: ogr2ogr.

Basic Operation

ogr2ogr has a well-earned reputation for being hard to use. The commandline options are plentiful and terse, the standard documentation page lacks examples, and format-specific documentation is hidden away with the driver documentation.

Shapefile

The basic structure of an ogr2ogr call is “ogr2ogr -f format destination source”. Here’s a simple shapefile load.

ogr2ogr \
--debug ON \
--config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
-t_srs "EPSG:4326" \
-nln interesting \
-f Carto \
"Carto:pramsey" \
interesting_things.shp


The parameters are:

• –debug turns on verbose debugging, which is useful during development to see what’s happening behind the scenes.
• –config is used to pass generic “configuration parameters”. In this case we pass our CARTO API key so we are allowed to write to the database.
• -t_srs is the “target spatial reference system”, telling ogr2ogr to convert the spatial coordinates to “EPSG:4326” (WGS84) before writing them to CARTO. The CARTO driver expects inputs in WGS84, so this step is mandatory.
• -nln is the “new layer name”, so the name of the uploaded table can differ from that of the input file.
• -f is the format of the destination layer, so for uploads to CARTO, it is always “Carto”.
• Carto:pramsey is the “destination datasource”, so it’s a CARTO source, in the “pramsey” account. Change this to your user name. (Note for multi-user accounts: you must supply your user name here, not your organization name.)
• interesting_things.shp is the “source datasource”, which for a shapefile is just the path to the file.

File Geodatabase

Loading a File Geodatabase is almost the same as loading a shapefile, except that a file geodatabase can contain multiple layers, so the conversion must also specify which layer to convert, by adding the source layer name after the data source. You can load multiple layers in one run by providing multiple layer names.

ogr2ogr \
--debug ON \
--config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
-t_srs "EPSG:4326" \
-nln cities \
-f Carto \
"Carto:pramsey" \
CountyData.gdb Cities


In this example, we take the “Cities” layer from the county database, and write it into the “cities” table of CARTO. Note that if you do not re-map the layer name to all lower case, you’ll get a mixed case layer in CARTO, which you may not want.

Filtering

You can use OGR on any input data source to filter the data prior to loading. This can be useful for loads of large inputs that are “only the data since time X” or “only the data in this region”, like this:

ogr2ogr \
--debug ON \
--config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
-t_srs "EPSG:4326" \
-nln cities \
-f Carto \
-sql "SELECT * FROM Cities WHERE state_fips = 53" \
"Carto:pramsey" \
CountyData.gdb Cities


Since the filter is just a SQL statement, the filter can both reduce the number of records and also apply transforms to the output on the way: reduce the number of columns, apply some data transformations, anything that is possible using the [SQLite dialect] of SQL.

Overwrite or Append

By default, ogr2ogr runs in “append” mode (you can force it with the -append flag, so if you run the same translation multiple times, you’ll get rows added into your table. This be useful for processes that regularly take the most recent entries and copy them into CARTO.

For translations where you want to replace the existing table, use the -overwrite mode, which will drop the existing table, and create a new one in its place.

Because of some limitations in how the OGR CARTO driver handles primary keys, the OGR -update mode does not work correctly.

OGR Virtual Format

As you can see, the command-line complexity of an OGR conversions starts high. The complexity only goes up as advanced features like filtering and arbitrary SQL are added.

To contain the complexity in one location, you can use the OGR “virtual format”, VRT files, to define your data sources. This is handy for managing a library of conversions in source control. Each data source becomes it’s own VRT file, and the actual OGR commands become smaller.

CSV Type Enforcement

CSV files are convenient ways of passing data, but they are under-defined: they supply column names, but not column types. This forces CSV consumers to do type guessing based on the input data, or to coerce every input to a lowest common denominator string type.

Particularly for repeated and automated uploads it would nice to define the column types once beforehand and have them respected in the final CARTO table.

For example, take this tiny CSV file:

longitude,latitude,name,the_date,the_double,the_int,the_int16,the_int_as_str,the_datetime
-120,51,"First Place",2018-01-01,2.3,123456789,1234,00001234,2014-03-04 08:12:23
-121,52,"Second Place",2017-02-02,4.3,423456789,4234,00004234,"2015-05-05 09:15:25"


Using a VRT, we can define a CSV file as a source, and also add the rich metadata needed to support proper type definitions:

<OGRVRTDataSource>
<OGRVRTLayer name="test_csv">
<SrcDataSource>/data/exports/test_csv.csv</SrcDataSource>
<GeometryField encoding="PointFromColumns" x="longitude" y="latitude"/>
<GeometryType>wkbPoint</GeometryType>
<LayerSRS>WGS84</LayerSRS>
<OpenOptions>
<OOI key="EMPTY_STRING_AS_NULL">YES</OOI>
</OpenOptions>
<Field name="name" type="String" nullable="false" />
<Field name="a_date" type="Date" src="the_date" nullable="true" />
<Field name="the_double" type="Real" nullable="true" />
<Field name="the_int" type="Integer" nullable="true" />
<Field name="the_int16" type="Integer" subtype="Int16" nullable="true" />
<Field name="the_int_as_str" type="String" nullable="true" />
<Field name="the_datetime" type="DateTime" nullable="true" />
</OGRVRTLayer>
</OGRVRTDataSource>


This example has a number of things going on:

• The <SrcDataSource> is an OGR connection string, as defined in the driver documentation for the format. For a CSV, it’s just the path to a file with a “csv” extension.
• The <GeometryField> line maps coordinate columns into a point geometry.
• The <LayerSRS> confirms the coordinates are WGS84. They could also be some planar format, and OGR can reproject them if requested.
• The <OpenOptions> let us pass one of the many CSV open options.
• The <Field> type definitions, using the “type” attribute to explicitly define types, including obscure ones like 16-bit integers.
• Column renaming, in the “a_date” <Field>, maps the source column name “the_date” to “a_date” in the target.
• Null enforcement, in the “name” <Field>, creates a target column with a NOT NULL constraint.

To execute the translation, we use the VRT as the source argument in the ogr2ogr call.

ogr2ogr \
--debug ON \
--config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
-t_srs "EPSG:4326" \
-f Carto \
"Carto:pramsey" \
test_csv.vrt


Database-side SQL

Imagine you have an Oracle database with sales information in it, and you want to upload a weekly snaphot of transactions. The database is behind the firewall, and the transactions need to be joined to location data in order to mapped. How to do it?

With VRT tables and the OGR Oracle driver, it’s just some more configuration during the load step:

<OGRVRTDataSource>
<OGRVRTLayer name="detroit_locations">
<LayerSRS>EPSG:26917</LayerSRS>
<GeometryField encoding="PointFromColumns" x="easting" y="northing"/>
<SrcSQL>
SELECT sales.sku, sales.amount, sales.tos,
locs.latitude, locs,longitude
FROM sales JOIN locs ON sales.loc_id = locs.loc_id
WHERE locs.city = 'Detroit'
AND sales.transaction_date > '2018-01-01'
</SrcSQL>
</OGRVRTLayer>
</OGRVRTDataSource>


Some things to note in this example:

• The <SrcDataSource> holds the Oracle connection string
• The coordinates are stored in UTM17, in northing/easting columns, but we can still easily map them into a point type for reprojection later.
• The output data source is actually the result of a join executed on the Oracle database, attributing each sale with the location it was made. We don’t have to ship the tables to CARTO separately.

The ability to run any SQL on the source database is a very powerful tool to ensure that the uploaded data is “just right” before it arrives on the CARTO side for analysis and display.

As before, the VRT is run with a simple execution of the ogr2ogr command line:

ogr2ogr \
--debug ON \
--config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
-t_srs "EPSG:4326" \
-f Carto \
"Carto:pramsey" \
test_oracle.vrt


Multi-format Sources

Suppose you have a source of attribute information and a source of location information, but they are in different formats in different databases: how to bring them together in CARTO? One way, as usual, would be to upload them separately and join them on the CARTO side with SQL. Another way is to use the power of ogr2ogr and VRT to do the join during the data upload.

For example, imagine having transaction data in a PostgreSQL database, and store locations in a Geodatabase. How to bring them together? Here’s a joined_stores.vrt file that does the join in ogr2ogr:

<OGRVRTDataSource>
<OGRVRTLayer name="sales_data">
<SrcDataSource>Pg:dbname=pramsey</SrcDataSource>
<SrcLayer>sales.sales_data_2017</SrcLayer>
</OGRVRTLayer>
<OGRVRTLayer name="stores">
<SrcDataSource>store_gis.gdb</SrcDataSource>
<SrcLayer>Stores</SrcLayer>
</OGRVRTLayer>
<OGRVRTLayer name="joined">
<SrcDataSource>joined_stores.vrt</SrcDataSource>
<SrcSQL dialect="SQLITE">
SELECT stores.*, sales_data.*
FROM sales_data
JOIN stores
ON sales_data.store_id = stores.store_id
</SrcSQL>
</OGRVRTLayer>
</OGRVRTDataSource>


Some things to note:

• The “joined” layer uses the VRT file itself in the <SrcDataSource> definition!
• Each <OGRVRTLayer> is a full-fledged VRT layer, so you can do extra processing in them. Apply type definitions to CSV, run complex SQL on a remote database, whatever you want.
• The join layer uses the “SQLite” dialect, so anything available in SQLite is available to you in the join step.

Almost an ETL

Combining the ability to read and write from multiple formats with the basic functionality of the SQLite engine, and chaining operations through multi-layer VRT layers, ogr2ogr provides the core functions of an “extract-transform-load” engine, in a package that is easy to automate and maintain.

For users with data behind a firewall, who need more complex processing during their loads, or who have data locked in formats the CARTO importer cannot read, ogr2ogr should be an essential tool.

Getting ogr2ogr

OGR is a subset of the GDAL suite of libraries and tools, so you need to install GDAL to get ogr2ogr

• For Linux, look for “gdal” packages in your Linux distribution of choice.
• For Mac OS X, use the GDAL Framework builds.
• For Windows, use the MS4 package system or pull the Mapserver builds from GisInternals and use the included GDAL binaries.

• Upgrades improve CARTO platform performance by > 20%

Here at CARTO we are constantly updating our platform so that our customers gain in system performance, reliability and security.

• 10,000 Dutch Bridges: A Foray into Using BigQuery Connector

One of my favorite parts about working at CARTO is getting to test out the new features that our product team is constantly shipping. A month or so ago, it was our Kepler c...

• Inside CARTO Engineering: Configuration Management Makeover

Our primary goal at CARTO is to be the world’s leading Location Intelligence platform, empowering our clients with the best data and the best spatial analysis. We frequentl...