The default CARTO data importer is a pretty convenient way to quickly get data into the platform, but for enterprises setting up automated update it has some limitations:
- there’s no way to define type coercions for CSV data;
- some common GIS formats like File Geodatabase aren’t supported;
- the sync facility is “pull” only, so data behind a firewall is inaccessible; and,
- the sync cannot automatically filter the data before loading it into the system.
Fortunately, there’s a handy command-line tool that can automate many common enterprise data loads:
ogr2ogr has a well-earned reputation for being hard to use. The commandline options are plentiful and terse, the standard documentation page lacks examples, and format-specific documentation is hidden away with the driver documentation.
The basic structure of an
ogr2ogr call is “ogr2ogr -f format destination source”. Here’s a simple shapefile load.
ogr2ogr \ --debug ON \ --config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \ -t_srs "EPSG:4326" \ -nln interesting \ -f Carto \ "Carto:pramsey" \ interesting_things.shp
The parameters are:
- –debug turns on verbose debugging, which is useful during development to see what’s happening behind the scenes.
- –config is used to pass generic “configuration parameters”. In this case we pass our CARTO API key so we are allowed to write to the database.
- -t_srs is the “target spatial reference system”, telling
ogr2ogrto convert the spatial coordinates to “EPSG:4326” (WGS84) before writing them to CARTO. The CARTO driver expects inputs in WGS84, so this step is mandatory.
- -nln is the “new layer name”, so the name of the uploaded table can differ from that of the input file.
- -f is the format of the destination layer, so for uploads to CARTO, it is always “Carto”.
- Carto:pramsey is the “destination datasource”, so it’s a CARTO source, in the “pramsey” account. Change this to your user name. (Note for multi-user accounts: you must supply your user name here, not your organization name.)
- interesting_things.shp is the “source datasource”, which for a shapefile is just the path to the file.
Loading a File Geodatabase is almost the same as loading a shapefile, except that a file geodatabase can contain multiple layers, so the conversion must also specify which layer to convert, by adding the source layer name after the data source. You can load multiple layers in one run by providing multiple layer names.
ogr2ogr \ --debug ON \ --config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \ -t_srs "EPSG:4326" \ -nln cities \ -f Carto \ "Carto:pramsey" \ CountyData.gdb Cities
In this example, we take the “Cities” layer from the county database, and write it into the “cities” table of CARTO. Note that if you do not re-map the layer name to all lower case, you’ll get a mixed case layer in CARTO, which you may not want.
You can use OGR on any input data source to filter the data prior to loading. This can be useful for loads of large inputs that are “only the data since time X” or “only the data in this region”, like this:
ogr2ogr \ --debug ON \ --config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \ -t_srs "EPSG:4326" \ -nln cities \ -f Carto \ -sql "SELECT * FROM Cities WHERE state_fips = 53" \ "Carto:pramsey" \ CountyData.gdb Cities
Since the filter is just a SQL statement, the filter can both reduce the number of records and also apply transforms to the output on the way: reduce the number of columns, apply some data transformations, anything that is possible using the SQLite dialect of SQL.
Overwrite or Append
ogr2ogr runs in “append” mode (you can force it with the
-append flag, so if you run the same translation multiple times, you’ll get rows added into your table. This be useful for processes that regularly take the most recent entries and copy them into CARTO.
For translations where you want to replace the existing table, use the
-overwrite mode, which will drop the existing table, and create a new one in its place.
Because of some limitations in how the OGR CARTO driver handles primary keys, the OGR
-update mode does not work correctly.
OGR Virtual Format
As you can see, the command-line complexity of an OGR conversions starts high. The complexity only goes up as advanced features like filtering and arbitrary SQL are added.
To contain the complexity in one location, you can use the OGR “virtual format”, VRT files, to define your data sources. This is handy for managing a library of conversions in source control. Each data source becomes it’s own VRT file, and the actual OGR commands become smaller.
CSV Type Enforcement
CSV files are convenient ways of passing data, but they are under-defined: they supply column names, but not column types. This forces CSV consumers to do type guessing based on the input data, or to coerce every input to a lowest common denominator string type.
Particularly for repeated and automated uploads it would nice to define the column types once beforehand and have them respected in the final CARTO table.
For example, take this tiny CSV file:
longitude,latitude,name,the_date,the_double,the_int,the_int16,the_int_as_str,the_datetime -120,51,"First Place",2018-01-01,2.3,123456789,1234,00001234,2014-03-04 08:12:23 -121,52,"Second Place",2017-02-02,4.3,423456789,4234,00004234,"2015-05-05 09:15:25"
Using a VRT, we can define a CSV file as a source, and also add the rich metadata needed to support proper type definitions:
<OGRVRTDataSource> <OGRVRTLayer name="test_csv"> <SrcDataSource>/data/exports/test_csv.csv</SrcDataSource> <GeometryField encoding="PointFromColumns" x="longitude" y="latitude"/> <GeometryType>wkbPoint</GeometryType> <LayerSRS>WGS84</LayerSRS> <OpenOptions> <OOI key="EMPTY_STRING_AS_NULL">YES</OOI> </OpenOptions> <Field name="name" type="String" nullable="false" /> <Field name="a_date" type="Date" src="the_date" nullable="true" /> <Field name="the_double" type="Real" nullable="true" /> <Field name="the_int" type="Integer" nullable="true" /> <Field name="the_int16" type="Integer" subtype="Int16" nullable="true" /> <Field name="the_int_as_str" type="String" nullable="true" /> <Field name="the_datetime" type="DateTime" nullable="true" /> </OGRVRTLayer> </OGRVRTDataSource>
This example has a number of things going on:
<SrcDataSource>is an OGR connection string, as defined in the driver documentation for the format. For a CSV, it’s just the path to a file with a “csv” extension.
<GeometryField>line maps coordinate columns into a point geometry.
<LayerSRS>confirms the coordinates are WGS84. They could also be some planar format, and OGR can reproject them if requested.
<OpenOptions>let us pass one of the many CSV open options.
<Field>type definitions, using the “type” attribute to explicitly define types, including obscure ones like 16-bit integers.
- Column renaming, in the “a_date”
<Field>, maps the source column name “the_date” to “a_date” in the target.
- Null enforcement, in the “name”
<Field>, creates a target column with a
To execute the translation, we use the VRT as the source argument in the
ogr2ogr \ --debug ON \ --config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \ -t_srs "EPSG:4326" \ -f Carto \ "Carto:pramsey" \ test_csv.vrt
Imagine you have an Oracle database with sales information in it, and you want to upload a weekly snaphot of transactions. The database is behind the firewall, and the transactions need to be joined to location data in order to mapped. How to do it?
With VRT tables and the OGR Oracle driver, it’s just some more configuration during the load step:
<OGRVRTDataSource> <OGRVRTLayer name="detroit_locations"> <SrcDataSource>OCI:email@example.com</SrcDataSource> <LayerSRS>EPSG:26917</LayerSRS> <GeometryField encoding="PointFromColumns" x="easting" y="northing"/> <SrcSQL> SELECT sales.sku, sales.amount, sales.tos, locs.latitude, locs,longitude FROM sales JOIN locs ON sales.loc_id = locs.loc_id WHERE locs.city = 'Detroit' AND sales.transaction_date > '2018-01-01' </SrcSQL> </OGRVRTLayer> </OGRVRTDataSource>
Some things to note in this example:
<SrcDataSource>holds the Oracle connection string
- The coordinates are stored in UTM17, in northing/easting columns, but we can still easily map them into a point type for reprojection later.
- The output data source is actually the result of a join executed on the Oracle database, attributing each sale with the location it was made. We don’t have to ship the tables to CARTO separately.
The ability to run any SQL on the source database is a very powerful tool to ensure that the uploaded data is “just right” before it arrives on the CARTO side for analysis and display.
As before, the VRT is run with a simple execution of the
ogr2ogr command line:
ogr2ogr \ --debug ON \ --config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \ -t_srs "EPSG:4326" \ -f Carto \ "Carto:pramsey" \ test_oracle.vrt
Suppose you have a source of attribute information and a source of location information, but they are in different formats in different databases: how to bring them together in CARTO? One way, as usual, would be to upload them separately and join them on the CARTO side with SQL. Another way is to use the power of
ogr2ogr and VRT to do the join during the data upload.
For example, imagine having transaction data in a PostgreSQL database, and store locations in a Geodatabase. How to bring them together? Here’s a
joined_stores.vrt file that does the join in
<OGRVRTDataSource> <OGRVRTLayer name="sales_data"> <SrcDataSource>Pg:dbname=pramsey</SrcDataSource> <SrcLayer>sales.sales_data_2017</SrcLayer> </OGRVRTLayer> <OGRVRTLayer name="stores"> <SrcDataSource>store_gis.gdb</SrcDataSource> <SrcLayer>Stores</SrcLayer> </OGRVRTLayer> <OGRVRTLayer name="joined"> <SrcDataSource>joined_stores.vrt</SrcDataSource> <SrcSQL dialect="SQLITE"> SELECT stores.*, sales_data.* FROM sales_data JOIN stores ON sales_data.store_id = stores.store_id </SrcSQL> </OGRVRTLayer> </OGRVRTDataSource>
Some things to note:
- The “joined” layer uses the VRT file itself in the
<OGRVRTLayer>is a full-fledged VRT layer, so you can do extra processing in them. Apply type definitions to CSV, run complex SQL on a remote database, whatever you want.
- The join layer uses the “SQLite” dialect, so anything available in SQLite is available to you in the join step.
Almost an ETL
Combining the ability to read and write from multiple formats with the basic functionality of the SQLite engine, and chaining operations through multi-layer VRT layers,
ogr2ogr provides the core functions of an “extract-transform-load” engine, in a package that is easy to automate and maintain.
For users with data behind a firewall, who need more complex processing during their loads, or who have data locked in formats the CARTO importer cannot read,
ogr2ogr should be an essential tool.
OGR is a subset of the GDAL suite of libraries and tools, so you need to install GDAL to get