The default CARTO data importer is a pretty convenient way to quickly get data into the platform, but for enterprises setting up automated update it has some limitations:
Fortunately, there’s a handy command-line tool that can automate many common enterprise data loads: ogr2ogr
.
ogr2ogr
has a well-earned reputation for being hard to use. The commandline options are plentiful and terse, the standard documentation page lacks examples, and format-specific documentation is hidden away with the driver documentation.
The basic structure of an ogr2ogr
call is “ogr2ogr -f format destination source”. Here’s a simple shapefile load.
ogr2ogr \
--debug ON \
--config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
-t_srs "EPSG:4326" \
-nln interesting \
-f Carto \
"Carto:pramsey" \
interesting_things.shp
The parameters are:
ogr2ogr
to convert the spatial coordinates to “EPSG:4326” (WGS84) before writing them to CARTO. The CARTO driver expects inputs in WGS84, so this step is mandatory.Loading a File Geodatabase is almost the same as loading a shapefile, except that a file geodatabase can contain multiple layers, so the conversion must also specify which layer to convert, by adding the source layer name after the data source. You can load multiple layers in one run by providing multiple layer names.
ogr2ogr \
--debug ON \
--config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
-t_srs "EPSG:4326" \
-nln cities \
-f Carto \
"Carto:pramsey" \
CountyData.gdb Cities
In this example, we take the “Cities” layer from the county database, and write it into the “cities” table of CARTO. Note that if you do not re-map the layer name to all lower case, you’ll get a mixed case layer in CARTO, which you may not want.
You can use OGR on any input data source to filter the data prior to loading. This can be useful for loads of large inputs that are “only the data since time X” or “only the data in this region”, like this:
ogr2ogr \
--debug ON \
--config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
-t_srs "EPSG:4326" \
-nln cities \
-f Carto \
-sql "SELECT * FROM Cities WHERE state_fips = 53" \
"Carto:pramsey" \
CountyData.gdb Cities
Since the filter is just a SQL statement, the filter can both reduce the number of records and also apply transforms to the output on the way: reduce the number of columns, apply some data transformations, anything that is possible using the [SQLite dialect] of SQL.
By default, ogr2ogr
runs in “append” mode (you can force it with the -append
flag, so if you run the same translation multiple times, you’ll get rows added into your table. This be useful for processes that regularly take the most recent entries and copy them into CARTO.
For translations where you want to replace the existing table, use the -overwrite
mode, which will drop the existing table, and create a new one in its place.
Because of some limitations in how the OGR CARTO driver handles primary keys, the OGR -update
mode does not work correctly.
As you can see, the command-line complexity of an OGR conversions starts high. The complexity only goes up as advanced features like filtering and arbitrary SQL are added.
To contain the complexity in one location, you can use the OGR “virtual format”, VRT files, to define your data sources. This is handy for managing a library of conversions in source control. Each data source becomes it’s own VRT file, and the actual OGR commands become smaller.
CSV files are convenient ways of passing data, but they are under-defined: they supply column names, but not column types. This forces CSV consumers to do type guessing based on the input data, or to coerce every input to a lowest common denominator string type.
Particularly for repeated and automated uploads it would nice to define the column types once beforehand and have them respected in the final CARTO table.
For example, take this tiny CSV file:
longitude,latitude,name,the_date,the_double,the_int,the_int16,the_int_as_str,the_datetime
-120,51,"First Place",2018-01-01,2.3,123456789,1234,00001234,2014-03-04 08:12:23
-121,52,"Second Place",2017-02-02,4.3,423456789,4234,00004234,"2015-05-05 09:15:25"
Using a VRT, we can define a CSV file as a source, and also add the rich metadata needed to support proper type definitions:
<OGRVRTDataSource>
<OGRVRTLayer name="test_csv">
<SrcDataSource>/data/exports/test_csv.csv</SrcDataSource>
<GeometryField encoding="PointFromColumns" x="longitude" y="latitude"/>
<GeometryType>wkbPoint</GeometryType>
<LayerSRS>WGS84</LayerSRS>
<OpenOptions>
<OOI key="EMPTY_STRING_AS_NULL">YES</OOI>
</OpenOptions>
<Field name="name" type="String" nullable="false" />
<Field name="a_date" type="Date" src="the_date" nullable="true" />
<Field name="the_double" type="Real" nullable="true" />
<Field name="the_int" type="Integer" nullable="true" />
<Field name="the_int16" type="Integer" subtype="Int16" nullable="true" />
<Field name="the_int_as_str" type="String" nullable="true" />
<Field name="the_datetime" type="DateTime" nullable="true" />
</OGRVRTLayer>
</OGRVRTDataSource>
This example has a number of things going on:
<SrcDataSource>
is an OGR connection string, as defined in the driver documentation for the format. For a CSV, it’s just the path to a file with a “csv” extension.<GeometryField>
line maps coordinate columns into a point geometry.<LayerSRS>
confirms the coordinates are WGS84. They could also be some planar format, and OGR can reproject them if requested.<OpenOptions>
let us pass one of the many CSV open options.<Field>
type definitions, using the “type” attribute to explicitly define types, including obscure ones like 16-bit integers.<Field>
, maps the source column name “the_date” to “a_date” in the target.<Field>
, creates a target column with a NOT NULL
constraint.To execute the translation, we use the VRT as the source argument in the ogr2ogr
call.
ogr2ogr \
--debug ON \
--config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
-t_srs "EPSG:4326" \
-f Carto \
"Carto:pramsey" \
test_csv.vrt
Imagine you have an Oracle database with sales information in it, and you want to upload a weekly snaphot of transactions. The database is behind the firewall, and the transactions need to be joined to location data in order to mapped. How to do it?
With VRT tables and the OGR Oracle driver, it’s just some more configuration during the load step:
<OGRVRTDataSource>
<OGRVRTLayer name="detroit_locations">
<SrcDataSource>OCI:scott/password@ora.company.com</SrcDataSource>
<LayerSRS>EPSG:26917</LayerSRS>
<GeometryField encoding="PointFromColumns" x="easting" y="northing"/>
<SrcSQL>
SELECT sales.sku, sales.amount, sales.tos,
locs.latitude, locs,longitude
FROM sales JOIN locs ON sales.loc_id = locs.loc_id
WHERE locs.city = 'Detroit'
AND sales.transaction_date > '2018-01-01'
</SrcSQL>
</OGRVRTLayer>
</OGRVRTDataSource>
Some things to note in this example:
<SrcDataSource>
holds the Oracle connection stringThe ability to run any SQL on the source database is a very powerful tool to ensure that the uploaded data is “just right” before it arrives on the CARTO side for analysis and display.
As before, the VRT is run with a simple execution of the ogr2ogr
command line:
ogr2ogr \
--debug ON \
--config CARTO_API_KEY abcdefghijabcdefghijabcdefghijabcdefghij \
-t_srs "EPSG:4326" \
-f Carto \
"Carto:pramsey" \
test_oracle.vrt
Suppose you have a source of attribute information and a source of location information, but they are in different formats in different databases: how to bring them together in CARTO? One way, as usual, would be to upload them separately and join them on the CARTO side with SQL. Another way is to use the power of ogr2ogr
and VRT to do the join during the data upload.
For example, imagine having transaction data in a PostgreSQL database, and store locations in a Geodatabase. How to bring them together? Here’s a joined_stores.vrt
file that does the join in ogr2ogr
:
<OGRVRTDataSource>
<OGRVRTLayer name="sales_data">
<SrcDataSource>Pg:dbname=pramsey</SrcDataSource>
<SrcLayer>sales.sales_data_2017</SrcLayer>
</OGRVRTLayer>
<OGRVRTLayer name="stores">
<SrcDataSource>store_gis.gdb</SrcDataSource>
<SrcLayer>Stores</SrcLayer>
</OGRVRTLayer>
<OGRVRTLayer name="joined">
<SrcDataSource>joined_stores.vrt</SrcDataSource>
<SrcSQL dialect="SQLITE">
SELECT stores.*, sales_data.*
FROM sales_data
JOIN stores
ON sales_data.store_id = stores.store_id
</SrcSQL>
</OGRVRTLayer>
</OGRVRTDataSource>
Some things to note:
<SrcDataSource>
definition!<OGRVRTLayer>
is a full-fledged VRT layer, so you can do extra processing in them. Apply type definitions to CSV, run complex SQL on a remote database, whatever you want.Combining the ability to read and write from multiple formats with the basic functionality of the SQLite engine, and chaining operations through multi-layer VRT layers, ogr2ogr
provides the core functions of an “extract-transform-load” engine, in a package that is easy to automate and maintain.
For users with data behind a firewall, who need more complex processing during their loads, or who have data locked in formats the CARTO importer cannot read, ogr2ogr
should be an essential tool.
OGR is a subset of the GDAL suite of libraries and tools, so you need to install GDAL to get ogr2ogr
Here at CARTO we are constantly updating our platform so that our customers gain in system performance, reliability and security.
Core TechOne of my favorite parts about working at CARTO is getting to test out the new features that our product team is constantly shipping. A month or so ago, it was our Kepler c...
Core TechOur primary goal at CARTO is to be the world’s leading Location Intelligence platform, empowering our clients with the best data and the best spatial analysis. We frequentl...
Core TechPlease fill out the below form and we'll be in touch real soon.