Beyond the Shapefile with File Geodatabase and GeoPackage
Anyone who has programmed geospatial software has eventually come to a conclusion about data formats: there is only one truly de facto standard for geospatial data the shapefile and the shapefile sucks.
- It requires at least three files to define one spatial layer (more if you want to specify coordinate reference system or character encoding or spatial indexing).
- It only supports column names of 10 or fewer characters.
- It lacks a time or timestamp data type.
- It is limited to 2GB in file size.
- It only supports homogeneous spatial types for each layer.
- It only supports text fields of up to 255 characters.
Almost since they invented it, Esri has been trying to come up with replacements.
The "personal geodatabase" used the Microsoft Access MDB format as a storage layer, and stuffed geospatial information into that. Sadly, the format inherited all the limitations of MDB, which were substantial: file size, bloat, occasional corruption, and of course Windows-only platform dependencies.
The "file geodatabase" (FGDB) got around the MDB limitations by using a storage engine Esri wrote themselves. Unfortunately, the format was complex enough that they could never fully specify it and release a document on the format (and never seemed to want to). As a result, official support has only been through a proprietary binary API.
The FGDB format is very close to a shape file replacement.
- It supports multiple layers in one directory.
- It has no size limitations.
- It has a rich set of data types.
- It includes useful metadata about coordinate reference system and character encoding.
Since it shipped with ArcGIS 10 in 2010 the FGDB format has become popular in the Esri ecosystem, and it's not uncommon to find FGDB files on government open data sites or to receive them from GIS practitioners when requesting data.
CARTO has supported the shape file format since day 1 but we only recently added support for FGDB. We were able to support FGDB because the GDAL library we use in our import process has an open source read-only FGDB driver. Using the "open FGDB" driver allows us to use a stock build of GDAL without incorporating the proprietary Esri API libraries.
The file geodatabase format is a collection of files inside a directory named with a .gdb extension. In order to transfer that structure around FGDB files are first zipped up. So any FGDB data you receive will be a zip file that unzips to a .gdb directory.
FGDB data are loaded to CARTO just like any other format.
- Use the "New Dataset" option and either browse to your FGDB .zip file or drag'n'drop it in.
- Or just drag'n'drop the .zip file directly into the datasets dashboard.
After loading, you will have one new dataset in your account for each layer in the FGDB, named using a datasetname_layername pattern.
For example, the Elections.zip file from Clark County, Nevada includes 11 layers, as we can see by looking at the ogrinfo output for the file.
##_INIT_REPLACE_ME_PRE_## INFO: Open of `Election.gdb'
using driver `OpenFileGDB' successful.
1: senate_p (Multi Polygon)
2: school_p (Multi Polygon)
3: regent_p (Multi Polygon)
4: precinct_p (Multi Polygon)
5: ward_p (Multi Polygon)
6: congress_p (Multi Polygon)
7: pollpnts_x (Point)
8: educat_p (Multi Polygon)
9: township_p (Multi Polygon)
10: commiss_p (Multi Polygon)
11: assembly_p (Multi Polygon)
After upload, the file has been convered to 11 datasets with the standard naming pattern.
Want to get started?
If the FGDB format is so much better than shapefiles, why doesn't the story end there?
Because FGDB still has a couple major problems:
- There is no open source way to write to an FGDB file: that requires the proprietary Esri API libraries.
- The FGDB format is a directory which makes shipping it around involve annoying extra zip/unzip steps each time.
- The FGDB format is closed so there is no way to extend it for special use cases.
A couple years after FGDB was released the Open Geospatial Consortium (OGC) took on the task of defining a "shapefile replacement" format that learned all the lessons of shape files personal geodatabases and file geodatabases.
- Use open souce SQLite as the storage engine, more reliable and platform independent than MDB but with the advantage of easy, language independent read/write access via SQL.
- The SQLite engine is open source and multi-platform, so no Windows dependency.
- The SQLite engine stores data in a single file so no need to zip/unzip all the time.
- Leverage existing OGC standards like the WKT standard for spatial reference systems and the WKB standard for binary geometry representation.
- Document the format and include an extension mechanism so it can evolve over time and so third parties can experiment with new extensions.
The result is the GeoPackage (GPKG) format, which has become widely used in the open source world, and increasingly throughout the geospatial software ecosystem.
Loading GeoPackage into Carto now works exactly the same as FGDB: use the "New Dataset" page or just drag the file into the dataset dashboard. All the layers will be imported using the filename_layername pattern.
You can also now use GeoPackage as an export format! Click the export button and select the GPKG format and you'll get a single-layer GeoPackage with your table inside, ready for sharing with the world.
All this works because of the wonderful multi-format tools in the GDAL library which we use as part of our import process. You can exercise the power of GDAL yourself to directly solve your CARTO ETL problems using the ogr2ogr and ogrinfo tools in GDAL, check it out!