Recently, as part of our ongoing mission to empower Data Scientists with the best data and analysis, we announced the integration of our platform with Databricks, using either our Direct SQL Connection feature or CARTOframes, our Python package.
Last week Raela Wang and Borja Muñoz hosted a webinar to explore examples where Databricks and CARTO come together as part of this collaborative Data Science process. This post summarizes the key points covered in the webinar including:
Data Science teams often have to perform spatial analytics over very large datasets. With CARTO’s spatial analysis functionality and the Databricks Unified Analytics Platform, data scientists can run spatial analysis on very large datasets.
Data is usually stored in siloed database systems, making it challenging to enrich and combine your datasets. The CARTO Direct SQL Connection feature allows you to access your spatial data from the Databricks platform and combine it with your Delta Lake.
When you are working with a dataset with spatial information, you need to have a way to explore the data interactively on a map. The use of CARTOframes within Databricks notebooks allows you to generate insightful map visualizations from your spatial data.
Existing tables from the CARTO Spatial Database can be read in as a data source to the Databricks platform where transformations can be run on the Apache Spark cluster. The data can also be ETLed and stored within a data lake (in the example below Delta Lake is being used) where advanced analytics and machine learning models can be performed. Likewise visualization, data enrichment, and analysis can be performed using CARTOframes. At any step of the pipeline, data can always be persisted back into the CARTO Spatial Database.
CARTO is based on PostgreSQL and PostGIS and due to it being a managed service there is no direct access to the PostgreSQL server. Instead the database and datasets can be accessed using the Direct SQL Connection feature we announced earlier this year, which can be used within a Databricks notebook to read and write data utilizing Spark dataframes. In our post announcing Databricks integration we walked through an example of this process.
The Direct SQL Connection method should be used when performing data engineering or manipulation tasks that are not suitable for relational databases. For example when switching rows or columns (wide-to-long transformations). It is also recommended when scalability is an issue because you can scale and distribute the computation within your Databricks cluster.
CARTOframes can be used for enrichment, analysis and visualization of geospatial data directly within Databricks notebooks.
CARTOframes should be used when you want to:
These features are available right now to all enterprise customers. To start using the Direct SQL Connection, just go to the settings page in your dashboard and follow the instructions to set up a new connection. To get started with CARTOframes you can read the Quickstart guide. If you encounter any issues, just reach out to your Customer Success Manager.
Want to learn more?Watch the full webinar
Like many people who love trees and work in the geospatial field, I was fascinated (and disheartened) by a recent article I read in the New York Times called Since When Hav...Use Cases
Most Data Scientists and Analysts understand that visualizing datasets can be a crucial way for users to engage with data. Knowing where median household income is across a...Use Cases
The urban growth of metropolitan areas around the world can be affected by a number of factors. During the industrial revolution the explosion in job availability fueled mu...Use Cases
Please fill out the below form and we'll be in touch real soon.