Hey! This content applies only to previous CARTO products

Please check if it's relevant to your use case. On October 2021 we released a new version of our platform.
You can learn more and read the latest documentation at docs.carto.com

Tutorials / Data Management / Connect Data

How to connect to your CARTO database from Databricks

Connecting to your CARTO database from Databricks.

This tutorial demonstrates how to connect with your CARTO database from a Databricks cluster. We strongly recommend reading the Direct SQL Connection article to learn more about that feature before reading this one.

Introduction

With the Direct SQL Connection you can connect directly from your Databricks cluster to your CARTO database. You can read CARTO datasets as Spark dataframes, perform spatial analysis on massive datasets (using one of many available libraries), and store the results back in CARTO for visualizations.

Tested environments

Databricks Community Edition
Runtime 6.4 (Scala 2.11, Spark 2.4.5, OpenJDK 8)

Connect from notebook

Go to the Cluster configuration page. Select the Spark Cluster UI - Master tab and get the master node IP address from the hostname label
Through the Settings page in your CARTO dashboard, add this IP address to the list of IP addresses
Click Home in the sidebar and create a new Python notebook. You can use similar Scala instructions to connect to your CARTO database.
Test connectivity from your cluster to CARTO server running the following instruction in a cell:

%sh nc -vz dbconn-geuw.carto.com 5432

Copy the certificate files to the master node. You can use a S3 bucket, the wget tool or any other method. This step is only necessary if you want to implement client authentication using TLS certificates.

%sh wget {url} -O {path_to_file/filename}

Read a CARTO dataset as Spark DataFrame (you must use the DER PKCS #8 file, with a “.pk8” extension) with the following command if you want to use the TLS certificate for authentication:

remote_table = spark.read.format("jdbc")\
    .option("url", "jdbc:postgresql://dbconn-geuw.carto.com:5432/cartodb") \
    .option("dbtable", "{dataset}") \
    .option("user", "{username}") \
    .option("password", "{API key}") \
    .option("ssl", True) \
    .option("sslmode", "verify-full" ) \
    .option("sslcert", "{path_to_file}/client.crt" ) \
    .option("sslkey", "{path_to_file}/client.key.pk8" ) \
    .option("sslrootcert", "{path_to_file}/server_ca.pem") \
    .load()

If you don’t want or can’t use client authentication with TLS certificates, you can change the sslmode option to verify-ca and remove the sslcert and sslkey options:

remote_table = spark.read.format("jdbc")\
    .option("url", "jdbc:postgresql://dbconn-geuw.carto.com:5432/cartodb") \
    .option("dbtable", "{dataset}") \
    .option("user", "{username}") \
    .option("password", "{API key}") \
    .option("ssl", True) \
    .option("sslmode", "verify-ca" ) \
    .option("sslrootcert", "{path_to_file}/server_ca.pem") \
    .load()

Run your spatial analysis in your Databricks cluster. Then store the results in your CARTO dataset.