How to Build a Tree Equity Index & Dashboard using BigQuery


Lack of trees is linked to higher temperatures & more violent crime. Discover how to use the BigQuery Spatial Extension & CARTO for React to build a tree score dashboard.

This post may describe functionality for an old version of CARTO. Find out about the latest and cloud-native version here.
How to Build a Tree Equity Index & Dashboard using BigQuery

Like many people who love trees and work in the geospatial field I was fascinated (and disheartened) by a recent article I read in the New York Times called Since When Have Trees Existed Only for Rich Americans? This op-ed and data visualization explores how city trees are far more likely to exist in wealthier neighborhoods vs. poorer neighborhoods. They show how this is generally true across many U.S. cities and describe how the lack of trees is linked to higher temperatures and more violent crime. It's disheartening to realize that something as simple as a tree on the side of the street is a luxury good.

This scale of spatial analysis wasn't always so easy. Only recently have we had access to the types of libraries frameworks and platforms that allow us to query across multiple large datasets to be able to better understand the built environment. Naturally I wanted to try this out and show how it could be done using Google Cloud BigQuery and CARTO's BigQuery Spatial Extension.

I built out a demo application that gives a tree score (based on number of trees and tree width) for the approx 6000 census block groups in New York City. This tutorial will explore the SQL queries I used to do this and hopefully inspire others to create custom index scores with similar datasets.

Building a Tree Equity Index

With two long SQL queries I was able to generate a custom tree score for each census block group in New York City. Thanks to BigQuery’s speed this was calculated in roughly 5-6 seconds. The datasets I worked with were publicly available in BigQuery:

Calculating ACS and Street Tree Data by Block Group (first query)

The first query has several Common Table Expression (CTEs) but it’s fairly easy to follow. Here are the CTEs:

  1. The first CTE (nycbgs) SELECTs census block groups geographies (that are located in NYC’s counties) and specific columns like population and median income from the ACS dataset.
  2. The second CTE (count_by_bg) calculates the number (COUNT) and size (average diameter at breast height) for each tree that falls within 10 meters of the block group.
  3. The final part of the query pulls it all together by joining back to the 2nd CTE and calculating the number of trees per 1000 square meters. If you don’t normalize by the block group total area it skews the index in odd ways because larger block groups naturally hold more trees on average.
  4. Finally this is all saved to a new table in my BigQuery account using the CREATE TABLE AS

I now have a table with rich ACS and tree information for each block group in the city. Here’s the query with many inline comments:

Creating the Custom Tree Index Score (second query)

I’ll now show how I created a query that gave each block group a custom ‘tree score’. This tree score gives us an overall view at tree quantity and size within an area and allows for comparison to other areas. We can also see if tree scores correlate with things like median income or other ACS sociodemographics similar to what the New York Times piece did.

This query also has several CTEs but I’ll break it all down and add many comments:

  1. The first CTE (all_and_stscaler) uses built-in BigQuery Machine Learning preprocessing functions to standardize our features (tree count and average width). This column should now have a mean of 0 and standard deviation of 1. This allows us to work with features that have different scales (tree count is a simple count whereas width is in inches). The ML.STANDARD_SCALER function does this quickly and seamlessly. This Stackoverflow thread is great if you need to learn more about using a Standard Scaler.
  2. The 2nd CTE (sum_to_create_score) sums the tree count and average tree size scores together. I decided to weight the tree count heavier than the average tree size by multiplying by different numbers.
  3. The 3rd CTE (capping) uses the SQL CASE WHEN syntax to cap the score at 3 or -3. This prevents extreme outliers from throwing off the overall index score.
  4. Finally we use a bit of math to calculate a 0-100 index score. This calculates how close to the max score each score is with 100 being the max. The index score could be 0 to 10 or 0 to 200 (or anything else). Nice round numbers make it easier for people to understand the score (since many other index scores use similar things).

The weighting part in the 2nd step is an important one. Index scores usually include multiple indicators and datasets. The index score creator can choose how they want to weight the variables. I decided to weight the quantity of trees a bit more than tree width but you might decide to do this differently. It would be interesting to include overall tree health or diversity of trees as part of the score. The index score creator should strive to be transparent about their weighting.

Here’s the query with comments:

NYC Tree Score Dashboard

I used our CARTO for React framework to build a simple dashboard to showcase the results of my tree index score. I included both the census block groups as well as all of the approx 600 000 trees from the New York City tree census. It's amazing to see how trees cluster in little pockets around the city.

The dashboard is available here:

New York City Trees and Demographics Correlation Explorations

One of the core arguments of the New York Times piece (I mentioned above) was that trees tend to be located in wealthier neighborhoods. I used the Seaborn plotting library in Python (using CARTOframes) to determine if there was a correlation between the tree index score and the median income as well as the tree score and different racial demographics. Here’s what it looks like for median income:

Correlation between tree index score and median income

There was a weak positive correlation (.20) between median income and the tree score I calculated.

There was a weak negative correlation (-0.14) between the black population percentage and the tree score:

Correlation between tree index score and black population

There was a weak positive correlation (0.19) for the white population percentage and the tree score

Correlation between tree index score and white population

There was a weak negative correlation (-0.07) between hispanic population percentage and the tree score

Correlation between tree index score and hispanic population

While the positive correlation between median income and abundance of trees is not ideal it's not as dramatic as I might have expected. It's also reassuring to see but there are not strong correlations between major racial demographics and trees. In many ways it's fair to say that New York City has a mostly egalitarian distribution of trees.

Thanks to my colleague Stephanie Schober for showing us how to calculate index scores.

map from Carto

Want to create dashboards like this? Access the Spatial Extension for BigQuery