No data? No problem with the Data Observatory

As you might have already heard, the Data Observatory just launched to help provide CartoDB users with a universe of data. One of the reasons we built the Data Observatory is because getting the third-party data you need is oftentimes the hardest part of analyzing your own data. Data wrangling shouldn’t be such a big roadblock to mapping and analyzing your world.

Some of the most desirable and most used datasets are also the hardest to handle. Datasets like the US Census are rich with information but can be large and distributed. What most people want, though, is just a snapshot of some boundaries and particular measures. For instance, if you were telling a story about unemployment in a city, you might want to grab Census Tracts for that city, the number of unemployed people for each tract, and the number of people 16 and over for each tract (as the denominator). All of this is no easy feat given the size of the datasets provided by the US Census.

For today’s post, I want to look at how we can use CartoDB’s Data Observatory to answer questions when we don’t already have any data at all. Further, I want to only focus on just the region that I’m interested in, and avoid spending time working with data and geometries that are not used in my analysis.

SEE THE POWER OF LOCATION INTELLIGENCE

Request a live demo

Let’s explore San Francisco’s current housing price boom.

We’ll pull data for San Francisco’s real estate reality provided by Zillow and available in the Data Observatory. For the boundaries, we will work with the US Census generated ZIP code Tabulation Areas (ZCTA) boundaries, which correspond to ZIP codes for most purposes. To go beyond displaying one measure, we will pull the most recent median housing value per square foot and compare it to the same locations from three years ago to show the explosive growth in the region.

Getting Appropriate Boundaries

To get started we’ll need the boundaries that we’re interested in visualizing. Depending on the scale of your map (national, state, county) you will want different geometries. Looking at Zillow, they provide median home values in particular geographical regions, the most convenient of which is ZIP codes.

To find the available boundaries, we can use the function OBS_GetAvailableBoundaries which will list all of the IDs for boundaries available through the Data Observatory at the location we need them. Since we’re focused on San Francisco, we just pass the approximate location of the city and filter by descriptions which contain the word ‘zip’, like so:

SELECT * FROM
  OBS_GetAvailableBoundaries(CDB_LatLng(37.7749, -122.4194))
WHERE
  description ILIKE '%zip%'

The results look like this:

Get All Available Boundaries at a location

This will produce a temporary table of all boundaries we can use that intersect that point. We want the ZCTA boundary since we’re focused on ZIP codes. I’ll also opt for the water-clipped version since we don’t need to do any analysis with the geometries–we are only using them for visualization this time. The boundary_id that we need is: us.census.tiger.zcta5_clipped. You can see the data associated with these boundaries in this ID: they come from US Census Tiger geometries, are ZCTA boundaries, and are clipped (meaning water-clipped instead of statistical).

Now that we have our boundary ID, we can get our boundaries with the Data Observatory function OBS_GetBoundariesByPointAndRadius, which requires a center, a radius (in meters), and the boundary ID we’re interested in. I’ll populate an empty table called sf_housing which has a column for geometries (the_geom), and another for storing the text id of the geometry.

INSERT INTO sf_housing(the_geom, geoid)
SELECT *
FROM OBS_GetBoundariesByPointAndRadius(
  	CDB_LatLng(37.7749, -122.4194), -- San Francisco, CA
    25000 * 1.609,  -- 25 mile radius (= 25 km * conversion to miles)
    'us.census.tiger.zcta5_clipped' -- use water-clipped geometries for visualization
  )

Checking out our table, we’ll find it filled with all of the ZCTA boundaries in the region we’re interested in, but there’s no data for the geometries yet!

Getting Housing Value Data

Our next step is getting the housing price data for each of the ZIP code areas. The Data Observatory makes that possible through the wonderfully versatile OBS_GetMeasureById. The reason that the Data Observatory splits these the work into these two pieces is so that you have complete flexibility about what data you blend and what data you ignore. With hundreds of measures and growing, this flexibility will keep your work easy.

To find the measures that would be of interest, we’ll use the OBS_Search function through CartoDB’s SQL tray, like so:

SELECT * FROM OBS_Search('zillow')

In CartoDB, it will look like this:

list all measures that mention zillow

This gives back several IDs of interest, but let’s choose us.zillow.AllHomes_MedianValuePerSqft. This entry is just what it says, median home value per square foot for “All Homes” (i.e., single-family, condominium and co-operative homes). Let’s look at the most recent timestamp (March 2016).

We’ll put all of this information into OBS_GetMeasureById and populate a new column with the results:

UPDATE sf_housing
SET median_value_per_sqft_2016 = OBS_GetMeasureById(
       geoid, -- specify the place
       'us.zillow.AllHomes_MedianValuePerSqft', -- specify the measure
       'us.census.tiger.zcta5', -- specifies level of geometries
       '2016-03' -- specifies when in time
       )

Visualizing this in CartoDB, we get the following map:

Now we can easily see the median housing value per square foot easily visualized on a map – and we didn’t have to track down the datasets to do so!

Calculating Historical Changes

Now that we have recent housing price data for San Francisco, let’s compare it to data from March, 2013 to look at the change in home value over time. To do this, we can calculate the current price divided by the historical price to get the ratio of change. Subtracting one from this ratio and multiplying by 100% gives us the percentage change of 2016 as compared to 2013.

We can accomplish this with the Data Observatory as follows (after creating a new numeric column):

UPDATE sf_housing_test
SET percent_change_median_value_2016_2013 =
   100.0 * (median_value_per_sqft_2016 /
   OBS_GetMeasureById(geoid,
                  'us.zillow.AllHomes_MedianValuePerSqft',
                  'us.census.tiger.zcta5','2013-03') - 1)

Now that we have our data, let’s map it! In under 5 minutes we have created a pretty cool map of housing value change in SF starting with an empty dataset!

We went from having no data to creating a value-added maps in under five minutes. And this is just the beginning of what can be done with the Data Observatory.

What’s next?

We hope you are starting to see the power of the Data Observatory. We see the Data Observatory as a source of enrichment and a way to make your analyses more powerful in CartoDB.

There’s so much more to explore, so watch our blog and Twitter for more. Also, read our documentation and checkout our catalog of available data to get started.

What maps and datasets will you make from an empty table? Share your “something from nothing” maps with us!

Happy data mapping!

About the author

Andy Eschbacher is a data scientist at CARTO, where he integrates data science solutions into CARTO's infrastructure, solves spatial data science problems for clients, and builds out tools to better enable people working at the intersection of data science and GIS.

More posts from Andy Eschbacher

Related Posts

Ready to optimize your territories with Location Intelligence?

Close circle icon

Contact us

Please fill out the below form and we'll be in touch real soon.