As you might have already heard the Data Observatory just launched to help provide CartoDB users with a universe of data. One of the reasons we built the Data Observatory is because getting the third-party data you need is oftentimes the hardest part of analyzing your own data. Data wrangling shouldn't be such a big roadblock to mapping and analyzing your world.
Some of the most desirable and most used datasets are also the hardest to handle. Datasets like the US Census are rich with information but can be large and distributed. What most people want though is just a snapshot of some boundaries and particular measures. For instance if you were telling a story about unemployment in a city you might want to grab Census Tracts for that city the number of unemployed people for each tract and the number of people 16 and over for each tract (as the denominator). All of this is no easy feat given the size of the datasets provided by the US Census.
For today's post I want to look at how we can use CartoDB's Data Observatory to answer questions when we don't already have any data at all. Further I want to only focus on just the region that I'm interested in and avoid spending time working with data and geometries that are not used in my analysis.
SEE THE POWER OF LOCATION INTELLIGENCERequest a live demo
Let's explore San Francisco's current housing price boom.
We'll pull data for San Francisco's real estate reality provided by Zillow and available in the Data Observatory. For the boundaries we will work with the US Census generated ZIP code Tabulation Areas (ZCTA) boundaries which correspond to ZIP codes for most purposes. To go beyond displaying one measure we will pull the most recent median housing value per square foot and compare it to the same locations from three years ago to show the explosive growth in the region.
Getting Appropriate Boundaries
To get started we'll need the boundaries that we're interested in visualizing. Depending on the scale of your map (national state county) you will want different geometries. Looking at Zillow they provide median home values in particular geographical regions the most convenient of which is ZIP codes.
To find the available boundaries we can use the function
OBS_GetAvailableBoundaries which will list all of the IDs for boundaries available through the Data Observatory at the location we need them. Since we're focused on San Francisco we just pass the approximate location of the city and filter by descriptions which contain the word 'zip' like so:
SELECT * FROM OBS_GetAvailableBoundaries(CDB_LatLng(37.7749 -122.4194)) WHERE description ILIKE '%zip%' ##_END_REPLACE_ME_PRE_##
The results look like this:
This will produce a temporary table of all boundaries we can use that intersect that point. We want the ZCTA boundary since we're focused on ZIP codes. I'll also opt for the water-clipped version since we don't need to do any analysis with the geometries--we are only using them for visualization this time. The
boundary_idthat we need is:
us.census.tiger.zcta5_clipped. You can see the data associated with these boundaries in this ID: they come from US Census Tiger geometries are ZCTA boundaries and are clipped (meaning water-clipped instead of statistical).
Now that we have our boundary ID we can get our boundaries with the Data Observatory function
OBS_GetBoundariesByPointAndRadiuswhich requires a center a radius (in meters) and the boundary ID we're interested in. I'll populate an empty table called
sf_housingwhich has a column for geometries (
the_geom) and another for storing the text id of the geometry.
##_INIT_REPLACE_ME_PRE_##INSERT INTO sf_housing(the_geom geoid) SELECT * FROM OBS_GetBoundariesByPointAndRadius( CDB_LatLng(37.7749 -122.4194) -- San Francisco CA 25000 * 1.609 -- 25 mile radius (= 25 km * conversion to miles) 'us.census.tiger.zcta5_clipped' -- use water-clipped geometries for visualization ) ##_END_REPLACE_ME_PRE_##
Checking out our table we'll find it filled with all of the ZCTA boundaries in the region we're interested in but there's no data for the geometries yet!
Getting Housing Value Data
Our next step is getting the housing price data for each of the ZIP code areas. The Data Observatory makes that possible through the wonderfully versatile
OBS_GetMeasureById. The reason that the Data Observatory splits these the work into these two pieces is so that you have complete flexibility about what data you blend and what data you ignore. With hundreds of measures and growing this flexibility will keep your work easy.
To find the measures that would be of interest we'll use the
OBS_Searchfunction through CartoDB's SQL tray like so:
##_INIT_REPLACE_ME_PRE_##SELECT * FROM OBS_Search('zillow') ##_END_REPLACE_ME_PRE_##
In CartoDB it will look like this:
This gives back several IDs of interest but let's choose
us.zillow.AllHomes_MedianValuePerSqft. This entry is just what it says median home value per square foot for "All Homes" (i.e. single-family condominium and co-operative homes). Let's look at the most recent timestamp (March 2016).
We'll put all of this information into
OBS_GetMeasureByIdand populate a new column with the results:
##_INIT_REPLACE_ME_PRE_##UPDATE sf_housing SET median_value_per_sqft_2016 = OBS_GetMeasureById( geoid -- specify the place 'us.zillow.AllHomes_MedianValuePerSqft' -- specify the measure 'us.census.tiger.zcta5' -- specifies level of geometries '2016-03' -- specifies when in time ) ##_END_REPLACE_ME_PRE_##
Visualizing this in CartoDB we get the following map:
Now we can easily see the median housing value per square foot easily visualized on a map -- and we didn't have to track down the datasets to do so!
Calculating Historical Changes
Now that we have recent housing price data for San Francisco let's compare it to data from March 2013 to look at the change in home value over time. To do this we can calculate the current price divided by the historical price to get the ratio of change. Subtracting one from this ratio and multiplying by 100% gives us the percentage change of 2016 as compared to 2013.
We can accomplish this with the Data Observatory as follows (after creating a new numeric column):
##_INIT_REPLACE_ME_PRE_##UPDATE sf_housing_test SET percent_change_median_value_2016_2013 = 100.0 * (median_value_per_sqft_2016 / OBS_GetMeasureById(geoid 'us.zillow.AllHomes_MedianValuePerSqft' 'us.census.tiger.zcta5' '2013-03') - 1) ##_END_REPLACE_ME_PRE_##
Now that we have our data let's map it! In under 5 minutes we have created a pretty cool map of housing value change in SF starting with an empty dataset!
We went from having no data to creating a value-added maps in under five minutes. And this is just the beginning of what can be done with the Data Observatory.
We hope you are starting to see the power of the Data Observatory. We see the Data Observatory as a source of enrichment and a way to make your analyses more powerful in CartoDB.
What maps and datasets will you make from an empty table? Share your "something from nothing" maps with us!
Happy data mapping!