Computing distance metrics is a challenging problem in spatial analysis. When studying urban travel, calculating distances between two locations as the crow flies is the most straightforward method but this approach often introduces gross errors. If we are after a more exact result, it is necessary to consider these distances as the shortest path taking into account the available transportation network.
At CARTO we have been working hard on making scalable spatial analytics possible through the implementation of spatial functions and procedures natively within Google BigQuery as part of our Spatial Extension. These functions can be further enhanced by using our curated data offering through the Data Observatory, also accessible through BigQuery.
For this scenario, we decided to take on the challenge of solving routing problems at scale natively in BigQuery, the results of which we are unveiling today with the release of a new routing module inside the CARTO Spatial Extension for BigQuery.
As with any spatial analysis challenge, the first step is to source relevant data. For routing, there are mainly two sources of information required. First we need the road network information over which we will compute distance metrics, the so-called shortest paths. OpenStreetMap data is a fantastic option for this, as it’s freely available in BigQuery. In particular, we are going to use the
bigquery-public-data:geo_openstreetmap.planet_ways table where public roads are stored.
Second we need a set of point location pairs to use as start and end points to compute distances. For that, we decided to use the San Francisco Bike Share Trips dataset, which contains almost 2 million Bay Area Bike Share Trips since 2013 and is also publicly available in BigQuery under the name
bigquery-public-data:san_francisco_bikeshare, thanks to the San Francisco Open Data Project.
Now that we have sourced our data, the next step is to transform the public road information offered by OpenStreetMap into a network object composed of nodes and edges. The aim of this object is to encode the road network information in a way that will enable us to efficiently perform shortest path computations. For that, we will make use of the
DECLARE county_geom GEOGRAPHY; SET county_geom = ( SELECT county_geom from `bigquery-public-data.geo_us_boundaries.counties` WHERE county_name LIKE '%San Francisco%'); SELECT * FROM UNNEST(( SELECT `bqcarto.routing.GENERATE_NETWORK`(ARRAY_AGG(STRUCT(geometry, 1.))) FROM `bigquery-public-data.geo_openstreetmap.planet_ways`, UNNEST(all_tags) tag WHERE ST_INTERSECTS(geometry, county_geom) AND key = 'highway' AND value IN UNNEST(['trunk','primary','secondary','tertiary','unclassified','residential','living_street','road','pedestrian','path','cycleway']) ));
The result of this query is publicly available and can be found in
From the input San Francisco’s roads linestrings, the
GENERATE_NETWORK function creates a collection of nodes and edges with an assigned weight representing the Euclidean distance between nodes. Moreover, as we can see in the map below, it drastically simplifies the topology whilst ensuring the total path’s lengths are preserved.
Now, let’s put our network to work by computing the shortest paths for all bike share trips completed on the 25th of April 2018, using the
WITH agg_network AS ( SELECT ARRAY_AGG(link) as network FROM `cartobq.docs.san-francisco-network` link ), shortest_paths AS ( SELECT `bqcarto.routing.FIND_SHORTEST_PATH_FROM_NETWORK`(network, start_station_geom, end_station_geom) as paths, trip_id FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`, agg_network WHERE DATE(start_date)="2018-04-25" AND start_station_id=59 AND end_station_id != 59 ) SELECT paths.geom, paths.weight, trip_id FROM shortest_paths
DISTANCE_MAP_FROM_NETWORK function we can also use the same network to compute the distance to each node of the network from bike station 59, which was the starting point of all the paths we computed in the previous step.
DECLARE start_geom GEOGRAPHY; SET start_geom = ( SELECT station_geom FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info` WHERE station_id = 59); SELECT * FROM UNNEST(( SELECT `bqcarto.routing.DISTANCE_MAP_FROM_NETWORK`(ARRAY_AGG(link), start_geom) FROM `cartobq.docs.san-francisco-network` link));
This solution not only allows us to compute distances, but also to identify the most common routes and transited roads based on the assumption that users always choose the shortest path. Leveraging the processing power of BigQuery, let’s discover the most used street segments for more than a million bicycle trips.
-- Temporary function to split our routes into segments -- so we can later analyze their usage independently CREATE TEMP FUNCTION pairwise(myarray ARRAY<STRUCT<idx INT64, geom GEOGRAPHY>>) AS (( SELECT ARRAY_AGG(STRUCT(x AS a, myarray[SAFE_OFFSET(off-1)] AS b, ST_MAKELINE(x.geom, myarray[SAFE_OFFSET(off-1)].geom) AS geom)) FROM UNNEST(myarray) AS x WITH OFFSET off WHERE myarray[SAFE_OFFSET(off-1)] IS NOT NULL )); -- Get network to compute the shortest paths WITH agg_network AS ( SELECT ARRAY_AGG(link) network FROM `cartobq.docs.san-francisco-network` link ), -- Get bikeshare trips, always considering the starting -- station the one with lower ID bikeshare_trips AS ( SELECT IF (start_station_id>end_station_id, start_station_id, end_station_id) start_station_id, IF (start_station_id>end_station_id, end_station_id, start_station_id) end_station_id, ANY_VALUE( IF (start_station_id>end_station_id, start_station_geom, end_station_geom)) start_station_geom, ANY_VALUE( IF (start_station_id>end_station_id, end_station_geom, start_station_geom)) end_station_geom, COUNT(*) count FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips` INNER JOIN `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info` start_station ON start_station_id = start_station.station_id INNER JOIN `bigquery-public-data.san_francisco_bikeshare.bikeshare_regions` start_region ON start_region.region_id = start_station.region_id INNER JOIN `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info` end_station ON end_station_id = end_station.station_id INNER JOIN `bigquery-public-data.san_francisco_bikeshare.bikeshare_regions` end_region ON end_region.region_id = end_station.region_id WHERE start_station_id<>end_station_id AND start_region.name="San Francisco" AND end_region.name="San Francisco" GROUP BY start_station_id, end_station_id), -- Compute shortest paths routes AS ( SELECT `bqcarto.routing.FIND_SHORTEST_PATH_FROM_NETWORK`(network, start_station_geom, end_station_geom) compacted_route, count, start_station_id, end_station_id FROM agg_network, bikeshare_trips) -- Get usage frequency for each segment of the resulting routes SELECT ANY_VALUE(segments.geom) geo, SUM(count) count, FROM routes route, UNNEST(pairwise(compacted_route.path)) segments GROUP BY CONCAT(segments.a.idx,segments.b.idx)
This is just a taste of the possibilities that route computation using the Spatial Extension for BigQuery can offer, with many more use cases available. For instance, the GENERATE_NETWORK function supports the assignment of custom weights to each node of the network. In the example provided in this post we have set uniform weights throughout the network to keep it simple, but this feature is critical for the routing in real-world use cases where the maximum speed allowed in each type of road matters. We are also working on adding directionality to the edges of the network, so that not all can be traversed bidirectionally as it is the case in the current implementation. We will keep you posted on the progress!
We’d love for you to give this a try and give us suggestions on how to utilize this Spatial Extension module further. To get access, even if you are not a CARTO customer, please fill in this form and we will get back to you to set up access. All functions used in this blogpost are documented here.
|This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 960401.|
Over the past few months we have been working with Databricks to add built-in support for H3, and this added functionality was released recently. Native support for H3 mean...News
The ever-growing threat of climate change on the built environment cannot be ignored. Intensifying wildfires, storms, floods and sea level rise not only impact the health a...News
Please fill out the below form and we'll be in touch real soon.