CARTO: transforming Citi Bike data into new insights
CARTO is more than just a useful tool for visualizing spreadsheets. When you use CARTO you have the ability to tap into powerful data manipulation and GIS tools as well. To showcase these capabilities, we are creating a short series, Mapping the invisible. As the title may suggest, we'll be exploring different datasets, doing analysis and visualizations to tell stories that are hidden in that data so you can learn and get inspired to work with CARTO.
In our first installment, we're going to dig through some publically available data from Citi Bike. The data are all recorded bike trips over a short period of time. Like many datasets that governments and businesses publish, there is often more than meets the eye. Here we are going to show you how you can transform data into entirely new and insightful visualizations. If you work with data and find this interesting, stay tuned and follow us on Twitter to stay on top of the series. Let's go!
Mapping the invisible Part I
. Although it is difficult to know the breakdown of the costs of running the program, the cost of transporting bikes between stations is likely to be significant.
Here's a closer look at how Citi Bike transfers their bikes from one location to another using some simple analysis of the information through maps.
Oh the Places You'll Go!
Being a bike share, Citi Bike allows their riders to pick up and drop off a bike from any of their stations. Although this may seem obvious to many, car shares (i.e. Zipcar) generally require the users to return vehicles to the original pickup location. As a result, to make sure bikes are available where needed Citi Bike needs to move bikes from one location to another depending on the availability and demand of bikes as each station. In general, one would expect to have fewer costs the less often the company needs to move the bikes.
Getting to know a bike
Bike #17310 is the most widely used bike among all the bikes in the bike share, with over 1,600 rides as of May. The visualization below shows all the rides taken with the bike: red indicating trips by a rider and blue indicating trips done by a Citi Bike truck. In general, it looks like the bike is moved around without any extra help in the western part of Manhattan, but frequently needs to be moved by truck in the East Village and Brooklyn.
While the original data only records when the bike was riden by a user, you can use SQL to identify gaps in the network of Bike #17310's movement:
##_INIT_REPLACE_ME_PRE_##
select
a.bikeid a.endstation
a.starttime + interval '1 second' * a.tripdur as actreturntime
b.startstation
b.starttime as pickuptime
extract(epoch from (b.starttime - a.starttime)) as timediff
a.whichday a.whichhour
from test2.testtable as a
left join test2.testtable as b
on (a.bikeid = b.bikeid) and (a.bikeuserank + 1 = b.bikeuserank)
where b.startstation != a.endstation
and extract(epoch from (b.starttime - a.starttime)) < 3600;
##_END_REPLACE_ME_PRE_##
This is the SQL we used to create the blue lines above. Our analysis assumes that all unrecorded moves are relocation events, but clearly there could be cases where the bike needed to be taken in for maintainance or other factors. Assumptions aside, this simple SQL statement let's us start to look at the efficiency of the system.
To minimize costs, it would be important to find ways for the riders to do the majority of the moving, while the trucks move as few bikes as possible. On the other hand, the worst case scenario is where bikes need to be constantly ferried across the city. If this is spread among many stations, this will cost Citi Bike a significant amount of money to run their program. This begs the question, where should one place a station to minimize management effort?
Bikes Go In Bikes Go Out
The visualization below shows the commuter flow over various periods of time (use the layer toggle for different time periods), generated by this SQL
##_INIT_REPLACE_ME_PRE_##
select * st_transform(the_geom 3857) as the_geom_webmercator from
(
select startstation start_name start_lon start_lat
coalesce(a.departures 0) as departures
coalesce(arrivals 0) as arrivals
coalesce(a.departures 0) - coalesce(arrivals 0) as netdeps
st_setsrid(st_point(start_lon start_lat) 4326) as the_geom
from (
select count(startstation) as departures startstation start_name start_lon start_lat
FROM citibike_finished
where extract(epoch from (pickuptime - returntime)) > 300
and whichday <= 5 and whichhour between 17 and 20
group by startstation start_name start_lon start_lat
order by departures desc) as a
left join (
select count(endstation) as arrivals endstation end_name end_lon end_lat
FROM citibike_finished
where extract(epoch from (pickuptime - returntime)) > 300
and whichday <= 5 and whichhour between 17 and 20
group by endstation end_name end_lon end_lat
order by arrivals desc) as b
on a.startstation = b.endstation
order by netdeps desc) as aa
##_END_REPLACE_ME_PRE_##
The map helps demonstrate the general trends of where bikes are flowing from one locaton to another minus maintenance work.
Not surprisingly we can see that the largest flows are the morning and evening weekday commute from Times Square, Penn Station, and Grand Central Station. What is surprising is Bryant Park, a major hub of work commuters and very close to Grand Central and Port Authority, generated bike flow in the opposite directions. We wonder if there could be simple solutions for this cluster.
Where's The Action?
…which is not exactly what is happenening in the chart below. The visualization below shows actual pickup and dropoff times…
##_INIT_REPLACE_ME_PRE_##
select * st_transform(the_geom 3857) as the_geom_webmercator from
(SELECT cartodb_id 1 as dropoff
st_setsrid(st_point(start_lon start_lat) 4326) as the_geom
date '2014-01-01' + pickuptime::time as starttime
FROM citibike_finished
where extract(epoch from (pickuptime - returntime)) > 300 and whichday < 6
and pickuptime > date '2014-05-30'
union
SELECT cartodb_id 2 as dropoff
st_setsrid(st_point(end_lon end_lat) 4326) as the_geom
date '2014-01-01' + returntime::time as starttime
FROM citibike_finished
where extract(epoch from (pickuptime - returntime)) > 300 and whichday < 6
and pickuptime > date '2014-05-30'
) as a
where starttime between '2014-01-01 07:30:00' and '2014-01-01 09:30:00'
##_END_REPLACE_ME_PRE_##
…highlighting a surprisingly large number of blinking lights in many of the stations. One notable example is the stations around Grand Central Station around 42nd and Park Avenue. If this really is trucks ferrying bikes, wow! There you can see alternating red and blue lights indicating constant ferring of bikes in and out of a single station by vans or other non-riders. It makes us want to dig deeper to find out which of these are vans and which are oddities in the data.
This is just a simple example of how you can use CARTO to turn data into insights. If you're interested in checking it out create a free account and start mapping your own data today.