Making Maps with GDELT + CartoDB
What would it look like to literally map the world’s news as it happens? What if you could reach across a growing fraction of the world's news media every day in real time in 65 languages and put a dot on a map for every mention in every article in every language of any location on earth along with the people organizations topics and emotions associated with each place? That’s the vision that drives the GDELT Project.
In the wake of our GeoJourNews conference celebrating journalists cartographers and coders we have an exciting partnership to announce with GDELT one that we hope will only further support our community! We welcome Kalev H. Leetaru a Senior Fellow at the George Washington University Center for Cyber & Homeland Security in Washington DC to author a guest post demoing how CartoDB can be used to map the world's news in real time!
The GDELT Project processes a growing fraction of the world's news media in real time identifying the people locations organizations themes sources emotions counts quotes and events driving global society. The GDELT Project creates a free open platform for computing on the entire world. In essence GDELT acts as an automated open data real time metadata index over the world’s news media.
Working closely with governments media organizations think tanks academics NGOs and ordinary citizens GDELT has been steadily building one of the highest resolution catalogs of the world's local media which it monitors in real time and partners with the Internet Archive to preserve. Since much of the world's local news is not in English GDELT uses one of the largest deployments of streaming machine translation to live translate the world's news from 65 languages accounting for 98.4% of media it finds each day. In one of the largest deployments of sentiment analysis GDELT brings together 24 emotion (tone) mining packages that assess more than 2 300 emotions and themes from every article including native measures for 15 languages. One of the largest multilingual geocoding platforms completes the pipeline identifying disambiguating and rendering to centroid geographic coordinates every mention of more than 10 million places worldwide across 65 languages.
All of this happens 24/7 with updates every 15 minutes around the clock and that makes for some pretty powerful and timely maps!
Getting to Know GDELT
The GDELT Project compiles an enormous array of information about global human society spanning many different datasets.
Here's a taste of what it has to offer:
- GDELT Event Database consists of more than 313 million daily records from 1979-present recording over 300 categories of "events" from riots and protests to peace appeals and diplomatic exchanges all geocoded to the city level across the globe. You can query the entire dataset in Google BigQuery or download the raw CSV files.
- GDELT Emotions of American Television News Database processed more than 540 000 hours of English-language American domestic television news broadcasts monitored by the Internet Archive’s TV News Archive from July 2010 to October 2014 - extracting every mention of a person organization location and theme from the closed captioning of each broadcast along with [thousands of emotions].
- Africa and Middle East Global Knowledge Graph processed more than 21 billion words of academic literature comprising the majority of the research of the humanities and social sciences literature over Africa and the Middle East since 1945 (including all relevant material from JSTOR DTIC CORE CiteSeerX and CIA publications and the 1.7 billion open web PDFs archived by the Internet Archive since 1996). A massive array of socio-cultural information was extracted from every article including every locative mention and the entire underlying citation graph.
- Human Rights Global Knowledge Graph processed more than 110 000 documents from Amnesty International FIDH Human Rights Watch ICC ICG US State and the United Nations dating back to 1960 documenting human rights abuses across the world. A vast array of socio-cultural indicators were extracted including all location mentions.
As you can see there is so much data here to map making for incredible opportunities for a mashup between GDELT and CartoDB. In December 2013 GDELT used CartoDB to produce animated and searchable maps of the geographic footprint of American television news using an earlier version of the dataset linked above. Likewise CartoDB was used to create all of the geographic visualizations for the paper describing the Africa and Middle East GKG research. Each of these datasets includes rich geographic information geocoded down to the city or hilltop level globally and each is available in its entirety as open data for immediate download. However due to their enormous size and complexity these datasets require non-trivial programming expertise to manage and munge the data not to mention substantial disk and CPU resources.
We’re going to focus here on one final GDELT dataset called the Global Knowledge Graph (GKG). In a nutshell the GKG processes every news article across all 65 languages and extracts a vast array of metadata indicators. We'll use a set of tools that do all of the hard work to reformat these data to make it point-and-click easy for us to map it.
The GDELT Project is one of the most ambitious programs ever attempted to codify the world's news into computable format and as a disclaimer there will always be a certain level of error in the data it produces. First there's a lot of news media out there and monitoring local news outlets in every corner of the world is really hard. GDELT will always miss some portion of the news each day – it is not an exhaustive catalog of every report. Attempting to automate the parsing of narrative across 65 languages and literally all the world's news technology platforms is exquisitely difficult. Combined with the subtlties of geopolitical and placename identifiers assumptions of shared locality the mixture of textual and visual locative cues and transcription and typographical error the multilingual geocoding is especially challenging!
However the GDELT team has been exploring the geography of text for more than a decade. The data that GDELT provides overall reflects a reasonably accurate representation of the world's media output.
Mapping with GDELT
All this is to say you can do some pretty incredible things with GDELT data and we're here to make that easier!
Get the Data
- Import from the GDELT CartoDB Account: we've created a hourly-synced dataset of GDELT data available in the GDELT account and in the CartoDB Data Library. Fork an hourly capture of those data to your own account for experimentation (note: once you copy this table into your account it will no longer update).
Download the Raw Files: if you have "sync tables" enabled in your account you can create a new table from the raw GeoJSON feed URLS and set them to sync every hour or every 24 hours.
Learn about the API: the GDELT API allows you to create customized tables that include only your data of interest. See this tutorial for more on how to use the API.
Many geospatial analysis and approaches are possible with the API or hourly data resources. You can check out the GDELT Public Profile on CartoDB for mapping ideas and ongoing experiments!
For example the map below explores the geography of discussion of protests (orange) cyber (purple) and unrest (red). An orange dot doesn't necessarily indicate that a protest is taking place at that location only that protest-related language appears to be associated with it over the last hour.
Instead of filtering by topic what if we displayed every worldwide location mentioned in an article monitored by GDELT over the 12-24 hours and color-coded each location by the language of the news article mentioning it first in a given 15 minute interval? We'd end up with the animated map below of the linguistic geography of the world's news!
Instead of language what if we color-coded each location by the average "tone" from highly positive (green) to highly negative (red) of all worldwide news coverage mentioning each location in 15 minute increments? We'd get the real time map below of the World's Happiest and Saddest News!
Explore the Sandbox Search Tool!
To make your first maps of the world's news media you don't even need to touch a single line of SQL. Instead we've created an interactive Geographic News Search Tool using the CartoDB platform and the CartoDB.js library.
You can enter any major person or organization name a GDELT Theme the phrase "lang:”" plus one of the 65 languages GDELT translates (to display all coverage written in that language) or "domain:" and the domain name of a news outlet (to display all coverage from that domain). There's autocomplete functionality to guide your search toward relevant coverage over the past 24 hours. Be careful to check the linked vocabularies about to generate the most robust maps!
Try searching for "lang:Portuguese" to view the locations being discussed in the Portuguese-language press "domain:bbc.co.uk" to create an instant geographic search interface to the BBC or the GDELT Theme "REFUGEES" to view all coverage across all 65 languages relating to Refugees.
With three layers accessible in the upper right dropdown you can filter your search to the last hour of coverage an animated heatmap view and an emotional graph view showing you broader temporal patterns throughout the past 12-24 hours.
Read on to find out more about the Geographic News Search possiblities on the GDELT blog!
We'll be releasing more GDELT features and tutorials incrementally over the next few weeks. Stay tuned for the upcoming posts on how to use the GDELT API and CartoDB to create fully-customized maps with UI flexibliity multilayer query mashups and more extensive emotional/tonal analysis!
Looking for one last map before you go? Check out this visualization exploring the world's news media groups by countries in clusters. In essence for every monitored news article published anywhere in the world that mentions a given country we compile a list of all other countries also mentioned in those articles: in essence a dynamic time-varying geographic co-occurrence network. Read more about what the visualization shows or view the live interactive display!
Meanwhile thanks Kalev and happy mapping to all!