Supported by Google Jigsaw, the GDELT Project monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.
A GDELT Project visualization highlighting the 25,000 newsmakers mentioned most frequently and the connections among them.
Recently they launched the GDELT Global Geographic Graph, the underlying dataset powering the GDELT GEO 2.0 API, covering more than 1.7 billion location mentions from worldwide English language online news coverage back to April 4, 2017, with full details of each mention, including a 600-character contextual snippet of its context and usage.
The GDELT Project is a realtime network diagram and database of global human society for open research
This post, guest written by GDELT, looks into how the integration with Google BigQuery can be leveraged to map this huge dataset, focusing on news coverage in the context of COVID-19 at different points during the pandemic.
The Global Geographic Graph now spans more than 1.7 billion location mentions in worldwide English language news coverage back to 2017, making it possible to map the geography of the global news narrative.
Creating a map from the Global Geographic Graph traditionally involved running a BigQuery SQL query to compile the results, downloading them as a CSV file and then importing that file into CARTO to create the actual map. However, CARTO actually has a built-in BigQuery connector, meaning you can simply paste your SQL query directly into CARTO, which runs the query on BigQuery and downloads the results into your map completely seamlessly and can even rerun the query on a fixed internal to keep the map constantly synced!
Let’s say we want to create a map of locations mentioned in English language news coverage in the context of COVID-19 during the month of May. Just log into your CARTO account and select “New Dataset” and click on “BigQuery” as your data source and then click on the “Connect” button. You’ll get the dialog below:
The “Billing Project ID” dropdown will automatically populate with the list of your Google Cloud projects (this is the project against whose quota the BigQuery query will be billed). You then paste your query in the SQL Query box and pick a name for the new CARTO dataset and then click “Connect to BQ”! That’s quite literally all there is to it! CARTO will then send your query to BigQuery to be executed and within a few moments the results will be streamed back to CARTO and populate your new dataset!
Here’s the actual query we’re going to run (we use “GeoType > 1” to include country-level matches):
SELECT ST_GeogPoint(Lon, Lat) as the_geom, count(1) Cnt FROM `gdelt-bq.gdeltv2.ggg` where (ContextualText like '%COVID%' and ContextualText like '%coronavirus%')and GeoType>1 and DATE(DateTime) >= "2020-05-01" AND DATE(DateTime) <= "2020-05-31" group by Lat,Lon order by Cnt desc limit 10000
The “ST_GeogPoint(Lon, Lat) as the_geom” at the beginning of the query takes our latitude and longitude coordinates and converts them into a Geography “Point” type.
Alternatively, since the Global Geographic Graph actually already has a GEOGRAPHY column called “the_geom” you could also use:
SELECT ANY_VALUE(the_geom) as the_geom, count(1) Cnt FROM `gdelt-bq.gdeltv2.ggg` where (ContextualText like '%COVID%' and ContextualText like '%coronavirus%') and GeoType>1 and DATE(DateTime) >= "2020-05-01" AND DATE(DateTime) <= "2020-05-31" group by TO_JSON_STRING(the_geom) order by Cnt desc limit 10000
It is important to remember that grouping by a geography column like we’re doing here is extremely bad practice for observed datasets due to the imprecise nature of most observed/sampled geographic data. When collecting GPS points, a stationary object can appear to constantly move in a small radius around its actual location due to GPS jitter, while sampling imprecision and error can yield clusters of points all slightly different from each other.
Simply snapping to a grid doesn’t solve this issue, since regardless of the grid resolution, a cluster of points will be split across the rounded coordinates. In the case of the Global Geographic Graph, all locations are drawn from a set of centroid-based gazeteers in which every reference to Paris, France will always yield precisely the same coordinate and thus here we can safely group by our coordinates, but for other kinds of data we strongly recommend against it. With observed GIS to perform this kind of “Top N” grouping you would need to cluster your points first.
Within a few moments the data appears as a standard CARTO table and you can instantly map it:
But what if we want to make an interactive clickable map that displays the top 5 articles mentioning each location?
The query below selects the top 5 articles per location and concatenates their titles and URLs into a block of HTML and selects one of their social media images as the header image.
select ST_GeogPoint(Lon, Lat) as the_geom, ARRAY_TO_STRING(ARRAY_AGG(Art ORDER BY cnt DESC LIMIT 5), '<BR><BR>') Articles, max(SharingImage) SharingImage, APPROX_TOP_COUNT(Loc, 1)[OFFSET(0)].value LocationName, Count(1) NumMentions from ( select CONCAT('<a href="', min(URL), '" target="blank">',max(Title),'</a>') Art, Lat, Lon, max(SharingImage) SharingImage, APPROX_TOP_COUNT(Location, 1)[OFFSET(0)].value Loc, count(1) cnt FROM `gdelt-bq.gdeltv2.ggg` WHERE (ContextualText like '%COVID%' and ContextualText like '%coronavirus%') and GeoType>1 and DATE(DateTime) >= "2020-05-01" AND DATE(DateTime) <= "2020-05-31" group by LOWER(Title), Lat, Lon ) group by Lat, Lon order by NumMentions desc limit 10000
Once again the data is imported into a CARTO dataset within moments. Following the same instructions as we used in April (but this time skipping the part about saving the data out of BigQuery as a CSV and loading back into CARTO since we don’t have to do that anymore!), we get the interactive map below.
The GDELT Project also created a video documenting this process which can be viewed below.
As has been demonstrated with our connector, you only need to enter a SQL query from BigQuery to connect your data & make it available on the CARTO platform—allowing you to easily visualize, perform spatial analyses & build applications.
Want to know more?Click to discover all the features of the BigQuery connector
This article was originally published on GDELT’s blog.
Like many people who love trees and work in the geospatial field, I was fascinated (and disheartened) by a recent article I read in the New York Times called Since When Hav...Use Cases
Most Data Scientists and Analysts understand that visualizing datasets can be a crucial way for users to engage with data. Knowing where median household income is across a...Use Cases
The urban growth of metropolitan areas around the world can be affected by a number of factors. During the industrial revolution the explosion in job availability fueled mu...Use Cases
Please fill out the below form and we'll be in touch real soon.