DJA manager Marianne Bouchart’s guide to sourcing data
“The term data journalist is a bit of a jack-of-all-trade term,” Marianne Bouchart told delegates this afternoon at the news:rewired conference in London. “Some call us computer-assisted reporter, journalist programmer, journo geek… unicorns. It varies.”
Bouchart is communications director and Data Journalism Award Manager at the Global Editors Network – and the founder of Hei-Da.org, a not-for-profit organisation set up earlier this year that specialises in open data driven projects.
She is also the founder and editor of the Data Journalism Blog, and gave delegates some expert advice on how to source and use data sets.
Here is a list of sources, as recommended by her:
- Dataportals.org – a comprehensive list of open data portals from around the world. It is a good starting point for a diverse range of data
- FindTheData.com – quite similar to Dataportals.org, it contains a lot of data sets on various topics and industries
- EU Data Portal – the EU Data Portal launched last week, and is still in beta. You can use the resource to browse official data sets, and it is sponsored by the European commission
- European Union Open Data Portal – much like the EU Data Portal, it offers a lot of data sets on various subjects in Europe
- Data.gov.uk – the UK government data website. It contains public data to help people understand how data works and how policy is made
- Data.gov – the US counterpart to Data.gov.uk. There are many similar websites available to others in other countries
- Open Corporates – the largest open data base of companies in the world, its main goal is to have a url for every company in the world. It contains lots of specific business data
- WikiLeaks – people presume that WikiLeaks is dated, but Bouchart stresses that it is still an exceptional resource, with a website that is continually updated
- The World Bank – it has a data portal that offers free and open data about development around the world
- The UN Data Portal – grants access to a gigantic list of data sets, broken down by countries and themes
- The UNHCR Data Portal – dedicated to data about the refugee crisis. It is a very visual resource that often provides raw data sets
- The World Health Organisation Data – this resource offers a large data library with maps and reports, as well as country-specific statistics
- Google Public Data Explorer – enter keywords and it will bring you results of data sets according to what you’re looking for, broken down by the data sources
- GetTheData.org – a forum where users can ask others where they find specific data
- Crowdsourcing using Google Forms – previously used successfully by organisations like the Guardian when compiling data from their readers about how many Olympic tickets they got hold of
- WhatDoTheyKnow.com – this resource is a good tool to use when you can’t find the data you need. It is a website that gathers all of the Freedom of Information requests that have ever been submitted, and tells you whether they were successful or not
- Quora – can be used to browse information, and much like GetTheData.org, ask others where to source specific data
You can also find datasets directly on Google using the following search operators:
- Filetype:CSV and filetype:XLS for Excel spreadsheets
- Filetype:shp for geo data
- Filetype: MDB, filetype: SQL, filetype:DB for database extracts
- You can even look for filetype:pdf
- ‘inurl:downloads filetype:xls
Example: site:Adidas-group.com filetype:pdf
This way you find not only the documents that companies or organisations make public, but also what they share internally.
For the more advanced, try data scraping with Google. Here’s Bouchart’s one line magic formula for use in Google Spreadsheet to scrape data from HTML tables:
For more resources on how to interrogate data using spreadsheets, Bouchart recommends Berkeley’s tutorial on Spreadsheets, as well as the Centre for Investigative Journalism’s Data Journalism Handbook.
Finally, don’t forgot to clean your data! Bouchart says that holes in data, mean wrong and unreliable data. She recommends using Open Refine, which is free, open source, and doesn’t need an internet connection to use it – only to download the software to begin with.