#newsrw: Making sense of the numbers in data journalism

The next big development in data journalism is live data and also getting your audience involved, according the Martin Stabe, pictured above, an interactive producer of

He was one of four data journalists giving tips on what is in the data journalism toolkit ; advice on tools, many of which are free and how to find the data and clean it.

James Ball, data journalist, the Guardian investigations team, worked on the WikiLeaks cables and  discussed the “use and abuse of statistics”.

He showed “a really awful infographic” on the amount of water it takes to make a pizza and a slice of bread.

“You don’t have to do much research to realise that is just tosh,” he said.

“We have to sense-check numbers.” He gave the example of culture secretary Jeremy Hunt giving expected TV viewers for the Royal Wedding of the unrealistic figure of two billion. The estimated audience was 300 million.

He asked: “Why might it matter?” And explained the dangers of bad statistics and bad journalism. “The best bit of your toolkit is understanding a bit of maths,” he advised.

Kevin Anderson, data journalism trainer and digital strategist, trained as a journalist in US, gave more tips on tools. One of the revolutions is access to data, the other is the access to tools, he said.

One tool in his kit is Google Docs. Google Spreadsheets, which Anderson used when he was at the Guardian and recommended the OUseful blog.

“You can import data live data feed,” he said, and suggests collecting your own data in a form. You can ask questions, including multiple choice, and embed the form it into a story.

For easy mapping tools he advises Google and Zeemaps. Once you have the data he said the next process is “link scraping”.

You can “grab data” from existing sources. He gives an example of using Outwit Hub, a plugin which works with Firefox, which allows you to pull in links, with the URLs, from any search and then export it as a Google Spreadsheet or SQL.

Anderson also recommends tools to order data from text. He gives the example of OpenCalais, a Thomson Reuters tool, which “allows you to see patterns in your own coverage” and connections between stories.

He also pointed journalists towards ThinkMap and gave the example of ‘Who Runs Hong Kong’, a data visualisation showing the connections of power.

“The ability for news organisations to extract more value through data journalism is a huge opportunity,” he said.

Stabe who, like Anderson, is orginally from the States, described how data-driven news stories at are handled by a team.

He explained the team consists of a reporter, “who really knows the story”, a producer, like Stabe, a designer and a developer.

“One of the best things you can do in your newsroom is to get your head round administrative geography,” he said and understand statistical data.

He said it is very difficult to get data on all local authorities, on when they hold local elections and how their public spending is changing. Local data is often coded in different ways, he explained and gave the example of the “Cuts of £6bn hit the elderly the hardest” report on

When you have a large dataset you need to ask questions. But data maybe “dirty” with a mix of local of coding conventions.

“The very act of cleaning the data is the key step,” Stabe explained.

“Data is only useful if it’s personal”, Bella Hurrell from the BBC recently said on Paul Bradshaw’s blog, a quote echoed by Stabe, giving an example on data collected on how likely a 16-year-old receiving free school lunches is to get good or bad GCSE results.

He pointed out that readers are usually only interested in one area, one school, so an interactive version allows people to drill down. The data journalism steps are to obtain, warehouse and publish the data.

In obtaining the data, “sometimes we ask for it nicely”, Stabe said, but usually the FT scrapes the data, and it then goes into a database.

His tips for journalists include learning how to manipulate text in Excel.

Next came advice from Simon Rogers, editor of the Guardian’s Data Store and Datablog.

Newspapers are all about the geography of the newsroom, he said, describing how he sits beside the investigations team and news desk.

He spoke about the difficulty in getting usable public data and dealing with the government’s “annual resource accounts”.

The Guardian is now providing ordered data to the people in government who supplied it, he explained.

The Guardian’s data work flow is: getting sent data, data from breaking news, recurring events and “theories to be exploited”. The journalists then have to think about how to “mash it together”, as the combined data makes it more interesting.

A couple of Rogers’ tips are to use ManyEyes, Google Spreadsheets but “sometimes numbers alone are interesting enough to make a story,” he said.

He gave the example of a map made using Google Fusion Tables showing “patterns of awfulness” every death in Iraq mapped – which took about half an hour.

More recent examples include accessing data provided on the NATO Libya website. The site produces a daily archive for what happens each day, including data on missions.

Every day they add the NATO data to a map to show visually what has been hit where. It can also make stories as journalists notice patterns.

You might be interested in …