Data (Visualization) Journalism

Here's a look at how newsrooms can collect, sanitize and store data, and some of the tools that help create visual representations out of it.


  1. Data visualization is important because data on its own can be difficult to understand. Imagine reading row after row of numbers, and you get the idea. Data visualization helps present that information in a way that's engaging and helps communicate complex ideas quicker. This is especially true on the web, where data visualization can grab a web surfer's eyes and get information across much easier than paragraphs and paragraphs of words.
  2. Data sources.

  3. Freely available. The world is generating more and more data. If you know where to look, you can find many free and useful datasets. The City of Toronto, for example, provides a large catalog of free data related to the city via its Open Data Toronto site that can be accessed by anyone.
  4. The data found on Open Data Toronto has even spawned a stand-alone website that helps drivers locate the worst places for receiving parking tickets in Toronto.
  5. One quick search and you can find out what hours, days of the week and months most parking tickets have been written for any given street in the city. Almost 50,000 tickets, the most in the city, were given at Sunnybrook hospital, according to the data.
  6. Crowdsourcing from citizens. Some data can also be collected from the general public. Here is an example of how SocMap and Re:Baltica used crowdsourced data.
  7. In February, we launched our very first application, HotBills, which we created in partnership with the Baltic Centre for investigative journalism (Re:Baltica). The idea behind the app is to determine how much people pay for heating in various parts of Latvia, so that the data can later be used in journalists’ research into heating prices, transparency and validity, as well as to give people an incentive to talk to their landlords about the prices, ask for explanations, and get adequate answers. We asked users to scan their bills and submit them.
  8. This following clip explains the above example, as well as a number of other ways Data Journalism is used by journalists from across the globe.
  9. Sarah Marshall: "Around the world in online innovation"
  10. More data journalism stories from Re:Baltica can be found on their site.
  11. Scrape sites for data. While newsrooms can get developers to write their own screen scrapers to automatically populate their databases, there is a website that does it for you.
  12. The PANDA team has written screen scrapers and made them available to others via ScraperWiki, a site for collaboratively building programmes to extract and analyse data.
  13. Freedom of Information requests. For example, the data used by The Toronto Star in their award nominated investigation, Known to Police, was gathered via freedom of information requests.
  14. 'Known to police' is about who police stop, question and document in encounters that typically involve no arrest or charge, where they do this, and why. What we’ve shown, using Toronto police data, is that, in every part of the city, black and “brown” people are being stopped at rates disproportionate to the populations of black and brown people living in these areas. This is even more so with young males. The analysis allows for a provocative question: Is it possible that police in certain areas of the city have documented every young male of colour who lives there? And, what does that do to a community?
  15. The data that serves as the foundation for the Known to Police series was obtained through a freedom of information request that was a follow-up to two requests made in 2000 and 2003.
  16. Toronto police data on arrests and charges served as the basis for Race & Crime, a 2002 series that found police in certain circumstances treated blacks more harshly than whites. Updated charge and arrest data, and data that shows who police stop and document in mostly non-criminal encounters, was requested in 2003. After a seven-year battle for the data — including court challenges — police released them in 2010, resulting in the series Race Matters that same year.
  17. An updated version of these data — and the latest available census demographic data for Toronto — serve as the basis for this series.
  18. Build a database of relevant... data.

  19. Much of the data from the aforementioned sources comes in various formats and document types: Excel spreadsheets, XML, PDFs, and others. Not only do newsrooms need a centralized place to store this data, they also need to clean it and prepare in order to use it for data visualizations.
  20. Extracting the data.
  21. If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful this is — you can’t easily copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data in CSV format, through a simple interface. And now you can download Tabula and run it on your own computer, like you would with OpenRefine.