A Journalist's Guide to Finding the Data You Need

Editor's Note: This is part two of our two-part series by graduate journalism teacher Amanda Hickman of Factful. She researches ways to make contemporary state-of-the-art data processing and storage tools more accessible to investigative reporters. In this second part of the series, she completes her comprehensive roundup of data repositories, research guides, and other online tools that are valuable for reporters, investigators, and now librarians. In part one of the series, she told us about her tips and tricks from her Where to Find Data workshops, plus a list of newsroom data warehouses and newsroom collaborations. Here is the rest of the story.

This post originally appeared in Source, an OpenNews project designed to amplify the impact of journalism by connecting a network of developers, designers, journalists, and editors to collaborate on open technologies. It was originally written for journalists, but we thought the piece so unique and useful to librarians and library workers that we're reposting it on TechSoup for Libraries. Find the original here.

Data Repositories

In addition to newsroom data warehouses and newsroom collaborations, there are some far-reaching data warehouses and repositories and tools for publishing data that are pretty remarkable, as well as a few that kind of aren't. This is an A – Z list.

With Aleph, OCCRP, the Sarajevo-based Organized Crime and Corruption Reporting Project, is building a unified index of data. They have tackled a few important questions, including managing access to data that they can't advertise beyond a trusted network of reporters. Aleph is tightly focused on public accountability data and includes quite a few sources obtained through leaks. The data is well organized and includes a lot of accountability and anticorruption data that isn't available other places. Aleph is free and open-source software, so hosting your own instance is also an option.

Awesome Public Data is a great big list of public datasets on Github, organized into broad topics. Anyone can propose data for addition by submitting a pull request. Awesome Public Data does a good job of continuously checking links and flagging broken links. And they point out canonical sources rather than trying to aggregate and store data. Unfortunately, there's no descriptive information, so users can't skim a list and have a sense of what kind of data is available at a particular source.

Registry of Open Data on AWS is a roundup of publicly available data stored on Amazon Web Services, with great usage examples. The AWS Open Data team vets submissions, so the registry includes a range of actively maintained and clearly documented data. The collection is pretty random, however: Amazon Customer Reviews, IRS 990 forms, soil chemistry, and data from Hubble Space Telescope instruments are all there, tagged but not organized in any particular structure.

CKAN is free and open-source software for data publishers. They maintain a list of almost 200 known instances, including quite a few national and regional governments.

Data Portals bills itself as a comprehensive worldwide index of data portals, which it is not. At a glance, a lot of smaller cities, like Berkeley and Oakland, California, are not listed. Anyone can propose new portals, but the list definitely isn't comprehensive yet.

Datasette is free and open-source software for publishing data alongside a clean view of the data. They don't maintain a commons, but if you're looking for a good way to publish data and make it accessible for both skimming and analysis, Datasette might be a good fit.

Data.world is a data collaboration platform. They encourage users to add data, which many have done, but they don't enforce any particular policy about preserving provenance, and the site is cluttered with samples and tests. Data.world did identify a handful of sources and mirror them wholesale, for example, Uniform Crime Reports or US EPA, and some newsrooms including the Associated Press and NJ Advance keep their public data collections on data.world. Unfortunately, there's no hierarchy to the site, or structure of any sort. Anyone can add data, so there's definitely some outright spam on the site. It's an interesting place to search for data ideas, and maybe an interesting place to aggregate data you have worked with. But once you find something interesting, you're going to want to head upstream to make sure you've got current, complete records.

Enigma Public is a relatively comprehensive collection of public and semi-public structured data. Data they consider semi-public includes information that they obtained via Freedom of Information request. Enigma has improved their provenance metadata significantly in recent years, and the data they provide is well documented but scattershot. Coverage of major U.S. cities is much more complete than international data. Their list of governments includes a handful of countries outside the U.S., but in many cases only one or two datasets are actually available. A search for Oakland 311 turns up no Oakland results but does surface NYC 311 data, last updated eight months ago, as the top result. NYC's actual 311 call data is updated daily, but an Enigma user wouldn't necessarily know that more current data is available. Enigma can be a great resource, but users will want to manually check upstream if they need or want the most current data.

Global Open Data Index, compiled by Open Knowledge International (OKFN), aims to provide a comprehensive snapshot of published government data. Their data is tightly organized by nation and topic, so OKFN can show you the state of public access to national legislative or land ownership data around the world, or public data in a handful of key topic areas for any one country. It appears that the index was last updated in 2015, but their sources can help you connect with current data sources. The Global Open Data Index is particularly useful to English-speaking researchers who need to find non-English-language data and may not be able to skim a foreign language government site in search of a specific data source.

Google's Dataset Search tool launched in the fall of 2018. Google crawls the web for data sources that include schema.org microdata, and incorporates it into search results. The result is that the data they're searching isn't necessarily vetted, current, or accurate. Dataset Search results include a lot of data attributed to Kaggle (see that entry, below), which is all user submitted and often detached from its original source, making it difficult to find current data upstream. As more data publishers incorporate schema microdata, however, Dataset Search will get more comprehensive.

IRE's Database Library includes a few valuable business and transportation datasets that Investigative Reporters and Editors has compiled and cleaned, some dating back decades.

Kaggle bills itself as a project-based data science site, but the site includes a commons of user-contributed data — there were 14,000 datasets when I last looked. Kaggle's commons is an eclectic mashup of whatever users have supplied. They encourage users to supply provenance information and human-readable data dictionaries, but they don't support automatic updates, so their data isn't especially useful as source material. Their metadata includes the date data was added to Kaggle but doesn't indicate whether newer data might be available from the source — which it often is. Google recently acquired Kaggle, and (not surprisingly) Kaggle data shows up a lot in Google's Dataset Search tool.

Open Policing Project at Stanford has aggregated police stop data from 31 U.S. states and organized the data to facilitate comparisons across states. They're aiming to collect, clean, collate, and release data from all 50 U.S. states and have plans and funding to keep the data up to date.

ProPublica publishes and sometimes sells some data. Data they obtained through formal public records requests (i.e., FOIA) is generally available free of charge on request; data they've cleaned or reconciled is available for purchase and licensing. Their collection is scattered and reflects their reporting rather than a concerted effort to create a unified index of data, but they have a lot of very interesting data, and they do a very good job of being explicit about provenance and limitations.

Quilt is a Python package and business that facilitates Git-like data packaging that keeps provenance intact and supports tracking of any cleaning or transformation of data. Their commons includes any and all public data that users are storing there, so the quality and usefulness varies widely. Quilt is a super interesting option for reporters and newsrooms that want to publish data or share cleaned data, so if you're looking around for a better-than-GitHub way to publish data you've cleaned or transformed, Quilt is worth checking out.

Socrata, like CKAN, builds software that facilitates sharing public data. Socrata doesn't publish a list of instances, but many city, state, national, and regional governments publish public data through a Socrata portal.

Swirrl, or PublishMyData, is a U.K.-based linked data project with a lot of overlap with Socrata or CKAN. Swirrl primarily powers public data sites, such as Scottish Government. They include a cart functionality that facilitates cross-comparisons within a given data store. Swirrl doesn't publish a list of instances of their software, but quite a few local and national governments in the U.K. and Europe appear to use their software to publish public data.

Vigilant is a business that promises to track and compile public data and make it available to their customers in standardized formats. They don't publish any data publicly.

Still Looking? Try These Research Guides

Ally Jarmanning, a data reporter at WBUR in Boston, maintains a comprehensive guide to obtaining state court data (Google doc).

Charles Ornstein at ProPublica spent ten years covering healthcare. His guide to covering opioids with data (Google doc) is required reading if you're covering the opioid crisis, even if you don't think you're covering it with data.

Angilee Shaw compiled an extensive collection of sources for immigration data (Google sheet).

Jeremy Singer-Vine's newsletter, Data Is Plural, isn't strictly a research guide, but it's great. Jeremy is the data editor at BuzzFeed News, and every week he sends out a roundup of a few interesting datasets. He also maintains a structured archive (Google sheet) of recommendations that is a great place to look for inspiration, but probably not the best path if you already know what you want.

Berkeley Advanced Media Institute's roundup of US regulatory agencies is a great resource for looking into the data that federal and local regulatory agencies maintain.

The Newmark Graduate School of Journalism at CUNY maintains a series of research guides including a guide to using census data and a roundup of data resources. Their index of research databases is a great review of what is available if you have access to a library (you'll need a library barcode to access the databases themselves, but the index is a handy starting place).

Dan Nguyen keeps a thorough roundup of data reporting course syllabi that are definitely worth rooting around in — most data reporting classes (and sometimes a few CS courses) include a lesson on finding data.

It's People!

No data source roundup is complete without a loud reminder that data is only as good as the people who enter it. Before you rely on data for your reporting, you need to know who generated it and how the data you're looking at got into the database.

Data is almost always entered by people. The fastest way to reduce the number of felony robberies in a single police precinct is to start classifying incidents as misdemeanors, and there's good evidence that New York Police Department precincts did exactly that when the commissioner started rewarding precincts that got their serious crime rates down.

It isn't clear why the Baltimore County Police Department has more "unfounded" rape complaints than most departments nationwide, but BuzzFeed News found that many of those "unfounded" complaints were never really investigated.

Sometimes there are just quirks in the way data gets recorded. One report found that coroners don't have solid standards about how to decide whether to record a gun death as an accident or homicide, and as a result, accidental homicides are split between the two categories, making it hard to track down reliable data.

Data is powerful, but it is never a substitute for picking up the phone and making some calls. If you're just starting to think about where data fits in your reporting process, Samantha Sunne wrote an excellent introduction to the challenges and possible pitfalls of data journalism, and how you can you avoid them.

So What Do You Do with All This Data?

If you're really new to data, knowing where to find it is only the beginning. You also need to get a handle on the tools you'll use to clean, sort, and understand the data.

NICAR trainings are a great way to get your bearings.
Source's guide to Working with Data includes a few tips for beginners.
Workbench tutorials are a great resource.

If you've already got a handle on the basics, Source's regular roundups of Things You Made should inspire you to stretch your own wings a bit.

About the Author

Amanda Hickman led BuzzFeed's Open Lab for Journalism, Technology, and the Arts from its founding in 2015 until the lab wrapped up in 2017. She has taught reporting, data analysis and visualization, and audience engagement at graduate journalism programs at UC Berkeley, Columbia University, and the City University of New York, and was DocumentCloud's founding program director. Amanda has a long history of collaborating with both journalists, editors, and community organizers to design and create the tools they need to be more effective.