Editor's Note: This post originally appeared in Source, an OpenNews project designed to amplify the impact of journalism by connecting a network of developers, designers, journalists, and editors to collaborate on open technologies. It was originally written for journalists, but we thought the piece so unique and useful to librarians and library workers that we're reposting it on TechSoup for Libraries in a two-part series. Find the original here.
At Factful, where we're building technology for journalists and civil society researchers, we're researching ways to make contemporary state-of-the-art data processing and storage tools more accessible to investigative reporters. One question driving our research was whether or not it made sense to create a large-scale data commons, a place where publicly useful sets of information could be stored, curated, and compared for the common good. Ultimately, we decided that for us the answer is no, at least for now. There are plenty of incomplete or out-of-date data commons projects already, and building and maintaining a truly comprehensive project is a massive undertaking.
Along the way, we did compile a pretty comprehensive roundup of data repositories and commons projects that could be valuable tools for reporters, investigators, or anyone looking to increase accountability through publicly available information.
Data Is Awesome
Data is an incredibly powerful reporting tool. It lets us scrutinize public spending and policy outcomes, challenge long-held conventional wisdom, and participate more fully in public conversations. A decade of open data activism has left reporters and the general public with unprecedented access to public payrolls, traffic reports, police data, and much more. All of it allows us to hold policymakers accountable and understand the world in ways we couldn't without access to the numbers.
Whether you're a seasoned data journalist or brand-new to thinking about data as a source in your reporting, there are exceptional places to find data that you may never have considered.
And if you've got a lot of interesting data that you'd like to share, there are some excellent tools for doing just that, none of which have the traction they deserve.
So who has data now, and how can you get your hands on it?
Start at the Source
This is not a list of every civic data repository, public data source, or research organization, but those are some of the richest data mines.
When I teach data reporting, I always start with a workshop on finding data. We start by identifying a few beats that students are excited about — student loans, civil asset forfeiture, child welfare — and then we brainstorm potential sources for data on the subject.
The best way to start looking for data you need is almost always to ask yourself who could collect this data and look at where they might share it. Are there city, county, state, or national agencies that collect data? Do they publish it? If they don't publish data, what happens when you ask for it? Sometimes all you have to do is ask; sometimes you have to file a more formal Freedom of Information request for the data.
Are there private research organizations or nonprofits that keep data on the subject you're researching?
In the data reporting class, we compile our findings in a list of tips and tricks called where to Find Data. That resource is not meant to be comprehensive. It's meant to help you think about where to start looking for the data you need for your reporting. If you're doing a good job, your first set of findings will leave you with additional questions. Those questions could send you back to the same source for more information, or they may lead you in a different direction. While there is no one centralized data commons to search, there is a rich patchwork of possibilities that will vary with each potential area of inquiry.
Where Else?
Once you've exhausted the direct approach, or you're just interested in sparking some inspiration, there are a few more great places to look for data and ideas.
Newsroom Data Warehouses
Lots of newsrooms push cleaned data (and code) to GitHub, but there's not a unified way to find it all. The Washington Post has released a collection of data on school shootings, police-involved shootings, and unsolved homicides, along with valuable context about how the data was collected and processed. BuzzFeed News maintains an indexed overview of all the data they've published to GitHub, as does 538. Here are a few more:
- Arizona Central recently launched a data hub.
- BuzzFeed News
- Courier Journal (Louisville, Kentucky)
- Naples Daily News (Naples, Florida)
- New Jersey Advance publishes their data on data.world.
- New York Times maintains a repository that's mostly code. They also publish some data via The Upshot.
- News-Press (Cape Coral, Florida)
- NPR Visuals publishes mostly code.
- ProPublica (their GitHub repository includes more data as well as a fair amount of reusable code).
- Quartz includes data along with a ton of helpful code in their GitHub repository.
- Tallahassee Democrat (Tallahassee, Florida)
- Vancouver Sun
- Washington Post
- 538
Newsroom Collaborations
Chicago Data Collaborative includes data that newsrooms, academics, and advocates have compiled to better understand criminal justice in Chicago.
Wireservice's Lookup repository is a collection of very useful lookup tables for BLS, IPUMS and some Fed data. (Wireservice is a collaboration between a number of U.S. newsroom developers and data reporters.)
Most of those newsroom data warehouses are on the huge code host site, GitHub, or the data collaboration platform data.world. But there are definitely more options for publishing data like repositories such as OCCRP Data, CKAN, Datasette, Quilt, and Socrata.
We will cover those and many more data repositories in the second and final installment of "A Journalist's Guide to Finding the Data You Need."
About the Author
Amanda Hickman led BuzzFeed's Open Lab for Journalism, Technology, and the Arts from its founding in 2015 until the lab wrapped up in 2017. She has taught reporting, data analysis and visualization, and audience engagement at graduate journalism programs at UC Berkeley, Columbia University, and the City University of New York, and was DocumentCloud's founding program director. Amanda has a long history of collaborating with both journalists, editors, and community organizers to design and create the tools they need to be more effective.