Editor's Note: This post originally appeared in Source, an OpenNews project designed to amplify the impact of journalism by connecting a network of developers, designers, journalists, and editors to collaborate on open technologies.It was originally written for journalists, but we thought the piece so unique and useful to libraries that we're reposting a somewhat shortened version. Find the original here.
Do you need to pay a lot of money to get reliable OCR results? Is Google Cloud Vision actually better than Tesseract? Are any cutting-edge neural-network-based OCR engines worth the time investment of getting them set up?
OCR, or optical character recognition, allows us to transform a scan or photograph of a letter or court filing into searchable, sortable text that we can analyze. One of our projects at Factful is to build tools that make state-of-the-art machine learning and artificial intelligence accessible to investigative reporters. We have been testing the components that already exist so we can prioritize our own efforts.
We couldn't find a single side-by-side comparison of the most accessible OCR options, so we ran a handful of documents through seven different tools and compared the results. Here they are.
Calamari is built on TensorFlow, an open-source machine learning library, which allows Calamari to take advantage of TensorFlow's neural network capacity. It's relatively straightforward to use, but it comes with some tricky dependencies. Because Calamari only does text recognition, you have to use another engine (they recommend OCRopus) to increase contrast, deskew, and segment the images you want to read. OCRopus requires Python 2, and Calamari is written in Python 3 — not an insurmountable obstacle but one to be alert to.
Pricing: Calamari is free and open-source software.
OCRopus is a collection of document analysis tools that add up to a functional OCR engine if you throw in a final script to stitch the recognized output into a text file. OCRopus will output hOCR.
OCRopus requires Python 2.7, so you probably want to use virtualenv to install it and manage dependencies. We had hiccups using the installation instructions in the Readme file, but found workable installation instructions hiding in an issue. You'll also need to follow some specialized instructions to get matplotlib running in a Python 2.7 virtualenv.
Dan Vanderkam's blog post about his experiences with OCRopus is also helpful.
OCRopus needs higher-resolution images than the other OCR engines we tested. You'll see a lot of errors if your resolution is below 300 dpi. Unlike most tools we tested, OCRopus won't catch documents that are sideways or upside down, so you'll need to make sure your pages are oriented correctly.
Pricing: OCRopus is free and open-source software.
Kraken is a turnkey OCR system forked from OCRopus. Kraken does output geometry in hOCR or ALTO format. ALTO (Analyzed Layout and Text Object) is an XML schema for text and layout information. It's a well-developed standard, but we didn't encounter other tools that output ALTO in our testing. Kraken is just OCRopus bundled nicely, so the actual results will be on par with OCRopus results.
Pricing: Kraken is free and open-source software.
Tesseract is a free and open-source command-line OCR engine that was developed at Hewlett-Packard in the mid 1980s, and has been maintained by Google since 2006. It is well documented. Tesseract is written in C/C++. Their installation instructions are reasonably comprehensive. We were able to follow them and get Tesseract running without any additional troubleshooting.
Tesseract will return results as plain text, hOCR, or in a PDF, with text overlaid on the original image.
Pricing: Tesseract is free and open-source software.
Tesseract accurately transcribed the handwritten text (Come again…) at the bottom of the Rikers commissary receipt above. None of the tools we tested accurately captured the handwriting at the top (Chips A'hoy Keeps Me Happy). Tesseract definitely garbled the prices in the document, however.
Adobe Acrobat Pro doesn't provide API access to their OCR tools, but they will batch process documents. Acrobat Pro only takes PDFs (no images) and only returns PDFs with searchable text inline. If you need a separate text file, you can use Docsplit to extract a plain text file from a PDF after you've run it through Acrobat.
Pricing: Adobe Acrobat Pro DC is a desktop app, but you pay a recurring monthly subscription to use it. The application is available to public libraries at discounted rates through TechSoup at $24 off Adobe's current rates for the first year. After you request this offer through TechSoup for a $5 admin fee, you'll pay $12.99 to Adobe directly each month for the first year of the discounted membership.
Adobe Acrobat Pro gave garbled results on the historical document above.
Of all the cloud services we tested, Abbyy Cloud is the most straightforward to set up because you aren't setting up access to a whole cloud platform — OCR is the only thing they do. [Editor's note: Abbyy's product line has changed considerably.] Abbyy has been in the OCR business since 1993, and in addition to their Cloud API service, they also sell a desktop app that starts at $200 and access to an SDK that developers can use to incorporate OCR functionality into software.
Abbyy did a better job of preserving spacing in their text-only results than most of the tools we tested. In addition to plain text, Abbyy will return JSON, XML, or a PDF with the text searchable inline.
Pricing: Abbyy will let you process 50 pages with a free account. After that you need to sign up for either a monthly subscription or a 90-day package. Packages start at 10¢ per page for 100 pages or 6¢ per page with a $29.99 monthly subscription. Pricing goes as low as 3¢ per page when you get into the tens of thousands of pages. Their desktop app is $200 and comes without any page count restrictions.
Abbyy preserved much of the formatting on the receipt but introduced some wonky spacing. It isn't clear why Abbyy couldn't read the Ramen Soup price.
Google's cloud services include an OCR tool, Cloud Vision. Of all the tools we tested, Cloud Vision did the best job of extracting useful results from the low-resolution images we fed it. There are a few steps to getting it up and running, but the documentation covers them well. If you follow the instructions, you should be able to get set up. If it feels like you're going in circles, you might still be on the right track.
When you create your account and first log in, you have to actually select Console from the landing page to get to the settings you need. From the console, start by creating a project. If it's not your first project, the option is hiding under Select a Project. If you don't see an option at the top left to create or select a project, try reloading the page — our Select a Project pulldown actually disappeared briefly.
Once you've selected, or created and then selected, a project, you will need to either search for "vision" to find the Cloud Vision API or select *APIs > enable APIs and services* and then select Cloud Vision API.
However you get there, your next step is the enable button, and then create credentials — you'll need to tell the system, again, which API you're using. Once the project is set up, you also need to create a "Service Account." We used "Project Owner" as the "role" for ours, but if you read the documentation, you might be able to make a more precise selection choice. Once you hit Create, you should be prompted to download your credentials. Save the file as credentials.json, and you're ready to run our script.
Dan Nguyen has published a few additional Python scripts that he used to compare Cloud Vision and Tesseract.
Pricing: Your first 1,000 pages each month are free. After that, you'll pay $1.50 per thousand pages. In addition, Google Cloud Vision currently offers a free trial that will get you $300 in free credits, which is enough to process 200,000 pages in one month. When you get to 10 million pages, the price drops to $0.60 per thousand pages.
Google Cloud Vision did better than any other tool on this heavily redacted FISA warrant, but it still choked on an otherwise readable sentence. "1. (U) Identity of Federal Officer Making Application This application is made by" was reduced to "dentit made."
Computer Vision is Microsoft Azure's OCR tool. It's available as an API or as an SDK if you want to bake it into another application. Azure provides sample jupyter notebooks, which is helpful. Their API doesn't return plain text results, however. The only way to get those is to scrape the text out of the bounding boxes. Our script or their sample scripts will do that nicely though.
There are a handful of steps that you need to follow to use Computer Vision. Their quickstart guide spells them out, but you need to set up an Azure cloud account, create a "resource" (the "location" option is oddly circular, but if you stick to the default you should be okay), and wait a moment for it to deploy. Then you will be able to actually go to resources to grab your credentials and the API endpoint. Add those to credentials.json in our Azure sample script and you're ready to run it. We inexplicably got locked out of our account — reentry took more steps than it should have, but we did get back in.
Pricing: Your first 5,000 pages each month are free. After that you'll pay $1.50 per thousand pages, and the per-thousand-page price drops again at 1,000,000 pages and at 5,000,000 pages.
Azure seemed to do a nice job of breaking the receipt into columns, but it was actually a bit erratic in its implementation. The segment shown here includes a lot of numbers from the Price column and from the Total column. This wouldn't be easy to import into a spreadsheet.
All the scripts we used to run each of these tests are available in a repository on GitHub.
Ted Han was the lead technologist behind DocumentCloud from 2011 to 2018, a successful project and hosting service used widely for publication of newsworthy documents and for document analysis. Ted has been involved in open-source software for 15 years, was lead developer at Investigative Reporters and Editors, and taught at Missouri University School of Journalism.
Amanda Hickman led BuzzFeed's Open Lab for Journalism, Technology, and the Arts from its founding in 2015 until the lab wrapped up in 2017. She has taught reporting, data analysis and visualization, and audience engagement at graduate journalism programs at UC Berkeley, Columbia University, and the City University of New York, and was DocumentCloud's founding program director. Amanda has a long history of collaborating with both journalists, editors, and community organizers to design and create the tools they need to be more effective.