Case Study 7.10: University of Georgia Law Library—An Optical Character Recognition (OCR) Workflow
Another common accessibility workflow used for digital libraries and repositories is optical character recognition or OCR. In an OCR process, computer software is used to recognize the printed characters in an image-only PDF. Without OCRing, assistive devices cannot convert text to speech and the document will not be far less searchable and discoverable in a digital environment.
In their 2020 CALIcon presentation, three presenters from the University of Georgia School of Law Library explained how they constructed an OCRing workflow to update older PDFs in their institutional repository, Digital Commons @ Georgia Law. Library staff used the batch OCR process in Adobe Acrobat Pro to update the documents in the IR. Although the presentation highlights resultant SEO advantages to OCRing, it merits reiterating that this process is also extremely important for supporting accessibility.
It is worth noting that OCRing produces varied results depending on the typography and characters in any given text. In “Steps toward Accessibility: Improving the User Experience through OCR,” the presenters pointed out common OCR software including Abbyy FineReader, Tesseract OCR, and Adobe Acrobat Pro. Note that for non-Roman and historical characters, other options like Kraken may be more suitable.
Some institutions include corrective and quality-control steps after producing OCR. To learn more about some of these processes, take a look at the University of Illinois library guide, “Introduction to OCR and Searchable PDFs.”