You Are Here: Home » links

How To Extract All Text From PDFs (Including Text In Images) [Ubuntu]

By Ricky on March 14th, 2010 
Advertisement

The following tutorial will explain how to extract all text from PDFs (including text in images), by using a combination of Ghostscript and a command line OCR tool calledtesseract-ocr. This is yet another guest post by StoneCut.

First we need to convert our PDF to individual image files (TIFF) so we can then OCR-scan them again. We need Ghostscript for that. It's probably already installed on your system but just to be sure you can run:

Read more here.


Useful Links That You May Like To ReadThis is an article from the Digitizor Links section, where we share short stories from across the web for you to read. To view our regular articles visit Digitizor.com

Advertisement







How To Extract All Text From PDFs (Including Text In Images) [Ubuntu] was originally published on Digitizor.com on March 14, 2010 - 1:30 am (Indian Standard Time)