The following tutorial will explain how to extract all text from PDFs (including text in images), by using a combination of Ghostscript and a command line OCR tool calledtesseract-ocr. This is yet another guest post by StoneCut.
First we need to convert our PDF to individual image files (TIFF) so we can then OCR-scan them again. We need Ghostscript for that. It's probably already installed on your system but just to be sure you can run:
|This is an article from the Digitizor Links section, where we share short stories from across the web for you to read. To view our regular articles visit Digitizor.com|
Related Posts by Tags: Linux, pdf, text, ubuntu