PDF to text using OCR Software

I received a flyer from a friend in PDF format and wanted to convert it from German to English so that I could read it, using a quick technical solution. I could’nt find the text in ascii format in the PDF file, so I had to resort to another method.

I’ve not had much success in the past with OCR software, so I was sceptical over how much I could accomplish. Much to my surprise, the recognition worked much better than I had expected, and I was able to break the conversion down into a simple, repeatable process.

The OCR magic comes from a piece of software called Tesseract, now maintained by Google.

Download and install
**Download tesseract-ocr-3.02.02.tar.gz and tesseract-ocr-3.02.deu.tar.gz

The conversion process

  • You can remove the cover from the PDF, if it has no text.
  • $ pdftk original_flyer.pdf cat 2-end output flyer.pdf
  • tesseract 3.02 only reads .tiff file formats in grey, so we use ghostscript for this
  • $ gs -o flyer.tiff -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw flyer.pdf
  • Scanned text is in German, so tesseract has to have pattern recognition for umlauts, etc
  • $ tesseract -l deu flyer.tiff flyer.txt
  • Copy textfile flyer.txt into Google Translate
  • https://translate.google.com/


    This entry was posted in software. Bookmark the permalink.

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out /  Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out /  Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out /  Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out /  Change )

    Connecting to %s