I received a flyer from a friend in PDF format and wanted to convert it from German to English so that I could read it, using a quick technical solution. I could’nt find the text in ascii format in the PDF file, so I had to resort to another method.

I’ve not had much success in the past with OCR software, so I was sceptical over how much I could accomplish. Much to my surprise, the recognition worked much better than I had expected, and I was able to break the conversion down into a simple, repeatable process.

The OCR magic comes from a piece of software called Tesseract, now maintained by Google.

Download and install
**Download tesseract-ocr-3.02.02.tar.gz and tesseract-ocr-3.02.deu.tar.gz

The conversion process

  • You can remove the cover from the PDF, if it has no text.
  • $ pdftk original_flyer.pdf cat 2-end output flyer.pdf
  • tesseract 3.02 only reads .tiff file formats in grey, so we use ghostscript for this
  • $ gs -o flyer.tiff -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw flyer.pdf
  • Scanned text is in German, so tesseract has to have pattern recognition for umlauts, etc
  • $ tesseract -l deu flyer.tiff flyer.txt
  • Copy textfile flyer.txt into Google Translate
  • https://translate.google.com/


