I received a flyer from a friend in PDF format and wanted to convert it from German to English so that I could read it, using a quick technical solution. I could’nt find the text in ascii format in the PDF file, so I had to resort to another method.
I’ve not had much success in the past with OCR software, so I was sceptical over how much I could accomplish. Much to my surprise, the recognition worked much better than I had expected, and I was able to break the conversion down into a simple, repeatable process.
The OCR magic comes from a piece of software called Tesseract, now maintained by Google.
Download and install
**Download tesseract-ocr-3.02.02.tar.gz and tesseract-ocr-3.02.deu.tar.gz
The conversion process
$ pdftk original_flyer.pdf cat 2-end output flyer.pdf
$ gs -o flyer.tiff -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw flyer.pdf
$ tesseract -l deu flyer.tiff flyer.txt