I’ve posted about combining a bunch of images into one PDF, but how about going the other way?
This site has a great tutorial for using GhostScript to convert a PDF into PNGs suitable for using for OCR. They do a great job explaining the different flags for GhostScript and some tips for getting the best resolution for the PNGs. The one step they don’t show is how to get each page of a PDF into a separate PNG (so a 10 page PDF makes 10 PNGs).
Here’s how to do that:
In the output image name, add: %03d
This will insert an automatically incremented number padded with a padding of three digits. That means the first number will be 001, then 002, then 003, and so forth. This is really helpful in keeping the files in alphabetical and numerical order. Otherwise you’ll get a file ending in 11 coming before 2.
Here is the complete command I have been using:
gs -dSAFER -sDEVICE=png16m -dINTERPOLATE -dNumRenderingThreads=8 -r300 -o Zsuzsa_Polgar-%03d.png -c 30000000 setvmthreshold -f Polgar_Zsuzsa-1574-10.03.1992.pdf
So my workflow has been like this:
1. If I have a scanned copy of files in PDF form I run the above GhostScript command. This results in a folder of PNG images.
2. I run a new watermark/OCR tool on the folder of images. It is a Ruby script which utilizes ImageMagick for creating a watermark and Tesseract for running OCR on the images. You can find this program here:
This creates a folder called ‘output’ with a PDF of all the images (kind of redundant when starting with a PDF, but now the pages have the watermark on them), and two sub-folders, one with the OCR files, and one with the watermarked copies.
3. Now I can get rid of the PNGs that were created with the GhostScript command.
Now that I have each page OCRed, I can do searches on these files, where otherwise I had to read through the entire PDF page by page. For example, today I’m looking through a 40+ page PDF transcript of a survivor interview to find the parts where she talks about her experiences at the Porta Westfalica camp. While I’ll read through each page, to get a sense of where I should be looking I can now do a search on the OCRed pages to find out where the term ‘Porta’ is found.
Now I know that at least on pages 47 and 48 is where I’ll find some description of her time in Porta Westfalica.