Tag: OCR

Converting PDFs to PNGs & My Workflow

I’ve posted about combining a bunch of images into one PDF, but how about going the other way?

This site has a great tutorial for using GhostScript to convert a PDF into PNGs suitable for using for OCR. They do a great job explaining the different flags for GhostScript and some tips for getting the best resolution for the PNGs. The one step they don’t show is how to get each page of a PDF into a separate PNG (so a 10 page PDF makes 10 PNGs).

Here’s how to do that:

In the output image name, add: %03d

This will insert an automatically incremented number padded with a padding of three digits. That means the first number will be 001, then 002, then 003, and so forth. This is really helpful in keeping the files in alphabetical and numerical order. Otherwise you’ll get a file ending in 11 coming before 2.

Here is the complete command I have been using:

gs -dSAFER -sDEVICE=png16m -dINTERPOLATE -dNumRenderingThreads=8 -r300 -o Zsuzsa_Polgar-%03d.png -c 30000000 setvmthreshold -f Polgar_Zsuzsa-1574-10.03.1992.pdf

So my workflow has been like this:

1. If I have a scanned copy of files in PDF form I run the above GhostScript command. This results in a folder of PNG images.

2. I run a new watermark/OCR tool on the folder of images. It is a Ruby script which utilizes ImageMagick for creating a watermark and Tesseract for running OCR on the images. You can find this program here:

https://github.com/mossiso/cowl

This creates a folder called ‘output’ with a PDF of all the images (kind of redundant when starting with a PDF, but now the pages have the watermark on them), and two sub-folders, one with the OCR files, and one with the watermarked copies.

3. Now I can get rid of the PNGs that were created with the GhostScript command.

Now that I have each page OCRed, I can do searches on these files, where otherwise I had to read through the entire PDF page by page. For example, today I’m looking through a 40+ page PDF transcript of a survivor interview to find the parts where she talks about her experiences at the Porta Westfalica camp. While I’ll read through each page, to get a sense of where I should be looking I can now do a search on the OCRed pages to find out where the term ‘Porta’ is found.

Screen Shot 2015-01-30 at 1.17.16 PM

Now I know that at least on pages 47 and 48 is where I’ll find some description of her time in Porta Westfalica.

Work at the Porta Labor Camps

Job List

Reinhold Blanke-Bohne wrote a completed his dissertation on the Nazi SS labor camps at Porta Westfalica in 1984. There were many different commands that inmates were assigned to; they switched commands often for various reasons. Reinhold Blanke-Bohne has a list of 26 different commands; not all of them were in existence at the same time.

Some of the jobs at the labor camp in Porta Westfalica:

  1. Höhle 1 (= unteres System im Jakobsberg);
  2. Höhle 2 (= oberes System im Jakobsberg); (Beide Kommandos hatten mehrere Unterkommandos)
  3. Denkmalstollen
  4. Heserstollen
  5. Häverstädter Stollen (ebenfalls mit Unterkommandos)
  6. Stollenkippe (= oberes System im Jakobsberg)
  7. Betonwerk Weber (siehe Teil 4.6)
  8. Verschiedene Baukommandos für Erdarbeiten , Zement­transport und Mischung , Klinkerbau- und Transport, Betonbau (Betriebe: OT Einsatzgruppe Philipp Holzmann, ARGE Herford u.a.)
  9. Brunnenbaukommando
  10. Betonkolonne
  11. Kommando Kiesgrube
  12. Kommando Uhde
  13. Kommando Edeleanu
  14. Kommando Saupe und Hielke
  15. Kommando Be- und Entwässerung
  16. Kommando Barackenbau
  17. Verschiedene Transportkommandos
  18. Waldarbeiterkommando
  19. Kommando Büscher
  20. Kommando Maschinenbau
  21. Kommando Hammerwerke
  22. Kommando Baumgarten
  23. Gleisbau Walther
  24. Kommando SS Haus.
  25. Lagerkommando
  26. Kommando Badeheizer.

And English translations (Better, more accurate suggestions are welcome. Just add a comment to this post.)

  1. Large tunnel or cave one (the lower tunnel system in Jakobsberg)
  2. Cave Two or Phillip works (Upper tunnel system in Jakobsberg)
  3. Memorial gallery
  4. Weser tunnel
  5. Häverstedter gallery
  6. gallery dump
  7. Weber Concrete works
  8. Various’ construction for earthworks
  9. well construction command
  10. concrete column
  11. Command gravel pit
  12. command Uhde
  13. command Edeleanu
  14. Command Saupe and Mielke
  15. Command irrigation and drainage
  16. Barrack construction command
  17. Various Transport Command
  18. Forest workers command
  19. Command Büscher
  20. machine construction command
  21. Command hammer works
  22. Command Baumgarten
  23. track construction Walther
  24. Command SS-house
  25. camp command
  26. Command bath heater
Monument in Porta Westfalica to the former laborers.
Monument in Porta Westfalica to the former laborers.

Technical Notes

I have a copy of Reinhold Blanke-Bohne due to the extreme generosity of several individuals. Foremost is Wolfgang Walter from Minden who had a copy of the dissertation he allowed to be copied. Second is Dr. Gerhard Franke who had the copies made and sent them to me while I was in Berlin. And third, is Dirk Volkening at Kopiertechnik who made the copies. He actually scanned them to PDF files, which is even better than paper copies. I then opened the PDF in Adobe Acrobat Pro and converted it to a searchable document (Open the Text tool, select the Recognize Text menu, and click the “In This File” option; may be different in your version of Adobe Acrobat Pro).

Making a PDF searchable in Adobe.
Making a PDF searchable in Adobe.

Another option is to upload the PDF to your Google Docs.

First make sure the upload settings are set to automatically convert the document on upload, or at least ask you on each upload. When you view the PDF document in Google Docs, rather than Google Drive Viewer, you will have a searchable text page after each image page.