Category: Technology

Dissertation is live and on line

I am in the final stages of editing… hopefully. The plan is to finish this summer, defend in September, and graduate in December 2016.

Now that I have all of the chapters written, and they just need some work (apparently lots and lots of work), I have put the text and images online in their own website.

A write up of the technology and decissions made in the site will be forth coming.

All of the primary sources that I reference are also online and available. The process and decissions for creating this repository will be in a following blog post.

All of the incremental changes made to the “official” version of the dissertation can be seen in the GitHub repository for the dissertation text.

There is still a lot to work on. Another 20 books of supporting material to read through and incorporate into the text.

Grammar and style to fix…

Overwhelming, but nearing the end.

The Writing Stack: Zotero -> Scrivener -> ODT -> Docx -> Markdown -> HTML

Scrivener to Markdown and HTML

How to write in Scrivener and display in HTML, Markdown, ODT, or Doc and keep the footnotes and images.

This is the process I use for getting my chapters out of Scrivener and formatted into Markdown and HTML for putting on the web. Markdown for Github, and HTML for a static website, and Doc for turning in to advisors and the Library.

Write it

Use Scrivener to bring all of the notes and sources together in one place.

Note it

The process of writing actually begins while reading through books and looking at original source documents. For each source (whether book, document, image, or web page) I create an entry in Zotero. With an entry in Zotero, I create a child-note for that entry and take notes in that child-note. I always include the page number in the notes for easy referencing later. A typical note for an entry in Zotero looks like this:

Kaj Björn Karbo (July 4, 1920)

{ | Karbo, 1947 | | |zu:312:A6J3JADD} 

page 1,
1400 men were supposed to wash in half and hour at 20 faucets.
Longest roll call was 4 hours because a couple of men had escaped.

page 2,
Relationship to Kapos was bad, also to Russians, and somewhat so to other nationalities.
Kapos were German, Russian, Polish and Czech

page 3,
Punishments consisted of beatings with boards from a bed and truncheon. 
Stretched over a bench and held by four men and then beat

page 4,
Was part of many different work commands. In January 1945 was Schieber, 
the lowest rung of prison hierarchy. He was in charge of a 16 man work 
gang. they helped German civilian workers build a factory for synthetic fuel.

The part in curly braces { | Karbo, 1947 | | |zu:312:A6J3JADD} comes in handy later when adding citations in Scrivener.

Compose it

With all of the notes taken (for now, it can be a never ending process), copy and paste the relevant notes in the correct section of the Scrivener outline. Basically, each idea gets its own ‘page’. This boils down to each paragraph, more or less, on its own ‘page’.

Export it

First step is to export the chapter from Scrivener.

  • Export it as the OpenOffice (.odt) format. Give it a name like chapter2.odt.

Scan it

To get the footnotes into the correct format (MLA, Chicago, etc), we’ll scan the .odt file with Zotero. This creates a new file.

  • Open Zotero, click the gear, and select ‘RTF/ODF Scan’.
  • Select the file you created above (chapter2.odt).
  • Create a new name and place to save it (chapter2-citations.odt)

Cite it

The Zotero scan converts all of the coded citations from Scrivener into ‘normal’ citations.

from this: { | Blanke-Bohne, 1984 | p. 16 | |zu:312:KMQEIBU0N}

to this: Blanke-Bohne, 1984.

To get it into a different citation style, we’ll open up the file in LibreOffice and change the citation style using the Zotero ‘Set Document Preferences’ menu.

from this: Blanke-Bohne, 1984.

to this: Blanke-Bohne, Reinhold. "Die unterirdische Verlagerung von Rüstungsbetrieben und die Aßuenlager des KZ Neuengamme in Porta Westfalica bei Minden." Dissertation, University of Bremen, 1984.

After the changes finish (could take a while), then save the document as a Word, make sure to do a ‘Save As’ .docx file (chapter2-citations.docx).

Fix it

Only the .docx format is supported by pandoc for extracting images, so we’ll need to use Word as the final format before converting to Markdown and HTML. Frankly, it also has much better grammar and spell checking.

Open the .docx in Microsoft Word and fix up any formatting issues.

I also turn this version in to my advisors for review.

Convert it

In the terminal, we’ll use the pandoc command to convert the file to Markdown and HTML.

This will convert the .docx file to a markdown file, extracting the images and putting them in a ‘files/media/’ directory.

The images are named incrementally in the order they are encountered in the document. The images are given a default name, keeping the extension. If I had four images in the file (two jpegs, one png, and one gif), they would be extracted and named like so: image1.jpeg, image2.jpeg, image3.png, image4.gif, etc.

We’ll have to go in and fix the tables and check for other formatting issues.

pandoc --smart --extract-media=files -f docx -t markdown_github chapter1-citations.docx -o

Next we can create an HTML file using pandoc and the .docx file.

pandoc --smart --extract-media=files --ascii --html-q-tags --section-divs -f docx -t html5 chapter1-citations.docx -o chapter1.html

This creates an HTML file with the images linked to the files in the files/media/ directory and the footnotes converted to hyperlinks.

Version it

Now these files can more easily be tracked with a versioning system, like git, and the HTML files can be uploaded for a static website version of the
dissertation. Styling can easily be applied if used in a Jekyll site.

For sharing on Github, there are two repos, main and gh-pages.

main repo

The main repo is simply the chapter directories with each of the document versions and the extracted media files. Once edits and conversions are done, this is updated with a simple

git add .
git commit -m "Updates chapter X"
git push

gh-pages repo

The gh-pages repo contains the files needed to convert the html version of the doucments into a Jekyll based static website. The trick here is to get all of the updates from the main repo into this gh-pages repo. This is accomplished with doing the following command while checked out in the gh-pages branch.

git checkout master -- chapterX

Before I can push the new changes to Github, I’ll need to fix a few things in the html version of the chapter.

First is to add some YAML front matter. I add this to the beginning of the HTML version.

layout: page
title: Chapter X

Second, update the path for the images so that they will work. I open the file in Vim and do a simple search and replace:

:%s/img src="files/img src="..\/files/g

Now I can update the gh-pages branch and the site.

git add .
git commit -m "Add updates from chapterX"
git push

Methodology of a visualization


Visual representations of data offer a quick way to express a lot of information. As the old adage goes, a picture is worth a thousand words. One of the facets of digital humanities research is providing information in the form of visuals: graphs, maps, charts, etc.

I was already writing up some notes on a visualization I was creating for the dissertation when I read this excellent blog post by Fred Gibbs (a version of a presentation at the AHA 2015). In this essay I think Fred accurately identifies the digital humanities field as one in need of stepping up to the next level. It is no longer enough to present visuals as humanities research, but it is time to start critiquing what is presented, and for researchers to start explicitly explaining the choices that went into creating that visualization.

With those thoughts in mind, I present the methodology, the decisions, and the visualization of over 200 deaths at the KZ Porta Westfalica Barkhausen, during a one year period.

A change is happening (at least for me) in how data is analyzed. I have a spreadsheet of over 200 deaths, with various information, death date, location, nationality, etc. The desire to create a visualization came from wanting to understand the data and see the commonalities and differences. The first question I had was how many nationalities are represented, and which countries. The second question is what is the distribution of the deaths by month.

The following is how I came to a visualization that answers the first question.

Data Compilation

Data is taken from two locations and merged.

  • The first set of data is a large spreadsheet obtained from the KZ Neuengamme Archiv containing all of their data on the prisoners that died and were at KZ Neuengamme or one of the satellite camps. This file contains 23,393 individuals.
  • The second data set is another set of files from KZ Neuengamme Archiv, but is derived from a list compiled by French authorities. It is available online at: The files were split into three sections listing the dead from Barkhause, Porta Westfalica, and Lerbeck. These files contained a total of 177 individuals.

Combining just the individuals matching those who were in a Porta Westfalica KZ from both sets of data left around 280 individuals.

Data Cleaning

There were a number of steps needed in order to have useful information from the data.

  • First of all, the data from the French archive was highly abbreviated. For example, the column containing the locations of internment were two or three letter abbreviations of location names. Elie Barioz, for example, had the locations “Wil, Ng (Po, Bar)” which, when translated, turn into “Wilhelmshaven, Neuengamme (Porta Westfalica, Porta Westfalica-Barkhausen)”
    • The process of translating the abbreviations was quite labor intensive. First, I had to search on the French site for an individual:
    • Search for ‘Barioz’. image-of-searchingNote: The Chrome web browser can automatically translate the pages on this site.
    • The correct individual can be determined by comparing the full name and the birthdate. The citation to the location in the book is a hyperlink to that record (ex. Part III, No. 14 list. (III.14.)).image-of-matches
    • The abbreviations for this individual’s interment locations are hyperlinks to more information, part of which is the full name of the location. Clicking on ‘Wil’ results in a pop up window describing the KZ at Wilhelmshaven and information about the city.
    • After determining that ‘Wil’ meant ‘Wilhelmshaven’, all occurrences of ‘Wil’ in that column can be changed to ‘Wilhelmshaven’.This process is repeated until all of the abbreviations have been translated.
  • Remove extraneous asterisks. It was quite frustrating to note that the French site did not include information on what the asterisk and other odd symbols mean. (Another odd notation is the numbers in parenthesis after the birth location.) I had to simply just delete the asterisks, losing any possible meaning they might have had.
  • Combine duplicates. Keep as much information from both records as possible.
  • Fix dates. They should all be the same format. This is tricky, in that Europe keeps dates in the format MM-DD-YYYY. For clarity sake, it would be best to use “Month DD, YYYY”. I left them as is for now. Editing 280 dates is not fun…
  • Fix nationality. The Tableau software references current nations. The data in the spread sheets uses nations current to the time of creation. For example, some individuals were noted with the nationality of ‘Soviet Union (Ukraine)’. These needed to be brought to the present as ‘Ukraine’. More problematic were the individuals from ‘Czechoslovakia’. Presently, there is the Czech Republic and Slovakia. The question is, which present day nationality to pick. There is a column for birth place which potentially solves the issue, but this field is for where the individual was born, wich, in the case of Jan Siminski, is seen. He was born in the Polish town of Obersitz (German translation), so the birth place can not clarify his nationality as Czech or Slovakian.
  • This brings up another issue, the translation of place names. City names in German, especially during the Third Reich, are different than current German names for the city, which are different than the English name of the city, which are different than what the nation calls the city. I need to standardize the names, picking, probably English. Tableau seemed to have no problem with the ethnic city names, or the German version, so I left them as is.


Tool Picking

I used the free program, Tableau Public:

This allows for very quick visuals, and a very easy process. The website has a number of free tutorials to get started.


The first visualization I wanted to make was a map showing where the prisoners were from, their nationality. The map would also show the number of prisoners from each country. (This is not a tutorial on how to use Tableau, but a walk through of the pertinent choices I made to make sense of the data, it is methodology, not tech support. 🙂 )

Using the default settings (basically, just double clicking on the Nationality field to create the map) results in a dot on each country represented in the data.

This can be transformed into a polygon highlight of the country by selecting a “Filled Map”.

Next step was to apply shading to the filled map; the larger the number of prisoners who died from that country the darker the fill color.
The default color was shades of green. I wanted a more dull color to fit in with the theme of the visualization, “death”. I picked a light orange to brown default gradient, separated into 13 steps (there are 13 countries represented).


While just a filled map with gradient colored countries is helpful, the information would be more complete, more fully understandable, with a legend. This can be created by using a plane table listing the countries and the number of dead from that country. Each row is color coordinated with the map by using the same color scheme and number of steps as with the map.




In Tableau, you create a dashboard to combine the different work sheets, maps, tables, graphs, etc. In this case, a full page map, with the table overlaid completes the visualization.


The result is a very simple map, created in about ten minutes (after a few video tutorials to refresh my memory on how to create the affects I wanted).

(See a fully functioning result below this image.)


Benefits of Tableau

Tableau has some limitations. The results are hosted on their servers, which has the potential for lock down. They use proprietary, closed source code and applications.

But there are many benefits. The default visualizations look great. It is very easy to create simple and powerful visualizations. The product is capable of producing very sophisticated statistical representations. You can use the free and open source stats program R. The visualizations are embed-able in any website using Javascript.

The biggest benefit of using Tableau is the automatic link back to the original data source. I think the most needed shift in humanities (particularly the history profession), and the biggest benefit of “digital” capabilities for the humanities, is the ability to link to the source material. This makes it infinitely more easy for readers and other scholars to follow the source trail in order to provide better and more accurate feed back (read critique and support).

To see the underlying data in this visualization, click on a country in the map or the table. A pop up window appears with minimal data.


Click on the “View Data” icon.


Select the “Underlying” tab and check the “Show all columns” box. Voilà!


Behold the intoxicating power of being able to view the underlying data for a visualization!

Digital Humanities Improvement Idea

Imagine, if you will, the typical journal article or book, with footnotes or end notes referencing some primary document or page in another book or article. With digital media, that footnote turns into a hyper-link. A link to a digital copy of the primary document at the archive’s site, or the author’s own personal archive site. Or it links to a Google Book site with the page of the book or journal displayed. Now you have the whole document or at least a whole page of text to provide appropriate context to citation.

Way too often I have been met with a dead end in following citations; especially references to documents in an archive. Not often, but archives change catalog formats, documents move in an archive, they no longer are available to researchers, etc. It would be so much easier to have a link to what some researcher has already spent time finding. Let’s build on the shoulders of each other, rather than make each scholar waste time doing archival research that has already been done.

I think it incumbent upon all researchers to provide more than a dead-text citation to their sources. In this digital age, it is becoming more and more trivial to set up a repository of the sources used in research, and the skills needed to provide a link to an item in a repository less demanding. Here are some ideas on how to accomplish this already.

  • Set up a free, hosted version of Omeka at Add all of your source material to Omeka. Provide a link to the document in Omeka along with your citation in the footnote or end note.
  • Create a free WordPress account at Add a post for each source document. Provide a link to that post in your citation.
  • Most universities have a free faculty or student web hosting environment (something like Dump all of your digital copies of your documents in that space (nicely organized in descriptive folders and with descriptive file names–no spaces in the names, of course). Now, provide a link to that resource in your citation.
  • Set up a free Zotero account at Set up a Group Library as Public and publish all of your sources to this library.

I intend to take my own advice. I have an Omeka repository already set up, with a few resources there already: NaziTunnels Document Repository. Once I start publishing the text of my dissertation, there will be links back to the primary document in the footnotes.

I would love to see this type of digital citation become as ubiquitous as the present-day dead-text citation.

I have not addressed Copyright issues with this. Copyright restrictions will severely limit the resources to be used in an online sources repository, but there are certainly work ways to work around this.

If hosting the sources on your own, one quick fix would be to put the digital citation sources behind a password (available in the book or journal text). Another option might be to get permission from the archive if only low quality reproductions are offered.


Let me know if you find the live-text or digital citation idea viable. Do you have other ideas for providing a repository of your sources?

Drop me a note if you want more detail on how I created the map in Tableau. I’m by no means proficient or in no way the technical support for Tableau, but I’ll do what I can to guide and advise.

A Map of KZ Porta Westfalica

I needed to get the latitude and longitude of several places for the GIS project. I used Google Maps to get the data. Just click on a point on the map and the info box shows you the lat/long.



While playing with this, I figured I’d make a more permanent map showing some of the important locations. That map is found here:


I was able to find the locations with the help of a couple of maps I found in archives.IMG_0635 copy LagerMap


The nice thing about the Google map is that it can attache photos of the points I marked (as seen in the first, feature, image).




Converting PDFs to PNGs & My Workflow

I’ve posted about combining a bunch of images into one PDF, but how about going the other way?

This site has a great tutorial for using GhostScript to convert a PDF into PNGs suitable for using for OCR. They do a great job explaining the different flags for GhostScript and some tips for getting the best resolution for the PNGs. The one step they don’t show is how to get each page of a PDF into a separate PNG (so a 10 page PDF makes 10 PNGs).

Here’s how to do that:

In the output image name, add: %03d

This will insert an automatically incremented number padded with a padding of three digits. That means the first number will be 001, then 002, then 003, and so forth. This is really helpful in keeping the files in alphabetical and numerical order. Otherwise you’ll get a file ending in 11 coming before 2.

Here is the complete command I have been using:

gs -dSAFER -sDEVICE=png16m -dINTERPOLATE -dNumRenderingThreads=8 -r300 -o Zsuzsa_Polgar-%03d.png -c 30000000 setvmthreshold -f Polgar_Zsuzsa-1574-10.03.1992.pdf

So my workflow has been like this:

1. If I have a scanned copy of files in PDF form I run the above GhostScript command. This results in a folder of PNG images.

2. I run a new watermark/OCR tool on the folder of images. It is a Ruby script which utilizes ImageMagick for creating a watermark and Tesseract for running OCR on the images. You can find this program here:

This creates a folder called ‘output’ with a PDF of all the images (kind of redundant when starting with a PDF, but now the pages have the watermark on them), and two sub-folders, one with the OCR files, and one with the watermarked copies.

3. Now I can get rid of the PNGs that were created with the GhostScript command.

Now that I have each page OCRed, I can do searches on these files, where otherwise I had to read through the entire PDF page by page. For example, today I’m looking through a 40+ page PDF transcript of a survivor interview to find the parts where she talks about her experiences at the Porta Westfalica camp. While I’ll read through each page, to get a sense of where I should be looking I can now do a search on the OCRed pages to find out where the term ‘Porta’ is found.

Screen Shot 2015-01-30 at 1.17.16 PM

Now I know that at least on pages 47 and 48 is where I’ll find some description of her time in Porta Westfalica.

Copying Files From Mac to Linux With Umlauts

I ran into an issue today where I wanted to copy some files from my laptop to the web server.

Usually, I just run the scp command like so:

scp -r /path/to/files/on/laptop/

This will copy all of the files without problems.

The problem is that there were nearly 300 files to copy, and so I left the laptop to do the copy. In the meantime, it went to sleep, stopping the copy. Scp is not smart enough to just copy the files that didn’t get copied, but will copy all nearly 300 files again.

There is a program that has this intelligence, though… rsync !

Run this command like so:

rsync -avz /path/to/files/on/laptop/ -e ssh


This usually works great… except when there are umlauts in the file names. Apparently Macs and Linux use a different terminology when talking UTF-8.

The default Mac version of rsync is woefully out of date, though, and doesn’t support an option to fix this issue.

The solution!

You’ll need to have homebrew installed in order to update to the latest version of rsync. If you don’t have homebrew installed already, you need to.

Then it’s a simple install command:

brew install rsync

And now you can do the rsync command again:

rsync -avz --iconv=UTF8-MAC,UTF-8 /path/to/files/on/laptop/ -e ssh


The –inconv option allows Mac and Linux to speak the same UTF-8 language.

Special thanks to Janak Singh for the rsync option and detailed information on the issue.


Update: December 9, 2014.

There were some issues with the umlauts on the Linux server, and with the names of the files as I put them into Omeka, so I decided to do away with the special characters altogether. But how to change all of the file names? Easy, use the rename command.

On the Linux server it was easy as:

rename ü ue *.png

on the Mac I needed to install the rename command with homebrew first:

brew install rename

The syntax is a little bit different on the Mac:

rename -s ü ue *.png


You can also do a dry run to make sure it the command doesn’t do something you don’t like.

rename -n -s ü ue *.png


That takes care of the special characters issue.

Watermarking and OCRing your images

I have accumulated nearly 2000 images, all scans of documents, relating to the dissertation. One goal of the project is to make these documents open and available in an Omeka database. In order to more correctly attribute these documents to the archives where I got them, I need to place a watermark on each image.

I also need the content of the documents in a format to make it easy to search and copy/paste.

The tools to do each of those steps are readily available, and easy to use, but I needed a script to put them together so I can run them on a handful of images at a time, or even hundreds at a time.

To layout the solution, I’ll walk through the problem and how I solved it.

When at the Neuengamme Concentration Camp Memorial Archive near Hamburg in the summer of 2013, I found about 25 testimonials of former inmates. In most cases I took a picture of the written testimonial (the next day I realized I could use their copier/scanner and make nicer copies). So I ended up with quite a number of folders, each containing a number of images.

Screen Shot 2014-11-18 at 10.52.38 AM

So the goal became to water mark each of the images, and then to run an OCR program on them to grab the contents into plain text.


There are many options for water marking images. I chose to use the incredibly powerful ImageMagick tool. The ImageMagick website has a pretty good tutorial on adding watermarks to single images. I chose to add a smoky gray rectangle to the bottom of the image with the copyright text in white.

The image watermark command by itself goes like this:

width=$(identify -format %w "/path/to/copies/filename.png"); \
s=$((width/2)); \
convert -background '#00000080' -fill white -size "$s" \
-font "/path/to/font/file/font.ttf" label:"Copyright ©2014 Ammon" miff:- | \
composite -gravity south -geometry +0+3 - \
"/path/to/copies/filename.png" "/path/to/marked/filename.png"

This command can actually be run on the command line as is (replacing the paths to images the font file, and copyright text of course). I’ll explain the command below.

The first line gets the width of the image to be watermarked and sets it to the variable “width”. The second line gets half the value of the width, and sets it to the variable “s”.

The third line starts the ImageMagick command (and is broken onto several lines using the \ to denote that the command continues). The code from ‘convert’ to the pipe ‘|’ creates the watermark, a dark grey rectangle with white text at the bottom of the image.

Screen Shot 2014-11-18 at 1.40.12 PM


Most of the images I have are of typed up documents, so they are good candidates for OCR (Optical Character Recognition), or grabbing the text out of the image.

OCR is done using a program called tesseract.

The tesseract command is relatively simple. Give it an input file name, an output file name, and an optional language.

tesseract "/path/to/input/file.png" "/path/to/output/file" -l deu

This will OCR file.png and create a file named file.txt. The -l (lowercase letter L) option sets the language to German (deut[sch]).


The Script

The script is available at my GitHub repo:

Here is how to use the script.

Download the ocrwm file and put it in the directory that has the image files.

Open the file with a text editor and set the default label to use in the watermark. If desired, you can also specify a font file to use.


On the command line (the terminal), simply type:

bash ocrwm

At it’s basic this will make a “copies” directory and put in there a copy of each image file (it will find images of the format JPG, GIF, TIF, and PNG in the directory where you run the command).


To OCR and Watermark the images do:

bash ocrwm -ow

This will make the copies as above, but will also create a directory named “ocr” and a directory named “marked” and add respective files therein.


You can also create a single pdf file from the images in the directory like so:

bash ocrwm -pow


Adding the l (lowercase letter L) option allows you to set the text in the watermark.

bash ocrwm -powl "Copyright ©2014 Me"


There is an option to not copy the files. This is useful if the files have been copied using this script previously (say you ran the script but only did water marks and not OCR, then to just do the OCR you can run the script again but not have to copy the files again).

bash ocrwm -co




Here are things to look out for when running the script.

By default, the script will run the OCR program, tesseract, with German as the default language. You can change that to English by deleting the “-l deu” part on the line that calls tesseract. The list of language abbreviations and languages available are in the tesseract manual (or on the command line type).

man tesseract


A few times I had PDFs as the original format to work with. In most cases these were multi-page PDFs. In order to use the script with these, I first needed to break out each page of the PDF and convert it to a PNG format. See here for a reason to choose PNG over other formats.

The ImageMagick command ‘convert’ will take care of that:

convert -density 600 -quality 100 original.pdf newfile.png

Depending on how many pages are in the PDF, the command can take quite a while to run. For a 30 page PDF, it took my laptop about 5 minutes. The end result is a PNG image for each page incrementally numbered beginning with zero. If the PDF above had four pages, I would end up with the following PNGs: newfile-0.png, newfile-1.png, newfile-2.png, newfile-3.png

Now I could run the ocrwm script in the directory and get OCR’ed and watermarked images. In this case I could leave off the ‘p’ option because I began with a PDF with all pages combined.

bash ocrwm -ow


Feel free to download the script, make changes or improvements, and send them back to me (via the github page).