Transkribus Training

After Easter, I attended three sessions of training on Transkribus, the Handwriting Recognition Software that works a bit like Optical Character Recognition to create automatic transcriptions of manuscripts. This is the first in a series of three posts about the training.

The first session was a bit of an epic – 4 hours of training explaining how it works and how to use it. They started out by telling us that it is particularly useful for:

  • Manually and automatically transcribing your handwritten and printed documents – the reason it is good for doing it manually is because you can do it in a split screen environment
  • Training models to read particular handwriting sets that are useful for you.
  • Collaborating on transcribing documents
  • Searching in documents for key words

Once you’ve got your transcriptions, you still need to read them or prepare them for Natural Language Processing.

Handwritten Text Recognition changes images of manuscripts, printed texts and newspapers to machine readable text. OCR only focusses on single characters whereas HTR processes the whole line so tends to create a better transcription.  This is because in handwriting, the formation of letters is different depending on the position in the word, while there isn’t space between one character and the next. HTR can be trained and has improved recently due to advances in AI. There are lots of HTR models that you can try on your documents and you need to find the one that works best for your document. It can transcribe between 100 and 200 pages in an hour (actually, I think they said half an hour, but I can’t be sure so I thought I’d better go for the lower value!!!).Honestly, the last time I tried one of my sixteenth century documents on it a couple of years ago it generated gibberish. My attempt last week wasn’t perfect, but it was perfectly legible to human eyes (a machine running corpus linguistics would be a different matter) and therefore I would be able to read through the transcriptions to look for anything that was relevant to my work.

The training went through how to create a collection, upload images, and test the transcription on a few pages to see if any of the existing HTR models work. If they are pretty good, you can correct them, but if you find they aren’t that great, you can train your own model, so they explained how to do that by creating what they call a ‘ground truth’ of around 75 properly transcribed images. Once you have corrected the transcription you can tag it, and download it either as a text or as a two-layer image that allows you to look at the original image and the text overlay.

I’ll be honest… four hours was too long for me and by the time they got to discussing tables in documents I wasn’t taking anything in, but happily the session was recorded and I will be able to go back and look at that later.

Last time I tried to use it a couple of years ago, it was, frankly, poor for early modern handwriting, so I was dubious. However, I had a quick go with it while I was in the training and it looks to me like it has improved dramatically.  It won’t produce perfect transcriptions (but then nor, generally, do humans), but it isn’t half bad and it would be quicker for me to correct the transcriptions it produces than to do them by hand.

There is a fly in the ointment though – the free plan only allows users to transcribe 100 images a month. My next project has thousands…  Paying for it to transcribe them would still be considerably cheaper financially than paying someone to do them all manually, or in terms of time, doing it myself, but even though it’s relatively cheap, it would still cost a lot to do the sort of numbers of images that we would need for major projects. But if pre-modernists could get it to do even the basics of the transcriptions they might need, it would save an enormous amount of time, meaning that we could cut down on the time it takes us to complete even relatively small research projects. It would also open up texts to corpus analysis techniques which are currently impossible because the corpus is all handwritten.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.