#it's only about 50% the OCR's fault here and it's 50% the PDF just being terrible to start with | Explore Tumblr posts and blogs

finnlongman · 9 months ago

Note

i hope this doesn't sound patronising if you've already tried that route, but in case you haven't: if this is a text that has been given to you as is and been produced by a general OCR, it might be worth looking into whether ppl have already trained language-specific OCR models for your field of study and using that to transcribe the scanned text again. there's a lot of transcription solutions/software and different fields prefer different ones, and idk the standard for Celtic studies personally, but a site i use often (transkribus) has 2 Irish models whose related projects you might be able to use as a starting point for research at least. best of luck to you either way!

So there are several factors at work with the OCR problems with this text specifically.

The PDF of the text is from Archive. The library copy it was scanned from has various pencil markings and annotations that are interfering with the printed text -- it's not a clean scan. It's also not super high definition, so letters like "h" sometimes get misread as "li", even though they're totally readable to human eyes.

The edition uses frequent italics and brackets to show where abbrevations in the manuscript has been expanded. Individual italicised letters confuse the OCR, as do random square brackets in the middle of words.

It also has a lot of superscript numbers corresponding to manuscirpt variants in the footnotes. Sometimes these are in the middle of a word. This also confuses most OCR systems, even if it can tell that the footnotes are separate from the main text.

The language of the text is late Middle / Early Modern Irish, from two different manuscripts that have their own unique spelling quirks (for example, one of them loves to spell Cú Chulainn's name "Cú Cholain", which is a vibe).

In order to run the text through a more sophisticated OCR system that was equipped to cope with a) annotations, b) weird formatting and punctuation, c) incredibly frequent footnotes (variants), and d) non-standardised spelling (which throws off many language models), I would probably still need to have a reliable, clear, and high-definition scan of the text. Which would require re-digitising it from scratch.

So, the quickest and easiest way to get a version of the text that I personally can use is to sit here and type up 20,000 words into a document. This is 2-4 days' work, depending on how focused I am, and gives me the chance to go through the text in detail and spot things I might miss otherwise, so it's probably a whole lot less effort for more benefit than trying to adapt an entire language model that could read this terrible PDF. Especially as I have no experience of using these programmes so would have a steep learning curve.

Now, somebody absolutely should do that, so we could get proper searchable editions of more things. But honestly, if using transcription tools for medieval/early modern Irish I think there are higher priorities than things already available in printed form, so I doubt it's at the top of anyone's to-do list!

#it's only about 50% the OCR's fault here and it's 50% the PDF just being terrible to start with #which is. a different problem #finn is not doing a phd #answered

16 notes · View notes