The promise of OCR

Fact: there are a lot of Ancient Greek texts available in pdf documents online. The heyday of scholarly activity for many of the pseudepigraphical texts, for instance, was in the late 1800’s or early 1900’s (though many are picking up again). That means that the standard editions for many of these texts are available for free online (thank you!), or at least good copies of the texts. There is also the Patrologia Graeca which has an estimated 50 million+ words of Greek text, many different editions of the NT, and so forth. In other words, a veritable bounty of Greek texts are available for free online.

Problem: Of course, as anyone who has tried to read a long pdf already knows, they have their difficulties, especially when it comes to small screens. What to do to be able to read all these different texts in a more readable format?

Solution: Enter OCR, or Optical Character Recognition. If you don’t know what this it, it is simply the means by which a computer “grabs” text from an image and converts it into editable text. Pretty cool. OCR works great for English and is easy to get for free. What about Ancient Greek? Well, that is a bit more of a problem. I recently set out to find a good way to OCR Ancient Greek text. Here I set out a comparison of 3 different means of OCR-ring Ancient Greek texts to see which one gives the most reliable and useful results. The results were surprising, to me at least.

The contestants: the OCR engines tested

I ended up testing three different means of OCR-ing Ancient Greek text. My initial set was two OCR engines that have been trained specifically for working with Ancient Greek. These are both free.

  1. Ancient Greek OCR: Ancient Greek OCR is a free software (its called gImage Reader) to accurately convert scans of printed Greek into unicode text and PDF files, using the Tesseract OCR engine trained for Ancient Greek typography, syntax, and vocabulary. Here is a description of how Tesseract was trained for Ancient Greek and the difficulties involved.
  2. Antigrapheus: basically an online version of the above software which “aims to be no better or worse than you would get from downloading, installing, and configuring Tesseract, but without the need to do all those things.”

These two recommend themselves as two OCR engines which have been trained specifically for OCR-ing Ancient Greek text. I’ll cover the third OCR tool later.

The texts

To test the OCR engines I fed in one picture of clean, neat, very legible Greek text (taken with my phone) and one example of a more difficult example copied from a pdf from the Patrologia Graeca.[1] For reference, here are the pictures:

4 Baruch 1.1-8
John Moschus “Spiritual Meadows” chapter 7

The first text is obviously cleaner and easier to read. I did nothing to clean up the quality of the pictures, like is generally suggested for better OCR results. Let’s take a look at the results turned out by these two.

The results

To put it bluntly, and with all due respect to those who have done the work in getting these OCR engines together, the results were underwhelming. So, after my initial results, I entered a dark-horse candidate into the running. In a stroke of necessity fueled genius I though, “Why not try a Modern Greek OCR engine?” Since all the letters are the same and there is more impetus to design a high-functionality OCR engine for Modern Greek, it would seem that it would at least accurately give the words (for the most part) and maybe would even do it better than the results shown above. Obviously, Modern Greek does not have all the accents, breathing marks, and iota subscripts found in Ancient Greek printed texts, but that might be a small loss if it can at least copy all the letters with accuracy that facilitates reading the text. Reading a text where every accent is converted to an acute is better than reading a text where an upsilon gets converted into two iotas, was the basic idea. So I tried out i2OCR. The results were, to put it bluntly, astounding! Following is a brief comparison (one sentence from each of the test passages) showing the results provided by the different OCR tools. The results speak for themselves.

Antigrapheus5 καὶ ἐλάλησεν Ἱερεμίας λέγων» Κύριε παντοκράτωρ, παραδιδι,»ς τὴυ πύλῃ, Τήν ἑκλεῃή.) εἰς χεῖρας τῶν χαλδωων, ἵνα καυχήσηηιι ὸ βασιλεὺς μετὰ τοῦ πλήθους τοῦ λαοῦ αῡωῡ, καὶ εἴπῃ ὅτι. ’Ἱηχυσα ἐπὶ τήν ιεράν Πύλῃ) τοῦ Θεωῃ
Ancient Greek OCR5 καὶ ἐλάλησεν Ἱερεμίας λεγωιι» κύριε παντοκράτωρ, ιιειιιιιδιδιιις τὴιι πόλιν τήν εκλεκιὴι) εἱς χεῖρας τῶν χειλδιιιων, ἵνα κωιχιιιιιιηιι ὸ βασιλεὺς μετὰ τοῦ πλήθους
τιιῡ λωιῡ αὺιιιῡ, καὶ εἰιιῃ ὅτι. ’Ἱηχιισα επὶ τήν ιεράι, Πόλιν τοῦ Θεοῡ;
i2OCR5 Καὶ ἐλάλησεν Ἱερεμίας λέγων’ Κύριε παντοκράτωρ, παραδίδως τὴν πόλιν τὴν ἐκλεκτὴν εἰς χεῖρας τῶν Χαλδαίων, ἵνα καυχήσηται ὁ βασιλεὺς μετὰ τοῦ πλήθους
τοῦ λαοῦ αὐτοῦ, καὶ εἴπῃ ὅτι, Ἴσχυσα ἐπὶ τὴν ἱερὰν πόλιν τοῦ Θεοῦ;
1st Test: 4 Baruch 1.5

There are some obvious points of struggle on this text. The clear winner is the Modern Greek i2OCR. It even wins on matters of accents and breathing marks (though not perfect, by any means). The second example is below:

AntigrapheusΓέρων εις ικάαηω ιν πῃ λαύρᾳ κῶν Πυργίων, οἱ οὖν «μασι-ἀρα τῆς αὐτῆς λαύρας καὶ οι λοιποι Δᾰιλψοι, ιπειιὴ ἦν τιλειωιιεις αὐτῶν ο ἡγούμενος, ὡς μιγιιν χιι ιῡάμιαωυ ωτι, ῃέλησαν αὑτὸν ποιῑμ «ιι ηγοῡμενινι.
Ancient Greek OCRΓέρων εις ικάαηω ιν πῃ λαύρᾳ κῶν Πυργίων, οἱ
οὖν «μασι-ἀρα τῆς αὐτῆς λαύρας καὶ οι λοιποι
Δᾰιλψοι, ιπειιὴ ἦν τιλειωιιεις αὐτῶν ο ἡγούμενος,
ὡς μιγιιν χιι ιῡάμιαωυ ωτι, ῃέλησαν αὑτὸν ποιῑμ
«ιι ηγοῡμενινι.
i2OCRΓέρων τις ἐχάθητο ἓν τῇ λαύρᾳα τῶν Πυργίων. Οἱ
οὖν πρεσθύτεροι τῆς αὑτῆς λαύρας καὶ οἱ λοιποὶ
ἁδελφοὶ, ἐπειδὴ ἂν τελειωθεὶς αὐτῶν ὁ ἡγούμενος,
ὡς µέγαν καὶ εὐάρεστον ὄντα, Ἰθέλησαν αὐτὸν ποιῆ-
σαι ἠγούμενου.
2nd Test: “Spiritual Meadows” chp 7, first sentence

Again, when compared to the image, the clear winner is i2OCR. This image quality is lower, mainly due to the original printing being fairly low quality and, now, getting on towards 200 years old. The bigger difficulty seems to be the odd font which the Patrologia Graeca series uses, as that accounts for most of the errors which have slipped into the i2OCR results. While the font may have been standard for its day (I have no idea if it was or not), it is clearly not one that has been extensively trained into this OCR engine.

The verdict: why not use a Modern Greek OCR?

All of the OCR tools tested are free. They are all easily accessible. And, there is a clear winner when it comes to quality: the Modern Greek OCR tool. And, on top of the better accuracy, it was also the fastest to return its results every time.

After thinking about it more, this is actually not surprising. We tend to think of Modern Greek as a monotonic language (that is, it uses only one accent mark and no breathing marks when written). That is true. But is has only been true for 40-50 years. Before that, government sanctioned Greek still used the same accents that we find in printed Ancient Greek texts, as well as the iota subscript. So for an OCR tool to be useful for actually scanning a wide-variety of pre-digital age texts it needs to be trained at handling most of the different symbols that are used in Ancient Greek texts. So, since there is much more reason to develop a highly skilled Modern Greek OCR engine than an Ancient Greek one, it makes sense that these have better results.

My plan going forward? I will be using the i2OCR tool, and maybe see if there are anymore online that also work well. The output provided can easily be pasted into a LibreOffice document to use its Ancient Greek spell-checker, which will quickly point out most places where manual attention needs to be spent in correcting the scan.

The results of this experiment have moved me from seeing fairly little value in OCR-ing Ancient Greek text to seeing a clear path forward for doing it well–even when the input text is not the greatest quality.

[1] The first is a picture from the text of 4 Baruch according to Herzer’s edition. This is not in the public domain and is used for illustrative purposes only (the text is essentially identical to the versions which are in public domain–it is the apparatus that is valuable in this work, and that is not pictured). The second is a picture from John Moschus’ “Spiritual Meadows” taken from Patrologia Graeca column 2857, chapter 7. This is in public domain.