The promise of OCR

Fact: there are a lot of Ancient Greek texts available in pdf documents online. The heyday of scholarly activity for many of the pseudepigraphical texts, for instance, was in the late 1800’s or early 1900’s (though many are picking up again). That means that the standard editions for many of these texts are available for free online (thank you archive.org!), or at least good copies of the texts. There is also the Patrologia Graeca which has an estimated 50 million+ words of Greek text, many different editions of the NT, and so forth. In other words, a veritable bounty of Greek texts are available for free online.

Problem: Of course, as anyone who has tried to read a long pdf already knows, they have their difficulties, especially when it comes to small screens. What to do to be able to read all these different texts in a more readable format?

Solution: Enter OCR, or Optical Character Recognition. If you don’t know what this it, it is simply the means by which a computer “grabs” text from an image and converts it into editable text. Pretty cool. OCR works great for English and is easy to get for free. What about Ancient Greek? Well, that is a bit more of a problem. I recently set out to find a good way to OCR Ancient Greek text. Here I set out a comparison of 3 different means of OCR-ring Ancient Greek texts to see which one gives the most reliable and useful results. The results were surprising, to me at least.

The contestants: the OCR engines tested

I ended up testing three different means of OCR-ing Ancient Greek text. My initial set was two OCR engines that have been trained specifically for working with Ancient Greek. These are both free.

Ancient Greek OCR: Ancient Greek OCR is a free software (its called gImage Reader) to accurately convert scans of printed Greek into unicode text and PDF files, using the Tesseract OCR engine trained for Ancient Greek typography, syntax, and vocabulary. Here is a description of how Tesseract was trained for Ancient Greek and the difficulties involved.
Antigrapheus: basically an online version of the above software which “aims to be no better or worse than you would get from downloading, installing, and configuring Tesseract, but without the need to do all those things.”

These two recommend themselves as two OCR engines which have been trained specifically for OCR-ing Ancient Greek text. I’ll cover the third OCR tool later.

For an updated take on using a newer version of Tesseract and its seriously good results (my new default OCR tool), check out the Power of Tesseract.

The texts

To test the OCR engines I fed in one picture of clean, neat, very legible Greek text (taken with my phone) and one example of a more difficult example copied from a pdf from the Patrologia Graeca.^[1] For reference, here are the pictures:

John Moschus “Spiritual Meadows” chapter 7

The first text is obviously cleaner and easier to read. I did nothing to clean up the quality of the pictures, like is generally suggested for better OCR results. Let’s take a look at the results turned out by these two.

The results

To put it bluntly, and with all due respect to those who have done the work in getting these OCR engines together, the results were underwhelming. So, after my initial results, I entered a dark-horse candidate into the running. In a stroke of necessity fueled genius I thought, “Why not try a Modern Greek OCR engine?” Since all the letters are the same and there is more impetus to design a high-functionality OCR engine for Modern Greek, it would seem that it would at least accurately give the words (for the most part) and maybe would even do it better than the results shown above. Obviously, Modern Greek does not have all the accents, breathing marks, and iota subscripts found in Ancient Greek printed texts, but that might be a small loss if it can at least copy all the letters with accuracy that facilitates reading the text. Reading a text where every accent is converted to an acute is better than reading a text where an upsilon gets converted into two iotas, was the basic idea. So I tried out i2OCR. The results were, to put it bluntly, astounding! Following is a brief comparison (one sentence from each of the test passages) showing the results provided by the different OCR tools. The results speak for themselves.

Antigrapheus	5 καὶ ἐλάλησεν Ἱερεμίας λέγων» Κύριε παντοκράτωρ, παραδιδι,»ς τὴυ πύλῃ, Τήν ἑκλεῃή.) εἰς χεῖρας τῶν χαλδωων, ἵνα καυχήσηηιι ὸ βασιλεὺς μετὰ τοῦ πλήθους τοῦ λαοῦ αῡωῡ, καὶ εἴπῃ ὅτι. ’Ἱηχυσα ἐπὶ τήν ιεράν Πύλῃ) τοῦ Θεωῃ
Ancient Greek OCR	5 καὶ ἐλάλησεν Ἱερεμίας λεγωιι» κύριε παντοκράτωρ, ιιειιιιιδιδιιις τὴιι πόλιν τήν εκλεκιὴι) εἱς χεῖρας τῶν χειλδιιιων, ἵνα κωιχιιιιιιηιι ὸ βασιλεὺς μετὰ τοῦ πλήθους τιιῡ λωιῡ αὺιιιῡ, καὶ εἰιιῃ ὅτι. ’Ἱηχιισα επὶ τήν ιεράι, Πόλιν τοῦ Θεοῡ;
i2OCR	5 Καὶ ἐλάλησεν Ἱερεμίας λέγων’ Κύριε παντοκράτωρ, παραδίδως τὴν πόλιν τὴν ἐκλεκτὴν εἰς χεῖρας τῶν Χαλδαίων, ἵνα καυχήσηται ὁ βασιλεὺς μετὰ τοῦ πλήθους τοῦ λαοῦ αὐτοῦ, καὶ εἴπῃ ὅτι, Ἴσχυσα ἐπὶ τὴν ἱερὰν πόλιν τοῦ Θεοῦ;

1st Test: 4 Baruch 1.5

There are some obvious points of struggle on this text. The clear winner is the Modern Greek i2OCR. It even wins on matters of accents and breathing marks (though not perfect, by any means). The second example is below:

Antigrapheus	Γέρων εις ικάαηω ιν πῃ λαύρᾳ κῶν Πυργίων, οἱ οὖν «μασι-ἀρα τῆς αὐτῆς λαύρας καὶ οι λοιποι Δᾰιλψοι, ιπειιὴ ἦν τιλειωιιεις αὐτῶν ο ἡγούμενος, ὡς μιγιιν χιι ιῡάμιαωυ ωτι, ῃέλησαν αὑτὸν ποιῑμ «ιι ηγοῡμενινι.
Ancient Greek OCR	Γέρων εις ικάαηω ιν πῃ λαύρᾳ κῶν Πυργίων, οἱ οὖν «μασι-ἀρα τῆς αὐτῆς λαύρας καὶ οι λοιποι Δᾰιλψοι, ιπειιὴ ἦν τιλειωιιεις αὐτῶν ο ἡγούμενος, ὡς μιγιιν χιι ιῡάμιαωυ ωτι, ῃέλησαν αὑτὸν ποιῑμ «ιι ηγοῡμενινι.
i2OCR	Γέρων τις ἐχάθητο ἓν τῇ λαύρᾳα τῶν Πυργίων. Οἱ οὖν πρεσθύτεροι τῆς αὑτῆς λαύρας καὶ οἱ λοιποὶ ἁδελφοὶ, ἐπειδὴ ἂν τελειωθεὶς αὐτῶν ὁ ἡγούμενος, ὡς µέγαν καὶ εὐάρεστον ὄντα, Ἰθέλησαν αὐτὸν ποιῆ- σαι ἠγούμενου.

2nd Test: “Spiritual Meadows” chp 7, first sentence

Again, when compared to the image, the clear winner is i2OCR. This image quality is lower, mainly due to the original printing being fairly low quality and, now, getting on towards 200 years old. The bigger difficulty seems to be the odd font which the Patrologia Graeca series uses, as that accounts for most of the errors which have slipped into the i2OCR results. While the font may have been standard for its day (I have no idea if it was or not), it is clearly not one that has been extensively trained into this OCR engine.

The verdict: why not use a Modern Greek OCR?

All of the OCR tools tested are free. They are all easily accessible. And, there is a clear winner when it comes to quality: the Modern Greek OCR tool. And, on top of the better accuracy, it was also the fastest to return its results every time.

After thinking about it more, this is actually not surprising. We tend to think of Modern Greek as a monotonic language (that is, it uses only one accent mark and no breathing marks when written). That is true. But is has only been true for 40-50 years. Before that, government sanctioned Greek still used the same accents that we find in printed Ancient Greek texts, as well as the iota subscript. So for an OCR tool to be useful for actually scanning a wide-variety of pre-digital age texts it needs to be trained at handling most of the different symbols that are used in Ancient Greek texts. So, since there is much more reason to develop a highly skilled Modern Greek OCR engine than an Ancient Greek one, it makes sense that these have better results.

My plan going forward? I will be using the i2OCR tool, and maybe see if there are anymore online that also work well. The output provided can easily be pasted into a LibreOffice document to use its Ancient Greek spell-checker, which will quickly point out most places where manual attention needs to be spent in correcting the scan.

The results of this experiment have moved me from seeing fairly little value in OCR-ing Ancient Greek text to seeing a clear path forward for doing it well–even when the input text is not the greatest quality.

**Update. As of 11/28/2020, I have been using this Modern Greek OCR again and just want to report that it has gotten even better! Not only is the text cleaner and the interface nicer, but they have made a huge improvement in letting the user upload pdf files to scan, rather than having to convert them to image files on your own. It handles all the necessary manipulations itself. Very useful improvement.

[1] The first is a picture from the text of 4 Baruch according to Herzer’s edition. This is not in the public domain and is used for illustrative purposes only (the text is essentially identical to the versions which are in public domain–it is the apparatus that is valuable in this work, and that is not pictured). The second is a picture from John Moschus’ “Spiritual Meadows” taken from Patrologia Graeca column 2857, chapter 7. This is in public domain.

8 thoughts on “OCR Ancient Greek texts: the best tool may be a surprise”

Dewayne Dulaney says:

May 22, 2020 at 9:29 am

This could be helpful in producing epub or other digital formats of Greek works, maybe. The epubs i’ve seen had badly garbled Greek text, much worse than the results you had at first. Don’t know if this is an OCR issue or a problem with the epub formatting. Any thoughts?

LikeLiked by 1 person

1. Nathaniel J. Erickson says:
  
  May 22, 2020 at 10:06 am
  
  I’ve never tried anything with epub and Greek. Epub does support unicode fonts (UTF-8), so in theory it should not have any trouble handling Ancient Greek text, provided whoever puts the Greek works together uses a font that has the required characters and builds the file the right way. That being said, I’ve had lots of trouble with various online text-converting services set up to create an e-reader file of some sort that really struggle with non-English input. I suspect there are two issues: poor OCR that is not corrected and/or using a service to create the epub file that is not set up to work with non-English texts. I’ll have to play with this some. It would be really nice to be able to put some more Greek texts on an e-reader file format
  
  LikeLike
  
  1. Dewayne Dulaney says:
    
    May 22, 2020 at 11:18 am
    
    Yes, it would be nice. Don’t remember if I’ve looked at any Greek works in Kindle or not. Would be nice to have some for that format, as well. I especially like epubs and pdfs, though. Especially if they are searchable with Greek input terms.
    
    LikeLike
  2. Nathaniel J. Erickson says:
    
    May 22, 2020 at 12:29 pm
    
    I can now verify that it is easy to make epub Greek texts. I just used calibre (https://calibre-ebook.com/) and it converts a variety of file types to epub, as well as various other e-reader formats, including Kindle. All you need is clean Greek text to put in. Works with .docx files, HTML, different e-reader formats, etc. In principle, you could ocr the entire Patrologia Graeca if you wanted and then turn it into an ebook with this software. I imagine shorter works would be more in order, though.
    
    LikeLike
Dewayne Dulaney says:

May 22, 2020 at 11:19 am

And I like it when they have hyperlinks and bookmarks for longer works.

LikeLike

Pingback: OCR Ancient Greek text: the power of Tesseract – NT Greek et al.
Danse Noble says:

September 17, 2024 at 8:32 am

I have similar observations to yours: in spite of all the hype lavished on tesseract, their Polytonic trained data mishandles on a consistent basis κ, χ, θ and β , which shows that it still needs a lot of work before it does its job properly. In comparison, a typical Modern Greek OCR software does a much better job. Having said that, using Modern Greek OCR is not an option for me, as I realized spending seemingly an infinite amount of time fixing all the wrong and missing diacritics. I am looking for somebody providing better trained data for Polytonic Greek than the default one offered by the Tesseract project right now.

LikeLike

1. Nathaniel J. Erickson says:
  
  September 17, 2024 at 2:01 pm
  
  I haven’t done anything with OCR for a while. I suspect that the existence of the Thesaurus Lingua Graece puts a bit of a damper on serious work to enhance Ancient Greek OCR, since most texts have already been digitized there. Doesn’t help people who aren’t in a position to pay for it, though.
  Personally, I’d advise focusing on making sure the characters are right and ignoring the diacritical markings. That assumes a certain degree of Greek proficiency, but it is kind of fun if you can do it.
  
  LikeLike

NT Greek et al.

Thinking about Koine Greek, the New Testament, and related topics

OCR Ancient Greek texts: the best tool may be a surprise

The promise of OCR

The contestants: the OCR engines tested

The texts

The results

The verdict: why not use a Modern Greek OCR?

8 thoughts on “OCR Ancient Greek texts: the best tool may be a surprise”

Leave a comment Cancel reply

The promise of OCR

The contestants: the OCR engines tested

The texts

The results

The verdict: why not use a Modern Greek OCR?

Share this:

Related

8 thoughts on “OCR Ancient Greek texts: the best tool may be a surprise”

Leave a comment Cancel reply