A while back I wrote a blog post on OCR technologies available for Ancient Greek. My finding at that point was that the best tool (freely available and easily accessible) was probably the i2OCR website set on its Modern Greek setting (its “Greek Ancient” setting does not, in my opinion, perform any better than Modern Greek). This site was tested against a couple attempts to train the Tesseract engine (originally developed by Google for OCR and now an open-source project that lives here) to Ancient Greek. I was under-impressed by those tools, got rid of Tesseract from my computer, and have not thought about it again…until today.
The return of Tesseract
Fast-forward to today. I was poking around the internet looking for a website or program with the capability to OCR German Fraktur typeface (commonly known as Gothic Script). This is the old way German was printed and it has a distinctly runic character about it. I was hoping to be able to turn some pdfs into a file format that is more forgiving for reading on a smartphone. The most commonly cited solutions to this problem all pointed to using the Tesseract engine. So I gave it another go.
In becoming reacquainted with Tesseract I was pleasantly surprised that it has grown up considerably since it was used by those digital pioneers trying to train it for reading Ancient Greek. There has also been a really handy project from the Manheim University Library making Tesseract way easier to get onto a personal computer that is not running Linux. While I was testing the Fraktur capabilities of Tesseract (which are pretty good out of the box, in case you are wondering, but could certainly stand for some further training) I figured I might as well give its Modern Greek capabilities a go, seeing how Modern Greek OCR technologies seem to handle Ancient Greek text well. The results?
The power of Tesseract
In short…it works great!!!!
Running the Tesseract OCR tool on a text I had OCR’d at i2OCR resulted in markedly better results. On top of its accuracy, it also handles complete pdfs all in one pass, which is far faster than the website does. i2OCR converts a pdf into a series of pictures, which can then be manually OCR’d one at a time. This is way faster than converting and splitting the pdf, then uploading the results one page at a time, but it is still tedious to have to manually do each page. The Tesseract Modern Greek’s greatest area of difficulty is, predicably, the rough versus smooth breathing marks. This is a forgivable shortfall. As a real test of its Ancient Greek OCR prowess, I ran a passage from the Patrologia Graecae through the OCR engine. The Tesseract engine was able to handle the much more difficult Patrologia Graecae text, with its odd, small font and its old, cheap printing quality admirably well. Tesseract is certainly not up to the quality of a trained human eye in its character recognition, but is close enough to be meaningfully effective without much oversight.
How to acquire (and use) Tesseract
The easiest way to acquire Tesseract is from Manheim University’s GitHub page. It is available there as a download for Windows machines (I imagine there is a way to get Tesseract for Mac, but not being a Mac user, I have never bothered to find out). Follow the instructions for installation. When asked to choose components to add, you can make sure it is grabbing the Greek tools for download. There is a Modern Greek and a Greek Script language setting. The Modern Greek works better. I imagine Greek Script is meant for something more like Greek handwriting. If you miss downloading them at this point, not a big deal. They can be downloaded later from within the graphical user interface.
After running the download of Tesseract you may be disappointed to learn that the program does not yet exist on your computer, at least not in the normal sense that most of us using computers think of programs as existing. If you are the sort who loves to run programs from the command prompt, then you are set to go! Most of us will need to download a Graphical User Interface (GUI) in order to actually use Tesseract. The GUI is simply the part of the program that allows you to click on things to make the program do things (in technical speak). gImageReader is a free and useful GUI for using Tesseract on whatever document you please. When you download it it takes care of all the backend work of making the Tesseract OCR tool actually work. It allows uploading pdfs and other image formats, outputs either plain text or text that retains formatting data and can be converted into a pdf, and the results can be edited within the program while comparing them to the image from which they came.
OCR of Ancient Greek: a state of the field
OCR’ing of Greek texts on a small scale is at a point which still requires a fair degree of oversight. Tesseract will take a reasonably clear input and return a basically correct output. Its biggest struggle is that it does not differentiate very well between text and structural material, such as line numbers, which can result in spewing lines of nonsense between perfectly scanned Greek text. Can’t have everything, at least not yet.
Summing up, using Tesseract trained for Modern Greek works surprisingly well and will be a go-to OCR resource for me in the future.