Khmer OCR: Convert hard-copy Khmer text to digital

Have you ever thought that text from book or paper can be converted into a word document on the computer? How much easier life would be, not having to re-type everything on your laptop?

Now with Khmer OCR software, you can simply scan the paper, upload it, and the software will generate the Khmer characters in digital format for you. Never again will you have to spend hours typing reference documents or texts for school! Geeks In Cambodia spoke with Mr. Ly Sovannra, one of the members in this project.

(This interview has been edited for flow and clarity.)

//

Can you introduce yourself and explain what Khmer OCR is?

IMG_3057

Mr. Ly Sovannra gave an interview with Geeks in Cambodia about Khmer OCR

I’m Mr. Ly Sovannra, and I’m part of the Khmer Optical Character Recognition (OCR) project team alongside 4 other members. They are: Danh Hong, Team Leader, E Tola, Font Designer, Thim Chanrithy, Programming, and Uch Sarak, Web Developer.

Mr. Danh Hong created the Khmer Unicode font, and he started researching on Khmer OCR since 2012. We have spoken with many about supporting this project, but have not gotten any. Therefore, the 5 of us have committed to doing this.

With Khmer OCR, you can convert documents into digital format. After uploading these to the Internet, it will be easier to search and archive. Earlier this year in May, we started to test the software for the first time. After this, we designed the font and input more data. The first time we did an accuracy test, we got around 50% to 60%. This was only for the font Khmer OS Battambang, 26pt.

How does Khmer OCR work?

We use Tesseract, an open source software by Google, to do OCR. Simply input an image of a document with text, and Tesseract will convert the document into text in the computer. However, as it is only 50% to 60% accurate, we have to train the data to increase accuracy. This requires us to replace the wrong words with the right ones. It is currently still in beta mode, as we wish to train the software to work with many Khmer fonts. Anyone can train the data, just by going onto our website.

At the end of this month, we hope to get 60% accuracy for the Khmer OS font.

What was the reason behind starting Khmer OCR?

We feel that Cambodia’s growth is too late, and many youths now need documents for education and research. I’ve spoken with a law association who scans many documents. However, they are unable to search for what they need. By converting these hard-copy texts into digital format, people can do searches, put it on websites or even make e-books.

What are some of the challenges you have faced?

Funding is a huge problem, as having it will dramatically speed up our process. Though we have spoken to many, we have not gotten any. Right now, Society For Better Books In Cambodia (an American NGO) is providing us with one staff to train the data.

What are the next steps for Khmer OCR?

We are planning to do a mobile application for romanisation, and also for foreigners where you can just take a picture and upload it for translation.

However, this is dependant on when we gather a large enough database for OCR in order to create a mobile application.

Do you have any advice for Cambodians that want to create such beneficial softwares?

Khmer language with new technology is working very slow. I implore those that have the skills to spend some time in improving it.

//

Sovannra works a full-time job unrelated to Khmer OCR, and spends every night training the data for the benefit of all Cambodians. If you would like to volunteer by training the software or would like to contact Sovannra, you can do so by emailing him at sovannra@gmail.com.