THL Toolbox > Scanning & OCR > How to OCR a PDF > Assessment of Abbyy software for OCRing Romanized Tibetan with Diacritics
Assessment of Abbyy software for OCRing Romanized Tibetan with Diacritics
Contributor(s): Zach Rowinski
Assessment of Abbyy software (http://www.abbyy.com/) for OCRing romanized Tibetan text with diacritics:
It is hard to say without knowing how Abbyy OCR works. For example, if text recognition uses a probabilistic language model to make decisions about novel strings of letters, then it is unlikely that it will do well with Romanized Tibetan (since none of the weird Romanized Tibetan words will be part of the model). In that case, presenting Abbyy OCR with some special purpose list of Romanized Tibetan words may be useful, assuming that is possible. If instead, the OCR recognizes words letter by letter -- and doesn't attempt any probabilistic context correction, it should do OK. The problem then is making sure the diacritic letters are in the training model....
I have no idea how Abbyy OCR works unfortunately. It appears to be very advanced and customizable, so it seems like it could accommodate Romanized Tibetan with a little help.
I would suggest that Jeremy try to present Abbyy with a specialized Romanized Tibetan vocabulary if he can (replete with diacritics). (THL doesn't have anything like this, right?) Beyond this, Jeremy may want to look to see if Abbyy has different language settings. Maybe it is possible he can recognize using settings another language that uses diacritics…