PhD project: Machine learning for recognizing handwritten Thai

Name: Olarik Surinta

Supervisors:
prof. dr. L.R.B. (Lambert) Schomaker
dr. M.A. (Marco) Wiering

Summary of PhD project:

There exists no generic method for recognizing handwritten scripts from different writing systems, cultures or historical periods. Asian scripts pose a number of interesting fundamental problems at the levels of image processing, text segmentation, feature extraction, shape classification and language modeling. Instead of spending human efforts at each of these leveld, the current challenge is to exploit machine learning methods. The main objective of the PhD project is to automatically recognize handwritten Thai and to automatically convert documents written in Thai to text files. There are two processes needed for this research: 1) Image processing for converting documents to words and characters, and 2) pattern recognition to convert words and characters to text files. Although targeted at Thai script, the new algorithms should be generalizable to a family of similar scripts with curved ink shapes that are contained in well-separable character rectangles, as opposed to connected-cursive Western styles for words.

First, an image processing scheme is applied to extract the information from the image documents, this scheme consists of 4 main steps including background elimination, line segmentation, character segmentation, and feature extraction. The last step is used to create a number of novel feature extractors to compute feature vectors from the handwritten Thai character image. One of these feature extractors will be the novel deep neural network architecture algorithm. The feature vectors can then be used to classify Thai characters.

Second, in the pattern recognition scheme, machine learning is used to identify each Thai character based on the feature vectors. Since different feature vectors are used as input, we will focus on ensemble classifiers that combine multiple machine learning algorithms to obtain the best performance. An important element in this project is to bootstrap the system, in the beginning there are few labels, but by using the system more and more correct labels will be obtained until finally the system is able to correctly classify most of the characters without any supervision.

Finally, state-of the art systems should be able to handle continuous learning. A system should be able to correctly classify characters from different authors and written in different years when trained on a particular author in a single year. For this robust ways have to be found that transform the characters to other ways of writing them, and using novelty detection to instantiate new classes if needed.

Figure 1: Illustration of the Thai handwritten images. (a) Thai handwritten characters and (b) Thai handwritten digits.

Figure 2: Illustration of the diversity in writing styles of the Thai handwritten dataset. (a), (b) and (c) sample of Thai handwritten characters and (d), (e) and (f) Thai handwritten digits.

Last modified:

26 January 2024 4.43 p.m.