Machine learning for historical document-image analysis

Contact person: prof.dr. Lambert Schomaker

The difficult area of digital document-image analysis in historical collections is a fruitful breeding ground for the evaluation and improvement of (deep) machine-learning methods. The group maintains an e-Science cloud service (Monk) for users in the humanities to label and index historical manuscripts. This service represents an observatory for {image, label} tuples. The problems addressed range from image preprocessing, image-layout analysis, segmentation to (handwritten) text recognition. Other machine-learning tasks are document dating and writer identification. The problems encountered are characterized by a notorious lack of labeled data when a new historical collection in an unknown script style and language is ingested. This means that current deep-learning methods cannot immediately be used and bootstrapping algorithms need to be developed, to reach a critical mass of labeled data.

Last modified:

13 December 2022 1.23 p.m.