Domain adaptation for parsing

08 December 2011

PhD ceremony: Ms. B. Plank, 12.45 uur, Aula Academiegebouw, Broerstraat 5, Groningen

Dissertation: Domain adaptation for parsing

Promotor(s): prof. G.J.M. van Noord, prof. J. Nerbonne

Faculty: Arts

The ultimate goal of natural language processing is to build computer systems that are able to understand and produce natural language, just like humans do. Building such systems is a difficult task, given the problem of ambiguity of natural language. In this dissertation Plank focuses on parsing, the process of syntactic analysis of natural language sentences. The ambiguity problem in parsing is characterized by multiple plausible alternative syntactic analyses for a given input sentences, from which the parser has to choose.

Current natural language processing systems employ supervised machine learning to infer a model from annotated training data. For parsing, the training data consists of a collection of syntactically annotated sentences. The parameters of the model are estimated to best reflect the characteristics of the training data, at a cost of portability. There is a substantial drop in performance when applying the system to data that is drawn from a related, but different distribution than the training data. Most parsers are trained on newspaper texts and as consequence, they do not perform well when applied to other kinds of text, for example, scientific texts.

The focus of this dissertation is to investigate the domain dependence of natural language parsing systems. The contribution of this dissertation is threefold. First, the effectiveness of existing as well as a novel domain adaptation techniques is evaluated in the context of a grammar-driven parsing system for Dutch, the Alpino parser. In contrast, most previous work on domain adaptation for parsing has focused on data-driven parsing systems. Second, Plank assesses the sensitivity of parsing systems to domain shifts. She compares the grammar-driven system Alpino to data-driven parsing systems. The hypothesis is tested that the grammar-driven system is less affected by domain shifts, and, consequently, data-driven systems are more in need for domain adaptation techniques. The chapter shows that Alpino is robust in comparison to the data-driven parsers. The last contribution of this dissertation is to establish a measure of domain similarity to select data automatically that is beneficial for a new target domain. Most previous work assumed that there is data available for the new domain, which is not always the case. The results show that a simple technique based on relative frequencies of words is effective for both languages examined, English and Dutch.

Last modified:13 March 2020 01.09 a.m.

Share this Facebook LinkedIn

View this page in: Nederlands