Domain adaptation for parsing
PhD ceremony: Ms. B. Plank, 12.45 uur, Aula Academiegebouw, Broerstraat 5, Groningen
Dissertation: Domain adaptation for parsing
Promotor(s): prof. G.J.M. van Noord, prof. J. Nerbonne
Faculty: Arts
The ultimate goal of natural language processing is to build computer systems that are able to understand and produce natural language, just like humans do. Building such systems is a difficult task, given the problem of ambiguity of natural language. In this dissertation Plank focuses on parsing, the process of syntactic analysis of natural language sentences. The ambiguity problem in parsing is characterized by multiple plausible alternative syntactic analyses for a given input sentences, from which the parser has to choose.
Current natural language processing systems employ supervised machine learning to infer a model from annotated training data. For parsing, the training data consists of a collection of syntactically annotated sentences. The parameters of the model are estimated to best reflect the characteristics of the training data, at a cost of portability. There is a substantial drop in performance when applying the system to data that is drawn from a related, but different distribution than the training data. Most parsers are trained on newspaper texts and as consequence, they do not perform well when applied to other kinds of text, for example, scientific texts.
The focus of this dissertation is to investigate the domain dependence of natural language parsing systems. The contribution of this dissertation is threefold. First, the effectiveness of existing as well as a novel domain adaptation techniques is evaluated in the context of a grammar-driven parsing system for Dutch, the Alpino parser. In contrast, most previous work on domain adaptation for parsing has focused on data-driven parsing systems. Second, Plank assesses the sensitivity of parsing systems to domain shifts. She compares the grammar-driven system Alpino to data-driven parsing systems. The hypothesis is tested that the grammar-driven system is less affected by domain shifts, and, consequently, data-driven systems are more in need for domain adaptation techniques. The chapter shows that Alpino is robust in comparison to the data-driven parsers. The last contribution of this dissertation is to establish a measure of domain similarity to select data automatically that is beneficial for a new target domain. Most previous work assumed that there is data available for the new domain, which is not always the case. The results show that a simple technique based on relative frequencies of words is effective for both languages examined, English and Dutch.
Last modified: | 13 March 2020 01.09 a.m. |
More news
-
24 March 2025
UG 28th in World's Most International Universities 2025 rankings
The University of Groningen has been ranked 28th in the World's Most International Universities 2025 by Times Higher Education. With this, the UG leaves behind institutions such as MIT and Harvard. The 28th place marks an increase of five places: in...
-
12 March 2025
Breaking news: local journalism is alive
Local journalism is alive, still plays an important role in our lives and definitely has a future. In fact, local journalism can play a more crucial role than ever in creating our sense of community. But for that to happen, journalists will have to...
-
11 March 2025
Student challenge: Starting Stories
The Challenge Starting Stories dares you to think about the beginning of recent novels for ten days.