Domain adaptation for parsing
PhD ceremony: Ms. B. Plank, 12.45 uur, Aula Academiegebouw, Broerstraat 5, Groningen
Dissertation: Domain adaptation for parsing
Promotor(s): prof. G.J.M. van Noord, prof. J. Nerbonne
Faculty: Arts
The ultimate goal of natural language processing is to build computer systems that are able to understand and produce natural language, just like humans do. Building such systems is a difficult task, given the problem of ambiguity of natural language. In this dissertation Plank focuses on parsing, the process of syntactic analysis of natural language sentences. The ambiguity problem in parsing is characterized by multiple plausible alternative syntactic analyses for a given input sentences, from which the parser has to choose.
Current natural language processing systems employ supervised machine learning to infer a model from annotated training data. For parsing, the training data consists of a collection of syntactically annotated sentences. The parameters of the model are estimated to best reflect the characteristics of the training data, at a cost of portability. There is a substantial drop in performance when applying the system to data that is drawn from a related, but different distribution than the training data. Most parsers are trained on newspaper texts and as consequence, they do not perform well when applied to other kinds of text, for example, scientific texts.
The focus of this dissertation is to investigate the domain dependence of natural language parsing systems. The contribution of this dissertation is threefold. First, the effectiveness of existing as well as a novel domain adaptation techniques is evaluated in the context of a grammar-driven parsing system for Dutch, the Alpino parser. In contrast, most previous work on domain adaptation for parsing has focused on data-driven parsing systems. Second, Plank assesses the sensitivity of parsing systems to domain shifts. She compares the grammar-driven system Alpino to data-driven parsing systems. The hypothesis is tested that the grammar-driven system is less affected by domain shifts, and, consequently, data-driven systems are more in need for domain adaptation techniques. The chapter shows that Alpino is robust in comparison to the data-driven parsers. The last contribution of this dissertation is to establish a measure of domain similarity to select data automatically that is beneficial for a new target domain. Most previous work assumed that there is data available for the new domain, which is not always the case. The results show that a simple technique based on relative frequencies of words is effective for both languages examined, English and Dutch.
Last modified: | 13 March 2020 01.09 a.m. |
More news
-
09 September 2025
Art + science = 1-0 for humanity
PhD candidate in Media Studies Marije Miedema and theater maker Mees van den Bergh joined forces. The result is the theatrical audio installation "Future of the Past," a project about how people want to be remembered digitally.
-
26 August 2025
Free rein for the crypto coin
Canadian-Dutch political economist Malcolm Campbell-Verduyn is fascinated by cryptocurrencies. It is full steam ahead under American President Donald Trump, he says. At the same time, the economist himself was involved in the crypto coin Ada as an...
-
17 July 2025
Veni-grants for eleven UG researchers
The Dutch Research Council (NWO) has awarded a Veni grant of up to €320,000 each to eleven researchers of the University of Groningen and the UMCG: Quentin Changeat, Wen Wu, Femke Cnossen, Stacey Copeland, Bart Danon, Gesa Kübek, Hannah Laurens, Adi...