Domain adaptation for parsing

08 December 2011

PhD ceremony: Ms. B. Plank, 12.45 uur, Aula Academiegebouw, Broerstraat 5, Groningen

Dissertation: Domain adaptation for parsing

Promotor(s): prof. G.J.M. van Noord, prof. J. Nerbonne

Faculty: Arts

The ultimate goal of natural language processing is to build computer systems that are able to understand and produce natural language, just like humans do. Building such systems is a difficult task, given the problem of ambiguity of natural language. In this dissertation Plank focuses on parsing, the process of syntactic analysis of natural language sentences. The ambiguity problem in parsing is characterized by multiple plausible alternative syntactic analyses for a given input sentences, from which the parser has to choose.

Current natural language processing systems employ supervised machine learning to infer a model from annotated training data. For parsing, the training data consists of a collection of syntactically annotated sentences. The parameters of the model are estimated to best reflect the characteristics of the training data, at a cost of portability. There is a substantial drop in performance when applying the system to data that is drawn from a related, but different distribution than the training data. Most parsers are trained on newspaper texts and as consequence, they do not perform well when applied to other kinds of text, for example, scientific texts.

The focus of this dissertation is to investigate the domain dependence of natural language parsing systems. The contribution of this dissertation is threefold. First, the effectiveness of existing as well as a novel domain adaptation techniques is evaluated in the context of a grammar-driven parsing system for Dutch, the Alpino parser. In contrast, most previous work on domain adaptation for parsing has focused on data-driven parsing systems. Second, Plank assesses the sensitivity of parsing systems to domain shifts. She compares the grammar-driven system Alpino to data-driven parsing systems. The hypothesis is tested that the grammar-driven system is less affected by domain shifts, and, consequently, data-driven systems are more in need for domain adaptation techniques. The chapter shows that Alpino is robust in comparison to the data-driven parsers. The last contribution of this dissertation is to establish a measure of domain similarity to select data automatically that is beneficial for a new target domain. Most previous work assumed that there is data available for the new domain, which is not always the case. The results show that a simple technique based on relative frequencies of words is effective for both languages examined, English and Dutch.

Last modified:

13 March 2020 01.09 a.m.

Share this Facebook Twitter LinkedIn

View this page in: Nederlands

More news

02 September 2024

Preserving the web for researchers of the future

How do you archive the internet? What are you going to keep and what are you not going to keep? And who decides this? These are questions that Susan Aasman thinks about on a daily basis. The media historian and Professor of Digital Humanities at...
02 September 2024

Come to the Arts Festival for science, stories, music, and more

On Saturday, September 21, the Faculty of Arts at the University of Groningen will host the Literature Festival, a scientific event for anyone interested in the diverse world of the humanities. The Harmonie Building and the surrounding squares will...
17 July 2024

Veni-grants for ten researchers

The Dutch Research Council (NWO) has awarded a Veni grant of up to €320,000 each to ten researchers of the University of Groningen and the UMCG. The Veni grants are designed for outstanding researchers who have recently gained a PhD.

Domain adaptation for parsing

More news

Preserving the web for researchers of the future

Come to the Arts Festival for science, stories, music, and more

Veni-grants for ten researchers