A Bigger Fish to Fry: Scaling up the Automatic Understanding of Idiomatic Expressions

Haagsma, H., 2020, [Groningen]: University of Groningen. 205 p.

Research output: ThesisThesis fully internal (DIV)

Copy link to clipboard



  • Hessel Haagsma
In this thesis, we are concerned with idiomatic expressions and how to handle them within NLP. Idiomatic expressions are a type of multiword phrase which have a meaning that is not a direct combination of the meaning of its parts, e.g. 'at a crossroads' and 'move the goalposts'.

In Part I, we provide a general introduction to idiomatic expressions and an overview of observations regarding idioms based on corpus data. In addition, we discuss existing research on idioms from an NLP perspective, providing an overview of existing tasks, approaches, and datasets. In Part II, we focus on the building of a large idiom corpus, consisting of developing a system for the automatic extraction of potentially idiom expressions and building a large corpus of idiom using crowdsourced annotation. Finally, in Part III, we improve an existing unsupervised classifier and compare it to other existing classifiers. Given the relatively poor performance of this unsupervised classifier, we also develop a supervised deep neural network-based system and find that a model involving two separate modules looking at different information sources yields the best performance, surpassing previous state-of-the-art approaches.

In conclusion, this work shows the feasibility of building a large corpus of sense-annotated potentially idiomatic expressions, and the benefits such a corpus provides for further research. It provides the possibility for quick testing of hypotheses about the distribution and usage of idioms, it enables the training of data-hungry machine learning methods for PIE disambiguation systems, and it permits fine-grained, reliable evaluation of such systems.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
Award date3-Sep-2020
Place of Publication[Groningen]
Print ISBNs9789403425269
Electronic ISBNs9789403425252
Publication statusPublished - 2020

Download statistics

No data available

ID: 131057087