Publication

Normalization and parsing algorithms for uncertain input

van der Goot, R. M., 2019, [Groningen]: University of Groningen. 196 p.

Research output: ThesisThesis fully internal (DIV)Academic

Copy link to clipboard

Documents

  • Rob Matthijs van der Goot
The automatic analysis (parsing) of natural language is an important ingredient for many natural language processing applications (search-engines, automatic translation, speech-processing, etc.), as it is the first step towards interpretation. For standard texts, like well-edited news articles, current parsers perform very well. However, for user-generated content, such as tweets, parser performance drops dramatically.

In this research, we attempt to improve the automatic analysis of spontaneous language by translating it to 'normal' language. For example, the sentence "new pix comming tomorroe" is translated to "new pictures coming tomorrow". In this example sentence, a variety of phenomena occurs: 'pix' is a replacement based on the pronunciation, whereas 'comming' is probably a typo. This translation is also referred to as 'normalization'. Based on the observation that the normalization problem actually consists of multiple sub-problems, we developed a modular normalization model: MoNoise. This normalization model reaches a new state-of-art performance on a variety of languages.

Normalizing social media texts leads to a performance increase for syntactic parsers. In the basic setup, we use only the single best normalization candidate for each word, which might lead to error propagation. Hence, we introduce two novel methods to let the parser to take multiple normalization candidates into account per position, leading to further improvements in parser performance.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
Supervisors/Advisors
Award date4-Apr-2019
Place of Publication[Groningen]
Publisher
Print ISBNs978-94-034-1458-4
Electronic ISBNs978-94-034-1457-7
Publication statusPublished - 2019

Download statistics

No data available

ID: 78256478