Normalization and parsing algorithms for uncertain input
|PhD ceremony:||Mr R.M. (Rob) van der Goot|
|When:||April 04, 2019|
|Supervisors:||prof. dr. G.J.M. (Gertjan) van Noord, prof. dr. M. (Malvina) Nissim|
|Where:||Academy building RUG|
The automatic analysis (parsing) of natural language is an important ingredient for many natural language processing applications (search-engines, automatic translation, speech-processing, etc.), as it is the first step towards interpretation. For standard texts, like well-edited news articles, current parsers perform very well. However, for user-generated content, such as tweets, parser performance drops dramatically.
In this research, we attempt to improve the automatic analysis of spontaneous language by translating it to 'normal' language. For example, the sentence "new pix comming tomorroe" is translated to "new pictures coming tomorrow". In this example sentence, a variety of phenomena occurs: 'pix' is a replacement based on the pronunciation, whereas 'comming' is probably a typo. This translation is also referred to as 'normalization'. Based on the observation that the normalization problem actually consists of multiple sub-problems, we developed a modular normalization model: MoNoise. This normalization model reaches a new state-of-art performance on a variety of languages.
Normalizing social media texts leads to a performance increase for syntactic parsers. In the basic setup, we use only the single best normalization candidate for each word, which might lead to error propagation. Hence, we introduce two novel methods to let the parser to take multiple normalization candidates into account per position, leading to further improvements in parser performance.