Skip to ContentSkip to Navigation
ResearchResearch Faculty of ArtsResearch InstitutesCenter for Language and CognitionResearchComputational Linguistics

Computational Linguistics

Introduction

The computational linguistics group focuses on natural language processing by computers, from theoretical, experimental and applied perspectives. Areas of interest are wide-coverage grammars (especially for Dutch), machine translation, machine learning, and dialectometry. Strong ties to the researchers specializing in semantics and descriptive linguistics have developed from common interests in mathematical linguistics and corpora.

Research program

Parsing

The computational linguistics group has constructed a natural language understanding system for Dutch. This includes a wide-coverage grammar based on the insights of Head-Driven Phrase Structure (HPSG) grammars, a large-scale lexicon, parser, and disambiguation component. Further research focuses on domain adaptation of wide-coverage grammars, question answering using syntactic analysis and information retrieval, hybrid machine translation and language generation.

Dialectometry

Techniques for computational assessment of the linguistic distance between the pronunciation of Dutch dialect varieties were developed. Our current research extends these techniques to other language dialects and seeks to account for dialect variation through dialect area, geography and/or settlement size. Methods from bioinformatics are also incorporated for calculating linguistics distances and clustering sites. Research in phonology is also persued.

Research projects

* Building ICT Research Capacity in Uganda: this project aims to extend ICT research capacity at the four Public Universities in Uganda (MAK, MUST, KYU, and GU). Project objectives include training of staff of the four universities at PhD level in ICT, development of ICT infrastructure, and implementation of educational programs.

* COREA (cOreference Resolution for Extracting Answers): The COREA project has created an Dutch corpus annotated with coreference relations and a robust system for automatically assigning coreference relations between noun phrases. The coreference resolution was evaluated in the context of information extraction and question answering tasks.

* DAISY (Dutch lAnguage Investigation of Summarization technologY): The DAISY project designs, develops and evaluates essential technologies for automatic summarization of Dutch informative texts. Innovative algorithms for topic detection and discrimination, rhetorical classification of content, sentence compression and text generation are developed.

* Determinants of dialectal variation: Techniques for assessing the linguistic distance between the pronunciation of Dutch dialectal varieties have been shown to be consistent and valid in adducing classifications for which expert consensus exists. This project extended these results to Dutch lexis (vocabulary) and syntax and to German pronunciation and examined quantitative models that seek to account for the variation through dialect area (tribal history), geography and/or settlement size.

* DuOMAn (Dutch Online Media Analysis): DuOMAn aims to transform the volumes of online information that threaten to leave media analysts information-bound into aggregates of attitudes organized by topic by employing classification, information extraction, and cross-document linking.

* IRME (Identification and Representation of Multiword Expressions): this project aims developed innovative methods and tools for the automatic identification and lexical representation of multiword expressions. The project contributed to the development of electronic lexicons, in particular for Dutch. The MWE database that was developed was integrated in lexical resources for Dutch.

* LASSY (Large Scale Syntactic Annotation of written Dutch): In the LASSY project a large corpus of written Dutch texts (1,000,000 words) is syntactically annotated with manual correction. In addition, a full corpus is developed as the successor of D-COI (500,000,000 words) by automatic annotation. Various browse and search tools for syntactically annotated corpora are developed further.

* Linguistic determinanants of mutual intelligibility in Scandinavia: The three mainland Scandinavian languages have a reputation of being mutually intelligible, which means that the speakers are able to communicate each using his or her language. However, in daily practice inter-Scandinavian communication sometimes fails. The problems are commonly explained by extra-linguistic factors such as linguistic experience and language attitude. Linguistic explanations have mostly been neglected due to the lack of a suitable method for quantifying linguistic distance. Recently, such methods have been developed. The aim of the present project is to use these newly developed methods and refine them in order to be able to measure communicatively relevant linguistic distances among the spoken Scandinavian languages. On the basis of these measurements, a model will be developed that explains mutual intelligibility in Scandinavia.

* Measuring linguistic unity and diversity in Europe: This project takes well-established quantitative techniques for measuring language diversity and apply them to the language varieties of Bulgaria and the surrounding territory. These methods have already been used successfully for the study of language diversity and unity in Western Europe, but the cultural and linguistic context of Bulgaria within the larger Balkan language area presents new challenges not encountered in the Western context.

* PaCo-MT (Parse and COrpus based Machine Translation): The PaCo-MT project aims at developing an open domain hybrid MT system integrating proper linguistic analysis and syntactic transfer into a data-driven approach to be used by professional translators. Translation will be based on transfer (lexical and syntactic) from a parsed source language sentence into a corresponding target language structure. From this the final output is generated using information from a large target language Treebank that will ensure grammaticality and fluency.

* Question Answering for Dutch using Dependency Relations: this project, conducted in context of the IMIX (Interactive Multimodal Information Extraction) project, has investigated the use of sophisticated linguistic knowledge and robust natural language processing for QA. In particular, the use of syntactic and semantic dependency relations in the question and potential answer texts in supporting QA was investigated.

* SCRATCH (SCRipt Analysis Tools for the Cultural Heritage): this project is focused on methods for information retrieval in large collection of handwritten-document images. Within this project the computational linguistics group focused on the study of the textual, linguistic regularities of document content in a given homogeneous archive.

* The Mutual Comprehensibility of Language Varieties in the Low Lands: differences in the comprehensibility of closely related language varieties are often attributed to similarity of languages, familiarity of a language, and language attitude. However, often these factors are stated without systematic research. This project aims to investigate the impact of these factors on language comprehension. For that purpose the mutual intelligibility of Dutch varieties from Belgium and the Netherlands is tested experimentally, and the intelligibility scores are systematically related to linguistic and non-linguistic factors.

Research Results

In the recent years, the computational linguistics group has successfully participated in various projects focusing on dialectometry, question answering systems, coreference resolution, identification of multiword expressions, and syntactic annotation of large corpora. The relevance of the research of the computational linguistics group is also reflected in the participation of the members of the group in various important (inter)national organizations, journals, conferences, and workshops.

The computational linguistics group of CLCG obtained the maximum score (5, 5, 5, 5) in the external evaluation procedure over 1998-2004.
Last modified:15 September 2016 11.08 a.m.
printAlso available in: Nederlands