Resources

Over the years, the Computational Linguistics group produced a lot of useful and interesting resources that can be valuable for the community. They are listed here, divided in two sections: Demos & Tools and Corpora.

Demos & Tools

BERTje: the Dutch BERT

Our group, led by PhD-student Wietse de Vries, trained a Dutch version of BERT, called BERTje, that is freely available. Details can be found in the paper and the Github page.

Woordwaark (Dutch only)

Spreek jij Gronings, Drents, Twents, Achterhoeks of Veluws? Wij hebben een quiz gemaakt, op basis van oude dialectopnames uit het noorden en oosten van Nederland, die kan raden waar je vandaan komt! Probeer het zelf op de website van Woordwaark.

The Canonizer

This tool presents an application of the Dutch novels 1800-2000 dataset; you can pick a novel or enter your own text, and the system will tell you how this text compares to the novels in the corpus. You can also plot the frequency of a word or phrase for each year across the corpus. It was developed by Andreas van Cranenburgh.

MoNoise

MoNoise is a modular system which translates noisy chat-texts to more canonical language, developed by Rob van der Goot. It exploits traditional spelling correction algorithms as well as word embeddings. The online demo includes models for multiple languages.

Dutchcoref

A Dutch coreference resolution system, building on Alpino and BERTje. The system is being developed by Andreas van Cranenburgh. The code is available under an open source license and there is an online demo.

PaQu: Parse & Query

PaQu is web-based platform for syntactic search in Dutch treebanks, developed by a series of projects financed by Clarin and Clariah. It was developed by Peter Kleiweg and Gertjan van Noord.

Alpino

Alpino is an integrated natural language analysis system for Dutch. It includes a highly accurate, robust, fast dependency parser, a generation component, and a related set of tools for treebank search and for pre-processing. Alpino is available under the conditions of the Gnu Lesser General Public License and was developed by Gertjan van Noord.

Authorship attribution & author profiling demo

A web interface for authorship attribution and author profiling using our tools. Developed by Malvina Nissim.

Word frequency on Twitter

This Dutch demo, developed by Peter Kleiweg, shows the frequency of words on Twitter for certain periods of time. It can make a nice visualisation of where in the Netherlands the word is tweeted, based on the geolocation of the tweets.

Computational Linguistics Github

The Computational Linguistics group also has a Github page with interesting software.

Corpora

The Groningen Meaning Bank

The Groningen Meaning Bank is a syntactically and semantically annotated corpus of public domain English texts. Each document in the corpus has a corresponding meaning representation in the form of a discourse representation structure. All tagging layers together with the raw text can be downloaded, but it also possible to view the contents using the explorer. The latest release (04-07-2014) contains 10,000 annotated documents.

The Parallel Meaning Bank

The Parallel Meaning Bank, the successor of the The Groningen Meaning Bank, is a semantically annotated corpus of English texts aligned with translations in Dutch, German and Italian. Each sentence is syntactically and semantically annotated, ultimately producing scoped meaning representations in a language-neutral format. It contains fully manually checked documents, but also automatically tagged documents. To see all our current documents in a nice interactive format, please visit the explorer. There are also multiple official releases of this data already available. Moreover, as part of IWCS 2019, a shared task was organized on producing the scoped meaning representations for English.

DALC

DALC, created by Tommaso Caselli, is a corpus of Twitter messages annotated for abusive and offensive language. It is is the first publicly available resource of this kind in Dutch. The corpus contains a total of 11,292 manually annotated messages. The offensiveness dimension contains both aggregated and disaggregated annotations for the explicitness and the target layers.

Dutch novels 1800-2000

A dataset of textual features and metadata for a corpus of Dutch novels (1800-2000), created for the goal of studying why some novels become classics while others are forgotten (canonicity). The dataset is available under a creative commons license and was created by Andreas van Cranenburgh. For more details, see the blog post.

Shared Task Annotation

Shared tasks are indisputably drivers of progress and interest for problems in NLP. To qualify some of the characteristics and potential problems, we annotated a random set of shared tasks by hand. The process is described in the paper Sharing is Caring: The Future of Shared Tasks.

Taxonomy for Normalization

This dataset accommodates for in-depth evaluation of a normalization model (e.g. MoNoise), it is based on a taxonomy containing different normalization actions. A small Twitter corpus is annotated by 2 annotators and is publicly available.

An annotated corpus for the analysis of VP ellipsis

This is an annotated corpus of VPE in all 25 sections of the Wall Street Journal corpus (WSJ) distributed with the Penn Treebank. The resulting corpus will be useful for studying VPE phenomena as well as for evaluating natural language processing systems equipped with ellipsis resolution algorithms. It was annotated by Johan Bos and Jennifer Spenader.

Last modified:

17 February 2025 2.46 p.m.