Skip to ContentSkip to Navigation
Research Bernoulli Institute Calendar

Colloquium Computer Science - Harm de Vries, ServiceNow

When:Fr 17-03-2023 16:00 - 17:00
Where:5161.0222 Bernoulliborg

Title: BigCode: open and responsible development of large language models for code

Abstract:

In this talk, I’ll cover the recent progress of the BigCode project, an open-scientific collaboration working on the responsible development of Large Language Models (LLMs) for Code. Code LLMs can increase the productivity of developers by completing code snippets from both natural language instructions and other code fragments. I’ll discuss how we created The Stack, a large dataset for training code LLMs, and discuss some of its legal, ethical, and governance concerns, including (i) how to give developers the possibility to opt-out their code repositories from the training data and (ii) how to give proper attribution when the model generates verbatim copies of other people's code. Finally, I’ll go over the learnings of our first model, called SantaCoder, a 1.1B parameter model trained on Java, Javascript, and Python. SantaCoder outperforms other open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the MultiPL-E benchmark, despite being substantially smaller.

More information: