Protecting voices, sharing knowledge: responsible data practices in speech research

An interview with Dr. Matt Coler

03 September 2025

We interviewed Dr. Matt Coler about embedding Open Science into speech technology research, with a focus on responsible data management, data protection, and inclusivity.

Dr. Matt Coler is an Associate Professor at Campus Fryslân and Director of the MSc Speech Technology. With an undergraduate background in Philosophy and Chinese and graduate training in linguistics, he earned his PhD at the intersection of language and anthropology, documenting endangered languages in the Andes. This transformative work revealed the urgency of preserving linguistic diversity and the limits of traditional methods.

After leading the cognitive systems unit at an AI startup, he saw how speech recognition, synthesis, and other tools could support underrepresented languages and speakers. His research now focuses on creating speech technology for communities often excluded from mainstream systems, whether minority languages like Frisian or Aymara, or speakers with atypical speech patterns, pioneering innovations that value diversity over uniformity.

Good data management isn't just about compliance or best practices - it makes your research more efficient, more reproducible, and more impactful.

1. You've previously mentioned in the UG Open Science Blog about embedding Open Science into the foundation of the MSc Speech Technology. How has that vision evolved when it comes to the way you and your students handle data?

Responsible openness demands careful judgment.

When we launched the program in 2021, we made a bold commitment to build it entirely around open science principles. Three years in, I can say that vision has only strengthened, and it's become far more nuanced than our initial idealism.

Early on, I thought openness was straightforward: make everything public, use open source tools, share all data. But responsible openness demands careful judgment. We encountered our first major challenge when students wanted to work with clinical populations - people with speech disorders who could benefit from better technology, but whose data needed careful protection. We had to develop new workflows that balanced accessibility with privacy.

Industry partnerships also forced us to evolve. Companies wanted to collaborate but couldn't share proprietary datasets. Instead of abandoning our principles, we found creative solutions. We explored using synthetic data, focusing on open benchmarks, or having students work on complementary problems using public data. These constraints actually improved our research by forcing us to think more carefully about generalizability.

Students are developing judgment about when complete openness serves the greater good and when more controlled sharing protects vulnerable communities.

Perhaps the biggest evolution has been moving beyond technical compliance toward ethical reasoning. Students now routinely ask not just "can we share this?" but "should we share this?" and "how might this be misused?". They're developing judgment about when complete openness serves the greater good and when more controlled sharing protects vulnerable communities. That's a more mature understanding of open science than we started with.

2. As an Open Science ambassador, how are you helping embed Open Science practices into the culture at Campus Fryslân?

Being an Open Science ambassador means being on the lookout for teachable moments and practical applications. I try to model the behavior I want to see - when I publish, I choose open access venues; when I work on software development, it goes on public repositories; when I give talks, the slides are freely available.

But culture change happens through relationships, not mandates. I work with our Open Science Ambassadors network, which creates bridges between central support and individual faculty communities. A colleague struggling with data sharing doesn't need another lecture about benefits - they need practical help navigating publisher requirements or understanding licensing options.

What works is connecting open science to values faculty already hold. Most researchers want their work to have impact, to advance knowledge, to benefit society. Open science isn't an additional burden - it's a better way to achieve those goals. When someone sees their openly shared dataset being cited and built upon, or when their open source code gets adopted by other research groups, the value becomes self-evident.

Culture change happens through relationships, not mandates…it requires both hearts and minds, but it also requires removing friction from the process

Looking ahead, I’m thinking about developing an idea around "collaborative course content" - open course materials that students iteratively improve and expand as part of their coursework, hosted on a GitHub repository or similar. Instead of static texts, students would contribute real-world case studies from their independent studies or thesis projects, update datasets with current examples, or write explanatory sections that clarify concepts their peers struggled with. Each contribution gets peer-reviewed by other students before integration, creating a scholarly community within the classroom.

I'm planning to pilot this next year with our research methods course, a subject where student perspectives and current examples are crucial. This could revolutionize how we think about education and student engagement with open scholarship, especially when public faith in academia is declining. Students become genuine contributors to knowledge rather than passive consumers. Culture change requires both hearts and minds, but it also requires removing friction from the process.

The RDMS has been valuable because it provides secure storage with granular access controls.

3. How do you handle sensitive data in your projects, especially when working with vulnerable populations or emotion-rich datasets like sarcasm detection?

Speech data is inherently personal in ways that other research data often isn't. Voices express not just words and phrases, but identity markers, emotional states, and cultural background. Even when we're analyzing publicly available content from TV shows to study sarcasm detection, we're still dealing with material that could be misused. So there’s no simple “one-size-fits-all” response to this question.

The RDMS has been valuable here because it provides secure storage with granular access controls. We can share derived datasets while keeping raw recordings protected, or provide access to qualified researchers while maintaining participant privacy. The key is designing privacy protection into the research workflow from the beginning, not treating it as an afterthought when it's time to publish.

The key is designing privacy protection into the research workflow from the beginning, not treating it as an afterthought when it's time to publish.

4. What are your considerations regarding concerns about bias, privacy, or unintended surveillance in your research?

Speech technology has the potential to make communication more accessible and inclusive, but the same tools can reinforce existing biases or enable surveillance we never intended.

Bias is built into our datasets, whether we acknowledge it or not. Historical speech corpora overrepresent certain demographics, accents, and speaking styles. When we train models on this data, we embed those biases into our systems. We try to be explicit about these limitations in our publications and actively seek more diverse training data, but it's an ongoing challenge.

The goal isn't to avoid all risk … but we have a responsibility to anticipate consequences and design safeguards where possible.

Privacy concerns go beyond individual data protection to broader questions about consent and agency. When we develop speech recognition systems, we're potentially contributing to a world where every conversation could be monitored and analyzed. We can't control how our research gets used after publication, but we can be thoughtful about what we choose to work on and how we frame our contributions.

I'm looking into incorporating ethical impact assessments into our research planning. Before starting a new project, we ask: Who benefits if this technology works perfectly? Who might be harmed if it's misused? Are there vulnerable populations who might be unfairly affected? These questions don't always change what we do, but they change how we do it and how we communicate about it. The goal isn't to avoid all risk - that would mean avoiding all innovation. But we have a responsibility to anticipate consequences and design safeguards where possible.

The best data management plan is one you actually follow, not one that looks impressive on paper.

5. What advice would you give to new UG researchers on getting started with responsible data management planning?

Start simple and start early. Don't wait until you have data to think about how you'll manage it. The best data management plan is one you actually follow, not one that looks impressive on paper.

Don't reinvent the wheel when good tools exist.

My first piece of advice is to use the resources already available. The DCC offers workshops and consultation services. The RDMS provides secure storage and sharing capabilities that most researchers don't need to build from scratch. Don't reinvent the wheel when good tools exist.

Second, think about your future self. Six months from now, will you remember why you organized files in a particular way? Will your naming conventions make sense? Can you find the specific version of code that generated a particular result? Good data management is mostly about being kind to your future self and your collaborators.

Third, embrace version control and documentation from day one. Even if you're working alone, Git will save you from yourself when you accidentally delete something important. And write README files like your career depends on it (because it might).

You don't need to master every aspect of data management before you start collecting data.

Finally, don't let perfect be the enemy of good. You don't need to master every aspect of data management before you start collecting data. Focus on the basics: secure storage, clear naming conventions, regular backups, and basic documentation. You can always improve your practices as you learn more.

Remember that good data management isn't just about compliance or best practices - it makes your research more efficient, more reproducible, and more impactful. The time you invest upfront pays dividends throughout your project… and beyond.

Good data management is mostly about being kind to your future self and your collaborators.

Contact the DCC for support

Last modified:03 September 2025 1.30 p.m.

Share this Facebook LinkedIn

More news

27 November 2025

The UG once again has the best Bachelor’s degree programme in the Netherlands
28 October 2025

‘Our core values must be prioritized when making digital choices’
07 July 2025

Tim Huiskes: 'Embrace internationalization, even when it goes against the tide'