Good practices for FAIR data management - an interview with Don van Ravenzwaaij on research data de-identification and FAIR implementation in the Behavioural Social Sciences

Date:16 June 2025Author:Alba Soares Capellas

Part of open science is that researchers make their data FAIR: Findable, Accessible, Interoperable and Reusable. But how to do this? In this series, we ask researchers to tell us more about their data management choices.

In this interview, we speak with Prof. Don van Ravenzwaaij, a leading advocate for open science and Professor at the Department of Psychometrics and Statistics at the Faculty of Behavioural and Social Sciences of the UG, about his recent co-authored article “De-identification when making data sets findable, accessible, interoperable and reusable (FAIR): Two worked examples from the Behavioral and Social Sciences”. He shares key challenges, practical solutions, and advice for researchers navigating the balance between openness and data protection.

“We hoped to share the fruits of our labour as some kind of tutorial document”

The article provides practical footholds for researchers who would like to publish datasets as open as possible but as closed as necessary. What motivated you to write it?
The genesis of this article happened when I was chair of the BSS Open Science Committee, which had as a mandate to look for bottom-up needs of researchers in the faculty in the area of open science. One thing we quickly noticed was that there are quite a few resources available on how to make data FAIR, or where to publish your paper, but that practical worked-out examples were few and far between. So as a committee, we decided to send out a call to applied researchers in our faculty.

Everyone interested in learning how to properly de-identify their data for purposes of making it publicly available could send us their empirical datasets, and we would work alongside them to make them ready. We hoped to share the fruits of our labour as some kind of tutorial document (so that everyone in the faculty could benefit from it), or even a published paper (so that everyone in the social sciences could benefit from it). It turned out to be quite a journey. I, for one, had not anticipated how much work such an endeavor would be, as we had to properly familiarize ourselves with the content of datasets that were not our own. However, it was very rewarding, as I learned a lot from it!

“Part of making data accessible was providing a roadmap for converting the proprietary files to open files.”

The article is shaped as a tutorial paper, showcasing two examples and the challenges that were encountered during the de-identification process and in making the data FAIR. What can researchers learn from it?
Our original focus, which I am stealing from our abstract, was assisting researchers with “Navigating the balance between protecting participants’ privacy and making one’s dataset as open as possible”. To that end, we provide a step-by-step de-identification process, which we implement for two datasets. The checklist, the datasets (both unprocessed datasets with identifying values replaced by simulated values and their respective de-identified versions), and all associated materials can be found on OSF (https://osf.io/eqbd3). Researchers can also learn from the article what FAIR data means concretely, as some of the original terminology in the Wilkinson et al paper may not be as accessible to the typical applied researcher in the social sciences.

Could you highlight some challenges that you encountered in the process of de-identifying the datasets?
Several, but I’ll try to limit myself to three :-) The first one: at the BSS faculty, datasets are often analyzed in the statistical software package IBM SPSS. Part of making data accessible (in our opinion) was providing a roadmap for converting the proprietary .sav files to .csv files. In theory, you can do this with the R package “foreign”, as many online sources and papers will tell you. In practice, some weird conversion issues arise, such as (1) time variables becoming weird long numbers (which, after research, turns out to be the time stamp converted to seconds since October 14, 1582, or the start of the Gregorian calendar); and (2) open-text variables getting chopped into separate variables, because the default number of characters for an open-text variable in SPSS is larger than in R. None of these are insurmountable, but if you are not too familiar with R, you will not be able to solve these so easily. Hopefully our annotated code will help with that.

“It is essential that the research info and informed consent specify that properly de-identified data might become publicly available!”

The second one: what if a dataset includes variables or responses in a non-English language (say, German)? How much translating should be expected of a researcher publishing their data, and how much should be the responsibility of the end user? Put the bar too high for the publishing researcher, and they will not want to engage with making their data accessible, but put the bar too low and the end user will not be able to navigate the data at all. We settled on leaving the data as is, but including a code book with clear translations of each variable as a compromise.

Finally, an open door, but it is essential that the research info and informed consent specify that properly de-identified data might become publicly available! Early on in the process, we had started working on an example dataset, assuming that this informed consent had been collected by the submitting researcher. The committee felt appropriately embarrassed when we later found out we could not actually use this data as an example dataset!

“Many cases datasets can be de-identified without losing any value for potential future users of the data.”

One of the co-authors on the article is a data steward from the UG Digital Competence Centre (UG DCC), Marlon de Jong. How did this become a collaborative process, and how does this collaboration enrich the article?
The action editor of our paper commented in a revise/resubmit decision letter that the paper would really be strengthened by including a local data steward, and we completely agreed. I reached out to Marlon at the time, and Marlon was fortunately interested in joining the project! Marlon’s involvement meant we did a massive overhaul of the paper, revising way more than the editor and reviewers asked for, but it made the paper much stronger. Naming some specifics: Marlon’s knowledge of the GDPR, of certain aspects of the FAIR principles, and of FAIR implementation profiles (or FIPs) were important additions to the expertise present in the author team. On a more personal note, Marlon was clear from the outset she was not interested in being a figurehead, and that she was only interested in joining if she could make some significant changes to the paper. That attitude made the collaboration much more fruitful (I appreciate when people speak their minds, so that expectations from both sides are clear).

“Make sure you get to know your local data steward: they have a ton of expertise.”

Do you think that the FAIR principles and the GDPR are compatible? Are there ways that researchers can ensure that the privacy of participants is protected, while still providing data that can be valuable for reuse?
Absolutely, I’d say that’s the main message of our paper! Naturally, there is some friction between the two. It’s all about finding the right balance, but I believe that in many cases datasets can be de-identified without the dataset losing any value for potential future users of the data. Citing from our paper: “Anonymizing or pseudonymizing raw data will not always be possible, for instance, in cases in which the raw data include video material of participants. Video-editing techniques exist for blurring faces and distorting voices, but they may not always be sufficient to fully anonymize the data. In such cases, we recommend researchers make the processed data available, for instance, the coding of the behavior in the video, along with the codebook that explains how the coding was done. At the very least, this would flag that these data exist and enable others to contact the authors of those data.”

What advice would you give to researchers in social sciences who want to improve their data management practices to align with FAIR principles and contribute to a more open scientific community?
Subscribe to the DCC Up to Data newsletter (if you have not already), it contains a ton of useful workshops! Also, make sure you get to know your local data steward: they have a ton of expertise (although, depending on the faculty, they may also be a bit overworked). Finally, read our paper! If anything, it contains a lot of useful resources for further study!

Tags: open science, research data management, open data, open research, interview

About the author

Alba Soares Capellas

Communications Officer at the UG Digital Competence Centre (UG DCC)

Share this Facebook LinkedIn