Good practices for FAIR data management (2) - an interview with Lilian Peters on the COVID-19 EHR cohort dataset
|Date:||02 December 2022|
|Author:||Jitka Vavra & Leon ter Schure|
Part of open science is that researchers make their data FAIR: Findable, Accessible, Interoperable and Reusable. But how to do this? In this series, we ask researchers to tell us more about their data management choices to make their dataset FAIR.
In this edition we highlight the dataset ‘COVID-19 EHR cohort: linked and harmonized Electronic Health Records (EHR) in primary care and secondary care’.
We asked project lead Lilian Peters (Epidemiologist and Assistant Professor at the Department of General Practice and Elderly care Medicine/Midwifery Science UMCG) and her colleagues Feikje Groenhof (database coördinator Academisch Huisarts Ontwikkel Netwerk (AHON), data steward UMCG), Karina Sulim (data manager AHON, UMCG) and Eline Meijer (data scientist at the Data Science Center in Health (DASH), UMCG) to tell us more. Co-project leads are Lotte Ramerman and Isabelle Bos from Nivel.
Can you describe your dataset?
The COVID-19 EHR cohort links (or can link) data from various sources at patient level. First of all, it contains so-called Electronic Health Registry (EHR) data from approximately 385.000 patients. These received care from general practitioners (GPs) affiliated with the GP networks from the University Medical Centre Groningen (AHON), Radboud University Medical Center (Family Medicine Network: Fa-Me-net), and Maastricht University Medical Center+ (Research Network Family Medicine: RNFM) in the years 2019-2021. These data were harmonized and cleaned and are enriched by various other data sources, namely:
EHR data from approximately one million patients who received care from general practices and out-of-office services affiliated with the Netherlands Institute for Health Services Research (Nivel). This is about 10% of the Dutch population.
Microdata from Statistics Netherlands (CBS). These are linkable data on socio-economic status, urbanization, migration background, mortality, Dutch hospital data and more, made available to us under strict conditions for statistical research.
Regional COVID-19 test and vaccination data from various Public Health Services (GGD Groningen, GGD Fryslân, GGD Drenthe) and COVID-19 test data from Certe (medical diagnostics and laboratory medicine) for 2020-2021.
There were various research projects that contributed to the creation of the COVID-19 EHR cohort dataset. Initially, the dataset was designed for the ZonMW-funded project COVID-GP to investigate the impact of the COVID-19 pandemic on GP-care provision. It was subsequently enriched with EHR-data from 2021 for another ZonMW-funded research project, that examines risk factors for developing post covid syndrome.
The dataset(s) contain(s) personal data of patients. How does this impact its accessibility?
The data of the COVID-19 EHR cohort can and should not be shared openly, because they contain sensitive (albeit pseudonymized) personal data. Interested parties may only obtain access if they get an approval from the scientific members of the COVID-19 GP cohort, as well as from the committees of the GP networks affiliated with the UMCG, RadboudUMC, MUMC+, and Nivel. Requests for access need to follow a strict procedure that involves submitting a research proposal.
Once approval is obtained, we share EHR-data of the involved institutions in a secure way. The data is first stored in the Azure DRE data environment. This is a digital environment in which the researcher has access to and can work with the data. The EHR-data can potentially be enriched with data from Statistics Netherlands on individual characteristics such as mortality, migration background or socio-economic status.
Depending on the research question, we provide access to Azure DRE and/or the Statistics Netherlands environment. We need to use these two different research environments, because linkage with Statistics Netherlands is only possible for approximately 80% of the data due to the regulations for its pseudonymization process. Also, the anonymized free text fields (e.g. GP notes) are not available within the Statistics Netherlands environment because of its governance rules. Access to the data will require financial procedures as well, which we specify for each data request.
Why is this dataset FAIR and what were the main challenges in making it FAIR?
This dataset cannot be made openly accessible. One of our main challenges is therefore to make the information about the dataset as findable as possible, in accordance with the FAIR principle ‘as open as possible, as closed as necessary’. The metadata are made available on the health-RI COVID-19 data portal. More information about the research projects involved (COVID-GP and Long COVID) can also be found on Health-RI, which is a Dutch initiative that facilitates the reuse of health data by providing an integrated health data infrastructure.
We realize that the accessibility of the COVID-19 EHR cohort will have to be actively maintained, also after related research projects have ended. This is important, because we should always be able to process incoming requests for data sharing for new research questions. Hereby we will contribute to assessing the long-term effects caused by the COVID-19 pandemic.
We enhanced the interoperability of the data (the ‘I’ in FAIR) by making use of the CEDAR metadata template. Where possible, we also used controlled vocabularies and ontologies, such as the International Classification of Primary Care (ICPC-1 and ICPC-2), the International Classification of Diseases and Related Health Problems (ICD-10), and the Anatomical Therapeutic Chemical (ATC).
Each new researcher will have to learn how to work with the dataset. To make this process easier, we try to make our data as reusable as possible by documenting all the steps as clearly as possible with comments in the scripts and descriptive readme files. We also apply a standard folder structure and standards for file naming, and where possible, make use of version control software. Our data scientists have also developed and are maintaining an R package to facilitate a more standardized data processing workflow for researchers who work with R, but this is not a prerequisite because we store data intended for re-use in open and shareable formats such as .csv. Our data managers and data scientists are always available to provide support if needed.
What kinds of reuse do you anticipate for this dataset? What are the advantages of making data FAIR and which stakeholders will benefit from this?
The COVID-19 EHR cohort dataset is being used to identify different trajectories of COVID-19 patients. Within the COVID-GP and Long COVID projects, the dataset has already helped to examine patient consultation trends within general practices before and during the COVID pandemic. It has also allowed us to investigate the nature and occurrence of persistent complaints after COVID-19 infection, i.e. post-COVID syndrome (PCS). This dataset allowed us to develop a definition of PCS and we could estimate how many people suffer from it. Moreover, we are able to investigate risk factors and care pathways of these patients. The dataset was also used to develop and train an AI-model to estimate COVID-19 prevalence in GP care.
The data has a high degree of reusability and can be re-used to study different research topics for the total population and for patient subgroups. This has benefited further research on the impact of the COVID-19 pandemic on general practitioner care for different subgroups, such as:
Patients suffering from chronic diseases (i.e. asthma and COPD)
Pregnant women who differ in socio-economic backgrounds (project ongoing)
Patients with dementia including medical prescriptions (project ongoing)
Children (project ongoing)
Patients with mental health problems (project ongoing)
We expect many more requests for access to the data. Other researchers could benefit from this dataset because it contains many other potential subgroups of patients.
ZonMW formulated specific open science requirements for COVID-19 research, for instance regarding the metadata standards that had to be used. How did this affect your research project?
The ZonMw FAIR requirements for COVID-19 projects have been more stringent than for other ZonMw subsidy applications. We clearly see the benefits of the increased potential for re-use of this very important data. We learned that the open science requirements formulated by ZonMw for COVID-19 research projects are divided into two sets: the first concerned the grant application phase and the second the project phase. For the grant application, we filled out a form that addressed four requirements: 1) alignment and reuse, 2) preregistration, 3) FAIR data within COVID-19 research and 4) budget for FAIR data and open access publication.
The second set of requirements has to be applied during the project phase: 5) a data management plan (DMP) and information on key items, 6) (pre)registration of the research project, 7) sharing research findings through open access publications and 8) access to research data and metadata.
The project leads, together with the data stewards, are aware of and collaborate on these requirements. Some key items have been finalized (such as the DMP, data format, data model, vocabularies and ontologies, and the metadata scheme), while others are currently in progress (DOI, digital repositories, online catalog where the data can be registered, and the terms of (re)use).
Project page of the COVID-GP project (in Dutch)
About the author
Jitka Vavra is a FAIR data specialist at the Digital Competence Centre of the University Medical Center Groningen (UMCG DCC). She contributes to the Pillar FAIR Data & Software in the Open Science Program of the University of Groningen.
Leon ter Schure is scholarly information specialist at the UG Library and pillar leader for the FAIR Data & Software pillar in the Open Science Program of the University of Groningen.