Skip to ContentSkip to Navigation

Digital Competence Centre

your one-stop for research IT and data
Privacy & Protection Data Protection

De-identification

Before the start of your research it is important to think about what data you need to collect for your research. Data minimization involves ensuring that you collect the minimum amount of identifiable data that you need. You only collect what is strictly necessary and do not ask for more detailed data than you require (e.g. an age category is often sufficient, so you do not need the year or full date of birth). You can also use an ID code instead of a name if you need to link datasets. This is always the basic principle, as stated in the GDPR: privacy by design and by default.

How can you de-identify your data?

After data collection, you might no longer need all the data you collected or you would like to make the data less identifiable for privacy reasons. De-identification involves removing data (deleting) or masking it (through pseudonymization, generalization and aggregation).

Anonymisation

Anonymisation involves removing direct and indirect personal identifiers. Anonymization is important for data processing and data sharing. Removing or aggregating all personal indirect identifiers can make data sets useless for research. It is important to put administrative and technical measures in place when some records in a research dataset are re-identifiable.

Generalization

Generalization is something that you usually apply in your research design: it means classifying information in categories. Examples of generalization are using an age or even age categories (10-20, 20-30) instead of a date or year of birth, and using the postcode as an approximation of an address rather than the street name and house number. If it is impossible to apply generalization already during the collection of the data, you need to generalize your data as soon as possible after obtaining them.

Aggregation

Aggregation also involves grouping information in larger clusters, and is applied when unique categories are formed in the data, despite the generalization that is applied. An example is applying age categories so that there are ten people in each group, rather than using the age or fixed age categories. Another example of aggregation is to remove the letters in the postcode, should it become apparent that only one person has a certain postcode, or if only a few people are registered at a particular postcode. In the same vein, you could use the area code instead of the full telephone number.


Quasi-identifiers are sometimes aggregated to reduce the k-anonymity of a dataset to a certain level, but aggregation alone does not necessarily result in anonymous data.

Pseudonymisation

Pseudonymisation involves separating directly identifying personal data from substantive data, optionally maintaining a link through an arbitrary key. The GDPR explicitly mentions pseudonymisation as one approach for GDPR requirements compliance, increasing the privacy and security of personal data processing.


More information on de-identification (≠ anonymization):

Risk management for research data with about people (incl. pseudonymization) (LCRDM)

About pseudonimization and securing the pseudonimization key (LCRDM) (Dutch only)

Removing identifiers from human data (guide from the University of Sydney)

De-identification training exercises by the UK Data Service (exercises using quantitative and qualitative data)

10 misunderstandings about anonymisation (European Data Protection Supervisor)


Last modified:17 May 2022 11.43 a.m.