‘Safe Outputs’ and Statistical Disclosure Control in OpenSAFELY | Bennett Institute for Applied Data Science

We have previously described how OpenSAFELY allows efficient access to data whilst maintaining extremely high standards for data privacy by following the Five Safes framework. Here, we will take a deeper look at the ‘Safe Outputs’ dimension of this framework, and how it is applied in OpenSAFELY.

What data is available in OpenSAFELY?

In the previous blog, having considered the other 4 dimensions of the Five Safes framework, we described Safe Outputs as the assessment of any residual risk in outputs being released from a secure environment like OpenSAFELY. To understand this residual risk, it is first useful to understand the types of data available within OpenSAFELY.

OpenSAFELY allows access to the full pseudonymised primary care record for more than 57 million patients. This contains information such as medical diagnoses, clinical tests or medications prescribed to a patient as well as information from linked datasets such as hospital admissions (made available via the Secondary Uses Service, SUS) and registered deaths (available via ONS). This data is an incredibly rich data resource, especially during a global pandemic; it contains large volumes of longitudinal data that can be used for epidemiological research including understanding disease risk, monitoring the delivery of healthcare, describing uptake of new treatments, assessing patient safety and informing public health guidance and policy. However, it does present challenges for protecting patient confidentiality.

Individual level data like that available in OpenSAFELY is commonly referred to as microdata. Microdata may contain both direct identifiers and indirect identifiers (sometimes called quasi-identifiers). Direct identifiers are attributes that are considered personally identifiable; this includes names, medical record numbers and dates of birth. Indirect identifiers are attributes that can’t be used directly to identify an individual, but in combination may allow an individual to be identified; this includes attributes such as medical conditions, ethnicity or a prescription record. Both of these types of attributes can be classified as confidential data; data that might cause harm to an individual if released without appropriate controls. It is therefore very important to consider the risk of disclosure when processing this information.

How do we protect patient confidentiality?

Pseudonymisation is a widely used method to protect patient confidentiality whereby direct identifiers are removed and replaced with pseudonyms using a technique called masking. An example is replacing patient names with a pseudo ID number. In OpenSAFELY, all available data is pseudonymised at source by the EHR providers; only data processor staff working for the EHR vendor (and GP clinical staff who can access the data for the purposes of direct patient care) have access to the raw data. This protects against primary disclosure; the situation where confidential information can be obtained directly from the data source. Whilst this reduces risk from direct identifiers, pseudonymisation alone is not enough to preserve privacy.

Pseudonymisation ignores the potential for indirect-identifiers in the data itself or from other data sources to be linked to the pseudonymised data to allow re-identification. This is known as secondary disclosure. For example, it has been shown that the majority of the population of the United States are likely unique based on only 3 simple variables (5-digit ZIP, gender and date of birth). Further to this, events recorded in electronic health records often have dates attached, which further increases the risk of re-identification; apparently non-identifiable patterns in longitudinal data are often distinct and can be used to identify an individual.

It is therefore necessary for increased protection to be given to the outputs of analyses run on confidential datasets like those used by OpenSAFELY. One key feature of OpenSAFELY, as well as other remote research environments, is that no analysis results can be released from the secure environment without first being independently checked by two reviewers. This is required to ensure no disclosive data or information that could lead to a data subject being re-identified is released. Assessing the residual risk of disclosure from these research outputs and the approaches to controlling this risk is known as output Statistical Disclosure Control (SDC) and is used by statistical agencies around the world to ensure confidentiality of sensitive data.

Statistical disclosure control

Assessing disclosure risk

The goal of SDC is to prevent the release of information that could identify individuals while still allowing the release of high-quality statistical data. Striking this balance is crucial to ensuring maximum public benefit from confidential data sources.

Research outputs can be categorised into different types of outputs (e.g. frequency tables, survival model results), each of which can be classified as ‘safe statistics’ or ‘unsafe statistics’. For safe statistics, the default position is that they will be released; only under exception will they be withheld. Unsafe statistics are outputs which by default will not be released unless the requester can show that they are non-disclosive.

There are two approaches to deciding whether an output can be released: rules-based and principles-based. In a rules-based approach, a fixed set of ‘hard rules’ are developed for different types of outputs; these rules must be met before an output is released. The advantages of this approach are that it is very clear what can and can’t be released, the rules are consistent and transparent and it does not require specialist training to implement. However, this approach is also inflexible and can be over restrictive, especially in environments such as OpenSAFELY, where a wide range of research outputs are produced using a variety of different statistical approaches. Similarly to the rules-based approach, the principles-based approach starts with a set of rules. However, the ‘hard rules’ are instead treated as guidelines by which outputs can be assessed given the context in which they have been produced. This approach results in reviews of outputs taking longer and requires specialist training of output checkers, but it is more flexible to the needs of the researchers requesting review of the outputs, which can ultimately result in more efficient use of the underlying data.

In OpenSAFELY, we operate a principles-based approach; researchers provide context around all of their study outputs which are used by our trained output checkers to 1) understand the data being requested for release 2) decide whether there are situations where disclosure is likely to occur if an output can be released 3) suggest disclosure controls that could be applied to make the data safe for release.

Reducing disclosure risk

There are many different methods commonly used to reduce disclosure risk for microdata. These methods can be broadly divided into perturbative or non-perturbative methods. Perturbative methods sacrifice data truthfulness by introducing some error into the data and include approaches such as rounding, microaggregation (aggregation of continuous variables into ordinal variables, e.g. binning a continuous age variable into age bands) and record swapping (swapping of a sensitive variable between records which are similar for other characteristics but differ in their sensitive variable at a specified rate). Non-perturbative methods sacrifice data completeness and include approaches such as sampling, recoding (reducing the number of distinct categories of values, e.g. in OpenSAFELY we allow ethicity to be aggregated into 16 or 6 groups) and cell suppression (redacting values within a table). Choosing the correct approach depends on the type of analysis the data is used for and the types of outputs required. This handbook is a useful reference, covering the disclosure considerations for common output types as well as suggested approaches to reducing their risk. More in depth guidelines also exist for specific types of output such as regression and survival analysis.

The most common disclosure control method applied in OpenSAFELY is suppression. We require that any statistic describing 5 or fewer patients, either directly or indirectly should be redacted. This threshold of 5 is commonly used for safe dissemination of health statistics and is used by statistical agencies such as the Office for National Statistics for high risk data.

This not only includes tables containing patient counts, but also graphical figures whose underlying values describe <=5 patients. We also strongly recommend rounding of all counts produced using OpenSAFELY to reduce the risk of secondary disclosure; for research conducted on populations as large as is available through OpenSAFELY this is an easy way to reduce disclosure risk whilst having a minimal impact on the accuracy of statistical conclusions.

Conclusion

This post gives a brief overview of SDC and how it is used by OpenSAFELY to ensure the ‘Safe Outputs’ aspect of the Five Safes framework is met. In the next post in this series, we will see what this looks like in practice and describe the OpenSAFELY output checking service in more detail.