Using electronic health records and open science in the COVID-19 pandemic
We recently presented on behalf of our team (grp-EHR) at an open science workshop, hosted by the MRC Biostatistics Unit at the University of Cambridge. This was a great opportunity for us to reflect on our team’s experiences using OpenSAFELY, and more broadly embracing open and team science approaches to our work. In this blog, we’ll introduce grp-EHR, give an overview of the open and team science approaches that we’ve adopted, reflect on the advantages and disadvantages of these approaches, and finally, reflect on our learnings from the open science workshop.
Our team and what we do
The “EHR” of grp-EHR stands for electronic health records. grp-EHR is a group of interdisciplinary scientists consisting of medical statisticians, epidemiologists and health data scientists from Universities of Bristol, Cambridge, Plymouth and Leicester. The group formed during the COVID-19 pandemic to address urgent, policy-relevant questions using linked electronic health record data. From the beginning, the group has embraced “team science” and adopted open science practices, of which we will introduce and describe below.
Open Science practices
We have broken down the grp-EHR open science approach into four components: protocols, code review, output checking and preprints. Below, we introduce each of these and describe how we have used them within our work.
Writing pre-specified protocols (sometimes referred to as analysis plans) has become an integral aspect of our workflow. This way of working can reduce biases, ensure the analytical approach is based on the most rigorous methods and streamlines the production of a scientific manuscript. The protocols generally consist of an introduction, the research questions, the data sources that will be used, the study population, exposures, outcomes and an outline of planned analyses. Of course, projects can often change course over time, so we keep track of changes made to the protocol throughout the project period and openly document these. The protocol is made openly available via the GitHub repository alongside our code.
Version control and code review using Git
Learning to use Git and GitHub has probably been the biggest change grp-EHR has adopted in comparison to previous workflows. Grp-EHR projects are all tracked using Git and are published on GitHub, so that the whole team has access to each other’s code. Within projects, we aim to always have more than one researcher involved in code development. We typically have a main branch, which is the version of the code that is eventually run on the real data. We create separate branches to add or update code, and before merging to the main branch we create a pull request, which we assign to another researcher to review. These are some of the advantages that we’ve found with this process:
- Having more than one set of eyes on the code means we’re more likely to identify errors
- It’s much easier to revert to previous versions of the project
- We’ve learnt a lot from each other’s coding styles
- We’ve found that we code differently when we know it’s being reviewed – it’s much more readable and better annotated (rather than only thinking about readability and annotation at the end of the project!)
The data we work with is highly sensitive and so it is crucial that any outputs released from the secure environment pass rigorous statistical disclosure control. For example, a researcher might want to release some results from a regression analysis. They submit their request for the results to be released. Then, two independent trained output checkers are assigned to review the outputs in accordance with the five-safes framework; a set of principles to make decisions regarding the secure and effective access and use of data. The output can only be released once the two independent researchers have each approved. Everyone who performs output checking on our team is an ONS accredited safe researcher and has completed an output checking course which includes an exam with a strict pass mark. The advantage of this approach is that we are far less likely to accidentally release disclosive results. On the flip side, output checking can be time consuming, but we are continuously looking for ways to improve the efficiency of this essential process.
Once a manuscript is ready to submit to a journal for peer review, we aim to submit it to a preprint server such as medRxiv. This has the advantage that the work is openly available during the peer review process, which can take many months. This has been particularly important for work relating to the COVID-19 pandemic, where there is an urgent need to disseminate new findings as quickly as possible. A disadvantage of preprinting work is that the findings have not been peer-reviewed, and so should be interpreted with caution.
Advantages and disadvantages of these open science practices
Common advantages across all these practices are that they increase transparency, reproducibility, and reusability of our work. Many of the practices also improve the quality of our work, as errors are more likely to be identified.
It’s important to also acknowledge that adopting these practices can feel like a time-burden at first, especially when working to tight deadlines (as we often are with COVID-19 related work). Most researchers in grp-EHR were not familiar with Git and GitHub before joining the team – learning to use these tools was a steep learning curve. The level of teamwork and collaboration was also new to many of us who were used to working as the sole analyst on a project, so waiting for colleagues to review code and outputs could sometimes feel like a bottleneck. However, in the long run we’ve found that these practices allow us to be much more efficient in building on current work (e.g. addressing comments following peer review), replicating work (e.g. in other populations) and setting up new studies.
Learnings from and Open Science workshop
We received an immensely positive reaction from our presentation at the workshop, so one take-home message was that grp-EHR are doing great work in embracing open and team science! It was also interesting to hear from researchers in other fields.
One talk that stood out was from Guillermo Reales, who (with Chris Wallace) investigated whether sharing GWAS summary statistics resulted in more citations. They found that “sharers get on average ~75% more citations, independently of journal of publication and impact factor, and that this effect is sustained over time” (preprint and twitter thread). This got us wondering whether publishing clinical code lists has a similar effect in the EHR space.
Reflecting on grp-EHR’s adoption of open science and team science approaches made us really proud of how much we’ve all adapted our working style over the past couple of years, on top of delivering timely results under the pressure of a pandemic! The Open Science workshop was a great opportunity to share approaches and learn from other researchers, and we’d be keen to attend more events on open science in the future.
For more information about grp-EHR, please contact Jonathan Sterne (Jonathan.Sterne@bristol.ac.uk).