Re-running your study, quickly and easily: why automated pipelines matter for research | Bennett Institute for Applied Data Science

Health data research is almost always dependent on writing complex computational code to analyse increasingly large datasets. Writing these scripts is time-consuming and challenging, but often this code is discarded at the end of the project or on completion of the final paper. Very rarely is the code ever re-run, or reused, although this has clear efficiency benefits. Furthermore this code is often written as individual scripts that must be manually executed in sequence by a person, sometimes with additional steps (such as moving files, or checking outputs) that the human must perform to deliver the final tables and graphs.

With OpenSAFELY, we set out to rationalise and automate the whole process of data curation and analysis in end-to-end pipelines. This was necessary to deliver our overarching goals around reproducibility, security and efficiency. In this blog, we present an example of how these kinds of pipelines - which run by themselves once triggered - can also help researchers re-run a complex analysis quickly and easily in order to update the study results.

The study at hand is our BMJ paper from 2021 that looked at whether adults living with school age children had a higher (or lower) risk of COVID-19 death, hospital admission, ICU admission, or infection. We initially conceptualised the study during the late Summer months of 2020, immediately after the first wave, and the code was written then. However, the scale of the data (>17 million adults) plus the level of complexity linking primary care and secondary care data - and clustering by household - meant that we ran into some significant computational limits. This required work to improve the capacity of the OpenSAFELY platform and remove some bottlenecks, as well as adjusting our code to use less resources where possible. As this was time sensitive research, this was a coordinated cross-disciplinary effort in order to be able to complete the largest and most complex study we had run to date. The BMJ published our paper in early 2021.

Then, at the end of Summer 2021, people began to ask whether the results might have changed in the third wave.

Normally, the prospect of re-opening and re-running a huge analysis of this kind might make a research team feel weary! In this case the circumstances were very different. We had done all the coding to run the analysis - and optimised the code to run efficiently - in the first wave. In OpenSAFELY, the whole pipeline of curation and multiple analyses is defined in the “project.yaml” file that defines all the steps that must run. We made a new study repository, changed the index dates to capture the third wave, pressed the button on the job server to re-run the analysis almost a year later… and it ran without issue, but on completely updated data.

The most complicated question was a methodological and interpretative one: how to deal with the Euro championship at the end of the 3rd wave, possibly confounding our results, and account for the population-wide COVID-19 vaccination programme.

This made us happy for two reasons. Firstly, it meant we didn’t have huge headaches and traumas on that single project. But secondly, it showed the power of investing time and effort up-front in standardising your curation and analysis pipelines, and ensuring that whole analytic arcs can be sent off to run with a single click on the OpenSAFELY Job Server.

For us, there are three key takeaways from this story:

Solve complex computational problems once! We used two software developers to help resolve the issues with clustering in a large dataset in the first wave. It was a significant investment, but it has paid dividends since the study has now been run three times, and we learned a lot on the way.
Work hard to produce generalisable code that can be reused! We ran our code in the OpenSAFELY platform, which has a specific way of writing a study definition, including defining index dates as variables, rather than writing bespoke code with dates for the study baked into every line of the analysis. As these dates are visible in the study definition, we could change and update these quickly and easily. We have also done this with our Service Restoration work, for example, weekly vaccine uptake reports. These are available at reports.opensafely.org.
Magic happens when epidemiologists work closely with professional software developers! We have a mixed team of developers, clinicians and epidemiologists, and this brings huge advantages for writing repeatable and reusable code. Epidemiologists know a lot about designing and running an analysis. Developers are used to working with code that gets re-run many times and have methods and tools to increase efficiency and speed: not just esoteric techniques for computational efficiency, but hard-won practical lessons around ensuring that your code is legible and generalisable. This means understanding the current problems, and future ones, to choose just the right trade-off between code that is quickly re-usable for other similar tasks, without delaying delivery of the urgent output at hand. That kind of work was key to the success of this one study, but also the whole work of the OpenSAFELY platform and collaboration!