Why ehrQL? | Bennett Institute for Applied Data Science

ehrQL is a query language for electronic health record data. It was designed to support the privacy, reproducibility and transparency needs of the OpenSAFELY platform; but it is a standalone tool usable outside of OpenSAFELY and, potentially, on datasets outside of healthcare altogether.

The philosophy and design goals of ehrQL are best explained as an answer to the very reasonable question …

Why didn’t you just use SQL?

1. We need to be able to write queries once and then run them in situ against data stored in very different databases

The OpenSAFELY platform runs against data stored in a growing variety of different databases. For the most part, these speak SQL. But they all speak different dialects of SQL and support different sets of features. Or – worse still – they seem to provide equivalent features but behave in subtly different ways. And they have radically different performance characteristics meaning that queries need to be structured very differently for each database in order to run in a sensible time.

A common solution to this problem is to pool the data in a single system where it can be queried using a single SQL dialect. But OpenSAFELY’s privacy and security goals require that we minimise data movement and, as far as possible, query data in situ in whatever system it happens to be stored in.

ehrQL provides a unified, consistent language that can be used across any database (once we’ve written the necessary adaptor code). An extensive test framework ensures that queries behave in the same way across databases. And a query optimiser deals with the different efficiency needs of each database.

In a sense, this is the original – unrealised – promise of SQL: describe the data you want using a standardised language and the database will work out the most efficient way of retrieving it. ehrQL is able to deliver on this promise because it is tightly focused on solving a specific set of problems in a specific domain, and not attempting to be a fully general language for working with data (see “What kinds of query is ehrQL designed for?” below).

2. We want to be able to build a library of reusable variables

As a language, SQL does not lend itself to composability and reusability: it’s not easy to take part of a SQL query, say the part which identifies patients with severe asthma, and extract it out into something which can be easily slotted in to any other query which needs to identify asthma cases. It’s even harder if the asthma definition needs to interact with other parts of the query e.g. “find asthma diagnoses that were at least four weeks prior to these hospitalisation events”.

ehrQL was designed so that queries can be split into small components which can then be re-used between studies, reducing error-prone and duplicative work.

3. We need to be able to develop queries and analysis code without having direct access to the data

OpenSAFELY inverts the traditional relationship between researchers and the public on whose data they are working: traditionally, researchers can see the public’s data but the public cannot see the researcher’s code; in OpenSAFELY the public can see the researcher’s code but the researcher does not have direct access to public’s data.

This is only possible if researchers have some way of developing their code, and confirming that it works as expected, without access to the data. To support this, ehrQL can generate dummy data. That is, it can take a query and generate pseudo-random results which involve no real patient data (and hence no privacy concerns) but whose basic structural properties correspond to those of the real data. This allows the researcher to confirm that their ehrQL and their analysis code is doing what they intend it to do without ever touching real patient data.

Generating this dummy data requires code which can “work backwards” through an ehrQL query and determine what kind of data it needs to generate in order that the query will have enough results to be usable for verification purposes. This is challenging code to write, and would be more challenging still, if not impossible, with a fully general query language like SQL. The constrained and focused semantics of ehrQL thus make this dummy data generation feasible in a way it would not be with SQL.

4. We want a language which is readable by non-experts

It’s important for the goals of OpenSAFELY that even those without deep technical knowledge or extensive training in ehrQL can understand what a given ehrQL query is doing.

SQL queries of any size and complexity can quickly become unwieldy. Partly this is because SQL doesn’t naturally support breaking queries down into small, clearly named components which can be gradually built up into a more complex whole. It is also because SQL needs to support a wide variety of types of query against data structured in a wide variety of ways.

ehrQL is designed to be composable and to have a more natural “flow” than SQL. Because it is focused on a specific problem domain it can dispense with a lot of the complexity (e.g. explicit JOINS between tables) which can make the equivalent SQL hard to interpret.

What kinds of query is ehrQL designed for?

The above features of ehrQL are only possible because ehrQL isn’t attempting to be an all-purpose query language, suitable for any kind of query on any dataset. Rather, ehrQL is designed to query datasets containing facts and events related to individuals. (At present, these individuals are always patients recorded within an EHR system but nothing in the fundamental design of ehrQL restricts us to this.) Having the individual as the fundamental unit of analysis which ties together entries spread across disparate tables is what allows us to make the simplifications and affordances that give ehrQL its advantages over SQL.

ehrQL also benefits from reducing the scope of its ambitions with respect to SQL. ehrQL isn’t a language designed to let you answer any question you might have about your data; rather it is a language for getting the data you need in order to answer whatever questions you have. That is, it functions as step one in an analytic pipeline where the actual analysis is done in R or Python or Stata or whatever other language you like.

Can I use ehrQL now?

Yes! As with all our OpenSAFELY code, ehrQL is freely available under an open license. We have extensive technical documentation, including a getting started guide and full language reference available at:
https://docs.opensafely.org/ehrql/

ehrQL is the result of a huge amount of effort by many people (over 4,000 commits by 16 contributers over two years) and is still under very active development, all of which happens in the open so you can follow along here:
https://github.com/opensafely-core/ehrql