OpenPathology: Issues with reference ranges — Part 3 | Bennett Institute for Applied Data Science

This is the third instalment in our series of commentaries on using reference ranges to interpret pathology test results.

Reference ranges vary between labs

Classically, the reference range is defined statistically: it is the interval within which 95% of the values of a healthy reference population fall into. Therefore 2.5% of the time, healthy people will have (for example) haemoglobin concentrations less than the lower limit, and 2.5% of the time it will be over the upper limit. (Read more in our previous blog).

Sometimes intervals are by consensus (or “decision limit”) instead — for example, the upper limit of normal for Prostate Specific Antigen (PSA), a test used to identify cases of possible prostate cancer, is defined in the guidelines as 4 ng/mL (but action should be taken >3ng/mL for men 50-69), and this is reflected in reference ranges. In other words, a consensus reference range is supposed to be indicative of some kind of pathology (at a certain level of confidence).

Which is the correct reference range?

There is often considerable debate about the correct levels. The WHO definition of anaemia uses haemoglobin reference ranges defined in 1968, and these have continued to be debated on the basis of real-world outcomes (such as all-cause mortality) ever since.

In an ideal world, most reference ranges would be created through a process of scientific consensus based on clinical indications, but pragmatically, the majority are statistical. This creates an interesting disjoint with guidelines which refer to absolute numbers. For example, the definition of abnormally high Thyroid Stimulating Hormone (TSH) is defined by the outliers in a normal population as somewhere between 4 to 5 mU/L, and this is usually the figure given in reference ranges; whereas guidelines suggest thyroxine treatment should be initiated above the precise figure of 10 mU/L.

How do labs come up with different reference ranges?

Traditionally, it has been recommended that each lab produce its own reference ranges based on testing local normal populations, using their own analysers. In practice, this rarely happens. One survey of labs in Birmingham showed 35% used ranges provided by the manufacturer; 14% used literature ranges; 18% used in-house ranges; and 34% didn’t know the source of their ranges.

Manufacturer ranges themselves may have been determined directly on selected populations, or may themselves be derived from literature. And it’s not clear how you would judge the relative merits of two reference ranges, both derived from population studies: one study that generated its own reference ranges found 25% of its chemistry tests disagreed with Abbot’s own reference ranges by more than a fifth.

The result: different labs may use very different reference ranges for no good reason; or there may be a good reason, but the users of the service may not know.

Does this matter?

We see a lot of different reference ranges in OpenPathology data. For example, a 40 year old woman with 119 g/dL haemoglobin would be flagged as abnormal in Cornwall, but normal across the border in North Devon.

This is reflected in outcomes. We see that Cornwall, with its narrower reference range, does define more results as abnormal:

When a lab changes equipment, it will usually switch to using the reference ranges recommended by the manufacturer. The following chart shows the effect that a new analyser can have on results. In 2018 you can see the proportion of TSH tests above range started going down, in Cornwall only. This lab changed their analyser in September, but we’re currently applying today’s reference values to all of the data:

In an ideal world, all analysers would be calibrated to a single reference standard, and this would make the use of fixed national reference ranges completely defensible. Indeed, the Pathology Harmony Group pioneered a process for setting a number of reference ranges in the UK, which are now used by the majority of labs; but these only cover a few analytes. In reality, reference ranges come from a hodge-podge of different sources, making it hard to compare outcomes between regions.