Spot the difference. Part 1: Source Datasets | Bennett Institute for Applied Data Science

In this two-part blog series, one of our pharmacists, Vicky Speed explains why OpenPrescribing measures may not always be exactly the same as on alternative data analysis platforms and why these differences are often to be expected.

One of the most common questions we are asked at OpenPrescribing is, Why does my OpenPrescribing output not match what I have found on another data analysis platform?

At OpenPrescribing we anticipate that our results will not always be exactly the same as those displayed on alternative data analysis platforms due to two main reasons:

1. the other platform is using a different source dataset,

2. analysts have made different analytic choices in the alternative data analysis platform to what we have made on OpenPrescribing.

In the first of this two part blog series we will walk through why differences in the source dataset can make results, sometimes, look a little bit different.

Is the source dataset you are using the same?

At OpenPrescribing we use community pharmacy reimbursement data to deliver our products that illustrate prescribing in England. These data are owned by the National Health Service Business Services Authority (NHSBSA). Briefly the NHSBSA is responsible for paying community pharmacies for the drugs they supply. Consequently, the BSA has a substantial amount of data on what medicines have been prescribed, by who and how much has been paid for them. The NHSBSA then makes various subsets of this data available. Some data are available privately within the NHS (e.g. via a system called ePACT2 and Medicines dispensed in Primary Care NHS Business Services Authority data) and some are openly available.

At OpenPrescribing we use two openly available datasets the (English Prescribing dataset (EPD) and the Prescription Cost Analysis (PCA) which are available in the Community Prescribing & Dispensing section of the NHSBSA Open Data Portal.

What is the difference between PCA and the EPD?

The NHSBSA describe the differences between the two datasets in the Coherence and Comparability section of their methodology for PCA data here. Whilst these two datasets sound similar they actually contain quite different underlying data. In the following paragraphs we briefly set out how they are different. This might be helpful if you are using another dataset and want to think through how your results may expectedly differ from OpenPrescribing.

The EPD can be considered a prescribing view (although it still only includes medications which were dispensed). It shows the prescribing organisation and the item that was written on the prescription form. It is generated from prescriptions that have been prescribed in England (clue is in the name!) but could have been dispensed in England, Wales, Scotland, Guernsey, Alderney, Jersey, and the Isle of Man. These data are available down to practice-level. More details on the EPD methodology can be found here.

Whereas, the PCA focuses on dispensing. It shows the item that is most likely to have been dispensed (which is sometimes different to what was prescribed). PCA data contains all prescription items dispensed in the community in England on a monthly basis and submitted to the NHSBSA for reimbursement. In this dataset, the prescription could have been prescribed elsewhere in the UK. PCA data is only available down to Integrated Care Board level. More details on the PCA methodology can be found here.

Source Dataset Summary

Whilst the underlying data owned by the NHSBSA may originate from the same community pharmacy reimbursement data, there are different subsets of these data. These subsets have some key differences which mean that there will often be differences in reported results. Therefore when comparing results between data analysis platforms, it is always important to check which source dataset has been used.

In our next blog we will discuss how different data analysis platforms such as OpenPrescribing, which ‘sit on top’ of all or part of the underlying data and are used to analyse and visualise the underlying source dataset, may report different results. The people running each platform will make different analytic choices for data preparation, analysis and visualisation, which can mean that, even though the source dataset is the same, the results that are reported are different.