Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

A nurse leans around a patient with her hands places reassuringly on her shoulders.

Whilst working with one of the UK’s biggest GP databases, DPUK scientist, Tim, realised how valuable it could be for dementia research. Here, the researcher who created one of DPUK’s biggest, most representative cohorts describes how other researchers can use the resource to answer the big questions in dementia research.

In the UK, every time someone sees their GP, has a test performed or is prescribed a drug, that information is coded and stored in the GP’s electronic medical record system. These ‘routinely-collected data’ are primarily recorded for healthcare purposes – a GP needs to have a comprehensive medical record to care for their patients safely.

These records – when anonymised – have an amazing second value for us researchers. We use diagnoses, symptoms, test results and prescriptions to learn more about the causes and consequences of diseases. GP data may be particularly useful for dementia research because the vast majority of people living with dementia are at home, or under the care of their GP, rather than in hospital. So when I came across a dataset which contains GP data for 80% of the Welsh population – as well as hospital, death and deprivation data for the entire population – I was mystified: why on earth wasn’t every dementia researcher using this?

The answer became clear the minute we opened the files – we were working with the SAIL databank datasets and they are huge. Think billions of rows of data – very messy and complicated. GP data are a far cry from the ‘clean’, processed datasets that are available in other DPUK cohorts. However, we were not to be put off by a challenge when the prospects for research were so great. We had an idea – to work with the SAIL team to create a user-friendly version of the datasets for dementia research. After DPUK provided the funding, we established a new collaboration between the University of Edinburgh and Swansea University, and the SAIL Dementia electronic Cohort (SAIL-DeC) was born!

In creating SAIL-DeC, we used algorithms and code lists to turn the billions of rows of messy GP, hospital admissions, mortality and deprivation data into a series of clean, user-friendly datasets. We produced tables containing information on participant demographics (such as month and year of birth, sex, socioeconomic status, smoking etc) and health problems (such as heart disease, stroke, Parkinson’s disease, epilepsy, depression etc). We also built on research we conducted within DPUK to determine which participants did and did not develop dementia during follow-up (see Importantly, we designed SAIL-DeC to be flexible, so it can be used for different study types and can improve over time – we can adjust our underlying algorithms as new research becomes available, allowing us to improve the quality and range of derived variables within the cohort.

Three reasons why the SAIL dementia e-cohort is great

1.       Big numbers

At present, SAIL-DeC contains information for 1.2 million participants, of whom 130,000 people developed dementia during follow-up. These enormous numbers reveal the strength of using an entire country’s health data for research. The numbers will only increase over time as more data become available.

 2.       Unconsented data – what this adds to dementia research

SAIL-DeC is different from other DPUK cohorts in that it is composed entirely of anonymised, unconsented, routinely-collected healthcare data. Anonymised health data are very commonly used in health research, but these data have their limitations: we can only access information that is collected routinely. We do not have information on genetics, brain imaging or other dementia biomarker investigations, some information is likely to be missing for many patients, and as the data are not collected with the intention of research, it may not be as accurate as that found in traditional, ‘consented’ cohorts.

That’s why researchers like to use traditional consented cohorts – so why use SAIL-DeC for research, instead of a ‘consented’ cohort? One important factor is generalisability. Participants who sign up to cohorts are amazing – my colleagues and I could not do our research without them – but they’re definitely not normal! People who take part in cohort studies tend to be more health-conscious and therefore healthier than the average population. Depending on the cohort, they may also be wealthier and more often 'whiter' than the wider population. The advantage to an unconsented, ‘virtual’ cohort such as SAIL-DeC, is we avoid this issue, and we can be more confident that our data are generalisable to the wider population.

 3.       Open for business

We have designed the SAIL-DeC cohort in the spirit of open science – we have made the cohort metadata, code lists, data dictionary and code freely available (see, and have submitted a cohort profile to an open access journal. Our hope is that by making SAIL-DeC available within DPUK, we can avoid duplication of effort for researchers, thereby reducing costs and increasing efficiency in dementia research.

The e-cohort is open for business! Researchers can apply to access SAIL-DeC by contacting the SAIL databank directly ( or via the DPUK Data Portal ( If you have questions on the design of the cohort, you can contact us at or