Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Discover how DPUK uses data curation to make information accessed via our Data Portal efficient for researchers studying dementia.

Good data is the basis for good science. Data curation allows us to build more confidence in data, making it more accurate, trustworthy, and up-to-date. It allows for high-quality research, and creates a reliable platform to collect, validate, and communicate data.

A visual of how information is processed in DPUK's data portal

By using data curation to manage scientific data, scientists can reuse many sources of published data and find additional interest and usefulness for it in new research contexts. To ensure that scientists find what they need in the enormous pool of published data, data management procedures are needed to help organise data by its features and points of interest. DPUK uses these sorts of data curation methods, and this blog will discuss how we use them.

Data curation activities vary depending on the different areas in which they are used. However, the goal of data curation is to expand awareness and knowledge of a specific subject.

But how do data curators actually work?

Essentially, data curation involves collecting information using research methodology and then shifting independent data into organised data sets. In simpler terms, it’s when someone takes a look at the results of a study and verifies that the collected data makes sense.

Study data may come from many different sources like Electronic Health Records (EHRs), Electronic Case Report Files (eCRFs), Electronic Patient Reported Outcomes (ePROs), etc. The goal of data curation is to understand the nature of data at its source.  The standard process involves customising data into different query sets to identify patterns. The team begins by conducting a broad, comprehensive, and systematic survey of the literature related to their study area. The data then gets refined to address the requirement of the research study.

When curating data, it is important to make the process as easy and transparent as possible. Users and data curators should be able to easily see the queries and data that they need to review, specific to their roles. Query management tools allow individuals to curate the data, and this is often called a worklist. Researchers in different locations can see and work from the same worklist. This is especially useful when a study has many different sites. When issues are raised, the data curator worklist can be queried and filtered by case files, subject, timeframe, etc., to check for completeness and accuracy. If potential red flags are reasonable and validated, the queries can be accepted and locked. If, however, they need more complete data, or the data needs further cleaning, it will be sent back to the site worklist for review.

Data curation in DPUK

The DPUK Data Portal facilitates multimodal remote data access to 3.4 million individuals from 42 cohorts using a secure and robust data repository. To accommodate the analysis of multiple independent datasets, DPUK curates cohort data based on a bespoke ontology called C-Surv, developed internally by the DPUK scientists.

Researchers can identify which cohorts are relevant to their proposed research question or area of study, apply for access to the data, and then analyse it in a secure, remote environment complete with data linkage, analytical software packages, and cross-cohort capability.

Using the C-Surv data model

The C-Surv ontology is a data model optimised for analysing epidemiologic survey data. It enables analysts to conduct multi-cohort multimodal analyses rapidly; reducing the administrative load for researchers and data managers. It also reduces artefactual variation in data by standardising the data pre-processing.

C-Surv provides a harmonised dataset of 30 common data elements (variables) which are widely used in neurodegenerative and bio-epidemiological research and can be visualised in the DPUK Cohort Explorer. The C-Surv standard data structure allows efficient data discovery and data selection. Building on the DPUK Cohort Matrix and Cohort Directory, the C-Surv standard data structure provides a framework for the development of ergonomic data discovery and data selection tools.