Data curation in scientific research: how does data curation work at DPUK?
8 February 2023
Discover how DPUK uses data curation to make information accessed via our Data Portal efficient for researchers studying dementia.
Good data is the basis for good science. Data curation allows us to build more confidence in data, making it more accurate, trustworthy, and up-to-date. It allows for high-quality research, and creates a reliable platform to collect, validate, and communicate data.
By using data curation to manage scientific data, scientists can reuse many sources of published data and find additional interest and usefulness for it in new research contexts. To ensure that scientists find what they need in the enormous pool of published data, data management procedures are needed to help organise data by its features and points of interest. DPUK uses these sorts of data curation methods, and this blog will discuss how we use them.
Data curation activities vary depending on the different areas in which they are used. However, the goal of data curation is to expand awareness and knowledge of a specific subject.
But how do data curators actually work?
Essentially, data curation involves collecting information using research methodology and then shifting independent data into organised data sets. In simpler terms, it’s when someone takes a look at the results of a study and verifies that the collected data makes sense.
Study data may come from many different sources like Electronic Health Records (EHRs), Electronic Case Report Files (eCRFs), Electronic Patient Reported Outcomes (ePROs), etc. The goal of data curation is to understand the nature of data at its source. The standard process involves customising data into different query sets to identify patterns. The team begins by conducting a broad, comprehensive, and systematic survey of the literature related to their study area. The data then gets refined to address the requirement of the research study.
When curating data, it is important to make the process as easy and transparent as possible. Users and data curators should be able to easily see the queries and data that they need to review, specific to their roles. Query management tools allow individuals to curate the data, and this is often called a worklist. Researchers in different locations can see and work from the same worklist. This is especially useful when a study has many different sites. When issues are raised, the data curator worklist can be queried and filtered by case files, subject, timeframe, etc., to check for completeness and accuracy. If potential red flags are reasonable and validated, the queries can be accepted and locked. If, however, they need more complete data, or the data needs further cleaning, it will be sent back to the site worklist for review.
Data curation in DPUK
The DPUK Data Portal facilitates multimodal remote data access to 3.4 million individuals from 42 cohorts using a secure and robust data repository. To accommodate the analysis of multiple independent datasets, DPUK curates cohort data based on a bespoke ontology called C-Surv, developed internally by the DPUK scientists.
Researchers can identify which cohorts are relevant to their proposed research question or area of study, apply for access to the data, and then analyse it in a secure, remote environment complete with data linkage, analytical software packages, and cross-cohort capability.
Using the C-Surv data model
The C-Surv ontology is a data model optimised for analysing epidemiologic survey data. It enables analysts to conduct multi-cohort multimodal analyses rapidly; reducing the administrative load for researchers and data managers. It also reduces artefactual variation in data by standardising the data pre-processing.
C-Surv provides a harmonised dataset of 30 common data elements (variables) which are widely used in neurodegenerative and bio-epidemiological research and can be visualised in the DPUK Cohort Explorer. The C-Surv standard data structure allows efficient data discovery and data selection. Building on the DPUK Cohort Matrix and Cohort Directory, the C-Surv standard data structure provides a framework for the development of ergonomic data discovery and data selection tools.
What to read next
In conversation: Postdoctoral researcher Gaurav Bhalerao talks about the significance of neuroimaging studies
19 December 2022
Gaurav Bhalerao is a postdoctoral researcher currently working in the Dementias Platform UK (DPUK) project. His research interests include neuroimaging in psychiatry, machine learning, and computational modelling of brain stimulation. In this one-on-one discussion, he takes time to go into greater depth on specific aspects of his research work in imaging studies.
8 November 2021
Learn about the latest research from Dr Ludovica Griffanti, a DPUK Discovery Award winner, in this blog post.
16 December 2020
We spoke to Dr Donncha Mullin, a participant in the virtual DPUK datathon held in November 2020, about what it's like to take part in one of these events.