Messy and Complex DATA 2 This slide deck was produced for an in-person workshop at the ADR UK 2023 conference, Tuesday 14 November It is the second of two slide decks produced for this workshop
This session Recap/summary of session 1: Principles Data Structures This session Messiness & cleaning Error & linkage Definition Linkage techniques Evaluation of error Implications for analysis Focus on 1:1 person linkages
Introductions Presenters Jonathan Swan Head of ADRUK Data Engineering, ONS Jen Hampton Head of ADRUK Linkage, ONS
Recap Data Engineering Principles Having the right principles in place facilitates the delivery of data to researchers that is easy, accurate, and consistent. Data Structure Structuring data can be difficult. Providing a good structure means dealing with complexity to produce data that enable analysis. Analysis over complex structure is not easy – but implementing good structure makes it less complicated for researchers.
Definitions: Complex Data “Data that are hard to process and translate into a digestible format.” Complexity may be because of Size (wide or tall) Number of Sources Structure Multiple entities (e.g. people, households, jobs, employers all in same data) Relationships within the data, between entities, or between sources The type of data (pictures, sound, video, free text, etc.)
Definitions: Messy Data Data where there is a barrier to using the data for analysis. More detail shortly!
Messiness and Error in data Tuesday 14 November
Error and Messiness in data Error The value held differs from the true value. e.g. Joan vs Jane The value held differs from the true value by an amount that is outside an acceptable range e.g. 3.1 vs 3.14 (Precision) 3.14 vs 3.16 (Accuracy) Messiness The value held or provided, whilst true, differs from: similar data held in another source, the anticipated framework; or is outside the anticipated range of responses; or cannot be accurately rendered in the response. Rob, Bob, Robert Cygnus House vs No. 24 02/28/2023 Zip Code -> Po15 5RR? Male or female when other Sölden vs Soelden On what date vs “December to January”
Types of Error No single accepted framework for Data Error Classic “Survey Error” framework Limited use for data error Survey Error framework Interviewer error R espondent error I nstrument error M ode of data collection
Towards a Data Error Framework? Source of Error Supply Error Recording or Entry Error Storage Error Transmission Error Processing Error Manifestation of Error Cell Error Row Error File Error
File and Row level errors Often neglected? A problem for system and data engineers? It will be all fixed by the time researchers use data? Source of Error Supply Error Recording or Entry Error Storage Error Transmission Error Processing Error
File and Row level errors: Detection File Specification Manifest files Row Counts Checksums Good documentation Value/Range checks Plausibility Checks Prevention Good collection instruments Good system design Use good file standards CSVW Parquet Robust coding Correction File Resupply Robust coding
Cell level errors: recalling principles Classification Principle Practice Implication for researchers User (principle) Change minimum Keep data values as unchanged as possible. Even if adding value – keep originals. Researchers can be confident that data are as recorded. User Derived Variables in source data Where several users need DVs, agree and add DVs to researcher data. Users can work off consistent information and don’t need to re-invent the wheel. User Standardise data Provide data that meets standards; where necessary include original and standardised data Users can more easily interpret and compare data and results User Use meaningful terms Where possible use meaning file and variable names, use meaningful category names (Male, Female vs 1, 2) Makes analysis easier for the user, and reduces the risk of errors. User Ease of use Prioritise ease of use over storage efficiency. Easier for researchers.
Dealing with Cell Error Prevent Detect Manage Correct Flag Manage Standardise
Standardisation 1 Case Simple: UPPER, lower Complex: Title Case; Title-case; Title-Case; Heading case Removal of special characters Alphanumeric only: A-Z, a-z, 0-9 + <space> Alphabetic only: A-Z, a-z + <space> Numeric only: 0-9 Allow some: e.g. O’Conner, Jones-Smith ‘Convert’ accents Café -> Cafe Degau ß -> Degauss BUT in German ö -> oe ( Sölden -> Soelden )
Standardisation 2 Apply a coding framework SIC SOC UPRN (Unique Property Reference No.) Convert to fixed response range Male, Female, Other, Unknown Nullify unexpected values Use system <NULL> Convert to single framework 01/01/23, 1 Jan 2023, 1 st January 23, -> 01:01:2023
Detecting Error 1 Allowed values Code lists E.g. Male, Female, Other Value ranges E.g. age >=0 and <120 Fixed formats NHS No. NINO. Postcode Checksums e.g. 10 th digit on NHS No.
Detecting Error 2 Permitted change over time Alive -> Deceased , Deceased -> Alive Age -> Age , Age -> Age + 1 , Other Change value range % Price increase over time Plausibility to other data 18 year old brain surgeon
Correcting error Postcodes Fixed pattern of numbers and letters 0-> O Po15 5RR to PO15 5RR l to 1 SWlAA to SW1AA Borrowing from another source Choosing most likely if conflict Replacing invalid with valid (maybe applying similarity test) (Approximate) Calculation from other data Imputation Borrowing data from similar case Used in Census
Free text fields Free text fields pose a significant challenge! Potential security or disclosure risk SRS only allows in extremely exceptional circumstances Detect and Supress Convert to <NULL> Detection tactics Use of @ - emails Use of number / multiple numbers e.g. Allow up to 999? - covers most addresses Capitals / Block Caps Names Postcodes
Some thoughts We all have a role in dealing with error. Error handling will vary from dataset to dataset. There is a menu of techniques Choose the ones that apply And there may be items not on the menu Handling messiness and error is hand in hand with QA
Over to you What methods have you used to deal with messiness or error? Do you have any cautionary examples? Do you have any other good Detection Correction Management techniques? Please don’t forget file and record level errors!
Data Linkage and Error Tuesday 14 November
Record linkage Aim is t o identify records that belong to the same person across different data sources This is straightforward when clean, unique identifiers are common across the data exact matching / joins
Record linkage Problem: common unique identifiers are inconsistently available in the data Solution: use characteristics / information that uniquely identifies each person, common across sources
Record linkage Linkage can be thought of as determining the true match-status of record pairs We want to link records that belong to the same person and not link those that don’t
Record linkage Correctly assigned match-status Records we link that do belong to the same person are true matches: true positives LINK = TRUE MATCH
Record linkage Correctly assigned match-status Records we don’t link that belong to different people are true non-matches: true negatives NO LINK = TRUE NON-MATCH
Record linkage Incorrectly assigned match-status Records we link together that don’t belong to the same people, are false matches: false positives LINK = FALSE MATCH
Record linkage Incorrectly assigned match-status Records we don’t link together that do belong to the same people, are missed matches: false negatives NO LINK = MISSED MATCH
Linkage variables Information, common across sources, that, when used in combination, uniquely identifies a person Activity: What variables might be used, in combination with each other, to determine if records should (or shouldn’t be linked)? What problems with these variables might lead us to incorrect conclusions about the true match status of record pairs?
Linkage variables Information, common across sources, that, when used in combination, uniquely identifies a person Activity: What variables might be used, in combination with each other, to determine if records should (or shouldn’t be linked)? First, middle, last name Address, postcode, UPRN Date of Birth Sex Ethnicity What problems with these variables might lead us to incorrect conclusions about the true match status of record pairs? Personal Identifying Information
Linkage variables Issues with linkage variables: Missingness Whole record / variable level Recording error / inconsistencies Typographical, phonetic, optical character recognition Transpositions Inconsistent between data sources Low distinguishing power e.g. sex in prison population data Instability over time longitudinal linkage There are instances where data is not of good enough quality, coverage, or commonality to meet minimum linkage or research requirements d erived variables / rule-based matching / string comparators match-key design / probabilistic weights linkage design data preparation
Linkage techniques – exact matching When linkage variables of good enough quality, coverage, distinguishing power, and stability exact (deterministic) matching Does not allow for error Minimises false positives; potentially increases false negatives Forename Surname Sex Date of Birth Postcode Shirley Barker F 07/07/2001 CF23 9PQ James Jones M 31/05/1986 NP5 4PL Benjamin Kelly M 29/10/1982 SA18 5BC Emily Mercer F 16/09/1994 CF3 1AA Forename Surname Sex Date of Birth Postcode Shirley Barker F 07/07/2001 CF23 9PQ James Jones M 31/05/1986 NP5 4PL Ben Kelly M 29/10/1982 SA18 5BC Emily Mercer F 16/09/1994 CF3 7NP
Linkage techniques – match-key rules Relaxing the strictness of matches between the linkage variables adjusts for error deterministic rule-based matching Declared & implemented as a series of hierarchical match-keys, each designed to account for different errors and inconsistencies in linkage variables Including use of derived variables Match-key design is iterative, informed by the data, and meets saturation point If too exhaustive, we run the risk of introducing too many false positives Match-key Accounting for First name, last name, sex, DoB , postcode area Partial error in postcode / local moves First name initial , last name, sex, DoB , postcode Error in first name or nickname* First name, last name, sex, postcode Missing or incorrect DoB [ Last name, first name ], sex, DoB , postcode Transposition of names
Linkage techniques – match-key rules Relaxing the strictness of matches between the linkage variables adjusts for error deterministic rules-based matching Allows for recording error & inconsistencies in linkage variables Reduces false negatives; increases false positives Forename Surname Sex Date of Birth Postcode Shirley Barker F 07/07/2001 CF23 9PQ James Jones M 31/05/1986 NP5 4PL Benjamin Kelly M 29/10/1982 SA18 5BC Emily Mercer F 16/09/1994 CF3 1AA Forename Surname Sex Date of Birth Postcode Shirley Barker F 07/07/2001 CF23 9PQ James Jones M 31/05/1986 NP5 4PL Ben Kelly M 29/10/1982 SA18 5BC Emily Mercer F 16/09/1994 CF3 7NP Relax name Exact Relax postcode
Linkage techniques – string comparators Accounts for two main types of error 1. Variations in spelling phonetic encoding (e.g. soundex , double metaphone)
Linkage techniques – string comparators Accounts for two main types of error 2. Typographical mistakes distance measures ( Levenshtein ) / similarity measures ( Jaro / Jaro -Winkler) Edit distance, i.e. number of substitutions, insertions, deletions k itten → s itten (substitution of “s” for “k”) sitt e n → sitt i n (substitution of “ i ” for “e”) sittin → sittin g (insertion of “g” at the end) edit distance = 3
Linkage techniques – string comparators Accounts for two main types of error Variations in spelling Typographical mistakes Optimal method depends on data Can be built into match-keys Score compared to threshold , to determine whether declare a link or non-link Pairs have same soundex codes but would have radically different distance measures
Linkage techniques – probabilistic weights Score-based method to measure partial agreement between record pairs Plus, probability theory Agreement and disagreement weights adjust the prior that any pair is a match Agreement = +1 Disagreement = -1 Null = 0 Scores summed for overall weight for each pair Compared to threshold to determine if link/non-link Accounts for quality and distinguishing power of each variable Thresholds act as mechanism to balance linkage errors
Linkage techniques – probabilistic weights Score-based method to measure partial agreement between record pairs Splink example…
Evaluating error – quality assurance methods Clerical review of samples Manual & high accuracy (esp. with contextual knowledge) Time & resource intensive Gold standard linked data Known true status of records Infer quality & test linkage algorithm (as with synthetic data) Positive/negative controls implausible links linkage variable quality Typically, these require access to linkage variables Separation of duty Akin to negative controls but don’t need linkage information
Evaluating error – precision & recall Means to quantify the quality of the linkage, based on false positives and negatives Match Non-match Link True Positive False Positive Non-link False Negative True Negative True status Decision Recall Precision
Implications of error What are the implications of linkage error? What impact could these errors have on analysis? MISSED MATCH FALSE MATCH
Implications of error What are the implications of linkage error? What impact could these errors have on analysis? Information bias Misclassification & measurement error False matches undercounting; ‘merging’ Missed matches double-counting; ‘splitting’ Selection bias Incorrect inclusion or exclusion in analysis
Evaluating error – distribution of errors Most impact when there is non-random error in variables of interest Typically, linkage error is associated with: age, gender, ethnicity, socioeconomic status Linked & unlinked records can be more useful to do this with payload/attribute data need to expect all data in one to link to another Aggregate characteristics comparison to external reference statistics need reference data to be representative of linkage data
Evaluating error – distribution of errors We typically quantify using proportional discrepancy Measure of over- or under-representation, proportional to overall match-rate
Handling linkage error in analysis Linkage error is almost inescapable Precision Function of false positives (incorrect links) Recall Function of false negatives (missed matches) Non-random distribution of errors Impacts on analysis Misclassification & measurement error Incorrect inclusion/exclusion Missing attribute data & diminished power How can we mitigate the impact of errors in analysis and on inferences drawn?
Handling linkage error in analysis Report errors & limitations Bias analysis Needs rates & distribution of errors Sensitivity analysis Needs uncertain links & info on match quality Probabilistic analysis Needs uncertain links & info on match quality Separation of duty requires engagement & communication between linkers & analysts, operating under transparency & with feedback mechanisms in place
Closing remarks Linkage error is almost inescapable Result of measurement error and collection idiosyncrasies Linkage involves correcting and balancing these errors Precision vs Recall Impacts analysis Information & selection bias Analytic methods Bias, sensitivity, probabilstic analyses Separation of duty requires engagement & communication between linkers & analysts Thank you!
Resources Quality assessment in data linkage - GOV.UK (www.gov.uk) guide to evaluating linkage quality for the analysis of linked data | International Journal of Epidemiology | Oxford Academic (oup.com) Reflections on modern methods: linkage error bias | International Journal of Epidemiology | Oxford Academic (oup.com) Introduction - Splink (moj-analytical-services.github.io) Data Linkage Hub @ ONS: [email protected] ADR @ ONS: [email protected]