ADR UK workshop: Messy and complex data part 1

EleanorCollard 254 views 36 slides Aug 12, 2024
Slide 1
Slide 1 of 36
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36

About This Presentation

ADR UK workshop


Slide Content

Messy and Complex DATA 1 This slide deck was produced for an in-person workshop at the ADR UK 2023 conference, Tuesday 14 November

This session Intro Getting started Definitions This session  Principles of Data Engineering Structure and complexity

Introductions Presenters Jonathan Swan Head of ADRUK Data Engineering, ONS Jen Hampton Head of ADRUK Linkage, ONS

Definitions: Complex Data “Data that are hard to process and translate into a digestible format.” Complexity may be because of Size (wide or tall) Number of Sources Structure Multiple entities (e.g. people, households, jobs, employers all in same data) Relationships within the data, between entities, or between sources The type of data (pictures, sound, video, free text, etc.)

Definitions: Messy Data Data where there is a barrier to using the data for analysis. Examples of Messy data missing data. unstructured data. multiple variables in one column. variables stored in the wrong places. observations split incorrectly or left together against normalization rules. switched columns and rows. extra spaces Also uncertainty imprecision

Principles of Data Engineering Tuesday 14 November

Why do data engineering principles matter? “I’m a researcher – why should I care about data engineering?” Data engineering includes the process of preparing data to enable users, researchers, to use the data. Using the right principles enables: Ease of use Consistency of use Accuracy of the data Accuracy in interpretation Avoidance of error

Standard set of Data Engineering Principles? 7 Data Engineering Principles You Should Be Aware Of 6 data integration principles for data engineers to live by The Three P's of Data Engineering Data engineering principles - AWS Prescriptive Guidance Essential Data Engineering Concepts and Principles Data Engineering Design Principles SOLID Principles in Data Engineering 10 Major DataOps Principles to Overcome Data Engineer Burnout Simplified

Questions for this session Can we form a list of practical Data Engineering principles? What do they mean to us in practice? Does usage for research data change anything? The art of a data engineer is dealing with messy and complex data, whilst maintaining clarity. Embrace complexity, Avoid complicated.

Two main principal frameworks (sort of)  Software Engineering Principles Flexibility Reproducibility Reusability Scalability Auditability SOLID Framework S ingle Responsibility O pen/Closed L iskov Substitution I nterface Segregation D ependency Inversion … And the Acronyms DRY … Don’t Repeat Yourself YAGNI … You Ain’t Gonna Need It KISS … Keep It Simple Stupid

For the Record - SOLID Single responsibility A class should have only one responsibility Open/closed Classes should be open for extension, but closed for modification Liskov substitution If A is a subtype of class B, we should be able to replace B with A, without disrupting program behaviour. (If B is class of Car, then it should work for Electric Car.) Interface segregation larger interfaces should be split into smaller ones. By doing so, we can ensure that implementing classes only need to be concerned about the methods that are of interest to them. Dependency inversion High-level modules, which provide complex logic, should be easily reusable and unaffected by changes in low-level modules, which provide utility features.

Towards a Principles Framework for Data Engineering to Support Research Delivery Principles Principles that help data engineers deliver User Principles Principles that aid research Implementation Principles Principles for technical implementation (like SOLID)

What principles are key? Classification Principle Practice Implication for researchers User (principle) Change minimum Keep data values as unchanged as possible. Even if adding value – keep originals. Researchers can be confident that data are as recorded. User Derived Variables in source data Where several users need DVs, agree and add DVs to researcher data. Users can work off consistent information and don’t need to re-invent the wheel. User Standardise data Provide data that meets standards; where necessary include original and standardised data Users can more easily interpret and compare data and results User Use meaningful terms Where possible use meaning file and variable names, use meaningful category names (Male, Female vs 1, 2) Makes analysis easier for the user, and reduces the risk of errors. User Ease of use Prioritise ease of use over storage efficiency. Easier for researchers.

Classification Principle Practice Implication for researchers Delivery (principle) Reproducibility Use a reproducible pipeline, that meets good coding standards. Reliable delivery, able to consistently re-supply. With adaption enable re-supply with changes. Delivery Reusability Use functions and modular code. Generalise where possible. Quicker, more reliable development. Helps consistency. Delivery Scalability Write code and use tools that facilitate scaling to increased (or decreased) data volumes. Delivery Auditability Use procedures, tools, and code) so that changes to pipelines and data can be traced, sourced, reasoned, and justified. Helps ensure the reliability of data. Delivery Document and Share Document everything! Code, derivations, etc. etc. Share documentation and code Researchers can understand how their data are derived.

Over to you Do you have any comments on these principles? What other principles are important, especially to researchers? A reminder Classification Principle Practice Implication for researchers User, Delivery, (Implementation) User Principles Delivery Principles Change minimum Reproducibility Derived Variables in source data Reusability Standardise data Scalability Use meaningful terms Auditability Document and Share

Structuring data for researchers Tuesday 14 November

Structuring data Underlying principle – make the data as easy to use as possible But modern data get complex fast! And linking across sources ratchets up the complexity!

Starting simple – a flat table Simplest to analyse Ideal for simple surveys e.g. Opinions Survey ID Type of House Number of Bedrooms HID1 Semi detached 2 HID2 Terrace 4 HID3 Flat 1

Sparse tables Advantages of flat table Not efficient for memory Good for linked data e.g. ASHE linked to Census ID House_type No. Bedrooms HID1 Semi detached 2 HID2 Terrace 4 HID3 Flat 1 ID No. Cars HID1 3 HID3 HID4 1 ID House_type No. Bedrooms No. Cars HID1 Semi detached 2 3 HID2 Terrace 4 <NULL> HID3 Flat 1 HID4 <NULL> <NULL> 1 Consider: Null in original vs Null = no match

Wide, Stacked, Summarised Three ways to structure the same data ID No. Cars HID1 3 HID3 HID4 1 ID Model HID1 Ford Focus HID1 Toyota Aygo HID1 Range Rover Velar HID3 <NULL> HID4 Bentley Continental ID Model 1 Model 2 Model 3 HID1 Ford Focus Toyota Aygo Range Rover Velar HID3 <NULL> <NULL> <NULL> HID4 Bentley Continental <NULL> <NULL>

Over to you: How would you structure the house data linked to the car models data? And Why? HID, House Type, No. Bedrooms HID, Make 1, Make 2, etc. How would you structure data on pupils, that gave all their qualifications? And Why? ID Qualification type Subject Grade/Result

More on Stacking Useful when the row is about an entity of interest e.g. ASHE data relate to employments Each person can have several employments Some research interest is around employments Useful for data covering several time periods e.g. (again) ASHE data Wide data structure sometimes not practical If lots of entries for an entity (e.g. cars to a household) Or if large variation in the number of entries Stacking doesn’t work well if lots of variation in data contents e.g. New Earnings Survey (NES) questions varied dramatically from year to year

Multiple flat files What it says on the tin Linked by common keys E.g. Labour Force Survey Individual File Household File Longitudinal File ID HHID Age Employment Status Person 1 HH 1 45 Full Time Person 2 HH 2 28 Unemployed Person 3 HH 2 1 <NULL> Person 4 HH 3 36 Full Time Person 5 HH 3 34 Economically inactive Person 6 HH 3 10 <NULL> HHID Property type No. Employed HH1 Flat 1 HH2 Terrace HH3 Semi-detached 1

Table + Spine Used for stacked data Spine contains ‘fixed’ or persistent data Traditionally one spine entry per secondary entity But can be more than one if details change PID Income Job Start P1 28,000 01/04/2022 P1 32,000 01/08/2023 P2 96,000 01/01/2018 PID Age at xx Sex P1 32 Male P2 52 Female

Table + Spine Used for stacked data Spine contains ‘fixed’ or persistent data Traditionally one spine entry per secondary entity But can be more than one if details change PID Income Job Start P1 28,000 01/04/2022 P1 32,000 01/08/2023 P2 96,000 01/01/2018 PID Age at xx Sex P1 32 Male P2 52 Female

Multiple Tables + Spine Used for data covering multiple sources Where the data structure is not amenable to stacking etc. Data Spine indicates source of data and may contain fixed data One entry per data subject Demographic spine gives ‘fixed’ or persistent data Can use multiple spines Typically one data spine + one demographic spine

Multiple Tables + Spine - example ID Subject Grade ID 1 English 8 ID 1 Maths 9 ID 2 Greek 7 ID 2 Latin 4 ID 3 Physics 9 ID 3 French 2 ID 4 Biology 3 ID Sex Age at xx School ID ID 1 F 17 ABC ID 2 M 18 ABC ID 4 F 18 DEF ID 5 M 11 GHI ID 6 F 9 GHI ID Subject Result ID 3 Physics First ID 7 Media studies 2:2 Exam PLASC Degree ID 1 TRUE TRUE FALSE ID 2 TRUE TRUE FALSE ID 3 TRUE FALSE TRUE ID 4 TRUE TRUE FALSE ID 5 FALSE TRUE FALSE ID 6 FALSE TRUE FALSE ID 7 FALSE FALSE TRUE Age at xx Sex Postcode ID 1 17 M SW1A 1AA ID 1 18 M SW1A 1AA ID 2 18 F SW1A 0AA ID 2 18 F SW1A 0PW ID 3 19 <NULL> PO15 5RR ID 4 18 F NP10 8XG ID 5 11 M DL1 5AD ID 6 9 F M1 6EU ID 7 22 F SW1P 4DF ID 7 22 F EC3N 4AB

Other data structures JSON J ava S cript O bject N otation Plain text Language independent Used to talk computer to computer Great for complex data But a pig to analyse without conversion!

JSON Example { "id": "0001", "type": "donut", "name": "Cake", " ppu ": 0.55, "batters": { "batter": [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, { "id": "1003", "type": "Blueberry" }, { "id": "1004", "type": "Devil's Food" } ] }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5005", "type": "Sugar" }, { "id": "5007", "type": "Powdered Sugar" }, { "id": "5006", "type": "Chocolate with Sprinkles" }, { "id": "5003", "type": "Chocolate" }, { "id": "5004", "type": "Maple" } ] }

Other data structures JSON Relational Databases A way of storing data across multiple tables Can be made available as multiple related tables Easier to analyse if translated to stacked or wide tables

Other data structures: Relational Database Student_ID Name 1 Alice 2 Bob 3 Cate Student_ID Course_ID 1 1 1 2 2 1 2 3 3 1 3 2 3 3 Course_ID Course 1 English 2 Maths 3 Science

Other data structures JSON Relational Databases Graph databases Stores nodes and relationships Not tables Useful for complex linkage and matching Useful for visulisation Useful for uncertainty in data/relationships

Graph Database Example Image: By Ole Mussmann - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=87002327

Other data structures JSON Relational Databases Graph databases

Towards the future Data queries Pre Canned data query Updates when the data does Useful for consistency Useful for sharing common queries Creating updating derived variables Data Views Build a view for analysis Re-presents underlying data Can be built across several sources When the data is updated / changed – reflected in the view Cloud based tools are making building custom views much easier Dashboards Present summary information Can be graphical Updates when the data does

Structuring Data Over to you! Are there any other data structures you have used? When have you had difficulty with structure? When has a structure worked well for you (for complex data)? What techniques have you found useful?
Tags