Beyond Availability: The Seven Dimensions for Data Product SLOs

ScyllaDB 110 views 26 slides Jun 26, 2024
Slide 1
Slide 1 of 26
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26

About This Presentation

In the software world, we're used to SLOs built around latency and availability. But in the data engineering universe, there are different usage patterns for data products, and this means that data product SLOs shouldn't rely on only these SLOs to guide product development. In fact, sometime...


Slide Content

A Better SLO for Data-Intensive Systems Emily Gorcenski Lead Data Scientist

The Value of Reliability 2 © 2023 Thoughtworks | Presented at P99 Conf 2023

© 2023 Thoughtworks | Presented at P99 Conf 2023 Data Engineers solving ad hoc tickets for “data pulls” Common Signs of an Impending Data Engineering Disaster Across businesses of all sizes, spread across all industries, we see common challenges with data architectures. These are warning signs you are not getting the most value out of your data investment. 3 Data Engineers constantly addressing regressions, backfills, schema changes, etc. Running studies/experiments and being unable to interpret the results Months of machine learning development, no productionization Concerns about scaling when your data volumes aren’t in the terabytes Proliferation of shadow data infrastructure

© 2023 Thoughtworks | Presented at P99 Conf 2023 50% YoY Growth in Data Engineering in 2019 According to Dice Why Reliability Matters Industry reports are showing Data Engineering as one of the fastest growing tech jobs, while the Data Science market is contracting. Building a good work experience by empowering data engineers is important to sustained success in data. Institutional knowledge is irreplaceable. Data Engineering Job Churn 4

Data teams often report to different stakeholders than product teams, and have different objectives and measures of success. This makes data subservient to product whims. We’ve seen this before: the pre-DevOps days, where software teams and operations teams were disconnected and not working together. © 2023 Thoughtworks | Presented at P99 Conf 2023 separate (and often competing) objectives, separate department leadership, separate key performance indicators by which they were judged, and often worked on separate floors or even separate buildings. The result was siloed teams concerned only with their own fiefdoms, long hours, botched releases, Why Reliability Matters Product and Data OKRs are Misaligned 5 and unhappy customers. ” “ Developers and IT/Ops professionals had

At companies of all sizes, in all industries, data teams are seeing the same problems: Data engineers always “firefighting” Low data quality and poor discoverability Long time-to-production No time to innovate/experiment High flight risk for Data engineers © 2023 Thoughtworks | Presented at P99 Conf 2023 “Our biggest pain point is finding the right data…” “I spent 80% of my time cleaning data…” 6 “It takes four to six weeks to develop an insight…” “Our data platform teams are creating a bottleneck…” Data, AI, and Analytics are in rough shape

© 2023 Thoughtworks | Presented at P99 Conf 2023 SLAs, SLOs, SLIs, what’s the difference? SL*s form the cornerstone of reliability engineering theory, but the concepts are frequently confused. SLOs are built around SLIs, and SLAs are written around SLOs. 7 Service Level Indicators (SLIs) SLIs are what we measure and how—they should be quantifiable and monitorable. Service Level Objectives (SLOs) SLOs communicate an intention—they are the target that drives engineering effort and should be connected to business and product value. Service Level Agreements (SLAs) SLAs are agreements between parties specifying the intended service levels, the means by which it is measured, and the expectations of both parties. Service Level Theory

© 2023 Thoughtworks | Presented at P99 Conf 2023 Operational systems (e.g. microservices) Low marginal value per transaction High transaction volume User holds the data, system performs a function SLOs typically built around performance and availability We love to talk about SLAs for data teams But we do them wrong and don’t understand them Analytical systems (e.g. BI systems) High marginal value per transaction Low transaction volume System holds the data, user performs a function SLAs built around minimizing failures and bug fixes 8 © 2023 Thoughtworks | Presented at P99 Conf 2023 Data is different. Analytical and big data users/stakeholders are suffering from poor data reliability, but we haven’t really solved their problems yet

SLOs should be connected to user value Good SLOs are really hard to define Microservices SLOs don’t work While the notions of latency and availability are still relevant, they are only two possible dimensions of several that matter for data-intensive applications. Moreover, the conventional ways of measuring latency and availability don’t make sense for the application context. SLOs should provide “wiggle room” to allow for experimentation SLOs should inform engineering improvements SLOs should be monitorable and visible ‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023

Let’s build a smarter availability SLO ‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023

Designing an availability SLO that makes sense for data-intensive systems A better availability SLO Problem : Our daily sales report needs to be ready by start of business every workday, so that our analysts can supply our sales and inventory management teams with the latest numbers. The job to compile the report runs nightly and takes several hours. At least once per week, the numbers are wrong or the job has failed. Attempt 1 Proposed SLO : Data warehouse system must have 99.9% uptime Why to reject it : ‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023 Not connected to user value: an empty table returns a successful query but that doesn’t help us We’re paying for availability at 11 PM on Saturday night when no users are working

A better availability SLO Problem : Our daily sales report needs to be ready by start of business every workday, so that our analysts can supply our sales and inventory management teams with the latest numbers. The job to compile the report runs nightly and takes several hours. At least once per week, the numbers are wrong or the job has failed. Attempt 2 Proposed SLO : Daily Report Job must be completed by 7 AM every day. Why to reject it : ‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023 Binary outcome leaves no wiggle room, either the job is or isn’t done Isn’t decomposable and doesn’t tell us how to do better No good way to measure a meaningful distribution Designing an availability SLO that makes sense for data-intensive systems

A better availability SLO Problem : Our daily sales report needs to be ready by start of business every workday, so that our analysts can supply our sales and inventory management teams with the latest numbers. The job to compile the report runs nightly and takes several hours. At least once per week, the numbers are wrong or the job has failed. Attempt 3 Proposed SLO : Daily Report Job must be completed by 7 AM 99% of the time Why to reject it : ‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023 We have an error budget but the outcome is still binary, not clear what window we’re measuring over Isn’t decomposable and doesn’t tell us how to do better Not really connected to user value Designing an availability SLO that makes sense for data-intensive systems

© 2023 Thoughtworks | Presented at P99 Conf 2023 A better availability SLO Problem : Our daily sales report needs to be ready by start of business every workday, so that our analysts can supply our sales and inventory management teams with the latest numbers. The job to compile the report runs nightly and takes several hours. At least once per week, the numbers are wrong or the job has failed. 14 Attempt 4 Proposed SLO : Number of minutes past 7 AM on every workday when the job isn’t successfully completed cannot be more than 270 minutes over any 30 working day window Why to accept it : Gives us lots of wiggle room to miss by a little bit here and there We can trace where our job runs long and improve it Not paying for reliability when no one is using the system Directly connected to user value: minutes matter Designing an availability SLO that makes sense for data-intensive systems

A better availability SLO Problem : Our daily sales report needs to be ready by start of business every workday, so that our analysts can supply our sales and inventory management teams with the latest numbers. The job to compile the report runs nightly and takes several hours. At least once per week, the numbers are wrong or the job has failed. Attempt 4 Proposed SLO : Number of minutes past 7 AM on every workday when the job isn’t successfully completed cannot be more than 270 minutes over any 30 working day window Rewritten: ‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023 12 working hr/day = 720 min 30 working days * 720 min = 21600 min 270 min down / 21600 min = 98.75% uptime Designing an availability SLO that makes sense for data-intensive systems

Six more SLOs for data systems ‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023

© 2023 Thoughtworks | Presented at P99 Conf 2023 17 Freshness Seven useful SLO dimensions Let’s design more agile, useful, and reliable data systems Timeliness (System) Latency Impact You don’t have to measure all of these. It’s probably not a great idea to try to optimize around all of them. Accuracy Availability Completeness

But Emily, that’s not canon Out with the old, in with the new! The classical six data quality dimensions are good. But I think some of them are misleading and box us into centralized thinking. And I think some of them are overrated. Image source: https://www .resear chgate.net/figure/Data-quality-dimensions-8_fig1_355105056 ‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023

‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023 No more moldy data Problem : Not all data can be real time, microbatched, or even processed daily. Consider a late payment report. Monthly payments are due on the first of the month, but there’s a five-day grace period, and a necessary manual confirmation step that takes at least 10 days. This data can be reliably no fresher than 15 days. This sets a lower bound for freshness. A better freshness SLO Proposed SLO : good ol’ p99 latency! 99% of relevant results should be received within 15+n days. (Better: 15 days + x hours!) Why to accept it : don’t aim for 100%. There’s always exceptions that need to be handled late. Don’t punish yourself for being flexible enough to have exceptions.

© 2023 Thoughtworks | Presented at P99 Conf 2023 Who owns data quality? It’s not always the Data Engineers The previous example actually builds a data quality SLO around a process , not a system. Data engineers can’t invent quality out of thin air. Being data driven means you’re thinking about data quality at every step of the process. How might this SLO drive changes in behaviors? Exercise for the viewer… 20

‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023 Data ASAP! Problem : Different than freshness, we’ll measure timeliness by how long it takes to use data once it’s ready. Payroll closes on the 15th, and it takes 3 days to process timesheets. Payroll must be submitted to the processor on the 20th in order for employees to receive their salary by the 24th. Timeliness differentiates the time it takes the data to be sourced and the time it takes for it to be usable. A better timeliness SLO Proposed SLO : we can use p99 like with freshness but with a different threshold. Many people consider these the same but I don’t: they reflect fundamentally different processes to measure. Why to accept it : It’s easy to measure and understandable. It gives us wiggle room and maps more closely to the user’s value (i.e. the payroll department).

‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023 Slow systems be gone! Problem : Processing times for training machine learning models or running BI queries vary dramatically. Because of the unpredictable nature of these jobs, it’s hard to right-size the compute cluster. Also, because the jobs vary in complexity and data volume, it’s hard to know what good looks like. Users (data scientists, analysts) deserve the same love as our customers! Consistent user experience of data platforms should be a key objective! A better (system) latency SLO Proposed SLO : p99 latency for one or more reference jobs. Experimentation needed to identify a reasonable threshold. Why to accept it : easy to implement and understand, having a common baseline will help identify patterns to help with scheduling, cluster sizing, etc. Other proxy measures might include monitoring CPU, disk, memory usage, cache size, etc.

‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023 No more wasted data engineering Problem : Lots of high-performance systems are shuffling around a lot of big data for a lot of big cost and none of it is being used. Consider a BI ecosystem with self-service dashboards. Multiple dashboards include a “number of sold items” KPI. Are they all needed? The ratio of dashboards to employees is close to or greater than 1 in some organizations. But is this really the most effective use of data? A better impact SLO Proposed SLO : frequency of use divided by frequency of update. Why to accept it : data must be valuable to the end user to justify the cost in processing it. Of course, some data processing is necessary for compliance reasons. Some reports don’t need to be used every day. But a good starting point is to assume this value to be O(1).

‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023 Good enough is good enough Problem : BI reporting often demands high accuracy. No one wants to tell the tax man the wrong numbers! Imagine putting together a logistics SLA compliance report. The accuracy of this report has an impact on your contracts and costs. Having very high accuracy is important, but 100% accuracy is not always achievable or necessary, particularly if the output KPI is not continuous. A better accuracy SLO (BI edition) Proposed SLO : error rate can be really hard to compute. Instead, consider anomaly detection. Track statistically significant changes in input distributions likely to affect the KPI. A p-value can be a good SLO. Why to accept it : SLO misses aren’t a bad thing. If you detect anomalies in your data, this merits deeper investigation. The data may end up being accurate, and that’s ok.

customer doesn’t care about! today’s truth, not yesterday’s. ‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023 Users change, change with them Problem : Machine learning models are periodically retrained and data scientists use a variety of metrics to judge them. The models keep getting better but the customer experience (and revenue) keeps getting worse! Unfortunately, many machine learning metrics are academic and uncoupled to user value. A model may get a very good score if trained on scenarios the A better accuracy SLO (machine learning edition) Proposed SLO : anything but your standard ML metrics! F1 means nothing in the real world! AUC means nothing in the real world! What’s important is to track model performance over time based on the job to be done. Why to accept it : user behavior is constantly changing and decisions need to be made with respect to

compute relevant insights? ‹#› © 2023 Thoughtworks | Presented at P99 Conf 2023 Quantify the effect of missing data, not the amount of missing data A better completeness SLO Problem : Imagine doing product analysis and wanting to understand the demographics of who is buying your products. A fraction of users have opted not to share their gender or age information. Not all missing data is wrong. Historical data that didn’t include a field, optional fields, or withheld information are all valid forms of incomplete data. Proposed SLO : Estimate the effect of incomplete data by sampling from your complete dataset and mocking incomplete data. Use this to set a relevant incomplete data percentage KPI. Why to accept it : completeness plays against other SLOs, such as timeliness. Is it really necessary to have the entire population to
Tags