[DSC DACH 24] Data contracts and data quality in streaming - Ivan Dundovic

DataScienceConferenc1 24 views 21 slides Sep 18, 2024
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

Treating data as a product has been a hot topic in data engineering for a while now. Data governance is crucial for teams to make sure they are delivering high-quality data products. Data contracts can be thought of as a governance bridge between data products and consumers. They represent an agreem...


Slide Content

Data Contracts and Data Quality in Streaming Ivan Dundović 8-Sep-24

CROZ is a Croatian-German biz-tech consulting company that brings the most complex organizations where the new – and improved values are created. ABOUT US

Agenda Data quality in streaming Data contracts Putting all together

Data quality D ata quality measures the data fitness for its intended uses in operations, decision making and planning Without good data quality, organizations can’t have trust in their data

Data quality dimensions

Data quality process

Data quality in static data Data quality improvement is a continuous process Measuring data quality in static data is usually done once per day, or sometimes even less frequently We want to measure and improve data quality to ensure data can be trusted

Data quality in streaming data Due to the nature of streaming systems, data quality over multiple records in a dataset is hard to evaluate in (near)realtime Consistency Uniqueness Accuracy

Analytics Analytical systems Transactions Operational systems ML Data sources Streaming platform Application Database API

Data quality in streaming data The least we want to achieve is to make it transparent to all users of the system if some records fail certain quality checks STREAMING DATA PRODUCERS   ORIGINAL TOPIC Data quality checker OK NOT OK

Data contract is a document that defines the structure, format, semantics, quality and terms of use for exchanging data between a data provider and their consumers https://datacontract.com/

VALID DATA INVALID DATA ORIGINAL TOPIC Published Data contract Data Contract Checker ( KStreams ) DATA PRODUCERS   Data catalog Stakeholders, Roles, Responsibilities Open Data Contract Standard Metadata Master source

Data Contract Checker Data Contract is defined following the Open Data Contract Standard (YAML) Real-time Data contract checker will validate : is data ownership defined S chema ( c olumns , data types, (non)NULL constraints) DQ rules – specific elements or list of values  DQ rules – a specific range of values  DQ rules – telephone numbers , or address format SLA rules – is data deliver ed on time 

Data contract general information

Dataset/table description

Columns description With the usage of Common Expression Language we can write data quality checks that are validated in runtime

Common Expression Language (CEL) CEL is an expression language created by Google It implements common semantics for expression evaluation, enabling different applications to interoperate more easily CEL is designed to be embedded in an application and is ideal for extending declarative configurations Project Nessie offers Java implementation of CEL that we used in Kstreams https://github.com/projectnessie/cel-java

How we ensure data quality in streaming Completeness Besides Data Contract checker, Avro can be used Validity Definitions of valid data are defined in Data Contract through CEL and checked in near realtime Timeliness SLAs are defined in Data Contract and checked in near realtime

Recap In the PoC we demonstrated how Data contracts and CEL can be used for data quality checks Next steps Integrate with Data Governance tool like Collibra Apply Data Contract changes to the Checker in runtime

Thank you! 8-Sep-24
Tags