Treating data as a product has been a hot topic in data engineering for a while now. Data governance is crucial for teams to make sure they are delivering high-quality data products. Data contracts can be thought of as a governance bridge between data products and consumers. They represent an agreem...
Treating data as a product has been a hot topic in data engineering for a while now. Data governance is crucial for teams to make sure they are delivering high-quality data products. Data contracts can be thought of as a governance bridge between data products and consumers. They represent an agreement that data products will conform to the specified structure, format, timelines, quality of data, and much more. There are many aspects of data contracts, data quality or data governance in general and each one brings its own set of challenges that need to be considered when designing data products. We at CROZ have been working with data for over 15 years and have been involved in various data quality challenges. In one of our latest projects, we helped our client implement data governance while the special attention was on implementing data contracts, especially in streaming applications. In this talk, you will hear about some of the challenges we faced on the project and how data contract enforcement drives better data quality.
Size: 2.76 MB
Language: en
Added: Sep 18, 2024
Slides: 21 pages
Slide Content
Data Contracts and Data Quality in Streaming Ivan Dundović 8-Sep-24
CROZ is a Croatian-German biz-tech consulting company that brings the most complex organizations where the new – and improved values are created. ABOUT US
Agenda Data quality in streaming Data contracts Putting all together
Data quality D ata quality measures the data fitness for its intended uses in operations, decision making and planning Without good data quality, organizations can’t have trust in their data
Data quality dimensions
Data quality process
Data quality in static data Data quality improvement is a continuous process Measuring data quality in static data is usually done once per day, or sometimes even less frequently We want to measure and improve data quality to ensure data can be trusted
Data quality in streaming data Due to the nature of streaming systems, data quality over multiple records in a dataset is hard to evaluate in (near)realtime Consistency Uniqueness Accuracy
Analytics Analytical systems Transactions Operational systems ML Data sources Streaming platform Application Database API
Data quality in streaming data The least we want to achieve is to make it transparent to all users of the system if some records fail certain quality checks STREAMING DATA PRODUCERS ORIGINAL TOPIC Data quality checker OK NOT OK
Data contract is a document that defines the structure, format, semantics, quality and terms of use for exchanging data between a data provider and their consumers https://datacontract.com/
VALID DATA INVALID DATA ORIGINAL TOPIC Published Data contract Data Contract Checker ( KStreams ) DATA PRODUCERS Data catalog Stakeholders, Roles, Responsibilities Open Data Contract Standard Metadata Master source
Data Contract Checker Data Contract is defined following the Open Data Contract Standard (YAML) Real-time Data contract checker will validate : is data ownership defined S chema ( c olumns , data types, (non)NULL constraints) DQ rules – specific elements or list of values DQ rules – a specific range of values DQ rules – telephone numbers , or address format SLA rules – is data deliver ed on time
Data contract general information
Dataset/table description
Columns description With the usage of Common Expression Language we can write data quality checks that are validated in runtime
Common Expression Language (CEL) CEL is an expression language created by Google It implements common semantics for expression evaluation, enabling different applications to interoperate more easily CEL is designed to be embedded in an application and is ideal for extending declarative configurations Project Nessie offers Java implementation of CEL that we used in Kstreams https://github.com/projectnessie/cel-java
How we ensure data quality in streaming Completeness Besides Data Contract checker, Avro can be used Validity Definitions of valid data are defined in Data Contract through CEL and checked in near realtime Timeliness SLAs are defined in Data Contract and checked in near realtime
Recap In the PoC we demonstrated how Data contracts and CEL can be used for data quality checks Next steps Integrate with Data Governance tool like Collibra Apply Data Contract changes to the Checker in runtime