Data Contracts Course - Data Management & Data Quality

MirkoPeters 45 views 124 slides Jan 16, 2025
Slide 1
Slide 1 of 124
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124

About This Presentation

Enroll Now and Transform Your Data Skills! https://go.tdaa.link/DataContracts

Take your data management skills to the next level with our Data Contracts Certification Course! This comprehensive program is designed to equip professionals with the foundational knowledge and practical expertise requir...


Slide Content

Strategies, Governance, and Innovations for Effective Data Practices
Data Management and
Quality Course

Introduction to Data Quality
Importance of Data Quality
Role of Data Contracts
Building Effective Data Systems
Adoption Strategies for Data Contracts
Defining Data Products
Design Considerations for Data Products
Practical Examples of Data Products
Data Consumers and Generators
Responsibilities in Data Management
Feedback Loops in Data Management
01
02
03
04
05
06
07
08
09
10
11
Table of contents

Necessity of Data Governance
Common Applications of Data Governance
Promoting Governance through Data Contracts
Data Architecture Council
Federated Governance Implementation
Evolution of Data Storage Technologies
Amazon Redshift
SQL-Compatible Warehouses
Modern Data Stack Components
Data Lakehouse Concept
Operational Data Store (ODS)
12
13
14
15
16
17
18
19
20
21
22
Table of contents

Cost and Accessibility of Data Lakehouses
Data Accessibility in Organizations
Data Culture and Quality
Competitive Advantage through Data
Case Study: Consumer Sector
Definition of Data Contracts
Schema Components
Flexibility vs. Interoperability
Data Quality Checks
Utility of Data Contracts
Introduction to Data Mesh
23
24
25
26
27
28
29
30
31
32
33
Table of contents

Principles of Data Mesh
Role of Data Contracts in Data Mesh
Adoption of Data Contracts
Roles in Data Management
Cultural Change in Data Management
Engagement Strategy for Data Generators
Communicating Benefits of Data Contracts
Creating Usable Data Products
E-commerce Example: Discounts
Data Management Issues in E-commerce
Simplification of Database Schema
34
35
36
37
38
39
40
41
42
43
44
Table of contents

Roles of Data Consumers and Generators
Responsibilities of Data Engineers
Data Analysts and Business Users
Understanding Data Structure
Data Dependability and Performance
Value Delivery through Data Contracts
Breaking Changes in Data Management
Non-Breaking Changes in Data Management
Migration Path for Data Contracts
Data Lineage Tools
Expectations Management in Data Migration
45
46
47
48
49
50
51
52
53
54
55
Table of contents

Decentralized Data Governance
Roles and Responsibilities in Data Governance
Data Contracts for Metadata Management
Data Architecture Council
Federated Data Governance
Example of YAML-Based Data Contract
Defining a Schema for Data Contracts
Schema Example: Customer Record
Importance of Structured Data
Tooling and Functionality of Schemas
Documentation in Schemas
56
57
58
59
60
61
62
63
64
65
66
Table of contents

Ownership in Data Contracts
Elements of a Data Contract
Metadata Capture in Data Contracts
Recommended Languages for Data Contracts
Organizational Context for Data Contracts
Contract-Driven Data Architecture
Data Processing Services
Anonymization Strategies
Data Governance and Visibility
Empowerment of Data Generators
Formation of Data Infrastructure Team
67
68
69
70
71
72
73
74
75
76
77
Table of contents

Guidelines and Guardrails for Data Generators
Agility and Autonomy in Data Management
Consistency in Data Management
Incident Management in Data Contracts
Return on Investment in Data Contracts
Creating a Data Contract
Anonymization Service Example
Components of a Data Contract
Providing Interfaces to Data
BigQuery Schema Creation
Managing BigQuery Tables with Pulumi
78
79
80
81
82
83
84
85
86
87
88
Table of contents

Populating a Central Schema Registry
Schema Retrieval in Confluent Registry
Version Management in Schema Registry
Schema Evolution in Data Contracts
Non-Breaking Changes Example
Getting Started with Data Contracts
Migrating to Data Contracts
Discovering Data Contracts
Building a Data Contracts-Backed Culture
Migration Plan for Data Assets
Collaboration with Data Consumers
89
90
91
92
93
94
95
96
97
98
99
Table of contents

Engagement with Data Generators
Setting Deadlines for Migration
Further Reading on Data Management
Ensuring Data Quality in Publishing
Post-Publishing Monitoring
Data Observability Tools
Performance and Dependability Monitoring
Transactional Outbox Pattern
Event Generation in Outbox Pattern
Performance Improvement with Outbox Pattern
Drawbacks of Outbox Pattern
100
101
102
103
104
105
106
107
108
109
110
Table of contents

Popularity of Outbox Pattern
Conclusion and Next Steps
111
112
Table of contents

Data Management and Quality Course
Understanding Data Quality: Data quality refers to the condition of a dataset, determined by factors such as accuracy,
completeness, reliability, and relevance. High-quality data is essential for effective decision-making and operational
efficiency within an organization.
Importance of Data Quality: Ensures reliable insights: Quality data leads to accurate analysis and informed decision-
making, reducing the risk of errors in business strategies. Enhances operational efficiency: By minimizing data
discrepancies, organizations can streamline processes and reduce the time spent on data correction and validation.
Challenges in Maintaining Data Quality: Lack of clear expectations: Users often have unclear expectations regarding data
reliability and origin, leading to potential misuse and misinterpretation. Reactive management: Traditional approaches
often react to data quality issues rather than proactively addressing them at the source, resulting in ongoing challenges.
Strategies for Improving Data Quality: Shift-left approach: Assigning responsibility for data quality to data generators
ensures that those closest to the data understand its structure and implications, fostering accountability. Implementing
data quality checks: Establishing checks at the source can mitigate risks and reinforce the importance of data quality
among data generators.
The Role of Data Contracts: Data contracts serve as formal agreements that outline expectations for data quality,
including schema definitions and service level objectives (SLOs). They promote accountability and ensure that both data
generators and consumers are aligned on quality standards.
Introduction to Data Quality
Introduction to Data Quality
1

03
0201
Data Management and Quality Course
Poor data quality can lead to costly errors and inefficiencies.
By shifting responsibility for data quality to data generators
and implementing proactive quality checks at the source,
organizations can reduce incidents caused by upstream data
changes.
This shift minimizes the risk of invalid data affecting
downstream processes and lowers the overall costs
associated with data management and resolution of data-
related issues.
Mitigating Risks and Costs
Investing in data quality directly correlates with
enhanced business outcomes.
Organizations that prioritize data quality can expect to
see significant returns on their investments, as quality
data supports the development of data-driven products
and services.
Top-performing retailers that utilize data effectively are
reported to be 83% more profitable, showcasing the
financial benefits of quality data.
Enhancing Business ValueFoundation for Decision-Making
High-quality data is essential for informed decision-
making within organizations.
It enables leaders to understand past and present
situations accurately and make reliable predictions
for the future.
Organizations that leverage quality data can make
faster, more effective decisions, leading to improved
operational efficiency and competitive advantage.
Importance of Data Quality
2

01 02 03
04 05
Data Management and Quality Course
Driving Cultural Change in Data
Management
Encourages a cultural shift towards valuing
data quality and reliability.
Leads to better business outcomes and
increased investment in data initiatives.
Supporting Decentralization and
Autonomy
Empowers data generators to own their data
contracts.
Organizations can reduce bottlenecks and
enable faster, more efficient data management
processes.
Promoting Data Quality and Governance
Include provisions for data quality checks and
governance policies.
Allows data generators to manage metadata
effectively, including data classification and
sensitivity.
Enhancing Collaboration
Facilitate communication between data
generators and consumers.
Fosters a partnership that aligns data
production with business needs and enhances
trust in data quality.
Establishing Clear Expectations
Data contracts define the schema,
expectations, and service level objectives
(SLOs) for data delivery.
Ensures that both data generators and
consumers have a mutual understanding of
their responsibilities.
Role of Data Contracts
3

Data Management and Quality Course
01 02 03 04
Establish a robust data governance framework that includes
compliance with data policies and external regulations.
This framework should also incorporate data quality checks
to monitor and maintain the integrity of the data products.
By ensuring that data governance is embedded within the
data management processes, organizations can maximize
the value of their data as a strategic asset.
Implement Governance and
Quality Checks
Encourage a collaborative environment between data generators
and consumers.
This partnership is crucial for enhancing data quality and
usability.
Data generators should feel a sense of ownership over the data
outcomes, motivated by clear communication from consumers
about their needs.
This collaboration will lead to the development of reliable data
products that can be confidently utilized across the organization.
Foster Collaboration
Implement data contracts as formal agreements
that outline the schema, expectations, and service
level objectives (SLOs) for the data being generated.
These contracts facilitate better communication
between data generators and consumers, ensuring
that both parties have a mutual understanding of the
data's purpose and quality standards.
Establish Data Contracts
Begin by establishing a clear definition of what
constitutes a data product within your organization.
A data product is a high-quality dataset designed for
consumption, meeting the specific requirements and
expectations of its users.
It should be scoped to a single business domain and
owned by the relevant teams, ensuring that it relates to
specific business entities such as customers or orders.
Define Data Products
Building Effective Data Systems
4

01 02 03
Data Management and Quality Course
Select a relevant use case that aligns with your objectives
to serve as a proof of concept.
This POC should demonstrate the value of data contracts
and involve the necessary resources and personnel.
By successfully delivering this initial project, you can build
momentum and support for broader adoption across the
organization.
Implement a Proof of Concept (POC)
Foster collaboration between data generators (those who
create data) and data consumers (those who utilize data).
Involve both parties early in the planning process to define
data needs and requirements.
This engagement not only builds a sense of ownership
but also ensures that the data contracts meet the actual
needs of the business.
Engage Data Generators and
Consumers
Begin by identifying the key objectives for implementing
data contracts within your organization.
Focus on specific problems you aim to solve, such as
improving data pipeline dependability or enhancing user
trust in data.
This clarity will guide the adoption process and ensure
alignment with business goals.
Define Clear Objectives
Adoption Strategies for Data Contracts
5

01 02
0403
Data Management and Quality Course
Data products can be derived from other data
products.
Creates a supply chain that enhances data
utilization across various business
applications.
Promotes efficiency and maximizes the value
extracted from data within the organization.
Interconnectedness of Data Products
Each data product holds intrinsic value for
decision-making and operational processes.
Must be easily discoverable and accessible
through a stable interface.
Supported by comprehensive documentation
detailing the data's fields, values, and
limitations.
Value and Accessibility
Data products are scoped to a single business
domain.
Owned by the teams within that domain.
Ensures relevance and tailoring to specific
business entities, such as customers or orders.
Ownership and ScopeWhat is a Data Product?
A data product is a high-quality dataset
specifically designed for consumption by
others.
Ensures it meets their requirements and
expectations.
Serves as a trustworthy resource that users
can confidently build upon.
Defining Data Products
6

03
0201
Data Management and Quality Course
Prioritize high-quality data to enhance the
product's trustworthiness.
Implement data quality checks and set clear
expectations for data completeness (100%
expected), timeliness (data available within 60
minutes), and availability (95% accessible for
querying) to foster user confidence and
support informed decision-making.
Focusing on Data Quality and
Reliability
Create formal agreements that outline the
schema, expectations, and service level
objectives (SLOs) for the data product.
Considerations for data access methods,
performance impacts, and data format
stability, ensuring clarity and accountability
between data generators and consumers.
Establishing Data Contracts
Understanding Consumer
Requirements
Identify the specific needs and expectations of
data consumers to ensure the data product is
tailored to solve relevant business problems.
Engaging with consumers helps define a clear
schema and enhances the product's usability.
Design Considerations for Data
Products
7

Data Management and Quality Course
Overview: This data product is designed to manage and analyze order processing
within an e-commerce platform, focusing on discounts applied to products.
Key Features:
Data Schema: The product includes a well-defined schema that captures essential
fields such as `created_at`, `items`, `product_id`, `price`, and `quantity`. This
structure allows for clear understanding and utilization of the data.
Service Level Objectives (SLOs): The data product sets specific expectations for
performance, including:
Completeness: Ensures 100% of order data is captured.
E-commerce Order Processing Data Product
Practical Examples of Data Products
8

03
0201
Data Management and Quality Course
Ownership of Data Quality: Data generators are responsible
for maintaining the quality and reliability of the data they
produce. This includes understanding the impact of their
data on consumers and being incentivized to uphold high
standards.
Consumer Accountability: Data consumers must
demonstrate the value derived from the data, ensuring that
investments in data quality yield tangible business
outcomes. This reciprocal relationship enhances the
overall effectiveness of data management within the
organization.
Value Generation and Accountability
Data Contracts: Formal agreements that outline
expectations, responsibilities, and service-level objectives
(SLOs) between data consumers and generators. These
contracts enhance communication and ensure that both
parties understand their roles in the data ecosystem.
Feedback Mechanisms: Establishing regular feedback
loops between consumers and generators fosters a
culture of accountability and continuous improvement,
ensuring that data quality meets consumer needs.
Collaboration and CommunicationUnderstanding Roles
Data Consumers: Individuals or teams that utilize
data for analysis, reporting, and decision-making.
They require clear access to data and
understanding of its structure and context to derive
business value.
Data Generators: Individuals or services that create
data for future use. This includes software
engineers, data engineers, and third-party services,
each playing a crucial role in the data supply chain.
Data Consumers and Generators
9

Data Management and Quality Course
Data generators are responsible for owning
and managing data contracts.
These contracts define the expectations
around data quality, structure, and governance.
Ownership enables informed decisions
regarding data generation, classification, and
compliance with organizational policies.
Fostering accountability and enhancing data
quality.
Ownership of Data Contracts02Empowerment of Data Generators
Data generators are tasked with providing
accurate and relevant data to meet the needs
of data consumers.
They must balance trade-offs between
consumer preferences and the feasibility of
data generation.
Ensuring that the data produced aligns with
organizational standards.
01
Responsibilities in Data Management
10

01 02 03
Data Management and Quality Course
Foster a culture of continuous improvement by encouraging regular
feedback from data consumers regarding the data provided.
This feedback should be used to make iterative adjustments to data
products and processes.
By actively seeking input and adapting based on consumer
experiences, organizations can enhance data quality, build trust, and
ensure that data products evolve to meet changing business needs
effectively.
Encourage Iterative Feedback and
Adaptation
Develop and utilize data contracts that outline the expectations
regarding data quality, reliability, and performance.
These contracts serve as formal agreements that clarify the
requirements of data consumers and the responsibilities of data
generators.
By having these contracts in place, organizations can create a
structured framework that facilitates accountability and
encourages continuous feedback on data usage and quality.
Implement Data Contracts
Initiate effective communication between data generators and
consumers to ensure that both parties understand their roles and
responsibilities.
This step is crucial for defining expectations and fostering a
collaborative environment where feedback can be shared openly.
By establishing these channels, data generators can gain insights
into the needs of data consumers, leading to improved data
quality and usability.
Establish Clear Communication
Channels
Feedback Loops in Data Management
11

01 02 03
04 05
Data Management and Quality Course
Mitigating Risks
By implementing clear roles and
responsibilities within data governance,
organizations can better manage risks
associated with data misuse and breaches.
This proactive approach helps in identifying
and addressing vulnerabilities before they lead
to significant issues.
Fostering a Data-Driven Culture
Successful data governance initiatives
promote data literacy across the organization,
encouraging a culture where data is valued and
utilized in decision-making processes.
This cultural shift is essential for maximizing
the potential of data assets.
Promoting Data Accessibility
A well-structured governance framework
facilitates easier access to data for authorized
users while maintaining security protocols.
This balance is vital for empowering teams to
utilize data effectively without compromising
sensitive information.
Enhancing Data Quality
Effective data governance establishes
standards and processes that ensure data is
accurate, consistent, and reliable.
This enhances the overall quality of data, which
is essential for informed decision-making and
operational efficiency.
Ensuring Compliance
Organizations must adhere to regulatory
requirements for data handling to avoid
significant fines and reputational damage.
Compliance with laws such as the General
Data Protection Regulation (GDPR) is crucial
for managing personal data effectively.
Necessity of Data Governance
12

01 02 03
Data Management and Quality Course
Data governance initiatives play a crucial role in
fostering a data-driven culture within
organizations.
By enhancing data literacy and accessibility,
these initiatives empower employees to utilize
data effectively, driving better business
outcomes and encouraging informed decision-
making across all levels of the organization.
Data Culture Promotion
Effective data governance frameworks
establish standards and processes for
maintaining data quality.
This includes implementing data quality
checks to identify and rectify issues, ensuring
that data remains accurate, consistent, and
reliable for decision-making.
Data Quality Management
Organizations implement data governance to
ensure adherence to various regulatory
requirements, such as GDPR.
This compliance helps avoid significant fines
and protects the organization's reputation by
ensuring that personal data is handled securely
and responsibly.
Regulatory Compliance
Common Applications of Data Governance
13

01 02
Data Management and Quality Course
Utilize data contracts as a source of truth for metadata, which
includes data classification, sensitivity, and governance policies. This
ensures that data remains up to date and compliant with regulatory
requirements.
Implement tooling that automates governance processes, allowing for
seamless integration with privacy and data management systems,
thereby enhancing the overall quality and accessibility of data across
the organization.
Automation and Compliance
Empower data generators with autonomy to manage their data,
reducing bottlenecks associated with centralized control. This
approach allows for quicker decision-making and enhances the
responsiveness of data management practices.
Establish clear standards and documentation from a central data
governance council to guide data generators in classifying and
managing their data effectively, ensuring compliance with
organizational policies.
Decentralized Data Governance
Promoting Governance through Data Contracts
14

01 02 03
Data Management and Quality Course
Define the council's scope and objectives,
focusing on promoting a data-driven culture
and effective collaboration.
Limit membership to no more than 10
participants to enhance decision-making and
communication efficiency.
Secure sponsorship from a senior leader to
provide authority and accountability, ensuring
the council's objectives are met.
Setting Up the Council
Data Product Managers: Ensure the quality and
relevance of data products, addressing
concerns related to data delivery.
Legal, Privacy, and Security Experts: Provide
guidance on compliance with regulations and
clarify data handling requirements.
Data Platform Representatives: Implement
data contracts and associated tooling to
support governance efforts.
Key Roles and Responsibilities
Establishes a framework for effective data
governance within the organization.
Facilitates collaboration among cross-
functional teams to define and implement data
governance policies and standards.
Purpose and Functionality
Data Architecture Council
15

01 02 03 04 05
Data Management and Quality Course
Regularly review and update
governance policies and practices
based on feedback and evolving
organizational needs.
This iterative approach ensures that
the federated governance model
remains effective, balancing risk
management with the agility
required for data-driven decision-
making.
Monitor and Adjust
Governance Practices
Foster a culture of collaboration
between data consumers and
generators.
Establish feedback mechanisms to
ensure that data quality and
governance practices are continuously
improved.
This step is crucial for adapting to
changing data needs and maintaining
the relevance of governance policies.
Promote Collaboration
and Feedback Loops
Introduce data contracts that allow
data generators to manage metadata
related to their data.
These contracts should include
classifications, sensitivity levels, and
policies for data handling, such as
deletion or anonymization strategies.
Automating these processes through
tooling will streamline governance and
enhance compliance.
Implement Data
Contracts
Clearly outline the roles within the
governance framework, emphasizing
the autonomy of data generators.
Each data generator should be
empowered to manage their data,
supported by self-service tools and
guidelines provided by the council.
This decentralization helps avoid
bottlenecks and promotes faster
decision-making.
Define Roles and
Responsibilities
Form a cross-functional team that
includes representatives from
various business areas, such as
data product managers, legal
experts, and data infrastructure
leads.
This council will define the policies
and standards necessary for
effective data governance, ensuring
that all stakeholders are aligned on
objectives and responsibilities.
Establish a Data
Governance Council
Federated Governance Implementation
16

01
02 04
03
Data Management and Quality Course
Evolution of Data Storage Technologies
17
•The data lakehouse architecture combines the
best features of data lakes and data
warehouses, providing a unified platform for
both structured and unstructured data.
•This evolution supports advanced analytics
and machine learning, allowing organizations
to derive insights from diverse data types while
maintaining data quality and governance.
Rise of Data Lakehouses (2020s)
•Cloud storage technologies have
transformed data management by offering
scalable, cost-effective solutions for data
storage and access.
•Services like Amazon S3 and Google
Cloud Storage enable organizations to
store and retrieve data from anywhere,
promoting collaboration and enhancing
data accessibility.
Advent of Cloud Storage Solutions
(2010s-Present)
•The emergence of data lakes revolutionized
data storage by enabling the storage of vast
amounts of unstructured and semi-structured
data.
•This technology allowed organizations to
retain raw data in its native format, providing
flexibility for future analysis and reducing the
need for upfront data modeling.
Introduction of Data Lakes (2010s)
•The inception of data warehousing
marked a significant shift in data
management, allowing organizations to
consolidate data from various sources
into a central repository.
•This era focused on structured data
storage, primarily using relational
databases, which facilitated complex
queries and reporting.
Early Data Warehousing (1980s-1990s)

03
0201
Data Management and Quality Course
Modern Data Stack Integration: Launched in
2012, Redshift has played a pivotal role in the
evolution of the modern data stack, addressing
bottlenecks in data access and usage.
Its integration with various data tools enhances
data accessibility and supports a data-driven
culture within organizations.
Impact on Data Management
Scalability: Redshift allows users to start with a
small amount of data and scale up to petabytes
as needed, accommodating growing data
needs without significant upfront investment.
Performance: Utilizing columnar storage and
parallel processing, Redshift delivers fast query
performance, enabling users to run complex
queries on large datasets efficiently.
Key Features and BenefitsIntroduction to Amazon Redshift
Amazon Redshift is a fully managed, petabyte-
scale data warehouse service in the cloud.
It enables organizations to analyze large
volumes of data quickly and cost-effectively,
making it a cornerstone of modern data
architecture.
Amazon Redshift
18

01 02
0403
Data Management and Quality Course
Makes data accessible through SQL, reducing reliance on
specialized data engineering teams.
Democratizes data access, fostering a culture of data-driven
decision-making.
Enables more stakeholders to leverage data for strategic
initiatives.
Impact on Data Accessibility
Seamlessly integrate with various reporting and business
intelligence tools.
Empowers users to create dashboards and visualizations.
Enhances data-driven decision-making across the organization.
Integration with Reporting Tools
Offer scalability, allowing organizations to handle large volumes of
data efficiently.
Provide robust performance for complex queries.
Ensure users can retrieve insights quickly and effectively.
Key Features
SQL-compatible warehouses are data storage solutions designed to
support SQL queries.
Enables users to interact with data using familiar SQL syntax.
Facilitates easier access and manipulation of data for a wide range
of users, from data engineers to business analysts.
Definition and Purpose
SQL-Compatible Warehouses
19

03
0201
Data Management and Quality Course
Purpose: Enable the cleaning, structuring, and
enriching of data to make it suitable for
analysis.
Key Approaches: Techniques such as ELT
(Extract, Load, Transform) and ETL (Extract,
Transform, Load) are employed to ensure data
quality and usability, supporting various
analytical use cases.
Data Transformation Frameworks
Purpose: Provide scalable and efficient storage
for large volumes of data.
Types: Options include Data Lakes for raw data
storage and Data Warehouses for structured
data, allowing organizations to choose based
on their data processing needs.
Data Storage SolutionsData Ingestion Tools
Purpose: Facilitate the extraction of data from
various sources into a centralized data
repository.
Examples: Tools like Apache Kafka and
Fivetran enable real-time data streaming and
batch processing, ensuring timely access to
data for analysis.
Modern Data Stack Components
20

01 02 03
04 05
Data Management and Quality Course
Business Value and Competitive
Advantage
Enhanced Decision-Making: By providing timely
and accurate data, organizations can make
informed decisions that drive business outcomes.
Increased Profitability: Companies leveraging data
lakehouses can gain a competitive edge, as
evidenced by top-performing retailers who utilize
data effectively to enhance customer experiences
and profitability.
Integration with Data Products
Data Contracts: Establish clear expectations and
governance for data usage, ensuring that data
products are reliable and meet business
requirements.
Single Source of Truth: Data lakehouses serve as
a central repository, minimizing data duplication
and enhancing data integrity across the
organization.
Data Management Efficiency
Reduced Bottlenecks: By decentralizing data
production, organizations can alleviate
bottlenecks associated with centralized data
engineering teams.
Support for Diverse Data Types: Accommodates
various data formats, enabling organizations to
manage both traditional datasets and modern
data streams effectively.
Key Features
Querying Capabilities: Users familiar with SQL
can query data directly, enhancing accessibility
for a broader range of users.
Data Accessibility: Facilitates the
democratization of data, allowing various
stakeholders to access and utilize data without
relying solely on central data teams.
Data Lakehouse Concept
A data lakehouse combines the features of
data lakes and data warehouses, providing a
unified platform for storing and managing
structured and unstructured data.
It enables organizations to leverage the
scalability of data lakes while maintaining the
performance and reliability of data
warehouses.
Data Lakehouse Concept
21

01 02 03
Data Management and Quality Course
Improved Decision-Making: By providing timely
and accurate data, an ODS enhances the ability
of organizations to make informed decisions
quickly.
Reduced Data Redundancy: Centralizing
operational data minimizes duplication and
inconsistencies, leading to more efficient data
management.
Enhanced Data Quality: ODS facilitates data
quality checks and governance, ensuring that
the data used for operational purposes meets
organizational standards.
Benefits of Implementing an ODS
Real-Time Data Access: ODS provides near
real-time access to current data, enabling
timely insights and decision-making.
Data Integration: It integrates data from
multiple operational systems, ensuring
consistency and accuracy across the
organization.
Support for Business Operations: ODS is
designed to support day-to-day business
operations, providing a reliable source of
information for operational reporting and
analytics.
Key Features
An Operational Data Store (ODS) serves as a
centralized repository for operational data,
allowing organizations to consolidate data
from various sources for real-time reporting
and analysis.
It acts as an intermediary between
transactional systems and data warehouses,
ensuring that data is readily available for
operational decision-making.
Definition and Purpose
Operational Data Store (ODS)
22

01 02
0403
Data Management and Quality Course
The ability to access and analyze data efficiently
fosters a data-driven culture within organizations.
This cultural shift encourages teams to make
decisions based on data insights rather than
intuition.
As organizations invest in data lakehouses, they
position themselves to gain a competitive
advantage by leveraging data effectively across all
levels of the business.
Support for Data-Driven Culture
Data lakehouses facilitate the use of previously
inaccessible 'dark data,' unlocking potential insights
that can drive business value. This increased
accessibility can lead to innovative applications and
services.
By making data available for reporting tools,
organizations can enhance their analytics
capabilities, leading to more informed business
strategies.
Improved Data Utilization
Modern data lakehouses allow users familiar with
SQL to query data directly, promoting broader
access across the organization. This
democratization of data enables non-technical
users to leverage data for decision-making.
The architecture supports self-service capabilities,
allowing various stakeholders to access and utilize
data without relying solely on central data
engineering teams.
Enhanced AccessibilityCost Efficiency
Data lakehouses combine the benefits of data lakes
and data warehouses, reducing the need for
separate systems. This integration can lead to
significant cost savings in data storage and
management.
By utilizing a single architecture, organizations can
minimize expenses related to data duplication and
maintenance, ultimately lowering the total cost of
ownership.
Cost and Accessibility of Data
Lakehouses
23

03
0201
Data Management and Quality Course
The effectiveness of data accessibility is often
hindered by the quality of data and the
prevailing data culture.
Organizations need to focus on improving data
ingestion processes and fostering a culture
that values data quality to unlock the full
potential of their data assets.
Addressing Data Quality and Cultural
Limitations
Modern data lakehouses allow users familiar
with SQL to query data directly, enhancing
accessibility.
This capability supports the integration of data
into various reporting tools, making it easier for
less technical users to extract insights.
Utilization of Modern Data
Lakehouses
Empowering Users Across the
Organization
Organizations must move away from reliance
on a central data engineering team to ensure
broader access to data.
This shift enables various users to leverage
data for decision-making, fostering a more
data-driven culture.
Data Accessibility in Organizations
24

Data Management and Quality Course
Fostering a Data-Driven Culture: Emphasizing the importance of a data product mindset to enhance data accessibility and
drive business outcomes. Organizations should create an environment where data is viewed as a valuable asset, encouraging
collaboration between data generators and consumers to maximize its utility.
Empowering Data Generators: Data generators must take ownership of their data products, ensuring they meet the needs of
data consumers. This includes understanding the implications of their data changes and being accountable for data quality.
By shifting responsibilities to those closest to the data, organizations can improve reliability and reduce costs associated
with data incidents.
Implementing Data Contracts: Establishing clear data contracts between data generators and consumers is crucial for
setting expectations and ensuring data governance. These contracts facilitate communication, define responsibilities, and
promote a shared understanding of data quality standards, ultimately leading to more reliable data products.
Enhancing Data Quality through Collaboration: Collaboration between data teams and business units is essential for
improving data quality. By integrating feedback loops and encouraging open communication, organizations can address data
quality issues proactively, ensuring that data products are not only accurate but also aligned with business objectives.
Continuous Improvement and Governance: A mature data governance process is necessary to maintain high data quality
standards. This involves regularly reviewing data management practices, implementing automated tools for compliance, and
fostering a culture of accountability among all stakeholders involved in data generation and consumption.
Data Culture and Quality
Data Culture and Quality
25

03
0201
Data Management and Quality Course
As technology and data become increasingly
integral to business success, organizations
across various industries are investing heavily
in data science.
This investment is not limited to tech
companies; it spans multiple sectors,
emphasizing the universal importance of data
in achieving competitive advantages.
Investment in Data Science
In the consumer sector, top-performing
retailers utilize data to refine customer
interactions.
A McKinsey report indicates that the 25 leading
retailers are 83% more profitable and have
captured over 90% of market gains,
showcasing how data-driven strategies can
enhance customer satisfaction and loyalty.
Enhanced Customer ExperiencesData-Driven Decision Making
Organizations that effectively utilize data can
gain a significant competitive edge.
By leveraging data to understand past and
present situations, businesses can make
informed predictions about future trends,
leading to faster and more strategic decision-
making.
Competitive Advantage through Data
26

01 02 03
04 05
Data Management and Quality Course
Challenges in Data Accessibility
Despite the potential, many organizations
struggle with data accessibility, limiting the use
of valuable insights.
Moving away from reliance on centralized data
teams to a more decentralized approach can
unlock the full potential of data across the
organization.
Investment in Data Science
The increasing importance of data has led
organizations across various industries to
invest heavily in data science initiatives.
This investment is not limited to tech
companies; traditional sectors are also
recognizing the value of data in driving
business success.
Enhancing Customer Experience
Data is crucial at every customer touchpoint,
enabling personalized interactions and
improved service delivery.
Retailers that harness data analytics can
respond swiftly to market trends and customer
preferences, driving loyalty and sales.
Impact of Data Utilization
A McKinsey report indicates that the 25 leading
retailers are 83% more profitable than their
competitors.
These retailers have captured over 90% of
market gains, showcasing the financial
benefits of a robust data strategy.
Data-Driven Competitive Advantage
Organizations leveraging data effectively can
gain a significant edge in the market.
The consumer sector exemplifies this, with top-
performing retailers utilizing data to enhance
customer experiences and operational
efficiency.
Case Study: Consumer Sector
27

03
0201
Data Management and Quality Course
Data contracts facilitate compliance with
governance policies by categorizing data and
defining retention and deletion policies.
They also support the implementation of
service level objectives (SLOs) that ensure data
quality metrics, such as completeness and
timeliness, are met consistently.
Governance and Compliance
Ownership: Each data contract must have a
designated owner, typically the data generator, who
is responsible for the data's accuracy and integrity.
Schema Definition: Contracts should include
detailed schemas that specify fields, data types,
and any relevant metadata, such as version
numbers and data classification (e.g., personal
data).
Key Elements of a Data ContractPurpose and Importance
Data contracts serve as formal agreements
between data generators and consumers,
outlining expectations, responsibilities, and the
schema of the data being shared.
They are essential for ensuring data quality,
governance, and effective communication,
ultimately leading to better business
outcomes.
Definition of Data Contracts
28

01 02
Data Management and Quality Course
The schema registry acts as a central repository for schemas,
ensuring they are accessible and serve as the source of truth for both
data generators and consumers.
This registry facilitates version control, allowing applications to
reference the same schema version, which is crucial for maintaining
data integrity across different systems.
Schema Registry
A schema serves as a blueprint for data structure, defining the fields
and their types.
For example, a schema may include elements such as email patterns
and language preferences, ensuring data consistency and validation.
Schema Definition
Schema Components
29

Data Management and Quality Course
Flexibility vs. Interoperability
30
•Interoperability emphasizes the capability of different data
systems and products to work together seamlessly.
•It involves ensuring that data can be shared and utilized
across various platforms and applications without
compatibility issues.
•Effective data contracts play a vital role in establishing clear
interfaces and expectations, enabling data consumers and
generators to collaborate efficiently.
•This interconnectedness is essential for creating a cohesive
data ecosystem, where data products serve as reliable
sources of truth, minimizing duplication and enhancing
overall data quality.
Interoperability
•Flexibility in data management refers to the ability of data
systems to adapt to changing requirements and
environments.
•This includes the capacity to modify data structures,
schemas, and processes without significant downtime or
resource allocation.
•For instance, data generators need to autonomously change
database schemas to deliver product features efficiently,
allowing for regular updates and enhancements.
•This adaptability is crucial for organizations aiming to
respond quickly to market demands and internal needs,
fostering innovation and responsiveness.
Flexibility

01 02 03
Data Management and Quality Course
Shift accountability for data quality from data
engineering teams to data generators who
have the most insight into data structure and
generation.
Encourage data generators to understand their
role in maintaining data quality, leading to
reduced incidents and improved data reliability.
Proactive Responsibility
Assignment
Data Validations: Implement checks for unique
values, regular expression matching, and
referential integrity to ensure data meets
specified standards.
Source-Level Checks: Conduct data quality
checks at the source to catch issues before
they propagate downstream, reinforcing
accountability among data generators.
Types of Data Quality Checks
Ensures reliability and accuracy of data for
decision-making.
High-quality data is essential for effective data
products and business outcomes.
Importance of Data Quality
Data Quality Checks
31

01 02
0403
Data Management and Quality Course
Shift organizational mindset towards valuing data quality and
reliability as integral to business success.
Encourage investment in data products and governance, ultimately
leading to better business outcomes and decision-making.
Driving Cultural Change
Simplify data processing by reducing complexity in data pipelines,
making them quicker and more cost-effective.
Enable data consumers to meet their SLOs confidently, enhancing
trust in the data provided.
Streamlining Data Pipelines
Promote communication between data generators and consumers
to refine requirements and adjust performance expectations.
Foster a sense of ownership among data generators, leading to
improved data stewardship and accountability.
Facilitating Collaboration
Establish clear expectations and responsibilities between data
generators and consumers.
Define Service Level Objectives (SLOs) for data quality, including
completeness, timeliness, and availability, ensuring reliable data
delivery.
Enhancing Data Quality
Utility of Data Contracts
32

03
0201
Data Management and Quality Course
Formal agreements that define expectations
around data quality, access, and performance.
Promote collaboration between data
generators and consumers, ensuring clarity
and reducing misunderstandings.
Importance of Data Contracts
Data as a Product: Each data domain is
responsible for delivering high-quality data
products that meet consumer needs.
Federated Computational Governance:
Establishes a framework for governance that
balances autonomy with accountability across
data domains.
Key ComponentsConcept of Data Mesh
A decentralized approach to data management
that treats data as a product.
Emphasizes domain ownership, allowing
teams to take responsibility for their data.
Introduction to Data Mesh
33

01 02 03
04 05
Data Management and Quality Course
Self-Serve Data Infrastructure
Create a self-serve data platform that enables
teams to access and manage their data
independently.
Reduce bottlenecks and enhance agility.
Allow teams to respond quickly to changing
business needs.
Data Contracts
Establish clear data contracts that define
expectations, responsibilities, and service level
objectives (SLOs) between data producers and
consumers.
Promote transparency and accountability in
data management.
Federated Computational Governance
Implement a governance model that balances
centralized oversight with decentralized
execution.
Allow for flexibility and innovation while
maintaining necessary standards and
compliance.
Domain Ownership
Empower domain teams to take ownership of
their data.
Each team is responsible for the quality and
governance of the data they produce.
Foster accountability and a sense of
ownership.
Data as a Product
Treat data as a product that provides value to
its consumers.
Understand the needs of data users.
Ensure data products are designed to meet
those needs effectively.
Principles of Data Mesh
34

01 02 03
Data Management and Quality Course
Data contracts empower data generators to
take ownership of their datasets, allowing them
to self-serve and provision data interfaces
without bottlenecks from central teams.
This autonomy supports a more agile data
management approach, enabling quicker
adaptations to changing business
requirements and data evolution.
Facilitating Decentralization and Autonomy
By fostering communication between data
generators and consumers, data contracts
promote a collaborative environment that
enhances trust in data quality and reliability.
This partnership is crucial for aligning data
products with business needs, ultimately
driving better decision-making and outcomes.
Enhancing Collaboration and
Trust
Data contracts serve as formal agreements
between data generators and consumers,
defining the schema, data quality, and service
level objectives (SLOs).
They clarify responsibilities, ensuring that both
parties understand their roles in data
management and governance.
Establishing Clear Expectations
Role of Data Contracts in Data Mesh
35

Data Management and Quality Course
01 02 03 04
After implementing the data contracts, continuously
measure the progress of adoption using metrics such as
adoption rates, data incidents, and ETL costs.
Regularly communicate these metrics to stakeholders to
highlight the impact of data contracts on the organization.
This iterative approach allows for adjustments based on
feedback and ensures that the data contracts evolve to
meet changing business needs.
Iterate and Measure Progress
Engage data generators and consumers in collaborative
discussions to refine data needs and requirements.
This step is crucial for building a sense of ownership and
accountability among teams.
By working together, they can define clear expectations
and responsibilities, which will facilitate smoother
implementation and enhance the overall effectiveness of
the data contracts.
Foster Collaboration Between
Teams
Choose a specific use case that aligns with the identified
objectives to serve as a proof of concept (POC).
This use case should involve both data generators and
consumers, ensuring that the necessary resources and
personnel are available to support the initiative.
The POC will demonstrate the value of data contracts
and lay the groundwork for broader adoption.
Select a Relevant Use Case
Begin by clearly defining the primary goals for
implementing data contracts within the organization.
This could include enhancing data pipeline reliability,
improving user trust in data, or making data more
accessible for critical applications like machine
learning.
Establishing these objectives will guide the subsequent
steps and ensure alignment with business needs.
Identify Key Objectives
Adoption of Data Contracts
36

03
0201
Data Management and Quality Course
Role: Act as a bridge between data generators
and consumers, ensuring that the requirements
of data consumers are understood and met.
Importance: They facilitate collaboration,
manage data contracts, and help define the
expectations around data products, enhancing
overall data governance and quality.
Data Product Managers
Definition: Users who access and utilize data
for analysis, reporting, and decision-making.
Expectations: They require clarity on data
structure, dependability, and performance
metrics to effectively leverage data for
business processes.
Data ConsumersData Generators
Definition: Individuals or teams responsible for
creating and supplying data within an
organization.
Responsibilities: They must understand the
needs of data consumers, manage data quality,
and ensure compliance with organizational
policies regarding data categorization and
retention.
Roles in Data Management
37

Data Management and Quality Course
Empowering Data Generators: Shift responsibility for data quality from centralized data teams to data generators
who have the most insight into the data's structure and generation process. This proactive approach enhances
accountability and ensures that those closest to the data are responsible for its accuracy and reliability.
Fostering Collaboration: Establish clear communication channels between data consumers and generators to
facilitate knowledge transfer. This includes sharing domain models, change histories, and metadata, which are
essential for effective data utilization and decision-making.
Implementing Data Contracts: Utilize data contracts as tools to define expectations and responsibilities within the
data ecosystem. These contracts promote a culture of ownership and accountability, ensuring that data products
meet business requirements and governance standards.
Promoting a Data-Driven Mindset: Encourage a cultural shift towards viewing data as a product rather than a
byproduct. This involves adopting a product mindset across teams, emphasizing the importance of data quality
and its role in driving business outcomes.
Continuous Improvement and Feedback Loops: Create mechanisms for ongoing feedback between data
consumers and generators. This iterative process helps identify issues early, fosters a culture of continuous
improvement, and enhances the overall quality and utility of data products.
Cultural Change in Data Management
Cultural Change in Data Management
38

03
0201
Data Management and Quality Course
Develop clear guidelines and self-service tools
that assist data generators in adhering to data
management standards without requiring deep
expertise.
Ensure that these tools are integrated into their
daily workflows, promoting efficiency and
reducing bottlenecks in data generation
processes.
Guidelines and Support Tools
Facilitate regular interactions between data
generators and data consumers to enhance
understanding of data needs and expectations.
Implement feedback loops where data
consumers can provide insights on data utility,
helping data generators refine their outputs
and improve quality.
Collaboration and CommunicationEmpowerment and Ownership
Encourage data generators to take ownership
of their data contracts, ensuring they
understand the importance of their role in data
quality and reliability.
Provide training and resources to help them
articulate the value of the data they generate,
fostering a sense of responsibility and
accountability.
Engagement Strategy for Data
Generators
39

Data Management and Quality Course
Enhanced Data Quality and Trust: Data contracts establish clear expectations and responsibilities between data
generators and consumers, leading to improved data quality. By defining service level objectives (SLOs) such as
completeness, timeliness, and availability, organizations can foster trust in the data being utilized for decision-making.
Streamlined Data Management Processes: Implementing data contracts simplifies data pipelines, making them more
efficient and cost-effective. This reduction in complexity allows data generators to focus on producing high-quality data
while ensuring that consumers can easily access and utilize the data they need.
Facilitated Collaboration and Ownership: Data contracts promote collaboration between data generators and consumers,
creating a partnership that enhances accountability. By involving both parties in the design and implementation of data
contracts, organizations can ensure that the data products meet the actual needs of the business.
Support for Data-Driven Decision Making: By clearly articulating the value of data and the positive outcomes it generates,
data contracts empower organizations to leverage data for strategic decision-making. This alignment with business goals
encourages investment in data quality initiatives, ultimately driving better business outcomes.
Cultural Shift Towards Data Investment: The introduction of data contracts signifies a cultural shift within organizations,
emphasizing the importance of data quality and reliability. This shift encourages teams to prioritize data management
practices and fosters a data-centric mindset across the organization, leading to sustained improvements in data
governance and usage.
Communicating Benefits of Data Contracts
Communicating Benefits of Data Contracts
40

01 02 03
Data Management and Quality Course
As the business evolves, continuously iterate on the data product
based on feedback from consumers and changes in requirements.
Versioning the data contract allows for the introduction of new
features or adjustments while maintaining stability in core data
models.
This ensures that consumers can rely on the data product for
consistent performance and dependability, ultimately enhancing
their confidence in data-driven decision-making.
Iterate and Version the Data Product
Once consumer requirements are understood, create a data
contract that outlines the schema, expectations, and service level
objectives (SLOs) for the data product.
This contract serves as a formal agreement that documents
essential elements such as the owner, description, and versioning
of the data product.
It establishes a stable interface for data access, promoting
discoverability and usability.
Define a Data Contract
Begin by engaging with data consumers to gather insights about
their specific needs and expectations.
This step is crucial for defining the schema of the data product,
ensuring it aligns with the business objectives and provides the
necessary value.
Clear communication helps data generators appreciate the
business context and fosters a sense of ownership over the data
outcomes.
Understand Consumer
Requirements
Creating Usable Data Products
41

Data Management and Quality Course
Discount Strategy Evolution
Initial Approach: Discounts were initially applied directly within the products table, limiting
flexibility to specific products. This approach restricted the ability to manage discounts
across multiple items effectively.
Transition to Discounts Table: A new discounts table was introduced to enhance flexibility,
allowing discounts to be applied across various products. However, challenges arose due
to incomplete data backfilling and retention of the old discount column, complicating data
management.
Data Management Challenges
Data Quality Issues: The transition led to poor data quality, making it difficult for users to
trust the discount data. Users faced confusion due to discrepancies between the old and
new discount systems.
E-commerce Example: Discounts
E-commerce Example: Discounts
42

03
0201
Data Management and Quality Course
Data consumers are often not notified of
upstream logic changes, resulting in confusion
and a loss of trust in the data.
This disconnect can lead to discrepancies in
data availability and reliability, ultimately
affecting business performance and decision-
making.
Lack of Communication on
Changes
Despite the critical role of discount data in
driving sales strategies and marketing efforts,
the quality of this data remains poor.
This inadequacy complicates its usage and
undermines the effectiveness of data-driven
decision-making processes.
Data Quality ConcernsComplexity of Data Pipelines
Data consumers face challenges with intricate
SQL queries, often exceeding 800 lines, which
incorporate extensive business logic.
This complexity leads to increased
maintenance costs and difficulties in execution,
hindering efficient data management.
Data Management Issues in E-
commerce
43

01 02
0403
Data Management and Quality Course
The database schema was intentionally simplified by excluding
other tables, such as the customers table.
This approach allowed for a concentrated focus on the relationship
between discounts and products, facilitating clearer data
management and analysis.
Focus on Core Elements
The transition faced challenges, including the lack of backfilling
data into the new discounts table.
The retention of the old discount column in the products table
created confusion and potential data integrity issues.
Data Management Issues
A new discounts table was created to allow discounts to be
applied across multiple products.
This change enhanced the ability to respond to market dynamics
and stock levels effectively.
Transition to a Separate Discounts Table
The original database schema included a discount field directly in
the products table.
This design limited discounts to specific products, reducing
flexibility in discount application.
Initial Implementation Challenges
Simplification of Database Schema
44

01 02 03
Data Management and Quality Course
Communication: Effective collaboration
between data consumers and generators is
essential for understanding requirements and
expectations. This fosters a sense of
ownership among data generators, motivating
them to provide high-quality data.
Feedback Mechanisms: Regular feedback from
consumers reinforces the importance of data
quality and encourages continuous
improvement in data generation practices,
ultimately leading to better business
outcomes.
Collaboration for Enhanced Data Quality
Data Consumers: Expected to articulate their
data needs clearly, provide feedback on data
quality, and demonstrate the value derived
from data usage. They play a crucial role in
shaping data contracts by defining
requirements and expectations.
Data Generators: Hold the responsibility for
data quality and reliability, managing data in
accordance with organizational policies. They
must understand the implications of their data
generation processes and maintain ongoing
support for the data they produce.
Responsibilities and
Accountabilities
Data Consumers: Individuals or teams that
utilize data for analysis, decision-making, and
driving business processes. This includes roles
such as data analysts, business users, and
product engineering teams who rely on data to
enhance services and create value.
Data Generators: Individuals or services
responsible for creating data. This includes
software engineers who generate data through
system actions, data/BI/analytics engineers
who build data products, and third-party
services that provide data via APIs.
Definition of Roles
Roles of Data Consumers and Generators
45

01 02
Data Management and Quality Course
Data engineers play a crucial role in bridging the gap between data
generators and consumers. They must communicate effectively with
both parties to understand data requirements and expectations.
By fostering a collaborative environment, data engineers can help
establish clear data contracts that define responsibilities and ensure
that data generators are aware of the implications of their data
outputs on downstream processes.
Facilitating Collaboration Between Data Generators and
Consumers
Data engineers are tasked with maintaining the integrity and accuracy
of data throughout its lifecycle. This includes implementing data
quality checks and monitoring data pipelines to identify and rectify
issues promptly.
They must also respond to upstream changes that may affect data
quality, ensuring that any modifications do not compromise the
reliability of the data consumed downstream.
Ensuring Data Quality and Reliability
Responsibilities of Data Engineers
46

03
0201
Data Management and Quality Course
Clarity on data timeliness, correctness,
completeness, and availability is vital for
analysts to build confidence in their data-driven
decisions.
Understanding data ownership and support
levels is crucial, as it informs analysts about
who to approach for assistance and how to
manage data-related issues effectively.
Dependability and Performance
Expectations
To effectively utilize data, analysts must
comprehend its structure, including available
fields and context.
Documentation is essential for defining
semantics, compliance, and governance
aspects, ensuring users understand data
confidentiality and processing permissions.
Understanding Data StructureRoles and Responsibilities
Data analysts and business users play a crucial
role in leveraging data for decision-making and
business processes.
They query curated data products through
Business Intelligence (BI) tools or
spreadsheets, requiring data to be accessible
in their preferred formats.
Data Analysts and Business Users
47

01 02
0403
Data Management and Quality Course
Effective data products should be modeled around business
entities or domains rather than internal data structures.
This approach enhances discoverability and usability, allowing data
consumers to leverage data effectively for decision-making and
operational processes.
Modeling Data Products
Data contracts define the schema and expectations for data
products, ensuring clarity in data structure.
They facilitate communication between data generators and
consumers, promoting a shared understanding of data usage and
governance.
Importance of Data Contracts
Primitive Structures: Basic data types such as integers, floats, and
characters that serve as the building blocks for more complex
structures.
Composite Structures: More complex arrangements like arrays,
lists, and trees that allow for the organization of data in a way that
reflects relationships and hierarchies.
Types of Data Structures
A data structure is a systematic way of organizing and storing data
to enable efficient access and modification.
It serves as the foundation for data management, allowing for the
effective handling of data products.
Definition of Data Structure
Understanding Data Structure
48

01 02 03
Data Management and Quality Course
Implementing data quality checks at the source
is essential for maintaining high standards of
dependability.
By shifting responsibility for data quality to
data generators, organizations can enhance
accountability and reduce the risk of
downstream data issues, ultimately leading to
more reliable data products.
Proactive Quality Management
To assess data performance, organizations
should focus on three primary metrics:
Completeness: Ensures that all required data is
present and accounted for, minimizing gaps
that could lead to erroneous conclusions.
Timeliness: Measures how quickly data is
available for use, impacting the relevance of
insights derived from it.
Availability: Assesses the accessibility of data
when needed, ensuring that users can rely on it
for their operational and analytical needs.
Key Performance Metrics
Data dependability refers to the reliability and
accuracy of data over time, which is crucial for
effective decision-making.
It encompasses the consistency of data quality
and its ability to meet the expectations set by
data consumers.
Understanding Data Dependability
Data Dependability and Performance
49

01 02 03
04 05
Data Management and Quality Course
Cultural Shift Towards Data Ownership
The introduction of data contracts promotes a
cultural change within organizations, encouraging
data generators to take ownership of their data
products.
This shift leads to a greater investment in data
quality and a more data-driven organizational
mindset, ultimately driving better business
outcomes.
Streamlined Data Pipelines
Implementing data contracts simplifies data
pipelines, reducing complexity and operational
costs.
This efficiency enables faster data delivery,
allowing organizations to respond quickly to
business needs and market changes.
Service Level Objectives (SLOs)
Defining SLOs within data contracts sets
measurable targets for data quality.
Achieving 100% completeness and 95%
availability allows organizations to track
performance and make informed adjustments
to data processes.
Collaboration Between Teams
Data contracts foster a partnership between
data generators and consumers, promoting
open communication and understanding of
data needs.
This collaboration helps align data products
with business objectives, ensuring that the
data delivered is relevant and valuable.
Enhanced Data Quality
Establishing clear expectations through data
contracts ensures that data generators are
accountable for the quality of the data they
produce.
This leads to improved completeness,
timeliness, and availability of data, which are
critical for reliable decision-making.
Value Delivery through Data Contracts
50

01 02 03
Data Management and Quality Course
A comprehensive migration plan must be developed to
facilitate the transition from the old schema to the new one.
This plan should outline the steps for consumers to follow,
including timelines and support resources.
It may involve running both the old and new versions
concurrently for a specified period, allowing consumers to
adapt without immediate disruption to their operations.
Establish a Migration Plan
Once breaking changes are identified, it is essential to
communicate these changes to all affected data consumers.
This communication should occur well in advance, allowing
consumers to provide feedback and prepare for necessary
adjustments.
Effective communication helps mitigate the risk of unplanned
downtime and ensures that consumers understand the
implications of the changes.
Communicate with Data
Consumers
Breaking changes are modifications that negatively impact
existing data consumers, requiring them to adjust their services
or analytics.
Examples include removing required fields or altering data types,
which can render previously valid data incompatible with older
schemas.
Recognizing these changes is crucial for effective data
management.
Identify Breaking Changes
Breaking Changes in Data Management
51

Data Management and Quality Course
01 02 03 04
Data generators can implement non-breaking changes
with minimal friction, promoting agility in data
management.
By allowing these changes to occur without significant
barriers, organizations can enhance their data
products and services while ensuring that consumers
are not adversely affected, thus fostering a more
collaborative data environment.
Facilitating Low-Friction Updates
The implementation of non-breaking changes has a low
impact on existing data consumers.
They can continue to operate as usual, as these changes
do not require immediate adjustments to their systems.
This flexibility allows consumers to adapt at their own
pace, ensuring a smoother transition to newer data
structures when they are ready.
Impact on Data Consumers
Common examples include adding optional fields to a
schema or removing non-required fields that have
default values.
For instance, introducing a new address field in a
customer schema allows existing consumers to
ignore this field until they choose to upgrade, thereby
maintaining their current operations without
interruption.
Examples of Non-Breaking
Changes
Non-breaking changes refer to modifications made to a
data schema that do not disrupt existing data consumers.
These changes allow data generated against a new
schema version to be read by services using previous
versions without any data loss or impact.
This ensures that current applications and analytics
remain functional and reliable.
Definition of Non-Breaking
Changes
Non-Breaking Changes in Data Management
52

01 02 03
Data Management and Quality Course
Articulate the benefits of data contracts to data
generators by demonstrating how these contracts
enhance data quality and usability.
Align the value of the data with company-wide goals
to incentivize data generators, fostering a sense of
ownership and responsibility for the data they
produce.
Communicate Value to Data
Generators
Form a working group comprising data consumers,
such as data/analytics engineers and data scientists,
to prioritize critical datasets for migration.
This collaborative effort will help identify core data
models essential to the business, ensuring that the
most valuable data is transitioned first and that
consumer needs are adequately addressed.
Engage Key Data Consumers
Begin by developing a structured migration plan that
balances the urgency of transitioning to data
contracts with the ongoing commitments of product
teams.
This plan should outline timelines, resource allocation,
and key milestones to ensure a smooth transition
while minimizing disruption to existing workflows.
Establish a Migration Plan
Migration Path for Data Contracts
53

01 02 03
04 05
Data Management and Quality Course
Supporting Decentralized Data
Architecture
Data lineage tools empower data generators by
clarifying ownership and responsibilities within
specific business domains.
They provide essential insights for impact
analysis, compliance, and effective data
management, fostering a mature data culture.
Benefits for Data Engineering
Teams
Enable teams to optimize data pipelines by
identifying performance bottlenecks and
understanding data usage patterns.
Assist in troubleshooting issues within
complex data environments, ensuring efficient
data processing and reliability.
Integration with Data Catalogs
Many data lineage tools include cataloging
functionalities, allowing users to discover and
understand data governed by contracts.
This integration supports a unified approach to
data management, enhancing the overall data
governance framework.
Types of Data Lineage Tools
Paid Solutions: Offer comprehensive features
and support for enterprise-level data lineage
management.
Open Source Options: Provide flexibility and
customization for organizations looking to
implement data lineage without significant
financial investment.
Purpose of Data Lineage Tools
Facilitate the tracking of data relationships,
origins, transformations, and usage across
various applications.
Enhance transparency and trust in data by
providing clear visibility into data flows and
dependencies.
Data Lineage Tools
54

01 02 03
Data Management and Quality Course
Define specific SLOs to set clear expectations
for data completeness, timeliness, and
availability.
For example, aim for 100% data completeness,
data availability within 60 minutes of
generation, and 95% accessibility for querying.
Establishing Service Level Objectives
(SLOs)
A structured migration plan is crucial to
transition consumers to new schema versions
without causing disruptions.
The complexity of the plan should consider the
size of the change, the criticality of the data,
and the number of consumers affected.
Developing a Migration Plan
Data generators may need to update schemas
to accommodate new consumer requirements
or enhance service features.
Clear communication with data consumers is
essential to ensure that the new data contract
aligns with their needs and expectations.
Understanding Schema Evolution
Expectations Management in Data Migration
55

01 02
Data Management and Quality Course
A cross-functional data governance council is established to define
policies and standards that guide data management practices. This
council ensures that while data generators operate independently, they
are supported by a framework that balances risk management with
the need for agility in data usage.
The council provides essential documentation and guardrails, enabling
data generators to make informed decisions about data classification
and handling, thus promoting a federated governance model that
aligns with organizational goals.
Role of the Data Governance Council
Data generators are granted autonomy to manage their own data,
fostering a sense of ownership and accountability. This decentralized
approach alleviates bottlenecks often created by central teams,
allowing for quicker access and utilization of data.
By utilizing self-service tools and guidelines, data generators can
effectively populate and maintain metadata, including data
classification and sensitivity, which enhances the overall governance
process.
Empowering Data Generators
Decentralized Data Governance
56

03
0201
Data Management and Quality Course
Engagement: Provide feedback to data
generators to ensure data meets their needs
and expectations.
Collaboration: Work closely with data
generators to foster a data-driven culture and
enhance the overall quality of data
management.
Data Consumers
Ownership: Responsible for managing their
data, including classification, access
management, and compliance with
organizational policies.
Autonomy: Empowered to make local
decisions regarding data handling, supported
by self-service tools and guidelines.
Data GeneratorsData Governance Council
Purpose: A cross-functional group that defines
data governance policies and standards.
Composition: Includes representatives from
various business areas, such as data product
managers, legal experts, and data
infrastructure teams.
Roles and Responsibilities in Data
Governance
57

01 02 03
04 05
Data Management and Quality Course
Quality Assurance and
Compliance
Data contracts include provisions for data
quality checks and compliance with governance
policies.
By defining data quality metrics and retention
policies, organizations can ensure that
metadata remains accurate and relevant,
supporting effective data management
practices.
Machine-Readable Formats
Data contracts can be designed in machine-
readable formats, enabling seamless
integration with tools for privacy, data catalogs,
and governance.
This capability enhances the accessibility and
usability of metadata across the organization.
Decentralized Governance
By empowering data generators to manage
their metadata through data contracts,
organizations can implement a decentralized
governance model.
This reduces bottlenecks associated with
central teams and enhances agility in data
management.
Comprehensive Metadata Capture
Data contracts serve as a source of truth for
metadata, capturing essential elements such
as version numbers, data access methods,
primary keys, and data classification.
This comprehensive approach facilitates better
governance and understanding of data assets.
Ownership of Metadata
Data contracts establish clear ownership,
typically assigned to data generators who
possess the most context about the data.
This ownership ensures accountability and
responsibility for maintaining accurate
metadata.
Data Contracts for Metadata Management
58

01 02 03
Data Management and Quality Course
Define the council's scope and objectives,
focusing on promoting a data-driven culture
and effective collaboration.
Limit membership to a maximum of 10
participants to enhance decision-making and
communication efficiency.
Setting Up the Council
Data Product Managers: Ensure the quality and
relevance of data products, addressing
concerns related to data delivery.
Legal, Privacy, and Security Experts: Provide
guidance on compliance with regulations and
clarify data handling requirements.
Data Platform Representatives: Implement
data contracts and manage the necessary
tooling for data governance.
Key Roles and Responsibilities
Establishes a framework for effective data
governance within the organization.
Facilitates collaboration among cross-
functional teams to define data policies and
standards.
Purpose and Functionality
Data Architecture Council
59

01 02 03
04 05
Data Management and Quality Course
Implementation of Data Contracts
Data contracts play a crucial role in federated
governance by setting expectations around
data management.
Ensures that metadata is accurate and up-to-
date, and facilitates automation in data
handling.
Risk Management and Agility
This governance model acknowledges the
potential for human error while enabling faster
decision-making.
Avoids bottlenecks that can hinder data
availability and usage.
Autonomy for Data Generators
Data generators are empowered to make local
decisions regarding data management.
Supported by guidelines and self-service tools
from the governance council, fostering
ownership and accountability.
Role of the Data Governance Council
A central body responsible for defining policies,
standards, and processes for data
classification and access.
Ensures compliance and risk management
while supporting data generators.
Definition and Importance
Federated data governance is a model that
balances local decision-making with
centralized oversight.
Allows organizations to manage data
effectively while promoting agility and
innovation.
Federated Data Governance
60

01 02 03
Data Management and Quality Course
Human-Readable Format: YAML is both
human-friendly and machine-readable, making
it accessible for data generators and
consumers alike.
Flexibility and Extensibility: The structure
allows for easy updates and modifications as
data requirements evolve, supporting version
control and schema management.
Integration Capabilities: YAML contracts can be
converted into other formats (e.g., Protocol
Buffers) for use in various data processing
systems, ensuring compatibility across
platforms.
Benefits of Using YAML for Data Contracts
Fields Definition: Each field in the contract
specifies attributes such as type, description,
and requirements.
Example: id: Type: string, Description: Unique
identifier for the customer, Required: true,
Primary Key: true
Personal Data Classification: Fields containing
sensitive information must be clearly marked
and include anonymization strategies.
Example: email: Type: string, Description:
Customer's email address, Required: true,
Anonymization Strategy: email
Key Components of a YAML Data Contract
A data contract serves as a formal agreement
between data generators and consumers,
defining the structure and semantics of the
data being shared.
It ensures clarity and consistency in data
usage, enhancing trust and reliability in data
management practices.
Data Contract Overview
Example of YAML-Based Data Contract
61

Data Management and Quality Course
Understanding the Schema Structure: The schema serves as a blueprint for data contracts, detailing the structure and
organization of data. It includes essential components such as field names, data types, and documentation for each
field, ensuring clarity and consistency in data representation.
Key Components of a Schema: Field Names and Data Types: Clearly define each field's name and its corresponding data
type (e.g., string, integer). This establishes expectations for data format and usage. Documentation: Provide
comprehensive descriptions for each field, outlining its purpose, limitations, and any specific constraints. This aids data
consumers in understanding how to utilize the data effectively.
Incorporating Metadata and Quality Checks: Metadata Elements: Include additional metadata such as primary keys,
data quality rules, entity relationships, and data classification (e.g., confidential, public). This enhances the schema's
utility and governance. Data Quality Checks: Define valid and invalid data values, including constraints like
minimum/maximum values and format adherence (e.g., email addresses). This ensures data integrity and reliability.
Flexibility and Interoperability Considerations: While schemas can be defined in various formats (YAML, JSON, code), it
is crucial to balance flexibility with interoperability. Choose formats that allow for easy integration across different
systems while maintaining the ability to evolve as organizational needs change.
Versioning and Evolution of Schemas: Establish a versioning strategy for schemas to accommodate changes over time.
This allows data contracts to evolve in response to new requirements while ensuring that data consumers can transition
smoothly to updated versions without disruption.
Defining a Schema for Data Contracts
Defining a Schema for Data Contracts
62

01 02 03
Data Management and Quality Course
Schema Registry Utilization: The schema is
stored in a central schema registry, serving as
the source of truth.
This registry allows data generators and
consumers to access the latest schema
versions, ensuring consistency across
applications and facilitating effective data
management practices.
Governance and Accessibility
Managing Changes: The schema supports
multiple versions, allowing for non-breaking
changes, such as adding optional fields without
disrupting existing consumers.
This ensures that the schema can evolve
alongside business needs while maintaining
compatibility.
Version Control and Evolution
Field Types and Descriptions: The customer
record schema includes essential fields such
as:
email: A string that must match a specific
pattern, ensuring valid email addresses.
language: A string indicating the customer's
language preference, with options like English
(en), French (fr), and Spanish (es).
Schema Definition
Schema Example: Customer Record
63

01 02
Data Management and Quality Course
A structured approach to data management supports effective
governance by establishing clear guidelines and standards for data
usage.
This structure aids in compliance with internal policies and external
regulations, ensuring that data is managed responsibly.
By implementing data contracts, organizations can set expectations
around data quality and accountability, fostering a culture of
ownership among data generators and consumers.
Facilitation of Data Governance and Compliance
Structured data ensures that information is organized in a predictable
format, which facilitates easier access and analysis.
This organization minimizes errors and inconsistencies, leading to
higher data quality.
By shifting responsibility for data quality to data generators,
organizations can proactively address issues at the source, reducing
the likelihood of invalid data affecting downstream processes.
Enhanced Data Quality and Reliability
Importance of Structured Data
64

01 02 03
Data Management and Quality Course
Maintaining version control is crucial for
ensuring that all applications refer to the same
schema version.
A schema registry supports this by storing
multiple versions of schemas, which aids in
managing schema evolution and reinforces
governance and control mechanisms within
data management practices.
Version Control and Governance
Schemas can be implemented using various
serialization formats such as Apache Avro,
Protocol Buffers, and JSON Schema.
These formats enable data generators to
produce data that conforms to the defined
schemas, while also allowing data consumers
to deserialize and validate the data effectively.
Integration with Serialization
Formats
A schema registry serves as a central
repository for schemas, ensuring accessibility
for both data generators and consumers.
It acts as the definitive source for schema
definitions, promoting consistency across
applications and facilitating effective schema
management.
Schema Registry as a Source of
Truth
Tooling and Functionality of Schemas
65

Data Management and Quality Course
Importance of Schema Documentation: Clear documentation is essential for understanding the structure and purpose
of data schemas. It serves as a reference for both data generators and consumers, ensuring that everyone involved has
a shared understanding of the data being used.
Key Components of Schema Documentation: Schema Definition: Each schema should include a detailed definition that
outlines the fields, data types, and any constraints (e.g., required fields, valid ranges). This clarity helps prevent errors in
data generation and consumption. Data Quality Checks: Documentation should specify the data quality checks that are
in place, such as validation rules for data formats and acceptable value ranges. This ensures that data integrity is
maintained throughout its lifecycle.
Utilization of Schema Registries: A schema registry acts as the source of truth for all documented schemas, making
them easily accessible to both data generators and consumers. This centralization promotes consistency and helps
manage version control effectively.
Version Control and Governance: Maintaining version control is crucial for schemas, as it ensures that all applications
refer to the same version. Documentation should include version history and changes made, which aids in governance
and compliance with data management policies.
Collaboration and Communication: Effective schema documentation fosters collaboration between data generators and
consumers. Regular discussions and updates regarding schema changes should be documented to ensure that all
stakeholders are informed and aligned on data expectations.
Documentation in Schemas
Documentation in Schemas
66

01 02 03
Data Management and Quality Course
A formal document capturing the data contract
is essential.
This documentation should include key
elements such as service-level objectives
(SLOs), data access methods, and data
classification.
Proper governance ensures compliance with
organizational policies and enhances data
quality and integrity.
Documentation and Governance
Effective data contracts require ongoing
collaboration between data generators and
consumers.
This partnership helps refine requirements,
adjust performance expectations, and ensures
that both parties are aligned on the data's value
and utility.
Collaboration with Data
Consumers
Data generators are the primary owners of data
contracts, possessing the full context of their
services and the data they produce.
This ownership ensures accountability and a
deeper understanding of the data's purpose
and usage.
Data Generator Responsibility
Ownership in Data Contracts
67

Data Management and Quality Course
Ownership and Responsibility: Data contracts must have a designated owner, typically the data generator, who is accountable for the
data's quality and integrity. This ownership ensures that those who understand the data best are responsible for its management and
delivery.
Contract Components: Essential elements to include in a data contract are: Version Number: Tracks changes and updates to the
contract over time. Service Level Objectives (SLOs): Defines expectations for data quality, including completeness (100%), timeliness
(60 minutes), and availability (95%). Access Methods: Specifies how consumers can access the data, such as through data warehouse
tables or streaming platforms. Data Classification: Categorizes data according to sensitivity and compliance requirements, such as
confidential or public. Retention and Deletion Policies: Outlines how long data will be kept and the procedures for its deletion or
anonymization.
Schema Definition: The data contract should clearly define the schema, including the structure and types of data fields. This clarity
helps ensure that both data generators and consumers have a mutual understanding of the data being exchanged.
Documentation and Communication: A formal document should capture all discussions and decisions related to the data contract. This
documentation serves as a reference point for both data generators and consumers, facilitating ongoing communication and
adjustments as needed.
Governance and Compliance: Data contracts must align with organizational governance policies, ensuring that data handling practices
comply with legal and regulatory standards. This includes implementing necessary controls for personal data and ensuring data quality
checks are in place.
Elements of a Data Contract
Elements of a Data Contract
68

01 02 03
Data Management and Quality Course
Defining metadata in a machine-readable
format enhances integration with tools for
privacy, data catalogs, and governance.
Automated validation processes can be
implemented to ensure metadata accuracy,
facilitating efficient data handling and access
control management.
Machine-Readable Metadata for
Automation
Data generators are responsible for
maintaining and updating metadata, leveraging
their contextual knowledge of the data.
This ownership fosters accountability and
ensures that metadata remains accurate and
relevant as data evolves.
Role of Data Generators in Metadata
Management
Capturing extensive metadata is crucial for
effective data governance and management.
Metadata includes details such as data
classification (e.g., confidential, public),
retention periods, and deletion policies,
ensuring compliance with organizational
standards.
Importance of Comprehensive Metadata
Metadata Capture in Data Contracts
69

01 02 03
04 05
Data Management and Quality Course
Considerations for Language
Selection
The choice of language should align with
organizational requirements and existing tools
to ensure ease of adoption.
Familiarity among team members with the
selected language can significantly enhance
the implementation process and reduce
learning curves.
TypeScript
Combines the benefits of JavaScript with static
typing, making it suitable for defining data
contracts in web-based applications.
Ensures type safety, which can help prevent
errors during data handling and processing.
Python
A versatile programming language that can be
utilized for defining data contracts, especially
in environments where data manipulation is
frequent.
Offers extensive libraries and frameworks that
can enhance the functionality of data
contracts.
YAML
Widely recognized for its human-readable
format, making it accessible for users with
varying technical backgrounds.
Ideal for organizations without established
tooling, providing a straightforward approach
to defining data contracts.
Jsonnet
A flexible data definition language favored for
its integration with existing infrastructure tools.
Allows for dynamic configuration and is
particularly useful for teams already familiar
with it.
Recommended Languages for Data Contracts
70

01 02 03
Data Management and Quality Course
Structured Governance: Effective data
governance is necessary to ensure compliance
and quality. This includes assigning
responsibilities and promoting governance
through data contracts.
Tooling Support: Organizations must develop
tools that facilitate the self-service capabilities
of data generators, allowing them to manage
their data autonomously while adhering to
established standards.
Governance and Tooling for Success
Key Objectives: Organizations should focus on
improving data pipeline dependability,
enhancing user trust, and making data more
accessible for critical applications, such as
machine learning.
Proof of Concept (POC): Selecting relevant use
cases for POCs is essential. These should align
with organizational objectives and involve both
data generators and consumers to ensure
successful implementation.
Implementation Objectives and Strategies
Empowering Data Generators: Data generators,
often from product engineering teams, are now
recognized as key players in the data
ecosystem. They are responsible for the data
they produce, fostering a sense of ownership
and accountability.
Role of Data Consumers: Data consumers,
including data engineers and analysts, rely on
the data generated to make informed
decisions. Their collaboration with data
generators is crucial for defining data needs
and expectations.
Cultural Shift Towards Data Ownership
Organizational Context for Data Contracts
71

Data Management and Quality Course
Data generators are responsible for owning the
data contract, as they possess the necessary
context to make informed decisions about their
data.
Collaboration between data generators and
consumers is essential for refining
requirements, ensuring that the data contract
meets the needs of both parties and supports
effective data governance.
Ownership and Collaboration in Data
Contracts
02
Guidelines and Guardrails for Data
Generators
Tools are provided to assist data generators in
adhering to data management standards,
enabling them to focus on their data products
without needing to be experts in data
contracts.
These guidelines streamline workflows, reduce
bottlenecks, and promote agility, allowing data
generators to operate independently while
ensuring compliance with organizational
standards.
01
Contract-Driven Data Architecture
72

01 02 03
Data Management and Quality Course
Provide tools and standards that assist data
generators in adhering to data management
practices without requiring them to be experts.
These guidelines promote consistency in data
management, enabling data generators to work
autonomously while ensuring compliance with
organizational standards.
Guidelines and Guardrails for Data
Generators
Implement data transformations and business
logic at the source to reduce redundancy and
improve efficiency.
This approach minimizes the need for
extensive processing downstream, allowing for
quicker access to quality data.
Shift Left Approach in Data
Processing
Establish formal agreements that outline the
schema, expectations, and service level
objectives (SLOs) for data generation.
Promote accountability and clarity in roles
between data generators and consumers,
ensuring that both parties understand their
responsibilities.
Understanding Data Contracts
Data Processing Services
73

Data Management and Quality Course
Data Masking: Replaces sensitive information
with fictional data that retains the same
format, allowing for data analysis without
exposing real data.
Tokenization: Substitutes sensitive data
elements with non-sensitive equivalents,
known as tokens, which can be mapped back
to the original data only by authorized systems.
Hashing: Converts data into a fixed-size string
of characters, which is irreversible, ensuring
that original data cannot be retrieved from the
hashed value.
Generalization: Reduces the precision of data
by replacing specific values with broader
categories, making it harder to identify
individuals while still allowing for useful
analysis.
Common Anonymization Techniques02Purpose of Anonymization
Protects personal data by ensuring that
individuals cannot be identified from the data.
Essential for compliance with data protection
regulations and maintaining user privacy.
01
Anonymization Strategies
74

01 02 03
Data Management and Quality Course
Clear visibility of the council's objectives and
activities fosters engagement across the
organization.
Regular updates and communication help
maintain alignment and accountability in data
governance practices.
Promoting Transparency and
Communication
Data Governance Council: A cross-functional
team responsible for defining policies and
standards, including representatives from data
product management, legal, privacy, and
security.
Data Generators: Individuals who manage their
data, supported by self-service tools and
guidelines, empowered to classify data and
ensure compliance with established standards.
Roles and Responsibilities
Establishes a framework for managing data
effectively across the organization.
Ensures data is accessible, usable, accurate,
consistent, secure, and compliant with
regulations.
Importance of Data Governance
Data Governance and Visibility
75

01 02
Data Management and Quality Course
Establishing well-defined roles for data generators enhances
collaboration with data consumers, ensuring that both parties
understand their impact on data quality and utility.
Data contracts play a crucial role in clarifying these responsibilities,
enabling data generators to manage metadata, classify data
sensitivity, and adhere to organizational policies effectively. This
structured approach promotes a proactive stance on data quality, as
generators are now accountable for the data they create and its
implications for downstream users.
Clear Roles and Responsibilities
Data generators are encouraged to take ownership of their data,
allowing them to manage and provide data products independently.
This autonomy reduces reliance on central teams, which can create
bottlenecks and slow down data accessibility.
By utilizing self-service tools and guidelines, data generators can
efficiently create and maintain their data products, fostering a sense
of responsibility and accountability for the quality of the data they
produce.
Autonomy in Data Management
Empowerment of Data Generators
76

01 02 03
Data Management and Quality Course
Foster collaboration between data generators
and consumers to bridge gaps in
communication and understanding.
Integrate with existing data governance
councils to align on policies, standards, and
best practices for data management.
Collaboration and Integration
Develop and support data contract tooling to
ensure compliance with organizational
standards.
Provide self-service capabilities for data
generators, enabling them to manage their
datasets effectively without constant
oversight.
Key Responsibilities
Establish a dedicated team to support the
implementation of data contracts and enhance
data management practices.
Focus on building and maintaining tooling that
facilitates data governance, quality, and
accessibility across the organization.
Purpose and Objectives
Formation of Data Infrastructure Team
77

01 02
Data Management and Quality Course
Standardized Tooling: The use of standardized tools promotes
uniformity in data management, making it easier for data consumers
to discover and understand data governed by contracts.
Incident Management Support: Clear access to service configurations
and observability metrics allows data generators to respond
effectively to incidents, ensuring that data quality and reliability are
maintained.
Ensuring Compliance and Consistency
Autonomy in Data Management: Data generators are provided with
tools that enable them to create and manage their data products
independently, fostering a sense of ownership and accountability.
Streamlined Workflows: By implementing guidelines and guardrails,
data generators can adhere to data management standards without
needing to be experts, thus reducing bottlenecks and enhancing
development speed.
Empowering Data Generators
Guidelines and Guardrails for Data Generators
78

01 02 03
Data Management and Quality Course
Standardized tooling promotes uniformity in
data management practices across the
organization.
By ensuring that all data generators follow the
same guidelines, organizations can maintain
compliance with data policies while enhancing
the overall quality and reliability of data
products.
Consistency and Compliance
Implementing contract-driven architectures
allows data generators to adhere to data
management standards without needing to be
experts.
This approach minimizes bottlenecks,
enhances efficiency, and accelerates the
development of data products, ultimately
leading to faster insights and actions.
Streamlined Workflows
Provide tools and guidelines that enable data
generators to operate independently, reducing
reliance on central data teams.
This autonomy fosters a sense of ownership
and accountability, motivating teams to
produce high-quality data products.
Empowering Data Generators
Agility and Autonomy in Data Management
79

Data Management and Quality Course
Standardized Tooling for Data Generators: Implementing uniform tools across the organization promotes
consistency in data management practices. This ensures that all data generators can easily manage their datasets,
leading to a more cohesive data environment.
Clear Expectations and Access: Establishing data contracts sets clear expectations around data usage, access
controls, and ownership. This clarity allows data consumers to discover and understand the data more effectively,
fostering a reliable data ecosystem.
Incident Management and Recovery: Providing data generators with access to service configurations and
observability metrics enhances their ability to respond to incidents. This structured approach minimizes downtime
and ensures that data quality is maintained consistently.
Agility and Autonomy in Data Generation: By allowing data generators to work independently within established
guidelines, organizations can reduce bottlenecks. This autonomy empowers teams to manage their data effectively
while adhering to organizational standards.
Return on Investment in Data Infrastructure: A consistent approach to data management increases the
effectiveness of investments in data infrastructure. By minimizing the need for disparate solutions, organizations
can focus on generating business value through improved data quality and accessibility.
Consistency in Data Management
Consistency in Data Management
80

03
0201
Data Management and Quality Course
Well-defined backup and recovery processes
are essential for ensuring data integrity and
availability.
Data generators should have established
protocols to recover data swiftly in the event of
an incident, thereby maintaining business
continuity and trust in data quality.
Backup and Recovery Processes
Implementing observability metrics enables
data generators to monitor the health and
performance of their data products
continuously.
By tracking key performance indicators, teams
can detect anomalies early and take corrective
actions before they escalate into significant
incidents.
Observability Metrics for Proactive
Monitoring
Clear Access to Service
Configurations
Data generators have straightforward access
to their service configurations, which is crucial
for effective incident management.
This transparency allows teams to quickly
identify and address issues as they arise,
minimizing downtime and disruption.
Incident Management in Data
Contracts
81

01 02
0403
Data Management and Quality Course
Investing in data contracts leads to the development of data
products that drive business-critical applications, such as analytics
and machine learning.
As data products evolve, they unlock new use cases and
opportunities for revenue generation, reinforcing the strategic
importance of quality data investments.
Long-term Business Value
Data contracts foster a partnership between data generators and
consumers, enhancing mutual understanding of data needs and
expectations.
Regular communication regarding changes in data schema ensures
minimal disruption to business operations, promoting a culture of
collaboration.
Stronger Collaboration and Communication
Simplification of data pipelines reduces operational costs and
time, making data processing quicker and more cost-effective.
Streamlined workflows minimize bottlenecks, allowing teams to
focus on value creation rather than data management hurdles.
Increased Efficiency in Data Pipelines
Improved accountability for data quality through defined roles for
data generators and consumers.
Establishment of Service Level Objectives (SLOs) ensures data
completeness, timeliness, and availability, leading to more reliable
data delivery.
Enhanced Data Quality
Return on Investment in Data Contracts
82

01 02 03
Data Management and Quality Course
Before finalizing the data contract, data generators must
consider various trade-offs and constraints.
This includes assessing the cost of generating comprehensive
data, the performance impacts on data generation services, and
the feasibility of meeting specific performance requirements.
Careful evaluation at this stage is crucial to ensure a realistic
and achievable contract.
Evaluate Trade-offs
Data generators should collect detailed needs from
data consumers, which may include the types of data
required, the desired structure, the interface for data
consumption, and the timeliness of data delivery.
This ensures that the data contract aligns with
consumer expectations and business objectives.
Gather Requirements
Begin by understanding the intended use of the data
product.
Determine who the data is for, how it will be utilized, and
the specific problems it aims to solve.
This step encourages collaboration between data
generators and consumers to gather essential
requirements.
Identify the Purpose
Creating a Data Contract
83

01 02
Data Management and Quality Course
Data Contract Integration: The service operates based on predefined
data contracts, which outline the rules and expectations for data
handling, including anonymization strategies.
Tooling Capabilities: The implementation supports various
functionalities such as automated data quality checks to ensure the
integrity of anonymized data, regular backups and data transfer
capabilities to enhance data management efficiency, and SLA
reporting to monitor compliance with data governance standards.
Key Features and Benefits
The anonymization service utilizes a Python script (anonymize.py) to
transform personal data into non-identifiable formats.
This process is crucial for protecting sensitive information while
maintaining data utility.
Example transformations include converting names and emails into
hashed formats, ensuring compliance with data protection
regulations.
Overview of Anonymization Process
Anonymization Service Example
84

01 02 03
Data Management and Quality Course
Schema Definition: Outlines the structure of the
data, including fields and data types.
Data Quality Checks: Establishes criteria for
ensuring data integrity, such as valid ranges
and format matching.
Retention and Deletion Policies: Specifies how
long data will be kept and the procedures for
its deletion or anonymization, ensuring
compliance with governance policies.
Metadata and Schema Definition
Version Number: Tracks changes and updates
to the contract over time.
Service-Level Agreements (SLAs): Defines
expectations for data quality, including metrics
such as completeness, timeliness, and
availability.
Data Access Methods: Specifies how
consumers can access the data, whether
through data warehouse tables or streaming
platforms.
Data Classification: Identifies the sensitivity of
the data (e.g., confidential, public) and outlines
necessary security controls.
Contract Elements
Every data contract must have a designated
owner, typically the data generator, who is
accountable for the data's quality and integrity.
This ownership ensures that those with the
most context about the data are responsible
for its management and evolution.
Ownership and Responsibilities
Components of a Data Contract
85

01 02 03
04 05
Data Management and Quality Course
Transactional Outbox Pattern
Implement the transactional outbox pattern to
maintain consistency between source systems
and data consumer interfaces.
Ensures reliable data flow by writing events to
an outbox table before they are sent to the data
contract-backed interface.
Access Control Management
Configuring access controls is essential but
can be managed outside the data contract
during initial phases.
Allows for rapid proof of concept (POC)
implementations while ensuring security
measures are in place.
Self-Service Deployment
Empower data generators to autonomously
deploy interfaces without bottlenecks from
central teams.
Enhances efficiency and allows for quicker
adaptation to changing data needs.
Utilizing Existing Platforms
Leverage established data platforms such as
Snowflake, Google BigQuery, or event
streaming services like Apache Kafka to
provision interfaces.
Minimizes complexity and utilizes familiar
tools for data management.
Data Contract Provisioning
Establish formal agreements that define the
schema, expectations, and service level
objectives (SLOs) for data interfaces.
Ensures clarity and alignment between data
generators and consumers.
Providing Interfaces to Data
86

01 02 03
Data Management and Quality Course
Use Pulumi to create and manage the BigQuery table based on
the generated JSON schema.
This step involves defining the table resources in your Pulumi
application, specifying the dataset and table names, and applying
the schema.
By executing the `pulumi up` command, you will provision the
BigQuery table, ensuring it aligns with the data contract and is
ready for data ingestion.
Implement with Pulumi
Utilize the defined data contract to generate a BigQuery
schema in JSON format.
This involves iterating over the fields specified in the
YAML contract, extracting necessary metadata, and
formatting it according to BigQuery's requirements.
The resulting JSON schema will detail each field's
attributes, including its type and whether it is mandatory.
Convert to JSON Schema
Begin by creating a YAML-based interface that
outlines the structure and requirements of the data to
be stored in BigQuery.
This data contract serves as the foundation for the
schema, capturing essential metadata such as field
names, types, and whether fields are required or
optional.
Define the Data Contract
BigQuery Schema Creation
87

Data Management and Quality Course
01 02 03 04
After confirming the update, Pulumi will execute the
provisioning.
You can verify the successful creation of your BigQuery
resources by accessing the Google Cloud Console.
Navigate to the BigQuery section to see your dataset
and table, ensuring that the schema aligns with your
data contract.
Verify Resource Creation
Use the command pulumi up to initiate the provisioning
process.
During this step, Pulumi will prompt you to confirm the
creation of resources, including your BigQuery dataset
and table.
For example, you may see a preview indicating that a
dataset named defaultDataset and a table named
defaultTable will be created.
Provision BigQuery Resources
Create a data contract that outlines the schema for
your BigQuery table.
This contract serves as a structured definition of the
data, detailing the fields and their attributes.
For instance, you might define fields such as id,
name, and email, specifying which are required and
their data types.
Define Your Data Contract
Begin by installing the Pulumi CLI and setting up your
project.
Ensure you have a Google Cloud account with the
necessary permissions to create BigQuery datasets and
tables.
Modify the Pulumi.yaml configuration file to specify your
Google Cloud project, which is essential for resource
management.
Set Up Your Pulumi Environment
Managing BigQuery Tables with Pulumi
88

01 02 03
Data Management and Quality Course
After converting the data contracts, publish the schema to the central
schema registry.
This involves using the registry's API to upload the schema, ensuring it is
stored as the source of truth for your data.
The registry should support operations such as publishing new schemas,
updating existing ones, and retrieving schemas by version.
This step establishes a centralized repository that all applications can
reference, promoting consistency across your data management practices.
Publish the Schema to the
Registry
Once the schema structure is defined, convert your data
contracts into JSON Schema format.
This conversion is essential for compatibility with various
schema registries, allowing for easier management and retrieval
of schemas.
The JSON Schema format provides a standardized way to
describe the structure of your data, making it accessible to
different applications and services.
Convert Data Contracts to JSON
Schema
Begin by outlining the schema structure that will be used to represent
your data.
This includes identifying the necessary fields, data types, and any
validation rules that need to be applied.
For example, a schema for a customer might include fields such as
email, language preference, and other relevant attributes.
This step ensures that the schema accurately reflects the data
requirements of your applications.
Define the Schema Structure
Populating a Central Schema Registry
89

01 02
Data Management and Quality Course
The Confluent schema registry supports multiple versions of a
schema, allowing users to manage and retrieve specific versions.
To see how many versions exist for a subject, the command `curl
http://localhost:8081/subjects/Customer/versions` can be used,
which will indicate the available versions, such as `[1]`.
Users can fetch a specific version or the latest version of a schema
using commands like `curl
http://localhost:8081/subjects/Customer/versions/1/schema` for a
specific version or `curl
http://localhost:8081/subjects/Customer/versions/latest/schema` for
the most recent version.
Version Management and Retrieval
Utilize API calls to retrieve schemas based on their names, referred to
as subjects in the Confluent schema registry.
To list all subjects, the command `curl http://localhost:8081/subjects`
can be executed, which will return a list of available subjects, such as
`[Customer]`.
Accessing Schemas by Subject
Schema Retrieval in Confluent Registry
90

01 02 03
Data Management and Quality Course
The schema registry supports multiple
versions of a schema, enabling users to
retrieve specific versions or the latest version
as needed.
This functionality ensures that data consumers
can seamlessly interact with the correct
schema version, minimizing disruptions in data
processing and application functionality.
Compatibility and Retrieval
Implementing a semantic versioning approach
is crucial.
Major versions indicate breaking changes (e.g.,
Customer.v1 to Customer.v2), while minor
versions signify compatible updates (e.g., v1.1,
v1.2).
This strategy helps maintain clarity and
compatibility across different applications.
Versioning Strategy
Schemas evolve over time to accommodate
changes in organizational data needs.
Each schema version represents a specific
structure of data at a given point in time,
allowing for historical tracking and
management of changes.
Understanding Schema Versions
Version Management in Schema Registry
91

01 02
Data Management and Quality Course
Migration Planning: A well-defined migration plan is essential to
transition consumers to new schema versions without disrupting
existing applications. The complexity of this plan varies based on the
size of the change and the number of affected consumers.
Version Management: For minor changes, running both old and new
schema versions concurrently for a limited time may suffice. However,
significant changes may require maintaining multiple versions longer
and providing migration libraries to assist consumers in adapting to
the new schema.
Migration Strategies for Schema Changes
Adapting to Consumer Needs: Data contracts must evolve to meet
changing requirements from data consumers, ensuring that the data
remains relevant and useful.
Enhancing Service Features: Schema updates can facilitate the
introduction of new features and improvements in performance,
thereby enhancing the overall service quality.
Importance of Schema Evolution
Schema Evolution in Data Contracts
92

03
0201
Data Management and Quality Course
Low Impact: Non-breaking changes have
minimal effect on existing data consumers,
allowing them to continue using the data as
documented.
This fosters a smoother transition and
encourages the adoption of new features
without the need for immediate updates to
their systems.
Impact on Data Consumers
Adding Optional Fields: Introducing a new field,
such as 'address,' to a customer schema.
Existing consumers can ignore this new field
without affecting their operations.
Removing Non-Required Fields: Removing a
field that is not mandatory and has a default
value, ensuring that current consumers are not
impacted by this change.
Common ExamplesDefinition of Non-Breaking Changes
Non-breaking changes are modifications to a
data schema that do not disrupt existing data
consumers.
They allow data generated against a new
schema version to be read by services using
previous versions without any data loss or
impact.
Non-Breaking Changes Example
93

01 02 03 04 05
Data Management and Quality Course
Aim to create a minimum viable
product that supports the POC.
Deliver this MVP quickly and
iteratively, incorporating user feedback
to refine and enhance the data
contracts-backed platform.
This approach will help establish a
solid foundation for broader adoption
of data contracts across the
organization.
Focus on Building a
Minimum Viable Product
(MVP)
Develop tooling that supports the
decentralization and ownership
goals of data contracts.
Ensure that data generators can
self-serve the tooling to avoid
bottlenecks and maintain autonomy.
Involve stakeholders in the design
process to enhance ownership and
improve outcomes.
Design Appropriate
Tooling
Encourage open communication
and collaboration between data
generators (those who produce
data) and data consumers (those
who utilize data).
This collaboration is essential for
defining data needs, requirements,
and expectations, ultimately leading
to a more effective data contract.
Foster Collaboration
Between Teams
Choose a use case that aligns with
your identified objectives to serve
as a proof of concept (POC).
This use case should have the
necessary resources and
personnel, including both data
generators and data consumers, to
ensure successful implementation
and value delivery.
Select a Relevant Use
Case
Begin by determining the primary
goals for implementing data contracts
within your organization.
Focus on specific issues such as
improving data pipeline dependability,
enhancing user trust in data, or
increasing accessibility for critical
applications like machine learning.
Establishing clear objectives will guide
the entire implementation process.
Identify Key Objectives
Getting Started with Data Contracts
94

01 02 03
Data Management and Quality Course
Articulate the value of the data to both data
consumers and end users, making a compelling case
for the adoption of data contracts.
Align these discussions with company-wide goals to
incentivize data generators, ensuring they understand
the benefits of participating in the migration process
and the importance of their contributions.
Communicate Value to Data
Generators
Collaborate with key data consumers, such as
data/analytics engineers and data scientists, to
prioritize critical datasets for migration.
Form a working group to identify core data models
that are essential to the business, facilitating a shared
understanding of data needs and fostering
collaboration between teams.
Engage Key Data Consumers
Establish a structured migration plan that balances
the need for timely transition to data contracts with
the ongoing commitments of product teams.
This plan should consider the complexity of the
migration process and the specific objectives of the
organization, ensuring that it aligns with both
immediate and long-term goals.
Develop a Migration Plan
Migrating to Data Contracts
95

01 02
Data Management and Quality Course
Schema Definition: Clearly defines the structure of the data, including
fields and data types, ensuring that all parties understand the format
and requirements.
Service Level Objectives (SLOs): Establishes measurable goals for
data quality, such as completeness, timeliness, and availability, which
are crucial for maintaining trust and reliability in data delivery. For
instance, a typical SLO might specify 100% completeness and 95%
availability, guiding data generators in their commitments.
Key Components of Data Contracts
Data contracts serve as formal agreements between data generators
and consumers, outlining expectations, responsibilities, and the
schema of the data being shared.
They are essential for ensuring clarity and alignment in data
management practices.
Understanding Data Contracts
Discovering Data Contracts
96

03
0201
Data Management and Quality Course
A mature data governance process is essential,
with policies set centrally but responsibilities
assigned locally to data generators.
Providing guidelines and automated tools can
streamline workflows, reduce bottlenecks, and
empower data generators to work
independently while adhering to organizational
standards.
Implementing Structured Governance and
Support
Establishing strong communication channels
between data generators and consumers is
crucial for defining data needs and
expectations.
Regular discussions help refine requirements,
adjust performance expectations, and build a
partnership focused on achieving business
goals.
Fostering Collaboration Between
Teams
Emphasizing Ownership and
Accountability
Data generators must take ownership of data
contracts, ensuring they understand the full
context of their data and its implications for
the organization.
This ownership fosters a sense of
responsibility, encouraging data generators to
maintain high standards of data quality and
reliability.
Building a Data Contracts-Backed
Culture
97

01 02 03 04 05
Data Management and Quality Course
Consider establishing a deadline for
the migration to maintain momentum,
while being mindful of the potential
risks of reduced buy-in from data
generators.
Regularly measure and communicate
migration progress using metrics such
as adoption rates and data incidents,
ensuring transparency and
accountability throughout the process.
Set Deadlines and
Monitor Progress
Articulate the value of the data
contracts to data generators,
emphasizing how these contracts
align with company-wide goals.
This communication should
highlight the benefits of reliable data
for both consumers and end users,
incentivizing data generators to
actively participate in the migration
process.
Communicate Value to
Data Generators
Collaborate with key data consumers,
such as data/analytics engineers and
data scientists, to prioritize critical
datasets for migration.
Form a working group to facilitate
discussions and ensure that the
most important data models are
identified and addressed first,
fostering a sense of ownership
among stakeholders.
Engage Key
Stakeholders
Create a structured migration plan
that balances the need for timely
completion with the ongoing
commitments of product teams.
This strategy should outline the
steps for transitioning data assets
while minimizing disruption to
existing workflows and ensuring
that product teams can continue to
meet their roadmaps.
Develop a Migration
Strategy
Begin by identifying the key objectives
for migrating data assets to data
contracts.
This includes improving data pipeline
dependability, enhancing user trust,
and ensuring data accessibility for
critical applications.
Clearly defined goals will guide the
migration process and align it with
organizational priorities.
Establish Objectives
Migration Plan for Data Assets
98

03
0201
Data Management and Quality Course
Encourage data generators to take ownership
of the data they produce, enhancing
accountability and quality.
Provide self-service tools that enable teams to
manage data effectively without requiring
extensive expertise in data governance.
Empowering Teams for Data
Quality
Utilize data contracts to define roles,
responsibilities, and expectations between
data consumers and generators.
Foster better communication and
collaboration, ensuring that data consumers
articulate their needs clearly.
Establishing Data ContractsUnderstanding Requirements
Identify the specific business needs and
expectations of data consumers to build
effective data products.
Engage in discussions to clarify who the data
products are for and what problems they aim
to solve.
Collaboration with Data Consumers
99

01 02
0403
Data Management and Quality Course
By shifting accountability for data quality
upstream to data generators, organizations can
proactively address data issues at the source.
This approach minimizes the risk of
downstream data incidents and enhances
overall data reliability, leading to better
business outcomes.
Shift-Left Approach to Data Quality
Providing structured guidelines and self-service
tools enables data generators to adhere to data
management standards without needing deep
expertise.
This support reduces bottlenecks and
enhances agility, allowing teams to focus on
delivering quality data products.
Guidelines and Support Tools
Effective engagement requires clear
communication between data generators and
consumers.
Regular feedback loops help data generators
understand consumer needs, while consumers
must articulate their data requirements to
ensure alignment and value generation.
Collaboration and CommunicationEmpowerment and Ownership
Data generators should take ownership of data
contracts, as they possess the context
necessary for informed decision-making.
This ownership fosters a sense of
responsibility and accountability for the quality
and reliability of the data they produce.
Engagement with Data Generators
10
0

03
0201
Data Management and Quality Course
Foster collaboration among data generators
and consumers to create a shared
understanding of the migration goals.
Engage stakeholders in the planning process to
enhance commitment and ensure alignment
with organizational objectives.
Encouraging Collaboration and Buy-In
Consider the potential risks associated with
setting strict deadlines.
Deadlines may lead to reduced buy-in from
data generators, who might prioritize speed
over quality in their work.
Assessing Risks of DeadlinesEstablishing a Migration Timeline
Define a clear timeline for the migration
process to ensure a structured approach.
Balance the urgency of decommissioning
legacy systems with the ongoing commitments
of product teams.
Setting Deadlines for Migration
101

01 02
Data Management and Quality Course
The concept of data products emphasizes the importance of creating
stable, accessible, and useful data solutions.
Data products should be designed around business entities rather
than internal data structures, fostering collaboration between data
generators and consumers.
This approach not only reduces costs and data duplication but also
enhances the overall business value derived from data, making it a
strategic asset for organizations.
Building Effective Data Products
Understanding the necessity of data governance is crucial for
maximizing the value of data as an asset.
This includes defining roles, establishing a data governance council,
and ensuring compliance with data policies and external regulations.
Effective governance promotes accountability and responsibility in
data management, which is vital for maintaining data quality and
usability.
Data Governance Essentials
Further Reading on Data Management
10
2

01 02 03
Data Management and Quality Course
Adopting a shift-left approach means
addressing data quality issues at the source,
where data is generated. This proactive
strategy reduces redundancy and enhances
efficiency in data processing.
By performing necessary transformations and
implementing business logic early in the data
lifecycle, organizations can minimize the risk of
errors and improve overall data reliability for
critical applications.
Shift-Left Approach to Data Quality
Establishing formal data contracts is essential
for defining expectations regarding data
quality, schema, and service level objectives
(SLOs). These contracts serve as a mutual
agreement that clarifies roles and
responsibilities between data generators and
consumers.
Data contracts should include specific quality
checks, such as valid ranges and format
matching, to ensure data integrity before it is
published.
Implementing Data Contracts
Data generators must take responsibility for
the quality of the data they produce. This
ownership fosters a sense of accountability,
motivating them to maintain high standards
and reliability in their outputs.
Clear communication from data consumers
about their needs enhances this accountability,
ensuring that data generators understand the
impact of their work on downstream users.
Ownership and Accountability
Ensuring Data Quality in Publishing
10
3

01 02
0403
Data Management and Quality Course
Foster a culture of continuous improvement by regularly reviewing
monitoring outcomes and incorporating feedback from data
consumers to enhance data product quality and usability.
Continuous Improvement and Feedback Loops
Develop a robust monitoring framework that tracks SLO
compliance, providing alerts for any breaches to maintain data
integrity and prompt corrective actions.
Utilizing Monitoring Systems
Implement automated systems for collecting and reporting SLOs,
utilizing tools and resources that facilitate real-time monitoring of
data performance.
Automating Metrics Collection
Define key performance indicators such as completeness,
timeliness, and availability within data contracts to ensure data
quality and reliability.
Establishing Service-Level Objectives (SLOs)
Post-Publishing Monitoring
10
4

01 02 03
Data Management and Quality Course
Improved Data Quality: By identifying issues
early, organizations can maintain high
standards of data integrity.
Enhanced Collaboration: Fosters
communication between data generators and
consumers, aligning expectations and
responsibilities for data quality and
governance.
Benefits of Implementing Observability
Tools
Real-Time Monitoring: Provides continuous
insights into data pipelines, allowing for
immediate detection of anomalies or
disruptions.
Data Lineage Tracking: Enables users to trace
the origin and transformation of data, ensuring
transparency and accountability in data
management.
Key Features of Effective Tools
Enhances the ability to monitor and understand
data flows within an organization.
Facilitates quick identification of data quality
issues, ensuring reliable data for decision-
making.
Importance of Data Observability
Data Observability Tools
10
5

01 02 03
04 05
Data Management and Quality Course
Defining Ownership and
Responsibilities
Clearly defining ownership and responsibilities
related to data contracts is essential.
Ensures accountability among data generators.
Fosters a culture of reliability and trust in data
management practices.
Setting Clear Consumer Expectations
Crucial for data generators to set explicit
expectations for data consumers.
Helps avoid assumptions about performance
and dependability.
Prevents loss of trust if expectations are
unmet.
Timeliness Measurement Example
Track the time difference between record
creation and availability for querying.
Suggested SLO: the oldest available record
should not be older than one hour.
Ensures data is accessible when needed.
Establishing Service-Level Objectives
(SLOs)
SLOs are derived from SLIs and expressed as
percentages over time.
They define the expectations for data
consumers.
Help data consumers understand the
timeliness and reliability of the data provided.
Understanding Service-Level Indicators
(SLIs)
SLIs are direct measurements of system
performance from the user's perspective.
They should be continuously monitored to
ensure data quality and reliability.
Immediate alerts for any unhealthy indicators.
Performance and Dependability Monitoring
10
6

01 02 03
Data Management and Quality Course
Increased Database Load: The pattern
introduces an additional write operation in the
critical path, which can affect performance.
Separate Event Processing: Requires a
separate process to listen to events from the
outbox table, adding complexity to the
architecture.
Considerations and Drawbacks
Transactional Guarantees: Ensures that writes
to both the application and outbox tables occur
within the same database transaction, allowing
for rollback in case of failures.
Decoupling Event Structure: Events published
do not need to match the application's
database structure, providing flexibility for data
generators to evolve their internal models
without impacting downstream consumers.
Key Benefits
A design approach used in event-driven
architectures to ensure data consistency
across services and databases.
Involves creating an outbox table within the
application's database to log events alongside
changes made to the main application tables.
Overview of the Pattern
Transactional Outbox Pattern
10
7

01 02
0403
Data Management and Quality Course
The outbox pattern can improve application performance by
queuing events locally and enabling batch processing into data
warehouses or event streaming platforms.
However, it introduces additional load on the database due to the
extra write operation, necessitating careful management of
database resources.
Performance Considerations
Events can be generated using the context available at the time of
the change, allowing for alignment with data contracts required by
downstream consumers.
The structure of the events does not need to match the database
schema, promoting adaptability to evolving business needs.
Event Generation Process
Events are written to an outbox table within the same database
transaction as the main application changes.
This mechanism allows for rollback in case of failures, ensuring
data consistency and integrity.
Transactional Guarantees
A design approach that decouples event generation from the
application's database structure, enhancing flexibility in data
management.
Ensures that events are published reliably while maintaining
transactional integrity across services.
Overview of the Outbox Pattern
Event Generation in Outbox Pattern
10
8

01 02 03
Data Management and Quality Course
The outbox pattern allows data generators to
focus on their primary services without
managing separate data transformation
processes.
Reduces the maintenance burden and
streamlines workflows, enabling teams to
operate more efficiently and effectively.
Simplified Maintenance
By utilizing the outbox pattern, data generators
can reliably publish events to data consumers.
Minimizes the risk of data loss or
inconsistency, as events are only sent after
successful writes to the internal models.
Enhanced Data Reliability
Ensures that writes to both the application
database and the outbox table occur within the
same transaction.
If one write fails, the entire transaction is rolled
back, maintaining data integrity and
consistency.
Transactional Guarantees
Performance Improvement with Outbox Pattern
10
9

01 02
Data Management and Quality Course
The effectiveness of the outbox pattern hinges on the reliability of
database transactions.
If a service fails to write to its internal models or the outbox, it can
lead to inconsistencies in data availability.
This dependency can create challenges in ensuring data integrity,
especially in high-availability environments where transaction failures
may occur.
Dependency on Database Transactions
The transactional outbox pattern requires additional infrastructure to
manage the outbox table, which can complicate the architecture of the
system.
This added complexity may lead to increased development time and
potential integration challenges with existing systems.
Complexity in Implementation
Drawbacks of Outbox Pattern
110

01 02 03
Data Management and Quality Course
Many organizations already have the
transactional outbox pattern implemented,
making it easier to generate events using
existing tooling and libraries.
This integration reduces the need for additional
infrastructure and leverages current
capabilities to enhance data management
processes.
Integration with Existing Tooling
By utilizing the outbox pattern, organizations
can effectively generate events for data
contracts.
This method allows for the separation of
concerns, where the service writes to its
database and an outbox table, from which
events are later sent to the data contract-
backed interface.
Facilitating Event Generation
The outbox pattern guarantees that all writes
occur within the same database transaction.
If a service fails to write to its internal models,
an event will not be written to the outbox,
ensuring consistency between the source
system and the data consumer's interface.
Ensuring Data Consistency
Popularity of Outbox Pattern
111

Data Management and Quality Course
Conclusion and Next Steps
112
Establishing robust data governance
frameworks and implementing service
level objectives (SLOs) will ensure data
integrity and reliability, ultimately driving
business value and fostering a data-
driven culture.
Implement Data Governance and Quality
Checks
Strengthening the partnership between
data generators and consumers through
clear communication and data contracts
will lead to better-defined requirements
and improved data quality.
Foster Collaboration Between Data
Generators and Consumers
Adopting a data product mindset is
crucial for organizations to enhance
data quality and accessibility, ensuring
that data products are treated with the
same rigor as traditional products.
Emphasize Data Product Mindset