4070949. 89-Test-12-File.pdf

raypoll198 32 views 18 slides Jul 15, 2024
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

Data engineering


Slide Content

Data Engineering -
Best Practices
Suraj Acharya,
Director, Engineering
Singh Garewal,
Director, Marketing

Data Engineering Drivers
Advanced analytics / ML
coming of age
Industry-spanning
adoption
Technology innovation:
hardware, cloud and storage
Increased financial
scrutiny
Role evolution: CDO,
Data Curator
...
$$

Accelerate innovation by unifying data science,
engineering and business
•Original creators of , Databricks Delta &
•2000+ global companies use our platform across big
data & machine learning lifecycle
VISION
WHO WE
ARE
Unified Analytics PlatformSOLUTION

Apache Spark: The 1st Unified Analytics Engine

Runtime
Delta
Spark Core Engine

Big Data Processing
ETL + SQL +Streaming
Machine Learning
MLlib + SparkR
Uniquely combined Data & AI technologies

Databricks Delta
Adds data reliability and performance to data lakes

●Co-designed compute & storage

●Compatible with Spark API’s

●Built on open standards (Parquet)

Databricks Delta
Indexes &
Stats
Transactional
Log
Versioned
Parquet Files
Leverages your cloud blob storage

Data Engineering Playing Field
Message Log
Dashboarding/
Reporting/ BI
Storage
Data Model
Data Catalog/
Lineage
Compute: ETL,
analytics, ML
Sandbox
Orchestration
and Workflow
CI/CD Data Quality

What
Data organization and relation of the
different top-level data sets to each
other.

Data Model
How
•Audience segmentation
•Table categorization
•Data types
•Modeling discipline

Data Catalog + Lineage
What
Easy discovery of data sets
Policy enforcement

How
•Explore data model
•Search + suggestions
•Column and table annotations
and grouping
•Lineage tracking
•Automatic flagging of PII +
sensitive columns

Storage Architecture
What
Where data is stored and using what
formats.

How
•Columnar formats
•Minimize metadata lookups
•Compaction

Message Log
What
Source of streaming and batch data.

How
•Read logs into “raw” tables with
minimal preprocessing
•Firehose

Sandbox
What
Isolated environment for
experimentation and exploration.

How
•Notebook collaboration
•Tracking
•Management
•Source control

Compute / Data Processing


What
Execution engine used to process
data.
Layer where “jobs” run.

How
•Multiple multiple frameworks and
language
•SQL compatibility
•Connectors for your data-sources
•Less data scanned => faster job
execution

Orchestration and Workflow
What
Scheduling and triggering jobs
Job Dependencies

How
•“DAG” : Graphical view of job
dependencies and status
•Describe dependencies in code
•Retry policies
•Backfill policies

Dashboarding/ Reporting/ BI
What
Static reports and auto-updating
dashboards
Business facing


How
•Static graphs + emailed reports
•Rollups + aggregations
•Data modelling + Data Analyst
•Real-time dashboards

Quality : Monitoring and Alerting
What
Mechanisms for detecting and fixing
incorrect and stale data-sets
Anomaly detection


How
•Monitor job failures
•Prioritization and coalescing
•Emit metrics during and after jobs
•Metrics database + Graphing
•Monitoring dashboards
•Define KPIs and create alerts

CI/CD
What
Development tools and processes

How
•Sandbox queries, job code and
workflows in source control.
•Deployment process : life of a PR
•Multiple environment support
•Test data sets : sampling,
obfuscation, randomized.

Questions?
Check out Databricks Delta databricks.com/delta

Thank you
Parting words or contact information go here
Tags