Data Engineering -
Best Practices
Suraj Acharya,
Director, Engineering
Singh Garewal,
Director, Marketing
Data Engineering Drivers
Advanced analytics / ML
coming of age
Industry-spanning
adoption
Technology innovation:
hardware, cloud and storage
Increased financial
scrutiny
Role evolution: CDO,
Data Curator
...
$$
Accelerate innovation by unifying data science,
engineering and business
•Original creators of , Databricks Delta &
•2000+ global companies use our platform across big
data & machine learning lifecycle
VISION
WHO WE
ARE
Unified Analytics PlatformSOLUTION
Apache Spark: The 1st Unified Analytics Engine
Runtime
Delta
Spark Core Engine
Big Data Processing
ETL + SQL +Streaming
Machine Learning
MLlib + SparkR
Uniquely combined Data & AI technologies
Databricks Delta
Adds data reliability and performance to data lakes
Data Engineering Playing Field
Message Log
Dashboarding/
Reporting/ BI
Storage
Data Model
Data Catalog/
Lineage
Compute: ETL,
analytics, ML
Sandbox
Orchestration
and Workflow
CI/CD Data Quality
What
Data organization and relation of the
different top-level data sets to each
other.
Data Model
How
•Audience segmentation
•Table categorization
•Data types
•Modeling discipline
Data Catalog + Lineage
What
Easy discovery of data sets
Policy enforcement
How
•Explore data model
•Search + suggestions
•Column and table annotations
and grouping
•Lineage tracking
•Automatic flagging of PII +
sensitive columns
Storage Architecture
What
Where data is stored and using what
formats.
How
•Columnar formats
•Minimize metadata lookups
•Compaction
Message Log
What
Source of streaming and batch data.
How
•Read logs into “raw” tables with
minimal preprocessing
•Firehose
Sandbox
What
Isolated environment for
experimentation and exploration.
How
•Notebook collaboration
•Tracking
•Management
•Source control
Compute / Data Processing
What
Execution engine used to process
data.
Layer where “jobs” run.
How
•Multiple multiple frameworks and
language
•SQL compatibility
•Connectors for your data-sources
•Less data scanned => faster job
execution
Orchestration and Workflow
What
Scheduling and triggering jobs
Job Dependencies
How
•“DAG” : Graphical view of job
dependencies and status
•Describe dependencies in code
•Retry policies
•Backfill policies
Dashboarding/ Reporting/ BI
What
Static reports and auto-updating
dashboards
Business facing
How
•Static graphs + emailed reports
•Rollups + aggregations
•Data modelling + Data Analyst
•Real-time dashboards
Quality : Monitoring and Alerting
What
Mechanisms for detecting and fixing
incorrect and stale data-sets
Anomaly detection
How
•Monitor job failures
•Prioritization and coalescing
•Emit metrics during and after jobs
•Metrics database + Graphing
•Monitoring dashboards
•Define KPIs and create alerts
CI/CD
What
Development tools and processes
How
•Sandbox queries, job code and
workflows in source control.
•Deployment process : life of a PR
•Multiple environment support
•Test data sets : sampling,
obfuscation, randomized.
Questions?
Check out Databricks Delta databricks.com/delta
Thank you
Parting words or contact information go here