Fundamentals of Data Engineering covers the fundamental concepts of creating and managing data infrastructure. It comprises data intake, storage, processing, and pipeline automation with SQL, Python, Hadoop, and cloud platforms. It focuses on creating scalable systems for efficient data handling w...
Fundamentals of Data Engineering covers the fundamental concepts of creating and managing data infrastructure. It comprises data intake, storage, processing, and pipeline automation with SQL, Python, Hadoop, and cloud platforms. It focuses on creating scalable systems for efficient data handling while maintaining dependability, security, and real-time processing.
Size: 3.08 MB
Language: en
Added: Mar 06, 2025
Slides: 10 pages
Slide Content
Fundamentals of
Data Engineering
iabac.org
What is Data Engineering?
The process of creating, developing,
and managing systems for data
collection, storage, and processing.
Make sure that data is accurate,
readily available, and ready for
analyzing.
connects raw data to useful
discoveries.
iabac.org
Key Components of Data
Engineering
Data Collection – Gathering information
from multiple sources.
Data Storage – Utilizing databases, data
lakes, and warehouses for storage.
Data Processing – Converting raw data into
formats that can be used effectively.
Data Workflow Orchestration – Streamlining
the movement of data through automation.
Data Governance & Security – Maintaining
compliance and safeguarding data integrity.
iabac.org
Aspect
Focus
Data Engineering Data Science
Data Engineering vs Data Science
Focus Key Tools
Goal
Key Tools
Goal
SQL, Spark,
Airflow
Reliable data for
analytics
Analysis &
modeling
Python, ML
libraries
Insights &
predictions
iabac.org
Relational databases (SQL, PostgreSQL, and
MySQL).
NoSQL databases (MongoDB and
Cassandra).
Data Warehouses (BigQuery, Snowflake, and
Redshift).
Data lakes (S3 and Delta Lake).
iabac.org
Data Storage Technologies
Batch Processing (ETL) – for example,
Apache Spark.
Hadoop Stream Processing – for example,
Apache Kafka, Flink.
Hybrid Approaches – Combining batch &
real-time.
Data Processing Frameworks
iabac.org
Data Pipeline Orchestration
Workflow automation tools: Apache Airflow,
Prefect, Dagster.
Steps in a pipeline:
Data ingestion1.
Cleaning & transformation2.
Storage & indexing3.
Delivery to consumers4.
iabac.org
Data Quality- Validations, deduplication,
and anomaly detection.
Security- Encryption, access control (IAM)
Compliance- GDPR, HIPAA, SOC 2.
Metadata Management- Process of
classifying data in order to make it
discoverable.
Data Governance & Security
iabac.org
Data Storage: PostgreSQL, MongoDB and
Snowflake.
Processing: Spark, Flink and DBT.
Orchestration: Airflow and Prefect.
Cloud Platforms: AWS, GCP and Azure.
Tools & Technologies in Data
Engineering
iabac.org