All code and step-by-step instructions provided
Basic SQL knowledge required
Azure Account
Basic Python programming knowledge required
Cloud fundamentals will be beneficia
Size: 2.85 MB
Language: en
Added: Jun 02, 2024
Slides: 77 pages
Slide Content
Introduction to Azure Synapse Analytics
Azure Synapse Analytics is a limitless analytics
service that brings together data integration,
enterprise data warehousing and big data
analytics.
Azure Synapse Analytics -Introduction
Azure Synapse Analytics is a limitless analytics
servicethat brings together data integration,
enterprise data warehousing and big data
analytics.
Azure Synapse Analytics -Introduction
Azure Synapse Analytics is a limitless analytics
service that brings together data integration,
enterprise data warehousing and big data
analytics.
Azure Synapse Analytics -Introduction
Azure Synapse Analytics -Introduction
Data Warehouse
Emergence of Data Lakes
Modern Data Warehouse Architecture
Modern Data Warehouse -With Azure Synapse Analytics
Modern Data Warehouse –Without Azure Synapse Analytics
Data Warehouse
ETL
Data Consumers
Data Sources
Operational
Data
External DataData Warehouse
/ Mart
Data Warehouse
Lack of support for unstructured data
Longer to ingest new data
Proprietary data formats
Scalability
Expensive to store data
Lack of support for ML/ AI workloads
Data Lake
Ingest
Data Sources
Operational
Data
External Data
Transform
Data Science/ ML
workloads
Data LakeData Lake
BI Reports
Data Warehouse /
Mart
Modern Data Warehouse
Data Sources
Operational
Data
External Data
IngestExplore &
Prepare
Transform
& Enrich
Model &
Serve
Visualize
Data Lake Storage
Azure Data
Factory
ADF –Data
Flows
Azure
Databricks
Azure SQL
Data
Warehouse
Power BI
Azure Data Lake Gen2
ADF –Data
Flows
Azure
Databricks
Data
Visualization
Data
Integration
Storage
Modern Data Warehouse
Compute
Azure
DatabricksADF Data Flows
Azure SQL Data
Warehouse
Azure Data
Factory
Azure Data Lake Storage Gen2
Power BI
Development / Monitoring / Management & Security
Modern Data Warehouse
Too many services/ workspaces
Difficult to monitor
Management & Security overhead
No Serverless option
Metadata not shared
Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Azure
DatabricksADF Data Flows
Azure SQL Data
Warehouse
Azure Data
Factory
Azure Data Lake Storage Gen2
Power BI
Development / Monitoring / Management & Security
Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Azure
Databricks
Synapse Data
Flows
Azure SQL Data
Warehouse
Synapse
Pipelines
Azure Data Lake Storage Gen2
Power BI
Development / Monitoring / Management & Security
Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Azure SQL Data
Warehouse
Synapse
Pipelines
Azure Data Lake Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta store
Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta storeSynapse Link
Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta storeSynapse Link
Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta storeSynapse Link
Azure Synapse Analytics –Preview Features
Data Explorer Pool
Synapse Link for SQL Server 2022
Many more features
Azure Synapse Analytics
Azure Synapse Analytics is a limitless analytics servicethat brings together
data integration, enterprise data warehousing and big data analytics.
Create Synapse Analytics Workspace
Lab
Create Synapse Analytics Workspace
User Subscription
Azure Managed Resource GroupUser Managed Resource Group
Synapse Workspace
Primary ADLS Gen2
Storage Account
Role -Storage Blob
Data Contributor
Serverless SQL
Pool
Workspace
Container
SQL Admin User
Create Synapse Analytics Workspace
Project Overview
What is NYC Taxi
NYC Taxi data source & datasets
Prepare the data for the project
Project Requirements
Solution Architecture
Data Overview –NYC Taxi Trips
Data Overview –NYC Taxis
Yellow Taxis
Data Overview –NYC Taxis
Yellow Taxis
Green Taxis
Data Overview –NYC Taxis
Yellow Taxis
For Hire Vehicles
Green Taxis
High Volume For Hire Vehicles
Data Overview –NYC Taxis
Yellow Taxis
For Hire Vehicles
Green Taxis
High Volume For Hire Vehicles
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Trip Data
Taxi ZoneTrip Type
Payment Type
Vendor
NYC Taxi Data Files Overview
Rate Code
Calendar
Trip Data
Taxi ZoneTrip Type
Payment Type
Vendor
NYC Taxi Data Files Overview
Rate Code
Calendar
CSV
CSV
TSV
JSON
JSON
CSV
Quoted
Parquet
CSV
Delta
Import NYC Taxi Data to Data Lake
Project
Requirements
Data exploration capability on the raw data
Discovery using pay-per-query model
Discovery using T-SQL
Schema applied to the raw data
Data Discovery
Ingested data to be stored as Parquet
Ingestion using pay-per-query model
Ability to query the ingested data using SQL
Ingested data to be stored as tables/ views
Data Ingestion
Join the key information required for reporting to
create a new table.
Transformed data must be stored in columnar
format (i.e., Parquet)
Must be able to analyze the transformed data
via T-SQL
Data Transformation
Join the key information required for Analysis to
create a new table.
Scheduled to run at regular interval
Scheduling Requirements
Ability to monitor pipelines
Ability to re-run failed pipelines
Ability to set-up alerts on failures
Azure Synapse
Data
Visualization
Data
Integration
Storage
Solution Architecture
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta storeSynapse Link
Report
Solution Architecture –Serverless SQL Pool
Bronze
Layer
Silver
Layer
Gold
Layer
IngestTransformPresent
NYC Taxi
Data
Discovery
Analysts
Synapse Pipelines
Report
Solution Architecture –Spark Pool
Bronze
Layer
Silver
Layer
Gold
Layer
Present
NYC Taxi
Data
Discovery
Analysts
TransformIngest
Synapse Pipelines
Report
Solution Architecture –Dedicated SQL Pool
Bronze
Layer
Silver
Layer
Gold
LayerNYC Taxi
Data
Discovery
Analysts
TransformPresentIngest
Synapse Pipelines
Azure Synapse AnalyticsAzure Cosmos DB Container
Auto
Sync
Transactional Store
Row store optimized
for transactional
reads and writes
Analytical Store
Column store
optimized for
analytical queries
Spark Pool
Serverless SQL
Pool
Machine Learning
Bigdata Analytics
BI Reporting
Azure Synapse Link
/ HTAP
Solution Architecture -Synapse Link
Transactional
Data
Data Exploration
Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure SynapseAnalytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta storeSynapse Link
Section Overview –Serverless SQL Pool
Serverless SQL Pool Architecture
Cost Control
Connecting to Azure Data Studio
T-SQL Support
Features & Use Cases
Serverless SQL Pool
Azure Synapse
Data
Visualization
Data
Integration
Storage
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta storeSynapse Link
Serverless SQL Pool
Serverless SQL Pool
Serverless SQL pool is a serverless distributed
query engine that you can use to query data over
your data lake using T-SQL.
Serverless SQL Pool
Serverless SQL pool is a serverlessdistributed
query engine that you can use to query data over
your data lake using T-SQL.
Serverless SQL Pool
Serverless SQL pool is a serverless distributed
query enginethat you can use to query data over
your data lake using T-SQL.
Compute
Polaris -Distributed
Query Processing
Engine
Azure Storage
Control Node
Compute NodeCompute NodeCompute Node
Serverless SQL Pool -Architecture
User or
Application
T-SQL
https://www.vldb.org/pvldb/vol13/p3204-saborit.pdf
Serverless SQL Pool –Key Features
Serverless
Distributed query engine
Query using T-SQL
Robust
Pay-per-query pricing model
Not a storage
Synapse Link
Query spark tables
Serverless SQL Pool –Supported Data Sources
SQL API
Parquet
Delta Lake
Cosmos
Dataverse
Azure Storage Account
SQL Server 2022 (Preview)
MongoDB API
Delimited –CSV, TSV etc
JSON
Serverless SQL Pool –Use Cases
Discovery & Exploration
Logical Data Warehouse
Data Transformation
Serverless SQL Pool –Who is it far?
Data Engineers
Data Scientists
Data Analysts / BI Developers
Serverless SQL Pool
Cost Management
Cost Calculation
Amount of data read from storage
Billed for Data ProcessedAmount of data in intermediate results
Amount of data written to storage
Data Processed rounded to the nearest MB
Minimum of 10MB per query
Currently $6.25 per 1TB
Cost Control
Serverless SQL Pool
Cost Management
Lab
Azure Data Studio
Section Overview -Query CSV Files
Header Row
Field Terminator
Row Terminator
Quoted Files and escaping characters
Query CSV Files
OPENROWSET function overview
Query using OPENROWSET function
Data Types & Collations
Query subset of columns
Quoted strings & Escape Char
Query Tab Separated Values (TSV) file
Serverless SQL
Pool
OPENROWSET Function
SELECT *
FROM OPENROWSET(BULK ‘blob file path‘,
FORMAT = [‘CSV’ |
'PARQUET’ |
‘DELTA’]
) AS [file]
Azure Storage
User/
Application
OPENROWSET
Select
Section Overview -Query CSV Files
Line Delimited JSON
Section Overview -Query CSV Files
Single Line JSON
Standard JSON
Section Overview -Query CSV Files
Single Line JSON
Standard JSON
Classic JSON
Serverless SQL Pool –T-SQL Support
Databases
Schemas
Stored Procedures
Views
Inline table value functions
External Resources –data sources, file
formats and tables
Supported
Serverless SQL Pool –T-SQL Support
Tables
Triggers
DDL statements other than ones related
to views and security
Materialized views
DML statements
Not Supported
Serverless SQL Pool –T-SQL Support
Logins and users
Credentials to control access to storage accounts
Grant, deny, and revoke permissions per object level
Azure Active Directory integration
Security
Serverless SQL Pool –T-SQL Support
CETAS -CREATE EXTERNAL TABLE AS SELECT
Extension to OPENROWSET to aid querying data in
data lake
Additional
Features
Serverless SQL Pool –Monitoring Queries
Lab
Discovery & Exploration
Lab
Serverless SQL –Data Discovery
Identify the volume of the data
Qualify of data
Duplicates
Missing values
Invalid data
Ability join datasets (e.g. keys exist)
Ability to get business value
Total record count
Record count per day/ Week/ Month
Right columns exist
Transformations
Aggregations
Identify additional data required
Serverless SQL –Data Discovery
Identify duplicates in data
Check for missing data values
Invalid/ Unexpected data in columns
Join data from multiple files
Summarize/ Aggregate data
Apply some transforms