introduction to azure synapse analytics.

GravenGuan 94 views 77 slides Jun 02, 2024
Slide 1
Slide 1 of 77
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77

About This Presentation

All code and step-by-step instructions provided
Basic SQL knowledge required
Azure Account
Basic Python programming knowledge required
Cloud fundamentals will be beneficia


Slide Content

Introduction to Azure Synapse Analytics

Azure Synapse Analytics is a limitless analytics
service that brings together data integration,
enterprise data warehousing and big data
analytics.
Azure Synapse Analytics -Introduction

Azure Synapse Analytics is a limitless analytics
servicethat brings together data integration,
enterprise data warehousing and big data
analytics.
Azure Synapse Analytics -Introduction

Azure Synapse Analytics is a limitless analytics
service that brings together data integration,
enterprise data warehousing and big data
analytics.
Azure Synapse Analytics -Introduction

Azure Synapse Analytics -Introduction
Data Warehouse
Emergence of Data Lakes
Modern Data Warehouse Architecture
Modern Data Warehouse -With Azure Synapse Analytics
Modern Data Warehouse –Without Azure Synapse Analytics

Data Warehouse
ETL
Data Consumers
Data Sources
Operational
Data
External DataData Warehouse
/ Mart

Data Warehouse
Lack of support for unstructured data
Longer to ingest new data
Proprietary data formats
Scalability
Expensive to store data
Lack of support for ML/ AI workloads

Data Lake
Ingest
Data Sources
Operational
Data
External Data
Transform
Data Science/ ML
workloads
Data LakeData Lake
BI Reports
Data Warehouse /
Mart

Modern Data Warehouse
Data Sources
Operational
Data
External Data
IngestExplore &
Prepare
Transform
& Enrich
Model &
Serve
Visualize
Data Lake Storage
Azure Data
Factory
ADF –Data
Flows
Azure
Databricks
Azure SQL
Data
Warehouse
Power BI
Azure Data Lake Gen2
ADF –Data
Flows
Azure
Databricks

Data
Visualization
Data
Integration
Storage
Modern Data Warehouse
Compute
Azure
DatabricksADF Data Flows
Azure SQL Data
Warehouse
Azure Data
Factory
Azure Data Lake Storage Gen2
Power BI
Development / Monitoring / Management & Security

Modern Data Warehouse
Too many services/ workspaces
Difficult to monitor
Management & Security overhead
No Serverless option
Metadata not shared

Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Azure
DatabricksADF Data Flows
Azure SQL Data
Warehouse
Azure Data
Factory
Azure Data Lake Storage Gen2
Power BI
Development / Monitoring / Management & Security

Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Azure
Databricks
Synapse Data
Flows
Azure SQL Data
Warehouse
Synapse
Pipelines
Azure Data Lake Storage Gen2
Power BI
Development / Monitoring / Management & Security

Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Azure SQL Data
Warehouse
Synapse
Pipelines
Azure Data Lake Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool

Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool

Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool

Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta store

Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta storeSynapse Link

Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta storeSynapse Link

Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure Synapse Analytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta storeSynapse Link

Azure Synapse Analytics –Preview Features
Data Explorer Pool
Synapse Link for SQL Server 2022
Many more features

Azure Synapse Analytics
Azure Synapse Analytics is a limitless analytics servicethat brings together
data integration, enterprise data warehousing and big data analytics.

Create Synapse Analytics Workspace
Lab

Create Synapse Analytics Workspace

User Subscription
Azure Managed Resource GroupUser Managed Resource Group
Synapse Workspace
Primary ADLS Gen2
Storage Account
Role -Storage Blob
Data Contributor
Serverless SQL
Pool
Workspace
Container
SQL Admin User
Create Synapse Analytics Workspace

Project Overview
What is NYC Taxi
NYC Taxi data source & datasets
Prepare the data for the project
Project Requirements
Solution Architecture

Data Overview –NYC Taxi Trips

Data Overview –NYC Taxis
Yellow Taxis

Data Overview –NYC Taxis
Yellow Taxis
Green Taxis

Data Overview –NYC Taxis
Yellow Taxis
For Hire Vehicles
Green Taxis
High Volume For Hire Vehicles

Data Overview –NYC Taxis
Yellow Taxis
For Hire Vehicles
Green Taxis
High Volume For Hire Vehicles
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Trip Data
Taxi ZoneTrip Type
Payment Type
Vendor
NYC Taxi Data Files Overview
Rate Code
Calendar

Trip Data
Taxi ZoneTrip Type
Payment Type
Vendor
NYC Taxi Data Files Overview
Rate Code
Calendar
CSV
CSV
TSV
JSON
JSON
CSV
Quoted
Parquet
CSV
Delta

Import NYC Taxi Data to Data Lake

Project
Requirements

Data exploration capability on the raw data
Discovery using pay-per-query model
Discovery using T-SQL
Schema applied to the raw data
Data Discovery

Ingested data to be stored as Parquet
Ingestion using pay-per-query model
Ability to query the ingested data using SQL
Ingested data to be stored as tables/ views
Data Ingestion

Join the key information required for reporting to
create a new table.
Transformed data must be stored in columnar
format (i.e., Parquet)
Must be able to analyze the transformed data
via T-SQL
Data Transformation
Join the key information required for Analysis to
create a new table.

Taxi Demand
Reporting Requirements
Credit Card Campaign
Operational Reporting

Scheduled to run at regular interval
Scheduling Requirements
Ability to monitor pipelines
Ability to re-run failed pipelines
Ability to set-up alerts on failures

Azure Synapse
Data
Visualization
Data
Integration
Storage
Solution Architecture
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta storeSynapse Link

Report
Solution Architecture –Serverless SQL Pool
Bronze
Layer
Silver
Layer
Gold
Layer
IngestTransformPresent
NYC Taxi
Data
Discovery
Analysts
Synapse Pipelines

Report
Solution Architecture –Spark Pool
Bronze
Layer
Silver
Layer
Gold
Layer
Present
NYC Taxi
Data
Discovery
Analysts
TransformIngest
Synapse Pipelines

Report
Solution Architecture –Dedicated SQL Pool
Bronze
Layer
Silver
Layer
Gold
LayerNYC Taxi
Data
Discovery
Analysts
TransformPresentIngest
Synapse Pipelines

Azure Synapse AnalyticsAzure Cosmos DB Container
Auto
Sync
Transactional Store
Row store optimized
for transactional
reads and writes
Analytical Store
Column store
optimized for
analytical queries
Spark Pool
Serverless SQL
Pool
Machine Learning
Bigdata Analytics
BI Reporting
Azure Synapse Link
/ HTAP
Solution Architecture -Synapse Link
Transactional
Data
Data Exploration

Azure Synapse
Data
Visualization
Data
Integration
Storage
Azure SynapseAnalytics
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta storeSynapse Link

Section Overview –Serverless SQL Pool
Serverless SQL Pool Architecture
Cost Control
Connecting to Azure Data Studio
T-SQL Support
Features & Use Cases

Serverless SQL Pool

Azure Synapse
Data
Visualization
Data
Integration
Storage
Compute
Synapse Data
Flows
Dedicated SQL
Pool
Synapse
Pipelines
Azure Data Lake
Storage Gen2
Power BI
Development / Monitoring / Management & Security
Spark Pool
Serverless SQL
Pool
Meta storeSynapse Link
Serverless SQL Pool

Serverless SQL Pool
Serverless SQL pool is a serverless distributed
query engine that you can use to query data over
your data lake using T-SQL.

Serverless SQL Pool
Serverless SQL pool is a serverlessdistributed
query engine that you can use to query data over
your data lake using T-SQL.

Serverless SQL Pool
Serverless SQL pool is a serverless distributed
query enginethat you can use to query data over
your data lake using T-SQL.

Compute
Polaris -Distributed
Query Processing
Engine
Azure Storage
Control Node
Compute NodeCompute NodeCompute Node
Serverless SQL Pool -Architecture
User or
Application
T-SQL
https://www.vldb.org/pvldb/vol13/p3204-saborit.pdf

Serverless SQL Pool –Key Features
Serverless
Distributed query engine
Query using T-SQL
Robust
Pay-per-query pricing model
Not a storage
Synapse Link
Query spark tables

Serverless SQL Pool –Supported Data Sources
SQL API
Parquet
Delta Lake
Cosmos
Dataverse
Azure Storage Account
SQL Server 2022 (Preview)
MongoDB API
Delimited –CSV, TSV etc
JSON

Serverless SQL Pool –Use Cases
Discovery & Exploration
Logical Data Warehouse
Data Transformation

Serverless SQL Pool –Who is it far?
Data Engineers
Data Scientists
Data Analysts / BI Developers

Serverless SQL Pool
Cost Management

Cost Calculation
Amount of data read from storage
Billed for Data ProcessedAmount of data in intermediate results
Amount of data written to storage
Data Processed rounded to the nearest MB
Minimum of 10MB per query
Currently $6.25 per 1TB

Cost Control

Serverless SQL Pool
Cost Management
Lab

Azure Data Studio

Section Overview -Query CSV Files
Header Row
Field Terminator
Row Terminator
Quoted Files and escaping characters

Query CSV Files
OPENROWSET function overview
Query using OPENROWSET function
Data Types & Collations
Query subset of columns
Quoted strings & Escape Char
Query Tab Separated Values (TSV) file
Serverless SQL
Pool

OPENROWSET Function
SELECT *
FROM OPENROWSET(BULK ‘blob file path‘,
FORMAT = [‘CSV’ |
'PARQUET’ |
‘DELTA’]
) AS [file]
Azure Storage
User/
Application
OPENROWSET
Select

Section Overview -Query CSV Files
Line Delimited JSON

Section Overview -Query CSV Files
Single Line JSON
Standard JSON

Section Overview -Query CSV Files
Single Line JSON
Standard JSON
Classic JSON

Query JSON Files
OPENROWSET
CSV Parser
JSON_VALUE
OPENJSON
Line-delimited JSON
FIELDTERMINATOR –0x0b
FIELDQUOTE–0x0b
Standard JSON
FIELDTERMINATOR –0x0b
FIELDQUOTE–0x0b
ROWTERMINATOR –0x0b

Serverless SQL Pool –T-SQL Support
Databases
Schemas
Stored Procedures
Views
Inline table value functions
External Resources –data sources, file
formats and tables
Supported

Serverless SQL Pool –T-SQL Support
Tables
Triggers
DDL statements other than ones related
to views and security
Materialized views
DML statements
Not Supported

Serverless SQL Pool –T-SQL Support
Logins and users
Credentials to control access to storage accounts
Grant, deny, and revoke permissions per object level
Azure Active Directory integration
Security

Serverless SQL Pool –T-SQL Support
CETAS -CREATE EXTERNAL TABLE AS SELECT
Extension to OPENROWSET to aid querying data in
data lake
Additional
Features

Serverless SQL Pool –Monitoring Queries
Lab

Discovery & Exploration
Lab

Serverless SQL –Data Discovery
Identify the volume of the data
Qualify of data
Duplicates
Missing values
Invalid data
Ability join datasets (e.g. keys exist)
Ability to get business value
Total record count
Record count per day/ Week/ Month
Right columns exist
Transformations
Aggregations
Identify additional data required

Serverless SQL –Data Discovery
Identify duplicates in data
Check for missing data values
Invalid/ Unexpected data in columns
Join data from multiple files
Summarize/ Aggregate data
Apply some transforms
Tags