Data Scotland - Migrating Mapping Dataflows by Johan Kangasniemi.pdf

kangasniemi 20 views 23 slides Sep 13, 2024

Slide 1 of 23

About This Presentation

DATA:Scotland session on Migrating Mapping Dataflows by Johan Kangasniemi

Size: 1.55 MB

Language: en

Added: Sep 13, 2024

Slides: 23 pages

Slide Content

Migrating Mapping Dataflows
From Synapse Analytics to Spark Notebooks
Johan Kangasniemi| 2024-09-13

Who am I? -Johan’s introduction
4 Migrating Mapping Dataflows
•Swedish with Sami origin
•First time at Data:Scotland
•First time presenting (!) – this all started at SQLBits ‘24!
•Background is SQL Server originally
•Working within Azure (almost exclusively) since 2017
•Data and Analytics Lead in Oil and Gas in Aberdeen
•Co-organiser of Data Platform and Cloud Aberdeen (DPaCAbz)
•https://www.meetup.com/dpacabz
•Living in Aberdeen with wife and Marzipan the 21 year old cat -

Agenda
1.Mapping Dataflows
2.Spark Notebooks
3.Migration options
4.Demos
Migrating Mapping Dataflows5

Quick overview: Mapping Dataflows
6 Migrating Mapping Dataflows
•Visually designed data transformations
•Drag and Drop, Point and Click
•Azure Data Factory and/or Azure Synapse Pipelines
•Low/No Code option for Data Engineering
•Flowlets can be very neat for standardised, quick data engineering
•Runs on Azure Data Factory managed clusters
•Code translation, path optimisation and execution performed for you
•Uses scaled-out Apache Spark Clusters, managed for your convenience
•generates spark jobs
•does not generate Spark Notebooks or other "usable" artifacts that we can use

Quick overview: Mapping Dataflows
7 Migrating Mapping Dataflows
•Azure Synapse Analytics
•Quick demo!

Reasons for listening to this talk?
8 Migrating Mapping Dataflows
Moving to Fabric
oNo Mapping Dataflows in Fabric
oMigration path mostly leans on "manual migration“
oLooking for structure in migration
Staying in Synapse
o"Ease" of Debug
oSource Control options
▪Development Environment
oCurious about cost

Quick overview: Mapping Dataflows – Cost in Azure
9 Migrating Mapping Dataflows
•Costs more than "pure Spark":
•£0.218per vCore-hour (general)
•£0.279 per vCore-hour (memory-optimised)
•(UK West, 2024-09-10)
•Azure Managed Clusters
•Little-to-no control of size/scale/scope
•Hard to debug and optimise for cost
•Only Azure Integration Runtimes supported

Quick overview: Spark Notebooks
10Migrating Mapping Dataflows
•Interactive interface with multiple language support
•Develop locally, or against cloud clusters
oDeploy using standard version control mechanisms
•Azure Synapse:
–Spark Pool cluster - £0.116 per vCore-hour (memory-optimised)
–Mapping Dataflows ~1.8-2.4x more expensive
•(UK West, 2024-09-10)

We're moving to Notebooks – What now?
11Migrating Mapping Dataflows
•Planning our migration
•No native conversion from Mapping Dataflows to: Notebooks, Dataflows Gen2, Spark Job definitions
•Understanding our migration burden
•How many Mapping Dataflows?
•How complex are the Dataflows?
•Looking at our options:
•Manual Migration
•Syntactical Parsers
•LLM-based migration

Manual Migration
12Migrating Mapping Dataflows
•Quick to get going
•Manual – it's in the name
•Heavy human-involvement = costly (real or opportunity)
•Suitable if:
•Few flows
•Low* complexity
•Few data sources/sinks

Syntactical Parser:
Arun Sethia, https://github.com/sethiaarun/mapping-data-flow-to-spark
13Migrating Mapping Dataflows
•Java and Scala-based syntactical parsing with limitations
•Not all operations are supported, only:
▪Source
▪Select Column Mapping
▪Join
▪Union
▪Filter
▪Sink
▪Flatten (unroll_by)
•If your Mapping Dataflow fits limitations, and you're allowed to install pre-requisites, this may be an option
•Requires Java 11 and Python to run
•Your organisation might not approve
•Code formatting leaves room to be desired
•Parameterised Mapping Data Flows will require some tweaking and adjustments to be ready
•Output is however dependable and should not require validation – (but might require adjustments from a technical point)

LLM-based migration
14Migrating Mapping Dataflows
•ChatGPT or other LLM
•UI or API – take your pick
•Limitations here:
•Token length limitation
▪Depends on your service/model: gpt-4 8k, gpt-4-turbo 128k
▪Long scripts will be truncated and will fail
•Will require some tweaking and validation
•Remember: confident incorrectness is a thing
•Best option if your Mapping Dataflow is complex and/or doesn't fit into the Syntactical Parser

Demo: Syntactical Parser
15Migrating Mapping Dataflows
•Demo
•Syntactical Parser

Syntactical Parser
16Migrating Mapping Dataflows
•There is no magic bullet
•Manual intervention still required
•Large flows, small flows – all the same as long as you stay inside the limitations
•If you’re savvy with Scala – extend it!

Demo: LLM-based Migration
17Migrating Mapping Dataflows
•Demo
•LLM-based Migration
•Using ChatGPT UI here – API works the same

LLM-based Migration
18Migrating Mapping Dataflows
•There is no magic bullet
•Manual intervention still required
•Very quick to get started
•Accuracy is (from experience) really good most of the time – but when it’s bad, it’s bad
•Check the code before you run it!

Do the outputs run in Spark Notebooks?
19Migrating Mapping Dataflows
•Let’s see?

Recap
20Migrating Mapping Dataflows
•Migrating from Mapping Dataflows is usually done for one of two reasons:
•Moving to an environment where they are not supported
•Optimising cost
•Mapping Dataflows are actually "just" Spark
•Three main options for migration
•Manual migration
•Syntactical Parsing
•LLM
Your choice will depend on your circumstance
•There will be manual intervention required in almost all instances

Questions?
22Migrating Mapping Dataflows
•Thank you!
•Special thanks to:
John Martin, Rob Sewell, Ben Weissman, Benni De Jagere, Craig Porteous and Alexander Arvidsson

Data Scotland - Migrating Mapping Dataflows by Johan Kangasniemi.pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Data Scotland - Migrating Mapping Dataflows by Johan Kangasniemi.pdf

About This Presentation

Slide Content

Slide 1

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 22

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......