Data Scotland - Migrating Mapping Dataflows by Johan Kangasniemi.pdf
kangasniemi
20 views
23 slides
Sep 13, 2024
Slide 1 of 23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
About This Presentation
DATA:Scotland session on Migrating Mapping Dataflows by Johan Kangasniemi
Size: 1.55 MB
Language: en
Added: Sep 13, 2024
Slides: 23 pages
Slide Content
Migrating Mapping Dataflows
From Synapse Analytics to Spark Notebooks
Johan Kangasniemi| 2024-09-13
Who am I? -Johan’s introduction
4 Migrating Mapping Dataflows
•Swedish with Sami origin
•First time at Data:Scotland
•First time presenting (!) – this all started at SQLBits ‘24!
•Background is SQL Server originally
•Working within Azure (almost exclusively) since 2017
•Data and Analytics Lead in Oil and Gas in Aberdeen
•Co-organiser of Data Platform and Cloud Aberdeen (DPaCAbz)
•https://www.meetup.com/dpacabz
•Living in Aberdeen with wife and Marzipan the 21 year old cat -
Quick overview: Mapping Dataflows
6 Migrating Mapping Dataflows
•Visually designed data transformations
•Drag and Drop, Point and Click
•Azure Data Factory and/or Azure Synapse Pipelines
•Low/No Code option for Data Engineering
•Flowlets can be very neat for standardised, quick data engineering
•Runs on Azure Data Factory managed clusters
•Code translation, path optimisation and execution performed for you
•Uses scaled-out Apache Spark Clusters, managed for your convenience
•generates spark jobs
•does not generate Spark Notebooks or other "usable" artifacts that we can use
Reasons for listening to this talk?
8 Migrating Mapping Dataflows
Moving to Fabric
oNo Mapping Dataflows in Fabric
oMigration path mostly leans on "manual migration“
oLooking for structure in migration
Staying in Synapse
o"Ease" of Debug
oSource Control options
▪Development Environment
oCurious about cost
Quick overview: Mapping Dataflows – Cost in Azure
9 Migrating Mapping Dataflows
•Costs more than "pure Spark":
•£0.218per vCore-hour (general)
•£0.279 per vCore-hour (memory-optimised)
•(UK West, 2024-09-10)
•Azure Managed Clusters
•Little-to-no control of size/scale/scope
•Hard to debug and optimise for cost
•Only Azure Integration Runtimes supported
Quick overview: Spark Notebooks
10Migrating Mapping Dataflows
•Interactive interface with multiple language support
•Develop locally, or against cloud clusters
oDeploy using standard version control mechanisms
•Azure Synapse:
–Spark Pool cluster - £0.116 per vCore-hour (memory-optimised)
–Mapping Dataflows ~1.8-2.4x more expensive
•(UK West, 2024-09-10)
We're moving to Notebooks – What now?
11Migrating Mapping Dataflows
•Planning our migration
•No native conversion from Mapping Dataflows to: Notebooks, Dataflows Gen2, Spark Job definitions
•Understanding our migration burden
•How many Mapping Dataflows?
•How complex are the Dataflows?
•Looking at our options:
•Manual Migration
•Syntactical Parsers
•LLM-based migration
Manual Migration
12Migrating Mapping Dataflows
•Quick to get going
•Manual – it's in the name
•Heavy human-involvement = costly (real or opportunity)
•Suitable if:
•Few flows
•Low* complexity
•Few data sources/sinks
Syntactical Parser:
Arun Sethia, https://github.com/sethiaarun/mapping-data-flow-to-spark
13Migrating Mapping Dataflows
•Java and Scala-based syntactical parsing with limitations
•Not all operations are supported, only:
▪Source
▪Select Column Mapping
▪Join
▪Union
▪Filter
▪Sink
▪Flatten (unroll_by)
•If your Mapping Dataflow fits limitations, and you're allowed to install pre-requisites, this may be an option
•Requires Java 11 and Python to run
•Your organisation might not approve
•Code formatting leaves room to be desired
•Parameterised Mapping Data Flows will require some tweaking and adjustments to be ready
•Output is however dependable and should not require validation – (but might require adjustments from a technical point)
LLM-based migration
14Migrating Mapping Dataflows
•ChatGPT or other LLM
•UI or API – take your pick
•Limitations here:
•Token length limitation
▪Depends on your service/model: gpt-4 8k, gpt-4-turbo 128k
▪Long scripts will be truncated and will fail
•Will require some tweaking and validation
•Remember: confident incorrectness is a thing
•Best option if your Mapping Dataflow is complex and/or doesn't fit into the Syntactical Parser
Syntactical Parser
16Migrating Mapping Dataflows
•There is no magic bullet
•Manual intervention still required
•Large flows, small flows – all the same as long as you stay inside the limitations
•If you’re savvy with Scala – extend it!
Demo: LLM-based Migration
17Migrating Mapping Dataflows
•Demo
•LLM-based Migration
•Using ChatGPT UI here – API works the same
LLM-based Migration
18Migrating Mapping Dataflows
•There is no magic bullet
•Manual intervention still required
•Very quick to get started
•Accuracy is (from experience) really good most of the time – but when it’s bad, it’s bad
•Check the code before you run it!
Do the outputs run in Spark Notebooks?
19Migrating Mapping Dataflows
•Let’s see?
Recap
20Migrating Mapping Dataflows
•Migrating from Mapping Dataflows is usually done for one of two reasons:
•Moving to an environment where they are not supported
•Optimising cost
•Mapping Dataflows are actually "just" Spark
•Three main options for migration
•Manual migration
•Syntactical Parsing
•LLM
Your choice will depend on your circumstance
•There will be manual intervention required in almost all instances
Questions?
22Migrating Mapping Dataflows
•Thank you!
•Special thanks to:
John Martin, Rob Sewell, Ben Weissman, Benni De Jagere, Craig Porteous and Alexander Arvidsson