Nowadays, Business Intelligence is a mature technology. Technological migrations are more frequent than one might imagine. Regardless if we are moving to cloud or just reengineering old integration jobs, rewriting ETL from scratch in a new technology might be looming over our heads. As one might ima...
Nowadays, Business Intelligence is a mature technology. Technological migrations are more frequent than one might imagine. Regardless if we are moving to cloud or just reengineering old integration jobs, rewriting ETL from scratch in a new technology might be looming over our heads. As one might imagine, doing this manually might be an absolutely colossal task. Hundreds if not thousands mutually interdependent jobs. Written by somebody else years ago, often badly. Knowing what the job is actually supposed to do is more often an exception than a rule. For this purpose, we can use tools for automatic ETL migration from technology to technology. They offer obvious benefits, but also have limitations. In this lecture, we will dive deeply into the World of ETL migration. When to do it automatically, when to do it by hand? What are the threats on our journey? Last but not the least, we shall provide some real World experiences and examples. Success stories, perhaps an epic failure or two. There is only one way to find out - attend DSC DACH 24 Conference!
Size: 5.36 MB
Language: en
Added: Sep 21, 2024
Slides: 31 pages
Slide Content
Automatic ETL Migration
On-prem to Cloud and More
DSC Dach, Vienna
September2024
Speaker
2
Miljenko Vuković
VIS Solutions
Founder, Data Engineer [email protected]
VIS Solutions
Zagrebačkiinovacijskicentar
AvenijaDubrovnik 15 (Paviljon12)
10000 Zagreb, Croatia
Web page: https://vis-solutions.eu/
Phone number: +385 1 353 55 71
Mail address: [email protected]
INTRODUCTION
Data Integration
6
ETL
•Extract, Transform, Load
ELT
•Extract, Load, Transform
Streaming
File Ingestion
… and many more
DATA
Architectures
7
Data Warehouse
•Among the oldest and most well established data architectures
Operational Data Store
Data Lake
•Why not simply dump all your data –structured and not –into a central place?
Data Lakehouse
Modern Data Architecture
•Cloud Data Warehouse
Lambda, Kappa, Data Mesh, Architecture for IoT and many more
CORPORATE ARCHITECTURE
Example
8
Data
Warehouse
Data Science
Analytics
Reporting
ERP
CRM
HR
IoT
(Big Data)
DataLake
Streaming
INTEGRATION
Technologies
9
Stored Procedures
•Early DWH projects often used this solution, especially with source and target
on the same database technology
Blue Chip (and other) Vendors’ ETL Tools
•MS SSIS, Oracle Data Integrator (ODI), IBM DataStage, Informatica…
On Cloud
•Exporting data in files (csv, xml, json…), loading it into cloud stage area,
executing transformation in tools like dbtor writing stored procedures
Cloud Data
Warehouse
Source
System(s)
Cloud
StagingArea
ETL Migration
When & Why?
10
•Moving to cloud (or back ☺)
•Existing technology is outdated / deprecated
•Reengineering estimated as long term cheaper than maintaining old
code
•Company politics (e.g. changing primary vendor)
•… and many, many more
ETL Migration
Recommended steps
11
1.Migrate existing jobs to new technology
as-is
•New job has to generate exactly the same rows
and columns as the old one
2.Technical refactoring
•Remove unnecessary steps, simplify
•Performance optimization
3.Functional refactoring
•Include business / BA to update logic if
necessary
1.AS IS
2. Technical
3. Functional
As-is ETL Migration
Automatic vs. Manual
12
1.Manual
•Basically, opening integration job in the old
technology and rewriting it into the new
technology
2.Automatic
•Technologies, algorithms and procedures for
automatic parsing of original job and generating
the new one
Automatic ETL Migration
Principles
13
1.Direct migration from one technology to another
•Reading file export of integration routine or integration
tool’s repository
•Generating script for the new job, or generating new job
directly in the target tool repository
2.Using middle layer (meta definition of job)
•Source parsing uses middle layer as target
•Target generation uses middle layer as source
Source
Technology
Target
Technology
Source
Technology
Target
Technology
Meta
Format
Automatic ETL Migration
MiddleLayerConcept
14
Automatic ETL Migration
Excel as Middle Layer
15
Using Excel as middle layer solves (integration routine’s) documentation in one
step
* There is also „where” part as well as part with integration routine’s variables
„Select” part
„From” part
Automatic ETL Migration
IBM DataStage as source
16
Example of using file export as conversion source
Export to
XML Select file,
click Start
Automatic ETL Migration
Cloud as Target
17
Example of Cloud Data Architecture
17
Cloud Data
Warehouse
Extract
(csv, json, xml…)
Data Science
Analytics
Reporting
Source
System(s)
Cloud
StagingArea
Load Transform
Automatic ETL Migration
Snowflake as target
18
Automation of ETL migration to Snowflakeusingstoredproceduresin
„Transform” step
Select file,
Click Start
Automatic ETL Migration
Stored Procedures as Target
19
Using templates, stored procedures can be easily generated for any target
technology
Select file,
Click Start
Automatic ETL Migration
Development from scratch
20
•Typically, developers will investigate data relations
using SQL
•That will quite often result in a draft of integration
logic in SQL, regardless of target technology
•Only then they will develop integration job in the
target technology, trying to get the same results as
SQL in the previous bullet
•The million dollars question –can the last step be
automated? ??????
Automatic ETL Migration
SQL as source
21
Use SQL Parser (select file, click) to
generate Excel
Use Excel to generate target technology
integration routines (select file, click)
Automatic ETL Migration
Stored Procedures as Source
22
•Extension of previous case since stored procedures
typically contain main logic in SQL
•Even if there are multiple SQL statements, stored
procedure can still be migrated as series of
integration jobs in target technology
•Intermediate data will be stored in tmptables
Automatic ETL Migration
ODI as Target
23
Example of generating target job (mappingin ODI terminology) directly
in repository
Select file,
Click Start
Automatic ETL Migration
ODI as Source
24
This time, since ODI is basically SQL generator,we use generated
SQL as a source
Select file,
Click Start
Automatic ETL Migration
Staging Area
25
Yes, the routines for loading staging area tables (copy
of source tables) can also be automatically generated,
regardless if:
•Staging area tables are loaded using python scripts
•Staging area tables are loaded using ETL tool itself
•Staging area tables are loaded using stored
procedures
Automatic ETL Migration
Benefits
26
•Smaller time (i.e. money) consumption per integration routine
•Productivity boost
•Standardization and best practice enforcement
•Self-documenting process
Automatic ETL Migration
Minuses
27
•Initial investment
•Time required to develop a solution or
select / implement existing one
•Even existing solutions probably will not be
able to process 100% of jobs!
•Technical and functional refactoring can
not be done automatically (yet ☺)
CONCLUSION
28
•If you have a small number of integration
routines (1 –2 digits), migrate ETL
manually
•If you have a large number of integration
routines (hundreds or more), try any
automation you can
WIN A PRIZE !
Trip to Vis Island for Two
29
Visit us at our stand and apply!
QUESTIONS?
…and Discussion
30
??????
SELECTED
Literature
31
Fundamentals of Data Engineering –Plan and Build Robust Data Systems
Joe Reis & Matt Housley, 2022 O’Reilly Media
The Data Warehouse Toolkit –The Definitive Guide to Dimensional Modelling
Ralph Kimball & MargyRoss, 2013 (3rd Edition), John Wiley & Sons
https://www.vis-solutions.eu/products/epic-ingenioso