AWS Community Day Poland 2022 - Building a Data Lake.pdf

Anurag896857 34 views 30 slides May 05, 2024
Slide 1
Slide 1 of 30
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30

About This Presentation

A talk about the process of building a data lake for use case that spans from business needs to technology choices. Covers AWS services and choices made along the way


Slide Content

WARSZAWA
14th OCTOBER 2022

Building a Serverless Event Store
Store, distribute and discover events
Anurag Kale
[email protected]
Cloud Software Architect
Architecture & Standards
Polestar Digital

There is no sense in talking about the
solution before we agree on the problem,
and no sense talking about the
implementation steps before we agree on
the solution.
-Efrat Goldratt-Ashlog[The buy in process according to theory of constraints]

What’s in this session?
•Explore business use case
•Blueprint the business solution
•Evaluate different technical candidates for the problem
•Logical Solution Architecture
•Data Flow Diagram
•What’s next?

$whoami
Anurag Kale
Cloud Software Architect at Polestar Digital
AWS Data Hero
@iAnuragKale

Current Landscape
EDA
Software
Enteprise COTS
Factory Systems
Serverless EDA
Software
Enterprise COTS
R&D Systems
Data & Analytics
Data & Analytics
Industrial
Software
External ProvidersPolestar Managed Landscape

The Problem

Source 1Source 2Source 3
System 1System 2System 3
Integration 1Integration 2
The problem -starts from here….

Source 1Source 2Source 3
System 1System 2System 3
Integration 1Integration 2
System 4System 5
Integration 3
Source 4
Soon this happens…..
Source N
System N

•Standardise collection of data from external sources
•Avoid point to point integrations
•Consolidate efforts for common used data sources
•Own Polestar data in unmutated format
•Enable current way of event driven architectures
What are we trying to address?

Ideal Solution –Business Architecture
Source 1Source 2Source 3Source 4Source N
System 1System 2System 3System 4System 5
Black Box
Integration Options
Distribution Options


•Support multiple input methods
•Flexible and store multiple data formats
•Store data in original format i.e. unmutated data
•Store data for longer run in cost efficient way
•Good data discoverability
•Distibute data in events
•Adheres to IT Principles (Serverless / Cloud Native)
•Open to future expansions
•High sustanability impact (minimal CO2 emissions)
Ideal properties
The proposed black box

Candidates for Black Box


•Enterprise Data Warehouse Approach
•Data Lake
•Data Lake House
•Delta Lakes
•Data Mesh / Data Fabric
Data Architectures

—Enterprise Data Warehouse
The Usual Suspect
Read
Optimised
Relational
DB
ETL Tool / CDC / Upload
ETL Tool
Query Tool


•Proprietarytools, OS and expertise required
•Heavy licence costs
•Needs provisoned capacity for compute, network, storage
•Connection pooling issue ( not suitable for AWS Lambda)
•Violates IT principles (requires down time, non-modular, does not work
well with EDA etc)
•Long project effors for addition of every new data source
•24x7 provisoning àHigher CO2 emissions
Evaluation of EDW Approach

—Data Lake in Cloud
New kid on the block
ETL
Object
Store
StreamBatchUploadCDC
EventsBatchQueryRead


•Cloud Native, flat filestorage
•No licence cost involved
•No provisioning of servers needed
•API based reads / writes
•Fits into all IT principles
•New source addition addition trivial task
•Significantly lower CO2 emissions
•Leaves room for expansion
Evaluation of Data Lake Approach


DB
IoT
Apps
Disk
DBDWH
ReportDashboar
d
ML
Tools
DS
Notebooks
Data Lake as Platfrom
ETL, CDC, Upload
Stream
Stream, writeUpload, replicate
Object
StoreIndexCatalog
FaaSSQLaaS
IAMPoliciesLineage
Batch
Real
time
ETL,
SQLaaS
Read, Replicate
Governance
Persistance
Proces
s

—Other Data Architectures
Data Lake House
Source -AWS

—Other Data Architectures
Data Mesh / Data Fabric
Source -AWS
Virtalisation Layer
DBLakesDWH
CloudOn PremEdge
DBFilesDWHCacheTelemeter
y
Local
Storage

—Other Data Architectures
Delta Lakes

Our Solution

—Polestar Data Platfrom
Logical componets

—Envelope Format

—Polestar Data Platform
Data Flow Diagram

—What will we achieve?
•A staging area to store and operate variety of data
•Polestar will own the data in its landscape
•Highly suited platform for the current way of software design
•Highly decoupled, cloud native and modular software
•Light weight; minimal inertia to get off the ground
•Pay-as-you-go model
•No lock in and commitments
•Contains reusable pattens
•Self-service oriented and discoverable
•Sustainable solution


•Standard support for multiple input sources
•Privacy module to scramble, encrypt or mask sensitive data
•Provide reusable patterns to consume data from the platfrom
•Enable data lineage capabilities
•Provide standard BI reports
•Standard access patterns for power users i.e. data scientists, data
engineers etc.
What more can be done?
Open source version-https://github.com/mikaelvesavuori/example-aws-stream-data-to-events

Thank You! Questions?
Tags