Crafting highly scalable and performant Modern Data Platforms

SameerParadkar2 62 views 33 slides Jul 10, 2024
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

Modern Data Platforms


Slide Content

© 2021 Atos GB Ltd. All Rights Reserved. Commercial in Confidence.
© EVIDENSAS
Data at Scale: Crafting highly scalable and performant
Modern Data Platforms
Sameer Paradkar –Enterprise Architect –Digital
Distinguished Expert –Modern Applications
an atos business 30/09/2023

© 2021 Atos GB Ltd. All Rights Reserved. Commercial in Confidence.
Agenda –Modern Data Platforms
Data Driven Economy
01
Traditional Deployment Limitations
02
Modern Data Platforms -Goals and objective
03
Data Warehouse Vs Data Lake Vs Data Lakehouse
04
Lambda Vs Kappa Vs Delta
05
Data Analytics Platform –Reference Architecture
06
2

3© EVIDENSAS 3© EVIDENSAS
Four V's of Data Domain: Understanding Volume, Velocity, Variety, and Veracity
Volume:Enterprisesareawashwithever-growingdataofalltypes,easily
amassingterabytesandevenpetabytesofinformation.
Velocity:Fortime-sensitiveprocessessuchascatchingfraud,bigdatamustbe
analyzedasitstreamsintotheenterprisetomaximizeitsbusinessvalue.
Variety:Bigdataextendsbeyondstructureddatatoincludeunstructureddata
ofallvarieties:text,sensordata,audio,video,clickstreams,logfilesand
more.
Veracity:Asthecomplexityofinformationgrows,organizationsmustimprove
theleveloftrustusershaveininformation,ensureconsistencyacrossthe
organizationandsafeguardtheinformation.Establishingconfidenceis
essentialfordrivingbetterbusinessresults.

4© EVIDENSAS 4© EVIDENSAS
Data Driven Economy
Role of data & analytics is expanding with data becoming more strategic and mission critical

5© EVIDENSAS 5© EVIDENSAS
Modern Demands on Data Platforms

6© EVIDENSAS 6© EVIDENSAS
Traditional Model Limitations

7© EVIDENSAS 7© EVIDENSAS
Drive Better Business Outcomes
Organizations want to modernize their data & analytics platform to drive better business outcomes by using data as a differentiated asset for
innovation and competitive advantage.

8© EVIDENSAS 8© EVIDENSAS
Modern Data Platform is embedded in digital business platform

9© EVIDENSAS 9© EVIDENSAS
Journey to Cloud Data Platform

10© EVIDENSAS 10© EVIDENSAS
Modern Cloud Data Platforms: Workloads

11© EVIDENSAS 11© EVIDENSAS
Strategizing Modern Data Platforms: From Ingestion to Insightful Action
•Stakeholderstotakeevidence-baseddecisionsbasedonanalytical,transactional,rawdataandextractvaluefrom
data;
•DataScientistandanalystcanconsumeandextractvaluefromplatformdata.Enabledatascientiststo
collaborateandusepredictiveandprescriptiveanalyticspipelinesidentifyingmetadata,insightsandpatternsin
structuredandunstructureddatausinganalyticsenginesforbigdataprocessing;
•Exposetheoutcomesofthesepipelinesashumanandmachineinterfaces(WebUIsandservices)andenable
end-usersandsystemstosearchandconsumetheoutcomesofdataanalyticsactivities.
•Makinguseofadvancedanalyticssuchastextmining,NLPandAItopastandfutureunstructuredinformationto
augmentitwithmetadatathatlendsitselftodata-driveninterrogation;
•Implementandenforcedatagovernancepoliciesandprocesses
•Designauser-friendly,cost-effective,highlyavailableandscalableModernDataPlatform,toprovide
orchestration&monitoring,datagovernance,deploymentsautomation,accesscontrol&auditing&
orchestrationofdatapipelines.
•ModernDataPlatformisdesignedleveragingavarietyofSaaS/PaaSservices,withindistinctlayersconsistingof:
load,store,process,serve,andconsumeandthevariouscomponentswillbedisseminatedviaAPIs.

12© EVIDENSAS 12© EVIDENSAS
Modern Data & Analytics Platform –Logical Architecture

13© EVIDENSAS 13© EVIDENSAS
Data Analytics Platform –Reference Architecture
Data Storage
Data Analytics Platform
Data Ingestion Data Processing
Data Source 6
Data Source 5
Data Source 4
Data Source 3
DataSource 2
DataSource 1
Data Analytics &
Cognitive Services
Orchestration & Governance
Monitoring Security Cost ManagementDevOps
Data Visualization
DevelopmentTools
Use Case 6
Use Case 5
Use Case 4
Use Case 3
Use Case 2
Use Case 1
Data Consumers
General Public with
Limited Access
Use Case 7
Use Case 8
Consumers of
Reports &
Dashboards
Produces Reports
& Dashboards
Data Scientist
leveraging AI
Personas
Data Source 7
Data Sources
Data Source 8

14© EVIDENSAS 14© EVIDENSAS
Modern Data Platforms: Speed, Savings, and Secure Governance
OurDataModernizationPlatformprovidesvitalcapabilitiesempowering
organizationsto:
•Rapidlyingest,migrateandamalgamateenormousvolumesof
structuredandunstructureddatafromanylegacysystemsupto4-
timesfasteratanupto60%savings.
•Managetheuniverseofenterprisedatasimilarlyacrossmultiple
operatingunitsatover75%savings.
•Createarobustdatadictionarywhile“in-flight”duringmigration.
•Provideself-servicedatadiscoveryandexplorationtoolsets,including
hundredsofconnectors.
•Captureandindexcontext-sensitivemetadata.
•Implementprinciplesforethicaldatagovernance,includingenabling
sandboxesfordevelopment,experimentationandsecurity.
•Benefitfromanindustry-bestpre-builtcomplianceframeworkthat
promotessecurity,dataprivacy,andaudittrails.

15© EVIDENSAS 15© EVIDENSAS
Data Modernization Framework
DataisaStrategicAsset–Dataisahigh-interestcommodityandmustbeleveragedinawaythat
bringsbothimmediateandlastingvalue.
CollectiveDataStewardship–Assigndatastewards,datacustodians,andasetoffunctionaldata
managerstoachieveaccountabilitythroughouttheentiredatalifecycle.
DataEthics–Ethicsattheforefrontofallthoughtandactionsasitrelatestohowdataiscollected,
used,andstored.
DataCollection–Enableelectroniccollectionofdataatthepointofcreationandmaintainthe
pedigreeofthatdataatalltimes.
Enterprise-WideDataAccessandAvailability–Datamustbemadeavailableforusebyallauthorized
individualsandnon-personentitiesthroughappropriatemechanisms.
DataforArtificialIntelligence–DatasetsforA.I.trainingandalgorithmicmodelswillincreasingly
becometheDoD’smostvaluabledigitalassetsandwemustcreateaframeworkformanagingthem
acrossthedatalifecyclethatprovidesprotectedvisibilityandresponsiblebrokerage.
DataFitforPurpose–Mustcarefullyconsideranyethicalconcernsindatacollection,sharing,use,
rapiddataintegrationaswellasminimizationofanysourcesofunintendedbias.
DesignforCompliance–MustimplementITsolutionsthatprovideanopportunitytofullyautomate
theinformationmanagementlifecycle,properlysecuredata,andmaintainend-to-endrecords
management.

16© EVIDENSAS 16© EVIDENSAS
Modern Data Platform: Logical Architecture
•Commondataintegrationlayerthatallowstoingeststructured,semi-structured
andunstructureddata,supportingbothbatchandnearreal-timecapabilities.
•Commonhotandcoldstoragelayerwithsegregatedareasforrawdata,derived
dataandcurateddata.Allofthisdatacanalsobesegregatedbysubject
area/domainandsecuritypoliciesappliedtoeachofthese;
•Commondatasets(likeOrders,Product,CustomerorReferentialmasterdata)as
abuildingblockforanalysisacrosstheboard;
•Commonmulti-purpose,multi-formatanalysistoolsfordatascientists;
•CommonBusinessIntelligenceandcollaborationspacesforpowerusers;
ModernDataplatformenablesdataanalyticsformultiplepurposes,onwhichservicesarebuilt,andmeettheprivacyandinformation
securitystandards.Specifically,thisplatformmustofferanumberofcapabilities:

17© EVIDENSAS 17© EVIDENSAS
Modern Data Platform: Conceptual Architecture

18© EVIDENSAS 18© EVIDENSAS
Data Sharing and Cross Region Replication

19© EVIDENSAS 19© EVIDENSAS
Data Pipeline Architecture

20© EVIDENSAS 20© EVIDENSAS
Data Warehouse Vs Data Lake Vs Lakehouse

21© EVIDENSAS 21© EVIDENSAS
Databricks Unified Data Analytics Platform

22© EVIDENSAS 22© EVIDENSAS
Data Warehouse Vs Data Lake Vs Lakehouse
Architectural Dimension Data Warehouse Data Lake Lakehouse
Data Structure Highly structured (relational tables, schema-on-write)
Mostly unstructured or semi-structured (schema-on-
read)
Combines structured & unstructured (both schema-on-
read & schema-on-write)
Storage Layer Proprietary storage optimized for query performanceFlat file systems (e.g., HDFS, S3) Hybrid: combines proprietary storage and flat file systems
Processing Engine SQL-based engines (e.g., Teradata, Snowflake)Distributed processing (e.g., MapReduce, Spark)SQL & distributed processing (e.g., Delta Lake with Spark)
Data Ingestion Batch ETL processes Real-time and batch ingestion (e.g., Kafka, Flume)Both real-time and batch ingestion
Data Storage Cost Higher due to optimizations Lower due to object storage (e.g., HDFS, S3)
Medium: optimized storage for hot data, object storage
for cold data
Scalability Vertical scalability Horizontal scalability Both vertical and horizontal scalability
Latency Low latency for queries (optimized storage)High latency for big data processing
Optimized for low-latency queries on structured data &
scalable for big data
Flexibility Less flexible due to fixed schema Highly flexible with support for various data formats
Combines the flexibility of data lakes with structured
querying of DW
Security & Governance Mature with access controls and auditing
Role-based access, encryption, data masking (still
maturing)
Inherits mature features from DW and flexible features
from DL
Integration CapabilitiesIntegrated BI tools & dashboards
Requires integration with external processing &
analytics tools
Supports integrated BI tools, dashboards, and advanced
analytics
Data Quality & CurationHigh (data is cleaned & curated before storage)
Raw (data is stored as-is, curation happens during
processing)
Both raw and curated storage with support for refining
data

23© EVIDENSAS 23© EVIDENSAS
Delta Lake is an open source storage layer that brings Improved Reliability

24© EVIDENSAS 24© EVIDENSAS
Building a Data Driven Culture
Data
ScientistData
Engineer
Architects
Citizen
Consumer
Data
Governance
Lead
Operations
Leader
Data Management
Leader
Data
Steward
How do I manage
changes in the
CDW/L?
Is my domain performing?
What actions can we take to
improve quality
How can I find data to
onboard into the
CDW/L?
Where can I find, discover,
understand data required for
my analysis?
Are we achieving our adoption targets? Are we
achieving our efficiency targets?
How do I ensure our
policies are adhered
to?
Quality,
Privacy
Steward
Measure
Control
Build
Operate
Data
Democratization
Consume
Automation Scale Intelligence Powered by CLAIRE
Cloud Data Warehouse and Data Lake

25© EVIDENSAS 25© EVIDENSAS
Self Service Data Management and Analytics
Powerful User Interactions Business Focused
Scalable SimpleEnterprise-ready
Architect Business
Leader
Citizen Data
Scientist
Data
Analyst
Citizen
Integrators
IT
Specialist
SaaS /EDW
Owner
Any UserAny Pattern Any Data

26© EVIDENSAS 26© EVIDENSAS
Embed Quality & Governance in CDW/DL Architecture
Streaming
On-Premises
IoT Machine
Data
SocialLog files
Apps
Mobile
DatabasesApplication Servers
Documents
Mainframe
Data Warehouse
SaaS
ERP DRM
Cloud Data Lake
Landing
Zone
Data
Enrichment
Enterprise
Zone
Data Ingestion
2
Data Integration & Quality3
Cloud Storage
Spark Processing
Stream Storage
Stream
Processing
6
Data Integration
Data Provisioning
4
Cloud Data
Warehouse
Data Science/AI
Data Provisioning
5
Real-time Analytics
Enterprise Analytics
Line of Business / Self-
Service Analytics
Data
Scientist
Data
Engineer
Line of
Business
Data
Analyst
Business
User
Discovery Lineage Glossary
Data Catalog &
Data Governance
1
Data Quality –Cleanse, Parse, De-dupe, Standardize

27© EVIDENSAS 27© EVIDENSAS
Data Analytics Platform –Logical Architecture

28© EVIDENSAS 28© EVIDENSAS
Lambda Vs Kappa Architecture
Datathatflowsintothehotpathisconstrainedby
latencyrequirementsimposedbythespeedlayer,
sothatitcanbeprocessedasquicklyaspossible.
Often,thisrequiresatradeoffofsomelevelof
accuracyinfavorofdatathatisreadyasquicklyas
possible.
Eventually,thehotandcoldpathsconvergeatthe
analyticsclientapplication.Iftheclientneedsto
displaytimely,yetpotentiallylessaccuratedatain
realtime,itwillacquireitsresultfromthehot
path.Otherwise,itwillselectresultsfromthecold
pathtodisplaylesstimelybutmoreaccuratedata.
In the world of big data, if Lambda is the seasoned chef balancing two pans, then Kappa is the millennial with a single Insta-worthy skillet, cooking
everything on the fly

29© EVIDENSAS 29© EVIDENSAS
Lambda Vs Kappa Vs Delta Architecture
Architectural
Dimension
Lambda Architecture Kappa Architecture Delta Architecture
Processing LayersTwo: Batch and Stream (Speed) Single: Stream Unified: Both Batch and Stream
Data Processing Separate paths for batch and real-time processingAll data is processed as a streamUnified processing system for all data
Complexity High (due to dual paths) Low (single processing paradigm)
Moderate (merged paths but need for
stateful processing)
Latency Low latency for real-time, high for batch Low latency (real-time)
Low latency for both batch and real-
time
System MaintenanceMore complex (maintain two systems) Simpler (maintain one system)
Moderate (unified system but may
have complexities)
Data Freshness Real-time layer provides fresh data, batch layer lags
Provides fresh data due to real-time
processing
Fresh data due to unified processing
Fault Tolerance &
Recovery
Requires separate recovery strategies for batch &
stream
Recovery based on replaying streams
Recovery based on unified system
approach
Tooling & Technologies
Different tools for batch (e.g., Hadoop) & stream (e.g.,
Storm)
Common tools for all data (e.g., Kafka
Streams)
Unified tools (e.g., Apache Spark)
Data Consistency Challenging due to dual paths Easier due to single stream processingAchieved through unified processing
Just as rivers have different paths to the sea, data architectures have their courses too. Whether it's Lambda's dual highway, Kappa's speed lane, or Delta's scenic
route, they all ensure data flows to its right destination. Choose wisely, and may your streams never run dry!"

30© EVIDENSAS 30© EVIDENSAS
Data Analytics Platform: End to End Architecture

31© EVIDENSAS 31© EVIDENSAS
Data Analytics Platform –Data Analytics Capabilities
•Ingesting,storingandmanagingstructured,semi-structuredandunstructureddatacapabilities;
•Hotandcoldstorageoflargevolumesofdatafordifferentanalyticalpurposes;
•DataandmetadatagovernancecapabilitiesforEMA,EMRNandtheirstakeholder’sdatarequiredfordata
analytics;
•Dataauditabilityandmonitoringcapabilities;
•Batchandstreamingdataintegrationandprocessingcapabilities;
•Dataingestionanddatapublicationcapabilitiessupportingbothhumanandmachineinterfaces;
•Multi-purpose,multi-format,multi-technology,dataexplorationandanalyticscapabilities;
•PredictiveanalyticscapabilitiesenablingtheapplicationofArtificialIntelligencealgorithmstodatabyinternal
andexternalstakeholders;
•Prescriptiveanalyticscapabilitiesfrominternalandexternaldatasources;
•Self-serviceBusinessIntelligenceanddataanalyticscapabilities.

32© EVIDENSAS 32© EVIDENSAS
[email protected]@sameersparadkar https://www.linkedin.com/in/sameerparadkar/
Closing Notes and Q & A..
"Andrememberfolks,whileModernData
Platformssoundcomplex,attheendoftheday,
they'rejustlikeaSwissArmyknifefordata–
versatile,efficient,andalwaystherewhenyou
needtosliceanddicesomenumbers!Thanks
fortuningIn??????

© EVIDENSAS
Confidential information owned by EVIDEN SAS, to be used by the recipient only. This
document, or any part of it, may not be reproduced, copied, circulated and/or distributed nor
quoted without prior written approval from EVIDEN SAS.
Thank you!
[email protected]