Big Data Architecture

gschmutz 26,399 views 41 slides Sep 26, 2015
Slide 1
Slide 1 of 41
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41

About This Presentation

The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including tradition...


Slide Content

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Big Data Architecture
Guido Schmutz

Guido Schmutz
Working for Trivadis for more than 18 years
Oracle ACE Director for Fusion Middleware and SOA
Co-Author of different books
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Member of Trivadis Architecture Board
Technology Manager @ Trivadis

More than 25 years of software development experience

Contact: [email protected]
Blog: http://guidoschmutz.wordpress.com
Twitter: gschmutz

Agenda
1. Introduction
2. Traditional Architecture for Big Data
3. Streaming Analytics Architecture for Fast Data
4. Lambda/Kappa/Unifed Architecture for Big Data
5. Summary
INFOBOX – Read and delete

• If the agenda is used as an interim
page, please highlight the relevant
chapter in red font.

• To allow optimum alignment of objects,
display the drawing guides (right-click
on the page)

Introduction
INFOBOX – Read and delete

• In the chapter divider, the chapter text
is written centered in the text field
• Please keep chapter names as short as
possible, less text and punchy titles are
better!

Big Data is still “work in progress”
Choosing the right architecture is key for any (big data) project

Big Data is still quite a young field and therefore there are no standard architectures
available which have been used for years

In the past few years, a few architectures have evolved and have been discussed online

Know the use cases before choosing your architecture

To have one/a few reference architectures can help in choosing the right components

Hadoop Ecosystem – many choices ….
Management /
Monitoring
Core
Analytics Workflow/Job Unstructured
Data Sources
Structured Data
Sources
SQL on
Hadoop
Serialization Data Storage Security

Important Properties to choose a Big Data Architecture
Latency

Keep raw and un-interpreted data “forever” ?

Volume, Velocity, Variety, Veracity

Ad-Hoc Query Capabilities needed ?

Robustness & Fault Tolerance

Scalability

From Volume and Variety to Velocity
Big Data has evolved …





and the Hadoop Ecosystem as well ….
Past
Big Data = Volume & Variety
Present
Big Data = Volume & Variety & Velocity
Past
Batch Processing
Time to insight of Hours
Present
Batch & Stream Processing
Time to insight in Seconds
Adapted from Cloudera blog article

Traditional Architecture for Big
Data
INFOBOX – Read and delete

• In the chapter divider, the chapter text
is written centered in the text field
• Please keep chapter names as short as
possible, less text and punchy titles are
better!

“Traditional Architecture” for Big Data
Data
Collection
(Analytical) Data Processing




Result Storage Data
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
Batch
compute
Stage
Result Store
Query
Engine
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest

Use Case 1) – Click Stream analysis: 360 degree view
of customer
Data
Collection
(Analytical) Data Processing




Result Store Data
Sources
Data
Access
Channel
Batch
compute
Computed
Information
Raw Data
(Reservoir)
Result Store
Query
Engine
Reports
Analytic
Tools
Logfiles

Use Case 2) – Ingest Relational Data into Hadoop and
make it accessible
Data
Collection
(Analytical) Data Processing




Result Store Data
Sources
Data
Access
RDBMS
Batch
compute
Computed
Information
Raw Data
(Reservoir)
Result Store
Query
Engine
Reports
Service
Analytic
Tools
Alerting
Tools

Use Case 2a) – Ingest Relational Data into Hadoop and
make it accessible
Data
Collection
(Analytical) Data Processing




Result Store Data
Sources
Data
Access
RDBMS
(CDC)
Batch
compute
Computed
Information
Raw Data
(Reservoir)
Result Store
Query
Engine
Reports
Service
Analytic
Tools
Alerting
Tools
Channel

“Hadoop Ecosystem” Technology Mapping
Data
Collection
(Analytical) Data Processing




Result Storage Data
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
Batch
compute
Staging
Result Store
Query
Engine
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest

Apache Spark – the new kid on the block
Apache Spark is a fast and general engine for large-scale data processing
• The hot trend in Big Data!
• Originally developed 2009 in UC Berkley’s AMPLab
• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x
faster on disk
• One of the largest OSS communities in big data with over 200 contributors in 50+
organizations
• Open Sourced in 2010 – since 2014 part of Apache Software foundation
• Supported by many vendors

Motivation – Why Apache Spark?
Apache Hadoop MapReduce: Data Sharing on Disk







Apache Spark: Speed up processing by using Memory instead of Disks
map reduce . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
op1 op2
. . .
Input
Output
Output

Apache Spark “Ecosystem”
Spark SQL
(Batch
Processing)
Blink DB
(Approximate
Querying)
Spark Streaming
(Real-Time)
MLlib, Spark R
(Machine
Learning)
GraphX
(Graph
Processing)
Spark Core API and Execution Model
Spark
Standalone
MESOS YARN HDFS
Elastic
Search
NoSQL S3
Libraries
Core Runtime
Cluster Resource Managers Data / Data Stores

Use Case 3) – Predictive Maintenance through Machine
Learning on collected data
Data
Collection
(Analytical) Data Processing




Result Store Data
Sources
Data
Access
Machine
Batch
compute
Computed
Information
Raw Data
(Reservoir)
Result Store
Query
Engine
Reports
Service
Analytic
Tools
Alerting
Tools
DB
Staging File
Channel

“Spark Ecosystem” Technology Mapping
Data
Collection
(Analytical) Data Processing




Result Store Data
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
Batch
compute
Stage
Result Store
Query
Engine
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest

Traditional Architecture for Big Data
• Batch Processing
• Not for low latency use cases
• Spark can speed up, but if positioned as alternative to Hadoop Map/
Reduce, it’s still Batch Processing
• Spark Ecosystems offers a lot of additional advanced analytic capabilities
(machine learning, graph processing, …)

Streaming Analytics Architecture
for Big Data
INFOBOX – Read and delete

• In the chapter divider, the chapter text
is written centered in the text field
• Please keep chapter names as short as
possible, less text and punchy titles are
better!

Streaming Analytics Architecture for Big Data
aka. (Complex) Event Processing)
Data
Collection
Batch
compute
Data
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
(Analytical) Real-Time Data
Processing




Stream/Event
Processing
Info Delivery

Messaging
Result Store
= Data in Motion = Data at Rest

Use Case 4) Alerting in Internet of Things (IoT)
Data
Collection
Batch
compute
Data
Sources
Channel
Data
Access
Analytic
Tools
Alerting
Tools
Sensor
Machine
(Analytical) Real-Time Data
Processing




Stream/Event
Processing
Info Delivery

Messaging
Result Store

Use Case 5) Real-Time Analytics on Sensor Events
Data
Collection
Batch
compute
Data
Sources
Channel
Data
Access
Analytic
Tools
Sensor
Machine
(Analytical) Real-Time Data
Processing




Stream/Event
Processing
Info Delivery

Messaging
Result Store

Unified Log (Event) Processing
Stream processing allows for
computing feeds off of other feeds


Derived feeds are no
different than
original feeds
they are computed off


Single deployment of “Unified
Log” but logically different feeds

Meter
Readings
Collector
Enrich /
Transform
Aggregate by
Minute
Raw Meter
Readings
Meter with
Customer
Meter by Customer by
Minute
Customer
Aggregate by
Minute
Meter by Minute
Persist
Meter by
Minute
Persist
Raw Meter
Readings

Streaming Analytics Technology Mapping
Data
Collection
Batch
compute
Data
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
(Analytical) Real-Time Data
Processing




Stream/Event
Processing
Info Delivery

Messaging
Result Store
= Data in Motion = Data at Rest

Streaming Analytics Architecture for Big Data
The solution for low latency use cases

Process each event separately => low latency

Process events in micro-batches => increases latency but offers better
reliability

Previously known as “Complex Event Processing”

Keep the data moving / Data in Motion instead of Data at Rest => raw events
are (often) not stored

Lambda Architecture for Big Data
INFOBOX – Read and delete

• In the chapter divider, the chapter text
is written centered in the text field
• Please keep chapter names as short as
possible, less text and punchy titles are
better!

“Lambda Architecture” for Big Data
Data
Collection
(Analytical) Batch Data Processing




Batch
compute
Batch Result
Store
Data
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical) Real-Time Data Processing



Stream/Event
Processing
Batch
compute
Real-Time
Result Store

Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest

Use Case 6) Social Media and Social Network Analysis
Data
Collection
(Analytical) Batch Data Processing




Batch
compute
Batch Result
Store
Data
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
(Analytical) Real-Time Data Processing



Stream/Event
Processing
Batch
compute
Real-Time
Result Store

Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest

Lambda Architecture for Big Data
Combines (Big) Data at Rest with (Fast) Data in Motion

Closes the gap from high-latency batch processing

Keeps the raw information forever

Makes it possible to rerun analytics operations on whole data set if necessary
=> because the old run had an error or
=> because we have found a better algorithm we want to apply

Have to implement functionality twice
• Once for batch
• Once for real-time streaming

„Kappa“ Architecture for Big Data
INFOBOX – Read and delete

• In the chapter divider, the chapter text
is written centered in the text field
• Please keep chapter names as short as
possible, less text and punchy titles are
better!

“Kappa Architecture” for Big Data
Data
Collection
“Raw Data Reservoir”
Batch
compute
Data
Sources
Messaging
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
(Analytical) Real-Time Data Processing



Stream/Event
Processing
Result Store

Messaging
Result Store
Raw Data
(Reservoir)
= Data in Motion = Data at Rest

Kappa Architecture for Big Data
The solution for low latency use cases

Process each event separately => low latency

Process events in micro-batches => increases latency but offers better
reliability

Previously known as “Complex Event Processing”

Keep the data moving / Data in Motion instead of Data at Rest

„Unified“ Architecture for Big Data
INFOBOX – Read and delete

• In the chapter divider, the chapter text
is written centered in the text field
• Please keep chapter names as short as
possible, less text and punchy titles are
better!

“Unified Architecture” for Big Data
Data
Collection
(Analytical) Batch Data Processing
(Calculate Models of incoming data)




Batch
compute
Batch Result
Store
Data
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical) Real-Time Data Processing




Stream/Event
Processing
Batch
compute
Real-Time
Result Store

Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest
Prediction
Models

Use Case 7) Fraud Detection
Data
Collection
(Analytical) Batch Data Processing
(Calculate Models of incoming data)




Batch
compute
Batch Result
Store
Data
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical) Real-Time Data Processing




Stream/Event
Processing
Batch
compute
Real-Time
Result Store

Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)
Prediction
Models

Summary
INFOBOX – Read and delete

• In the chapter divider, the chapter text
is written centered in the text field
• Please keep chapter names as short as
possible, less text and punchy titles are
better!

Summary
Know your use cases and then choose your architecture and the relevant components/
products/frameworks

You don’t have to use all the components of the Hadoop Ecosystem to be successful

Big Data is still quite a young field and therefore there are no standard architectures
available which have been used for years

Lambda, Kappa Architecture are best practices architectures which you have to adapt to
your environment

INFOBOX – Read and delete

• Slide if reference to other information,
e.g. books, websites, etc.

Name Referent
Titel Referent
Tel. +00 00 000 00 00
[email protected]
INFOBOX – Read and delete

• There are two versions of the last slide
available, one for the contact details of
a speaker, and one for two or more
speakers.
• Name, title and location always
underneath one another in one row
(Shift+Return)
• This idea is that this is the last slide
(also for questions and answers) and is
on the screen for a long time at the end
of the presentation, so the viewers
have the chance to write down the
contact data !