Big Data Ecosystem

2,290 views 26 slides Jan 08, 2018
Slide 1
Slide 1 of 26
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26

About This Presentation

All about Big Data components and the best tools to ingest, process, store and visualize the data.

This is a keynote from the series "by Developer for Developers" powered by eSolutionsGrup.


Slide Content

BIG DATA
Lucian Neghina
Big Data & Cloud Computing
by Developer for Developers
Ecosystem

Introduction
Let’s see what is Big Data
1

BIG DATA CHALLENGES
3 Vs of
Big Data
Terabytes
Records
Transactions
Tables, files
Batch
Near time
Real time
Streaming
Structured
Unstructured
Semistructured
All of the above
VOLUME
VELOCITY VARIETY

BIG DATA PIPELINE
Source Ingest
Process
Analyze
VisualizeStore
Structured
Semi-Structured
Unstructured
Messaging
API/ODBC
ETL
Replication
Web Dashboards
Mobile Devices
Web Services
Data Lake
Operational
Data Store
Real Time
Batch
Interactive
AI/ML

BIG DATA POPULAR USE CASES
Fraud Detection
Security Intelligence
Price Optimization
Behavioral
Analytics
Recommendation
Engines
Social Media Analysis
and Response
Internet of Things
Financial Trading
Improving Science
and Research
Performance
Optimisation
Improving
Healthcare


Big Data is data sets that are too large,
complex and dynamics for any
conventional data tools to capture,
store, manage and analyze.

Ingest
Big Data Component
2

DATA INGESTION
Source
systems
Ingest / Collect
CATEGORIES OF DATA
●Data in motion
●Data at rest


Destination system
WHAT IS

DATA INGESTION SQOOP
RDBMS
PostgreSQL,
Oracle,
MySQL, ...
Sqoop Import
Sqoop job
Map
Map
Map
Hadoop Cluster
Sqoop Export
HDFS

DATA INGESTION KAFKA
CONNECT
PRODUCER-CONSUMER
Data
Source
Kafka Connect Kafka Connect
Data
Sink
Producer
Producer
Producer
Consumer
Consumer
Consumer
Kafka Cluster
Kafka Cluster

DATA INGESTION FLUME
External
Source
HDFS
File
Flume Agent
Source
Sink 1
Sink 2
Channel 1
Channel 2
Event
Event
Event Event
Event
Event
Event

DATA INGESTION NIFI
Edge Data

IoT Devices
Client
Libraries
Mobile
Client
Libraries
Container
MiNiFi
IoT Devices
Client
Libraries
Gateway
MiNiFi
Server Cluster
NiFi NiFi NiFi
Regional Center

Server Cluster
NiFi NiFi NiFi
Core Data Center

Kafka
Storm
Others...
Kafka
Spark

Storage
Big Data Component
3

DATA STORAGE CAP
Consistency
Availability
Partition Tolerance
All the clients see the
same data regardless
of updates or deletes
System continues to
operate as expected
even with node failures
System continues to operate
as expected despite network
or message failures

DATA STORAGE HDFS
Distributed File System
Master/Slave architecture
Provides file permissions
and authentication
High fault-tolerance
Read/Write terabytes
of data per second
Streaming data access
Replicates the data
for durability

DATA STORAGE HBASE
NoSQL database
Consistency and
Partition Tolerance
No data types
Stores data in HDFS
Optimized for reads
Column-Oriented
Automatic sharding
and load balancing
Master/Slave architecture
Support Aggregation

DATA STORAGE CASSANDRA
NoSQL database
Optimized for writes
No Single Point of Failures
Column-Oriented
Tunable Consistency
Ring architecture
Availability and
Partition Tolerance
Scalable with large clusters

DATA STORAGE SOLR
Full-Text Search
Linear Scalability
Distributed Index
Schema / Schemaless
Auto Index Replication Inverted Indexing
Auto Failover and
Recovery
Sharding and Replications

DATA STORAGE REDIS
Persistence via Snapshot / Journal
Key-Value NoSQL database
In memory data store
Keys can have expiry time
Master/Slave architecture
Publish / Subscribe system
Consistency and
Partition Tolerance

DATA STORAGE TITAN
Graph database
CAP according to
backend storage
Geo, numeric range, full text
ElasticSearch, Solr, Lucene
Support ACID and
Eventual Consistency
Very large graphs
Storage backends
Cassandra, HBase, Oracle
Concurrent Transactions and
Operational Graph Processing

Elastic and linear
scalability

Process & Analyze
Big Data Component
4

DATA PROCESSING
BATCH
Data arrives and is processed
at certain interval.


NEAR REAL-TIME
The time between when data
arrives and is processed is very
small (micro-batches).


REAL TIME
Data arrives and is processed
in a continuous manner.

DATA ANALYTICS
INTERACTIVE
Set of approaches to explore data, supporting
exploration at the rate of human thought.
MACHINE LEARNING
Turning data into information using automated
methods without direct human intervention.

Visualization
Big Data Component
5

Monitoring
DATA VISUALIZATION
Business users


Data scientist,
developers


Notebooks Business Intelligence Frameworks
D3.js Chart.js Google
Charts

Thank You !
@eSolutionsGrup
www.esolutions.ro