Apache Spark Introduction.pdf

MaheshPandit16 152 views 55 slides Aug 01, 2022
Slide 1
Slide 1 of 55
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55

About This Presentation

It contains introduction for Apache spark technology and then flow to deep content of it.


Slide Content

© 2018 YASH Technologies | www.yash.com | Confidential
Apache Spark
-Mahesh Pandit

2Agenda
What is Big Data
Sources for Big Data
Big Data Exploration
Hadoop Introduction
Limitations of Hadoop
Why Spark
Spark Ecosystem
Features of Spark
Use cases of Spark
Spark Architecture
Spark Structure API’s
Spark RDD
DataFrame & Datasets
Best Source to learn
Question-Answers

3
© 2018 YASH Technologies | www.yash.com | Confidential
BIG DATA
What is Big Data
Source of Big Data

4BIG DATA

5What is Big Data
Data setsthat are too large or complex for traditionaldata-processing application softwareto
adequately deal with.
Big data can be described by the following characteristics
VOLUME
VELOCITY
VALUE
VERACITY
VARIETY
Different Forms of
data
No use until turn into Value
Scale of data
Different forms of data
Uncertainty of Data

6Source of Big data
HUMAN
MACHINE

7
© 2018 YASH Technologies | www.yash.com | Confidential
Apache Hadoop
Introduction
Storage in Hadoop(HDFS)
Big Limitations of Hadoop

8Apache Hadoop
Hadoopis an open source distributed processing framework that manages data
processing and storage for big data applications running in clustered systems.
Inspired by “Google File System”
Its co-founders,Doug CuttingandMike Cafarella.
Hadoop 0.1.0 was released in April 2006.
Apache Hadoop has 2 main parts:
Hadoop Distributed File System (HDFS)
MapReduce
It is designed to scale up from single servers to thousands of machines,
 each offering local computation and storage.

9Data storage in HDFS

10Big Limitations of Hadoop

11
© 2018 YASH Technologies | www.yash.com | Confidential
Introduction to Spark
Introduction
Spark Ecosystem
Iterative & Interactive Data Mining by Spark
Features of Spark
Spark Cluster Manager
Spark Use Cases

12Why Spark ?

13Top Companies Using Spark

14Introduction of Spark
Apache Spark is an open-source distributed general-purpose cluster-computing framework.
It provides high-level APIs in Java, Scala, Python and R.
It provides an optimized engine that supports general execution graphs.
Spark is 100 times fasterthanMapReduce.
It also supports a rich set of higher-level tools including

15Apache Spark ecosystem

16

17Iterative and Interactive Data Mining
MapReducesimplified “Big Data” analysis on large clusters, but its inefficiency to handle
iterative algorithm and interactive data mining.
It led to the innovation of a new technology i.e. Spark which could provide the abstractions
for leveraging distributed memory.
Iteration 1
Iteration 1
Query 1
Query 1
Query 1
….
Input
Input
One time
Processing
……

18Features of Spark

19Lazy Evaluations

20DAG(Directed Acyclic Graph)

21Spark Cluster Managers
There are 3 types of cluster managers
Spark Standalone cluster
YARN mode
Spark Mesos

22Spark Use Cases

23
© 2018 YASH Technologies | www.yash.com | Confidential
Apache Spark Architecture
Spark-Context
Spark Driver
DAG Scheduler
Stages
Executor

24Spark Context
It is Entry gate ofApache Sparkfunctionality.
It allows your Spark Application to access Spark Cluster with the help of Resource Manager.
The resource manager can be one of these three-Spark Standalone, YARN, Apache Mesos.

25Functions of SparkContextin Apache Spark

26workflow of Spark Architecture
client submits spark user application code.When an application code is submitted, the driver
implicitly converts user code that contains transformations and actions into a logically directed
acyclic graphcalled DAG.
After that, it converts the logical graph called DAG into physical execution plan with many
stages.
After converting into a physical execution plan, it creates physical execution units called tasks
under each stage. Then the tasks are bundled and sent to the cluster.

27workflow of Spark Architecture
Now the driver talks to the cluster manager and negotiates the resources.Cluster manager
launches executors in worker nodes on behalf of the driver.
At this point, the driver will send the tasks to the executors based on data placement.
When executors start, they register themselves with drivers.
So, the driver will have a complete view of executors that areexecuting the task.
During the course of execution of tasks, driver program will monitor the set of executors that runs.
Driver node also schedules future tasks based on data placement.

28Spark Executor
Executors in Spark are worker nodes.
Those help to process in charge of running individual tasks in a given Spark job.
we launch them at thestart of a Spark application.
Then it typically runs for the entire lifetime of an application.
As soon as they have run the task, sends results to the driver.
Executors also provide in-memory storage for Spark RDDs that are cached by user programs
through Block Manager.

29Stages in Spark
A stage is nothing but a step in a physical execution plan.
Stage is a set of parallel tasks i.e. one task per partition.
Basically, each job which gets divided into smaller sets of tasks is a stage.

30Spark Stage -An Introduction to Physical Execution plan
Basically, stages in Apache spark are two categories
1.ShuffleMapStage in Spark
It is an intermediate Spark stage in the physical execution of DAG.
It produces data for another stage(s).
we can consider it as the final stage in Apache Spark.
It is possible that there is n number of multiple pipeline operations, in ShuffleMapStage. like map
and filter, before shuffle operation.
2.ResultStagein Spark
By running a function on a spark RDD Stage that executes a Spark actionin a user program is a
ResultStage.
it is considered as a final stage in spark.

31
© 2018 YASH Technologies | www.yash.com | Confidential
Spark Structure API
RDD
DataFrame
Dataset

32Spark Structured API’s
The Structured APIs are a tool for
manipulating all sorts of data, from
unstructured log files to semi-
structured CSV files and highly
structured Parquet files.
Spark RDD
It is Read-only partition collection of records. It allows a
programmer to perform in-memory computations .
Spark Dataframe
Unlike an RDD, data organized into named columns(like
table in RDBMS).
Dataframe in Spark allows developers to impose a
structure onto a distributed collection of data, allowing
higher-level abstraction.
Spark Dataset
Datasets in Apache Spark are an extension of DataFrame
API which provides type-safe, object-oriented
programming interface.
It takes advantage of Spark’s Catalyst optimizer by
exposing expressions and data fields to a query planner.

33
© 2018 YASH Technologies | www.yash.com | Confidential
RDD(Resilient Distributed Dataset)
Introduction
Features of RDD
Way to Create RDD
Working of RDD
RDD Persistence and Caching
RDD Operations

34Spark RDD
It stands for Resilient Distributed Dataset.
It is the fundamental data structure of Apache Spark.
RDDs are an immutable collection of objects which computes
on the different node of the cluster.
Resilient
i.e. fault-tolerant with the help of RDD lineage graph(DAG) and
so able to recomputed missing or damaged partitions due to
node failures.
Distributed
since Data resides on multiple nodes.
Dataset
represents records of the data you work with. The user can load
the data set externally which can be either JSON file, CSV file,
text file or database via JDBC with no specific data structure.
RDD Properties

35RDD Properties
Input
Dataset
Block1 –
N1
Partition1 Partition1 Partition1 c1
Block2 –
N2
Partition 2 Partition2 Partition2 c2
Block3 –
N3
Partition3 Partition 3Partition3 c3
Block4 –
N4
Partition4 Partition 4Partition4 c4
Block 5 –N Partition5 Partition5 Partition5 c5
textFile() filter() filter() count()
Log File
RDD
Error
RDD
HDFS
RDD
Property#1
List of Partitions
Property#3
List of
Dependencies
Property#2
Compute Functions

36Ways to Create RDD
From RDD
Parallelized
Collection
From RDD
External
data
Way to Create
RDD in Spark

37Ways to Create RDD
Parallelized collection (parallelizing)
valdata=spark.sparkContext.parallelize(Seq(("maths",52),("english",75),("science",82), ("computer",65),("maths",85)))
valsorted = data.sortByKey()
sorted.foreach(println)
External Datasets
valdataRDD= spark.read.csv("path/of/csv/file").rdd
valdataRDD= spark.read.json("path/of/json/file").rdd
valdataRDD= spark.read.textFile("path/of/text/file").rdd
Creating RDD from existing RDD
valwords=spark.sparkContext.parallelize(Seq("the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"))
valwordPair= words.map(w => (w.charAt(0), w))
wordPair.foreach(println)

38Working of RDD

39MEMORY_ONLY
HereRDD is stored as de-serialized
Java object in the JVM.
If the size of RDD is greater than
memory, It will not cache some partition
and recomputed them next time
whenever needed.
space used for storage is very high, the
CPU computation time is low.
data is stored in-memory. It does not
make use of the disk.
MEMORY_AND_DISK
Here RDD is stored as de-serialized
Java object in the JVM.
When the size of RDD is greater than
the size of memory, it stores the excess
partition on the disk, and retrieve from
disk whenever required.
space used for storage is high, the CPU
computation time is medium.
it makes use of both in-memory and on
disk storage.
Memory
Memory
RDD Persist
Memory only
RDD Persist Memory only
Memory
Memory
RDD Persist
Memory & Disk
RDD Persist Memory and Disk

MEMORY_ONLY_SER
It store the RDD as serialized Java
object.
It is more space efficient as
compared to de-serialized objects.
storage space is low, the CPU
computation time is high
data is stored in-memory. It does not
make use of the disk.
MEMORY_AND_DISK_SER
It drops the partition that does not
fits into memory to disk, rather than
recomputing each time it is needed.
space used for storage is low, the
CPU computation time is high.
It makes use of both in-memory and
on disk storage.
Memory
Memory
RDD Persist
Memory_only_Ser
RDD Persist Memory only Serialize
RDD as Serialize Java Object
Excess Data
Goes to Disk
RDD Persist
Memory_and_Disk_Ser
RDD Persist Memory and Disk Serialize
RDD as
Serialize
Java
Object
Memory

DISK_ONLY
RDD is stored only on disk.
The space used for storage is low, the
CPU computation time is high
It makes use of on disk storage.
Advantages of In-memory Processing
It is good for real-time risk management and fraud detection.
The data becomes highly accessible.
The computation speed of the system increases.
Improves complex event processing.
Cached a large amount of data.
It is economic, as the cost of RAM has fallen over a period of time.
Disk
Disk
RDD Persist
Memory only
RDD Persist Disk only

42Spark RDD Operations
Apache SparkRDD operations are-Transformations and Actions.
Transformations: They are theoperations that are applied to create a new RDD.
Actions:They are applied on an RDD to instruct Apache Spark to apply computation and pass the
result back to the driver.
Transformation
Spark
RDD
Operations
Action

43RDD Transformation
Spark Transformationis a function that produces new RDD from the existing RDDs.
Applying transformation built an RDD lineage, with the entire parent RDDs of the final RDD(s).
RDD lineage,also known as RDD operator graphorRDD dependency graph.
It is a logical execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs of RDD.
Transformations are lazyin nature i.e., they get execute when we call an action.
It has 2 types.
Narrow transformation
Wide transformation

44
Map
flatMap
Filter
Union
Sample
MapPartition
Narrow transformation
all the elements that are required
to compute the records in single
partition live in the single partition
of parent RDD.
wide transformation
all the elements that are required
to compute the records in the
single partition may live in many
partitions of parent RDD.
Intersection
Distinct
ReduceBykey
GroupBykey
Join
Cartesian
Repartition

45
© 2018 YASH Technologies | www.yash.com | Confidential
Spark DataFrame and Dataset
Introduction of DataFrame
DataFrame from various sources
Dataset in Spark
Features of Dataset

46DATAFRAME
Data is organized as a distributed collection of data into named columns. Basically, that we call a Spark
DataFrame.
It is as same as a table in a relational database.
we can construct a Spark DataFrame from a wide array of sources.
For example, structured data files, tables in Hive, external databases.
It can also handle Petabytesof data.
Creating
Spark
DataFrame in
Spark
From
Local
Data
Frames
From
Hive
tables
From
Data
Sources

47DataFrame from various Data Sources
Way to Create DataFrame in Spark
Hive data
CSV data
Json data
RDBMS Data
XML data
Parquet data
Cassandra data
RDDs
Col 1Col 2 Col3 …
Row 1
Row 2
Row 3
DataFrame

48a. From local data frames
To create a Spark DataFrame, there is one simplest way. That is the conversion of a local R data
frame into a Spark DataFrame.
df<-as.DataFrame(faithful)
# Displays the first part of the Spark DataFrame
head(df)
## eruptions waiting
##1 3.600 79
##2 1.800 54
##3 3.333 74

49b. From Data Sources
General method from data sources is read.df.
This method takes in the path for the file to load.
Spark supports reading JSON, CSV and parquet files natively.
# in Python
df= spark.read.format("json").load("/data/flight-data/json/2015-
summary.json")
## age name
##1 NA Michael
##2 30 Andy
df.printSchema()
# root
# |--age: long (nullable= true)
# |--name: string (nullable= true)

50c. From Hive tables
We can also use Hive tables to create SparkDataFrames. For this, we will need to create a
SparkSessionwith Hive support.
Also can help to access tables in the Hive MetaStore.
Although, SparkRattempt to create a SparkSessionwith Hive support enabled by default.
(enableHiveSupport= TRUE).
sql("CREATE TABLE IF NOT EXISTS src(key INT, value STRING)")
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
# Queries can be expressed in HiveQL.
results <-sql("FROM srcSELECT key, value")
# results is now a SparkDataFrame
head(results)
## key value
## 1 238 val_238
## 2 86 val_86

51Dataset in Spark
To overcome thelimitations of RDD and Dataframe, Dataset emerged.
In DataFrame, there was no provision for compile-time type safety. Data cannot be altered without knowing its
structure.
In RDD there was no automatic optimization. So for optimization, we do it manually when needed.
A Dataset is a type of interface that provides the benefits of RDD (strongly typed) and Spark SQL’s
optimization.
Datasets are a strictly Java Virtual Machine (JVM) language feature that work only with Scalaand Java.
Spark Dataset provides both type safety and object-oriented programming interface.
It represents structured queries with encoders. It is an extension to data frame API.
Conceptually, they are equivalent to a table in a relational database or a DataFrame in R or Python.

52Features of Dataset in Spark
Features
of
Spark
Dataset
Optimized Query
Analysis at compile time
Persistent Storage
Inter-convertible
Faster Computation
Less Memory
Consumption
Single API for
Java and Scala
strongly typed

53Best Books for Spark and Scala

54Question-Answers

© 2018 YASH Technologies | www.yash.com | Confidential
Feel free to write to me at:
[email protected]
in case of any queries / clarifications.