What is Spark? Apache Spark is an open-source, distributed processing system used for big data analytics. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala , Python and R, and supports code reuse across multiple nodes—batch processing, interactive queries, real-time analytics, machine learning, and graph processing. Spark was initially started by Matei Zaharia at UC Berkeley's AMPLab in 2009, and open sourced in 2010. In 2013, the project was donated to the Apache Software Foundation and switched its license to Apache 2.0 from BSD. In February 2014, Spark became a Top-Level Apache Project.
What is Spark? Spark is used by organizations from any industry, including at FINRA, Yelp, Zillow, DataXu , Urban Institute, and Crowd Strike. Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only. Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.
Features of Apache Spark Speed: Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory. Supports multiple languages: Spark provides built-in APIs in Java, Scala , or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying. Advanced Analytics: Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
The following diagram shows three ways of how Spark can be built with Hadoop components.
There are three ways of Spark deployment as explained below. Standalone: Spark Standalone deployment means Spark occupies the place on top of HDFS( Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster. Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack. Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access.
Components of Spark
Apache Spark Core Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems. Spark SQL Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD , which provides support for structured and semi-structured data. Spark Streaming Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.
MLlib (Machine Learning Library) MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface). GraphX GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using abstraction API. It also provides an optimized runtime for this abstraction.
Spark Architecture
Apache Spark Architecture is based on two main abstractions- Resilient Distributed Datasets (RDD) Directed Acyclic Graph (DAG) RDDs are the building blocks of any Spark application. RDDs Stands for: Resilient: Fault tolerant and is capable of rebuilding data on failure Distributed: Distributed data among the multiple nodes in a cluster Dataset: Collection of partitioned data with values Directed Acyclic Graph (DAG) Direct - Transformation is an action which transitions data partition state from A to B. Acyclic -Transformation cannot return to the older partition DAG is a sequence of computations performed on data where each node is an RDD partition and edge is a transformation on top of data. The DAG abstraction helps eliminate the HadoopMapReduce multistage execution model and provides performance enhancements over Hadoop.
Apache Spark follows master/slave architecture with two main daemons and a cluster manager – Master Daemon – (Master/Driver Process) Worker Daemon –(Slave Process/Executor) A spark cluster has a single Master and any number of Slaves/Workers. Master Node: Driver Program: The central program that drives the entire Spark application. It contains the main function and creates the SparkContext. SparkContext: Acts as the gateway to all Spark functionalities. It is responsible for initializing the Spark application and communicating with the Cluster Manager.
Cluster Manager: Manages the cluster resources and schedules tasks. Spark can work with different cluster managers such as Apache Hadoop YARN, Apache Mesos , or the built-in standalone cluster manager. Worker Nodes:Executors : These are the worker processes that run tasks and store data for the application. Each worker node can run multiple executors. Tasks : Units of work sent by the SparkContext to be executed on the executors. Each task operates on a partition of the data.
SparkContext Initialization: The SparkContext connects to the Cluster Manager to allocate resources. Task Distribution: The SparkContext sends the tasks to the Cluster Manager. The Cluster Manager assigns these tasks to the Executors on the Worker Nodes. Task Execution: Each worker node executes the tasks on its data partitions. For example, one worker might process lines 1-1000 of the input file while another processes lines 1001-2000. Data Processing: Executors perform the transformations ( flatMap , mapToPair , reduceByKey ) on their data partitions. Aggregation: Results from each worker node are aggregated to produce the final output, which in this case is the word count. Fault Tolerance: If a worker node fails, Spark can re-execute the tasks on another node using the lineage information stored in the RDDs.
Advantages: Performance: In-memory processing and DAG (Directed Acyclic Graph) execution make Spark much faster than traditional MapReduce. Fault Tolerance: RDDs provide fault tolerance through lineage, allowing automatic recovery from node failures. Ease of Use: High-level APIs in Java, Scala , Python, and R make Spark accessible to a wide range of users. Versatility: Spark supports various workloads including batch processing, real-time processing (Spark Streaming), machine learning ( MLlib ), and graph processing ( GraphX ).
Spark vs. Hadoop
Performance: Hadoop : Slow due to heavy reliance on disk I/O operations. Spark : Fast because it leverages in-memory operations. However, performance may degrade when using YARN for resource management. Fault Tolerance: Hadoop : High fault tolerance achieved through data replication across multiple nodes. Spark : Uses Resilient Distributed Datasets (RDDs) which provide fault tolerance by maintaining the lineage of transformations. Processing: Hadoop : Primarily supports batch processing. Spark : Supports batch processing, streaming, and graph processing, making it more versatile.
Ease of Use: Hadoop : Does not offer an interactive mode. Spark : Provides an interactive mode with support for APIs in multiple languages, making it more user-friendly for developers. Language Support: Hadoop : Supports Java, Python, R, and C++. Spark : Supports Scala , Python, R, and Java.
Scalability: Both Hadoop and Spark offer high scalability. Machine Learning: Not explicitly mentioned in the table, but it's known that Spark includes MLlib , a library for scalable machine learning, which is a notable advantage over Hadoop’s more limited machine learning capabilities. Scheduler: Hadoop : Relies on external schedulers like Zookeeper for resource management and job scheduling. Spark : Has its own built-in scheduler, which can simplify deployment and management.
Cluster Design Cluster Design involves planning and configuring the hardware and software components that make up a distributed computing environment. Key considerations include: Node Types: Master Nodes: Manage and coordinate the cluster. In Hadoop, these include NameNodes and ResourceManager . In Spark, these include the Driver and Cluster Manager. Worker Nodes: Execute tasks and store data. In Hadoop, these are DataNodes , and in Spark, these are Executors.
Hardware Configuration: CPU: More cores help in parallel processing. Memory: Essential for in-memory operations, especially in Spark. Storage: Sufficient disk space and high I/O throughput for data-intensive tasks. Network: High bandwidth and low latency for fast data transfer between nodes. Redundancy and Fault Tolerance: Multiple master nodes (with failover mechanisms) to ensure high availability. Data replication to prevent data loss. Scalability: Ability to add more nodes to the cluster without significant reconfiguration. Efficient resource allocation and load balancing.
Cluster Management Cluster Management involves the administration and monitoring of the cluster to ensure efficient operation and optimal performance. Key activities include: Resource Management: YARN (Yet Another Resource Negotiator): Used in Hadoop to manage resources. Standalone, Mesos , and Kubernetes : Various cluster managers used in Spark. Job Scheduling : Fair scheduling to allocate resources evenly across jobs. Priority scheduling for important jobs.
Monitoring and Maintenance: Monitoring tools (e.g., Ganglia, Nagios , and Ambari ) to track cluster performance and health. Regular maintenance to apply updates and patches. Security Management: Authentication and authorization to control access. Data encryption to protect sensitive information.
Performance Task Execution Time: Minimizing the time it takes to execute individual tasks. Efficient parallelism and task distribution. Data Locality: Ensuring tasks are executed close to where the data is stored to reduce data transfer time. Resource Utilization: Efficient use of CPU, memory, and disk resources. Avoiding resource contention and bottlenecks.
Throughput and Latency: Maximizing the number of tasks completed per unit time (throughput). Minimizing the delay in task execution (latency). Tuning and Optimization: Configuring parameters like block size in HDFS, the number of partitions in Spark, and garbage collection settings. Using performance profiling tools to identify and resolve bottlenecks.
APPLICATION PROGRAMMING INTERFACE (API) Spark makes its cluster computing capabilities available to an application in the form of a library. This library is written in Scala , but it provides an application programming interface (API) in multiple languages. The Spark API is available in Scala , Java, Python, and R. You can develop a Spark application in any of these languages. The Spark API consists of two important abstractions: SparkContext and Resilient Distributed Datasets (RDDs). An application interacts with Spark using these two abstractions. These abstractions allow an application to connect to a Spark cluster and use the cluster resources.
Spark Context SparkContext is a class defined in the Spark library. It is the main entry point into the Spark library. It represents a connection to a Spark cluster. It is also required to create other important objects provided by the Spark API. A Spark application must create an instance of the SparkContext class. Currently, an application can have only one active instance of SparkContext. To create another instance, it must first stop the active instance. The SparkContext class provides multiple constructors. The simplest one does not take any arguments.
REFER 8 TH PROGRAM FOR SPARK CONTEXT AND CREATION OF RDD
Resilient Distributed Datasets (RDD) RDD represents a collection of partitioned data elements that can be operated in parallel. It is an inflexible distributed collection of objects. It can contain various types of objects including Python, scala or java. It is the primary data abstraction mechanism in Spark. It is defined as an abstract class in the Spark library. Conceptually, RDD represents a distributed dataset and it supports lazy operations. The key characteristics of an RDD 1. Immutable : An RDD is an immutable data structure. Once created, it cannot be modified in-place. Basically, an operation that modifies an RDD returns a new RDD.
2. Partitioned Data represented by an RDD is split into partitions. These partitions are generally distributed across a cluster of nodes. However, when Spark is running on a single machine, all the partitions are on that machine. Note that there is a mapping from RDD partitions to physical partitions of a dataset. RDD provides an abstraction for data stored in distributed data sources, which generally partition data and distribute it across a cluster of nodes.
3. Fault Tolerant RDD is designed to be fault tolerant. An RDD represents data distributed across a cluster of nodes and a node can fail RDD automatically handles node failures. When a node fails, and partitions stored on that node become inaccessible, Spark reconstructs the lost RDD partitions on another node. Spark stores lineage information for each RDD. Using this lineage information, it can recover parts of an RDD or even an entire RDD in the event of node failures. 4. Interface It is important to remember that RDD is an interface for processing data. It is defined as an abstract class in the Spark library.
RDD provides a uniform interface for processing data from a variety of data sources, such as HDFS, HBase, Cassandra, and others. The same interface can also be used to process data stored in memory across a cluster of nodes. Spark provides concrete implementation classes for representing different data sources. Examples of concrete RDD implementation classes include HadoopRDD , ParallelCollectionRDD, JdbcRDD, and CassandraRDD. 5. Strongly Typed The RDD class definition has a type parameter. This allows an RDD to represent data of different types. It is a distributed collection of homogenous elements, which can be of type Integer, Long, Float, String, or a custom type defined by an application developer.
Thus, an application always works with an RDD of some type. It can be an RDD of Integer, Long, Float, Double, String, or a custom type. 6. In Memory The RDD class provides the API for enabling in-memory cluster computing. Spark allows RDDs to be cached or persisted in memory.
Creating an RDD Since RDD is an abstract class, you cannot create an instance of the RDD class directly. The SparkContext class provides factory methods to create instances of concrete implementation classes. Parallelize This method creates an RDD from a local Scala collection. It partitions and distributes the elements of a Scala collection and returns an RDD representing those elements to List. from pyspark import SparkContext # Step 1: Initialize SparkContext sc = SparkContext("local", "RDD Example") # Step 2: Create an RDD from a Python list data = [1, 2, 3, 4, 5] rdd = sc.parallelize (data)
# Step 3: Apply a map transformation to square each number rdd_squared = rdd.map (lambda x: x * x) # Step 4: Perform an action to collect the results result = rdd_squared.collect () print(result) # Output: [1, 4, 9, 16, 25] # Step 5: Stop the SparkContext sc.stop ()
TextFile The textFile method creates an RDD from a text file. It can read a file or multiple files in a directory stored on a local file system, HDFS, Amazon S3, or any other Hadoop-supported storage system. It returns an RDD of Strings, where each element represents a line in the input file. val rdd = sc.textFile (" hdfs ://namenode:9000/path/to/file-or-directory") The preceding code will create an RDD from a file or directory stored on HDFS. The textFile method can also read compressed files. In addition, it supports wildcards as an argument for reading multiple files from a directory. An example is shown next. val rdd = sc.textFile (" hdfs ://namenode:9000/path/to/directory/*. gz ")
wholeTextFiles This method reads all text files in a directory and returns an RDD of key-value pairs. Each key-value pair in the returned RDD corresponds to a single file. The key part stores the path of a file and the value part stores the content of a file. This method can also read files stored on a local file system, HDFS, Amazon S3, or any other Hadoop-supported storage system. val rdd = sc.wholeTextFiles ("path/to/my-data/*.txt")
sequenceFile The sequenceFile method reads key-value pairs from a sequence file stored on a local file system, HDFS, or any other Hadoop-supported storage system. It returns an RDD of key-value pairs. In addition to providing the name of an input file, you have to specify the data types for the keys and values as type parameters when you call this method. val rdd = sc.sequenceFile [String, String]("some-file") RDD Operations Spark applications process data using the methods defined in the RDD class or classes derived from it.
These methods are also referred to as operations. The beauty of Spark is that the same RDD methods can be used to process data ranging in size from a few bytes to several petabytes. In addition, a Spark application can use the same methods to process datasets stored on either a distributed storage system or a local file system. This flexibility allows a developer to develop, debug and test a Spark application on a single machine and deploy it on a large cluster without making any code change. RDD operations can be categorized into two types: transformation and action. A transformation creates a new RDD. An action returns a value to a driver program.
RDD Transformations RDD transformations returns pointer to new RDD and allows you to create dependencies between RDDs. Each RDD in dependency chain (String of Dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD. Spark is lazy, so nothing will be executed unless you call some transformation or action that will trigger job creation and execution. Look at the following snippet of the wordcount example. Therefore, RDD transformation is not a set of data but is a step in a program (might be the only step) telling Spark how to get data and what to do with it. LAZY OPEARTIONS
LAZY OPERATIONS (Actions)
Saving RDD, compiling and Running the RDD Refer 6 th program to run in Ubuntu
Spark Jobs Map The map method is a higher-order method that takes a function as input and applies it to each element in the source RDD to create a new RDD. map() transformation is used the apply any complex operations like adding a column, updating a column e.t.c ,. In our word count example, we are adding a new column with value 1 for each word, the result of the RDD is PairRDDFunctions which contains key-value pairs, word of type String as Key and 1 of type Int as value.
F ilter The filter method is a higher-order method that takes a Boolean function as input and applies it to each element in the source RDD to create a new RDD. A Boolean function takes an input and returns true or false. The filter method returns a new RDD formed by selecting only those elements for which the input Boolean function returned true. Thus, the new RDD contains a subset of the elements in the original RDD. val lines = sc.textFile ("...") val longLines = lines filter { l => l.length > 80}
FlatMap flatMap () transformation flattens the RDD after applying the function and returns a new RDD. On the below example, first, it splits each record by space in an RDD and finally flattens it. Resulting RDD consists of a single word on each record. rdd2 = rdd.flatMap (lambda x: x.split (" "))
M apPartitions The higher-order mapPartitions method allows you to process data at a partition level. Instead of passing one element at a time to its input function, mapPartitions passes a partition in the form of an iterator. The input function to the mapPartitions method takes an iterator as input and returns another iterator as output. The mapPartitions method returns new RDD formed by applying a user-specified function to each partition of the source RDD. val lines = sc.textFile ("...") val lengths = lines mapPartitions { iter => iter.map { l => l.length }}