Big Data Processing Using Spark.pptx

DeekshaM35 32 views 20 slides May 20, 2024
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Big Data Processing using SPARK


Slide Content

Big Data Processing with Apache Spark Jan 16, 2024 © 2024 Wipfli LLP. All rights reserved.

Agenda What is Apache Spark ? Hadoop and Spark Features of Spark Spark ecosystem Spark architecture How Apache Spark integrates with Hadoop? How to choose between Hadoop and Spark? Limitations of Spark Demo 1 – Data ingestion, transformation and visualization using PySpark. Demo 2 – Big data ingestion using PySpark. Industry implementations Resources Q&A © 2024 Wipfli LLP. All rights reserved. 2

What is Apache Spark ? Apache Spark is a cluster-computing platform that provides an API for distributed programming similar to the MapReduce model but is designed to be fast for interactive queries and iterative algorithms. Designed specifically to replace MapReduce, Spark also processes data in batches, with workloads distributed across a cluster of interconnected servers. Similar to its predecessor, the engine supports single- and multi-node deployment scenarios and master-slave architecture. Each Spark cluster has a single master node or driver to manage tasks and numerous slaves or executors to perform operations. And that’s almost where the likeness ends. The main difference between Hadoop and Spark lies in data processing methods. MapReduce stores intermediate results on local discs and reads them later for further calculations. In contrast, Spark caches data in the main computer memory or RAM (Random Access Memory.) Even the best possible disk read time lags far behind RAM speeds. Not a big surprise that Spark runs workloads 100 times faster than MapReduce if all data fits in RAM. When datasets are so large or queries are so complex that they have to be saved to disc, Spark still outperforms the Hadoop engine by ten times. © 2024 Wipfli LLP. All rights reserved. 3

What is Apache Spark ? Continues.. The Spark driver: - The driver is the program or process responsible for coordinating the execution of the Spark application. It runs the main function and creates the SparkContext , which connects to the cluster manager. The Spark executors: - Executors are worker processes responsible for executing tasks in Spark applications. They are launched on worker nodes and communicate with the driver program and cluster manager. Executors run tasks concurrently and store data in memory or disk for caching and intermediate storage. The cluster manager: - The cluster manager is responsible for allocating resources and managing the cluster on which the Spark application runs. Spark supports various cluster managers like Apache Mesos, Hadoop YARN, and standalone cluster manager. Task: - A task is the smallest unit of work in Spark, representing a unit of computation that can be performed on a single partition of data. The driver program divides the Spark job into tasks and assigns them to the executor nodes for execution. © 2024 Wipfli LLP. All rights reserved. 4

Hadoop Vs Spark © 2024 Wipfli LLP. All rights reserved. 5 Hadoop Apache Spark Data Processing Batch processing Batch/stream processing Real-time processing None Near real-time Performance Slower, as the disk is used for storage 100 times faster due to in-memory operations Fault-tolerance Replication used for fault tolerance Checkpointing and RDDs provide fault tolerance Latency High latency Low latency Interactive mode No Yes Resource Management YARN Spark standalone, YARN, Mesos Ease of use Complex; need to understand low-level APIs Abstracts most of the distributed system details Language Support Java, Python Scala, Java, Python, R, SQL Cloud support Yes Yes Machine Learning Requires Apache Mahout Provides MLlib Cost Low cost, as disk drives are cheaper High price since a memory-intensive solution Security Highly secure Basic security MapReduce Architecture

Map Reduce © 2024 Wipfli LLP. All rights reserved. 6 It is a framework in which we can write applications to run huge amount of data in parallel and in large cluster of commodity hardware in a reliable manner. Different Phases of MapReduce:- Mapping :-  It is the first phase of MapReduce programming. Mapping Phase accepts key-value pairs as input as (k, v), where the key represents the Key address of each record and the value represents the entire record content.​The output of the Mapping phase will also be in the key-value format (k’, v’). Shuffling and Sorting :-  The output of various mapping parts (k’, v’), then goes into Shuffling and Sorting phase.​ All the same values are deleted, and different values are grouped together based on same keys.​ The output of the Shuffling and Sorting phase will be key-value pairs again as key and array of values (k, v[ ]). Reducer :-  The output of the Shuffling and Sorting phase (k, v[]) will be the input of the Reducer phase.​ In this phase reducer function’s logic is executed and all the values are Collected against their corresponding keys. ​Reducer stabilize outputs of various mappers and computes the final output.​ Combining :-  It is an optional phase in the MapReduce phases .​ The combiner phase is used to optimize the performance of MapReduce phases. This phase makes the Shuffling and Sorting phase work even quicker by enabling additional performance features in MapReduce phases.

Map Reduce Continues.. © 2024 Wipfli LLP. All rights reserved. 7 User_Id Movie_Id Rating Timestamp 196 242 3 881250949 186 302 3 891717742 196 377 1 878887116 244 51 2 880606923 166 346 1 886397596 186 474 4 884182806 186 265 2 881171488 Numeric Example

Map Reduce Continues.. © 2024 Wipfli LLP. All rights reserved. 8 Step 1 – First, we must map the values , it has happened in 1st phase of Map Reduce model. Mapping: - 196:242   ;  186:302   ;  196:377   ;  244:51   ;  166:346   ;  186:274   ;  186:265 Step 2 –  After Mapping shuffle and sort the values. Reduce: - 166:346   ;  186:302,274,265   ;  196:242,377   ;  244:51   Step 3 –  After completion of step1 and step2 we have to reduce each key’s values.

Features of Spark Speed : Spark takes MapReduce to the next level with less expensive shuffles in the data processing. Spark holds intermediate results in memory rather than writing them to disk which is very useful especially when there is a need to work on the same dataset multiple times which can be several times faster than other big data technologies. Fault Tolerance : Apache Spark achieves fault tolerance using a spark abstraction layer called RDD (Resilient Distributed Datasets), which is designed to handle worker node failure. Lazy Evaluation : Spark supports lazy evaluation of big data queries, which helps with optimization of the steps in data processing workflows. It provides a higher-level API to improve developer productivity and a consistent architect model for big data solutions. Multiple Language Support : Spark provides multiple programming language support, and you can use it interactively from the Scala, Python, R, and SQL shells. Real-Time Stream Processing : Spark Streaming brings Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs. Decouple storage and compute: It can connect to virtually any storage system, from HDFS to Cassandra to S3, and import data from a myriad of sources. © 2024 Wipfli LLP. All rights reserved. 9

Spark ecosystem Spark SQL : Provides the capability to expose the Spark datasets over JDBC API and allow running the SQL like queries on Spark data using traditional BI and visualization tools. Spark SQL allows the users to ETL their data from different formats it’s currently in (like JSON, Parquet, a Database), transform it, and expose it for ad-hoc querying. Spark Streaming : Can be used for processing the real-time streaming data. This is based on micro batch style of computing and processing. It uses the DStream which is basically a series of RDDs, to process the real-time data. MLlib : Its Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives. GraphX : A collection of algorithms and tools for manipulating graphs and performing parallel graph operations and computations. GraphX extends the RDD API to include operations for manipulating graphs, creating subgraphs, or accessing all vertices in a path. © 2024 Wipfli LLP. All rights reserved. 10

Spark architecture STEP 1 : The client submits spark user application code. When an application code is submitted, the driver implicitly converts user code that contains transformations and actions into a logically directed acyclic graph called DAG. At this stage, it also performs optimizations such as pipelining transformations. STEP 2 : After that, it converts the logical graph called DAG into physical execution plan with many stages. After converting into a physical execution plan, it creates physical execution units called tasks under each stage. Then the tasks are bundled and sent to the cluster. STEP 3 : Now the driver talks to the cluster manager and negotiates the resources. Cluster manager launches executors in worker nodes on behalf of the driver. At this point, the driver will send the tasks to the executors based on data placement. When executors start, they register themselves with drivers. So, the driver will have a complete view of executors that are executing the task. STEP 4 : During the execution of tasks, driver program will monitor the set of executors that runs. Driver node also schedules future tasks based on data placement.  © 2024 Wipfli LLP. All rights reserved. 11 Spark Architecture DAG based processing

How Apache Spark integrates with Hadoop? Unlike Hadoop, which unites storing, processing, and resource management capabilities, Spark is for processing only, having no native storage system. Instead, it can read and write data from/to different sources, including but not limited to HDFS, HBase, and Apache Cassandra. It is compatible with a plethora of other data repositories, outside the Hadoop ecosystem — say, Amazon S3. Processing data across multiple servers, Spark couldn’t control resources — mainly, CPU and memory — by itself. For this task, it needs a resource or cluster manager. Currently, the framework supports four options: Standalone , a simple pre-built cluster manager; Hadoop YARN, which is the most common choice for Spark; Apache Mesos , used to control resources of entire data centers and heavy-duty services; and Kubernetes, a container orchestration platform. Running Spark on Kubernetes makes sense if a company plans to move the entire company tech stack to the cloud-native infrastructure. © 2024 Wipfli LLP. All rights reserved. 12

How to choose between Hadoop and Spark? The choice is not between Spark and Hadoop, but between two processing engines, since Hadoop is more than that. A clear advantage of MapReduce is that you can perform large, delay-tolerant processing tasks at a relatively low cost. It works best for archived data that can be analyzed later — say, during night hours. Some real-life use cases are Online sentiment analysis to understand how people feel about your products. Predictive maintenance to address issues with equipment before they really happen. log files analysis to prevent security breaches. Spark, in turn, shines when speed is prioritized over price. It’s a natural choice for fraud detection and prevention, stock market trends prediction, near real-time recommendation systems, and risk management. © 2024 Wipfli LLP. All rights reserved. 13

Limitations of Spark Pricey hardware. RAM prices are higher than those of hard discs exploited by MapReduce, making Spark operations more expensive. Near, but not truly real-time processing. Spark Streaming and in-memory caching allow you to analyze data very quickly. But still it won’t be truly real-time, since the module works with micro-batches — or small groups of events collected over a predefined interval. Genuine real-time processing tools process data streams at the moment they are generated. © 2024 Wipfli LLP. All rights reserved. 14

Demo 1 – Data ingestion, transformation and visualization using PySpark Analyze retail data with PySpark and Databricks Objectives: Use modern tools like Databricks and PySpark to find hidden insights from the data. Ingest retail data from DBFS available in csv format. Utilize PySpark Dataframe API to perform variety of transformations and actions. Use graphical representation to enhance our understanding and analysis of the results. © 2024 Wipfli LLP. All rights reserved. 15 Resources:

Demo 2 – Big data ingestion using Pyspark . Ingest big data files available in PDF format and translate to desired language. Objectives: Install required libraries in Databricks notebook. Create functions to extract text, table, read and convert table data to plain text. Ingest and read text from pdf files available in DBFS into Dataframe . Translate text. © 2024 Wipfli LLP. All rights reserved. 16 Resources:

Industry Implementations Show around the Databricks end-to-end pipeline. Run the pipeline and show DAG created by Spark. © 2024 Wipfli LLP. All rights reserved. 17 DAG : Query execution plan End-to-end data pipeline

Resources Spark Architecture PySpark Pandas Install Hadoop on Windows – Step by Step Install Apache Spark on Windows – Step by Step Generate fake data using python faker library © 2024 Wipfli LLP. All rights reserved. 18

Q&A © 2024 Wipfli LLP. All rights reserved. 19

© 2024 Wipfli LLP. All rights reserved.