big data analyatics ppt Unit 2.pptx

Mukta88 11 views 39 slides Sep 09, 2025
Slide 1
Slide 1 of 39
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39

About This Presentation

............................................


Slide Content

What Exactly Is Apache Spark? Apache Spark is a lightning-fast, open-source engine for processing very large datasets across a cluster of computers. It started in 2009 at UC Berkeley’s AMPLab and became an Apache top-level project in 2014 Amazon Web Services, Inc. Databricks Wikipedia . Spark was built to overcome the limitations of Hadoop's MapReduce by keeping data in memory, making it up to 100× faster for some tasks

Key Features and Why Spark Shines Speed : Spark processes data in-memory, slashing read/write time on disk. That makes it dramatically faster—especially for iterative tasks like machine learning Coral Tech Team Amazon Web Services, Inc. Pyspark . Ease of Use : Programmers can write Spark apps using Java, Scala , Python, or R. It also offers an interactive shell for rapid testing Coral Tech Team Databricks . Unified Analytics Engine : Spark comes bundled with powerful built-in libraries: Spark SQL – for working with structured data using SQL syntax Spark Streaming – for near real-time stream processing MLlib – for scalable machine learning GraphX – for graph analysis Codecademy TechTarget IBM Instaclustr . Flexible Deployment : You can run Spark on Hadoop YARN, Apache Mesos , Kubernetes , or standalone. It also taps into diverse storage systems like HDFS, Cassandra, and S3 Databricks TechTarget Instaclustr . Rich Ecosystem : Spark's open-source roots mean it integrates well with tools like Hadoop , HBase , and more, making it adaptable and future-proof

How Does Spark Work? Breaking Down the Architecture Take a look at that architecture diagram above. Here's how Spark operates, step-by-step: Driver & SparkContext Your application starts a Driver , which creates the SparkContext —the entry point to all Spark operations Codecademy IBM Wikipedia . Cluster Manager The SparkContext requests resources from a Cluster Manager (like YARN, Mesos , Kubernetes , or Spark’s own manager) Codecademy IBM Amazon Web Services, Inc. . Executors The cluster manager launches executor processes on worker nodes. These executors do the actual work—running tasks and returning results to the Driver Codecademy IBM . RDDs & DAG Spark works with Resilient Distributed Datasets (RDDs) —immutable chunks of data distributed across the cluster. Operations on RDDs are either: Transformations (e.g., map , filter ) that define new RDDs, or Actions (e.g., collect , count ) that produce results IBM Wikipedia Codecademy . Spark builds a Directed Acyclic Graph (DAG) from these operations. The DAG scheduler optimizes task execution across nodes—this is one reason Spark runs faster than traditional MapReduce IBM Pyspark Wikipedia . Fault Tolerance Spark tracks the lineage (history) of RDDs. If something fails, it can rebuild lost data using those recorded operations
Tags