Big data and Hadoop Section..............

itsTIM66 11 views 47 slides Oct 07, 2024
Slide 1
Slide 1 of 47
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47

About This Presentation

.....


Slide Content

Big data and HDFS Section 2

A g e nda 2 DFS what are the Problems with big data? Hadoop is the solution! What i s Hadoop? HDFS YARN

D F S 3 A distributed file system (DFS) is a file system that enables clients to access file storage from multiple hosts through a computer network as if the user was accessing local storage. Files are spread across multiple storage servers and in multiple locations, which enables users to share data and storage resources.

Long - t e r m i n f o rm a t i o n s t o r ag e Access result of a process later Store large amounts of information Why to store Data in general? Enable access of multiple processes

F i l e

Buy a bigger disk? ? Copy data to an external hard drive? ? OR

PERSONAL WORK

R ac k Distributed File System (DFS) R ac k Cluster Node

R ac k 1 2 3 4 5 D a ta 1 2 3 4 5 Block

R ac k 1 2 3 4 5 D a ta 1 2 3 4 5 Analyze part 5 here!

R ac k 1 2 3 4 5 D a ta 1 3 5 5 1 1 2 2 1 2 4 5 3 3 3 2 4 5 3 1 4 4 2 5 4

R ac k 1 2 3 4 5 D a ta 1 3 5 5 1 1 2 2 1 2 4 5 3 3 3 2 4 5 3 1 4 4 2 5 4

R ac k 1 2 3 4 5 D a ta 1 2 3 4 5 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 4 4 4 3 4 5: Reader 1 5: Reader 3 5: Read er 2 High Concurrency vs. Low Consistency

R ac k D i s t r i bu t e d F il e S ys t e m ( D F S) R ac k Data scalability Fault tolerance High concurrency Data replication Data partitioning When many storage computers (racks or nodes) are connected throw the network we call it a DFS

Big data problems Storing huge and exponentially growing datasets. Processing data having complex structure . the bottleneck of bringing huge amount data to computation unit. 17

Hadoop is the solution ! What is Hadoop?

Hadoop Ecosystem 19 It is a services or framework s (Open-source projects) which solves big data problems. You can consider it as a suite which encompasses a number of services (ingesting, storing, analyzing and maintaining) inside it . Ex: To analyze the transaction data from a RDBMS, we need to ingest it into the Hadoop Distributed File System (HDFS).

20

H D F S 28 Hadoop Distributed File System is the core component or you can say, the backbone of Hadoop Ecosystem. HDFS is the one, which makes it possible to store different types of large data sets (i.e. structured, unstructured and semi structured data). It helps us in storing our data across various nodes and maintaining the log file about the stored data (metadata). HDFS has two core components, i.e. NameNode and DataNode.

H D F S 29 1. The NameNode is the main node and it doesn’t store the actual data. It contains metadata, just like a log file or you can say as a table of content. Therefore, it requires less storage and high computational resources. 2. On the other hand, all your data is stored on the DataNodes and hence it requires more storage resources. These DataNodes are commodity hardware (like your laptops and desktops) in the distributed environment. That’s the reason, why Hadoop solutions are very cost effective. 3. You always communicate to the NameNode while writing the data. Then, it internally sends a request to the client to store and replicate data on various DataNodes.

30

31 For more information on how it works visit : https:// www.edureka.co /blog/apache- hadoop - hdfs -architecture/

Y A R N 33 Yet Another Resource Negotiator (YARN) Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing activities by allocating resources and scheduling tasks . It has two major components , i.e. ResourceManager and NodeManager . Recourse Manager controls all the recourses and decide who gets what. The NodeManager in YARN is responsible for launching the containers with specified resource constraints. NodeManagers are installed on every DataNode .

Y A R N R e s o u r ce Manager Node Manager Scheduler App Manager C o n t ai n e r = machine App Master 2/23/24 PRESENTATION TITLE 34 It performs the scheduling of jobs based on policies and priorities defined by the administrator.  It monitors the health of the App Master in the cluster and manages failover in case of failures. Manage the execution of tasks within these containers, monitor their progress , and handle any task failures or reassignments.

35

Hive: A data warehousing and SQL-like query language tool that allows analysts to interact with data stored in HDFS using familiar SQL syntax. Pig: A high-level scripting platform for processing and analyzing large datasets. It simplifies complex data transformations using a simple scripting language.

Spark: Although not originally part of Hadoop, Apache Spark is often integrated into the ecosystem. It offers an in-memory processing engine for faster data processing, machine learning, and graph processing.

Zookeeper in Hadoop can be considered a centralized repository where distributed applications can put data into and retrieve data from. It makes a distributed system work together as a whole using its synchronization, serialization, and coordination goals.

Advantages of Hadoop for Big Data 46 Speed . Hadoop’s concurrent processing, MapReduce model, and HDFS lets users run complex queries in just a few seconds. Diversity . Hadoop’s HDFS can store different data formats, like structured, semi- structured, and unstructured. Cost-Effective . Hadoop is an open-source data framework. Resilient . Data stored in a node is replicated in other cluster nodes, ensuring fault tolerance. Scalable . Since Hadoop functions in a distributed environment, you can easily add more servers.

Next we will begin with Hadoop 47
Tags