A g e nda 2 DFS what are the Problems with big data? Hadoop is the solution! What i s Hadoop? HDFS YARN
D F S 3 A distributed file system (DFS) is a file system that enables clients to access file storage from multiple hosts through a computer network as if the user was accessing local storage. Files are spread across multiple storage servers and in multiple locations, which enables users to share data and storage resources.
Long - t e r m i n f o rm a t i o n s t o r ag e Access result of a process later Store large amounts of information Why to store Data in general? Enable access of multiple processes
F i l e
Buy a bigger disk? ? Copy data to an external hard drive? ? OR
PERSONAL WORK
R ac k Distributed File System (DFS) R ac k Cluster Node
R ac k 1 2 3 4 5 D a ta 1 2 3 4 5 Block
R ac k 1 2 3 4 5 D a ta 1 2 3 4 5 Analyze part 5 here!
R ac k 1 2 3 4 5 D a ta 1 3 5 5 1 1 2 2 1 2 4 5 3 3 3 2 4 5 3 1 4 4 2 5 4
R ac k 1 2 3 4 5 D a ta 1 3 5 5 1 1 2 2 1 2 4 5 3 3 3 2 4 5 3 1 4 4 2 5 4
R ac k 1 2 3 4 5 D a ta 1 2 3 4 5 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 4 4 4 3 4 5: Reader 1 5: Reader 3 5: Read er 2 High Concurrency vs. Low Consistency
R ac k D i s t r i bu t e d F il e S ys t e m ( D F S) R ac k Data scalability Fault tolerance High concurrency Data replication Data partitioning When many storage computers (racks or nodes) are connected throw the network we call it a DFS
Big data problems Storing huge and exponentially growing datasets. Processing data having complex structure . the bottleneck of bringing huge amount data to computation unit. 17
Hadoop is the solution ! What is Hadoop?
Hadoop Ecosystem 19 It is a services or framework s (Open-source projects) which solves big data problems. You can consider it as a suite which encompasses a number of services (ingesting, storing, analyzing and maintaining) inside it . Ex: To analyze the transaction data from a RDBMS, we need to ingest it into the Hadoop Distributed File System (HDFS).
20
H D F S 28 Hadoop Distributed File System is the core component or you can say, the backbone of Hadoop Ecosystem. HDFS is the one, which makes it possible to store different types of large data sets (i.e. structured, unstructured and semi structured data). It helps us in storing our data across various nodes and maintaining the log file about the stored data (metadata). HDFS has two core components, i.e. NameNode and DataNode.
H D F S 29 1. The NameNode is the main node and it doesn’t store the actual data. It contains metadata, just like a log file or you can say as a table of content. Therefore, it requires less storage and high computational resources. 2. On the other hand, all your data is stored on the DataNodes and hence it requires more storage resources. These DataNodes are commodity hardware (like your laptops and desktops) in the distributed environment. That’s the reason, why Hadoop solutions are very cost effective. 3. You always communicate to the NameNode while writing the data. Then, it internally sends a request to the client to store and replicate data on various DataNodes.
30
31 For more information on how it works visit : https:// www.edureka.co /blog/apache- hadoop - hdfs -architecture/
Y A R N 33 Yet Another Resource Negotiator (YARN) Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing activities by allocating resources and scheduling tasks . It has two major components , i.e. ResourceManager and NodeManager . Recourse Manager controls all the recourses and decide who gets what. The NodeManager in YARN is responsible for launching the containers with specified resource constraints. NodeManagers are installed on every DataNode .
Y A R N R e s o u r ce Manager Node Manager Scheduler App Manager C o n t ai n e r = machine App Master 2/23/24 PRESENTATION TITLE 34 It performs the scheduling of jobs based on policies and priorities defined by the administrator. It monitors the health of the App Master in the cluster and manages failover in case of failures. Manage the execution of tasks within these containers, monitor their progress , and handle any task failures or reassignments.
35
Hive: A data warehousing and SQL-like query language tool that allows analysts to interact with data stored in HDFS using familiar SQL syntax. Pig: A high-level scripting platform for processing and analyzing large datasets. It simplifies complex data transformations using a simple scripting language.
Spark: Although not originally part of Hadoop, Apache Spark is often integrated into the ecosystem. It offers an in-memory processing engine for faster data processing, machine learning, and graph processing.
Zookeeper in Hadoop can be considered a centralized repository where distributed applications can put data into and retrieve data from. It makes a distributed system work together as a whole using its synchronization, serialization, and coordination goals.
Advantages of Hadoop for Big Data 46 Speed . Hadoop’s concurrent processing, MapReduce model, and HDFS lets users run complex queries in just a few seconds. Diversity . Hadoop’s HDFS can store different data formats, like structured, semi- structured, and unstructured. Cost-Effective . Hadoop is an open-source data framework. Resilient . Data stored in a node is replicated in other cluster nodes, ensuring fault tolerance. Scalable . Since Hadoop functions in a distributed environment, you can easily add more servers.