Hadoop File System was developed using distributed file system design.
JSujatha2
39 views
37 slides
Aug 11, 2024
Slide 1 of 37
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
About This Presentation
HDFS,Mapreduce,ver1,2
Hadoop File System was developed using distributed file system design.
It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.
Size: 4.15 MB
Language: en
Added: Aug 11, 2024
Slides: 37 pages
Slide Content
Hadoop – HDFS, MAPREDUCE Version 1,2 Big Data Analytics (BDA) GTU #3170722
HDFS MapReduce
Hadoop Architecture
Map Reduce MapReduce is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing . MapReduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment. So, By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the processing’s logic and helps to write applications which transform big data sets into a manageable one. MapReduce uses two functions, namely Map() and Reduce(), and its task is: The Map() function performs actions like filtering, grouping and sorting. While Reduce() function aggregates and summarizes the result produced by map function.
Hadoop Distributed File System Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware. HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing. Features of HDFS It is suitable for the distributed storage and processing. Hadoop provides a command interface to interact with HDFS. The built-in servers of namenode and datanode help users to easily check the status of cluster. Streaming access to file system data. HDFS provides file permissions and authentication.
HDFS Master-Slave Architecture HDFS follows the master-slave architecture and it has the following elements.
Hadoop Core Components - HDFS NameNode represented every files and directory which is used in the namespace . DataNode helps you to manage the state of an HDFS node and allows you to interacts with the blocks. It is constantly reads all the file systems and metadata from the RAM of the NameNode and writes it into the hard disk.
Secondary Node The Secondary NameNode works concurrently with the primary NameNode as a helper daemon/process. It is one which constantly reads all the file systems and metadata from the RAM of the NameNode and writes it into the hard disk or the file system. It is responsible for combining the EditLogs with FsImage from the NameNode. It downloads the EditLogs from the NameNode at regular intervals and applies to FsImage. The new FsImage is copied back to the NameNode, which is used whenever the NameNode is started the next time. Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is also called CheckpointNode.
Block A Block is the minimum amount of data that it can read or write. HDFS blocks are 128 MB by default and this is configurable. Files n HDFS are broken into block-sized chunks,which are stored as independent units.
HDFS Architecture
Hadoop Cluster To Improve the Network Performance The communication between nodes residing on different racks is directed via switch. In general, you will find greater network bandwidth between machines in the same rack than the machines residing in different rack. It helps you to have reduce write traffic in between different racks and thus providing a better write performance. Also, you will be gaining increased read performance because you are using the bandwidth of multiple racks. To Prevent Loss of Data We don’t have to worry about the data even if an entire rack fails because of the switch failure or power failure. Like never put all your eggs in the same basket.
HDFS Write Architecture HDFS client, wants to write a file named “example.txt” of size 248 MB . Assume that the system block size is configured for 128 MB (default). So, the client will be dividing the file “example.txt” into 2 blocks – one of 128 MB (Block A) and the other of 120 MB (block B).
Map Reduce Application
Developing a Map Reduce Application MapReduce is a programming model for building applications which can process big data in parallel on multiple nodes . It provides analytical abilities for analysis of large amount of complex data. Traditional model is not suitable to process large amount of data and cannot be incorporated by standard database servers. Google solves this problem using MapReduce Algorithm. MapReduce is a distributed data processing algorithm , introduced by Google. It is influenced by functional programming model. In cluster environment, MapReduce algorithm is used to process large volume of data efficiently, reliably and parallel. It uses divide and conquer approach to process large volume of data. It divides input task into manageable sub-task to execute parallel .
How MapReduce Works? MapReduce divides a task into small parts and assigns them to many computers. The results are collected at one place and integrated to form the result dataset. The MapReduce algorithm contains two important tasks: Map - Splits & Mapping Reduce - Shuffling, Reducing The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs). The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples. The reduce task is always performed after the map job.
MapReduce Architecture Example: Welcome to Hadoop Class Hadoop is good Hadoop is bad bad 1 Class 1 good 1 Hadoop 3 is 2 to 1 Welcome 1 Output of MapReduce task
How MapReduce Works? – Cont. The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of entities called: Job Tracker : Acts like a master (responsible for complete execution of submitted job) Multiple Task Trackers : Acts like slaves, each of them performing the job. For every job submitted for execution in the system, there is one Jobtracker that resides on Namenode and there are multiple tasktrackers which reside on Datanode .
How MapReduce Works? – Cont. A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster. It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run on different data nodes. Execution of individual task is then to look after by task tracker, which resides on every data node executing part of the job. Task tracker's responsibility is to send the progress report to the job tracker. In addition, task tracker periodically sends 'heartbeat' signal to the Jobtracker so as to notify him of the current state of the system. Thus job tracker keeps track of the overall progress of each job. In the event of task failure, the job tracker can reschedule it on a different task tracker.
MapReduce Algorithm – Cont. Input Phase We have a Record Reader that translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs. Map It is a user-defined function, which takes a series of key-value pairs and processes each one of them to generate zero or more key-value pairs. Intermediate Keys They key-value pairs generated by the mapper are known as intermediate keys. Combiner A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is optional.
MapReduce Algorithm – Cont. Shuffle and Sort The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task. Reducer The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step. Output Phase In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.
Difference between Hadoop 1 and Hadoop 2 Hadoop 1 vs Hadoop 2 1. Components: In Hadoop 1 we have MapReduce but Hadoop 2 has YARN(Yet Another Resource Negotiator) and MapReduce version 2.
Difference between Hadoop 1 and Hadoop 2 2. Working : In Hadoop 1 , there is HDFS which is used for storage and top of it, Map Reduce which works as Resource Management as well as Data Processing. Due to this workload on Map Reduce, it will affect the performance. In Hadoop 2 , there is again HDFS which is again used for storage and on the top of HDFS, there is YARN which works as Resource Management. It basically allocates the resources and keeps all the things going on.
Difference between Hadoop 1 and Hadoop 2 3 . Limitations: Hadoop 1 is a Master-Slave architecture. It consists of a single master and multiple slaves . Suppose if master node got crashed then irrespective of your best slave nodes, your cluster will be destroyed. Again for creating that cluster means copying system files, image files, etc. on another system is too much time consuming which will not be tolerated by organizations in today’s time. Hadoop 2 is also a Master-Slave architecture. But this consists of multiple masters ( i.e active namenodes and standby namenodes ) and multiple slaves. If here master node got crashed then standby master node will take over it. You can make multiple combinations of active-standby nodes. Thus Hadoop 2 will eliminate the problem of a single point of failure .
Difference between Hadoop 1 and Hadoop 2 4. Ecosystem:
Difference between Hadoop 1 and Hadoop 2 Feature Hadoop 1.x Hadoop 2.x Resource Management Single JobTracker for resource management and job scheduling. YARN (Yet Another Resource Negotiator) separates resource management (ResourceManager) and job scheduling (Application Master). Allows multiple processing engines to share resources efficiently. Scalability Faces challenges in horizontal scalability, with a single JobTracker bottleneck. Addresses scalability issues with YARN, enabling more flexible and scalable resource management. Processing Engines Primarily designed for the MapReduce programming model. Supports multiple processing engines, not just MapReduce. Allows coexistence of MapReduce and other processing models. High Availability Single point of failure with the JobTracker. Introduces high availability for the NameNode in HDFS, as well as high availability configurations for the ResourceManager and other components. Job Scheduling Centralized JobTracker schedules MapReduce jobs. Decentralized job scheduling with YARN. Each application has its Application Master, which negotiates resources and schedules tasks. Fault Tolerance Ensures fault tolerance through data replication in HDFS but faces challenges with JobTracker failure. Improves fault tolerance with high availability configurations for the ResourceManager and NameNode .
Difference between Hadoop 1 and Hadoop 2 Feature Hadoop 1.x Hadoop 2.x Architecture Single JobTracker for resource management and job scheduling. YARN architecture with separate ResourceManager and Application Masters, allowing for more flexible resource utilization. Processing Flexibility Primarily designed for batch processing using MapReduce. Supports various processing engines, making it suitable for batch processing, interactive queries, streaming, etc. Resource Utilization Resource utilization is less flexible, as the JobTracker manages resources for all jobs. YARN provides a more flexible and fine-grained approach to resource management, allowing better utilization across different applications. Ecosystem Evolution Initial ecosystem projects emerge but limited compared to later versions. The ecosystem expands with additional projects like Apache Spark, Apache Flink, Apache Tez, etc., enabling diverse data processing scenarios. Community and Support Foundation laid for the Hadoop ecosystem, but certain limitations. Active development and community support continue, with ongoing improvements and adaptations. Version Compatibility MapReduce-based applications need adaptation for Hadoop 2.x. Maintains compatibility with MapReduce-based applications while enabling the development of new applications using YARN.
Hadoop 3.0 Hadoop 3.0 is a major release in the Apache Hadoop project, building upon the foundations of Hadoop 2.x. Released in December 2017, Hadoop 3.0 introduces several significant features, improvements, and enhancements over its predecessors. The key aspects of Hadoop 3.0 include: Erasure Coding in HDFS: Hadoop 3.0 introduces erasure coding as an alternative to replication for data fault tolerance in Hadoop Distributed File System (HDFS). Erasure coding is more storage-efficient compared to replication, reducing storage overhead. GPU Support: Hadoop 3.0 includes support for GPU (Graphics Processing Unit) acceleration. This enables certain computations to be offloaded to GPUs, enhancing performance for specific tasks. Improved Shell Scripting: Shell scripts can be used directly in Hadoop commands without the need for Java wrappers. This simplifies the development and execution of scripts in the Hadoop ecosystem. Native Support for Microsoft Azure Data Lake and Aliyun Object Storage: Hadoop 3.0 provides native integration with Microsoft Azure Data Lake Storage and Aliyun ( Alibaba Cloud) Object Storage, expanding its cloud compatibility.
Hadoop 3.0 Enhanced Resource Utilization and Performance : Hadoop 3.0 includes improvements in resource utilization and overall performance, optimizing resource management and job execution. Support for Java 8 : Hadoop 3.0 fully supports Java 8, allowing users to leverage the features and improvements introduced in the latest version of Java. Compatibility and Upgradability: Efforts have been made to maintain backward compatibility with previous Hadoop releases, making it easier for users to upgrade their existing Hadoop clusters to version 3.0. Continued Evolution: Hadoop 3.0 is part of the ongoing evolution of the Hadoop ecosystem to meet the growing demands of big data processing. While maintaining compatibility with existing applications, Hadoop 3.0 introduces features that enhance performance, storage efficiency, and integration with modern technologies. .
Difference between Hadoop 1.x,Hadoop 2.x and Hadoop 3.0 Feature Hadoop 1.0 Hadoop 2.0 Hadoop 3.0 Resource Management Single JobTracker for resource management and job scheduling. Introduction of YARN (Yet Another Resource Negotiator), separating resource management ( ResourceManager ) and job scheduling (Application Master). YARN continues as the resource manager, further optimizations and enhancements. Scalability Faces challenges in horizontal scalability, with a single JobTracker bottleneck. Addresses scalability issues with YARN, enabling more flexible and scalable resource management. Ongoing improvements in scalability and optimizations. Processing Engines Primarily designed for the MapReduce programming model. Supports multiple processing engines, not just MapReduce. Allows coexistence of MapReduce and other processing models. Continued support for diverse processing engines and optimizations.
Difference between Hadoop 1.x,Hadoop 2.x and Hadoop 3.0 Feature Hadoop 1.0 Hadoop 2.0 Hadoop 3.0 High Availability Single point of failure with the JobTracker. Introduces high availability for the NameNode in HDFS, as well as high availability configurations for the ResourceManager and other components. High availability enhancements, ongoing improvements. Job Scheduling Centralized JobTracker schedules MapReduce jobs. Decentralized job scheduling with YARN. Each application has its Application Master, which negotiates resources and schedules tasks. YARN continues to handle decentralized job scheduling. Fault Tolerance Ensures fault tolerance through data replication in HDFS but faces challenges with JobTracker failure. Improves fault tolerance with high availability configurations for the ResourceManager and NameNode. Further improvements in fault tolerance and optimizations.
Difference between Hadoop 1.x,Hadoop 2.x and Hadoop 3.0 Feature Hadoop 1.0 Hadoop 2.0 Hadoop 3.0 MapReduce Framework Original MapReduce framework. Introduction of MapReduce 2.0 (MRv2) with YARN, allowing greater flexibility and improved resource utilization. Ongoing improvements in MapReduce performance and capabilities. HDFS Features Basic features with data replication for fault tolerance. Introduction of HDFS Federation for improved scalability and high availability for the NameNode. Introduction of erasure coding for storage efficiency, optimizations, and continued enhancements. Containerization Limited support for containerization. Increased support for containerization with YARN, allowing applications to run in Docker containers. Enhanced support for containerization, leveraging technologies like Docker and Kubernetes .
Difference between Hadoop 1.x,Hadoop 2.x and Hadoop 3.0 Feature Hadoop 1.0 Hadoop 2.0 Hadoop 3.0 Java Support Java 6 and 7 compatibility. Java 7 compatibility. Full support for Java 8 and compatibility with newer versions. Security Enhancements Basic security features. Introduction of Kerberos-based authentication, authorization, and HDFS encryption. Ongoing enhancements in security, introduction of newer security features. Ecosystem Growth Growing ecosystem with MapReduce-centric tools. Expanded ecosystem with diverse tools for data processing, storage, and analytics. Continued growth and diversification of the Hadoop ecosystem.