YARN??????? In Hadoop version 1.0 which is also referred to as MRV1( MapReduce Version 1), MapReduce performed both processing and resource management functions . It consisted of a Job Tracker which was the single master. The Job Tracker allocated the resources, performed scheduling and monitored the processing jobs. It assigned map and reduce tasks on a number of subordinate processes called the Task Trackers . The Task Trackers periodically reported their progress to the Job Tracker.
This design resulted in scalability bottleneck due to a single Job Tracker. The practical limits of such a design are reached with a cluster of 5000 nodes and 40,000 tasks running concurrently. Apart from this limitation, the utilization of computational resources is inefficient in MRV1. Also , the Hadoop framework became limited only to MapReduce processing paradigm.
To overcome all these issues YARN was introduced in Hadoop version 2.0 in the year 2012 by Yahoo and Hortonworks . The basic idea behind YARN is to relieve MapReduce by taking over the responsibility of Resource Management and Job Scheduling . YARN started to give Hadoop the ability to run non- MapReduce jobs within the Hadoop framework.
MapReduce limitations (Version 1, Hadoop MapReduce) ● Scalability: Maximum cluster size: 4,000 nodes Maximum concurrent tasks: 40,000 Coarse synchronization in JobTracker ● Single point of failure: NameNode or JobTracker can become the choking point Failure kills all queued and running jobs Jobs need to be resubmitted by users Restart is very tricky due to complex state Hard partition of resources into Map and Reduce slots
YARN The two major functionalities of the JobTracker are resource management and job scheduling/monitoring . The load that is processed by JobTracker runs into problems due to competing demand for resources and execution cycles arising from the single point of control in the design. The fundamental idea of YARN is to split up the two major functionalities of the JobTracker into separate processes. In the new-release architecture, there are two modules: a global Resource Manager (RM) and per-application Application Master (AM).
YARN YARN enabled the users to perform operations as per requirement by using a variety of tools like Spark for real-time processing Hive for SQL HBase for NoSQL and others.
YARN The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons . The idea is to have a global ResourceManager ( RM ) and per-application ApplicationMaster ( AM ). The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage ( cpu , memory, disk, network) and reporting the same to the ResourceManager /Scheduler.
YARN The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager (s) to execute and monitor the tasks.
The primary components and their focus areas are: ResourceManager (RM) Application Manager NodeManager ApplicationMaster (AM) Container
The ResourceManager has two main components: Scheduler The Scheduler is responsible for allocating resources to the various running applications and manages the constraints of capacities, availability, and resource queues. The Scheduler is responsible for purely schedule management and will be working on scheduling based on resource containers, which specify memory, disk, and CPU. The Scheduler will not assume restarting of failed tasks either due to application failure or hardware failures. Application Manager Responsible for accepting job submissions. Negotiates the first container for executing the application-specific AM. Provides the service for restarting the AM container on failure. ApplicationsManager .
NodeManager The NodeManager is a per-machine agent and is responsible for launching containers for applications once the Scheduler allocates them to the application. Container resource monitoring for ensuring that the allocated containers do not exceed their allocated resource slices on the machine. Setting up the environment of the container for the task execution including binaries , libraries,and jars. Manages local storage on the node. Applications can continue to use the local storage even when they do not have an active allocation on the node, thus providing scalability and availability .
ApplicationMaster (AM ) Negotiates resources with the RM. Manages application scheduling and task execution with NodeManagers . Recovers the application on its own failure. Will either recover the application from the saved persistent state or just run the application from the very beginning, depending on recovery success .
Container: Package of resources including RAM, CPU, Network, HDD etc on a single node.
Application Workflow in YARN Client submits an application Resource Manager allocates a container to start Application Manager Application Manager registers with Resource Manager
Application Workflow in YARN 4. Application Manager asks containers from Resource Manager 5. Application Manager notifies Node Manager to launch containers 6. Application code is executed in the container
Application Workflow in YARN 7 . Client contacts Resource Manager/Application Manager to monitor application’s status 8. Application Manager unregisters with Resource Manager