Hadoop Fundamentals - YARN Concepts

Introduction

Generally speaking, the components in the Hadoop 1.0 environment are well-established and well-understood. The JobTracker receives the desire to run a task from the client, and then launches necessary map or reduce tasks by way of the TaskTrackers running on each node. The tasks on each node report status to the TaskTrackers, which periodically report status to the JobTracker. As each task completes, the associated TaskTracker performs the necessary cleanup and reports task slot availability to the JobTracker. When the overall job is complete, the JobTracker shows the end state of the job to the client.

A diagram showing the relationships between the Hadoop 1.0 components

As part of the initiative to create Hadoop 2.0 (YARN), the base implementation of the paradigm was completely redesigned. A number of the fundamental Hadoop concepts associated with Hadoop 1.0 were redefined, merged, or divided to create new concepts applicable to YARN.

A diagram showing the relationships between the YARN components

This post will go through the details of each new YARN concept, drawing parallels to Hadoop 1.0 where appropriate.


Container

The Container is a brand new concept for YARN, and is the fundamental piece that makes the revolutionary benefits of Hadoop 2.0 possible. A Container represents the specific set of resources that a particular application has the rights to use. The resources that make up a container are any subset of memory, CPU, network bandwidth, etc. on a specific node of the cluster. In order to run an application, one or more Containers are requested and let used for the processing. 

The Container concept completely replaces the "task slot" concept used in Hadoop 1.0. Since the base elements of the implementation are no longer "map task slots" and "reduce task slots", any distributed processing framework can be run using YARN - including original MapReduce.

Resource Manager

In Hadoop 1.0, the JobTracker was responsible for both scheduling resources and monitoring job progress. One consequence of that combination was that the code was very complex - making it difficult to refactor and optimize. In Hadoop 2.0, the Resource Manager is analogous to the JobTracker, but with the roles separated into two sub-processes - the Scheduler and the Applications Manager. These processes run on only one node in the cluster, and manage the responsibilities globally.

Scheduler

The scheduler is the process that handles container allocation to all applications running in the cluster. Part of this allocation process is to consider any constraints that have been specified (per user, per group, disk quotas, etc.) In addition to the various rules managed by the cluster administrator, the scheduler can make use of a variety of different scheduling algorithms to ensure fair access to resources (Capacity SchedulerFair Scheduler, etc.)

Applications Manager

The Applications Manager is responsible for accepting job submissions from the client, and negotiating the first container for each application. This first container is used to run the Application Master (described below). The Applications Manager also has the responsibility to restart any failed Application Masters.

Node Manager

The Node Manager is very similar to the TaskTracker in Hadoop 1.0. Every node in the cluster that should perform the work associated with jobs should have this process running, and all responsibilities are handled at the node-level. The Node Manager is responsible for launching containers as directed by the Resource Manager and reporting container status to the Application Masters (described below) periodically. The Node Manager also monitors resource usage in general so that the Resource Manager has the most current information when performing scheduling operations.

Application Master

The Application Master is another brand new concept for YARN. Each application running in the cluster must have its own Application Master, including traditional MapReduce jobs. The Application Master must negotiate any necessary resource containers from the Resource Manager's Scheduler. It is also responsible for communicating with the Node Managers of the cluster in order to track the status for all containers associated with the application. Since the Application Master actually runs in a special container negotiated by the Resource Manager's Applications Manager, in the event of a failure, the Application Master will be restarted automatically.

Where can I go for more information?

Announcing Apache Hadoop 2
Apache Hadoop NextGen MapReduce
Or feel free to leave your question or comment below!

No comments:

Post a Comment