Hadoop Fundamentals - YARN and MRv2

Introduction

One main feature of the Hadoop 1.0 ecosystem is that all of the tools typically packaged as part of a distribution provide an abstraction layer on top of the MapReduce paradigm meant to be used in conjunction with HDFS. This essentially means that all functionality must resolve to an implementation that makes use of the map function (processing a single key/value pair at a time) followed by the reduce function (grouping multiple values by a common key) approach.

This philosophy has been completely redesigned for the latest version of Hadoop, with two main components: YARN and MRv2.



YARN

YARN can be succinctly described as a generic framework for running distributed applications that will pull their data from HDFS. The main difference between Hadoop 1.0 and Hadoop 2.0 is the removal of MapReduce as the cornerstone of the paradigm. HDFS is still a main component, but the second piece of the package is now YARN instead of MapReduce. The main benefit of the drastic switch is that graph traversal, MPI, traditional MapReduce, and any other type of distributed application can all be run on top of YARN in the same cluster with no special considerations.

"YARN" should be treated as more of a codename than an acronym, although it does stand for something - "Yet Another Resource Negotiator".

MRv2

MapReduce version 2 (MRv2) represents the redesign of the MapReduce framework implementation to allow applications to run on top of YARN. There are several changes in the environment (covered in a separate blog post) that must be considered when creating an application designed for Hadoop 2.0, and these things are addressed in the new version of MapReduce.

Where can I go for more information?

Announcing Apache Hadoop 2
Or feel free to leave your question or comment below!

No comments:

Post a Comment