Hadoop Single Point of Failure, Thing of the Past?

Background

The initial architecture of Apache Hadoop has been often criticized for the single point of failure (SPOF) with its single NameNode design. A failure of the Hadoop NameNode is improbable but not impossible so system administrators should prepare their clusters for this event. Vendors have implemented solutions which could make this Hadoop perceived weakness a thing of the past if it is not aleady.

Impact of SPOF

The loss of a NameNode can affect the organization by making the cluster unavailable or even by losing data if the proper precautions are not taken.

Availability

If the NameNode process or machine fails, then the entire cluster will not be available until either the NameNode is rebooted or it is assigned and started on another machine. Any restarted NameNode is not available until it gets heartbeat messages from the data nodes with the block locations for all of the files on the data nodes.   This can take hours for large clusters which results in decreased availability when there is an unexpected outage.

Data Loss

The single NameNode contains the metadata about all of the file blocks stored in HDFS.  This meta data is a registry of which file blocks make up each HDFS file.  Without this registry, there is no way to know which blocks belong to which HDFS files.   The location of file blocks is sent to the NameNode through heartbeat messages from the Data Nodes.  

In the event of the NameNode failure, since there are normally no HDFS file blocks stored on the NameNode, there would be no loss of the file blocks that make up HDFS files.  As mentioned, the NameNode contains a registry of all of the blocks in HDFS.  This information is located in an image file called fsimage also in an edit log that keep tracks of all of the files on the system.  If this file is lost or corrupted, then there will be no record of which blocks are in which HDFS file resulting in data loss of the entire cluster.  Hadoop does have built in mechanisms and also some administration practices to protect against this case.

In the initial Hadoop architecture, the Secondary name node is not really a secondary name node.  It is responsible for applying changes to the fsimage file through a checkpointing process but does not directly assist in NameNode failover.  The checkpointing process is scheduled by default every hour. In the event of a crashed NameNode, copies of the fsimage and edit log can be restored from the Secondary NameNode as a stale backup.

The stale backup might result in data loss for all changes from the time that the last checkpoint was performed. To avoid having to restore from a stale backup, Hadoop can be configured to write all of the changes to a second shared directory to point to a NFS mounted location in core-site.xml that can be backed up by a reliable technology such as RAID.

<property>
<name>dfs.name.dir</name>
<value>${hadoop.tmp.dir}/dfs/name,
/share/backup/namenode</value>
</property>

Vendor Solutions

Vendors have been very active in addressing the reliability concern and now customers have been provided with some solutions for the single point of failure of inherent in the original Hadoop architecture.

MapR

MapR corporation built their design to avoid this single point of failure from the beginning of MapR’s Hadoop distribution.  MapR designed their architecture around a distributed HA NameNode.  This architecture protects against the failure of a single NameNode and also allows for greater scalability due to the lack of restriction on the number of files in MapRFS.  While MapR has been touting its solution and also reminding prospects about the single point of failure in Hadoop’s initial architecture, it’s competitor Cloudera has not remained passive.

Cloudera

In an effort to eliminate MapR’s competitive advantage in the area of reliability,  Cloudera has been leading the effort for a HA NameNode on Apache Hadoop. This work was done in several phases.

Phase 1

The goal of Phase 1 was to have a Standby NameNode which could be manually switched over at any time by the administrator in a matter of seconds.  The implementation sends parallel block reports from Data Nodes to the NameNode and Standby NameNode.  This implementation relied on some shared network storage between the NameNode and Standby NameNode for the edit logs of HDFS.  The Active NameNode writes to the shared storage and Standby reads for the shared storage.  This phase was completed in March of 2012.  Cloudera committed their changes to Hadoop 0.22.

Phase 2

Phase 1 solved some use cases such as planned maintenance but had some limitations.  

  • Still required a manual failover by administrators so administrators have to know if the outage occurred.  
  • The single point of failure became the shared storage directory and some customers do not have reliable shared storage in their environments.

Cloudera committed resources to address these concerns.  Phase 2 removed the dependency on the shared storage directory and utilize a zookeeper automatic failover daemon to better address unplanned outages.


To address the need for shared storage, the architecture was implemented as a quorum of machines to maintain the NameNode state.


These changes were merged into Hadoop trunk in October of 2012 and are in CDH4.1.  To understand which vendor releases and versions of Hadoop have HA NameNode features, it is helpful to understand Hadoop lineage.

Hadoop Lineage

A popular diagram for Hadoop lineage is from the blog below.



Since the HA NameNode changes were committed by Cloudera in Hadoop 0.22, you can see that CDH4 will have the changes along with Apache Hadoop 2.0 since it is derived from Hadoop 0.22. CDH3 is derived from Hadoop 0.20.2 so it will not have the HA NameNode functionality.  For more information see this additional resource.  ttp://hadoopilluminated.com/hadoop_book/Hadoop_Versions.html

Hortonworks

Since Cloudera committed the changes for the HA NameNode back to Apache Hadoop, Hortonworks 2.0.0 (Apache 2.0.2 alpha) will have the HA NameNode features, but Hortonworks 1.3.x will not have the Cloudera HA NameNode features since it is on the Hadoop 1.0 branch. On its own, Hortonworks has developed a NameNode monitoring service that can integrate into Linux HA for HDP 1.3 or VMware HA.  This can solve the problem of single point of failure for Hadoop 1.0.

MapR and Hortonworks already have individualized HA NameNode architectural solutions today. As a result of the work by Cloudera, Apache Hadoop 2.0 will have a HA NameNode solution to the single point of failure concerns with Hadoop.  This provides better reliability for Hadoop customers and advances the platform further as a leader in parallel processing capabilities.  

No comments:

Post a Comment