"HDFS error: data could only be replicated to 0 nodes, instead of 1"

Versions

CDH3 
CDH4

Overview


Pending particular service interruptions, permissions changes or node formatting, it may become familiar to encounter data replication errors, specifically "HDFS error: {data} could only be replicated to 0 nodes, instead of 1".  Again, the source of the issue may vary depending on the situation, but before participating in a debugging deep-dive, or a reformatting of the namenode, be attentive of the "namespaceID" assigned to each node within the working cluster.  A simple ID mismatch may be propagating the problem, causing supplementary nodes to go unrecognized.

Pinpointing a Possible Mismatch 

To uncover a possible inconsistency in assigned namespace identifiers, compare the namespaceID between the primary node and the first accessible data node.  This value is located within the "VERSION" file assigned to each respective machine, within the working dfs directory, for example:

/dfs/nn/current/VERSION
and 
/dfs/dn/current/VERSION 

Scoping into these files, respectively, note the assigned value for the namespaceID, again:



cat /dfs/nn/current/VERSION

namespaceID=2008523184
cTime=0
storageType=NAME_NODE
layoutVersion=-19


then

vim /dfs/dn/current/VERSION 

namespaceID=2008523787
cTime=0
storageType=NAME_NODE
layoutVersion=-19


Note the inconsistency between the name and data nodes in this particular example.  

Repairing the Issue 

With the root of the issue uncovered, matching namespace identifiers will be required across the board.  This will involve editing the VERSION file, and replacing the namespaceID of the data nodes with the namespaceID located in the VERSION file that is associated with the namenode.  Do the following:

a.) Stop Hadoop services.

b.) Copy the namespaceID value located in the VERSION file associated with the namenode.  

c.) Access the VERSION file associated with the data node and replace the namespaceID with that copied from the namenode.  If your cluster involves multiple machines, this will likely need to be performed in bulk.  A simple bash script that accomplishes this task will be available in an upcoming post.  

d.) Start Hadoop services.  

Summary 

Data replication errors may often be attributed to deeper and less straight-forward issues than what is explained here.  Therefore, this post is meant to simply serve as a check-point for one of the more simple common sources of this error, before significant time is allotted to more complex troubleshooting and platform debugging. 

No comments:

Post a Comment