Hadoop Cluster Administration: Manually setting up a bare minimum CDH4 cluster


This post explains how to manually set up the most basic CDH4 (Cloudera's Distribution for Apache Hadoop Version 4.3) Hadoop cluster onto multiple machines. This procedure would be handy if the machines you want to install CDH4 on to do not have Internet access.


Assumptions

  • CentOS 6.2 or 6.3
  • SELinux is disabled
  • The /etc/hosts file is in a correct state, here is an example of one in the correct state:
127.0.0.1       localhost.localdomain   localhost.localdomain   localhost4      localhost4.localdomain4 localhost
::1     localhost.localdomain   localhost.localdomain   localhost6      localhost6.localdomain6 localhost
 
172.17.133.21  node1.localdomain node1
172.17.133.22  node2.localdomain node2
172.17.133.23  node3.localdomain node3
172.17.133.24  node4.localdomain node4
172.17.133.25  node5.localdomain node5
  • The /etc/sysconfig/network file is in a correct state, here is an example: 
NETWORKING=yes
HOSTNAME=node1
  • Needed rpms have been propagated to all nodes

This tutorial has the following node configuration:
  • node1
    • JobTracker
    • NameNode (no Secondary NameNode)
  • node2
    • TaskTracker
    • DataNode
  • node3
    • TaskTracker
    • DataNode
  • node4
    • TaskTracker
    • DataNode
  • node5
    • TaskTracker
    • DataNode




Lessons learned




Install process

Installing RPMS on a Cluster

Install CDH4 with MRv1

Install each type of daemon package on the appropriate systems(s), as follows. The following assumes your $PWD is cdh4-installation.

JobTracker is to be installed on the first node:
yum --nogpgcheck localinstall  \
  hadoop-0.20-mapreduce-jobtracker-2.0.0+1357-1.cdh4.3.0.p0.21.el6.noarch.rpm  \
  hadoop-0.20-mapreduce-2.0.0+1357-1.cdh4.3.0.p0.21.el6.x86_64.rpm  \
  hadoop-2.0.0+1357-1.cdh4.3.0.p0.21.el6.x86_64.rpm  \
  hadoop-hdfs-2.0.0+1357-1.cdh4.3.0.p0.21.el6.x86_64.rpm  \
  bigtop-jsvc-1.0.10-1.cdh4.3.0.p0.14.el6.x86_64.rpm  \
  zookeeper-3.4.5+19-1.cdh4.3.0.p0.14.el6.noarch.rpm  \
  bigtop-utils-0.6.0+73-1.cdh4.3.0.p0.17.el6.noarch.rpm  \
  ../cdh4-installation-dependencies/nc-1.84-22.el6.x86_64.rpm

NameNode is to be installed on the first node:
yum --nogpgcheck localinstall hadoop-hdfs-namenode-2.0.0+1357-1.cdh4.3.0.p0.21.el6.x86_64.rpm

TaskTracker is to be installed on all other nodes:
yum --nogpgcheck localinstall  \
  hadoop-0.20-mapreduce-tasktracker-2.0.0+1357-1.cdh4.3.0.p0.21.el6.noarch.rpm  \
  hadoop-0.20-mapreduce-2.0.0+1357-1.cdh4.3.0.p0.21.el6.x86_64.rpm  \
  hadoop-hdfs-2.0.0+1357-1.cdh4.3.0.p0.21.el6.x86_64.rpm  \
  hadoop-2.0.0+1357-1.cdh4.3.0.p0.21.el6.x86_64.rpm  \
  bigtop-jsvc-1.0.10-1.cdh4.3.0.p0.14.el6.x86_64.rpm  \
  zookeeper-3.4.5+19-1.cdh4.3.0.p0.14.el6.noarch.rpm  \
  bigtop-utils-0.6.0+73-1.cdh4.3.0.p0.17.el6.noarch.rpm  \
  ../cdh4-installation-dependencies/nc-1.84-22.el6.x86_64.rpm

DataNode is to be installed on all other nodes:
yum --nogpgcheck localinstall hadoop-hdfs-datanode-2.0.0+1357-1.cdh4.3.0.p0.21.el6.x86_64.rpm


Deploying HDFS on a Cluster

Configuration steps

  • Important: configuration files for this hadoop cluster are located in /etc/hadoop/conf.my_cluster.
  • Execute this on all nodes: 
sudo cp -r /etc/hadoop/conf.dist /etc/hadoop/conf.my_cluster
sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
sudo alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

Customize Configuration Files

  • Define and set fs.defaultFS within conf/core-site.xml
  • Optionally, define and set dfs.permissions.superusergroup within conf/hdfs-site.xml
  • scp these configs to all other nodes: 
for i in `seq -w 2 5` ; do
 scp -pr /etc/hadoop/conf.my_cluster/*   node${i}:/etc/hadoop/conf.my_cluster/
done

Configure Local Storage Directories

  • Define and set dfs.namenode.name.dir within hdfs-site.xml on the NameNode.
  • Define and set dfs.datanode.data.dir within hdfs-site.xml on each DataNode.
  • scp this config to all other DataNodes: 
for i in `seq -w 3 5` ; do 
 scp -pr /etc/hadoop/conf.my_cluster/*   node${i}:/etc/hadoop/conf.my_cluster/
done
  • On NameNode host: create the dfs.namenode.name.dir local directories.
  • On all DataNode hosts: create the dfs.datanode.data.dir local directories.
  • On NameNode host: configure the owner of dfs.namenode.name.dir to be hdfs:hdfs.
  • On all DataNode hosts: configure the owner of dfs.datanode.data.dir to be hdfs:hdfs.
  • On NameNode host: set the permissions of dfs.namenode.name.dir to be 0700.
  • On all DataNode hosts: set the permissions of dfs.datanode.data.dir to be 0700

Configure DataNodes to Tolerate Local Storage Directory Failure

Since this is a bare minimum installation, this step is skipped.

Format the NameNode

Before starting the NameNode for the first time, format the file system:
sudo -u hdfs hadoop namenode -format

Configure a Remote NameNode Storage Directory

Since this is a bare minimum installation, this step is skipped.

Configure the Secondary NameNode

Since this is a bare minimum installation, this step is skipped.

Enable Trash

Since this is a bare minimum installation, this step is skipped.

Enable WebHDFS

Since this is a bare minimum installation, this step is skipped.

Deploying MapReduce v1 (MRv1) on a Cluster

Step 1: Configuring Properties for MRv1 Clusters

  • Define and set mapred.job.tracker within conf/mapred-site.xml
  • scp this config to all other nodes:
for i in `seq -w 2 5` ; do
 scp -pr /etc/hadoop/conf.my_cluster/mapred-site.xml  node${i}:/etc/hadoop/conf.my_cluster/
done

Step 2: Configure Local Storage Directories for Use by MRv1 Daemons

  • Define and set mapred.local.dir within mapred-site.xml on each TaskTracker.
  • scp this config to all other nodes: 
for i in `seq -w 3 5` ; do
 scp -pr /etc/hadoop/conf.my_cluster/mapred-site.xml  node${i}:/etc/hadoop/conf.my_cluster/
done
  • On all TaskTrackers: create the mapred.local.dir.
  • On all TaskTrackers: configure the owner of mapred.local.dir to be mapred:hadoop.
  • On all TaskTrackers: set the permissions of mapred.local.dir to be 0755

Step 3: Configure a Health Check Script for DataNode Processes

Since this is a bare minimum installation, this step is skipped.

Step 4: Configure JobTracker Recovery

Since this is a bare minimum installation, this step is skipped.

Step 5: Deploy your Custom Configuration to your Entire Cluster

This step has already been performed.

Step 6: Start HDFS on Every Node in the Cluster

for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do
 sudo service $x start
done

Step 7: Create the HDFS /tmp Directory

sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

Step 8: Create MapReduce /var directories

sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

Step 9: Verify the HDFS File Structure

sudo -u hdfs hadoop fs -ls -R /
You should see:
drwxrwxrwt   - hdfs supergroup          0 2012-04-19 15:14 /tmp
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred
drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging


Step 10: Create and Configure the mapred.system.dir Directory in HDFS

sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system


Step 11: Start MapReduce

On the JobTracker node, run:
sudo service hadoop-0.20-mapreduce-jobtracker start

On each TaskTracker node, run:
sudo service hadoop-0.20-mapreduce-tasktracker start

Step 12: Create a Home Directory for each MapReduce User

Create a home directory for each MapReduce user. For our organization it is best to do this on the NameNode; for example:
sudo -u hdfs hadoop fs -mkdir  /user/<user>
sudo -u hdfs hadoop fs -chown <user> /user/<user>
where <user> is the Linux username of each user.


Configuring the Daemons to Start in an MRv1 Cluster

On the JobTracker node, run:
sudo chkconfig hadoop-0.20-mapreduce-jobtracker on

On each TaskTracker node, run:
sudo chkconfig hadoop-0.20-mapreduce-tasktracker on

On the NameNode, run:
sudo chkconfig hadoop-hdfs-namenode on

On each DataNode, run:
sudo chkconfig hadoop-hdfs-datanode on



No comments:

Post a Comment