Benchmarking Hadoop : Know Thine Cluster

Hadoop Benchmarking

It is important that once you have your first Hadoop cluster running, that you have a standard set of benchmarking tests that you can run to compare peformance of any new clusters you want to create.  The benchmarking is often just after installation before release the cluster to the public so that it can be tested with no load on the system.  Of course if there are users on the system when the benchmarking is done mileage may vary.  When selecting your benchmark tests, it would be good to evaluate a project for bench marking Hadoop developed at Purdue University called PUMA.

The PUMA project is an extension of the hadoop examples jar with many more tests that give coverage to many other areas of Map Reduce besides standard examples from the hadoop distribution such as  TeraSort, WordCount, and Grep.  There are other tests to evaluate the performance of your cluster for Self-Joins, Histrograms, KMeans, Sequence Counting, Classification, Term Vector and more.

Test Data

Not only does PUMA provide the test jar but also provides data downloads for the tests.  There are downloads here for data sets for each of the tests or scripts to test data when necessary.  The data sets are even sized for various cluster sizes.

Running The Test

You can download the jar file for the tests here.  Once you have the jar file the command to run the test is similar to the following command shown for word count:

bin/hadoop jar hadoop-*-examples.jar wordcount –r <num-reduces> <input-dir> <output-dir>

The * represents your Hadoop Distribution version number.  The hadoop-*-examples.jar should be located in your hadoop root directory.  All of the tests in PUMA have the command line parameters listed on the download page.

More Information
For more information on PUMA click here.


No comments:

Post a Comment