Hadoop Integration: Installing Apache Cassandra on HortonWorks Data Platform Part 1

Overview

This post will serve as a guide to installing Cassandra on HortonWorks Data Platform 1.3.0, and will underline a generalized procedure for integrating with Hadoop so that MapReduce jobs may receive native Cassandra data.  Hadoop integration on the MapReduce level is materialized through custom implementations of MapReduce's InputSplit, InputFormat and RecordReader classes.

What is Cassandra?

Apache Cassandra is a fault-tolerant, NoSQL modeled, distributed database management system functioning as a service and designed to handle very large amounts of data over commodity server networks. It adheres to the conventional key-value store, column family structure typical of Hadoop-based components such as Apache HBase and other NoSQL modeled databases, but was not designed exclusively to operate within the Hadoop ecosystem. Cassandra's data structure leverages column indexes "with the performance of log-structured updates," hosts significant support for denormalization and materialized views along with an internal caching system. Within Cassandra's model, keys may map to an array of values, which are collected into column families. Column families remain fixed upon database creation, but may accept additions at any time. Key column pairs are specified exclusively, so different keys can have varying numbers of columns in any given family. Cassandra is not modeled exclusively for the Hadoop ecosystem, but rather works to serve as a distributed database operating a top a gathering of nodes within a "Cassandra" cluster, or more specifically, an array of machines or systems in which data is to be managed. Cassandra does offer an API for integrating with Hadoop in which binding classes and methods are utilized within custom MapReduce jobs. Newer versions of the Cassandra API offer Apache Pig support with Apache Hive and Oozie support in the workings. Datastax, a big data application company, offers an enterprise version of Cassandra shipped with Hadoop support out of the box.

Installing Cassandra 

Prerequisites

  • A working HortonWorks Data Platform (HDP) 1.3.0 cluster
  • A copy of apache-cassandra-1.2.6-bin.tar.gz
  • A Java 6 JDK derivative (preferably jdk-1.6.0_31) 

 
Retrieve the Cassandra tar version 1.2.6.  It is best practice to install from tarball in order to avoid some Java configuration confusions in the process.  The Cassandra rpm package automatically installs and implements OpenJDK, which may lead to dependency discrepancies with other applications on the system that use oracle implementations of JDK.  Though the working versions may be interchanged, this still stands as an inconvenience.

Unpack the Cassandra tarball:

$ cd /usr/local
$ sudo tar -xzvf apache-cassandra-1.2.6-bin.tar.gz
$ sudo ln -s apache-cassandra-1.2.6-bin.tar.gz/ cassandra
$ sudo chown -R $USER cassandra
$ sudo rm apache-cassandra-1.2.6-bin.tar.gz

Set the Cassandra home variable:

$ sudo vim ~/.bash_profile 

CASSANDRA_HOME=/usr/local/cassandra/bin
export CASSANDRA_HOME 

PATH=$PATH:$CASSANDRA_HOME

export PATH 

Give Cassandra permission to write to logs:

$ sudo chmod 755 /var/logs/cassandra

Open Cassandra's primary config file and edit the following: 

$ cd /usr/local/cassandra/conf/cassandra.yaml 

cluster_name: '<cluster_name>'
- seeds: "127.0.01, $FQDN"      
listen_address: $FQDN
rcp_address: $FQDN

Start Cassandra with:

$ cassandra -f 

or if not in PATH:

$CASSANDRA_HOME/bin/cassandra -f 

Connecting with Hadoop 

Interfacing with Hadoop occurs primarily through the use of three primary Cassandra Java classes, together working to formalize Cassandra DB artifacts to be read in a file-like manner and recognized as ligitimate input objects for MapReduce methods. 

ConfigHelper.class
ColumnFamilyInputFormat.class
ColumnFamilyOutputFormat.class 


See how we call Cassandra in MapReduce in Part 2. 

1 comment:

  1. Very good and complete post, this is of great help, thanks!

    NB: rpc_address, and no rcp_address ...

    ReplyDelete