A Data Scientist's Guide to Getting Started with RHive

What is RHive? 

RHive is an R package developed by NexR with a focus on providing distributed analysis capabilities through Hadoop.  RHive allows an analyst to interact with data stored in a Hadoop Distributed File System (HDFS) cluster by utilizing familiar SQL-like constructs established through Hive.  Furthermore, distributed analysis can be extended past the stock functions included in Hive by writing user-defined R functions (UDFs) that are distributed and processed on the cluster.  In this article, we will illustrate connecting to a cluster and some basic operations we can complete using RHive.

Prerequisites

This article assume that you have already done the client and server side installation steps necessary to begin execution of RHive commands.

Initializing RHive and Connecting to Data


Once RHive is installed on your system, you must first initialize some environment variables within R to allow RHive to use the Hadoop and Hive libraries loaded on your system.  You accomplish this using the rhive.init command:

>rhive.init(hive='/path/to/hive',hadoop_home='/path/to/hadoop'[,hadoop_conf='/path/to/hadoop_conf'])



Note that hadoop_conf is not necessary; however, it is often useful when connecting to remote Hadoop clusters to separate your configurations.  It is also possible to initialize a few other variables such as additional hive/hadoop libraries, etc.

Once RHive is initialized, connecting to your Hive instance is a simple as running:

>rhive.connect('hive-server-url')

Basic Operations


After RHive is configured and connected, you can run some simple commands to verify connection and to kick-start your analysis:

>rhive.list.tables() - list all tables available in the Hive database

>rhive.desc.table('table-name') - returns the schema of a table in the database

>sample<-rhive.query('select * from table limit 1000') - run a query on a table; in this case, return a sample of size 1000

No comments:

Post a Comment