Supplementing Hive's Format Restrictions: Compensating with R

Overview

Apache Hive hosts a capable formatting system for dealing with incoming data as text and csv files destined for storage within it's warehouse structure.  Specifically, the conventional process for instructing Hive how to correctly divide data fields has involved the specification of a termination character which may be as simple as a comma or a quotation mark, using the following the declaration ROW FORMAT DELIMITED FIELDS TERMINATED BY 'my_escape_character'.  Within the simplest circumstances, upon which these divides are well defined, this action proves straight forward, such as when the escape character is a comma and the field layout is structured as:

myfile.csv

field A, field B, field C
value A, value B, value C  

One may create the obvious schema and successfully load the data with the following:

LOAD DATA INPATH '/HDFS/PATH/myfile.csv' INTO TABLE 'myschema' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

This example of course represents the simplest case, but a minor adjustment in value may cause confusion in the process:

myfile2.csv

field A, field B, field C
value A,  "value, B, is a string,       value C
         that has 'multiple' lines",



Traditionally, Hive catalogs the ability to handle situations in which the escape character lies inside the actual field of value, but within some practical applications the process has proven to fail in differentiating between characters separating fields and arbitrary characters within the fields themselves.  

Compensating with R 

In cases where Hive may have difficulty differentiating arbitrary values from field terminators, R along with the RHive module can offer an alternative method for achieving desired formatting and loading the data inline with the associated schema.  Considering the case of "myfile2.csv", one may leverage some simple R processes on the data, and then use a simple RHive function to create the final data structure in Hive.  The process is as follows:


# Load CSV
mytable <- read.csv("~/path/to/myfile2.csv")



# Remove newlines in value B text
mytable$ValueB.Text<-gsub("\n",'|',mytable$ValueB.Text)
# remove commas, apostrophes, and quotes everywhere in data
test<-sapply(mytable,function(s) gsub(',',';',s))
test<-sapply(mytable,function(s) gsub('\'|\"','',s))
# cast data back into dataframe
mytable<-as.data.frame(test)
# write data to Hive table
rhive.write.table('mytable',mytable)


There are a number of tools and applications for supplementing any format restrictions that may become apparent using Hive. Though for most analytics driven environments that leverage distributed processing through Hadoop, R proves a usual component, and offers a streamlined solution with little complexity.  One should be sure to understand the data handling ability of each of the tools within their Hadoop system, as many offer handy processing and formatting solutions when others tend to fail. 

 

No comments:

Post a Comment