Hadoop Fundamentals: Loading Data into Apache Hive

Overview

It becomes imperative, when working with Hive in Hadoop, to have a very concise, repeatable and well defined method for transferring data locally, or from within HDFS, into the datastore.  This guide will provide such a method which may be scaled to the specific loading needs of the user.    

Step 1 - Backup Data Files


Make a copy of the file destined for transfer into HDFS.  Alternatively, be sure to at least retain the original package in which the file(s) were received (zip / tar / etc).

Step 2 - Remove Column Headers


When working with csv or txt data files, column headers may be included within the original file.  Remove the column headers to avoid an offsetting of data position when the file(s) eventually move to populate a Hive table.

Step 3 - Move Selected Files to HDFS


Be sure the destination directory to which the file is to placed exists.  Move the file using:

hadoop fs -copyFromLocal /path/to/file /hdfs/destination/path


OR:


hadoop fs -put /path/to/file /hdfs/destination/path

Step 4 - Create the Hive Table


At this point, if it hasn't been done already, establish the particular table schema and create the corresponding Hive table that will hold the data in transfer, using the format:

CREATE TABLE 'my_table' (col1, <TYPE>, col2 <TYPE> ...) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '<my_delimiter>';  

If the table created is to be an external table, the following syntax applies:

CREATE EXTERNAL TABLE 'my_table' (col1, <TYPE>, col2 <TYPE> ...) ROW FORMAT DELIMITED FIELDS TERMINATED BY '<my_delimiter>' LOCATION '</hdfs/path/to/data/folder>';

Step 5 - Load Data into Hive Table


With the Hive table created, data may be imported to populate the schema.  Load the data file into the associated Hive table by executing:

LOAD DATA INPATH '/hdfs/path/to/file' INTO TABLE my_table;  

For cases in which the data file lives on the local machine and not within HDFS, the command is:

LOAD DATA LOCAL INPATH '/local/path/to/file' INTO TABLE my_table; 


Step 6 - Confirm Successful Loading


Confirm the data has been loaded by running a set of test queries from within the Hive shell.  

A - check the alignment of the data with your intended schema by executing:

select * from my_table limit 10; 

B - confirm a consistent row count by executing:
select count(*) from my_table;
 



2 comments:

  1. You have an error, having the same query for loading data twice. LOAD DATA INPATH '/hdfs/path/to/file' INTO TABLE my_table; need to add the 'local' keyword and change the 'hdfs' path ;-)

    ReplyDelete
  2. Thanks for pointing that out, Daniel. The issue has been corrected. Thanks for reading!

    ReplyDelete