Hadoop Fundamentals: Loading Data into Apache Hive


It becomes imperative, when working with Hive in Hadoop, to have a very concise, repeatable and well defined method for transferring data locally, or from within HDFS, into the datastore.  This guide will provide such a method which may be scaled to the specific loading needs of the user.    

Step 1 - Backup Data Files

Make a copy of the file destined for transfer into HDFS.  Alternatively, be sure to at least retain the original package in which the file(s) were received (zip / tar / etc).

Step 2 - Remove Column Headers

When working with csv or txt data files, column headers may be included within the original file.  Remove the column headers to avoid an offsetting of data position when the file(s) eventually move to populate a Hive table.

Step 3 - Move Selected Files to HDFS

Be sure the destination directory to which the file is to placed exists.  Move the file using:

hadoop fs -copyFromLocal /path/to/file /hdfs/destination/path


hadoop fs -put /path/to/file /hdfs/destination/path

Step 4 - Create the Hive Table

At this point, if it hasn't been done already, establish the particular table schema and create the corresponding Hive table that will hold the data in transfer, using the format:

CREATE TABLE 'my_table' (col1, <TYPE>, col2 <TYPE> ...) ROW FORMAT

If the table created is to be an external table, the following syntax applies:

CREATE EXTERNAL TABLE 'my_table' (col1, <TYPE>, col2 <TYPE> ...) ROW FORMAT DELIMITED FIELDS TERMINATED BY '<my_delimiter>' LOCATION '</hdfs/path/to/data/folder>';

Step 5 - Load Data into Hive Table

With the Hive table created, data may be imported to populate the schema.  Load the data file into the associated Hive table by executing:

LOAD DATA INPATH '/hdfs/path/to/file' INTO TABLE my_table;  

For cases in which the data file lives on the local machine and not within HDFS, the command is:

LOAD DATA LOCAL INPATH '/local/path/to/file' INTO TABLE my_table; 

Step 6 - Confirm Successful Loading

Confirm the data has been loaded by running a set of test queries from within the Hive shell.  

A - check the alignment of the data with your intended schema by executing:

select * from my_table limit 10; 

B - confirm a consistent row count by executing:
select count(*) from my_table;


  1. You have an error, having the same query for loading data twice. LOAD DATA INPATH '/hdfs/path/to/file' INTO TABLE my_table; need to add the 'local' keyword and change the 'hdfs' path ;-)

  2. Thanks for pointing that out, Daniel. The issue has been corrected. Thanks for reading!