OverviewIt becomes imperative, when working with Hive in Hadoop, to have a very concise, repeatable and well defined method for transferring data locally, or from within HDFS, into the datastore. This guide will provide such a method which may be scaled to the specific loading needs of the user.
Step 1 - Backup Data Files
Make a copy of the file destined for transfer into HDFS. Alternatively, be sure to at least retain the original package in which the file(s) were received (zip / tar / etc).
Step 2 - Remove Column Headers
When working with csv or txt data files, column headers may be included within the original file. Remove the column headers to avoid an offsetting of data position when the file(s) eventually move to populate a Hive table.
Step 3 - Move Selected Files to HDFS
Be sure the destination directory to which the file is to placed exists. Move the file using:
hadoop fs -copyFromLocal /path/to/file /hdfs/destination/path
hadoop fs -put /path/to/file /hdfs/destination/path
At this point, if it hasn't been done already, establish the particular table schema and create the corresponding Hive table that will hold the data in transfer, using the format:
CREATE TABLE 'my_table' (col1, <TYPE>, col2 <TYPE> ...) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '<my_delimiter>';
If the table created is to be an external table, the following syntax applies:
CREATE EXTERNAL TABLE 'my_table' (col1, <TYPE>, col2 <TYPE> ...) ROW FORMAT DELIMITED FIELDS TERMINATED BY '<my_delimiter>' LOCATION '</hdfs/path/to/data/folder>';
Step 5 - Load Data into Hive Table
With the Hive table created, data may be imported to populate the schema. Load the data file into the associated Hive table by executing:
LOAD DATA INPATH '/hdfs/path/to/file' INTO TABLE my_table;
For cases in which the data file lives on the local machine and not within HDFS, the command is:
LOAD DATA LOCAL INPATH '/local/path/to/file' INTO TABLE my_table;
Step 6 - Confirm Successful Loading
Confirm the data has been loaded by running a set of test queries from within the Hive shell.
A - check the alignment of the data with your intended schema by executing:
select * from my_table limit 10;
B - confirm a consistent row count by executing:
select count(*) from my_table;