A Beginner's Guide To Apache Sqoop: Basic Navigation and Useful Commands

What is Sqoop?

For anyone that is facing the challenge of moving data out of a data warehouse and into the Hadoop environment, Apache Sqoop is a lifesaver. Sqoop provides users with an efficient and parallelized method of importing data from a relational database (Oracle, MySQL, etc.) either directly into HDFS or into a more managed environment like HBase or Hive. Additionally, Sqoop can be used to efficiently support the reverse, exporting data stored in HDFS back to the relational database.

What do I need to get started?

This guide assumes that Sqoop, Hive, and core Hadoop are installed and working properly. Once you have everything set up, you will also need access to an Oracle database through the appropriate JDBC connection string, using an account with the proper access privileges.

I have everything, what now?

List available databases

Using the JDBC connection string, Sqoop can connect to the database using the following command in order to return a list of all databases that it has access to. This can be used to verify that the connection and permission levels are working as expected.

$ sqoop list-databases --connect <JDBC_STRING> --username <USERNAME> -P

Load data into HDFS

The following command can be used to import data from the specified table into HDFS in a location of the form (<WAREHOUSE_DIR>/<TABLE_NAME>)

$ sqoop import --connect <JDBC_STRING> --table <TABLE_NAME> --username <USERNAME> -P --mysql-delimiters --warehouse-dir <PARENT_DIR>

Load data into Hive

Sqoop also supports importing data directly into Hive's underlying warehouse structure so that the data will be available for querying.

If the hive-overwrite option is present, any data already in the table will be overwritten. The default behavior is to append the newly imported data to the existing table.

The split-by option is required to support parallelization in the case where a primary key is not set in the table. If manually specified, the chosen column should be one with a wide range of values so that the data splits will be meaningful. For example, if the chosen column is a boolean, the maximum possible number of parallel tasks will be only two.

$ sqoop import --connect <JDBC_STRING> --username <USERNAME> -P --table <TABLE> --hive-import --hive-overwrite --split-by <SPLIT_COLUMN>

Once the command is complete, the following Hive command will show that the new table is properly loaded into Hive.

$ hive
> show tables;

Import query results

Sqoop also allows the user to execute SQL queries against the database from the command line. The results of the query are printed to the screen.

$ sqoop eval --connect <JDBC_STRING> --username <USERNAME> -P --query <QUERY>

Once a query has been shown to produce useful results, Sqoop can take the results and import them directly into HDFS. In order to support parallelization, Sqoop requires the query to be supplemented with an additional "WHERE $CONDITIONS" clause. During processing, Sqoop will replace the $CONDITIONS variable with a value that makes each instance of the query unique.

$ sqoop import --connect <JDBC_STRING> --username <USERNAME> -P --query '<QUERY> WHERE $CONDITIONS' --split-by <SPLIT_COLUMN> --target-dir <TABLE_DIR>

Export data to database

Sqoop can also be used to export data from HDFS into an existing table in the database.

If the table does not exist, the Sqoop eval command can be used to create it.

$ sqoop eval --connect <JDBC_STRING> --username <USERNAME> -P --query "<CREATE_TABLE>"

Once the table exists, the following Sqoop export command should be used with the appropriate connection information and the list of columns that should be exported.

$ sqoop export --connect <JDBC_STRING> --username <USERNAME> -P --export-dir <HDFS_FOLDER> --table <DB_TABLE> --columns <COLUMN_LIST> --direct --mysql-delimiters

Support reuse through Sqoop jobs

Some import and export commands will need to be repeated, either on a set schedule or as part of the development process. Sqoop provides the ability to save commands as "jobs" so that they can be executed at any point in the future. The essential commands related to Sqoop jobs are shown below.

List Sqoop jobs

$sqoop job --list

Create Sqoop job

$sqoop job --create <jobName> -- <command>

Execute Sqoop job

sqoop job --exec <jobName>

As an example, the following commands would create a Sqoop job called "listDatabases" and then execute it so that the list of databases accessible through the connection information is printed to the screen.

$ sqoop job --create listDatabases -- list-databases --connect <JDBC_STRING --username <USERNAME> -P
sqoop job --exec listDatabases

Where can I go for more information?

Or feel free to leave your question or comment below!

No comments:

Post a Comment