Four Useful Tricks for Working with Hive

Four Useful "Tricks" for working with Hive


1. Display database name and column headers


After start Hive command line interface (CLI), type
hive> set hive.cli.print.current.db=true;
hive> set hive.cli.print.header=true;
The first command line tell hive to display current database name in Hive CLI
The second command line tell hive to print column headers in results


2. Replicate IN/EXIST in SQL by using LEFT SEMI JOIN


SQL allows you to write IN/EXIST subquery syntax, such as

SELECT a.key, a.value
FROM a
WHERE a.key in
(SELECT b.key
FROM B);

Since Hive currently does not support IN/EXIST subquery, we can rewrite with LEFT SEMI JOIN in Hive. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses.

SELECT a.key, a.val
FROM a LEFT SEMI JOIN b on (a.key = b.key)


3. Avoid NULLs in inner join (default join)


Assume we have two tables (impressions and creative) and we want to join them on id. 

SELECT * FROM
FROM impression I JOIN creative C on (I.creativeid = C.id)

The joining procedure in Hive is demonstrated in the following picture

Records with the NULL key all end up at the same reducer. This will skew the load and break parallelism. Null keys will never be used in INNER JOIN, so it is pointless for them to get to the reducer. Reducing is slow. Try to avoid reducing phase if possible. We should adjust our query as

SELECT * FROM
FROM impression I JOIN creative C on (I.creativeid = C.id
AND I.creativeid IS NOT NULL
AND C.id IS NOT NULL)


4. Change the delimiter when exporting hive table


When exporting hive table, hive by default sets the delimiters as “^A” (ctrl-A '\001'), which is uncommon for other software.
Here is an argument that we can use to change the delimiter in the exporting table
hadoop fs -cat /user/hive/warehouse/database_name.db/table_name/* | tr "\001" "," > table_name.csv



This will change the delimited from the default delimiter to comma. You can also change it to tab by replacing "," with "\t"


For even more insights, check out the blog post called “A List of Subtle Differences Between HiveQL and SQL



Feel free to leave your question or comment below!




No comments:

Post a Comment