Four Useful "Tricks" for working with Hive
1. Display database name and column headers
After start Hive command line interface (CLI), type
• hive> set hive.cli.print.current.db=true;
• hive> set hive.cli.print.header=true;
The first command line tell hive to display current database name in Hive CLI
The second command line tell hive to print column headers in results
2. Replicate IN/EXIST in SQL by using LEFT SEMI JOIN
SQL allows you to write IN/EXIST subquery syntax, such as
SELECT a.key, a.value
WHERE a.key in
Since Hive currently does not support IN/EXIST subquery, we can rewrite with LEFT SEMI JOIN in Hive. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses.
SELECT a.key, a.val
FROM a LEFT SEMI JOIN b on (a.key = b.key)
3. Avoid NULLs in inner join (default join)
Assume we have two tables (impressions and creative) and we want to join them on id.
SELECT * FROM
FROM impression I JOIN creative C on (I.creativeid = C.id)
The joining procedure in Hive is demonstrated in the following picture
Records with the NULL key all end up at the same reducer. This will skew the load and break parallelism. Null keys will never be used in INNER JOIN, so it is pointless for them to get to the reducer. Reducing is slow. Try to avoid reducing phase if possible. We should adjust our query as
SELECT * FROM
FROM impression I JOIN creative C on (I.creativeid = C.id
AND I.creativeid IS NOT NULL
AND C.id IS NOT NULL)
4. Change the delimiter when exporting hive table
When exporting hive table, hive by default sets the delimiters as “^A” (ctrl-A '\001'), which is uncommon for other software.
Here is an argument that we can use to change the delimiter in the exporting table
• hadoop fs -cat /user/hive/warehouse/database_name.db/table_name/* | tr "\001" "," > table_name.csv
This will change the delimited from the default delimiter to comma. You can also change it to tab by replacing "," with "\t"
For even more insights, check out the blog post called “A List of Subtle Differences Between HiveQL and SQL”
Feel free to leave your question or comment below!