A List of Subtle Differences Between HiveQL and SQL

When a developer first transitions from working in SQL to working in HiveQL, the similarities between the two languages is often extremely helpful in ensuring a smooth transition. However, once the first few starter queries are written, a small amount of frustration often appears due to the more subtle differences in syntax and behavior. This post is an attempt to list a few of the most common differences that are helpful to know up front.

Join Syntax

In SQL, usually the default join is the "inner" join, in which the result includes entries where both sides of the condition are not null. However in HiveQL, the default is the "equi" join, in which the only entries that are returned are the ones where the condition is true and returns no null values. This distinction often causes problems when the user does not explicitly specify the type of join that Hive should use. One thing to note is that when converting existing SQL queries into HiveQL, the syntax of "LEFT JOIN" or "RIGHT JOIN" does not work. Hive requires that the join be specified as "LEFT OUTER JOIN" or "RIGHT OUTER JOIN".

Largest Table Last
Since queries in HiveQL are being converted into MapReduce jobs, for some queries it is important to keep that in mind. In order to improve run time, Hive will attempt to perform a map-side join where it loads the first table into memory and reads the second table in as normal input to the map function. When writing queries, try to facilitate this as much as possible and order the tables used in the join so that the largest table is last.


Group By

Like SQL, HiveQL provides support for the "group by" command, in which multiple rows in a table can be collected into groups with the same values in specific columns. Once the groups are formed, the columns being used in the group by can be accessed as usual in the select statement. The columns that were not used in the group by should be accessed through aggregation functions, like average or sum. If a column that is not part of the group by is accessed as a single element in the select statement, this will cause Hive to throw an error. Thankfully, the error displayed by Hive will alert the user to which column is being used incorrectly. The issue can be resolved by ensuring that all columns in the select statement are either part of the group by and accessed as a single element, or not part of the group by and accessed through aggregation.

Collect Set

One additional thing to consider when using group by is the existence of a special aggregation function: collect_set. This allows a column not used in the group by to be aggregated into a set. The values in the set are accessible using normal array-like syntax and can be used the same way as any column in the original table. The elements in the array will not be sorted, so any ordering will need to happen through the use of a user defined function. Additionally, Hive will not evaluate an expression to calculate an index.

Limit 

One effective way to understand the results of a query is to run it and use the "limit" function to view a small subset of the results. It is important to keep in mind that unless the "sort by" command is used as part of the query, the result will not be ordered in any way.

Where can I go for more information?

Apache Hive Documentation
Or feel free to leave your question or comment below!

No comments:

Post a Comment