Compression with Hive

There are compression settings for Hive that can be used to reduce the amount of disk space that Hive uses for its queries.  This is helpful especially helpful for development clusters with ad-hoc processing of Hive queries where each query is creating a large resultant data set.  

Intermediate Compression
If a Hive query has more than one map reduce job, the contents of the intermediate files between jobs can be compressed with the following setting in the hive-site.xml file.  


<property>
 <name>hive.exec.compress.intermediate</name>
 <value>true</value>
</property>


Once the compression is enabled, then Hive will use whichever compression codec is configured.   Hadoop has a default codec, but the compression codec can be specified in either the mapred-site.xml, hive-site.xml, or for the hive session.  The snappy codec is often used because it attempts to use minimal CPU time.


<property>
 <name>mapred.map.output.compression.codec</name>
 <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>


Intermediate compression will only save disk space for specific jobs which require multiple map reduce jobs.  For further savings of disk space, the actual hive output files can be compressed.


Hive Output
When the hive.exec.compress.output property is set to true, Hive will use the codec configured by the mapred.map.output.compression.codec property to compress the storage in HDFS.  These properties can be set in the hive.site.xml or in the Hive session via the Hive command line interface.  Once the Hive session is ended, any property changes made during the session will no longer be in effect.


<property>
 <name>hive.exec.compress.output</name>
 <value>true</value>
</property>


The bzip2 compression can be compressed into separatable blocks so it can still be used as input efficiently for subsequent map reduce jobs.


<property>
 <name>
mapred.output.compression.codec</name>
 <value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>


There is a choice to leave compression on or off by default.  Users can always enable or disable this in the Hive session for each queries.  If it is enabled there will be an extra step to extract data from HDFS.

Remember properties can be set at the command line in hive as follows for quick testing:

hive>  set hive.exec.compress.output = true;
hive>  set mapred.output.compression.codec = org.apache.hadoop.io.compress.BZip2Codec;

For one of our projects we were using CSV files so we created the table from CSV files with compression off and then with compression on.


[spry@localhost compress-test]$ hadoop fs -ls /user/hive/warehouse/intermed_comp_off
Found 2 items
-rw-r--r--   3 spry supergroup  267628173 2013-09-10 16:37 /user/hive/warehouse/intermed_comp_off/000000_0
-rw-r--r--   3 spry supergroup   38765577 2013-09-10 16:37 /user/hive/warehouse/intermed_comp_off/000001_0


After creating a table called bzip2_comp_on, there will be files in hdfs that have the .deflate extension.  Hive queries will be able to decode the compressed data and this will be transparent to the user that is running queries.


[spry@localhost compress-test]$ hadoop fs -ls /user/hive/warehouse/bzip2_comp_on
Found 2 items
-rw-r--r--   3 spry supergroup   26178373 2013-09-11 16:27 /user/hive/warehouse/bzip2_comp_on/000000_0.bz2
-rw-r--r--   3 spry supergroup    3563411 2013-09-11 16:27 /user/hive/warehouse/bzip2_comp_on/000001_0.bz2


In this case, the compression reduction in case was about 91%.  This can tremendously help storage space when working in a cluster that is utilized for ad-hoc analysis.


Performance

One would think that compression typically would slow down map reduce processing.  In our test, loading the test took more time but the query on the compressed data took 53.95 seconds versus 32.142 seconds.  The reduced size of data can make for much less disk IO so for jobs that are heavy on disk IO so there may be little impact to processing time.  Jobs that are CPU intensive with limited CPU resources would be more likely to impact processing.    The developers and administrators will have to monitored performance implications of using various compression options.   There are several codecs that can be used each with their own drawbacks and advantages:

DEFLATE  org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO com.hadoop.compression.lzo.LzopCodec
LZ4 org.apache.hadoop.io.compress.Lz4Codec
Snappy org.apache.hadoop.io.compress.SnappyCodec

A much more extensive study was done by another blogger on the impact of compression not only on disk space savings but also on performance implications and reported here.  


Extracting Compressed Data

To extract data from HDFS that has been compressed with BZip2 codec, you can use -text option from the hadoop fs.


hadoop fs -text /user/hive/warehouse/final_comp_on/* > data.txt


Conclusion

With some administration and awareness by users and developers, Hive cluster disk space usage can be utilized more efficiently with a management impact on cluster performance.  Administrators can also set the map reduce compression options at the Hadoop level rather than the Hive level to enforce compression policies for other map reduce jobs.  Developers can also specify this in the code programatically if desired.

No comments:

Post a Comment