Avoiding Dependency Duplication: The Oozie ShareLib

What Is Oozie?

Oozie is a workflow engine that allows a user to define and execute a series of actions in an order specified by a directed acyclic graph. These actions can make use of various technologies - in addition to plain Java code, Oozie also supports the use of Hive, Pig, Sqoop, and MapReduce actions.


What Is the ShareLib?

The ShareLib is a collection of Hadoop ecosystem jar files that Oozie can place on any workflow's classpath to ensure that the actions with dependencies on some of the fundamental jar files can access them automatically. 

For example, in order for a Pig action to be successful without the ShareLib, the user must determine and include the correct version of the Pig jar file in the workflow's lib folder in HDFS. If the user is creating multiple workflows, all making use of Pig actions, that Pig jar file must either be duplicated into each of the workflows' lib folders, or must be explicitly added to the classpath inside each new workflow. Since workflows typically make use of several Hadoop packages (Sqoop, Hive, and Pig in the same workflow), this can cause a substantial amount of duplication. If the versions of any of these packages change, the jars in each of the lib folders will need to be updated appropriately.

With the ShareLib, the dependencies for various supported Oozie actions are stored in one central location that Oozie can be instructed to place on the classpath. This will guarantee that the dependencies are appropriately matched to the cluster the workflow is running on, and that they are available to any action that needs to reference them.


How Do I Make Use of the ShareLib?

First, install the sharelib files into HDFS under the Oozie user using the following steps:

$ mkdir /tmp/ooziesharelib

$ cd /tmp/ooziesharelib
$ tar xzf /usr/lib/oozie/oozie-sharelib.tar.gz
$ sudo -u oozie hadoop fs -put share /user/oozie/share
$ rm -rf /tmp/ooziesharelib


Then, in each workflow that needs to access any of these jar files, place the following in the job.properties file for the Oozie workflow:

oozie.use.system.libpath=true

Now any actions in the workflow will reference the ShareLib, and Oozie will automatically add any necessary Hadoop ecosystem dependencies to the classpath.

Where Can I Go For More Information?

Or feel free to leave your question or comment below!

No comments:

Post a Comment