HOW TO access HDFS files from Cascading flow


One of the more difficult things to do in Cascading is to access data that is small pieces of data which you might not want to join.  Cascading does have a mechanism to do this but it is limited.

Query Planning vs. Runtime

One of the things that beginners with Cascading have to understand is that most of the java code written actually creates the query plan.  So it is limited where in your code you can actually access HDFS.  This can be done through the FlowProcess object in cascading.  You are provided the flow process object only in certain locations due to the fact that so much of the code is for generating the query plan.

Mechanism

The mechanism in Cascading for accessing an HDFS file is through the openTapForRead method on a Tap.  Note that any local file or HDFS file can be a Tap.   But if the file is local it must be on the local node where the flow is running.  Notice that the method takes a FlowProcess object. 

public TupleEntryIterator openForRead(FlowProcess<Config> flowProcess)
                               throws IOException

http://docs.cascading.org/cascading/2.5/javadoc/cascading/tap/Tap.html#openForRead(cascading.flow.FlowProcess)

This method is also available on the flow itself.

http://docs.cascading.org/cascading/2.5/javadoc/cascading/flow/BaseFlow.html#openTapForRead(cascading.tap.Tap)

Once the method is called you have the iterator to the tap to retrieve the tuple data.

 This object roughly corresponds to each session of a flow while it is running.  This FlowProcess is only available for programmers to access in several of the Cascading objects.  These include operations such as Buffer, Aggregator,  Functions, and Filters.    Assemblies do not have this option.  The pattern is to read the hdfs file and set the data in the context in the prepare operation method.  The context then should be destroyed in the clean up method.

No comments:

Post a Comment