An Intro to Cascading


What is Cascading?

Cascading is a Java framework created by Concurrent, Inc. that allows hadoop applications to be run as a single workflow unifying all steps of the application process.  Typically hadoop workflows are run in separate parts that do the map-reduce, data analysis in R,  and other transformations in scripts, such as pig.  Cascading focuses on the workflow as an entire application so the application can be run in a as a single jar file in an enterprise environment.  An upcoming competitor is Continuuity with it’s AppFabric product.


Why use Cascading?

  • Java developers can extend their existing Java applications to perform distributed processing.
  • Enterprises that need to perform repeatable workflows in Hadoop can implement the workflow as a java application in cascading rather than having disparate parts of the workflow implemented by different scripts such as Pig.  
  • Cascading attempts to replace common analysis functions that would have been done in R with Java code so that the analysis can be more easily integrated with enterprise systems.
  • Java developers more familiar with data flow rather than MapReduce can be more productive.
  • Support for test driven development including patterns for data quality and sampling

Cascading Approach
Cascading created a 'plumbing' type analogy for the data flowing through a workflow. It is a refreshing approach to looking at data processing that makes sense to developers and analysts.


Terminology

Tuple - Represents a row of data with fields.  
Taps - Represents resources on the file system including data sources that can be read or data sinks that can be written.  
Pipe - Represents a data stream.
Pipe Assembly - Chain of pipes
Flow - Coordinates how tuples flow through pipes in and out of the data sources and sinks.  Flows which rely on other flows are called ‘cascading’.  



Taps, pipes, pipe assemblies, and flows are set up as a workflow in the application logic using the Cascading SDK API.  


With the workflow oriented approach, analysts are able to provide input to the application developers to automate the workflow for their data processes.

During runtime, cascading generates a directed acyclic graph of the map reduce jobs to implement the connected flow of the application.  The map reduce jobs are scheduled and executed with the output written to an output file represented by the data sink tap.


Example to copy a file
A simple example to copy a file starts with an source or input tap pointing to a document in hdfs. A pipe is then used to connect the input tap to the sink or output tap. The flow is then defined and connected. When the program is run, a map reduce job is generated, and tuples flow from the source tap to the sink tap.

// Initialization
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );


// create the source tap
Tap inTap = new Hfs( new TextDelimited( true, "," , "\""), docPath );


// create the sink tap
Tap outTap = new Hfs( new TextDelimited( true, ",", "\"" ), wcPath );


// specify a pipe to connect the taps
Pipe copyPipe = new Pipe( "copy" );


// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
       .addSource( copyPipe, inTap )
            .addTailSink( copyPipe, outTap );
// run the flow
flowConnector.connect( flowDef ).complete();



Pipes
Pipes are used to perform logical operations on the data for each workflow.  The following diagram shows Java code along with the representing workflow. This workflow is similar to what would be done in a pig script.


Figure credit:  http://www.cascading.org

Conclusion
If your use case has a repeatable workflow that has to run in an enterprise environment, it is worth looking at Cascading as an alternative to Hadoop pig scripts.  In Part 2 of the introduction to Cascading, the Cascading sample tf-idf algorithm will be explored on a small radiology dataset.

No comments:

Post a Comment