Cascading: Break Point Debugging

Overview

One of the challenges that accompanies using Cascading to build large applications surfaces in debugging issues between streaming the source taps and writing the sink.  Since tangible data really isn't made available until flow completion, attempting to pinpoint which part of a faulty flow definition is failing can be a frustrating experience.  In comes cascading Debug "pipe" and the "break point" debugging technique.  This approach will aid in stepping through a flow in order to isolate any problems that are obvious in the final writeout.

Cascading's Debugger Pipe

The Debug function is a debugging tool offered within Cascading's base libraries which allows for tuples to be written to stdout.  A developer can visualize what is being streamed along a pipe without having to open and view a writeout file at the end of a process.  Here, the Debugger "pipe" is called such because it is used in the form of an assembly resembling the following:

Pipe A = ${some_assembly}

A = new Each(A, new Debug(true));

...

.addTail(A)
.addSink(A, sinkA);

The Break Point Technique 

The "break point" reference is used here because the above debugging assembly can be set up as "break points" at various points in a flow in which data streams can be forked off to stdout in order to analyze data form at particular placements within the flow process.  More often than not, data streams undergo various transformations during the flow process, and being able to drive a wedge in between these steps can aid tremendously in the debugging process.  The example below uses two break points:



Tap debugSink1;
Tap debugSink2; 

Pipe A = ${some_assembly}
Pipe B = ... 
Pipe D = ...
...

Pipe C = new CoGroup(A, new Fields(a), B, new Fields(b), new InnerJoin());

C = new Each(C, Fileds.ALL, new SomeFunction());

// debug

C = new Each(C, new Debug(true)); 

Pipe E = new HashJoin(D, new Fields(d), C, new Fields(c));

E = new Each(E, Fields.ALL, new SomeOtherFunction());

// debug

E = new Each(E, new Debug(true));

...

${any_other_group_of_operations}

return FlowDef.flowDef()
...
.addTail(C)
.addSink(C, debugSink1)
.addTail(E)
.addSink(E, debugSink2);


Here, two points in the flow can be isolated after the respective operations, in which output tuples can be reviewed for any skewed behavior that may be disrupting the expected outputs at that point.  Specifically, the interest is to view the stream after the first and the second function calls.  If necessary, the debug pipes may be shifted to various points in the flow in order to analyze new points, as long as they correspond to their associated streams.

Summary

The method described here is one of various ways to approach debugging in Cascading.  Different approaches are appropriate for different situations.  Checkpointing and Failure Traps are also two typical methods that can be extended in order to offer a sufficient debugging platform.  Cascading is unique in it's dynamic, so in turn, it's debugging and testing techniques more often than not require a unique or somewhat unconventional approach.  

   

No comments:

Post a Comment