SQL with Cascading: The NVL Function

Overview

 

With Apache Hadoop becoming more in more in demand in terms of big data handling, it is likely that many conventional database systems, such as warehouses, RDBMS and the like will forfeit their functionality over to tools and applications which run within the Hadoop sphere.  Just as likely is the fact that in the interest of bypassing Map Reduce programming, Cascading may become one of the primary tools used when performing these integrations.  If such a fact proves true,  much of the integration technique will involve recreation of the common SQL functionalities present in the operation of these conventional database systems.  With Cascading and the vast array of OOB functions and methods, many of these common functionalities may recreated quite easily. In the following example and additional blogs in the series, common SQL functions will be recreated in Cascading.

Function: NVL()   

 

Here, the NVL, or null value function will be considered.  This function is used to replace any null values in a particular column with a declared replacement, and is initialized as follows:

SELECT ...

NVL(column A, ${replacement}) AS A,
NVL(column B, 0) AS B,
NVL(column C, 'Y'),
NVL(column D, 4.01);

...

FROM table A

Cascading Implementation

 

In Cascading, imagine table A as a source tap attached to pipe A.  The respective columns from table have now become fields in the tuple.  Field A1 corresponding to column A from table A has an unknown number of NULL values which should now be changed to meaningful values consisting of the Integer entry "1".  



Custom Functions

 

A custom function can be created in order to accomplish the entry above for the NULL values in Field A1.  Such a function would resemble the following:

public class NVL extends BaseOperation implements Function {

 String nullField;
 Object replaceField;

public NVL(Fields fields, String nullField, Object replaceField) {

 super(1, fields);
 this.nullField = nullField;
 this.replaceField = replaceField;

}

public void operate(FlowProcess flowProcess, FunctionCall functionCall) {

 Object nullFieldObj = functionCall.getArguments().getObject(nullField);
 nullFieldObj = (nullFieldObj == null) ? (replaceField) : (nullFieldObj);

 functionCall.getOutputCollector().add(new Tuple(nullFieldObj));

  }
}

Execution 

 

Implementing the function above involves replacing the NULL values in Field A1 with the output from the function.  The function takes care of checking the values for the field in each row, determining where the NULL values are, and outputting a tuple with the declared replace value. Again, all NULL values are to be replaced with the value "1".  The newly created NVL function is used as follows:

Pipe A = new Pipe(table A);

A = new Each(A, new Fields(Field A1), new NVL(new Field(Field A1),"Field A1", 1), Fields.REPLACE); 

Here, again, an each pipe is used in combination with the custom NVL function to examine each row and gather the proper tuple.  The NULL value is simply replaced with the declared replace value using Cascading's REPLACE flag.


Summary   

 

This function displays the simplest case and can be easily expanded to handle multiple NVL calls over multiple fields with arrays of replace objects.  Implementation involves adding a new constructor to the function to handle each case.  Cascading does well at offering general operations that are seen often with SQL based programs, such as JOIN, GROUP and SUM, but the more specific functionalities such as NVL, DECODE, CASE may need to be recreated using processes resembling those shown above. 


 

No comments:

Post a Comment