Lingual and Cascading Integration

Background

As mentioned in the previous post, Lingual 1.0 was release end of 2013 with Cascading 2.2. It generates Cascading code for SQL queries. Lingual flows can be integrated with a cascading cascade of other cascading and lingual flows.   There are some caveats in doing this since Lingual is a 1.0 release and also some additional tasks that you might not think about for cascading.  For our project, the developer could choose on a per module basis to implement the workflow in Lingual.  In some cases, for moderately complex but simpler queries, the processing was first attempted using Lingual.  Cascading was used as a fall back if the Lingual query's query plan could not be generated.  We found that we had to make a few adjustments and discovered some limitations in Lingual's first release.  Teams will have to decide if benefits of using Lingual outweigh the additional caveats for Schemas and early releases.  Lingual is a promising step in the evolution of tools for data processing on Hadoop.

Schemas
Since it is SQL processing, Lingual needs to know the types (java classes) for all of fields the taps input and output.   This is more type restrictive than Cascading.  In your cascading application, if you don't specify the field names it will parse the header as such using Fields.ALL.

field1, field2, field3

So in cascading you can get away reading the field names form the header like this since you don't need type information.

Tap tap = new Hfs( new TextDelimited( Fields.ALL, true, DataFile.DELIMITER  ), tapPath, sinkMode );


One way to specify tap information automatically to Lingual is to put them in the header separated by a colon.

field1:int, field2:int, field3:String

Then you can utilize their SQLTypeTextDelimited scheme you need to automatically parse the field names and types in the header.

Tap tap = new Hfs( new SQLTypeTextDelimited(  ",", "\"" ), tapPath, , SinkMode.KEEP );



By default cascading flows will not output these type generated headers to cascading taps could not be inputs or outputs to the lingual flows which required the types.  Since we had the majority of the project written in cacading, the solution we came up with was to specify programatically the fields and types for taps which were needed by Lingual instead of using the headers.  

Tap tap = new Hfs( new TextDelimited( new Fields ( new String[]("field1", "field2", "field3"), new Class[] { int.class, String.class, String.class} ) , true, DataFile.DELIMITER  ), tapPath, sinkMode );

This approach worked as we had most of the fields in interfaces so just the class types need to be added.  A different approach might be to have cascading write out the type information with a custom scheme.  However the processing was to be run on customer data extracts so the types would have to be added to the headers by some process for input at least for the sources needed by Lingual.

Release Limitations
  1. Requires a prefix on the tables in your SQL (currently this is fixed in the next version past what is shipped with cascading.
  2. A good number but not all types of queries or subqueries supported
  3. Lingual needs to be in a separate flow from other cascading code
  4. Need to escape all field and table names to maintain a lower case schema if you are using one. It defaults to Oracle where they make everything upper case unless it is escaped. (This has been fixed in Optiq and needs to percolate up to Cascading.)
  5. No UDF support
  6. Per hadoop limitations, SQL INSERT is not able to directly insert into a cascading output pipe with existing data.  We actually found that the INSERT part of the query seemed inconsequential at this point in time.  It seems that cascading will write to the output tap the output of the SELECT statement. 
  7. Only one SQL statement can be implemented per flow.
Tips
  1. Do not use -ea in your JVM settings. Lingual has some benign assertions that will cause the tests to stop running.
  2. You may want to download sources from Lingual in an IDE like Intellij.  We found that during query planning you can get mismatched field types and the field names aren't output to the console window.  Setting breakpoints in Optiq or Cascading was required for quicker debugging.

No comments:

Post a Comment