Hadoop without Programming

Hadoop without Programming


Vendors are building the next generation of analysis tools for Hadoop which attempt to eliminate the labor intensive programming and scripting that has been associated with Hadoop data processing since its inception. This is a sign that Hadoop has readily been adopted and is now maturing as a platform. This type of progression for new technologies to require less specialized skills and labor for implementation has been typical of the major advances in software during the last several decades.  

Now that organizations have adopted Hadoop, they are looking for ways to increase productivity and increase return on their investment.  That desire is impacting the next generation of BI tools being developed for Hadoop that seek to eliminate programming Map Reduce in Java code or writing pig scripts to process information on Hadoop. Existing companies in the ETL market are now scrambling to participate in this transition or risk being left behind without a Hadoop solution much as Microsoft was left behind when they realized the internet was something they should embrace.

The driving force behind this transition are the end users.  Once they have adopted Hadoop, they do not want to depend on developers due to resource shortages, costs, and turnaround time necessary to complete their tasks.  Vendors are now promising to let them transform and analyze their big data without coding Map Reduce or pig scripts.  Many vendors claim success but their are two similar but different approaches taken by Platfora and Datameer. Both are integrated platforms to support data transformation, analysis, and visualization on Hadoop.


Platfora Approach

Platfora wants to reduce latency of constructing visualizations so their platfora requires extra hardware to cache results of transformations for use through the visualization tool. Theoretically this means that the user should have to run fewer map reduce steps since the dataset is cached in memory.

Datasets

In order to get started with Platfora, you first have to create your Dataset. This means that Platfora has to generate meta data about your existing HDFS or Hive data and store this data in its' PostgresSQL database. Platfora does have a small file upload capability of up to 50MB for small files outside of Hadoop. Datasets are configured through the Platfora web app.


Lenses

Lenses represent a view of the data from the datasets for the users to build visualizations. These are configured through the web app. Minor transformations such as counting and summing can be set up on the lens. Lens then go through a building process which is a map reduce job that distributes that data among the Platfora worker nodes to be cached in memory. Everything that the user wants in the visualization should be made part of the lense.

Visualizations

Visualizations are then constructed on the cached data from the lens build.



Platfora Architecture


Courtesy of Platfora

















Platfora & HDFS


The follow snapshot of platfora's hdfs directory shows the jar directories and directory for uploaded files. Platfora stores all of it's necesssary hdfs data in one configurable directory.


Datameer Approach


Instead of trying to solve the Hadoop high-latency problem Datameer provides a simple but elegant solution to attempt to manage it.  Datameer provides a workbook that shows a sample of the data and allows the user to modify their transformations with a preview of the impact of their changes. Once the transformation is perfected, then the high-latency map reduce job is run or can be scheudled.

In order for Datameer to access data, it must first be processed so that Datameer can store the meta data in it's MySQL database and sample records from the data source to store in HDFS for the workbook preview.  This includes even data already in Hadoop. The ecommended process sets up Datalinks to the existing data with samples are copied to HDFS. The workbooks can then be used to create transformations using a variety of built in functions.

Visualizations utilize the results of the workbooks.  The visualizations from our evaluation could not be shown.

Datameer Architecture



Datameer & HDFS

Results of the transformations are stored in workbooks in HDFS and also you can see it has the ability to upload files in HDFS.


There are trade offs to any tool in particular to your specific needs.  In order to find out how well vendors fulfill there promises to provide Hadoop without Programming, engage in a POC or evaluation of each vendors' software or contact an independent Hadoop solution provider such as Spry, Inc.

No comments:

Post a Comment