Connecting Live Data in HDP Hive to Tableau

Introduction

In order to get more familiar with working with Hadoop and Tableau, I was asked to visualize some analytic work that a colleague had originally created in R and excel based on some statistics relating to transportation.  I had limited experience with Tableau and zero experience working with connecting Hadoop directly to Tableau.  The excel file I was to work with had a few charts in it and I originally anticipated that the process would be straightforward based on an initial review of the data in the excel tables.  I ran into a few challenges along the way and I would like to walk through my experience in the hopes that others can learn from it and decide when it is best to connect live to your data in Tableau using Hadoop.

Configuration

Cisco VPN Client v5.0.07.0440 for Windows
Tableau 8.0 Desktop
PuTTy release 0.63
Hortonworks HDP 1.3.0 5 node cluster
Hive 0.11  

Initial Connection to Hadoop from Tableau

To connect to the cluster in Tableau I first clicked on the "connect to data" link on the Tableau start up page.  Once the options appeared, I selected "Hortonworks Hadoop Hive" to bring up the connection dialogue:















In the connection dialogue box, I entered the remote machine IP address.  The default port of 10000 was pre-filled.  I then selected "HiveServer 2" for the Type and "User Name" for the Authentication.  Next, I typed in the user name for my remote machine in the "Username" field and selected the "Connect" button.  At this point, the connection loaded for a few seconds to confirm the connection. Step 4 then became active.  I was then ready to select the database and tables that I would like to visualize

Note: The IP address and username for my server have been concealed in this example








I selected the magnifying glass next to 'default' to load the databases on the cluster and located the database which contained the transportation info within its tables.  I then selected the specific table that the excel file was originally generated from in Step 5 by selecting the magnifying glass next to the search box.  I left Step 6 to its default name setting.














First Approach - Import All Data

At this point, Tableau asked me to make a decision on what type of connection to use for this data set.  Knowing that I would be experimenting with the data, I initially decided that I would import all of the data to my local machine so that Tableau would have the records and I could work with the data locally.  I selected the second option "Import all data".  Unfortunately, this turned out to be a tedious exercise due to the nature of the underlying file - it was about fourteen million rows for the first table that I wanted to visualize.


After about 30 minutes, my machine had only imported about 600,000 rows or 4% of the data I would need.  Obviously this import time could be much quicker if I was able to beef up my internet connection or the remote machine but in most work scenarios, those aren't realistic options since it is not efficient for the business to wait for files to import - in fact, this table was only one of many that I would need to import for the project.  I also faced an issue with my VPN connectivity.  After letting Tableau import a majority of the records over a few hours, the VPN would disconnect forcing me to restart the process all over again.

Second Approach - Download Data from Machine to local disk

At this point I consulted with a colleague and expressed my frustration with my multiple attempts to import the large data set into Tableau.  After a bit of discussion, the idea to use a program outside of Tableau to download the file to my local disk from the remote machine arose.  I decided that this could be a viable option.  I loaded up WinSCP Version 5.1.7.  Using the IP address from my original Tableau connection, I logged onto the remote machine and began downloading the file.  Again, I was dismayed to see an extremely slow progress bar as the program attempted to download the large file to my machine.  It did seem quite a bit faster then the Tableau import so I left my laptop to allow it to download the file.   I returned about a half hour later to check on the progress.  At this point, I began to see the clear challenges associated with working in the "Big Data" space.  The file had only partially downloaded yet I had reached a limit of space on my local hard drive.  This after only downloading one portion of one table out of many tables.  With my limited resources, this solution did not seem like the ideal route.

Third Approach - Import Some Data

Understanding that importing all of the data would take some time, I decided to try Tableau's "Import Some Data" option.  Selecting this option brings up the Tableau data filter menu for selecting the columns (or specific values) of data you would like to import.



This approach requires the user (me) to know what data that I would like to use in my visualizations.  It was my first time experimenting with the live connection and visualizing this particular data set, so I wasn't exactly sure which columns I might use to make an impact with my visualizations.  It was difficult for me to narrow down the data using the filters.  I did attempt to do so with the filters as best I could.  I then selected the "Extract" button and the import process began.  I ran into the same problem I had run into in my first approach - the import still seemed to be moving at a snail's pace and even after a half hour I did not have anywhere near the complete file.



Final Approach - Connect Live

Realizing the challenges associated with my first three approaches and my deadline quickly approaching, I decided to try to work with the data through a live connection from my local Tableau desktop to the remote cluster via the Cisco VPN.  After an initial 1 minute load, Tableau indicated that it was ready to work with the data.  The columns were filtered into measures and dimensions and I was ready to begin visualizing.  I had been warned by colleagues that this method would not be efficient due to the fact that Tableau would have to pass HQL queries to hive and then receive the result prior to any visualization or filtering.  That turned out to be partially correct - on average I faced a load time of 30 seconds to one minute for each action I took within Tableau that required a hive query.  While this was time consuming, it was no where near the amount of time that it would have taken to use the other approaches for multiple tables from multiple databases.  


Conclusion - When is it best to connect live?


The best use case for connecting live to your data is when you are working with large amounts of data that you can filter on request and are looking for a quick response.  If the data set you are working with is only a few megabytes, it is far easier to import the entire data set and filter within Tableau.  In my use case, it was extremely difficult and time consuming to obtain the large data sets through downloading from the Hadoop cluster or importing the full set into Tableau.  Connecting live to "Big Data" using Hive from Tableau is still time consuming.  However, with limited processing resources and bandwidth, connecting live can be the best option for visualizing extremely large data sets within Tableau.

Further Reading

If you are running into some of the same issues around speed that I encountered there are some other options.  One option would be to have Tableau 8.0 running on a machine in the remote environment so that the data imports can run at LAN speed.  This would require that your Tableau license be on the remote computer and for my use case was not an option.  More on this over at the Tableau community.  For a tutorial on connecting the Hortonworks Sandbox to Tableau, head over to the Hortonworks tutorial.

No comments:

Post a Comment