An Intro to Cascading - Part 2 (tf-idf implementation)

Overview
In part 2 of the introduction to Cascading, this article explores the sample tf-idf algorithm implemented in cascading with some alterations to be utilized on a real-world data source.

Background

A hospital was interested in improving patient care by learning about follow ups to radiology studies over the last several years. The studies were in a separate system so a dump of information was exported to a CSV file. The most valuable information about required follow ups was the doctor's impression of the study contained in a free form text field. Some analysis was done using R on field to determine effectiveness of follow up care based upon regular expressions. There was a tf-idf workflow implemented in a Cascading example, so this workflow was modified to perform the algorithm on the doctor's impression field.


if-idf Example Workflow


Figure credit: www.cascading.org 






Environment for Evaluation
OS : Ubuntu 12.04
Cluster:  Spry CDH3 Quickstart VM
Distribution: CDH3 (Hadoop 0.20.2)
Cascading SDK:2.1 03-12-2013
Gradle: 1.5


Setting up Sample Projects
Gradle is used to build the sample programs and run tests.  Download gradle 1.5  and unzip it.
Add the unzipped location to the path of the virtual machine using by modifying the /etc/profile.

export GRADLE_HOME=/home/spry/cdh3/gradle-1.5
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin:$PIG_HOME/bin:$HBASE_HOME/bin:
$GRADLE_HOME/bin

SDK
Download and unzip the cascading 2.1 SDK.

Sample Projects

Clone the cascading impatient example forked repo.



git clone git://github.com/MarkFuini/Impatient.git



The samples are done in various parts located in sub-directories.  Once in one of the ‘parts’ directories you can execute commands to build the jar, put data files into hdfs, and run the program.  Paths and file names will vary.



gradle clean jar



Data Setup

The Cascading implementation requires an input source for the documents to be processed and a stop word input file.


RHive was used on a project to extract the impression lines from a larger radiology report records into a csv.  We used the resulting file, imps.csv, as the data source for the Cascading source tap.

Sample of imps.csv file
"id","aid","report"
"1","123456-1234",":||1. Right upper lobe pneumonia with underlying bronchiectasis and right paramediastinal fibrosis. Chest Xray follow up to resolution is suggested with CT follow up if clinically appropriate.||2. Left apical loculated fluid collection.||Findings were discussed with Dr XXXXX at 2pm on 6/28/07."
"2","123456-1234",": Left upper lobe anterior segment mass lesion for which CT is |recommended.  This will be communicated to the referring service.  | |DK/MTS/ps"

Sample of stop word file
--
-
stop
a
about
after
all
along
an
and

Load the data into HDFS
hadoop dfs -put imps.csv /user/spry/data/imps.csv
hadoop dfs -put en.stops /user/spry/data/en.stops


Project Setup

The idea project for part 5 of the sample code was created so that Intellij could be used as an IDE to write and debug the Cascading code. Cascading provides the Gradle command to create the intellij project.


gradle IdeaProject



Unfortunately, dependencies for Hadoop and the Cascading jars needed to be configured manually as that was not done in the Intellij project created by Gradle.  




Modifications were made to the program for new delimiters, fields, document specific tokens, scrub function, and some experiments were done to test group by in Cascading.  The code flow was observed in the IDE stepping through code with breakpoints to understand program execution in Cascading.

Running the Program

$HADOOP_HOME/bin/hadoop jar your-application.jar [your arguments]

Results
Per the tf-idf algorithm, the higher the number the less often it occurs in the documents.  If a number is zero then it occurs in every document. In this case, the tf-idf is calculated per word, not per token.  Further investigation is required if an implementation is desired that calculates the tf-idf of phrases.  

Sample of Results
These documents have the highest score for the term followup indicating it is mentioned less often.
docid,tfidf,token
2416,4.744432253321599,followup
9010,4.744432253321599,followup
8110,9.488864506643198,followup
1273,9.488864506643198,followup
203,9.488864506643198,followup
5194,9.488864506643198,followup
418,4.744432253321599,followup
5196,4.744432253321599,followup
5002,4.744432253321599,followup


Conclusion
The approach utilized by Cascading is a logical and efficient way to process data in Hadoop with out having to understand all the restrictions and techniques to " MapReducify" your algorithms. I would definitely recommend evaluating it for your Hadoop data processing needs. Cascading Pattern has also been releases which has many algorithms already implemented in Cascading with more promised to be implemented in the future.

No comments:

Post a Comment