BackgroundCascading is a system for transforming data files in Hadoop. It has wide adopt in many production environments. Continuing on with the standardization over configuration IT trend, there are some it is important to set up standards for data management. This article covers some ideas from our implementation that may be useful in other environments.
Use CaseThis particular use case was improving the speed of month end processing for a Fortune 500 company for financial reporting. The company had several brands that needed to be processed separately from a similiar but separate data set.
Data File Naming Conventions
- The flows will be executed per brand so we need the ability to contain data files in the data structure for both brands as they may be run simultaneously.
- The business would like to run multiple months at the same time.
Cascading / Java Requirements
- The structure should be able hold expectation files for JUnit tests
- The structure should hold output files that are needed by individual flows. If a flow reads a source and writes to a sink that is actually supposed to be the same data entity, then the sink needs to be stored in a separate directory.
These requirements can be accomplished with the following directory structure. These directories will be the same on the cluster system as on the developer system. In our case, the input files were not segregated by company brand so there is one input directory.
input - original files from the business after data jujitsu
output - any generated files from cascading
expectation - files to compare against for junit tests, may not be needed on cluster
The structure is as follows for the November 2013 month:
/MMYY/[brand]/[module]/[FlowName]/[brand agnostic file name].txt
/MMYY/[brand]/[module]/[TestName]/[brand agnostic file name].txt
The next blog post will cover a Flow Runner strategy to demonstrate how flows from various modules will be automatically put into a cascade by only defining the flow dependencies. This will allow multiple months of processing to be deployed on the cluster simultaneously.