Understanding an Apache Giraph Application

A member of the Spry team was selected to write a guest post for the Safari Books Online Blog. Please view the introduction below, but click through here to view the rest of the post.

Apache Giraph is a framework with methods that must be implemented in order to accomplish your graph processing goal. Two of the examples that come packaged with Giraph areShortest Path and Page Rank. These examples are meant to be extended and altered to fit the needs of any new custom application. This post will go through these two example implementations in detail in order to explain the actions necessary to write a Giraph application.


Determine the Input Data Format

The first step of any Giraph application is to determine the format for the graph input data. There are many different built in Input Formats, all defining data types for the following information describing the graph.

Vertex ID

The Vertex ID is the identifier for a vertex in the graph. The framework does not restrict the definition, so this can be something no more complex than a label, or it can be a fully initialized object with complex pieces. The only limitation is the ability to represent the information in a form that Giraph can parse from the input file.

Vertex Value

The Vertex Value optional, and is another place to store additional information associated with a vertex. Typically, this field is used to store values or objects that should be updated during graph processing.

Edge Tuples

The final piece of input data is the collection of information necessary to define the set of out-edges associated with the source vertex ID. This information is composed of tuples with two elements per edge: the destination Vertex ID and the Edge Weight, where the Edge Weight is optional. Since Giraph only expects out-edges to be specified in this definition, any bi-directional edges must be defined as two separate out-edges with opposite directions.
Click here to view the rest of the post.

No comments:

Post a Comment