An Introduction to Apache Giraph

A member of the Spry team was selected to write a guest post for the Safari Books Online Blog. Please view the introduction below, but click through here to view the rest of the post.

This blog provides you with an introduction to Apache Giraph, covering details about graphs in general and the benefits of using Giraph.

Graph Vertices and Edges

One task that is extremely important for any analyst is the process of discovering and understanding connections. This process typically manifests in the form of graph processing. A graph can be defined as a collection of vertices and edges, where the vertices can have values and the edges can have directions and weights. Two of the most common ways of interacting with graphs include traversal (following edges between a set of connected vertices) and search (finding vertices in the graph that meet a set of conditions).
This graph structure of vertices and edges is extremely useful in a wide variety of real world applications. For example, imagine the problem of finding driving directions between two addresses. In this scenario, a graph of the roadway system can be built by considering roads as the edges, and intersections as the vertices. A related, larger problem over that same graph might be the process of optimizing the order in which a business makes its deliveries.
Even a natural system like the brain can be treated as a graph – where the billions of neurons are the vertices and the trillions of synapses connecting them are the edges. Once the brain is represented in this manner, research can be conducted to help us understand general brain function in addition to diseases and conditions that affect the passageways – like Alzheimer’s.

Graphs and MapReduce

In today’s analytic world, where data volume and velocity are growing faster than ever before, the benefits of using a parallel computing platform like Hadoop to process the information is clear. The appeal of Hadoop’s main building block – MapReduce – is that by transforming the input data into key/value pairs and splitting the pairs among the workers, the parts can be processed independently and then merged together to form the final result set. By design, there is no communication between tasks to ensure that no synchronization overhead affects task completion. Unfortunately, the traditional MapReduce style of execution does not lend itself to graph applications. A graph algorithm usually requires sending messages between vertices or performing “hops” to travel across an edge from one vertex to another as the bulk of the processing. Executing this in typical MapReduce fashion requires each hop or message to be processed in its own job.
Click here to view the rest of the post.

No comments:

Post a Comment