-
Joseph E. Gonzalez authoredJoseph E. Gonzalez authored
layout: global
title: GraphX Programming Guide
- This will become a table of contents (this text will be scraped). {:toc}
Overview
GraphX is the new (alpha) Spark API for graphs and graph-parallel computation. At a high-level, GraphX extends the Spark RDD by introducing the Resilient Distributed property Graph (RDG): a directed graph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of functions (e.g., mapReduceTriplets) as well as optimized variants of the Pregel and GraphLab APIs. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.
Background on Graph-Parallel Computation
From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graph-parallel systems (e.g., Giraph and GraphLab). By restricting the types of computation that can be expressed and introducing new techniques to partition and distribute graphs, these systems can efficiently execute sophisticated graph algorithms orders of magnitude faster than more general data-parallel systems.
However, the same restrictions that enable these substantial performance gains also make it difficult to express many of the important stages in a typical graph-analytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a consequence, existing graph analytics pipelines compose graph-parallel and data-parallel systems, leading to extensive data movement and duplication and a complicated programming model.