Skip to content
Snippets Groups Projects
  • Zheng RuiFeng's avatar
    cdaf4ce9
    [SPARK-18480][DOCS] Fix wrong links for ML guide docs · cdaf4ce9
    Zheng RuiFeng authored
    ## What changes were proposed in this pull request?
    1, There are two `[Graph.partitionBy]` in `graphx-programming-guide.md`, the first one had no effert.
    2, `DataFrame`, `Transformer`, `Pipeline` and `Parameter`  in `ml-pipeline.md` were linked to `ml-guide.html` by mistake.
    3, `PythonMLLibAPI` in `mllib-linear-methods.md` was not accessable, because class `PythonMLLibAPI` is private.
    4, Other link updates.
    ## How was this patch tested?
     manual tests
    
    Author: Zheng RuiFeng <ruifengz@foxmail.com>
    
    Closes #15912 from zhengruifeng/md_fix.
    [SPARK-18480][DOCS] Fix wrong links for ML guide docs
    Zheng RuiFeng authored
    ## What changes were proposed in this pull request?
    1, There are two `[Graph.partitionBy]` in `graphx-programming-guide.md`, the first one had no effert.
    2, `DataFrame`, `Transformer`, `Pipeline` and `Parameter`  in `ml-pipeline.md` were linked to `ml-guide.html` by mistake.
    3, `PythonMLLibAPI` in `mllib-linear-methods.md` was not accessable, because class `PythonMLLibAPI` is private.
    4, Other link updates.
    ## How was this patch tested?
     manual tests
    
    Author: Zheng RuiFeng <ruifengz@foxmail.com>
    
    Closes #15912 from zhengruifeng/md_fix.
graphx-programming-guide.md 51.42 KiB
layout: global
displayTitle: GraphX Programming Guide
title: GraphX
description: GraphX graph processing library guide for Spark SPARK_VERSION_SHORT
  • This will become a table of contents (this text will be scraped). {:toc}

GraphX

Overview

GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

Getting Started

To get started you first need to import Spark and GraphX into your project, as follows:

{% highlight scala %} import org.apache.spark._ import org.apache.spark.graphx._ // To make some of the examples work we will also need RDD import org.apache.spark.rdd.RDD {% endhighlight %}

If you are not using the Spark shell you will also need a SparkContext. To learn more about getting started with Spark refer to the Spark Quick Start Guide.

The Property Graph

The property graph is a directed multigraph with user defined objects attached to each vertex and edge. A directed multigraph is a directed graph with potentially multiple parallel edges sharing the same source and destination vertex. The ability to support parallel edges simplifies modeling scenarios where there can be multiple relationships (e.g., co-worker and friend) between the same vertices. Each vertex is keyed by a unique 64-bit long identifier (VertexId). GraphX does not impose any ordering constraints on the vertex identifiers. Similarly, edges have corresponding source and destination vertex identifiers.

The property graph is parameterized over the vertex (VD) and edge (ED) types. These are the types of the objects associated with each vertex and edge respectively.

GraphX optimizes the representation of vertex and edge types when they are primitive data types (e.g., int, double, etc...) reducing the in memory footprint by storing them in specialized arrays.

In some cases it may be desirable to have vertices with different property types in the same graph. This can be accomplished through inheritance. For example to model users and products as a bipartite graph we might do the following:

{% highlight scala %} class VertexProperty() case class UserProperty(val name: String) extends VertexProperty case class ProductProperty(val name: String, val price: Double) extends VertexProperty // The graph might then have the type: var graph: Graph[VertexProperty, String] = null {% endhighlight %}

Like RDDs, property graphs are immutable, distributed, and fault-tolerant. Changes to the values or structure of the graph are accomplished by producing a new graph with the desired changes. Note that substantial parts of the original graph (i.e., unaffected structure, attributes, and indices) are reused in the new graph reducing the cost of this inherently functional data structure. The graph is partitioned across the executors using a range of vertex partitioning heuristics. As with RDDs, each partition of the graph can be recreated on a different machine in the event of a failure.

Logically the property graph corresponds to a pair of typed collections (RDDs) encoding the properties for each vertex and edge. As a consequence, the graph class contains members to access the vertices and edges of the graph:

{% highlight scala %} class Graph[VD, ED] { val vertices: VertexRDD[VD] val edges: EdgeRDD[ED] } {% endhighlight %}

The classes VertexRDD[VD] and EdgeRDD[ED] extend and are optimized versions of RDD[(VertexId, VD)] and RDD[Edge[ED]] respectively. Both VertexRDD[VD] and EdgeRDD[ED] provide additional functionality built around graph computation and leverage internal optimizations. We discuss the VertexRDDVertexRDD and EdgeRDDEdgeRDD API in greater detail in the section on vertex and edge RDDs but for now they can be thought of as simply RDDs of the form: RDD[(VertexId, VD)] and RDD[Edge[ED]].