diff --git a/docs/index.md b/docs/index.md index 4ac0982ae54f194826a738b6f135e38420da0aba..7fe6b43d32af72e6bd59767797293e9a02cefb17 100644 --- a/docs/index.md +++ b/docs/index.md @@ -103,6 +103,8 @@ options for deployment: * [Security](security.html): Spark security support * [Hardware Provisioning](hardware-provisioning.html): recommendations for cluster hardware * [3<sup>rd</sup> Party Hadoop Distributions](hadoop-third-party-distributions.html): using common Hadoop distributions +* Integration with other storage systems: + * [OpenStack Swift](storage-openstack-swift.html) * [Building Spark with Maven](building-with-maven.html): build Spark using the Maven system * [Contributing to Spark](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) diff --git a/docs/storage-openstack-swift.md b/docs/storage-openstack-swift.md new file mode 100644 index 0000000000000000000000000000000000000000..c39ef1ce59e1c1f7b3e4fba7e1f1447f5bd7315b --- /dev/null +++ b/docs/storage-openstack-swift.md @@ -0,0 +1,152 @@ +--- +layout: global +title: Accessing OpenStack Swift from Spark +--- + +Spark's support for Hadoop InputFormat allows it to process data in OpenStack Swift using the +same URI formats as in Hadoop. You can specify a path in Swift as input through a +URI of the form <code>swift://container.PROVIDER/path</code>. You will also need to set your +Swift security credentials, through <code>core-site.xml</code> or via +<code>SparkContext.hadoopConfiguration</code>. +Current Swift driver requires Swift to use Keystone authentication method. + +# Configuring Swift for Better Data Locality + +Although not mandatory, it is recommended to configure the proxy server of Swift with +<code>list_endpoints</code> to have better data locality. More information is +[available here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py). + + +# Dependencies + +The Spark application should include <code>hadoop-openstack</code> dependency. +For example, for Maven support, add the following to the <code>pom.xml</code> file: + +{% highlight xml %} +<dependencyManagement> + ... + <dependency> + <groupId>org.apache.hadoop</groupId> + <artifactId>hadoop-openstack</artifactId> + <version>2.3.0</version> + </dependency> + ... +</dependencyManagement> +{% endhighlight %} + + +# Configuration Parameters + +Create <code>core-site.xml</code> and place it inside Spark's <code>conf</code> directory. +There are two main categories of parameters that should to be configured: declaration of the +Swift driver and the parameters that are required by Keystone. + +Configuration of Hadoop to use Swift File system achieved via + +<table class="table"> +<tr><th>Property Name</th><th>Value</th></tr> +<tr> + <td>fs.swift.impl</td> + <td>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</td> +</tr> +</table> + +Additional parameters required by Keystone (v2.0) and should be provided to the Swift driver. Those +parameters will be used to perform authentication in Keystone to access Swift. The following table +contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be any name. + +<table class="table"> +<tr><th>Property Name</th><th>Meaning</th><th>Required</th></tr> +<tr> + <td><code>fs.swift.service.PROVIDER.auth.url</code></td> + <td>Keystone Authentication URL</td> + <td>Mandatory</td> +</tr> +<tr> + <td><code>fs.swift.service.PROVIDER.auth.endpoint.prefix</code></td> + <td>Keystone endpoints prefix</td> + <td>Optional</td> +</tr> +<tr> + <td><code>fs.swift.service.PROVIDER.tenant</code></td> + <td>Tenant</td> + <td>Mandatory</td> +</tr> +<tr> + <td><code>fs.swift.service.PROVIDER.username</code></td> + <td>Username</td> + <td>Mandatory</td> +</tr> +<tr> + <td><code>fs.swift.service.PROVIDER.password</code></td> + <td>Password</td> + <td>Mandatory</td> +</tr> +<tr> + <td><code>fs.swift.service.PROVIDER.http.port</code></td> + <td>HTTP port</td> + <td>Mandatory</td> +</tr> +<tr> + <td><code>fs.swift.service.PROVIDER.region</code></td> + <td>Keystone region</td> + <td>Mandatory</td> +</tr> +<tr> + <td><code>fs.swift.service.PROVIDER.public</code></td> + <td>Indicates if all URLs are public</td> + <td>Mandatory</td> +</tr> +</table> + +For example, assume <code>PROVIDER=SparkTest</code> and Keystone contains user <code>tester</code> with password <code>testing</code> +defined for tenant <code>test</code>. Then <code>core-site.xml</code> should include: + +{% highlight xml %} +<configuration> + <property> + <name>fs.swift.impl</name> + <value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value> + </property> + <property> + <name>fs.swift.service.SparkTest.auth.url</name> + <value>http://127.0.0.1:5000/v2.0/tokens</value> + </property> + <property> + <name>fs.swift.service.SparkTest.auth.endpoint.prefix</name> + <value>endpoints</value> + </property> + <name>fs.swift.service.SparkTest.http.port</name> + <value>8080</value> + </property> + <property> + <name>fs.swift.service.SparkTest.region</name> + <value>RegionOne</value> + </property> + <property> + <name>fs.swift.service.SparkTest.public</name> + <value>true</value> + </property> + <property> + <name>fs.swift.service.SparkTest.tenant</name> + <value>test</value> + </property> + <property> + <name>fs.swift.service.SparkTest.username</name> + <value>tester</value> + </property> + <property> + <name>fs.swift.service.SparkTest.password</name> + <value>testing</value> + </property> +</configuration> +{% endhighlight %} + +Notice that +<code>fs.swift.service.PROVIDER.tenant</code>, +<code>fs.swift.service.PROVIDER.username</code>, +<code>fs.swift.service.PROVIDER.password</code> contains sensitive information and keeping them in +<code>core-site.xml</code> is not always a good approach. +We suggest to keep those parameters in <code>core-site.xml</code> for testing purposes when running Spark +via <code>spark-shell</code>. +For job submissions they should be provided via <code>sparkContext.hadoopConfiguration</code>.