Skip to content
Snippets Groups Projects
  1. Jul 30, 2014
    • Michael Armbrust's avatar
      [SQL] Fix compiling of catalyst docs. · 2248891a
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1653 from marmbrus/fixDocs and squashes the following commits:
      
      0aa1feb [Michael Armbrust] Fix compiling of catalyst docs.
      2248891a
    • Yin Huai's avatar
      [SPARK-2179][SQL] Public API for DataTypes and Schema · 7003c163
      Yin Huai authored
      The current PR contains the following changes:
      * Expose `DataType`s in the sql package (internal details are private to sql).
      * Users can create Rows.
      * Introduce `applySchema` to create a `SchemaRDD` by applying a `schema: StructType` to an `RDD[Row]`.
      * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
      * `ScalaReflection.typeOfObject` provides a way to infer the Catalyst data type based on an object. Also, we can compose `typeOfObject` with some custom logics to form a new function to infer the data type (for different use cases).
      * `JsonRDD` has been refactored to use changes introduced by this PR.
      * Add a field `containsNull` to `ArrayType`. So, we can explicitly mark if an `ArrayType` can contain null values. The default value of `containsNull` is `false`.
      
      New APIs are introduced in the sql package object and SQLContext. You can find the scaladoc at
      [sql package object](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.package) and [SQLContext](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext).
      
      An example of using `applySchema` is shown below.
      ```scala
      import org.apache.spark.sql._
      val sqlContext = new org.apache.spark.sql.SQLContext(sc)
      
      val schema =
        StructType(
          StructField("name", StringType, false) ::
          StructField("age", IntegerType, true) :: Nil)
      
      val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Row(p(0), p(1).trim.toInt))
      val peopleSchemaRDD = sqlContext. applySchema(people, schema)
      peopleSchemaRDD.printSchema
      // root
      // |-- name: string (nullable = false)
      // |-- age: integer (nullable = true)
      
      peopleSchemaRDD.registerAsTable("people")
      sqlContext.sql("select name from people").collect.foreach(println)
      ```
      
      I will add new contents to the SQL programming guide later.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-2179
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1346 from yhuai/dataTypeAndSchema and squashes the following commits:
      
      1d45977 [Yin Huai] Clean up.
      a6e08b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      c712fbf [Yin Huai] Converts types of values based on defined schema.
      4ceeb66 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      e5f8df5 [Yin Huai] Scaladoc.
      122d1e7 [Yin Huai] Address comments.
      03bfd95 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      2476ed0 [Yin Huai] Minor updates.
      ab71f21 [Yin Huai] Format.
      fc2bed1 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      bd40a33 [Yin Huai] Address comments.
      991f860 [Yin Huai] Move "asJavaDataType" and "asScalaDataType" to DataTypeConversions.scala.
      1cb35fe [Yin Huai] Add "valueContainsNull" to MapType.
      3edb3ae [Yin Huai] Python doc.
      692c0b9 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      1d93395 [Yin Huai] Python APIs.
      246da96 [Yin Huai] Add java data type APIs to javadoc index.
      1db9531 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      d48fc7b [Yin Huai] Minor updates.
      33c4fec [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      b9f3071 [Yin Huai] Java API for applySchema.
      1c9f33c [Yin Huai] Java APIs for DataTypes and Row.
      624765c [Yin Huai] Tests for applySchema.
      aa92e84 [Yin Huai] Update data type tests.
      8da1a17 [Yin Huai] Add Row.fromSeq.
      9c99bc0 [Yin Huai] Several minor updates.
      1d9c13a [Yin Huai] Update applySchema API.
      85e9b51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      e495e4e [Yin Huai] More comments.
      42d47a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      c3f4a02 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      2e58dbd [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      b8b7db4 [Yin Huai] 1. Move sql package object and package-info to sql-core. 2. Minor updates on APIs. 3. Update scala doc.
      68525a2 [Yin Huai] Update JSON unit test.
      3209108 [Yin Huai] Add unit tests.
      dcaf22f [Yin Huai] Add a field containsNull to ArrayType to indicate if an array can contain null values or not. If an ArrayType is constructed by "ArrayType(elementType)" (the existing constructor), the value of containsNull is false.
      9168b83 [Yin Huai] Update comments.
      fc649d7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      eca7d04 [Yin Huai] Add two apply methods which will be used to extract StructField(s) from a StructType.
      949d6bb [Yin Huai] When creating a SchemaRDD for a JSON dataset, users can apply an existing schema.
      7a6a7e5 [Yin Huai] Fix bug introduced by the change made on SQLContext.inferSchema.
      43a45e1 [Yin Huai] Remove sql.util.package introduced in a previous commit.
      0266761 [Yin Huai] Format
      03eec4c [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      90460ac [Yin Huai] Infer the Catalyst data type from an object and cast a data value to the expected type.
      3fa0df5 [Yin Huai] Provide easier ways to construct a StructType.
      16be3e5 [Yin Huai] This commit contains three changes: * Expose `DataType`s in the sql package (internal details are private to sql). * Introduce `createSchemaRDD` to create a `SchemaRDD` from an `RDD` with a provided schema (represented by a `StructType`) and a provided function to construct `Row`, * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
      7003c163
  2. Jul 29, 2014
    • Michael Armbrust's avatar
      [SPARK-2054][SQL] Code Generation for Expression Evaluation · 84467468
      Michael Armbrust authored
      Adds a new method for evaluating expressions using code that is generated though Scala reflection.  This functionality is configured by the SQLConf option `spark.sql.codegen` and is currently turned off by default.
      
      Evaluation can be done in several specialized ways:
       - *Projection* - Given an input row, produce a new row from a set of expressions that define each column in terms of the input row.  This can either produce a new Row object or perform the projection in-place on an existing Row (MutableProjection).
       - *Ordering* - Compares two rows based on a list of `SortOrder` expressions
       - *Condition* - Returns `true` or `false` given an input row.
      
      For each of the above operations there is both a Generated and Interpreted version.  When generation for a given expression type is undefined, the code generator falls back on calling the `eval` function of the expression class.  Even without custom code, there is still a potential speed up, as loops are unrolled and code can still be inlined by JIT.
      
      This PR also contains a new type of Aggregation operator, `GeneratedAggregate`, that performs aggregation by using generated `Projection` code.  Currently the required expression rewriting only works for simple aggregations like `SUM` and `COUNT`.  This functionality will be extended in a future PR.
      
      This PR also performs several clean ups that simplified the implementation:
       - The notion of `Binding` all expressions in a tree automatically before query execution has been removed.  Instead it is the responsibly of an operator to provide the input schema when creating one of the specialized evaluators defined above.  In cases when the standard eval method is going to be called, binding can still be done manually using `BindReferences`.  There are a few reasons for this change:  First, there were many operators where it just didn't work before.  For example, operators with more than one child, and operators like aggregation that do significant rewriting of the expression. Second, the semantics of equality with `BoundReferences` are broken.  Specifically, we have had a few bugs where partitioning breaks because of the binding.
       - A copy of the current `SQLContext` is automatically propagated to all `SparkPlan` nodes by the query planner.  Before this was done ad-hoc for the nodes that needed this.  However, this required a lot of boilerplate as one had to always remember to make it `transient` and also had to modify the `otherCopyArgs`.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #993 from marmbrus/newCodeGen and squashes the following commits:
      
      96ef82c [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
      f34122d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
      67b1c48 [Michael Armbrust] Use conf variable in SQLConf object
      4bdc42c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
      41a40c9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
      de22aac [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
      fed3634 [Michael Armbrust] Inspectors are not serializable.
      ef8d42b [Michael Armbrust] comments
      533fdfd [Michael Armbrust] More logging of expression rewriting for GeneratedAggregate.
      3cd773e [Michael Armbrust] Allow codegen for Generate.
      64b2ee1 [Michael Armbrust] Implement copy
      3587460 [Michael Armbrust] Drop unused string builder function.
      9cce346 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
      1a61293 [Michael Armbrust] Address review comments.
      0672e8a [Michael Armbrust] Address comments.
      1ec2d6e [Michael Armbrust] Address comments
      033abc6 [Michael Armbrust] off by default
      4771fab [Michael Armbrust] Docs, more test coverage.
      d30fee2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
      d2ad5c5 [Michael Armbrust] Refactor putting SQLContext into SparkPlan. Fix ordering, other test cases.
      be2cd6b [Michael Armbrust] WIP: Remove old method for reference binding, more work on configuration.
      bc88ecd [Michael Armbrust] Style
      6cc97ca [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
      4220f1e [Michael Armbrust] Better config, docs, etc.
      ca6cc6b [Michael Armbrust] WIP
      9d67d85 [Michael Armbrust] Fix hive planner
      fc522d5 [Michael Armbrust] Hook generated aggregation in to the planner.
      e742640 [Michael Armbrust] Remove unneeded changes and code.
      675e679 [Michael Armbrust] Upgrade paradise.
      0093376 [Michael Armbrust] Comment / indenting cleanup.
      d81f998 [Michael Armbrust] include schema for binding.
      0e889e8 [Michael Armbrust] Use typeOf instead tq
      f623ffd [Michael Armbrust] Quiet logging from test suite.
      efad14f [Michael Armbrust] Remove some half finished functions.
      92e74a4 [Michael Armbrust] add overrides
      a2b5408 [Michael Armbrust] WIP: Code generation with scala reflection.
      84467468
    • Hari Shreedharan's avatar
      [STREAMING] SPARK-1729. Make Flume pull data from source, rather than the current pu... · 800ecff4
      Hari Shreedharan authored
      ...sh model
      
      Currently Spark uses Flume's internal Avro Protocol to ingest data from Flume. If the executor running the
      receiver fails, it currently has to be restarted on the same node to be able to receive data.
      
      This commit adds a new Sink which can be deployed to a Flume agent. This sink can be polled by a new
      DStream that is also included in this commit. This model ensures that data can be pulled into Spark from
      Flume even if the receiver is restarted on a new node. This also allows the receiver to receive data on
      multiple threads for better performance.
      
      Author: Hari Shreedharan <harishreedharan@gmail.com>
      Author: Hari Shreedharan <hshreedharan@apache.org>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: harishreedharan <hshreedharan@cloudera.com>
      
      Closes #807 from harishreedharan/master and squashes the following commits:
      
      e7f70a3 [Hari Shreedharan] Merge remote-tracking branch 'asf-git/master'
      96cfb6f [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
      e48d785 [Hari Shreedharan] Documenting flume-sink being ignored for Mima checks.
      5f212ce [Hari Shreedharan] Ignore Spark Sink from mima.
      981bf62 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
      7a1bc6e [Hari Shreedharan] Fix SparkBuild.scala
      a082eb3 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
      1f47364 [Hari Shreedharan] Minor fixes.
      73d6f6d [Hari Shreedharan] Cleaned up tests a bit. Added some docs in multiple places.
      65b76b4 [Hari Shreedharan] Fixing the unit test.
      e59cc20 [Hari Shreedharan] Use SparkFlumeEvent instead of the new type. Also, Flume Polling Receiver now uses the store(ArrayBuffer) method.
      f3c99d1 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
      3572180 [Hari Shreedharan] Adding a license header, making Jenkins happy.
      799509f [Hari Shreedharan] Fix a compile issue.
      3c5194c [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
      d248d22 [harishreedharan] Merge pull request #1 from tdas/flume-polling
      10b6214 [Tathagata Das] Changed public API, changed sink package, and added java unit test to make sure Java API is callable from Java.
      1edc806 [Hari Shreedharan] SPARK-1729. Update logging in Spark Sink.
      8c00289 [Hari Shreedharan] More debug messages
      393bd94 [Hari Shreedharan] SPARK-1729. Use LinkedBlockingQueue instead of ArrayBuffer to keep track of connections.
      120e2a1 [Hari Shreedharan] SPARK-1729. Some test changes and changes to utils classes.
      9fd0da7 [Hari Shreedharan] SPARK-1729. Use foreach instead of map for all Options.
      8136aa6 [Hari Shreedharan] Adding TransactionProcessor to map on returning batch of data
      86aa274 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
      205034d [Hari Shreedharan] Merging master in
      4b0c7fc [Hari Shreedharan] FLUME-1729. New Flume-Spark integration.
      bda01fc [Hari Shreedharan] FLUME-1729. Flume-Spark integration.
      0d69604 [Hari Shreedharan] FLUME-1729. Better Flume-Spark integration.
      3c23c18 [Hari Shreedharan] SPARK-1729. New Spark-Flume integration.
      70bcc2a [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
      d6fa3aa [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
      e7da512 [Hari Shreedharan] SPARK-1729. Fixing import order
      9741683 [Hari Shreedharan] SPARK-1729. Fixes based on review.
      c604a3c [Hari Shreedharan] SPARK-1729. Optimize imports.
      0f10788 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
      87775aa [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
      8df37e4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
      03d6c1c [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
      08176ad [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
      d24d9d4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
      6d6776a [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
      800ecff4
  3. Jul 28, 2014
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix) · a7a9d144
      Cheng Lian authored
      JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      
      Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.
      
      In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:
      
      629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
      ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
      a7a9d144
  4. Jul 27, 2014
    • Patrick Wendell's avatar
      Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server" · e5bbce9a
      Patrick Wendell authored
      This reverts commit f6ff2a61.
      e5bbce9a
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · f6ff2a61
      Cheng Lian authored
      (This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)
      
      JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      
      Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1600 from liancheng/jdbc and squashes the following commits:
      
      ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      f6ff2a61
  5. Jul 25, 2014
    • Michael Armbrust's avatar
      Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server" · afd757a2
      Michael Armbrust authored
      This reverts commit 06dc0d2c.
      
      #1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1594 from marmbrus/revertJDBC and squashes the following commits:
      
      59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
      afd757a2
    • Yin Huai's avatar
      [SPARK-2682] Javadoc generated from Scala source code is not in javadoc's index · a19d8c89
      Yin Huai authored
      Add genjavadocSettings back to SparkBuild. It requires #1585 .
      
      https://issues.apache.org/jira/browse/SPARK-2682
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1584 from yhuai/SPARK-2682 and squashes the following commits:
      
      2e89461 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2682
      54e3b66 [Yin Huai] Add genjavadocSettings back.
      a19d8c89
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · 06dc0d2c
      Cheng Lian authored
      JIRA issue:
      
      - Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      - Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
      
      Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      (Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)
      
      TODO
      
      - [x] Use `spark-submit` to launch the server, the CLI and beeline
      - [x] Migration guideline draft for Shark users
      
      ----
      
      Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:
      
      ```bash
      $ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
      ```
      
      This actually shows usage information of `SparkSubmit` rather than `BeeLine`.
      
      ~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~
      
      **UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1399 from liancheng/thriftserver and squashes the following commits:
      
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      06dc0d2c
  6. Jul 15, 2014
    • Yin Huai's avatar
      [SPARK-2474][SQL] For a registered table in OverrideCatalog, the Analyzer... · 8af46d58
      Yin Huai authored
      [SPARK-2474][SQL] For a registered table in OverrideCatalog, the Analyzer failed to resolve references in the format of "tableName.fieldName"
      
      Please refer to JIRA (https://issues.apache.org/jira/browse/SPARK-2474) for how to reproduce the problem and my understanding of the root cause.
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1406 from yhuai/SPARK-2474 and squashes the following commits:
      
      96b1627 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2474
      af36d65 [Yin Huai] Fix comment.
      be86ba9 [Yin Huai] Correct SQL console settings.
      c43ad00 [Yin Huai] Wrap the relation in a Subquery named by the table name in OverrideCatalog.lookupRelation.
      a5c2145 [Yin Huai] Support sql/console.
      8af46d58
    • Takuya UESHIN's avatar
      [SPARK-2467] Revert SparkBuild to publish-local to both .m2 and .ivy2. · e2255e4b
      Takuya UESHIN authored
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #1398 from ueshin/issues/SPARK-2467 and squashes the following commits:
      
      7f01d58 [Takuya UESHIN] Revert SparkBuild to publish-local to both .m2 and .ivy2.
      e2255e4b
  7. Jul 11, 2014
    • Prashant Sharma's avatar
      [SPARK-2437] Rename MAVEN_PROFILES to SBT_MAVEN_PROFILES and add SBT_MAVEN_PROPERTIES · b23e9c3e
      Prashant Sharma authored
      NOTE: It is not possible to use both env variable  `SBT_MAVEN_PROFILES`  and `-P` flag at same time. `-P` if specified takes precedence.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1374 from ScrapCodes/SPARK-2437/rename-MAVEN_PROFILES and squashes the following commits:
      
      8694bde [Prashant Sharma] [SPARK-2437] Rename MAVEN_PROFILES to SBT_MAVEN_PROFILES and add SBT_MAVEN_PROPERTIES
      b23e9c3e
  8. Jul 10, 2014
    • Prashant Sharma's avatar
      [SPARK-1776] Have Spark's SBT build read dependencies from Maven. · 628932b8
      Prashant Sharma authored
      Patch introduces the new way of working also retaining the existing ways of doing things.
      
      For example build instruction for yarn in maven is
      `mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
      in sbt it can become
      `MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
      Also supports
      `sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:
      
      a8ac951 [Prashant Sharma] Updated sbt version.
      62b09bb [Prashant Sharma] Improvements.
      fa6221d [Prashant Sharma] Excluding sql from mima
      4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
      72651ca [Prashant Sharma] Addresses code reivew comments.
      acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
      ac4312c [Prashant Sharma] Revert "minor fix"
      6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
      65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
      446768e [Prashant Sharma] minor fix
      89b9777 [Prashant Sharma] Merge conflicts
      d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
      dccc8ac [Prashant Sharma] updated mima to check against 1.0
      a49c61b [Prashant Sharma] Fix for tools jar
      a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
      cf88758 [Prashant Sharma] cleanup
      9439ea3 [Prashant Sharma] Small fix to run-examples script.
      96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
      36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
      4973dbd [Patrick Wendell] Example build using pom reader.
      628932b8
  9. Jul 01, 2014
  10. Jun 20, 2014
    • Marcelo Vanzin's avatar
      Fix some tests. · 648553d4
      Marcelo Vanzin authored
      - JavaAPISuite was trying to compare a bare path with a URI. Fix by
        extracting the path from the URI, since we know it should be a
        local path anyway/
      
      - b9be1609 excluded the ASM dependency everywhere, but easymock needs
        it (because cglib needs it). So re-add the dependency, with test
        scope this time.
      
      The second one above actually uncovered a weird situation: the maven
      test target works, even though I can't find the class sbt complains
      about in its classpath. sbt complains with:
      
        [error] Uncaught exception when running org.apache.spark.util
        .random.RandomSamplerSuite: java.lang.NoClassDefFoundError:
        org/objectweb/asm/Type
      
      To avoid more weirdness caused by that, I explicitly added the asm
      dependency to both maven and sbt (for tests only), and verified
      the classes don't end up in the final assembly.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #917 from vanzin/flaky-tests and squashes the following commits:
      
      d022320 [Marcelo Vanzin] Fix some tests.
      648553d4
  11. Jun 17, 2014
    • Yin Huai's avatar
      [SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL · d2f4f30b
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-2060
      
      Programming guide: http://yhuai.github.io/site/sql-programming-guide.html
      
      Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #999 from yhuai/newJson and squashes the following commits:
      
      227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      ce8eedd [Yin Huai] rxin's comments.
      bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      94ffdaa [Yin Huai] Remove "get" from method names.
      ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      79ea9ba [Yin Huai] Fix typos.
      5428451 [Yin Huai] Newline
      1f908ce [Yin Huai] Remove extra line.
      d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      7ea750e [Yin Huai] marmbrus's comments.
      6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      83013fb [Yin Huai] Update Java Example.
      e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map.
      6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      4fbddf0 [Yin Huai] Programming guide.
      9df8c5a [Yin Huai] Python API.
      7027634 [Yin Huai] Java API.
      cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset.
      d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      ab810b0 [Yin Huai] Make JsonRDD private.
      6df0891 [Yin Huai] Apache header.
      8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema.
      8ffed79 [Yin Huai] Update the example.
      a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution.
      65b87f0 [Yin Huai] Fix sampling...
      8846af5 [Yin Huai] API doc.
      52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      0387523 [Yin Huai] Address PR comments.
      666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      a2313a6 [Yin Huai] Address PR comments.
      f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used.
      0576406 [Yin Huai] Add Apache license header.
      af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD.
      f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
      d2f4f30b
  12. Jun 12, 2014
    • Doris Xin's avatar
      SPARK-1939 Refactor takeSample method in RDD to use ScaSRS · 1de1d703
      Doris Xin authored
      Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      Author: dorx <doris.s.xin@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #916 from dorx/takeSample and squashes the following commits:
      
      5b061ae [Doris Xin] merge master
      444e750 [Doris Xin] edge cases
      3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939
      82dde31 [Xiangrui Meng] update pyspark's takeSample
      48d954d [Doris Xin] remove unused imports from RDDSuite
      fb1452f [Doris Xin] allowing num to be greater than count in all cases
      1481b01 [Doris Xin] washing test tubes and making coffee
      dc699f3 [Doris Xin] give back imports removed by accident in rdd.py
      64e445b [Doris Xin] logwarnning as soon as it enters the while loop
      55518ed [Doris Xin] added TODO for logging in rdd.py
      eff89e2 [Doris Xin] addressed reviewer comments.
      ecab508 [Doris Xin] "fixed checkstyle violation
      0a9b3e3 [Doris Xin] "reviewer comment addressed"
      f80f270 [Doris Xin] Merge branch 'master' into takeSample
      ae3ad04 [Doris Xin] fixed edge cases to prevent overflow
      065ebcd [Doris Xin] Merge branch 'master' into takeSample
      9bdd36e [Doris Xin] Check sample size and move computeFraction
      e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
      7cab53a [Doris Xin] fixed import bug in rdd.py
      ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
      1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
      1de1d703
    • Patrick Wendell's avatar
      SPARK-1843: Replace assemble-deps with env variable. · 1c04652c
      Patrick Wendell authored
      (This change is actually small, I moved some logic into
      compute-classpath that was previously in spark-class).
      
      Assemble deps has existed for a while to allow developers to
      run local code with new changes quickly. When I'm developing I
      typically use a simpler approach which just prepends the Spark
      classes to the classpath before the assembly jar. This is well
      defined in the JVM and the Spark classes take precedence over those
      in the assembly.
      
      This approach is portable across both builds which is the main reason I'd
      like to switch to it. It's also a bit easier to toggle on and off quickly.
      
      The way you use this is the following:
      ```
      $ ./bin/spark-shell # Use spark with the normal assembly
      $ export SPARK_PREPEND_CLASSES=true
      $ ./bin/spark-shell # Now it's using compiled classes
      $ unset SPARK_PREPEND_CLASSES
      $ ./bin/spark-shell # Back to normal
      ```
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #877 from pwendell/assemble-deps and squashes the following commits:
      
      8a11345 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into assemble-deps
      faa3168 [Patrick Wendell] Adding a warning for compatibility
      3f151a7 [Patrick Wendell] Small fix
      bbfb73c [Patrick Wendell] Review feedback
      328e9f8 [Patrick Wendell] SPARK-1843: Replace assemble-deps with env variable.
      1c04652c
  13. Jun 11, 2014
    • Prashant Sharma's avatar
      [SPARK-2069] MIMA false positives · 5b754b45
      Prashant Sharma authored
      Fixes SPARK 2070 and 2071
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1021 from ScrapCodes/SPARK-2070/package-private-methods and squashes the following commits:
      
      7979a57 [Prashant Sharma] addressed code review comments
      558546d [Prashant Sharma] A little fancy error message.
      59275ab [Prashant Sharma] SPARK-2071 Mima ignores classes and its members from previous versions too.
      0c4ff2b [Prashant Sharma] SPARK-2070 Ignore methods along with annotated classes.
      5b754b45
  14. Jun 06, 2014
    • witgo's avatar
      [SPARK-1841]: update scalatest to version 2.1.5 · 41c4a331
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #713 from witgo/scalatest and squashes the following commits:
      
      b627a6a [witgo] merge master
      51fb3d6 [witgo] merge master
      3771474 [witgo] fix RDDSuite
      996d6f9 [witgo] fix TimeStampedWeakValueHashMap test
      9dfa4e7 [witgo] merge bug
      1479b22 [witgo] merge master
      29b9194 [witgo] fix code style
      022a7a2 [witgo] fix test dependency
      a52c0fa [witgo] fix test dependency
      cd8f59d [witgo] Merge branch 'master' of https://github.com/apache/spark into scalatest
      046540d [witgo] fix RDDSuite.scala
      2c543b9 [witgo] fix ReplSuite.scala
      c458928 [witgo] update scalatest to version 2.1.5
      41c4a331
  15. Jun 05, 2014
    • Marcelo Vanzin's avatar
      Remove compile-scoped junit dependency. · 668cb1de
      Marcelo Vanzin authored
      This avoids having junit classes showing up in the assembly jar.
      I verified that only test classes in the jtransforms package
      use junit.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #794 from vanzin/junit-dep-exclusion and squashes the following commits:
      
      274e1c2 [Marcelo Vanzin] Remove junit from assembly in sbt build also.
      ad950be [Marcelo Vanzin] Remove compile-scoped junit dependency.
      668cb1de
  16. Jun 03, 2014
    • Reynold Xin's avatar
      SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog. · 1faef149
      Reynold Xin authored
      I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results).
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #897 from rxin/hll and squashes the following commits:
      
      4d83f41 [Reynold Xin] New error bound and non-randomness.
      f154ea0 [Reynold Xin] Added a comment on the value bound for testing.
      e367527 [Reynold Xin] One more round of code review.
      41e649a [Reynold Xin] Update final mima list.
      9e320c8 [Reynold Xin] Incorporate code review feedback.
      e110d70 [Reynold Xin] Merge branch 'master' into hll
      354deb8 [Reynold Xin] Added comment on the Mima exclude rules.
      acaa524 [Reynold Xin] Added the right exclude rules in MimaExcludes.
      6555bfe [Reynold Xin] Added a default method and re-arranged MimaExcludes.
      1db1522 [Reynold Xin] Excluded util.SerializableHyperLogLog from MIMA check.
      9221b27 [Reynold Xin] Merge branch 'master' into hll
      88cfe77 [Reynold Xin] Updated documentation and restored the old incorrect API to maintain API compatibility.
      1294be6 [Reynold Xin] Updated HLL+.
      e7786cb [Reynold Xin] Merge branch 'master' into hll
      c0ef0c2 [Reynold Xin] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.
      1faef149
    • tzolov's avatar
      Add support for Pivotal HD in the Maven build: SPARK-1992 · b1f28535
      tzolov authored
      Allow Spark to build against particular Pivotal HD distributions. For example to build Spark against Pivotal HD 2.0.1 one can run:
      ```
      mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0-gphd-3.0.1.0 -DskipTests clean package
      ```
      
      Author: tzolov <christian.tzolov@gmail.com>
      
      Closes #942 from tzolov/master and squashes the following commits:
      
      bc3e05a [tzolov] Add support for Pivotal HD in the Maven build and SBT build: [SPARK-1992]
      b1f28535
  17. May 31, 2014
    • Michael Armbrust's avatar
      Optionally include Hive as a dependency of the REPL. · 7463cd24
      Michael Armbrust authored
      Due to the way spark-shell launches from an assembly jar, I don't think this change will affect anyone who isn't trying to launch the shell directly from sbt.  That said, it is kinda nice to be able to launch all things directly from SBT when developing.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #801 from marmbrus/hiveRepl and squashes the following commits:
      
      9570571 [Michael Armbrust] Optionally include Hive as a dependency of the REPL.
      7463cd24
  18. May 30, 2014
    • Prashant Sharma's avatar
      [SPARK-1971] Update MIMA to compare against Spark 1.0.0 · 79fa8fd4
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #910 from ScrapCodes/enable-mima/spark-core and squashes the following commits:
      
      79f3687 [Prashant Sharma] updated Mima to check against version 1.0
      1e8969c [Prashant Sharma] Spark core missed out on Mima settings. So in effect we never tested spark core for mima related errors.
      79fa8fd4
  19. May 29, 2014
  20. May 19, 2014
  21. May 15, 2014
    • witgo's avatar
      fix different versions of commons-lang dependency and apache/spark#746 addendum · bae07e36
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #754 from witgo/commons-lang and squashes the following commits:
      
      3ebab31 [witgo] merge master
      f3b8fa2 [witgo] merge master
      2083fae [witgo] repeat definition
      5599cdb [witgo] multiple version of sbt  dependency
      c1b66a1 [witgo] fix different versions of commons-lang dependency
      bae07e36
  22. May 14, 2014
  23. May 12, 2014
  24. May 10, 2014
    • Prashant Sharma's avatar
      Enabled incremental build that comes with sbt 0.13.2 · 70bcdef4
      Prashant Sharma authored
      More info at. https://github.com/sbt/sbt/issues/1010
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #525 from ScrapCodes/sbt-inc-opt and squashes the following commits:
      
      ba8fa42 [Prashant Sharma] Enabled incremental build that comes with sbt 0.13.2
      70bcdef4
    • Sean Owen's avatar
      SPARK-1789. Multiple versions of Netty dependencies cause FlumeStreamSuite failure · 2b7bd29e
      Sean Owen authored
      TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure.
      
      I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?)
      
      velvia notes:
      "I have found a workaround.  If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty."
      
      There are at least 3 versions of Netty in play in the build:
      
      - the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem
      - the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
      - but, Spark Core directly uses io.netty:netty-all:4.0.17.Final
      
      The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue.
      
      The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final.
      
      But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile.
      
      If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation.
      
      So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict:
      
      - Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts
      - Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty
      - Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent
      - Update SBT build accordingly
      
      A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #723 from srowen/SPARK-1789 and squashes the following commits:
      
      43661b7 [Sean Owen] Update and add Netty excludes to prevent some JAR conflicts that cause test issues
      2b7bd29e
    • Michael Armbrust's avatar
      [SQL] Upgrade parquet library. · 4d605532
      Michael Armbrust authored
      I think we are hitting this issue in some perf tests: https://github.com/Parquet/parquet-mr/commit/6aed5288fd4a1398063a5a219b2ae4a9f71b02cf
      
      Credit to @aarondav !
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #684 from marmbrus/upgradeParquet and squashes the following commits:
      
      e10a619 [Michael Armbrust] Upgrade parquet library.
      4d605532
    • witgo's avatar
      [SPARK-1644] The org.datanucleus:* should not be packaged into spark-assembly-*.jar · 56151086
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #688 from witgo/SPARK-1644 and squashes the following commits:
      
      56ad6ac [witgo] review commit
      87c03e4 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1644
      6ffa7e4 [witgo] review commit
      a597414 [witgo] The org.datanucleus:* should not be packaged into spark-assembly-*.jar
      56151086
  25. May 06, 2014
    • Matei Zaharia's avatar
      [SPARK-1549] Add Python support to spark-submit · 951a5d93
      Matei Zaharia authored
      This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
      
      This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
      
      In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
      
      In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #664 from mateiz/py-submit and squashes the following commits:
      
      15e9669 [Matei Zaharia] Fix some uses of path.separator property
      051278c [Matei Zaharia] Small style fixes
      0afe886 [Matei Zaharia] Add license headers
      4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
      15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
      47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
      d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
      951a5d93
    • Thomas Graves's avatar
      SPARK-1474: Spark on yarn assembly doesn't include AmIpFilter · 1e829905
      Thomas Graves authored
      We use org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter in spark on yarn but are not included it in the assembly jar.
      
      I tested this on yarn cluster by removing the yarn jars from the classpath and spark runs fine now.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #406 from tgravescs/SPARK-1474 and squashes the following commits:
      
      1548bf9 [Thomas Graves] SPARK-1474: Spark on yarn assembly doesn't include org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
      1e829905
  26. May 05, 2014
    • Sean Owen's avatar
      SPARK-1556. jets3t dep doesn't update properly with newer Hadoop versions · 73b0cbcc
      Sean Owen authored
      See related discussion at https://github.com/apache/spark/pull/468
      
      This PR may still overstep what you have in mind, but let me put it on the table to start. Besides fixing the issue, it has one substantive change, and that is to manage Hadoop-specific things only in Hadoop-related profiles. This does _not_ remove `yarn.version`.
      
      - Moves the YARN and Hadoop profiles together in pom.xml. Sorry that this makes the diff a little hard to grok but the changes are only as follows.
      - Removes `hadoop.major.version`
      - Introduce `hadoop-2.2` and `hadoop-2.3` profiles to control Hadoop-specific changes:
        - like the protobuf version issue - this was only 'solved' now by enabling YARN for 2.2+, which is really an orthogonal issue
        - like the jets3t version issue now
      - Hadoop profiles set an appropriate default `hadoop.version`, that can be overridden
      - _(YARN profiles in the parent now only exist to add the sub-module)_
      - Fixes the jets3t dependency issue
       - and makes it a runtime dependency
       - and centralizes config of this guy in the parent pom
      - Updates build docs
      - Updates SBT build too
        - and fixes a regex problem along the way
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #629 from srowen/SPARK-1556 and squashes the following commits:
      
      c3fa967 [Sean Owen] Fix hadoop-2.4 profile typo in doc
      a2105fd [Sean Owen] Add hadoop-2.4 profile and don't set hadoop.version in profiles
      274f4f9 [Sean Owen] Make jets3t a runtime dependency, and bring its exclusion up into parent config
      bbed826 [Sean Owen] Use jets3t 0.9.0 for Hadoop 2.3+ (and correct similar regex issue in SBT build)
      f21f356 [Sean Owen] Build changes to set up for jets3t fix
      73b0cbcc
  27. May 04, 2014
    • Sean Owen's avatar
      SPARK-1629. Addendum: Depend on commons lang3 (already used by tachyon) as... · f5041579
      Sean Owen authored
      SPARK-1629. Addendum: Depend on commons lang3 (already used by tachyon) as it's used in ReplSuite, and return to use lang3 utility in Utils.scala
      
      For consideration. This was proposed in related discussion: https://github.com/apache/spark/pull/569
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #635 from srowen/SPARK-1629.2 and squashes the following commits:
      
      a442b98 [Sean Owen] Depend on commons lang3 (already used by tachyon) as it's used in ReplSuite, and return to use lang3 utility in Utils.scala
      f5041579
Loading