Skip to content
Snippets Groups Projects
  1. Mar 02, 2014
    • Patrick Wendell's avatar
      SPARK-1121: Include avro for yarn-alpha builds · c3f5e075
      Patrick Wendell authored
      This lets us explicitly include Avro based on a profile for 0.23.X
      builds. It makes me sad how convoluted it is to express this logic
      in Maven. @tgraves and @sryza curious if this works for you.
      
      I'm also considering just reverting to how it was before. The only
      real problem was that Spark advertised a dependency on Avro
      even though it only really depends transitively on Avro through
      other deps.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #49 from pwendell/avro-build-fix and squashes the following commits:
      
      8d6ee92 [Patrick Wendell] SPARK-1121: Add avro to yarn-alpha profile
      c3f5e075
    • Sean Owen's avatar
      SPARK-1084.2 (resubmitted) · fd31adbf
      Sean Owen authored
      (Ported from https://github.com/apache/incubator-spark/pull/650 )
      
      This adds one more change though, to fix the scala version warning introduced by json4s recently.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #32 from srowen/SPARK-1084.2 and squashes the following commits:
      
      9240abd [Sean Owen] Avoid scala version conflict in scalap induced by json4s dependency
      1561cec [Sean Owen] Remove "exclude *" dependencies that are causing Maven warnings, and that are apparently unneeded anyway
      fd31adbf
    • Reynold Xin's avatar
      Ignore RateLimitedOutputStreamSuite for now. · 353ac6b4
      Reynold Xin authored
      This test has been flaky. We can re-enable it after @tdas has a chance to look at it.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #54 from rxin/ratelimit and squashes the following commits:
      
      1a12198 [Reynold Xin] Ignore RateLimitedOutputStreamSuite for now.
      353ac6b4
    • Aaron Davidson's avatar
      SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUID · 46bcb955
      Aaron Davidson authored
      Previously, ZooKeeperPersistenceEngine would crash the whole Master process if
      there was stored data from a prior Spark version. Now, we just delete these files.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #4 from aarondav/zookeeper2 and squashes the following commits:
      
      fa8b40f [Aaron Davidson] SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUID
      46bcb955
    • Patrick Wendell's avatar
      Remove remaining references to incubation · 1fd2bfd3
      Patrick Wendell authored
      This removes some loose ends not caught by the other (incubating -> tlp) patches. @markhamstra this updates the version as you mentioned earlier.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #51 from pwendell/tlp and squashes the following commits:
      
      d553b1b [Patrick Wendell] Remove remaining references to incubation
      1fd2bfd3
    • Binh Nguyen's avatar
      Update io.netty from 4.0.13 Final to 4.0.17.Final · b70823c9
      Binh Nguyen authored
      This update contains a lot of bug fixes and some new perf improvements.
      It is also binary compatible with the current 4.0.13.Final
      
      For more information: http://netty.io/news/2014/02/25/4-0-17-Final.html
      
      Author: Binh Nguyen <ngbinh@gmail.com>
      
      Author: Binh Nguyen <ngbinh@gmail.com>
      
      Closes #41 from ngbinh/master and squashes the following commits:
      
      a9498f4 [Binh Nguyen] update io.netty to 4.0.17.Final
      b70823c9
    • Michael Armbrust's avatar
      Merge the old sbt-launch-lib.bash with the new sbt-launcher jar downloading logic. · 012bd5fb
      Michael Armbrust authored
      This allows developers to pass options (such as -D) to sbt.  I also modified the SparkBuild to ensure spark specific properties are propagated to forked test JVMs.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #14 from marmbrus/sbtScripts and squashes the following commits:
      
      c008b18 [Michael Armbrust] Merge the old sbt-launch-lib.bash with the new sbt-launcher jar downloading logic.
      012bd5fb
    • DB Tsai's avatar
      Initialized the regVal for first iteration in SGD optimizer · 6fc76e49
      DB Tsai authored
      Ported from https://github.com/apache/incubator-spark/pull/633
      
      In runMiniBatchSGD, the regVal (for 1st iter) should be initialized
      as sum of sqrt of weights if it's L2 update; for L1 update, the same logic is followed.
      
      It maybe not be important here for SGD since the updater doesn't take the loss
      as parameter to find the new weights. But it will give us the correct history of loss.
      However, for LBFGS optimizer we implemented, the correct loss with regVal is crucial to
      find the new weights.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #40 from dbtsai/dbtsai-smallRegValFix and squashes the following commits:
      
      77d47da [DB Tsai] In runMiniBatchSGD, the regVal (for 1st iter) should be initialized as sum of sqrt of weights if it's L2 update; for L1 update, the same logic is followed.
      6fc76e49
  2. Mar 01, 2014
    • CodingCat's avatar
      [SPARK-1100] prevent Spark from overwriting directory silently · 3a8b698e
      CodingCat authored
      Thanks for Diana Carroll to report this issue (https://spark-project.atlassian.net/browse/SPARK-1100)
      
      the current saveAsTextFile/SequenceFile will overwrite the output directory silently if the directory already exists, this behaviour is not desirable because
      
      overwriting the data silently is not user-friendly
      
      if the partition number of two writing operation changed, then the output directory will contain the results generated by two runnings
      
      My fix includes:
      
      add some new APIs with a flag for users to define whether he/she wants to overwrite the directory:
      if the flag is set to true, then the output directory is deleted first and then written into the new data to prevent the output directory contains results from multiple rounds of running;
      
      if the flag is set to false, Spark will throw an exception if the output directory already exists
      
      changed JavaAPI part
      
      default behaviour is overwriting
      
      Two questions
      
      should we deprecate the old APIs without such a flag?
      
      I noticed that Spark Streaming also called these APIs, I thought we don't need to change the related part in streaming? @tdas
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #11 from CodingCat/SPARK-1100 and squashes the following commits:
      
      6a4e3a3 [CodingCat] code clean
      ef2d43f [CodingCat] add new test cases and code clean
      ac63136 [CodingCat] checkOutputSpecs not applicable to FSOutputFormat
      ec490e8 [CodingCat] prevent Spark from overwriting directory silently and leaving dirty directory
      3a8b698e
    • CodingCat's avatar
      [SPARK-1150] fix repo location in create script (re-open) · fe195ae1
      CodingCat authored
      reopen for https://spark-project.atlassian.net/browse/SPARK-1150
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #52 from CodingCat/script_fixes and squashes the following commits:
      
      fc05a71 [CodingCat] fix repo location in create script
      fe195ae1
    • Patrick Wendell's avatar
      Revert "[SPARK-1150] fix repo location in create script" · ec992e18
      Patrick Wendell authored
      This reverts commit 9aa09571.
      ec992e18
    • Mark Grover's avatar
      [SPARK-1150] fix repo location in create script · 9aa09571
      Mark Grover authored
      https://spark-project.atlassian.net/browse/SPARK-1150
      
      fix the repo location in create_release script
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #48 from CodingCat/script_fixes and squashes the following commits:
      
      01f4bf7 [Mark Grover] Fixing some nitpicks
      d2244d4 [Mark Grover] SPARK-676: Abbreviation in SPARK_MEM but not in SPARK_WORKER_MEMORY
      9aa09571
    • Kay Ousterhout's avatar
      [SPARK-979] Randomize order of offers. · 556c5668
      Kay Ousterhout authored
      This commit randomizes the order of resource offers to avoid scheduling
      all tasks on the same small set of machines.
      
      This is a much simpler solution to SPARK-979 than #7.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #27 from kayousterhout/randomize and squashes the following commits:
      
      435d817 [Kay Ousterhout] [SPARK-979] Randomize order of offers.
      556c5668
  3. Feb 28, 2014
  4. Feb 27, 2014
  5. Feb 26, 2014
    • Jyotiska NK's avatar
      Updated link for pyspark examples in docs · 26450351
      Jyotiska NK authored
      Author: Jyotiska NK <jyotiska123@gmail.com>
      
      Closes #22 from jyotiska/pyspark_docs and squashes the following commits:
      
      426136c [Jyotiska NK] Updated link for pyspark examples
      26450351
    • Prashant Sharma's avatar
      Deprecated and added a few java api methods for corresponding scala api. · 0e40e2b1
      Prashant Sharma authored
      PR [402](https://github.com/apache/incubator-spark/pull/402) from incubator repo.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #19 from ScrapCodes/java-api-completeness and squashes the following commits:
      
      11d0c2b [Prashant Sharma] Integer -> java.lang.Integer
      737819a [Prashant Sharma] SPARK-1095 add explicit return types to APIs.
      3ddc8bb [Prashant Sharma] Deprected *With functions in scala and added a few missing Java APIs
      0e40e2b1
    • Reynold Xin's avatar
      Removed reference to incubation in README.md. · 84f7ca13
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1 from rxin/readme and squashes the following commits:
      
      b3a77cd [Reynold Xin] Removed reference to incubation in README.md.
      84f7ca13
    • Bouke van der Bijl's avatar
      SPARK-1115: Catch depickling errors · 12738c1a
      Bouke van der Bijl authored
      This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason
      
      @JoshRosen
      
      Author: Bouke van der Bijl <boukevanderbijl@gmail.com>
      
      Closes #644 from bouk/catch-depickling-errors and squashes the following commits:
      
      f0f67cc [Bouke van der Bijl] Lol indentation
      0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block
      12738c1a
    • Matei Zaharia's avatar
      SPARK-1135: fix broken anchors in docs · c86eec58
      Matei Zaharia authored
      A recent PR that added Java vs Scala tabs for streaming also
      inadvertently added some bad code to a document.ready handler, breaking
      our other handler that manages scrolling to anchors correctly with the
      floating top bar. As a result the section title ended up always being
      hidden below the top bar. This removes the unnecessary JavaScript code.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #3 from mateiz/doc-links and squashes the following commits:
      
      e2a3488 [Matei Zaharia] SPARK-1135: fix broken anchors in docs
      c86eec58
    • William Benton's avatar
      SPARK-1078: Replace lift-json with json4s-jackson. · fbedc8ef
      William Benton authored
      The aim of the Json4s project is to provide a common API for
      Scala JSON libraries.  It is Apache-licensed, easier for
      downstream distributions to package, and mostly API-compatible
      with lift-json.  Furthermore, the Jackson-backed implementation
      parses faster than lift-json on all but the smallest inputs.
      
      Author: William Benton <willb@redhat.com>
      
      Closes #582 from willb/json4s and squashes the following commits:
      
      7ca62c4 [William Benton] Replace lift-json with json4s-jackson.
      fbedc8ef
    • Sandy Ryza's avatar
      SPARK-1053. Don't require SPARK_YARN_APP_JAR · b8a18719
      Sandy Ryza authored
      It looks this just requires taking out the checks.
      
      I verified that, with the patch, I was able to run spark-shell through yarn without setting the environment variable.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #553 from sryza/sandy-spark-1053 and squashes the following commits:
      
      b037676 [Sandy Ryza] SPARK-1053.  Don't require SPARK_YARN_APP_JAR
      b8a18719
  6. Feb 25, 2014
    • Raymond Liu's avatar
      For SPARK-1082, Use Curator for ZK interaction in standalone cluster · c852201c
      Raymond Liu authored
      Author: Raymond Liu <raymond.liu@intel.com>
      
      Closes #611 from colorant/curator and squashes the following commits:
      
      7556aa1 [Raymond Liu] Address review comments
      af92e1f [Raymond Liu] Fix coding style
      964f3c2 [Raymond Liu] Ignore NodeExists exception
      6df2966 [Raymond Liu] Rewrite zookeeper client code with curator
      c852201c
    • Semih Salihoglu's avatar
      Graph primitives2 · 1f4c7f7e
      Semih Salihoglu authored
      Hi guys,
      
      I'm following Joey and Ankur's suggestions to add collectEdges and pickRandomVertex. I'm also adding the tests for collectEdges and refactoring one method getCycleGraph in GraphOpsSuite.scala.
      
      Thank you,
      
      semih
      
      Author: Semih Salihoglu <semihsalihoglu@gmail.com>
      
      Closes #580 from semihsalihoglu/GraphPrimitives2 and squashes the following commits:
      
      937d3ec [Semih Salihoglu] - Fixed the scalastyle errors.
      a69a152 [Semih Salihoglu] - Adding collectEdges and pickRandomVertices. - Adding tests for collectEdges. - Refactoring a getCycle utility function for GraphOpsSuite.scala.
      41265a6 [Semih Salihoglu] - Adding collectEdges and pickRandomVertex. - Adding tests for collectEdges. - Recycling a getCycle utility test file.
      1f4c7f7e
  7. Feb 24, 2014
    • Andrew Ash's avatar
      Include reference to twitter/chill in tuning docs · a4f4fbc8
      Andrew Ash authored
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #647 from ash211/doc-tuning and squashes the following commits:
      
      b87de0a [Andrew Ash] Include reference to twitter/chill in tuning docs
      a4f4fbc8
    • Bryn Keller's avatar
      For outputformats that are Configurable, call setConf before sending data to them. · 4d880304
      Bryn Keller authored
      [SPARK-1108] This allows us to use, e.g. HBase's TableOutputFormat with PairRDDFunctions.saveAsNewAPIHadoopFile, which otherwise would throw NullPointerException because the output table name hasn't been configured.
      
      Note this bug also affects branch-0.9
      
      Author: Bryn Keller <bryn.keller@intel.com>
      
      Closes #638 from xoltar/SPARK-1108 and squashes the following commits:
      
      7e94e7d [Bryn Keller] Import, comment, and format cleanup per code review
      7cbcaa1 [Bryn Keller] For outputformats that are Configurable, call setConf before sending data to them. This allows us to use, e.g. HBase TableOutputFormat, which otherwise would throw NullPointerException because the output table name hasn't been configured
      4d880304
    • Matei Zaharia's avatar
      Merge pull request #641 from mateiz/spark-1124-master · d8d190ef
      Matei Zaharia authored
      SPARK-1124: Fix infinite retries of reduce stage when a map stage failed
      
      In the previous code, if you had a failing map stage and then tried to run reduce stages on it repeatedly, the first reduce stage would fail correctly, but the later ones would mistakenly believe that all map outputs are available and start failing infinitely with fetch failures from "null". See https://spark-project.atlassian.net/browse/SPARK-1124 for an example.
      
      This PR also cleans up code style slightly where there was a variable named "s" and some weird map manipulation.
      d8d190ef
    • Matei Zaharia's avatar
      Fix removal from shuffleToMapStage to search for a key-value pair with · 0187cef0
      Matei Zaharia authored
      our stage instead of using our shuffleID.
      0187cef0
    • Matei Zaharia's avatar
      SPARK-1124: Fix infinite retries of reduce stage when a map stage failed · cd32d5e4
      Matei Zaharia authored
      In the previous code, if you had a failing map stage and then tried to
      run reduce stages on it repeatedly, the first reduce stage would fail
      correctly, but the later ones would mistakenly believe that all map
      outputs are available and start failing infinitely with fetch failures
      from "null".
      cd32d5e4
  8. Feb 23, 2014
    • Sean Owen's avatar
      SPARK-1071: Tidy logging strategy and use of log4j · c0ef3afa
      Sean Owen authored
      Prompted by a recent thread on the mailing list, I tried and failed to see if Spark can be made independent of log4j. There are a few cases where control of the underlying logging is pretty useful, and to do that, you have to bind to a specific logger.
      
      Instead I propose some tidying that leaves Spark's use of log4j, but gets rid of warnings and should still enable downstream users to switch. The idea is to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J directly when logging, and where Spark needs to output info (REPL and tests), bind from SLF4J to log4j.
      
      This leaves the same behavior in Spark. It means that downstream users who want to use something except log4j should:
      
      - Exclude dependencies on log4j, slf4j-log4j12 from Spark
      - Include dependency on log4j-over-slf4j
      - Include dependency on another logger X, and another slf4j-X
      - Recreate any log config that Spark does, that is needed, in the other logger's config
      
      That sounds about right.
      
      Here are the key changes:
      
      - Include the jcl-over-slf4j shim everywhere by depending on it in core.
      - Exclude dependencies on commons-logging from third-party libraries.
      - Include the jul-to-slf4j shim everywhere by depending on it in core.
      - Exclude slf4j-* dependencies from third-party libraries to prevent collision or warnings
      - Added missing slf4j-log4j12 binding to GraphX, Bagel module tests
      
      And minor/incidental changes:
      
      - Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a recommended update over 1.7.2
      - (Remove a duplicate HBase dependency declaration in SparkBuild.scala)
      - (Remove a duplicate mockito dependency declaration that was causing warnings and bugging me)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #570 from srowen/SPARK-1071 and squashes the following commits:
      
      52eac9f [Sean Owen] Add slf4j-over-log4j12 dependency to core (non-test) and remove it from things that depend on core.
      77a7fa9 [Sean Owen] SPARK-1071: Tidy logging strategy and use of log4j
      c0ef3afa
Loading