Skip to content
Snippets Groups Projects
  1. Aug 05, 2014
    • Thomas Graves's avatar
      SPARK-1528 - spark on yarn, add support for accessing remote HDFS · 2c0f705e
      Thomas Graves authored
      Add a config (spark.yarn.access.namenodes) to allow applications running on yarn to access other secure HDFS cluster.  User just specifies the namenodes of the other clusters and we get Tokens for those and ship them with the spark application.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #1159 from tgravescs/spark-1528 and squashes the following commits:
      
      ddbcd16 [Thomas Graves] review comments
      0ac8501 [Thomas Graves] SPARK-1528 - add support for accessing remote HDFS
      2c0f705e
    • jerryshao's avatar
      [SPARK-1022][Streaming] Add Kafka real unit test · e87075df
      jerryshao authored
      This PR is a updated version of (https://github.com/apache/spark/pull/557) to actually test sending and receiving data through Kafka, and fix previous flaky issues.
      
      @tdas, would you mind reviewing this PR? Thanks a lot.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #1751 from jerryshao/kafka-unit-test and squashes the following commits:
      
      b6a505f [jerryshao] code refactor according to comments
      5222330 [jerryshao] Change JavaKafkaStreamSuite to better test it
      5525f10 [jerryshao] Fix flaky issue of Kafka real unit test
      4559310 [jerryshao] Minor changes for Kafka unit test
      860f649 [jerryshao] Minor style changes, and tests ignored due to flakiness
      796d4ca [jerryshao] Add real Kafka streaming test
      e87075df
    • Reynold Xin's avatar
      [SPARK-2856] Decrease initial buffer size for Kryo to 64KB. · 184048f8
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1780 from rxin/kryo-init-size and squashes the following commits:
      
      551b935 [Reynold Xin] [SPARK-2856] Decrease initial buffer size for Kryo to 64KB.
      184048f8
    • wangfei's avatar
      [SPARK-1779] Throw an exception if memory fractions are not between 0 and 1 · 9862c614
      wangfei authored
      Author: wangfei <scnbwf@yeah.net>
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #714 from scwf/memoryFraction and squashes the following commits:
      
      6e385b9 [wangfei] Update SparkConf.scala
      da6ee59 [wangfei] add configs
      829a195 [wangfei] add indent
      717c0ca [wangfei] updated to make more concise
      fc45476 [wangfei] validate memoryfraction in sparkconf
      2e79b3d [wangfei] && => ||
      43621bd [wangfei] && => ||
      cf38bcf [wangfei] throw IllegalArgumentException
      14d18ac [wangfei] throw IllegalArgumentException
      dff1f0f [wangfei] Update BlockManager.scala
      764965f [wangfei] Update ExternalAppendOnlyMap.scala
      a59d76b [wangfei] Throw exception when memoryFracton is out of range
      7b899c2 [wangfei] 【SPARK-1779】
      9862c614
    • Andrew Or's avatar
      [SPARK-2857] Correct properties to set Master / Worker ports · a646a365
      Andrew Or authored
      `master.ui.port` and `worker.ui.port` were never picked up by SparkConf, simply because they are not prefixed with "spark." Unfortunately, this is also currently the documented way of setting these values.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1779 from andrewor14/master-worker-port and squashes the following commits:
      
      8475e95 [Andrew Or] Update docs to reflect changes in configs
      4db3d5d [Andrew Or] Stop using configs that don't actually work
      a646a365
    • Matei Zaharia's avatar
      SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling collections · 4fde28c2
      Matei Zaharia authored
      This tracks memory properly if there are multiple spilling collections in the same task (which was a problem before), and also implements an algorithm that lets each thread grow up to 1 / 2N of the memory pool (where N is the number of threads) before spilling, which avoids an inefficiency with small spills we had before (some threads would spill many times at 0-1 MB because the pool was allocated elsewhere).
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1707 from mateiz/spark-2711 and squashes the following commits:
      
      debf75b [Matei Zaharia] Review comments
      24f28f3 [Matei Zaharia] Small rename
      c8f3a8b [Matei Zaharia] Update ShuffleMemoryManager to be able to partially grant requests
      315e3a5 [Matei Zaharia] Some review comments
      b810120 [Matei Zaharia] Create central manager to track memory for all spilling collections
      4fde28c2
    • Matei Zaharia's avatar
      SPARK-2685. Update ExternalAppendOnlyMap to avoid buffer.remove() · 066765d6
      Matei Zaharia authored
      Replaces this with an O(1) operation that does not have to shift over
      the whole tail of the array into the gap produced by the element removed.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1773 from mateiz/SPARK-2685 and squashes the following commits:
      
      1ea028a [Matei Zaharia] Update comments in StreamBuffer and EAOM, and reuse ArrayBuffers
      eb1abfd [Matei Zaharia] Update ExternalAppendOnlyMap to avoid buffer.remove()
      066765d6
  2. Aug 04, 2014
    • Reynold Xin's avatar
      [SPARK-2323] Exception in accumulator update should not crash DAGScheduler & SparkContext · 05bf4e4a
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1772 from rxin/accumulator-dagscheduler and squashes the following commits:
      
      6a58520 [Reynold Xin] [SPARK-2323] Exception in accumulator update should not crash DAGScheduler & SparkContext.
      05bf4e4a
    • Davies Liu's avatar
      [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple · 9fd82dbb
      Davies Liu authored
      serializer is imported multiple times during doctests, so it's better to make _hijack_namedtuple() safe to be called multiple times.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1771 from davies/fix and squashes the following commits:
      
      1a9e336 [Davies Liu] fix unit tests
      9fd82dbb
    • Matei Zaharia's avatar
      SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter · 8e7d5ba1
      Matei Zaharia authored
      All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.
      
      In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1722 from mateiz/spark-2792 and squashes the following commits:
      
      5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too
      18fe865 [Matei Zaharia] Update docs on objectStreamReset
      576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
      0374217 [Matei Zaharia] Remove super paranoid code to close file handles
      bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
      0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap
      9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
      8e7d5ba1
    • Davies Liu's avatar
      [SPARK-1687] [PySpark] pickable namedtuple · 59f84a95
      Davies Liu authored
      Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs.
      
      PS: pyspark should be import BEFORE "from collections import namedtuple"
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1623 from davies/namedtuple and squashes the following commits:
      
      045dad8 [Davies Liu] remove unrelated code changes
      4132f32 [Davies Liu] address comment
      55b1c1a [Davies Liu] fix tests
      61f86eb [Davies Liu] replace all the reference of namedtuple to new hacked one
      98df6c6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
      f7b1bde [Davies Liu] add hack for CloudPickleSerializer
      0c5c849 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
      21991e6 [Davies Liu] hack namedtuple in __main__ module, make it picklable.
      93b03b8 [Davies Liu] pickable namedtuple
      59f84a95
    • Liquan Pei's avatar
      [MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words · e053c558
      Liquan Pei authored
      This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
      
      To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.
      
      To investigate the vector representations is to find the closest words for a query word. For example, the top 20 closest words to "china" are for 1 partition and 1 iteration :
      
      taiwan 0.8077646146334014
      korea 0.740913304563621
      japan 0.7240667798885471
      republic 0.7107151279078352
      thailand 0.6953217332072862
      tibet 0.6916782118129544
      mongolia 0.6800858715972612
      macau 0.6794925677480378
      singapore 0.6594048695593799
      manchuria 0.658989931844148
      laos 0.6512978726001666
      nepal 0.6380792327845325
      mainland 0.6365469459587788
      myanmar 0.6358614338840394
      macedonia 0.6322366180313249
      xinjiang 0.6285291551708028
      russia 0.6279951236068411
      india 0.6272874944023487
      shanghai 0.6234544135576999
      macao 0.6220588462925876
      
      The result with 10 partitions and 5 iterations is:
      taiwan 0.8310495079388313
      india 0.7737171315919039
      japan 0.756777901233668
      korea 0.7429767187102452
      indonesia 0.7407557427278356
      pakistan 0.712883426985585
      mainland 0.7053379963140822
      thailand 0.696298191073948
      mongolia 0.693690656871415
      laos 0.6913069680735292
      macau 0.6903427690029617
      republic 0.6766381604813666
      malaysia 0.676460699141784
      singapore 0.6728790997360923
      malaya 0.672345232966194
      manchuria 0.6703732292753156
      macedonia 0.6637955686322028
      myanmar 0.6589462882439646
      kazakhstan 0.657017801081494
      cambodia 0.6542383836451932
      
      Author: Liquan Pei <lpei@gopivotal.com>
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Liquan Pei <liquanpei@gmail.com>
      
      Closes #1719 from Ishiihara/master and squashes the following commits:
      
      2ba9483 [Liquan Pei] minor fix for Word2Vec test
      e248441 [Liquan Pei] minor style change
      26a948d [Liquan Pei] Merge pull request #1 from mengxr/Ishiihara-master
      c14da41 [Xiangrui Meng] fix styles
      384c771 [Xiangrui Meng] remove minCount and window from constructor change model to use float instead of double
      e93e726 [Liquan Pei] use treeAggregate instead of aggregate
      1a8fb41 [Liquan Pei] use weighted sum in combOp
      7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate
      6bcc8be [Liquan Pei] add multiple iteration support
      720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes
      2e92b59 [Liquan Pei] modify according to feedback
      57dc50d [Liquan Pei] code formatting
      e4a04d3 [Liquan Pei] minor fix
      0aafb1b [Liquan Pei] Add comments, minor fixes
      8d6befe [Liquan Pei] initial commit
      e053c558
  3. Aug 03, 2014
    • DB Tsai's avatar
      SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent... · ae58aea2
      DB Tsai authored
      SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data
      
      Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.
      
      In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector.
      
      There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support.
      
      1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
      
      2) `Normalizer` - Normalizes samples individually to unit L^n norm
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits:
      
      78c15d3 [DB Tsai] Alpine Data Labs
      ae58aea2
    • Sarah Gerweck's avatar
      Fix some bugs with spaces in directory name. · 5507dd8e
      Sarah Gerweck authored
      Any time you use the directory name (`FWDIR`) it needs to be surrounded
      in quotes. If you're also using wildcards, you can safely put the quotes
      around just `$FWDIR`.
      
      Author: Sarah Gerweck <sarah.a180@gmail.com>
      
      Closes #1756 from sarahgerweck/folderSpaces and squashes the following commits:
      
      732629d [Sarah Gerweck] Fix some bugs with spaces in directory name.
      5507dd8e
    • Anand Avati's avatar
      [SPARK-2810] upgrade to scala-maven-plugin 3.2.0 · 6ba6c3eb
      Anand Avati authored
      Needed for Scala 2.11 compiler-interface
      
      Signed-off-by: Anand Avati <avatiredhat.com>
      
      Author: Anand Avati <avati@redhat.com>
      
      Closes #1711 from avati/SPARK-1812-scala-maven-plugin and squashes the following commits:
      
      9a22fc8 [Anand Avati] SPARK-1812: upgrade to scala-maven-plugin 3.2.0
      6ba6c3eb
    • Davies Liu's avatar
      [SPARK-1740] [PySpark] kill the python worker · 55349f9f
      Davies Liu authored
      Kill only the python worker related to cancelled tasks.
      
      The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is closed by JVM, this thread will kill the worker.
      
      When an task is cancelled, the socket to worker will be closed, then the worker will be killed by deamon.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1643 from davies/kill and squashes the following commits:
      
      8ffe9f3 [Davies Liu] kill worker by deamon, because runtime.exec() is too heavy
      46ca150 [Davies Liu] address comment
      acd751c [Davies Liu] kill the worker when task is canceled
      55349f9f
    • Yin Huai's avatar
      [SPARK-2783][SQL] Basic support for analyze in HiveContext · e139e2be
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-2783
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1741 from yhuai/analyzeTable and squashes the following commits:
      
      7bb5f02 [Yin Huai] Use sql instead of hql.
      4d09325 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
      e3ebcd4 [Yin Huai] Renaming.
      c170f4e [Yin Huai] Do not use getContentSummary.
      62393b6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
      db233a6 [Yin Huai] Trying to debug jenkins...
      fee84f0 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
      f0501f3 [Yin Huai] Fix compilation error.
      24ad391 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
      8918140 [Yin Huai] Wording.
      23df227 [Yin Huai] Add a simple analyze method to get the size of a table and update the "totalSize" property of this table in the Hive metastore.
      e139e2be
    • Cheng Lian's avatar
      [SPARK-2814][SQL] HiveThriftServer2 throws NPE when executing native commands · ac33cbbf
      Cheng Lian authored
      JIRA issue: [SPARK-2814](https://issues.apache.org/jira/browse/SPARK-2814)
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1753 from liancheng/spark-2814 and squashes the following commits:
      
      c74a3b2 [Cheng Lian] Fixed SPARK-2814
      ac33cbbf
    • Michael Armbrust's avatar
      [SPARK-2784][SQL] Deprecate hql() method in favor of a config option, 'spark.sql.dialect' · 236dfac6
      Michael Armbrust authored
      Many users have reported being confused by the distinction between the `sql` and `hql` methods.  Specifically, many users think that `sql(...)` cannot be used to read hive tables.  In this PR I introduce a new configuration option `spark.sql.dialect` that picks which dialect with be used for parsing.  For SQLContext this must be set to `sql`.  In `HiveContext` it defaults to `hiveql` but can also be set to `sql`.
      
      The `hql` and `hiveql` methods continue to act the same but are now marked as deprecated.
      
      **This is a possibly breaking change for some users unless they set the dialect manually, though this is unlikely.**
      
      For example: `hiveContex.sql("SELECT 1")` will now throw a parsing exception by default.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1746 from marmbrus/sqlLanguageConf and squashes the following commits:
      
      ad375cc [Michael Armbrust] Merge remote-tracking branch 'apache/master' into sqlLanguageConf
      20c43f8 [Michael Armbrust] override function instead of just setting the value
      7e4ae93 [Michael Armbrust] Deprecate hql() method in favor of a config option, 'spark.sql.dialect'
      236dfac6
    • Joseph K. Bradley's avatar
      [SPARK-2197] [mllib] Java DecisionTree bug fix and easy-of-use · 2998e38a
      Joseph K. Bradley authored
      Bug fix: Before, when an RDD was created in Java and passed to DecisionTree.train(), the fake class tag caused problems.
      * Fix: DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java.
      
      Other improvements to Decision Trees for easy-of-use with Java:
      * impurity classes: Added instance() methods to help with Java interface.
      * Strategy: Added Java-friendly constructor
      --> Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1740 from jkbradley/dt-java-new and squashes the following commits:
      
      0805dc6 [Joseph K. Bradley] Changed Strategy to use JavaConverters instead of JavaConversions
      519b1b7 [Joseph K. Bradley] * Organized imports in JavaDecisionTreeSuite.java * Using JavaConverters instead of JavaConversions in DecisionTreeSuite.scala
      f7b5ca1 [Joseph K. Bradley] Improvements to make it easier to run DecisionTree from Java. * DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java. * impurity classes: Added instance() methods to help with Java interface. * Strategy: Added Java-friendly constructor ** Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.
      d78ada6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
      320853f [Joseph K. Bradley] Added JavaDecisionTreeSuite, partly written
      13a585e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
      f1a8283 [Joseph K. Bradley] Added old JavaDecisionTreeSuite, to be updated later
      225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
      2998e38a
    • Allan Douglas R. de Oliveira's avatar
      SPARK-2246: Add user-data option to EC2 scripts · a0bcbc15
      Allan Douglas R. de Oliveira authored
      Author: Allan Douglas R. de Oliveira <allan@chaordicsystems.com>
      
      Closes #1186 from douglaz/spark_ec2_user_data and squashes the following commits:
      
      94a36f9 [Allan Douglas R. de Oliveira] Added user data option to EC2 script
      a0bcbc15
    • Stephen Boesch's avatar
      SPARK-2712 - Add a small note to maven doc that mvn package must happen ... · f8cd143b
      Stephen Boesch authored
      Per request by Reynold adding small note about proper sequencing of build then test.
      
      Author: Stephen Boesch <javadba@gmail.com>
      
      Closes #1615 from javadba/docs and squashes the following commits:
      
      6c3183e [Stephen Boesch] Moved updated testing blurb per PWendell
      5764757 [Stephen Boesch] SPARK-2712 - Add a small note to maven doc that mvn package must happen before test
      f8cd143b
    • Andrew Or's avatar
      [Minor] Fixes on top of #1679 · 3dc55fdf
      Andrew Or authored
      Minor fixes on top of #1679.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1736 from andrewor14/amend-#1679 and squashes the following commits:
      
      3b46f5e [Andrew Or] Minor fixes
      3dc55fdf
  4. Aug 02, 2014
    • Sean Owen's avatar
      SPARK-2414 [BUILD] Add LICENSE entry for jquery · 9cf429aa
      Sean Owen authored
      The JIRA concerned removing jquery, and this does not remove jquery. While it is distributed by Spark it should have an accompanying line in LICENSE, very technically, as per http://www.apache.org/dev/licensing-howto.html
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1748 from srowen/SPARK-2414 and squashes the following commits:
      
      2fdb03c [Sean Owen] Add LICENSE entry for jquery
      9cf429aa
    • Sean Owen's avatar
      SPARK-2602 [BUILD] Tests steal focus under Java 6 · 33f167d7
      Sean Owen authored
      As per https://issues.apache.org/jira/browse/SPARK-2602 , this may be resolved for Java 6 with the java.awt.headless system property, which never hurt anyone running a command line app. I tested it and seemed to get rid of focus stealing.
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1747 from srowen/SPARK-2602 and squashes the following commits:
      
      b141018 [Sean Owen] Set java.awt.headless during tests
      33f167d7
    • Michael Armbrust's avatar
      [SPARK-2739][SQL] Rename registerAsTable to registerTempTable · 1a804373
      Michael Armbrust authored
      There have been user complaints that the difference between `registerAsTable` and `saveAsTable` is too subtle.  This PR addresses this by renaming `registerAsTable` to `registerTempTable`, which more clearly reflects what is happening.  `registerAsTable` remains, but will cause a deprecation warning.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1743 from marmbrus/registerTempTable and squashes the following commits:
      
      d031348 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
      4dff086 [Michael Armbrust] Fix .java files too
      89a2f12 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
      0b7b71e [Michael Armbrust] Rename registerAsTable to registerTempTable
      1a804373
    • Yin Huai's avatar
      [SPARK-2797] [SQL] SchemaRDDs don't support unpersist() · d210022e
      Yin Huai authored
      The cause is explained in https://issues.apache.org/jira/browse/SPARK-2797.
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1745 from yhuai/SPARK-2797 and squashes the following commits:
      
      7b1627d [Yin Huai] The unpersist method of the Scala RDD cannot be called without the input parameter (blocking) from PySpark.
      d210022e
    • Cheng Lian's avatar
      [SPARK-2729][SQL] Added test case for SPARK-2729 · 866cf1f8
      Cheng Lian authored
      This is a follow up of #1636.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1738 from liancheng/test-for-spark-2729 and squashes the following commits:
      
      b13692a [Cheng Lian] Added test case for SPARK-2729
      866cf1f8
    • Michael Armbrust's avatar
      [SPARK-2785][SQL] Remove assertions that throw when users try unsupported Hive commands. · 198df11f
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1742 from marmbrus/asserts and squashes the following commits:
      
      5182d54 [Michael Armbrust] Remove assertions that throw when users try unsupported Hive commands.
      198df11f
    • Michael Armbrust's avatar
      [SPARK-2097][SQL] UDF Support · 158ad0bb
      Michael Armbrust authored
      This patch adds the ability to register lambda functions written in Python, Java or Scala as UDFs for use in SQL or HiveQL.
      
      Scala:
      ```scala
      registerFunction("strLenScala", (_: String).length)
      sql("SELECT strLenScala('test')")
      ```
      Python:
      ```python
      sqlCtx.registerFunction("strLenPython", lambda x: len(x), IntegerType())
      sqlCtx.sql("SELECT strLenPython('test')")
      ```
      Java:
      ```java
      sqlContext.registerFunction("stringLengthJava", new UDF1<String, Integer>() {
        Override
        public Integer call(String str) throws Exception {
          return str.length();
        }
      }, DataType.IntegerType);
      
      sqlContext.sql("SELECT stringLengthJava('test')");
      ```
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1063 from marmbrus/udfs and squashes the following commits:
      
      9eda0fe [Michael Armbrust] newline
      747c05e [Michael Armbrust] Add some scala UDF tests.
      d92727d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
      005d684 [Michael Armbrust] Fix naming and formatting.
      d14dac8 [Michael Armbrust] Fix last line of autogened java files.
      8135c48 [Michael Armbrust] Move UDF unit tests to pyspark.
      40b0ffd [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
      6a36890 [Michael Armbrust] Switch logging so that SQLContext can be serializable.
      7a83101 [Michael Armbrust] Drop toString
      795fd15 [Michael Armbrust] Try to avoid capturing SQLContext.
      e54fb45 [Michael Armbrust] Docs and tests.
      437cbe3 [Michael Armbrust] Update use of dataTypes, fix some python tests, address review comments.
      01517d6 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
      8e6c932 [Michael Armbrust] WIP
      3f96a52 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
      6237c8d [Michael Armbrust] WIP
      2766f0b [Michael Armbrust] Move udfs support to SQL from hive. Add support for Java UDFs.
      0f7d50c [Michael Armbrust] Draft of native Spark SQL UDFs for Scala and Python.
      158ad0bb
    • GuoQiang Li's avatar
      SPARK-2804: Remove scalalogging-slf4j dependency · 4c477117
      GuoQiang Li authored
      This also Closes #1701.
      
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #1208 from witgo/SPARK-1470 and squashes the following commits:
      
      422646b [GuoQiang Li] Remove scalalogging-slf4j dependency
      4c477117
    • Chris Fregly's avatar
      [SPARK-1981] Add AWS Kinesis streaming support · 91f9504e
      Chris Fregly authored
      Author: Chris Fregly <chris@fregly.com>
      
      Closes #1434 from cfregly/master and squashes the following commits:
      
      4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method
      0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl
      691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams
      0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      74e5c7c [Chris Fregly] updated per TD's feedback.  simplified examples, updated docs
      e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      bf614e9 [Chris Fregly] per matei's feedback:  moved the kinesis examples into the examples/ dir
      d17ca6d [Chris Fregly] per TD's feedback:  updated docs, simplified the KinesisUtils api
      912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class
      db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client
      338997e [Chris Fregly] improve build docs for kinesis
      828f8ae [Chris Fregly] more cleanup
      e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      cd68c0d [Chris Fregly] fixed typos and backward compatibility
      d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
      91f9504e
    • Yin Huai's avatar
      [SQL] Set outputPartitioning of BroadcastHashJoin correctly. · 67bd8e3c
      Yin Huai authored
      I think we will not generate the plan triggering this bug at this moment. But, let me explain it...
      
      Right now, we are using `left.outputPartitioning` as the `outputPartitioning` of a `BroadcastHashJoin`. We may have a wrong physical plan for cases like...
      ```sql
      SELECT l.key, count(*)
      FROM (SELECT key, count(*) as cnt
            FROM src
            GROUP BY key) l // This is buildPlan
      JOIN r // This is the streamedPlan
      ON (l.cnt = r.value)
      GROUP BY l.key
      ```
      Let's say we have a `BroadcastHashJoin` on `l` and `r`. For this case, we will pick `l`'s `outputPartitioning` for the `outputPartitioning`of the `BroadcastHashJoin` on `l` and `r`. Also, because the last `GROUP BY` is using `l.key` as the key, we will not introduce an `Exchange` for this aggregation. However, `r`'s outputPartitioning may not match the required distribution of the last `GROUP BY` and we fail to group data correctly.
      
      JIRA is being reindexed. I will create a JIRA ticket once it is back online.
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1735 from yhuai/BroadcastHashJoin and squashes the following commits:
      
      96d9cb3 [Yin Huai] Set outputPartitioning correctly.
      67bd8e3c
    • Joseph K. Bradley's avatar
      [SPARK-2478] [mllib] DecisionTree Python API · 3f67382e
      Joseph K. Bradley authored
      Added experimental Python API for Decision Trees.
      
      API:
      * class DecisionTreeModel
      ** predict() for single examples and RDDs, taking both feature vectors and LabeledPoints
      ** numNodes()
      ** depth()
      ** __str__()
      * class DecisionTree
      ** trainClassifier()
      ** trainRegressor()
      ** train()
      
      Examples and testing:
      * Added example testing classification and regression with batch prediction: examples/src/main/python/mllib/tree.py
      * Have also tested example usage in doc of python/pyspark/mllib/tree.py which tests single-example prediction with dense and sparse vectors
      
      Also: Small bug fix in python/pyspark/mllib/_common.py: In _linear_predictor_typecheck, changed check for RDD to use isinstance() instead of type() in order to catch RDD subclasses.
      
      CC mengxr manishamde
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1727 from jkbradley/decisiontree-python-new and squashes the following commits:
      
      3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py Small updates based on github review.
      6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in pyspark/mllib/util.py caused by change in loadLibSVMFile behavior.  (It used to threshold labels at 0 to make them 0/1, but it now leaves them as they are.) * Fixed small bug in loadLibSVMFile: If a data file had no features, then loadLibSVMFile would create a single all-zero feature.
      67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD) returns an empty RDD instead of an error. * Removed support for calling predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache serialized RDD any more.
      aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
      fa10ea7 [Joseph K. Bradley] Small style update
      7968692 [Joseph K. Bradley] small braces typo fix
      e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
      db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into decisiontree-python-new
      6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
      93953f1 [Joseph K. Bradley] Likely done with Python API.
      6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
      188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example but need to test a bit more.
      2b20c61 [Joseph K. Bradley] Small doc and style updates
      1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
      8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
      376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
      e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
      52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
      8e227ea [Joseph K. Bradley] Changed Strategy so it only requires numClassesForClassification >= 2 for classification
      cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
      8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      2283df8 [Joseph K. Bradley] 2 bug fixes.
      73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
      f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree.  Also added toString, depth, and numNodes methods to DecisionTreeModel.
      3f67382e
    • Andrew Or's avatar
      [HOTFIX] Do not throw NPE if spark.test.home is not set · e09e18b3
      Andrew Or authored
      `spark.test.home` was introduced in #1734. This is fine for SBT but is failing maven tests. Either way it shouldn't throw an NPE.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1739 from andrewor14/fix-spark-test-home and squashes the following commits:
      
      ce2624c [Andrew Or] Do not throw NPE if spark.test.home is not set
      e09e18b3
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · 87738bfa
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #706 (close requested by 'pwendell')
      Closes #453 (close requested by 'pwendell')
      Closes #557 (close requested by 'tdas')
      Closes #495 (close requested by 'tdas')
      Closes #1232 (close requested by 'pwendell')
      Closes #82 (close requested by 'pwendell')
      Closes #600 (close requested by 'pwendell')
      Closes #473 (close requested by 'pwendell')
      Closes #351 (close requested by 'pwendell')
      87738bfa
    • Patrick Wendell's avatar
      HOTFIX: Fix concurrency issue in FlumePollingStreamSuite. · 44460ba5
      Patrick Wendell authored
      This has been failing on master. One possible cause is that the port
      gets contended if multiple test runs happen concurrently and they
      hit this test at the same time. Since this test takes a long time
      (60 seconds) that's very plausible. This patch randomizes the port
      used in this test to avoid contention.
      44460ba5
    • Patrick Wendell's avatar
      HOTFIX: Fixing test error in maven for flume-sink. · 25cad6ad
      Patrick Wendell authored
      We needed to add an explicit dependency on scalatest since this
      module will not get it from spark core like others do.
      25cad6ad
    • Anand Avati's avatar
      [SPARK-1812] sql/catalyst - Provide explicit type information · 08c095b6
      Anand Avati authored
      For Scala 2.11 compatibility.
      
      Without the explicit type specification, withNullability
      return type is inferred to be Attribute, and thus calling
      at() on the returned object fails in these tests:
      
      [ERROR] /Users/avati/work/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala:370: value at is not a
      [ERROR]     val c4_notNull = 'a.boolean.notNull.at(3)
      [ERROR]                                         ^
      [ERROR] /Users/avati/work/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala:371: value at is not a
      [ERROR]     val c5_notNull = 'a.boolean.notNull.at(4)
      [ERROR]                                         ^
      [ERROR] /Users/avati/work/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala:372: value at is not a
      [ERROR]     val c6_notNull = 'a.boolean.notNull.at(5)
      [ERROR]                                         ^
      [ERROR] /Users/avati/work/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala:558: value at is not a
      [ERROR]     val s_notNull = 'a.string.notNull.at(0)
      
      Signed-off-by: Anand Avati <avatiredhat.com>
      
      Author: Anand Avati <avati@redhat.com>
      
      Closes #1709 from avati/SPARK-1812-notnull and squashes the following commits:
      
      0470eb3 [Anand Avati] SPARK-1812: sql/catalyst - Provide explicit type information
      08c095b6
    • Andrew Or's avatar
      [SPARK-2454] Do not ship spark home to Workers · 148af608
      Andrew Or authored
      When standalone Workers launch executors, they inherit the Spark home set by the driver. This means if the worker machines do not share the same directory structure as the driver node, the Workers will attempt to run scripts (e.g. bin/compute-classpath.sh) that do not exist locally and fail. This is a common scenario if the driver is launched from outside of the cluster.
      
      The solution is to simply not pass the driver's Spark home to the Workers. This PR further makes an attempt to avoid overloading the usages of `spark.home`, which is now only used for setting executor Spark home on Mesos and in python.
      
      This is based on top of #1392 and originally reported by YanTangZhai. Tested on standalone cluster.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1734 from andrewor14/spark-home-reprise and squashes the following commits:
      
      f71f391 [Andrew Or] Revert changes in python
      1c2532c [Andrew Or] Merge branch 'master' of github.com:apache/spark into spark-home-reprise
      188fc5d [Andrew Or] Avoid using spark.home where possible
      09272b7 [Andrew Or] Always use Worker's working directory as spark home
      148af608
Loading