Commits · 0ff38c22205f14770ecca1e66378e7c207ca2d1d · cs525-sp18-g07 / spark

Jan 29, 2014

Merge pull request #494 from tyro89/worker_registration_issue · 0ff38c22

Erik Selin authored 11 years ago

Issue with failed worker registrations

I've been going through the spark source after having some odd issues with workers dying and not coming back. After some digging (I'm very new to scala and spark) I believe I've found a worker registration issue. It looks to me like a failed registration follows the same code path as a successful registration which end up with workers believing they are connected (since they received a `RegisteredWorker` event) even tho they are not registered on the Master.

This is a quick fix that I hope addresses this issue (assuming I didn't completely miss-read the code and I'm about to look like a silly person :P)

I'm opening this pr now to start a chat with you guys while I do some more testing on my side :)

Author: Erik Selin <erik.selin@jadedpixel.com>

== Merge branch commits ==

commit 973012f8a2dcf1ac1e68a69a2086a1b9a50f401b
Author: Erik Selin <erik.selin@jadedpixel.com>
Date: Tue Jan 28 23:36:12 2014 -0500

break logwarning into two lines to respect line character limit.

commit e3754dc5b94730f37e9806974340e6dd93400f85
Author: Erik Selin <erik.selin@jadedpixel.com>
Date: Tue Jan 28 21:16:21 2014 -0500

add log warning when worker registration fails due to attempt to re-register on same address.

commit 14baca241fa7823e1213cfc12a3ff2a9b865b1ed
Author: Erik Selin <erik.selin@jadedpixel.com>
Date: Wed Jan 22 21:23:26 2014 -0500

address code style comment

commit 71c0d7e6f59cd378d4e24994c21140ab893954ee
Author: Erik Selin <erik.selin@jadedpixel.com>
Date: Wed Jan 22 16:01:42 2014 -0500

Make a failed registration not persist, not send a `RegisteredWordker` event and not run `schedule` but rather send a `RegisterWorkerFailed` message to the worker attempting to register.

0ff38c22

Jan 28, 2014

Merge pull request #497 from tdas/docs-update · 79302096

Tathagata Das authored 11 years ago

Updated Spark Streaming Programming Guide

Here is the updated version of the Spark Streaming Programming Guide. This is still a work in progress, but the major changes are in place. So feedback is most welcome.

In general, I have tried to make the guide to easier to understand even if the reader does not know much about Spark. The updated website is hosted here -

http://www.eecs.berkeley.edu/~tdas/spark_docs/streaming-programming-guide.html

The major changes are:
- Overview illustrates the usecases of Spark Streaming - various input sources and various output sources
- An example right after overview to quickly give an idea of what Spark Streaming program looks like
- Made Java API and examples a first class citizen like Scala by using tabs to show both Scala and Java examples (similar to AMPCamp tutorial's code tabs)
- Highlighted the DStream operations updateStateByKey and transform because of their powerful nature
- Updated driver node failure recovery text to highlight automatic recovery in Spark standalone mode
- Added information about linking and using the external input sources like Kafka and Flume
- In general, reorganized the sections to better show the Basic section and the more advanced sections like Tuning and Recovery.

Todos:
- Links to the docs of external Kafka, Flume, etc
- Illustrate window operation with figure as well as example.

Author: Tathagata Das <tathagata.das1565@gmail.com>

== Merge branch commits ==

commit 18ff10556570b39d672beeb0a32075215cfcc944
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date:   Tue Jan 28 21:49:30 2014 -0800

    Fixed a lot of broken links.

commit 34a5a6008dac2e107624c7ff0db0824ee5bae45f
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date:   Tue Jan 28 18:02:28 2014 -0800

    Updated github url to use SPARK_GITHUB_URL variable.

commit f338a60ae8069e0a382d2cb170227e5757cc0b7a
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date:   Mon Jan 27 22:42:42 2014 -0800

    More updates based on Patrick and Harvey's comments.

commit 89a81ff25726bf6d26163e0dd938290a79582c0f
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date:   Mon Jan 27 13:08:34 2014 -0800

    Updated docs based on Patricks PR comments.

commit d5b6196b532b5746e019b959a79ea0cc013a8fc3
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date:   Sun Jan 26 20:15:58 2014 -0800

    Added spark.streaming.unpersist config and info on StreamingListener interface.

commit e3dcb46ab83d7071f611d9b5008ba6bc16c9f951
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date:   Sun Jan 26 18:41:12 2014 -0800

    Fixed docs on StreamingContext.getOrCreate.

commit 6c29524639463f11eec721e4d17a9d7159f2944b
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date:   Thu Jan 23 18:49:39 2014 -0800

    Added example and figure for window operations, and links to Kafka and Flume API docs.

commit f06b964a51bb3b21cde2ff8bdea7d9785f6ce3a9
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date:   Wed Jan 22 22:49:12 2014 -0800

    Fixed missing endhighlight tag in the MLlib guide.

commit 036a7d46187ea3f2a0fb8349ef78f10d6c0b43a9
Merge: eab351d a1cd1851
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date:   Wed Jan 22 22:17:42 2014 -0800

    Merge remote-tracking branch 'apache/master' into docs-update

commit eab351d05c0baef1d4b549e1581310087158d78d
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date:   Wed Jan 22 22:17:15 2014 -0800

    Update Spark Streaming Programming Guide.

79302096

Merge pull request #523 from JoshRosen/SPARK-1043 · f8c742ce

Josh Rosen authored 11 years ago

Switch from MUTF8 to UTF8 in PySpark serializers.

This fixes SPARK-1043, a bug introduced in 0.9.0 where PySpark couldn't serialize strings > 64kB.

This fix was written by @tyro89 and @bouk in #512. This commit squashes and rebases their pull request in order to fix some merge conflicts.

f8c742ce

Switch from MUTF8 to UTF8 in PySpark serializers. · 1381fc72

Josh Rosen authored 11 years ago

This fixes SPARK-1043, a bug introduced in 0.9.0
where PySpark couldn't serialize strings > 64kB.

This fix was written by @tyro89 and @bouk in #512.
This commit squashes and rebases their pull request
in order to fix some merge conflicts.

1381fc72

Jan 27, 2014

Merge pull request #466 from liyinan926/file-overwrite-new · 84670f27

Reynold Xin authored 11 years ago

Allow files added through SparkContext.addFile() to be overwritten

This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. For example, a possible use case is: the driver periodically renews a Hadoop delegation token and writes it to a token file. The token file needs to be downloaded by the executors whenever it gets renewed. However, the current implementation throws an exception when the target file exists and its contents do not match those of the new source. This PR adds an option to allow files to be overwritten to support use cases similar to the above.

84670f27

Merge pull request #516 from sarutak/master · 3d5c03e2

Reynold Xin authored 11 years ago

modified SparkPluginBuild.scala to use https protocol for accessing gith...

We cannot build Spark behind a proxy although we execute sbt with -Dhttp(s).proxyHost -Dhttp(s).proxyPort -Dhttp(s).proxyUser -Dhttp(s).proxyPassword options.
It's because of using git protocol to clone junit_xml_listener.git.
I could build after modifying SparkPluginBuild.scala.

I reported this issue to JIRA.
https://spark-project.atlassian.net/browse/SPARK-1046

3d5c03e2

Merge pull request #490 from hsaputra/modify_checkoption_with_isdefined · f16c21e2

Reynold Xin authored 11 years ago

Replace the check for None Option with isDefined and isEmpty in Scala code

Propose to replace the Scala check for Option "!= None" with Option.isDefined and "=== None" with Option.isEmpty.

I think this, using method call if possible then operator function plus argument, will make the Scala code easier to read and understand.

Pass compile and tests.

f16c21e2

Merge pull request #460 from srowen/RandomInitialALSVectors · f67ce3e2

Sean Owen authored 11 years ago

Choose initial user/item vectors uniformly on the unit sphere

...rather than within the unit square to possibly avoid bias in the initial state and improve convergence.

The current implementation picks the N vector elements uniformly at random from [0,1). This means they all point into one quadrant of the vector space. As N gets just a little large, the vector tend strongly to point into the "corner", towards (1,1,1...,1). The vectors are not unit vectors either.

I suggest choosing the elements as Gaussian ~ N(0,1) and normalizing. This gets you uniform random choices on the unit sphere which is more what's of interest here. It has worked a little better for me in the past.

This is pretty minor but wanted to warm up suggesting a few tweaks to ALS.
Please excuse my Scala, pretty new to it.

Author: Sean Owen <sowen@cloudera.com>

== Merge branch commits ==

commit 492b13a7469e5a4ed7591ee8e56d8bd7570dfab6
Author: Sean Owen <sowen@cloudera.com>
Date:   Mon Jan 27 08:05:25 2014 +0000

    Style: spaces around binary operators

commit ce2b5b5a4fefa0356875701f668f01f02ba4d87e
Author: Sean Owen <sowen@cloudera.com>
Date:   Sun Jan 19 22:50:03 2014 +0000

    Generate factors with all positive components, per discussion in https://github.com/apache/incubator-spark/pull/460

commit b6f7a8a61643a8209e8bc662e8e81f2d15c710c7
Author: Sean Owen <sowen@cloudera.com>
Date:   Sat Jan 18 15:54:42 2014 +0000

    Choose initial user/item vectors uniformly on the unit sphere rather than within the unit square to possibly avoid bias in the initial state and improve convergence

f67ce3e2

modified SparkPluginBuild.scala to use https protocol for accessing github. · 6a5af7b7
sarutak authored 11 years ago

6a5af7b7

Jan 26, 2014

Merge pull request #504 from JoshRosen/SPARK-1025 · c40619d4

Reynold Xin authored 11 years ago

Fix PySpark hang when input files are deleted (SPARK-1025)

This pull request addresses [SPARK-1025](https://spark-project.atlassian.net/browse/SPARK-1025), an issue where PySpark could hang if its input files were deleted.

c40619d4

Merge pull request #511 from JoshRosen/SPARK-1040 · c66a2ef1

Reynold Xin authored 11 years ago

Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040)

This fixes [SPARK-1040](https://spark-project.atlassian.net/browse/SPARK-1040), an issue where JavaPairRDD.collectAsMap() could sometimes fail with ClassCastException. I applied the same fix to the Spark Streaming Java APIs. The commit message describes the fix in more detail.

I also increased the verbosity of JUnit test output under SBT to make it easier to verify that the Java tests are actually running.

c66a2ef1

Jan 25, 2014

Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040) · 740e865f

Josh Rosen authored 11 years ago

This fixes an issue where collectAsMap() could
fail when called on a JavaPairRDD that was derived
by transforming a non-JavaPairRDD.

The root problem was that we were creating the
JavaPairRDD's ClassTag by casting a
ClassTag[AnyRef] to a ClassTag[Tuple2[K2, V2]].
To fix this, I cast a ClassTag[Tuple2[_, _]]
instead, since this actually produces a ClassTag
of the appropriate type because ClassTags don't
capture type parameters:

scala> implicitly[ClassTag[Tuple2[_, _]]] == implicitly[ClassTag[Tuple2[Int, Int]]]
res8: Boolean = true

scala> implicitly[ClassTag[AnyRef]].asInstanceOf[ClassTag[Tuple2[Int, Int]]] == implicitly[ClassTag[Tuple2[Int, Int]]]
res9: Boolean = false

740e865f

Increase JUnit test verbosity under SBT. · 531d9d75

Josh Rosen authored 11 years ago

Upgrade junit-interface plugin from 0.9 to 0.10.

I noticed that the JavaAPISuite tests didn't
appear to display any output locally or under
Jenkins, making it difficult to know whether they
were running.  This change increases the verbosity
to more closely match the ScalaTest tests.

531d9d75

Jan 23, 2014

Merge pull request #505 from JoshRosen/SPARK-1026 · 05be7047

Patrick Wendell authored 11 years ago

Deprecate mapPartitionsWithSplit in PySpark (SPARK-1026)

This commit deprecates `mapPartitionsWithSplit` in PySpark (see [SPARK-1026](https://spark-project.atlassian.net/browse/SPARK-1026) and removes the remaining references to it from the docs.

05be7047

Deprecate mapPartitionsWithSplit in PySpark. · 4cebb79c
Josh Rosen authored 11 years ago
```
Also, replace the last reference to it in the docs.

This fixes SPARK-1026.
```
4cebb79c

Merge pull request #503 from pwendell/master · 3d6e7541

Patrick Wendell authored 11 years ago

Fix bug on read-side of external sort when using Snappy.

This case wasn't handled correctly and this patch fixes it.

3d6e7541

Minor fix · ff447321
Patrick Wendell authored 11 years ago

ff447321

Merge pull request #502 from pwendell/clone-1 · c3196171

Patrick Wendell authored 11 years ago

Remove Hadoop object cloning and warn users making Hadoop RDD's.

The code introduced in #359 used Hadoop's WritableUtils.clone() to
duplicate objects when reading from Hadoop files. Some users have
reported exceptions when cloning data in various file formats,
including Avro and another custom format.

This patch removes that functionality to ensure stability for the
0.9 release. Instead, it puts a clear warning in the documentation
that copying may be necessary for Hadoop data sets.

c3196171

Merge pull request #501 from JoshRosen/cartesian-rdd-fixes · cad3002f

Patrick Wendell authored 11 years ago

Fix two bugs in PySpark cartesian(): SPARK-978 and SPARK-1034

This pull request fixes two bugs in PySpark's `cartesian()` method:

- [SPARK-978](https://spark-project.atlassian.net/browse/SPARK-978): PySpark's cartesian method throws ClassCastException exception
- [SPARK-1034](https://spark-project.atlassian.net/browse/SPARK-1034): Py4JException on PySpark Cartesian Result

The JIRAs have more details describing the fixes.

cad3002f

Minor changes after auditing diff from earlier version · 268ecbd2
Patrick Wendell authored 11 years ago

268ecbd2
Fix for SPARK-1025: PySpark hang on missing files. · f8306849
Josh Rosen authored 11 years ago

f8306849
Response to Matei's review · c58d4ea3
Patrick Wendell authored 11 years ago

c58d4ea3
Fix bug on read-side of external sort when using Snappy. · 0213b403
Patrick Wendell authored 11 years ago
```
This case wasn't handled correctly and this patch fixes it.
```
0213b403

Remove Hadoop object cloning and warn users making Hadoop RDD's. · 71010178

Patrick Wendell authored 11 years ago

The code introduced in #359 used Hadoop's WritableUtils.clone() to
duplicate objects when reading from Hadoop files. Some users have
reported exceptions when cloning data in verious file formats,
including Avro and another custom format.

This patch removes that functionality to ensure stability for the
0.9 release. Instead, it puts a clear warning in the documentation
that copying may be necessary for Hadoop data sets.

71010178

Fix SPARK-978: ClassCastException in PySpark cartesian. · 61569906
Josh Rosen authored 11 years ago

61569906
Fix SPARK-1034: Py4JException on PySpark Cartesian Result · 0035dbbc
Josh Rosen authored 11 years ago

0035dbbc

Merge pull request #406 from eklavya/master · fad6aacf

Josh Rosen authored 11 years ago

Extending Java API coverage

Hi,

I have added three new methods to JavaRDD.

Please review and merge.

fad6aacf

Merge pull request #499 from jianpingjwang/dev1 · a2b47dae
Reynold Xin authored 11 years ago
```
Replace commons-math with jblas in SVDPlusPlus
```
a2b47dae
fixed ClassTag in mapPartitions · 60e74572
eklavya authored 11 years ago

60e74572
Add jblas dependency · 19a01c1b
Jianping J Wang authored 11 years ago

19a01c1b
Add jblas dependency · a5a513e2
Jianping J Wang authored 11 years ago

a5a513e2
Replace commons-math with jblas · cc0fd331
Jianping J Wang authored 11 years ago

cc0fd331

Jan 22, 2014

Merge pull request #496 from pwendell/master · a1cd1851

Patrick Wendell authored 11 years ago

Fix bug in worker clean-up in UI

Introduced in d5a96fec (/cc @aarondav).

This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers.

a1cd1851

Merge pull request #447 from CodingCat/SPARK-1027 · 034dce2a

Patrick Wendell authored 11 years ago

fix for SPARK-1027

fix for SPARK-1027  (https://spark-project.atlassian.net/browse/SPARK-1027)

FIXES

1. change sparkhome from String to Option(String) in ApplicationDesc

2. remove sparkhome parameter in LaunchExecutor message

3. adjust involved files

034dce2a

Fix bug in worker clean-up in UI · 62855131
Patrick Wendell authored 11 years ago
```
Introduced in d5a96fec. This should be picked into 0.8 and 0.9 as well.
```
62855131
refactor sparkHome to val · 2b3c4614
CodingCat authored 11 years ago
```
clean code
```
2b3c4614

Merge pull request #495 from srowen/GraphXCommonsMathDependency · 3184facd

Patrick Wendell authored 11 years ago

Fix graphx Commons Math dependency

`graphx` depends on Commons Math (2.x) in `SVDPlusPlus.scala`. However the module doesn't declare this dependency. It happens to work because it is included by Hadoop artifacts. But, I can tell you this isn't true as of a month or so ago. Building versus recent Hadoop would fail. (That's how we noticed.)

The simple fix is to declare the dependency, as it should be. But it's also worth noting that `commons-math` is the old-ish 2.x line, while `commons-math3` is where newer 3.x releases are. Drop-in replacement, but different artifact and package name. Changing this only usage to `commons-math3` works, tests pass, and isn't surprising that it does, so is probably also worth changing. (A comment in some test code also references `commons-math3`, FWIW.)

It does raise another question though: `mllib` looks like it uses the `jblas` `DoubleMatrix` for general purpose vector/matrix stuff. Should `graphx` really use Commons Math for this? Beyond the tiny scope here but worth asking.

3184facd

Also add graphx commons-math3 dependeny in sbt build · 4476398f
Sean Owen authored 11 years ago

4476398f
Merge pull request #492 from skicavs/master · a1238bb5
Patrick Wendell authored 11 years ago
```
fixed job name and usage information for the JavaSparkPi example
```
a1238bb5

Depend on Commons Math explicitly instead of accidentally getting it from... · fd0c5b8c

Sean Owen authored 11 years ago

Depend on Commons Math explicitly instead of accidentally getting it from Hadoop (which stops working in 2.2.x) and also use the newer commons-math3

fd0c5b8c