Commits · 06dc0d2c6b69c5d59b4d194ced2ac85bfe2e05e2 · cs525-sp18-g07 / spark

Jul 25, 2014

[SPARK-2410][SQL] Merging Hive Thrift/JDBC server · 06dc0d2c

Cheng Lian authored 10 years ago

JIRA issue:

- Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
- Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)

Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

(Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)

TODO

- [x] Use `spark-submit` to launch the server, the CLI and beeline
- [x] Migration guideline draft for Shark users

----

Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:

```bash
$ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
```

This actually shows usage information of `SparkSubmit` rather than `BeeLine`.

~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~

**UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1399 from liancheng/thriftserver and squashes the following commits:

090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server

06dc0d2c

[SPARK-2683] unidoc failed because org.apache.spark.util.CallSite uses Java keywords as value names · 32bcf9af

Yin Huai authored 10 years ago

Renaming `short` to `shortForm` and `long` to `longForm`.

JIRA: https://issues.apache.org/jira/browse/SPARK-2683

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1585 from yhuai/SPARK-2683 and squashes the following commits:

5ddb843 [Yin Huai] "short" and "long" are Java keyworks. In order to generate javadoc, renaming "short" to "shortForm" and "long" to "longForm".

32bcf9af

replace println to log4j · a2715ccd

fireflyc authored 10 years ago

Our program needs to receive a large amount of data and run for a long
time.
We set the log level to WARN but "Storing iterator" "received single"
as such message written to the log file. (over yarn)

Author: fireflyc <fireflyc@126.com>

Closes #1372 from fireflyc/fix-replace-stdout-log and squashes the following commits:

e684140 [fireflyc] 'info' modified into the 'debug'
fa22a38 [fireflyc] replace println to log4j

a2715ccd

[SPARK-2665] [SQL] Add EqualNS & Unit Tests · 184aa1c6

Cheng Hao authored 10 years ago

Hive Supports the operator "<=>", which returns same result with EQUAL(=) operator for non-null operands, but returns TRUE if both are NULL, FALSE if one of the them is NULL.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1570 from chenghao-intel/equalns and squashes the following commits:

8d6c789 [Cheng Hao] Remove the test case orc_predicate_pushdown
5b2ca88 [Cheng Hao] Add cases into whitelist
8e66cdd [Cheng Hao] Rename the EqualNSTo ==> EqualNullSafe
7af4b0b [Cheng Hao] Add EqualNS & Unit Tests

184aa1c6

[SPARK-2529] Clean closures in foreach and foreachPartition. · eb82abd8

Reynold Xin authored 10 years ago

Author: Reynold Xin <rxin@apache.org>

Closes #1583 from rxin/closureClean and squashes the following commits:

8982fe6 [Reynold Xin] [SPARK-2529] Clean closures in foreach and foreachPartition.

eb82abd8

SPARK-2657 Use more compact data structures than ArrayBuffer in groupBy & cogroup · 8529ced3

Matei Zaharia authored 10 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-2657

Our current code uses ArrayBuffers for each group of values in groupBy, as well as for the key's elements in CoGroupedRDD. ArrayBuffers have a lot of overhead if there are few values in them, which is likely to happen in cases such as join. In particular, they have a pointer to an Object[] of size 16 by default, which is 24 bytes for the array header + 128 for the pointers in there, plus at least 32 for the ArrayBuffer data structure. This patch replaces the per-group buffers with a CompactBuffer class that can store up to 2 elements more efficiently (in fields of itself) and acts like an ArrayBuffer beyond that. For a key's elements in CoGroupedRDD, we use an Array of CompactBuffers instead of an ArrayBuffer of ArrayBuffers.

There are some changes throughout the code to deal with CoGroupedRDD returning Array instead. We can also decide not to do that but CoGroupedRDD is a `DeveloperAPI` so I think it's okay to change it here.

Author: Matei Zaharia <matei@databricks.com>

Closes #1555 from mateiz/compact-groupby and squashes the following commits:

845a356 [Matei Zaharia] Lower initial size of CompactBuffer's vector to 8
07621a7 [Matei Zaharia] Review comments
0c1cd12 [Matei Zaharia] Don't use varargs in CompactBuffer.apply
bdc8a39 [Matei Zaharia] Small tweak to +=, and typos
f61f040 [Matei Zaharia] Fix line lengths
59da88b0 [Matei Zaharia] Fix line lengths
197cde8 [Matei Zaharia] Make CompactBuffer extend Seq to make its toSeq more efficient
775110f [Matei Zaharia] Change CoGroupedRDD to give (K, Array[Iterable[_]]) to avoid wrappers
9b4c6e8 [Matei Zaharia] Use CompactBuffer in CoGroupedRDD
ed577ab [Matei Zaharia] Use CompactBuffer in groupByKey
10f0de1 [Matei Zaharia] A CompactBuffer that's more memory-efficient than ArrayBuffer for small buffers

8529ced3

[SPARK-2656] Python version of stratified sampling · 2f75a4a3

Doris Xin authored 10 years ago

exact sample size not supported for now.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1554 from dorx/pystratified and squashes the following commits:

4ba927a [Doris Xin] use rel diff (+- 50%) instead of abs diff (+- 50)
bdc3f8b [Doris Xin] updated unit to check sample holistically
7713c7b [Doris Xin] Python version of stratified sampling

2f75a4a3

[SPARK-2538] [PySpark] Hash based disk spilling aggregation · 14174abd

Davies Liu authored 10 years ago

During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.

It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).

Author: Davies Liu <davies.liu@gmail.com>

Closes #1460 from davies/spill and squashes the following commits:

cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible.
37d71f7 [Davies Liu] balance the partitions
902f036 [Davies Liu] add shuffle.py into run-tests
dcf03a9 [Davies Liu] fix memory_info() of psutil
67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS
f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge:
e74b785 [Davies Liu] fix code style and change next_limit to memory_limit
400be01 [Davies Liu] address all the comments
6178844 [Davies Liu] refactor and improve docs
fdd0a49 [Davies Liu] add long doc string for ExternalMerger
1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy()
e6cc7f9 [Davies Liu] Merge branch 'master' into spill
3652583 [Davies Liu] address comments
e78a0a0 [Davies Liu] fix style
24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR
57ee7ef [Davies Liu] update docs
286aaff [Davies Liu] let spilled aggregation in Python configurable
e9a40f6 [Davies Liu] recursive merger
6edbd1f [Davies Liu] Hash based disk spilling aggregation

14174abd

Jul 24, 2014

[SPARK-2014] Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default · eff9714e

Prashant Sharma authored 10 years ago

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #1051 from ScrapCodes/SPARK-2014/pyspark-cache and squashes the following commits:

f192df7 [Prashant Sharma] Code Review
2a2f43f [Prashant Sharma] [SPARK-2014] Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default

eff9714e

[SPARK-2464][Streaming] Fixed Twitter stream stopping bug · a45d5480

Tathagata Das authored 10 years ago

Stopping the Twitter Receiver would call twitter4j's TwitterStream.shutdown, which in turn causes an Exception to be thrown to the listener. This exception caused the Receiver to be restarted. This patch check whether the receiver was stopped or not, and accordingly restarts on exception.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #1577 from tdas/twitter-stop and squashes the following commits:

011b525 [Tathagata Das] Fixed Twitter stream stopping bug.

a45d5480

SPARK-2250: show stage RDDs in UI · fec641b8

Neville Li authored 10 years ago

Author: Neville Li <neville@spotify.com>

Closes #1188 from nevillelyh/neville/ui and squashes the following commits:

d3ac425 [Neville Li] SPARK-2250: show persisted RDD in stage UI
f075db9 [Neville Li] SPARK-2035: show call stack even when description is available

fec641b8

[SPARK-2037]: yarn client mode doesn't support spark.yarn.max.executor.failures · 323a83c5

GuoQiang Li authored 10 years ago

Author: GuoQiang Li <witgo@qq.com>

Closes #1180 from witgo/SPARK-2037 and squashes the following commits:

3d52411 [GuoQiang Li] review commit
7058f4d [GuoQiang Li] Correctly stop SparkContext
6d0561f [GuoQiang Li] Fix: yarn client mode doesn't support spark.yarn.max.executor.failures

323a83c5

[SPARK-2479 (partial)][MLLIB] fix binary metrics unit tests · c960b505

Xiangrui Meng authored 10 years ago

Allow small errors in comparison.

@dbtsai , this unit test blocks https://github.com/apache/spark/pull/1562 . I may need to merge this one first. We can change it to use the tools in https://github.com/apache/spark/pull/1425 after that PR gets merged.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1576 from mengxr/fix-binary-metrics-unit-tests and squashes the following commits:

5076a7f [Xiangrui Meng] fix binary metrics unit tests

c960b505

[SPARK-2603][SQL] Remove unnecessary toMap and toList in converting Java... · b352ef17

Yin Huai authored 10 years ago

[SPARK-2603][SQL] Remove unnecessary toMap and toList in converting Java collections to Scala collections JsonRDD.scala

In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to a Scala one. These two operations are pretty expensive because they read elements from a Java Map/List and then load to a Scala Map/List. We can use Scala wrappers to wrap those Java collections instead of using toMap/toList.

I did a quick test to see the performance. I had a 2.9GB cached RDD[String] storing one JSON object per record (twitter dataset). My simple test program is attached below.
```scala
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._

val jsonData = sc.textFile("...")
jsonData.cache.count

val jsonSchemaRDD = sqlContext.jsonRDD(jsonData)
jsonSchemaRDD.registerAsTable("jt")

sqlContext.sql("select count(*) from jt").collect
```
Stages for the schema inference and the table scan both had 48 tasks. These tasks were executed sequentially. For the current implementation, scanning the JSON dataset will materialize values of all fields of a record. The inferred schema of the dataset can be accessed at https://gist.github.com/yhuai/05fe8a57c638c6666f8d.

From the result, there was no significant difference on running `jsonRDD`. For the simple aggregation query, results are attached below.
```
Original:
Run 1: 26.1s
Run 2: 27.03s
Run 3: 27.035s

With this change:
Run 1: 21.086s
Run 2: 21.035s
Run 3: 21.029s
```

JIRA: https://issues.apache.org/jira/browse/SPARK-2603

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1504 from yhuai/removeToMapToList and squashes the following commits:

6831b77 [Yin Huai] Fix failed tests.
09b9bca [Yin Huai] Merge remote-tracking branch 'upstream/master' into removeToMapToList
d1abdb8 [Yin Huai] Remove unnecessary toMap and toList.

b352ef17

[Build] SPARK-2619: Configurable filemode for the spark/bin folder in debian package · 9fd14147

tzolov authored 10 years ago

Add  a `<deb.bin.filemode>744</deb.bin.filemode>` property to the `assembly/pom.xml` that defaults to `744`.
Use this property for ../bin folder <filemode>.

This patch doesn't change the current default modes but allows one override the modes at build time:
`-Ddeb.bin.filemode=<new mode>`

Author: tzolov <christian.tzolov@gmail.com>

Closes #1531 from tzolov/SPARK-2619 and squashes the following commits:

6d95343 [tzolov] [Build] SPARK-2619: Configurable filemode for the spark/bin folder in the .deb package

9fd14147

SPARK-2150: Provide direct link to finished application UI in yarn resou... · 46e224aa

Rahul Singhal authored 10 years ago

...rce manager UI

Use the event logger directory to provide a direct link to finished
application UI in yarn resourcemanager UI.

Author: Rahul Singhal <rahul.singhal@guavus.com>

Closes #1094 from rahulsinghaliitd/SPARK-2150 and squashes the following commits:

95f230c [Rahul Singhal] SPARK-2150: Provide direct link to finished application UI in yarn resource manager UI

46e224aa

[SPARK-2661][bagel]unpersist old processed rdd · 42dfab7d

Daoyuan authored 10 years ago

Unpersist useless rdd during bagel iteration to make full use of memory.

Author: Daoyuan <daoyuan.wang@intel.com>

Closes #1519 from adrian-wang/bagelunpersist and squashes the following commits:

182c9dd [Daoyuan] rename var nextUseless to lastRDD
87fd3a4 [Daoyuan] bagel unpersist old processed rdd

42dfab7d

SPARK-2310. Support arbitrary Spark properties on the command line with ... · e34922a2

Sandy Ryza authored 10 years ago

...spark-submit

The PR allows invocations like
  spark-submit --class org.MyClass --spark.shuffle.spill false myjar.jar

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1253 from sryza/sandy-spark-2310 and squashes the following commits:

1dc9855 [Sandy Ryza] More doc and cleanup
00edfb9 [Sandy Ryza] Review comments
91b244a [Sandy Ryza] Change format to --conf PROP=VALUE
8fabe77 [Sandy Ryza] SPARK-2310. Support arbitrary Spark properties on the command line with spark-submit

e34922a2

[SPARK-2658][SQL] Add rule for true = 1. · 78d18fdb

Michael Armbrust authored 10 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #1556 from marmbrus/fixBooleanEqualsOne and squashes the following commits:

ad8edd4 [Michael Armbrust] Add rule for true = 1 and false = 0.

78d18fdb

SPARK-2662: Fix NPE for JsonProtocol · 9e7725c8

GuoQiang Li authored 10 years ago

Author: GuoQiang Li <witgo@qq.com>

Closes #1511 from witgo/JsonProtocol and squashes the following commits:

2b6227f [GuoQiang Li] Fix NPE for JsonProtocol

9e7725c8

Jul 23, 2014

Replace RoutingTableMessage with pair · 2d25e348

Ankur Dave authored 10 years ago

RoutingTableMessage was used to construct routing tables to enable
joining VertexRDDs with partitioned edges. It stored three elements: the
destination vertex ID, the source edge partition, and a byte specifying
the position in which the edge partition referenced the vertex to enable
join elimination.

However, this was incompatible with sort-based shuffle (SPARK-2045). It
was also slightly wasteful, because partition IDs are usually much
smaller than 2^32, though this was mitigated by a custom serializer that
used variable-length encoding.

This commit replaces RoutingTableMessage with a pair of (VertexId, Int)
where the Int encodes both the source partition ID (in the lower 30
bits) and the position (in the top 2 bits).

Author: Ankur Dave <ankurdave@gmail.com>

Closes #1553 from ankurdave/remove-RoutingTableMessage and squashes the following commits:

697e17b [Ankur Dave] Replace RoutingTableMessage with pair

2d25e348

[SPARK-2484][SQL] Build should not run hivecompatibility tests by default. · 60f0ae3d

witgo authored 10 years ago

Author: witgo <witgo@qq.com>

Closes #1403 from witgo/hive_compatibility and squashes the following commits:

4e5ecdb [witgo] The default does not run hive compatibility tests

60f0ae3d

[SPARK-2549] Functions defined inside of other functions trigger failures · 9b763329

Prashant Sharma authored 10 years ago

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #1510 from ScrapCodes/SPARK-2549/fun-in-fun and squashes the following commits:

9458bc5 [Prashant Sharma] Tested by removing an inner function from excludes.
bc03b1c [Prashant Sharma] SPARK-2549 Functions defined inside of other functions trigger failures

9b763329

[SPARK-2102][SQL][CORE] Add option for kryo registration required and use a... · efdaeb11

Ian O Connell authored 10 years ago

[SPARK-2102][SQL][CORE] Add option for kryo registration required and use a resource pool in Spark SQL for Kryo instances.

Author: Ian O Connell <ioconnell@twitter.com>

Closes #1377 from ianoc/feature/SPARK-2102 and squashes the following commits:

5498566 [Ian O Connell] Docs update suggested by Patrick
20e8555 [Ian O Connell] Slight style change
f92c294 [Ian O Connell] Add docs for new KryoSerializer option
f3735c8 [Ian O Connell] Add using a kryo resource pool for the SqlSerializer
4e5c342 [Ian O Connell] Register the SparkConf for kryo, it gets swept into serialization
665805a [Ian O Connell] Add a spark.kryo.registrationRequired option for configuring the Kryo Serializer

efdaeb11

[SPARK-2569][SQL] Fix shipping of TEMPORARY hive UDFs. · 1871574a

Michael Armbrust authored 10 years ago

Instead of shipping just the name and then looking up the info on the workers, we now ship the whole classname. Also, I refactored the file as it was getting pretty large to move out the type conversion code to its own file.

Author: Michael Armbrust <michael@databricks.com>

Closes #1552 from marmbrus/fixTempUdfs and squashes the following commits:

b695904 [Michael Armbrust] Make add jar execute with Hive. Ship the whole function class name since sometimes we cannot lookup temporary functions on the workers.

1871574a

SPARK-2226: [SQL] transform HAVING clauses with aggregate expressions that... · e060d3ee

William Benton authored 10 years ago

SPARK-2226: [SQL] transform HAVING clauses with aggregate expressions that aren't in the aggregation list

This change adds an analyzer rule to
1. find expressions in `HAVING` clause filters that depend on unresolved attributes,
2. push these expressions down to the underlying aggregates, and then
3. project them away above the filter.

It also enables the `HAVING` queries in the Hive compatibility suite.

Author: William Benton <willb@redhat.com>

Closes #1497 from willb/spark-2226 and squashes the following commits:

92c9a93 [William Benton] Removed unnecessary import
f1d4f34 [William Benton] Cleanups missed in prior commit
0e1624f [William Benton] Incorporated suggestions from @marmbrus; thanks!
541d4ee [William Benton] Cleanups from review
5a12647 [William Benton] Explanatory comments and stylistic cleanups.
c7f2b2c [William Benton] Whitelist HAVING queries.
29a26e3 [William Benton] Added rule to handle unresolved attributes in HAVING clauses (SPARK-2226)

e060d3ee

SPARK-2277: clear host->rack info properly · 91903e0a

Rui Li authored 10 years ago

Hi mridulm, I just think of this issue of [#1212](https://github.com/apache/spark/pull/1212): I added FakeRackUtil to hold the host -> rack mapping. It should be cleaned up after use so that it won't mess up with test cases others may add later.
Really sorry about this.

Author: Rui Li <rui.li@intel.com>

Closes #1454 from lirui-intel/SPARK-2277-fix-UT and squashes the following commits:

f8ea25c [Rui Li] SPARK-2277: clear host->rack info properly

91903e0a

[SPARK-2588][SQL] Add some more DSLs. · 1b790cf7

Takuya UESHIN authored 10 years ago

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1491 from ueshin/issues/SPARK-2588 and squashes the following commits:

43d0a46 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2588
1023ea0 [Takuya UESHIN] Modify tests to use DSLs.
2310bf1 [Takuya UESHIN] Add some more DSLs.

1b790cf7

[CORE] SPARK-2640: In "local[N]", free cores of the only executor should be... · f776bc98

woshilaiceshide authored 10 years ago

[CORE] SPARK-2640: In "local[N]", free cores of the only executor should be touched by "spark.task.cpus" for every finish/start-up of tasks.

Make spark's "local[N]" better.
In our company, we use "local[N]" in production. It works exellentlly. It's our best choice.

Author: woshilaiceshide <woshilaiceshide@qq.com>

Closes #1544 from woshilaiceshide/localX and squashes the following commits:

6c85154 [woshilaiceshide] [CORE] SPARK-2640: In "local[N]", free cores of the only executor should be touched by "spark.task.cpus" for every finish/start-up of tasks.

f776bc98

[SPARK-2609] Log thread ID when spilling ExternalAppendOnlyMap · 25921110

Andrew Or authored 10 years ago

It's useful to know whether one thread is constantly spilling or multiple threads are spilling relatively infrequently. Right now everything looks a little jumbled and we can't tell which lines belong to the same thread. For instance:

```
06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (194 times so far)
06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (198 times so far)
06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (198 times so far)
06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 10 MB to disk (197 times so far)
06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 9 MB to disk (45 times so far)
06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 23 MB to disk (198 times so far)
06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 38 MB to disk (25 times so far)
06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 161 MB to disk (25 times so far)
06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 0 MB to disk (199 times so far)
06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (166 times so far)
06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (199 times so far)
06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (200 times so far)
```

Author: Andrew Or <andrewor14@gmail.com>

Closes #1517 from andrewor14/external-log and squashes the following commits:

90e48bb [Andrew Or] Log thread ID when spilling

25921110

[SPARK-2617] Correct doc and usages of preservesPartitioning · 4c7243e1

Xiangrui Meng authored 10 years ago

The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages.

This PR

1. adds notes in `maPartitions*`,
2. makes `RDD.sample` preserve partitioner,
3. changes `preservesPartitioning` to false in `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD,
4. fixes some wrong usages in MLlib.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1526 from mengxr/preserve-partitioner and squashes the following commits:

b361e65 [Xiangrui Meng] update doc based on pwendell's comments
3b1ba19 [Xiangrui Meng] update doc
357575c [Xiangrui Meng] fix unit test
20b4816 [Xiangrui Meng] Merge branch 'master' into preserve-partitioner
d1caa65 [Xiangrui Meng] add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning

4c7243e1

Remove GraphX MessageToPartition for compatibility with sort-based shuffle · 6c2be93f

Ankur Dave authored 10 years ago

MessageToPartition was used in `Graph#partitionBy`. Unlike a Tuple2, it marked the key as transient to avoid sending it over the network. However, it was incompatible with sort-based shuffle (SPARK-2045) and represented only a minor optimization: for partitionBy, it improved performance by 6.3% (30.4 s to 28.5 s) and reduced communication by 5.6% (114.2 MB to 107.8 MB).

Author: Ankur Dave <ankurdave@gmail.com>

Closes #1537 from ankurdave/remove-MessageToPartition and squashes the following commits:

f9d0054 [Ankur Dave] Remove MessageToPartition
ab71364 [Ankur Dave] Remove unused VertexBroadcastMsg

6c2be93f

Jul 22, 2014

[YARN] SPARK-2577: File upload to viewfs is broken due to mount point re... · 02e45729

Gera Shegalov authored 10 years ago

Opting to the option 2 defined in SPARK-2577, i.e., retrieve and pass the correct file system object to addResource.

Author: Gera Shegalov <gera@twitter.com>

Closes #1483 from gerashegalov/master and squashes the following commits:

90c9087 [Gera Shegalov] [YARN] SPARK-2577: File upload to viewfs is broken due to mount point resolution

02e45729

[YARN][SPARK-2606]:In some cases,the spark UI pages display incorrect · ddadf1b0

GuoQiang Li authored 10 years ago

The issue is caused by #1112 .

Author: GuoQiang Li <witgo@qq.com>

Closes #1501 from witgo/webui_style and squashes the following commits:

4b34998 [GuoQiang Li] In some cases, pages display incorrect in WebUI

ddadf1b0

Graphx example · 5f7b9916

CrazyJvm authored 10 years ago

fix examples

Author: CrazyJvm <crazyjvm@gmail.com>

Closes #1523 from CrazyJvm/graphx-example and squashes the following commits:

663457a [CrazyJvm] outDegrees does not take parameters
7cfff1d [CrazyJvm] fix example for joinVertices

5f7b9916

[SPARK-2615] [SQL] Add Equal Sign "==" Support for HiveQl · 79fe7634

Cheng Hao authored 10 years ago

Currently, the "==" in HiveQL expression will cause exception thrown, this patch will fix it.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1522 from chenghao-intel/equal and squashes the following commits:

f62a0ff [Cheng Hao] Add == Support for HiveQl

79fe7634

SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage · 85d3596e

Aaron Davidson authored 10 years ago

### Why and what?
Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead.

This patch adds a Sorter API, intended for in memory sorts, which simply ports the Android Timsort implementation (available under Apache v2) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point.

Please compare our port of the Android Timsort sort with the original implementation: http://www.diffchecker.com/wiwrykcl

### Memory implications
An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOPS](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default.

Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes).
This results in a worst-case sorting overhead of 24N + 2N = 26N bytes (for Java 7).

The Sorter does not require allocating any tuples, but since it uses Timsort, it may copy up to half the entire array in the worst case.
This results in a worst-case sorting overhead of 4N bytes.

Thus, we have reduced the worst-case overhead of the sort by roughly 22 bytes times the number of elements.

### Performance implications
As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. However, because it implements Timsort, it also brings a substantial performance boost over our prior implementation.

Here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()" while the new implementation is "KV-array using Sorter".

<table>
<tr><th>Test</th><th>First run (JDK6)</th><th>Average of 10 (JDK6)</th><th>First run (JDK7)</th><th>Average of 10 (JDK7)</th></tr>
<tr><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td><td>2724 ms</td><td>131 ms (!!)</td></tr>
<tr><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td><td>13201 ms</td><td>878 ms</td></tr>
<tr><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td><td>20990 ms</td><td>1919 ms</td></tr>
<tr><td><b>KV-array using Sorter</b></td><td></td><td></td><td><b>15020 ms</b></td><td><b>834 ms</b></td></tr>
</table>

The results show that this Sorter performs exactly as expected (after the first run) -- it is as fast as the Java 7 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6 or 7.

In short, this patch should significantly improve performance for users running either Java 6 or 7.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1502 from aarondav/sort and squashes the following commits:

652d936 [Aaron Davidson] Update license, move Sorter to java src
a7b5b1c [Aaron Davidson] fix licenses
5c0efaf [Aaron Davidson] Update tmpLength
ec395c8 [Aaron Davidson] Ignore benchmark (again) and fix docs
034bf10 [Aaron Davidson] Change to Apache v2 Timsort
b97296c [Aaron Davidson] Don't try to run benchmark on Jenkins + private[spark]
6307338 [Aaron Davidson] SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage

85d3596e

[MLLIB] make Mima ignore updateFeatures (private) in ALS · 14078717

Xiangrui Meng authored 10 years ago

Fix Mima issues in #1521.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1533 from mengxr/mima-als and squashes the following commits:

78386e1 [Xiangrui Meng] make Mima ignore updateFeatures (private) in ALS

14078717

[SPARK-2612] [mllib] Fix data skew in ALS · 75db1742

peng.zhang authored 10 years ago

Author: peng.zhang <peng.zhang@xiaomi.com>

Closes #1521 from renozhang/fix-als and squashes the following commits:

b5727a4 [peng.zhang] Remove no need argument
1a4f7a0 [peng.zhang] Fix data skew in ALS

75db1742

[SPARK-2452] Create a new valid for each instead of using lineId. · 81fec992

Prashant Sharma authored 10 years ago

Author: Prashant Sharma <prashant@apache.org>

Closes #1441 from ScrapCodes/SPARK-2452/multi-statement and squashes the following commits:

26c5c72 [Prashant Sharma] Added a test case.
7e8d28d [Prashant Sharma] SPARK-2452, create a new valid for each instead of using lineId, because Line ids can be same sometimes.

81fec992