-
- Downloads
[SPARK-10788][MLLIB][ML] Remove duplicate bins for decision trees
Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example. Say there are 3 categories A, B, C. We consider 3 splits: * A vs. B, C * A, B vs. C * A, C vs. B Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A). This patch adds a parent stats array to the `DTStatsAggregator` so that the right child stats do not need to be stored. The right child stats are computed by subtracting left child stats from the parent stats for unordered categorical features. Author: sethah <seth.hendrickson16@gmail.com> Closes #9474 from sethah/SPARK-10788.
Showing
- mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala 6 additions, 9 deletions...in/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
- mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala 6 additions, 9 deletions...main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
- mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala 35 additions, 24 deletions.../org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala
- mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala 3 additions, 3 deletions...g/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
- mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Entropy.scala 0 additions, 1 deletion.../scala/org/apache/spark/mllib/tree/impurity/Entropy.scala
- mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala 0 additions, 1 deletion...ain/scala/org/apache/spark/mllib/tree/impurity/Gini.scala
- mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala 0 additions, 1 deletion...scala/org/apache/spark/mllib/tree/impurity/Impurity.scala
- mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Variance.scala 0 additions, 1 deletion...scala/org/apache/spark/mllib/tree/impurity/Variance.scala
- mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala 4 additions, 0 deletions...scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
Loading
Please register or sign in to comment