Skip to content
Snippets Groups Projects
Commit 13db11cb authored by Joseph K. Bradley's avatar Joseph K. Bradley Committed by Xiangrui Meng
Browse files

[SPARK-10061] [DOC] ML ensemble docs

User guide for spark.ml GBTs and Random Forests.
The examples are copied from the decision tree guide and modified to run.

I caught some issues I had somehow missed in the tree guide as well.

I have run all examples, including Java ones.  (Of course, I thought I had previously as well...)

CC: mengxr manishamde yanboliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8369 from jkbradley/ml-ensemble-docs.
parent cb2d2e15
No related branches found
No related tags found
No related merge requests found
...@@ -30,7 +30,7 @@ The Pipelines API for Decision Trees offers a bit more functionality than the or ...@@ -30,7 +30,7 @@ The Pipelines API for Decision Trees offers a bit more functionality than the or
Ensembles of trees (Random Forests and Gradient-Boosted Trees) are described in the [Ensembles guide](ml-ensembles.html). Ensembles of trees (Random Forests and Gradient-Boosted Trees) are described in the [Ensembles guide](ml-ensembles.html).
# Inputs and Outputs (Predictions) # Inputs and Outputs
We list the input and output (prediction) column types here. We list the input and output (prediction) column types here.
All output columns are optional; to exclude an output column, set its corresponding Param to an empty string. All output columns are optional; to exclude an output column, set its corresponding Param to an empty string.
...@@ -234,7 +234,7 @@ IndexToString labelConverter = new IndexToString() ...@@ -234,7 +234,7 @@ IndexToString labelConverter = new IndexToString()
// Chain indexers and tree in a Pipeline // Chain indexers and tree in a Pipeline
Pipeline pipeline = new Pipeline() Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[]{labelIndexer, featureIndexer, dt, labelConverter}); .setStages(new PipelineStage[] {labelIndexer, featureIndexer, dt, labelConverter});
// Train model. This also runs the indexers. // Train model. This also runs the indexers.
PipelineModel model = pipeline.fit(trainingData); PipelineModel model = pipeline.fit(trainingData);
...@@ -315,10 +315,13 @@ print treeModel # summary only ...@@ -315,10 +315,13 @@ print treeModel # summary only
## Regression ## Regression
The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use a feature transformer to index categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier). More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor).
{% highlight scala %} {% highlight scala %}
import org.apache.spark.ml.Pipeline import org.apache.spark.ml.Pipeline
...@@ -347,7 +350,7 @@ val dt = new DecisionTreeRegressor() ...@@ -347,7 +350,7 @@ val dt = new DecisionTreeRegressor()
.setLabelCol("label") .setLabelCol("label")
.setFeaturesCol("indexedFeatures") .setFeaturesCol("indexedFeatures")
// Chain indexers and tree in a Pipeline // Chain indexer and tree in a Pipeline
val pipeline = new Pipeline() val pipeline = new Pipeline()
.setStages(Array(featureIndexer, dt)) .setStages(Array(featureIndexer, dt))
...@@ -365,9 +368,7 @@ val evaluator = new RegressionEvaluator() ...@@ -365,9 +368,7 @@ val evaluator = new RegressionEvaluator()
.setLabelCol("label") .setLabelCol("label")
.setPredictionCol("prediction") .setPredictionCol("prediction")
.setMetricName("rmse") .setMetricName("rmse")
// We negate the RMSE value since RegressionEvalutor returns negated RMSE val rmse = evaluator.evaluate(predictions)
// (since evaluation metrics are meant to be maximized by CrossValidator).
val rmse = - evaluator.evaluate(predictions)
println("Root Mean Squared Error (RMSE) on test data = " + rmse) println("Root Mean Squared Error (RMSE) on test data = " + rmse)
val treeModel = model.stages(1).asInstanceOf[DecisionTreeRegressionModel] val treeModel = model.stages(1).asInstanceOf[DecisionTreeRegressionModel]
...@@ -377,14 +378,15 @@ println("Learned regression tree model:\n" + treeModel.toDebugString) ...@@ -377,14 +378,15 @@ println("Learned regression tree model:\n" + treeModel.toDebugString)
<div data-lang="java" markdown="1"> <div data-lang="java" markdown="1">
More details on parameters can be found in the [Java API documentation](api/java/org/apache/spark/ml/classification/DecisionTreeClassifier.html). More details on parameters can be found in the [Java API documentation](api/java/org/apache/spark/ml/regression/DecisionTreeRegressor.html).
{% highlight java %} {% highlight java %}
import org.apache.spark.ml.Pipeline; import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel; import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage; import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.evaluation.RegressionEvaluator; import org.apache.spark.ml.evaluation.RegressionEvaluator;
import org.apache.spark.ml.feature.*; import org.apache.spark.ml.feature.VectorIndexer;
import org.apache.spark.ml.feature.VectorIndexerModel;
import org.apache.spark.ml.regression.DecisionTreeRegressionModel; import org.apache.spark.ml.regression.DecisionTreeRegressionModel;
import org.apache.spark.ml.regression.DecisionTreeRegressor; import org.apache.spark.ml.regression.DecisionTreeRegressor;
import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.regression.LabeledPoint;
...@@ -396,17 +398,12 @@ import org.apache.spark.sql.DataFrame; ...@@ -396,17 +398,12 @@ import org.apache.spark.sql.DataFrame;
RDD<LabeledPoint> rdd = MLUtils.loadLibSVMFile(sc.sc(), "data/mllib/sample_libsvm_data.txt"); RDD<LabeledPoint> rdd = MLUtils.loadLibSVMFile(sc.sc(), "data/mllib/sample_libsvm_data.txt");
DataFrame data = jsql.createDataFrame(rdd, LabeledPoint.class); DataFrame data = jsql.createDataFrame(rdd, LabeledPoint.class);
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
StringIndexerModel labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data);
// Automatically identify categorical features, and index them. // Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
VectorIndexerModel featureIndexer = new VectorIndexer() VectorIndexerModel featureIndexer = new VectorIndexer()
.setInputCol("features") .setInputCol("features")
.setOutputCol("indexedFeatures") .setOutputCol("indexedFeatures")
.setMaxCategories(4) // features with > 4 distinct values are treated as continuous .setMaxCategories(4)
.fit(data); .fit(data);
// Split the data into training and test sets (30% held out for testing) // Split the data into training and test sets (30% held out for testing)
...@@ -416,61 +413,49 @@ DataFrame testData = splits[1]; ...@@ -416,61 +413,49 @@ DataFrame testData = splits[1];
// Train a DecisionTree model. // Train a DecisionTree model.
DecisionTreeRegressor dt = new DecisionTreeRegressor() DecisionTreeRegressor dt = new DecisionTreeRegressor()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures"); .setFeaturesCol("indexedFeatures");
// Convert indexed labels back to original labels. // Chain indexer and tree in a Pipeline
IndexToString labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels());
// Chain indexers and tree in a Pipeline
Pipeline pipeline = new Pipeline() Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[]{labelIndexer, featureIndexer, dt, labelConverter}); .setStages(new PipelineStage[] {featureIndexer, dt});
// Train model. This also runs the indexers. // Train model. This also runs the indexer.
PipelineModel model = pipeline.fit(trainingData); PipelineModel model = pipeline.fit(trainingData);
// Make predictions. // Make predictions.
DataFrame predictions = model.transform(testData); DataFrame predictions = model.transform(testData);
// Select example rows to display. // Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5); predictions.select("label", "features").show(5);
// Select (prediction, true label) and compute test error // Select (prediction, true label) and compute test error
RegressionEvaluator evaluator = new RegressionEvaluator() RegressionEvaluator evaluator = new RegressionEvaluator()
.setLabelCol("indexedLabel") .setLabelCol("label")
.setPredictionCol("prediction") .setPredictionCol("prediction")
.setMetricName("rmse"); .setMetricName("rmse");
// We negate the RMSE value since RegressionEvalutor returns negated RMSE double rmse = evaluator.evaluate(predictions);
// (since evaluation metrics are meant to be maximized by CrossValidator).
double rmse = - evaluator.evaluate(predictions);
System.out.println("Root Mean Squared Error (RMSE) on test data = " + rmse); System.out.println("Root Mean Squared Error (RMSE) on test data = " + rmse);
DecisionTreeRegressionModel treeModel = DecisionTreeRegressionModel treeModel =
(DecisionTreeRegressionModel)(model.stages()[2]); (DecisionTreeRegressionModel)(model.stages()[1]);
System.out.println("Learned regression tree model:\n" + treeModel.toDebugString()); System.out.println("Learned regression tree model:\n" + treeModel.toDebugString());
{% endhighlight %} {% endhighlight %}
</div> </div>
<div data-lang="python" markdown="1"> <div data-lang="python" markdown="1">
More details on parameters can be found in the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier). More details on parameters can be found in the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.regression.DecisionTreeRegressor).
{% highlight python %} {% highlight python %}
from pyspark.ml import Pipeline from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import StringIndexer, VectorIndexer from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.mllib.util import MLUtils from pyspark.mllib.util import MLUtils
# Load and parse the data file, converting it to a DataFrame. # Load and parse the data file, converting it to a DataFrame.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF() data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them. # Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous. # We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\ featureIndexer =\
...@@ -480,26 +465,24 @@ featureIndexer =\ ...@@ -480,26 +465,24 @@ featureIndexer =\
(trainingData, testData) = data.randomSplit([0.7, 0.3]) (trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model. # Train a DecisionTree model.
dt = DecisionTreeRegressor(labelCol="indexedLabel", featuresCol="indexedFeatures") dt = DecisionTreeRegressor(featuresCol="indexedFeatures")
# Chain indexers and tree in a Pipeline # Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt]) pipeline = Pipeline(stages=[featureIndexer, dt])
# Train model. This also runs the indexers. # Train model. This also runs the indexer.
model = pipeline.fit(trainingData) model = pipeline.fit(trainingData)
# Make predictions. # Make predictions.
predictions = model.transform(testData) predictions = model.transform(testData)
# Select example rows to display. # Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5) predictions.select("prediction", "label", "features").show(5)
# Select (prediction, true label) and compute test error # Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator( evaluator = RegressionEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="rmse") labelCol="label", predictionCol="prediction", metricName="rmse")
# We negate the RMSE value since RegressionEvalutor returns negated RMSE rmse = evaluator.evaluate(predictions)
# (since evaluation metrics are meant to be maximized by CrossValidator).
rmse = -evaluator.evaluate(predictions)
print "Root Mean Squared Error (RMSE) on test data = %g" % rmse print "Root Mean Squared Error (RMSE) on test data = %g" % rmse
treeModel = model.stages[1] treeModel = model.stages[1]
......
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment