-
Timothy Hunter authored
This PR moves pieces of the spark.ml user guide to reflect suggestions in SPARK-8517. It does not introduce new content, as requested. <img width="192" alt="screen shot 2015-12-08 at 11 36 00 am" src="https://cloud.githubusercontent.com/assets/7594753/11666166/e82b84f2-9d9f-11e5-8904-e215424d8444.png"> Author: Timothy Hunter <timhunter@databricks.com> Closes #10207 from thunterdb/spark-8517.
Timothy Hunter authoredThis PR moves pieces of the spark.ml user guide to reflect suggestions in SPARK-8517. It does not introduce new content, as requested. <img width="192" alt="screen shot 2015-12-08 at 11 36 00 am" src="https://cloud.githubusercontent.com/assets/7594753/11666166/e82b84f2-9d9f-11e5-8904-e215424d8444.png"> Author: Timothy Hunter <timhunter@databricks.com> Closes #10207 from thunterdb/spark-8517.
layout: global
title: Classification and regression - spark.ml
displayTitle: Classification and regression in spark.ml
\[ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\wv}{\mathbf{w}} \newcommand{\av}{\mathbf{\alpha}} \newcommand{\bv}{\mathbf{b}} \newcommand{\N}{\mathbb{N}} \newcommand{\id}{\mathbf{I}} \newcommand{\ind}{\mathbf{1}} \newcommand{\0}{\mathbf{0}} \newcommand{\unit}{\mathbf{e}} \newcommand{\one}{\mathbf{1}} \newcommand{\zero}{\mathbf{0}} \]
Table of Contents
- This will become a table of contents (this text will be scraped). {:toc}
In MLlib, we implement popular linear methods such as logistic
regression and linear least squares with L_1 or L_2 regularization.
Refer to the linear methods in mllib for
details. In spark.ml
, we also include Pipelines API for Elastic
net, a hybrid
of L_1 and L_2 regularization proposed in Zou et al, Regularization
and variable selection via the elastic
net.
Mathematically, it is defined as a convex combination of the L_1 and
the L_2 regularization terms:
\[ \alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0 \]
By setting \alpha properly, elastic net contains both L_1 and L_2
regularization as special cases. For example, if a linear
regression model is
trained with the elastic net parameter \alpha set to 1, it is
equivalent to a
Lasso model.
On the other hand, if \alpha is set to 0, the trained model reduces
to a ridge
regression model.
We implement Pipelines API for both linear regression and logistic
regression with elastic net regularization.
Classification
Logistic regression
Logistic regression is a popular method to predict a binary response. It is a special case of Generalized Linear models that predicts the probability of the outcome.
For more background and more details about the implementation, refer to the documentation of the logistic regression in spark.mllib
.
The current implementation of logistic regression in
spark.ml
only supports binary classes. Support for multiclass regression will be added in the future.
Example
The following example shows how to train a logistic regression model
with elastic net regularization. elasticNetParam
corresponds to
\alpha and regParam
corresponds to \lambda.
The spark.ml
implementation of logistic regression also supports
extracting a summary of the model over the training set. Note that the
predictions and metrics which are stored as Dataframe
in
BinaryLogisticRegressionSummary
are annotated @transient
and hence
only available on the driver.
LogisticRegressionTrainingSummary
provides a summary for a
LogisticRegressionModel
.
Currently, only binary classification is supported and the
summary must be explicitly cast to
BinaryLogisticRegressionTrainingSummary
.
This will likely change when multiclass classification is supported.
Continuing the earlier example:
{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala %}
LogisticRegressionTrainingSummary
provides a summary for a
LogisticRegressionModel
.
Currently, only binary classification is supported and the
summary must be explicitly cast to
BinaryLogisticRegressionTrainingSummary
.
This will likely change when multiclass classification is supported.
Continuing the earlier example:
{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionSummaryExample.java %}
Decision tree classifier
Decision trees are a popular family of classification and regression methods.
More information about the spark.ml
implementation can be found further in the section on decision trees.
Example
The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the DataFrame
which the Decision Tree algorithm can recognize.
More details on parameters can be found in the Scala API documentation.
{% include_example scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala %}
More details on parameters can be found in the Java API documentation.
{% include_example java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java %}
More details on parameters can be found in the Python API documentation.
{% include_example python/ml/decision_tree_classification_example.py %}
Random forest classifier
Random forests are a popular family of classification and regression methods.
More information about the spark.ml
implementation can be found further in the section on random forests.
Example
The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the DataFrame
which the tree-based algorithms can recognize.
Refer to the Scala API docs for more details.
{% include_example scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala %}
Refer to the Java API docs for more details.
{% include_example java/org/apache/spark/examples/ml/JavaRandomForestClassifierExample.java %}
Refer to the Python API docs for more details.
{% include_example python/ml/random_forest_classifier_example.py %}
Gradient-boosted tree classifier
Gradient-boosted trees (GBTs) are a popular classification and regression method using ensembles of decision trees.
More information about the spark.ml
implementation can be found further in the section on GBTs.
Example
The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the DataFrame
which the tree-based algorithms can recognize.
Refer to the Scala API docs for more details.
{% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeClassifierExample.scala %}
Refer to the Java API docs for more details.
{% include_example java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java %}
Refer to the Python API docs for more details.
{% include_example python/ml/gradient_boosted_tree_classifier_example.py %}
Multilayer perceptron classifier
Multilayer perceptron classifier (MLPC) is a classifier based on the feedforward artificial neural network.
MLPC consists of multiple layers of nodes.
Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes maps inputs to the outputs
by performing linear combination of the inputs with the node's weights $\wv$
and bias $\bv$
and applying an activation function.
It can be written in matrix form for MLPC with $K+1$
layers as follows:
\[ \mathrm{y}(\x) = \mathrm{f_K}(...\mathrm{f_2}(\wv_2^T\mathrm{f_1}(\wv_1^T \x+b_1)+b_2)...+b_K) \]
Nodes in intermediate layers use sigmoid (logistic) function:
\[ \mathrm{f}(z_i) = \frac{1}{1 + e^{-z_i}} \]
Nodes in the output layer use softmax function:
\[ \mathrm{f}(z_i) = \frac{e^{z_i}}{\sum_{k=1}^N e^{z_k}} \]
The number of nodes $N$
in the output layer corresponds to the number of classes.
MLPC employes backpropagation for learning the model. We use logistic loss function for optimization and L-BFGS as optimization routine.
Example
One-vs-Rest classifier (a.k.a. One-vs-All)
OneVsRest is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently. It is also known as "One-vs-All."
OneVsRest
is implemented as an Estimator
. For the base classifier it takes instances of Classifier
and creates a binary classification problem for each of the k classes. The classifier for class i is trained to predict whether the label is i or not, distinguishing class i from all other classes.
Predictions are done by evaluating each binary classifier and the index of the most confident classifier is output as label.
Example
The example below demonstrates how to load the
Iris dataset, parse it as a DataFrame and perform multiclass classification using OneVsRest
. The test error is calculated to measure the algorithm accuracy.
Refer to the Scala API docs for more details.
{% include_example scala/org/apache/spark/examples/ml/OneVsRestExample.scala %}
Refer to the Java API docs for more details.
{% include_example java/org/apache/spark/examples/ml/JavaOneVsRestExample.java %}
Regression
Linear regression
The interface for working with linear regression models and model summaries is similar to the logistic regression case.
Example
The following example demonstrates training an elastic net regularized linear regression model and extracting model summary statistics.
Decision tree regression
Decision trees are a popular family of classification and regression methods.
More information about the spark.ml
implementation can be found further in the section on decision trees.
Example