Skip to content
Snippets Groups Projects
  • Zheng RuiFeng's avatar
    ad1a8466
    [SPARK-15141][EXAMPLE][DOC] Update OneVsRest Examples · ad1a8466
    Zheng RuiFeng authored
    ## What changes were proposed in this pull request?
    1, Add python example for OneVsRest
    2, remove args-parsing
    
    ## How was this patch tested?
    manual tests
    `./bin/spark-submit examples/src/main/python/ml/one_vs_rest_example.py`
    
    Author: Zheng RuiFeng <ruifengz@foxmail.com>
    
    Closes #12920 from zhengruifeng/ovr_pe.
    ad1a8466
    History
    [SPARK-15141][EXAMPLE][DOC] Update OneVsRest Examples
    Zheng RuiFeng authored
    ## What changes were proposed in this pull request?
    1, Add python example for OneVsRest
    2, remove args-parsing
    
    ## How was this patch tested?
    manual tests
    `./bin/spark-submit examples/src/main/python/ml/one_vs_rest_example.py`
    
    Author: Zheng RuiFeng <ruifengz@foxmail.com>
    
    Closes #12920 from zhengruifeng/ovr_pe.
ml-classification-regression.md 30.41 KiB
layout: global
title: Classification and regression - spark.ml
displayTitle: Classification and regression - spark.ml

\[ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\wv}{\mathbf{w}} \newcommand{\av}{\mathbf{\alpha}} \newcommand{\bv}{\mathbf{b}} \newcommand{\N}{\mathbb{N}} \newcommand{\id}{\mathbf{I}} \newcommand{\ind}{\mathbf{1}} \newcommand{\0}{\mathbf{0}} \newcommand{\unit}{\mathbf{e}} \newcommand{\one}{\mathbf{1}} \newcommand{\zero}{\mathbf{0}} \]

Table of Contents

  • This will become a table of contents (this text will be scraped). {:toc}

In spark.ml, we implement popular linear methods such as logistic regression and linear least squares with L_1 or L_2 regularization. Refer to the linear methods in mllib for details about implementation and tuning. We also include a DataFrame API for Elastic net, a hybrid of L_1 and L_2 regularization proposed in Zou et al, Regularization and variable selection via the elastic net. Mathematically, it is defined as a convex combination of the L_1 and the L_2 regularization terms: \[ \alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0 \] By setting \alpha properly, elastic net contains both L_1 and L_2 regularization as special cases. For example, if a linear regression model is trained with the elastic net parameter \alpha set to 1, it is equivalent to a Lasso model. On the other hand, if \alpha is set to 0, the trained model reduces to a ridge regression model. We implement Pipelines API for both linear regression and logistic regression with elastic net regularization.

Classification

Logistic regression

Logistic regression is a popular method to predict a binary response. It is a special case of Generalized Linear models that predicts the probability of the outcome. For more background and more details about the implementation, refer to the documentation of the logistic regression in spark.mllib.

The current implementation of logistic regression in spark.ml only supports binary classes. Support for multiclass regression will be added in the future.

Example

The following example shows how to train a logistic regression model with elastic net regularization. elasticNetParam corresponds to \alpha and regParam corresponds to \lambda.

{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %}
{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java %}
{% include_example python/ml/logistic_regression_with_elastic_net.py %}

The spark.ml implementation of logistic regression also supports extracting a summary of the model over the training set. Note that the predictions and metrics which are stored as DataFrame in BinaryLogisticRegressionSummary are annotated @transient and hence only available on the driver.

LogisticRegressionTrainingSummary provides a summary for a LogisticRegressionModel. Currently, only binary classification is supported and the summary must be explicitly cast to BinaryLogisticRegressionTrainingSummary. This will likely change when multiclass classification is supported.

Continuing the earlier example:

{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala %}

LogisticRegressionTrainingSummary provides a summary for a LogisticRegressionModel. Currently, only binary classification is supported and the summary must be explicitly cast to BinaryLogisticRegressionTrainingSummary. This will likely change when multiclass classification is supported.

Continuing the earlier example:

{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionSummaryExample.java %}

Logistic regression model summary is not yet supported in Python.

Decision tree classifier

Decision trees are a popular family of classification and regression methods. More information about the spark.ml implementation can be found further in the section on decision trees.

Example

The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set. We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the DataFrame which the Decision Tree algorithm can recognize.

More details on parameters can be found in the Scala API documentation.

{% include_example scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala %}

More details on parameters can be found in the Java API documentation.

{% include_example java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java %}

More details on parameters can be found in the Python API documentation.

{% include_example python/ml/decision_tree_classification_example.py %}

Random forest classifier