Skip to content
Snippets Groups Projects
  • Xusen Yin's avatar
    a6428292
    [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in Spark and PySpark - update · a6428292
    Xusen Yin authored
    ## What changes were proposed in this pull request?
    
    This PR is an update for [https://github.com/apache/spark/pull/12738] which:
    * Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side
    * Various fixes for bugs found
      * This includes changing classes taking weightCol to treat unset and empty String Param values the same way.
    
    Defaults changed:
    * Scala
     * LogisticRegression: weightCol defaults to not set (instead of empty string)
     * StringIndexer: labels default to not set (instead of empty array)
     * GeneralizedLinearRegression:
       * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver)
       * weightCol defaults to not set (instead of empty string)
     * LinearRegression: weightCol defaults to not set (instead of empty string)
    * Python
     * MultilayerPerceptron: layers default to not set (instead of [1,1])
     * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)
    
    ## How was this patch tested?
    
    Generic unit test.  Manually tested that unit test by changing defaults and verifying that broke the test.
    
    Author: Joseph K. Bradley <joseph@databricks.com>
    Author: yinxusen <yinxusen@gmail.com>
    
    Closes #12816 from jkbradley/yinxusen-SPARK-14931.
    a6428292
    History
    [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in Spark and PySpark - update
    Xusen Yin authored
    ## What changes were proposed in this pull request?
    
    This PR is an update for [https://github.com/apache/spark/pull/12738] which:
    * Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side
    * Various fixes for bugs found
      * This includes changing classes taking weightCol to treat unset and empty String Param values the same way.
    
    Defaults changed:
    * Scala
     * LogisticRegression: weightCol defaults to not set (instead of empty string)
     * StringIndexer: labels default to not set (instead of empty array)
     * GeneralizedLinearRegression:
       * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver)
       * weightCol defaults to not set (instead of empty string)
     * LinearRegression: weightCol defaults to not set (instead of empty string)
    * Python
     * MultilayerPerceptron: layers default to not set (instead of [1,1])
     * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)
    
    ## How was this patch tested?
    
    Generic unit test.  Manually tested that unit test by changing defaults and verifying that broke the test.
    
    Author: Joseph K. Bradley <joseph@databricks.com>
    Author: yinxusen <yinxusen@gmail.com>
    
    Closes #12816 from jkbradley/yinxusen-SPARK-14931.
classification.py 55.73 KiB