diff --git a/docs/ml-features.md b/docs/ml-features.md index d67fce3c9528a8cca83a8f6fd3d98df218d20893..13d97a2290dc3bd100c6fc6af306bb182cd71d25 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1426,9 +1426,9 @@ categorical features. ChiSqSelector uses the features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`: * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. -* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. +* `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. * `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold. -* `fwe` chooses all features whose p-values is below a threshold, thus controlling the family-wise error rate of selection. +* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. The user can choose a selection method using `setSelectorType`. diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md index acd28943132db2fadd03559f73ea02997c90ca9f..75aea7060187521f4f77427406232625fa626db1 100644 --- a/docs/mllib-feature-extraction.md +++ b/docs/mllib-feature-extraction.md @@ -231,9 +231,9 @@ features to choose. It supports five selection methods: `numTopFeatures`, `perce * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. -* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. +* `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. * `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold. -* `fwe` chooses all features whose p-values is below a threshold, thus controlling the family-wise error rate of selection. +* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. The user can choose a selection method using `setSelectorType`. diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala index 353bd186daf019c9c638453ded6da284da3e41b3..16abc4949dea35bdaba0ebdcc7dcf600970b2764 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala @@ -143,13 +143,13 @@ private[feature] trait ChiSqSelectorParams extends Params * `fdr`, `fwe`. * - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. - * - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false + * - `fpr` chooses all features whose p-value are below a threshold, thus controlling the false * positive rate of selection. * - `fdr` uses the [Benjamini-Hochberg procedure] * (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) * to choose all features whose false discovery rate is below a threshold. - * - `fwe` chooses all features whose p-values is below a threshold, - * thus controlling the family-wise error rate of selection. + * - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by + * 1/numFeatures, thus controlling the family-wise error rate of selection. * By default, the selection method is `numTopFeatures`, with the default number of top features * set to 50. */ diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala index 9dea3c3e843c4af5fa0dad71ff704cb541aabb60..862be6f37e7e3ab501410149b6d0ba81a548586a 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala @@ -175,13 +175,13 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] { * `fdr`, `fwe`. * - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. - * - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false + * - `fpr` chooses all features whose p-values are below a threshold, thus controlling the false * positive rate of selection. * - `fdr` uses the [Benjamini-Hochberg procedure] * (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) * to choose all features whose false discovery rate is below a threshold. - * - `fwe` chooses all features whose p-values is below a threshold, - * thus controlling the family-wise error rate of selection. + * - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by + * 1/numFeatures, thus controlling the family-wise error rate of selection. * By default, the selection method is `numTopFeatures`, with the default number of top features * set to 50. */ diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala index f6c68b9314cadbb97338224d0579bf2206d1fe17..482e5d54260d4bf8a7c97e7d955df22b6710a489 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala @@ -35,22 +35,77 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext // Toy dataset, including the top feature for a chi-squared test. // These data are chosen such that each feature's test has a distinct p-value. - /* To verify the results with R, run: - library(stats) - x1 <- c(8.0, 0.0, 0.0, 7.0, 8.0) - x2 <- c(7.0, 9.0, 9.0, 9.0, 7.0) - x3 <- c(0.0, 6.0, 8.0, 5.0, 3.0) - y <- c(0.0, 1.0, 1.0, 2.0, 2.0) - chisq.test(x1,y) - chisq.test(x2,y) - chisq.test(x3,y) + /* + * Contingency tables + * feature1 = {6.0, 0.0, 8.0} + * class 0 1 2 + * 6.0||1|0|0| + * 0.0||0|3|0| + * 8.0||0|0|2| + * degree of freedom = 4, statistic = 12, pValue = 0.017 + * + * feature2 = {7.0, 9.0} + * class 0 1 2 + * 7.0||1|0|0| + * 9.0||0|3|2| + * degree of freedom = 2, statistic = 6, pValue = 0.049 + * + * feature3 = {0.0, 6.0, 3.0, 8.0} + * class 0 1 2 + * 0.0||1|0|0| + * 6.0||0|1|2| + * 3.0||0|1|0| + * 8.0||0|1|0| + * degree of freedom = 6, statistic = 8.66, pValue = 0.193 + * + * feature4 = {7.0, 0.0, 5.0, 4.0} + * class 0 1 2 + * 7.0||1|0|0| + * 0.0||0|2|0| + * 5.0||0|1|1| + * 4.0||0|0|1| + * degree of freedom = 6, statistic = 9.5, pValue = 0.147 + * + * feature5 = {6.0, 5.0, 4.0, 0.0} + * class 0 1 2 + * 6.0||1|1|0| + * 5.0||0|2|0| + * 4.0||0|0|1| + * 0.0||0|0|1| + * degree of freedom = 6, statistic = 8.0, pValue = 0.238 + * + * feature6 = {0.0, 9.0, 5.0, 4.0} + * class 0 1 2 + * 0.0||1|0|1| + * 9.0||0|1|0| + * 5.0||0|1|0| + * 4.0||0|1|1| + * degree of freedom = 6, statistic = 5, pValue = 0.54 + * + * To verify the results with R, run: + * library(stats) + * x1 <- c(6.0, 0.0, 0.0, 0.0, 8.0, 8.0) + * x2 <- c(7.0, 9.0, 9.0, 9.0, 9.0, 9.0) + * x3 <- c(0.0, 6.0, 3.0, 8.0, 6.0, 6.0) + * x4 <- c(7.0, 0.0, 0.0, 5.0, 5.0, 4.0) + * x5 <- c(6.0, 5.0, 5.0, 6.0, 4.0, 0.0) + * x6 <- c(0.0, 9.0, 5.0, 4.0, 4.0, 0.0) + * y <- c(0.0, 1.0, 1.0, 1.0, 2.0, 2.0) + * chisq.test(x1,y) + * chisq.test(x2,y) + * chisq.test(x3,y) + * chisq.test(x4,y) + * chisq.test(x5,y) + * chisq.test(x6,y) */ + dataset = spark.createDataFrame(Seq( - (0.0, Vectors.sparse(3, Array((0, 8.0), (1, 7.0))), Vectors.dense(8.0)), - (1.0, Vectors.sparse(3, Array((1, 9.0), (2, 6.0))), Vectors.dense(0.0)), - (1.0, Vectors.dense(Array(0.0, 9.0, 8.0)), Vectors.dense(0.0)), - (2.0, Vectors.dense(Array(7.0, 9.0, 5.0)), Vectors.dense(7.0)), - (2.0, Vectors.dense(Array(8.0, 7.0, 3.0)), Vectors.dense(8.0)) + (0.0, Vectors.sparse(6, Array((0, 6.0), (1, 7.0), (3, 7.0), (4, 6.0))), Vectors.dense(6.0)), + (1.0, Vectors.sparse(6, Array((1, 9.0), (2, 6.0), (4, 5.0), (5, 9.0))), Vectors.dense(0.0)), + (1.0, Vectors.sparse(6, Array((1, 9.0), (2, 3.0), (4, 5.0), (5, 5.0))), Vectors.dense(0.0)), + (1.0, Vectors.dense(Array(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)), Vectors.dense(0.0)), + (2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)), Vectors.dense(8.0)), + (2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 4.0, 0.0, 0.0)), Vectors.dense(8.0)) )).toDF("label", "features", "topFeature") } @@ -69,19 +124,25 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext test("Test Chi-Square selector: percentile") { val selector = new ChiSqSelector() - .setOutputCol("filtered").setSelectorType("percentile").setPercentile(0.34) + .setOutputCol("filtered").setSelectorType("percentile").setPercentile(0.17) ChiSqSelectorSuite.testSelector(selector, dataset) } test("Test Chi-Square selector: fpr") { val selector = new ChiSqSelector() - .setOutputCol("filtered").setSelectorType("fpr").setFpr(0.2) + .setOutputCol("filtered").setSelectorType("fpr").setFpr(0.02) + ChiSqSelectorSuite.testSelector(selector, dataset) + } + + test("Test Chi-Square selector: fdr") { + val selector = new ChiSqSelector() + .setOutputCol("filtered").setSelectorType("fdr").setFdr(0.12) ChiSqSelectorSuite.testSelector(selector, dataset) } test("Test Chi-Square selector: fwe") { val selector = new ChiSqSelector() - .setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6) + .setOutputCol("filtered").setSelectorType("fwe").setFwe(0.12) ChiSqSelectorSuite.testSelector(selector, dataset) } diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index dbd17e01d221308d235612bcab0ef9edb9d2840d..ac90c899d91f2f726e6fc01487076014ca5856bb 100755 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -2629,7 +2629,8 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja """ .. note:: Experimental - Creates a ChiSquared feature selector. + Chi-Squared feature selection, which selects categorical features to use for predicting a + categorical label. The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`. @@ -2638,15 +2639,15 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja * `percentile` is similar but chooses a fraction of all features instead of a fixed number. - * `fpr` chooses all features whose p-value is below a threshold, + * `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. * `fdr` uses the `Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/ False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_ to choose all features whose false discovery rate is below a threshold. - * `fwe` chooses all features whose p-values is below a threshold, - thus controlling the family-wise error rate of selection. + * `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by + 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. diff --git a/python/pyspark/mllib/feature.py b/python/pyspark/mllib/feature.py index 61f2bc7492ad69fbd2b29044fa97c24a2c27c041..e5231dc3a27a8f8da18569d56c06a1e304fca2a3 100644 --- a/python/pyspark/mllib/feature.py +++ b/python/pyspark/mllib/feature.py @@ -282,15 +282,15 @@ class ChiSqSelector(object): * `percentile` is similar but chooses a fraction of all features instead of a fixed number. - * `fpr` chooses all features whose p-value is below a threshold, + * `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. * `fdr` uses the `Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/ False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_ to choose all features whose false discovery rate is below a threshold. - * `fwe` chooses all features whose p-values is below a threshold, - thus controlling the family-wise error rate of selection. + * `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by + 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.