Skip to content
Snippets Groups Projects
ml-features.md 46.66 KiB
layout: global
title: Extracting, transforming and selecting features - spark.ml
displayTitle: Extracting, transforming and selecting features - spark.ml

This section covers algorithms for working with features, roughly divided into these groups:

  • Extraction: Extracting features from "raw" data
  • Transformation: Scaling, converting, or modifying features
  • Selection: Selecting a subset from a larger set of features

Table of Contents

  • This will become a table of contents (this text will be scraped). {:toc}

Feature Extractors

TF-IDF (HashingTF and IDF)

Term Frequency-Inverse Document Frequency (TF-IDF) is a common text pre-processing step. In Spark ML, TF-IDF is separate into two parts: TF (+hashing) and IDF.

TF: HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words. The algorithm combines Term Frequency (TF) counts with the hashing trick for dimensionality reduction.

IDF: IDF is an Estimator which fits on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.

Please refer to the MLlib user guide on TF-IDF for more details on Term Frequency and Inverse Document Frequency.

In the following code segment, we start with a set of sentences. We split each sentence into words using Tokenizer. For each sentence (bag of words), we use HashingTF to hash the sentence into a feature vector. We use IDF to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm.

Refer to the HashingTF Scala docs and the IDF Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/TfIdfExample.scala %}

Refer to the HashingTF Java docs and the IDF Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaTfIdfExample.java %}

Refer to the HashingTF Python docs and the IDF Python docs for more details on the API.

{% include_example python/ml/tf_idf_example.py %}

Word2Vec

Word2Vec is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used for as features for prediction, document similarity calculations, etc. Please refer to the MLlib user guide on Word2Vec for more details.

In the following code segment, we start with a set of documents, each of which is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm.

Refer to the Word2Vec Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/Word2VecExample.scala %}

Refer to the Word2Vec Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaWord2VecExample.java %}

Refer to the Word2Vec Python docs for more details on the API.

{% include_example python/ml/word2vec_example.py %}

CountVectorizer

CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.

During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter "minDF" also affect the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary.

Examples

Assume that we have the following DataFrame with columns id and texts:

 id | texts
----|----------
 0  | Array("a", "b", "c")
 1  | Array("a", "b", "b", "c", "a")

each row intexts is a document of type Array[String]. Invoking fit of CountVectorizer produces a CountVectorizerModel with vocabulary (a, b, c), then the output column "vector" after transformation contains:

 id | texts                           | vector
----|---------------------------------|---------------
 0  | Array("a", "b", "c")            | (3,[0,1,2],[1.0,1.0,1.0])
 1  | Array("a", "b", "b", "c", "a")  | (3,[0,1,2],[2.0,2.0,1.0])

each vector represents the token counts of the document over the vocabulary.

Refer to the CountVectorizer Scala docs and the CountVectorizerModel Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/CountVectorizerExample.scala %}

Refer to the CountVectorizer Java docs and the CountVectorizerModel Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaCountVectorizerExample.java %}

Feature Transformers

Tokenizer

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality. The example below shows how to split sentences into sequences of words.

RegexTokenizer allows more advanced tokenization based on regular expression (regex) matching. By default, the parameter "pattern" (regex, default: \s+) is used as delimiters to split the input text. Alternatively, users can set parameter "gaps" to false indicating the regex "pattern" denotes "tokens" rather than splitting gaps, and find all matching occurrences as the tokenization result.

Refer to the Tokenizer Scala docs and the RegexTokenizer Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/TokenizerExample.scala %}

Refer to the Tokenizer Java docs and the RegexTokenizer Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaTokenizerExample.java %}

Refer to the Tokenizer Python docs and the RegexTokenizer Python docs for more details on the API.

{% include_example python/ml/tokenizer_example.py %}

StopWordsRemover

Stop words are words which should be excluded from the input, typically because the words appear frequently and don't carry as much meaning.

StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of stopwords is specified by the stopWords parameter. We provide a list of stop words by default, accessible by calling getStopWords on a newly instantiated StopWordsRemover instance. A boolean parameter caseSensitive indicates if the matches should be case sensitive (false by default).

Examples

Assume that we have the following DataFrame with columns id and raw:

 id | raw
----|----------
 0  | [I, saw, the, red, baloon]
 1  | [Mary, had, a, little, lamb]

Applying StopWordsRemover with raw as the input column and filtered as the output column, we should get the following:

 id | raw                         | filtered
----|-----------------------------|--------------------
 0  | [I, saw, the, red, baloon]  |  [saw, red, baloon]
 1  | [Mary, had, a, little, lamb]|[Mary, little, lamb]

In filtered, the stop words "I", "the", "had", and "a" have been filtered out.

Refer to the StopWordsRemover Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/StopWordsRemoverExample.scala %}

Refer to the StopWordsRemover Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaStopWordsRemoverExample.java %}

Refer to the StopWordsRemover Python docs for more details on the API.

{% include_example python/ml/stopwords_remover_example.py %}

n-gram

An n-gram is a sequence of n tokens (typically words) for some integer n. The NGram class can be used to transform input features into n-grams.

NGram takes as input a sequence of strings (e.g. the output of a Tokenizer). The parameter n is used to determine the number of terms in each n-gram. The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words. If the input sequence contains fewer than n strings, no output is produced.

Refer to the NGram Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/NGramExample.scala %}

Refer to the NGram Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaNGramExample.java %}

Refer to the NGram Python docs for more details on the API.

{% include_example python/ml/n_gram_example.py %}

Binarizer

Binarization is the process of thresholding numerical features to binary (0/1) features.

Binarizer takes the common parameters inputCol and outputCol, as well as the threshold for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0.

Refer to the Binarizer Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/BinarizerExample.scala %}

Refer to the Binarizer Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaBinarizerExample.java %}

Refer to the Binarizer Python docs for more details on the API.

{% include_example python/ml/binarizer_example.py %}

PCA

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A PCA class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.

Refer to the PCA Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/PCAExample.scala %}

Refer to the PCA Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaPCAExample.java %}

Refer to the PCA Python docs for more details on the API.

{% include_example python/ml/pca_example.py %}

PolynomialExpansion

Polynomial expansion is the process of expanding your features into a polynomial space, which is formulated by an n-degree combination of original dimensions. A PolynomialExpansion class provides this functionality. The example below shows how to expand your features into a 3-degree polynomial space.

Refer to the PolynomialExpansion Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/PolynomialExpansionExample.scala %}

Refer to the PolynomialExpansion Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaPolynomialExpansionExample.java %}

Refer to the PolynomialExpansion Python docs for more details on the API.

{% include_example python/ml/polynomial_expansion_example.py %}

Discrete Cosine Transform (DCT)

The Discrete Cosine Transform transforms a length N real-valued sequence in the time domain into another length N real-valued sequence in the frequency domain. A DCT class provides this functionality, implementing the DCT-II and scaling the result by 1/\sqrt{2} such that the representing matrix for the transform is unitary. No shift is applied to the transformed sequence (e.g. the 0th element of the transformed sequence is the 0th DCT coefficient and not the N/2th).

Refer to the DCT Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/DCTExample.scala %}

Refer to the DCT Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaDCTExample.java %}

StringIndexer