Skip to content
Snippets Groups Projects
  • Reza Zadeh's avatar
    66a03e5f
    Principal Component Analysis · 66a03e5f
    Reza Zadeh authored
    # Principal Component Analysis
    
    Computes the top k principal component coefficients for the m-by-n data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is n-by-k. Each column of the coefficients return matrix contains coefficients for one principal component, and the columns are in descending order of component variance. This function centers the data and uses the singular value decomposition (SVD) algorithm.
    
    ## Testing
    Tests included:
     * All principal components
     * Only top k principal components
     * Dense SVD tests
     * Dense/sparse matrix tests
    
    The results are tested against MATLAB's pca: http://www.mathworks.com/help/stats/pca.html
    
    ## Documentation
    Added to mllib-guide.md
    
    ## Example Usage
    Added to examples directory under SparkPCA.scala
    
    Author: Reza Zadeh <rizlar@gmail.com>
    
    Closes #88 from rezazadeh/sparkpca and squashes the following commits:
    
    e298700 [Reza Zadeh] reformat using IDE
    3f23271 [Reza Zadeh] documentation and cleanup
    b025ab2 [Reza Zadeh] documentation
    e2667d4 [Reza Zadeh] assertMatrixApproximatelyEquals
    3787bb4 [Reza Zadeh] stylin
    c6ecc1f [Reza Zadeh] docs
    aa2bbcb [Reza Zadeh] rename sparseToTallSkinnyDense
    56975b0 [Reza Zadeh] docs
    2df9bde [Reza Zadeh] docs update
    8fb0015 [Reza Zadeh] rcond documentation
    dbf7797 [Reza Zadeh] correct argument number
    a9f1f62 [Reza Zadeh] documentation
    4ce6caa [Reza Zadeh] style changes
    9a56a02 [Reza Zadeh] use rcond relative to larget svalue
    120f796 [Reza Zadeh] housekeeping
    156ff78 [Reza Zadeh] string comprehension
    2e1cf43 [Reza Zadeh] rename rcond
    ea223a6 [Reza Zadeh] many style changes
    f4002d7 [Reza Zadeh] more docs
    bd53c7a [Reza Zadeh] proper accumulator
    a8b5ecf [Reza Zadeh] Don't use for loops
    0dc7980 [Reza Zadeh] filter zeros in sparse
    6115610 [Reza Zadeh] More documentation
    36d51e8 [Reza Zadeh] use JBLAS for UVS^-1 computation
    bc4599f [Reza Zadeh] configurable rcond
    86f7515 [Reza Zadeh] compute per parition, use while
    09726b3 [Reza Zadeh] more style changes
    4195e69 [Reza Zadeh] private, accumulator
    17002be [Reza Zadeh] style changes
    4ba7471 [Reza Zadeh] style change
    f4982e6 [Reza Zadeh] Use dense matrix in example
    2828d28 [Reza Zadeh] optimizations: normalize once, use inplace ops
    72c9fa1 [Reza Zadeh] rename DenseMatrix to TallSkinnyDenseMatrix, lean
    f807be9 [Reza Zadeh] fix typo
    2d7ccde [Reza Zadeh] Array interface for dense svd and pca
    cd290fa [Reza Zadeh] provide RDD[Array[Double]] support
    398d123 [Reza Zadeh] style change
    55abbfa [Reza Zadeh] docs fix
    ef29644 [Reza Zadeh] bad chnage undo
    472566e [Reza Zadeh] all files from old pr
    555168f [Reza Zadeh] initial files
    66a03e5f
    History
    Principal Component Analysis
    Reza Zadeh authored
    # Principal Component Analysis
    
    Computes the top k principal component coefficients for the m-by-n data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is n-by-k. Each column of the coefficients return matrix contains coefficients for one principal component, and the columns are in descending order of component variance. This function centers the data and uses the singular value decomposition (SVD) algorithm.
    
    ## Testing
    Tests included:
     * All principal components
     * Only top k principal components
     * Dense SVD tests
     * Dense/sparse matrix tests
    
    The results are tested against MATLAB's pca: http://www.mathworks.com/help/stats/pca.html
    
    ## Documentation
    Added to mllib-guide.md
    
    ## Example Usage
    Added to examples directory under SparkPCA.scala
    
    Author: Reza Zadeh <rizlar@gmail.com>
    
    Closes #88 from rezazadeh/sparkpca and squashes the following commits:
    
    e298700 [Reza Zadeh] reformat using IDE
    3f23271 [Reza Zadeh] documentation and cleanup
    b025ab2 [Reza Zadeh] documentation
    e2667d4 [Reza Zadeh] assertMatrixApproximatelyEquals
    3787bb4 [Reza Zadeh] stylin
    c6ecc1f [Reza Zadeh] docs
    aa2bbcb [Reza Zadeh] rename sparseToTallSkinnyDense
    56975b0 [Reza Zadeh] docs
    2df9bde [Reza Zadeh] docs update
    8fb0015 [Reza Zadeh] rcond documentation
    dbf7797 [Reza Zadeh] correct argument number
    a9f1f62 [Reza Zadeh] documentation
    4ce6caa [Reza Zadeh] style changes
    9a56a02 [Reza Zadeh] use rcond relative to larget svalue
    120f796 [Reza Zadeh] housekeeping
    156ff78 [Reza Zadeh] string comprehension
    2e1cf43 [Reza Zadeh] rename rcond
    ea223a6 [Reza Zadeh] many style changes
    f4002d7 [Reza Zadeh] more docs
    bd53c7a [Reza Zadeh] proper accumulator
    a8b5ecf [Reza Zadeh] Don't use for loops
    0dc7980 [Reza Zadeh] filter zeros in sparse
    6115610 [Reza Zadeh] More documentation
    36d51e8 [Reza Zadeh] use JBLAS for UVS^-1 computation
    bc4599f [Reza Zadeh] configurable rcond
    86f7515 [Reza Zadeh] compute per parition, use while
    09726b3 [Reza Zadeh] more style changes
    4195e69 [Reza Zadeh] private, accumulator
    17002be [Reza Zadeh] style changes
    4ba7471 [Reza Zadeh] style change
    f4982e6 [Reza Zadeh] Use dense matrix in example
    2828d28 [Reza Zadeh] optimizations: normalize once, use inplace ops
    72c9fa1 [Reza Zadeh] rename DenseMatrix to TallSkinnyDenseMatrix, lean
    f807be9 [Reza Zadeh] fix typo
    2d7ccde [Reza Zadeh] Array interface for dense svd and pca
    cd290fa [Reza Zadeh] provide RDD[Array[Double]] support
    398d123 [Reza Zadeh] style change
    55abbfa [Reza Zadeh] docs fix
    ef29644 [Reza Zadeh] bad chnage undo
    472566e [Reza Zadeh] all files from old pr
    555168f [Reza Zadeh] initial files
mllib-guide.md 1.70 KiB
layout: global
title: Machine Learning Library (MLlib)

MLlib is a Spark implementation of some common machine learning (ML) functionality, as well associated tests and data generators. MLlib currently supports four common types of machine learning problem settings, namely, binary classification, regression, clustering and collaborative filtering, as well as an underlying gradient descent optimization primitive.

Available Methods

The following links provide a detailed explanation of the methods and usage examples for each of them:

Dependencies

MLlib uses the jblas linear algebra library, which itself depends on native Fortran routines. You may need to install the gfortran runtime library if it is not already present on your nodes. MLlib will throw a linking error if it cannot detect these libraries automatically.

To use MLlib in Python, you will need NumPy version 1.7 or newer and Python 2.7.