Skip to content
Snippets Groups Projects
  • VinceShieh's avatar
    4a9034b1
    [SPARK-17498][ML] StringIndexer enhancement for handling unseen labels · 4a9034b1
    VinceShieh authored
    ## What changes were proposed in this pull request?
    This PR is an enhancement to ML StringIndexer.
    Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records.
    But those unseen records might still be useful and user would like to keep the unseen labels in
    certain use cases, This PR enables StringIndexer to support keeping unseen labels as
    indices [numLabels].
    
    '''Before
    StringIndexer().setHandleInvalid("skip")
    StringIndexer().setHandleInvalid("error")
    '''After
    support the third option "keep"
    StringIndexer().setHandleInvalid("keep")
    
    ## How was this patch tested?
    Test added in StringIndexerSuite
    
    Signed-off-by: VinceShieh <vincent.xieintel.com>
    (Please fill in changes proposed in this fix)
    
    Author: VinceShieh <vincent.xie@intel.com>
    
    Closes #16883 from VinceShieh/spark-17498.
    4a9034b1
    History
    [SPARK-17498][ML] StringIndexer enhancement for handling unseen labels
    VinceShieh authored
    ## What changes were proposed in this pull request?
    This PR is an enhancement to ML StringIndexer.
    Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records.
    But those unseen records might still be useful and user would like to keep the unseen labels in
    certain use cases, This PR enables StringIndexer to support keeping unseen labels as
    indices [numLabels].
    
    '''Before
    StringIndexer().setHandleInvalid("skip")
    StringIndexer().setHandleInvalid("error")
    '''After
    support the third option "keep"
    StringIndexer().setHandleInvalid("keep")
    
    ## How was this patch tested?
    Test added in StringIndexerSuite
    
    Signed-off-by: VinceShieh <vincent.xieintel.com>
    (Please fill in changes proposed in this fix)
    
    Author: VinceShieh <vincent.xie@intel.com>
    
    Closes #16883 from VinceShieh/spark-17498.