Skip to content
Snippets Groups Projects
  • Imran Rashid's avatar
    9ce7d3e5
    [SPARK-17675][CORE] Expand Blacklist for TaskSets · 9ce7d3e5
    Imran Rashid authored
    ## What changes were proposed in this pull request?
    
    This is a step along the way to SPARK-8425.
    
    To enable incremental review, the first step proposed here is to expand the blacklisting within tasksets. In particular, this will enable blacklisting for
    * (task, executor) pairs (this already exists via an undocumented config)
    * (task, node)
    * (taskset, executor)
    * (taskset, node)
    
    Adding (task, node) is critical to making spark fault-tolerant of one-bad disk in a cluster, without requiring careful tuning of "spark.task.maxFailures". The other additions are also important to avoid many misleading task failures and long scheduling delays when there is one bad node on a large cluster.
    
    Note that some of the code changes here aren't really required for just this -- they put pieces in place for SPARK-8425 even though they are not used yet (eg. the `BlacklistTracker` helper is a little out of place, `TaskSetBlacklist` holds onto a little more info than it needs to for just this change, and `ExecutorFailuresInTaskSet` is more complex than it needs to be).
    
    ## How was this patch tested?
    
    Added unit tests, run tests via jenkins.
    
    Author: Imran Rashid <irashid@cloudera.com>
    Author: mwws <wei.mao@intel.com>
    
    Closes #15249 from squito/taskset_blacklist_only.
    9ce7d3e5
    History
    [SPARK-17675][CORE] Expand Blacklist for TaskSets
    Imran Rashid authored
    ## What changes were proposed in this pull request?
    
    This is a step along the way to SPARK-8425.
    
    To enable incremental review, the first step proposed here is to expand the blacklisting within tasksets. In particular, this will enable blacklisting for
    * (task, executor) pairs (this already exists via an undocumented config)
    * (task, node)
    * (taskset, executor)
    * (taskset, node)
    
    Adding (task, node) is critical to making spark fault-tolerant of one-bad disk in a cluster, without requiring careful tuning of "spark.task.maxFailures". The other additions are also important to avoid many misleading task failures and long scheduling delays when there is one bad node on a large cluster.
    
    Note that some of the code changes here aren't really required for just this -- they put pieces in place for SPARK-8425 even though they are not used yet (eg. the `BlacklistTracker` helper is a little out of place, `TaskSetBlacklist` holds onto a little more info than it needs to for just this change, and `ExecutorFailuresInTaskSet` is more complex than it needs to be).
    
    ## How was this patch tested?
    
    Added unit tests, run tests via jenkins.
    
    Author: Imran Rashid <irashid@cloudera.com>
    Author: mwws <wei.mao@intel.com>
    
    Closes #15249 from squito/taskset_blacklist_only.