Skip to content
  • Marcelo Vanzin's avatar
    dc2714da
    [SPARK-22290][CORE] Avoid creating Hive delegation tokens when not necessary. · dc2714da
    Marcelo Vanzin authored
    Hive delegation tokens are only needed when the Spark driver has no access
    to the kerberos TGT. That happens only in two situations:
    
    - when using a proxy user
    - when using cluster mode without a keytab
    
    This change modifies the Hive provider so that it only generates delegation
    tokens in those situations, and tweaks the YARN AM so that it makes the proper
    user visible to the Hive code when running with keytabs, so that the TGT
    can be used instead of a delegation token.
    
    The effect of this change is that now it's possible to initialize multiple,
    non-concurrent SparkContext instances in the same JVM. Before, the second
    invocation would fail to fetch a new Hive delegation token, which then could
    make the second (or third or...) application fail once the token expired.
    With this change, the TGT will be used to authenticate to the HMS instead.
    
    This change also avoids polluting the current logged in user's credentials
    when launching applications. The credentials are copied only when running
    applications as a proxy user. This makes it possible to implement SPARK-11035
    later, where multiple threads might be launching applications, and each app
    should have its own set of credentials.
    
    Tested by verifying HDFS and Hive access in following scenarios:
    - client and cluster mode
    - client and cluster mode with proxy user
    - client and cluster mode with principal / keytab
    - long-running cluster app with principal / keytab
    - pyspark app that creates (and stops) multiple SparkContext instances
      through its lifetime
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #19509 from vanzin/SPARK-22290.
    dc2714da
    [SPARK-22290][CORE] Avoid creating Hive delegation tokens when not necessary.
    Marcelo Vanzin authored
    Hive delegation tokens are only needed when the Spark driver has no access
    to the kerberos TGT. That happens only in two situations:
    
    - when using a proxy user
    - when using cluster mode without a keytab
    
    This change modifies the Hive provider so that it only generates delegation
    tokens in those situations, and tweaks the YARN AM so that it makes the proper
    user visible to the Hive code when running with keytabs, so that the TGT
    can be used instead of a delegation token.
    
    The effect of this change is that now it's possible to initialize multiple,
    non-concurrent SparkContext instances in the same JVM. Before, the second
    invocation would fail to fetch a new Hive delegation token, which then could
    make the second (or third or...) application fail once the token expired.
    With this change, the TGT will be used to authenticate to the HMS instead.
    
    This change also avoids polluting the current logged in user's credentials
    when launching applications. The credentials are copied only when running
    applications as a proxy user. This makes it possible to implement SPARK-11035
    later, where multiple threads might be launching applications, and each app
    should have its own set of credentials.
    
    Tested by verifying HDFS and Hive access in following scenarios:
    - client and cluster mode
    - client and cluster mode with proxy user
    - client and cluster mode with principal / keytab
    - long-running cluster app with principal / keytab
    - pyspark app that creates (and stops) multiple SparkContext instances
      through its lifetime
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #19509 from vanzin/SPARK-22290.
Loading