Skip to content
Snippets Groups Projects
  • Andrew Or's avatar
    5603e4c4
    [SPARK-2242] HOTFIX: pyspark shell hangs on simple job · 5603e4c4
    Andrew Or authored
    This reverts a change introduced in 38702487, which redirected all stderr to the OS pipe instead of directly to the `bin/pyspark` shell output. This causes a simple job to hang in two ways:
    
    1. If the cluster is not configured correctly or does not have enough resources, the job hangs without producing any output, because the relevant warning messages are masked.
    2. If the stderr volume is large, this could lead to a deadlock if we redirect everything to the OS pipe. From the [python docs](https://docs.python.org/2/library/subprocess.html):
    
    ```
    Note Do not use stdout=PIPE or stderr=PIPE with this function as that can deadlock
    based on the child process output volume. Use Popen with the communicate() method
    when you need pipes.
    ```
    
    Note that we cannot remove `stdout=PIPE` in a similar way, because we currently use it to communicate the py4j port. However, it should be fine (as it has been for a long time) because we do not produce a ton of traffic through `stdout`.
    
    That commit was not merged in branch-1.0, so this fix is for master only.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes #1178 from andrewor14/fix-python and squashes the following commits:
    
    e68e870 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-python
    20849a8 [Andrew Or] Tone down stdout interference message
    a09805b [Andrew Or] Return more than 1 line of error message to user
    6dfbd1e [Andrew Or] Don't swallow original exception
    0d1861f [Andrew Or] Provide more helpful output if stdout is garbled
    21c9d7c [Andrew Or] Do not mask stderr from output
    5603e4c4
    History
    [SPARK-2242] HOTFIX: pyspark shell hangs on simple job
    Andrew Or authored
    This reverts a change introduced in 38702487, which redirected all stderr to the OS pipe instead of directly to the `bin/pyspark` shell output. This causes a simple job to hang in two ways:
    
    1. If the cluster is not configured correctly or does not have enough resources, the job hangs without producing any output, because the relevant warning messages are masked.
    2. If the stderr volume is large, this could lead to a deadlock if we redirect everything to the OS pipe. From the [python docs](https://docs.python.org/2/library/subprocess.html):
    
    ```
    Note Do not use stdout=PIPE or stderr=PIPE with this function as that can deadlock
    based on the child process output volume. Use Popen with the communicate() method
    when you need pipes.
    ```
    
    Note that we cannot remove `stdout=PIPE` in a similar way, because we currently use it to communicate the py4j port. However, it should be fine (as it has been for a long time) because we do not produce a ton of traffic through `stdout`.
    
    That commit was not merged in branch-1.0, so this fix is for master only.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes #1178 from andrewor14/fix-python and squashes the following commits:
    
    e68e870 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-python
    20849a8 [Andrew Or] Tone down stdout interference message
    a09805b [Andrew Or] Return more than 1 line of error message to user
    6dfbd1e [Andrew Or] Don't swallow original exception
    0d1861f [Andrew Or] Provide more helpful output if stdout is garbled
    21c9d7c [Andrew Or] Do not mask stderr from output
java_gateway.py 3.86 KiB