Skip to content
Snippets Groups Projects
  1. Feb 12, 2019
    • Neta Zmora's avatar
      Fix issue #148 + refactor load_checkpoint.py (#153) · 1210f412
      Neta Zmora authored
      The root-cause of issue #148 is that DataParallel modules cannot execute on the CPU,
      on machines that have both CPUs and GPUs.
      Therefore, we don’t use DataParallel for models loaded for the CPUs, but we do wrap
      the models with DataParallel when loaded on the GPUs (to make them run faster).
      The names of module keys saved in a checkpoint file depend if the modules are wrapped
      by a DataParallel module or not.  So loading a checkpoint that ran on the GPU onto a
      CPU-model (and vice-versa) will fail on the keys.
      This is all PyTorch and despite the community asking for a fix -
      e.g. https://github.com/pytorch/pytorch/issues/7457 - it is still pending.
      
      This commit contains code to catch key errors when loading a GPU-generated model
      (i.e. with DataParallel) onto a CPU, and convert the names of the keys.
      
      This PR also merges refactoring to load_chackpoint.py done by @barrh, who also added
      a test to further test loading checkpoints.
      Unverified
      1210f412
  2. Jul 11, 2018
  3. Apr 25, 2018
  4. Apr 24, 2018
Loading