Commits · 1210f412be838436e5460f374be9acc6acb7b749 · llvm / distiller

Feb 12, 2019

Fix issue #148 + refactor load_checkpoint.py (#153) · 1210f412

Neta Zmora authored 6 years ago

The root-cause of issue #148 is that DataParallel modules cannot execute on the CPU,
on machines that have both CPUs and GPUs.
Therefore, we don’t use DataParallel for models loaded for the CPUs, but we do wrap
the models with DataParallel when loaded on the GPUs (to make them run faster).
The names of module keys saved in a checkpoint file depend if the modules are wrapped
by a DataParallel module or not.  So loading a checkpoint that ran on the GPU onto a
CPU-model (and vice-versa) will fail on the keys.
This is all PyTorch and despite the community asking for a fix -
e.g. https://github.com/pytorch/pytorch/issues/7457 - it is still pending.

This commit contains code to catch key errors when loading a GPU-generated model
(i.e. with DataParallel) onto a CPU, and convert the names of the keys.

This PR also merges refactoring to load_chackpoint.py done by @barrh, who also added
a test to further test loading checkpoints.

Unverified

1210f412

Jul 11, 2018

load_checkpoint: replace exit() with exception and add test · 2bb90a9a

Neta Zmora authored 6 years ago

- Raise IOError instead of crude exit() when file is not found in the file-system
- Test that the correct exception is raised when opening a non-existent
checkpoint file

2bb90a9a

Apr 25, 2018
- Fixed test execution in new environment · 5e198621
  Neta Zmora authored 7 years ago
  
  5e198621
Apr 24, 2018
- first commit · 6eef69b5
  Neta Zmora authored 7 years ago
  
  6eef69b5