Commits · d14974b7b546b85243580a368ac8912c135b80e4 · llvm / distiller

Nov 11, 2019

Pruning with virtual Batch-norm statistics folding (#415) · c849a25f

Neta Zmora authored 5 years ago

* pruning: add an option to virtually fold BN into Conv2D for ranking

PruningPolicy can be configured using a new control argument fold_batchnorm: when set to `True`, the weights of BatchNorm modules are folded into the weights of Conv-2D modules (if Conv2D->BN edges exist in the model graph).  Each weights filter is attenuated using a different pair of (gamma, beta) coefficients, so `fold_batchnorm` is relevant for fine-grained and filter-ranking pruning methods.  We attenuate using the running values of the mean and variance, as is done in quantization.
This control argument is only supported for Conv-2D modules (i.e. other convolution operation variants and Linear operations are not supported).
e.g.:
policies:
  - pruner:
      instance_name : low_pruner
      args:
        fold_batchnorm: True
    starting_epoch: 0
    ending_epoch: 30
    frequency: 2

* AGP: non-functional refactoring

distiller/pruning/automated_gradual_pruner.py – change `prune_to_target_sparsity`
to `_set_param_mask_by_sparsity_target`, which is a more appropriate function
name as we don’t really prune in this function

* Simplify GEMM weights input-channel ranking logic

Ranking weight-matrices by input channels is similar to ranking 4D
Conv weights by input channels, so there is no need for duplicate logic.

distiller/pruning/ranked_structures_pruner.py
-change `prune_to_target_sparsity` to `_set_param_mask_by_sparsity_target`,
which is a more appropriate function name as we don’t really prune in this
function
-remove the code handling ranking of matrix rows

distiller/norms.py – remove rank_cols.

distiller/thresholding.py – in expand_binary_map treat `channels` group_type
the same as the `cols` group_type when dealing with 2D weights

* AGP: add example of ranking filters with virtual BN-folding

Also update resnet20 AGP examples

c849a25f

Nov 10, 2019

early-exit: further refactoring and resnet50-imagenet · 795590c8

Neta Zmora authored 5 years ago

Refactor EE code and place in a separate file.
Fix resnet50-earlyexit (inputs of nn.Linear layers was wrong).

Caveats:
1. resnet50-earlyexit performance needs to be tested for performance.
2. there is still too much EE code dispersed in apputils/image_classifier.py
and compress_classifier.py

795590c8

Oct 06, 2019

Low-level pruning API refactor (#401) · 05d5592e

Neta Zmora authored 5 years ago

Some refactoring of the low-level pruning API

Added distiller/norms.py - for calculating norms of various sub-tensors.

ranked_structures_pruner.py:
-Removed l1_magnitude, l2_magnitude. Use instead distiller.norms.l1_norm
-Lots of refactoring
-replaced LpRankedStructureParameterPruner.ch_binary_map_to_mask with
distiller.thresholding.expand_binary_map
-FMReconstructionChannelPruner.rank_and_prune_channels used L2-norm
by default and now uses L1-norm (i.e.magnitude_fn=l2_magnitude was
replaced with magnitude_fn=distiller.norms.l1_norm)

thresholding.py:
-Delegated lots of the work to the new norms.py.
-Removed support for 4D (entire convolution layers) since that has not been
maintained for a longtime. This may break some old scripts that remove entire
layers.
-added expand_binary_map() explicitly so others can use it. Might need to
move to a different file
-removed threshold_policy()

utils.py:
-use distiller.norms.xxx for sparsity stats

05d5592e

Sep 01, 2019

AMC: add pruning of FC layers · 3f7a9408

Neta Zmora authored 5 years ago

FMReconstructionChannelPruner: add support for nn.Linear layers
utils.py: add non_zero_channels()
thinning: support removing channels from FC layers preceding Conv layers
test_pruning.py: add test_row_pruning()
scheduler: init from a dictionary of Maskers
coach_if.py – fix imports of Clipped-PPO and TD3

3f7a9408

Aug 21, 2019
- test_pruning.py: adjust test after relaxing thinning checkpoint loading · 961bfc89
  Neta Zmora authored 5 years ago
  
  961bfc89
Aug 06, 2019

AMC and other refactoring - large merge (#339) · 02054da1

Neta Zmora authored 5 years ago

*An implementation of AMC (the previous implementation
 code has moved to a new location under 
/distiller/examples/auto_compression/amc.  AMC is aligned
with the ‘master’ branch of Coach.
*compress_classifier.py is refactored.  The base code moved
to /distiller/apputils/image_classifier.py.  Further refactoring
will follow.
We want to provide a simple and small API to the basic features of
a classifier-compression application.
This will help applications that want to use the make features of a
classifier-compression application, without the standard training
regiment.
AMC is one example of a stand-alone application that needs to leverage
the capabilities of a classifier-compression application, but is currently
coupled to `compress_classifier.py`.
`multi-finetune.py` is another example.
* ranked_structures_pruner.py:
** Added support for grouping channels/filters
Sometimes we want to prune a group of structures: e.g. groups of
8-channels.  This feature does not force the groups to be adjacent,
so it is more like a set of structures.  E.g. in the case of pruning
channels from a 64-channels convolution, grouped by 8 channels, we 
will prune exactly one of 0/8/16/24/32/40/48/56 channels.  I.e. 
always a multiple of 8-channels, excluding the set of all 64 channels.
** Added FMReconstructionChannelPruner – this is channel
pruning using L1-magnitude to rank and select channels to
remove, and feature-map reconstruction to improve the
resilience to the pruning.
* Added a script to run multiple instances of an 
experiment, in different processes:
 examples/classifier_compression/multi-run.py
* Set the seed value even when not specified by the command-line
arguments, so that we can try and recreate the session.
* Added pruning ranking noise -
Ranking noise introduces Gaussian noise when ranking channels/filters
using Lp-norm.  The noise is introduced using the epsilon-greedy
methodology, where ranking using exact Lp-norm is considered greedy.
* Added configurable rounding of pruning level: choose whether to 
Round up/down when rounding the number of structures to prune 
(rounding is always to an integer).

02054da1

May 20, 2019

Thinning: added support for group-wise convolutions · 6b832025

Neta Zmora authored 5 years ago

Group-wise convolutions with num_groups == num_in_channels, as
configured in MobileNet for example, create attribute and shape dependency
chains that are longer than convolutions with num_groups == 1.
For example in the sequence below, changing the number of filters of the
weights of Conv1, triggers changes in BN1, Conv2, BN2, and Conv3
(g indicates the number of groups):

Conv1(g=1) => BN1 ==> Conv2(g=32) ==> BN2 ==> Conv3(g=1)

Changing the number of filters used in Conv1 affects the parameters and
attributes of BN1 and Conv2 - The number of input channels of Conv2 is
changed.as explained in
https://nervanasystems.github.io/distiller/tutorial-struct_pruning.html.

However, since Conv2 has num_groups == num_in_channels, we have to change
num_groups, which triggers a change in num_out_channels.  This is akin to
changing the number of filters of Conv2, which triggers a change in BN2
and Conv3.

models/mobilenet.py:
Changed the code that flattens the output of the
features-extractors and prepares it as input to the classifier.
The code was written using hard-coded shape values, which made it
impossible to use in thinned models (where dimensions are changed).

tests/test_pruning.py:
Added test for thinning MobileNet (grouped convolutions)

6b832025

May 16, 2019
- Refactoring: utils.get_dummy_input() · bf1e6a0d
  Neta Zmora authored 5 years ago
  
  Remove the multiple instances of code that generates dummy input per dataset.
  bf1e6a0d
Apr 08, 2019
- Removed sys.path modifications when importing distiller. (#224) · 72ef9160
  Lev Zlotnik authored 6 years ago
  
  72ef9160
Apr 01, 2019

Load optimizer from checkpoint (BREAKING - see details) (#182) · 992291cf

Bar authored 6 years ago

Load optimizer from checkpoint (BREAKING - see details) (#182)

* Fixes issues #70, #145 and replaces PR #74
* checkpoint.py
  * save_checkpoint will now save the optimizer type in addition to
    its state
  * load_checkpoint will now instantiate an optimizer based on the
    saved type and load its state
* config.py: file/dict_config now accept the resumed epoch to pass to
  LR schedulers
* policy.py: LRPolicy now passes the current epoch to the LR scheduler
* Classifier compression sample
  * New flag '--resume-from' for properly resuming a saved training
    session, inc. optimizer state and epoch #
  * Flag '--reset-optimizer' added to allow discarding of a loaded
    optimizer.
  * BREAKING:
    * Previous flag '--resume' is deprecated and is mapped to
      '--resume-from' + '--reset-optimizer'. 
    * But, old resuming behavior had an inconsistency where the epoch
      count would continue from the saved epoch, but the LR scheduler
      was setup as if we were starting from epoch 0.
    * Using '--resume-from' + '--reset-optimizer' now will simply
      RESET the epoch count to 0 for the whole environment.
    * This means that scheduling configurations (in YAML or code)
      which assumed use of '--resume' might need to be changed to
      reflect the fact that the epoch count now starts from 0
    * All relevant YAML files under 'examples' modified to reflect
      this change
* Initial support for ReduceLROnPlateau (#161):
  * Allow passing **kwargs to policies via the scheduler
  * Image classification now passes the validation loss to the
    scheduler, to be used yo ReduceLROnPlateau
  * The current implementation is experimental and subject to change

992291cf

Feb 26, 2019

PyTorch 1.0.0 support + Proper Packaging (Release 0.3) (#144) · 62862a08

Lev Zlotnik authored 6 years ago

Not backward compatible - re-installation is required

* Fixes for PyTorch==1.0.0
* Refactoring folder structure
* Update installation section in docs

62862a08

Dec 01, 2018

Important changes to pruning channels and filters (#93) · a0bf2a8f

Neta Zmora authored 6 years ago

This commit contains the main fix for issue #85.  It contains a couple of changes to the YAML structure pruning API, with examples.
I urge you to read the documentation in the Wiki (https://github.com/NervanaSystems/distiller/wiki/Pruning-Filters-&-Channels).

New syntax for defining Structured AGP.  I tried to make the syntax similar to fine-grained
(i.e. element-wise) pruning.  All you need to do is add: ```group_type: Filters```.
```
  low_pruner:
    class: L1RankedStructureParameterPruner_AGP
    initial_sparsity : 0.10
    final_sparsity: 0.50
    group_type: Filters
    weights: [module.layer3.0.conv2.weight,
              module.layer3.0.downsample.0.weight,
              module.layer3.1.conv2.weight,
              module.layer3.2.conv2.weight]
```

If you want to define “leader-based” pruning dependencies, add ```group_dependency: Leader```:
```
  low_pruner:
    class: L1RankedStructureParameterPruner_AGP
    initial_sparsity : 0.10
    final_sparsity: 0.50
    group_type: Filters
    group_dependency: Leader
    weights: [module.layer3.0.conv2.weight,
              module.layer3.0.downsample.0.weight,
              module.layer3.1.conv2.weight,
              module.layer3.2.conv2.weight]
```

Retired the old ```reg_regims``` API for describing one-shot structured-pruning.

The new YAML API is very similar to AGP structured-pruning, which is much better
than before.
The new API also allows us to describe data-dependencies when doing one-shot
structure pruning, just like AGP structured-pruning.

This commit also includes further code refactoring.

Old API:
```
  filter_pruner:
     class: 'L1RankedStructureParameterPruner'
     reg_regims:
       'module.layer1.0.conv1.weight': [0.6, '3D']
       'module.layer1.1.conv1.weight': [0.6, '3D']
```

New API:
```
 filter_pruner:
    class: 'L1RankedStructureParameterPruner'
    group_type: Filters
    desired_sparsity: 0.6
    weights: [
      module.layer1.0.conv1.weight,
      module.layer1.1.conv1.weight]
```

thresholding.py – separate the generation of the binary_map from the pruning_mask so that we
can cache the binary map and share it between several modules.

pruning/automated_gradual_pruner.py – major refactoring to supported “leader-based”
sub-graph pruning dependencies.  The concept is explained in issue #85


agp-pruning/resnet20_filters.schedule_agp.yaml
agp-pruning/resnet20_filters.schedule_agp_2.yaml
agp-pruning/resnet20_filters.schedule_agp_3.yaml
network_trimming/resnet56_cifar_activation_apoz.yaml
network_trimming/resnet56_cifar_activation_apoz_v2.yaml

a0bf2a8f

Jul 22, 2018
- Tests: added VGG16-Cifar channel pruning test · 081035d3
  Neta Zmora authored 6 years ago
  
  081035d3
Jul 21, 2018

Mag pruner doc (#33) · 9f0c0832

Neta Zmora authored 6 years ago

MagnitudeParameterPruner: document and test

This is in response to a question in issue #19

9f0c0832

Jul 15, 2018

Thinning: bug fixes · b48908c3

Neta Zmora authored 6 years ago

There are two different “namespaces” referring to module names:
normalized and de-normalized.
Normalized module names are module names that have the same
format for both data-parallel and data-serial models.
De-normalized module names are the “raw” PyTorch module names
that reflect the full model graph.  So if there is a container module
such as nn.DataParallel in the model, then a sub-module’s name
will have the “module” substring somewhere in it.

SummaryGraph operates by converting the PyTorch to ONNX, and
I’ve have issues handling nn.DataParallel in this process.
Therefore, SummaryGraph uses only normalized names internally.

PruningRecipe, on the other hand, uses de-normalized names
because it needs to operate on the model itself.

This is a sticky situation that can create really annoying bugs and makes
for some ugly code.  Nonetheless, this is the best I can do right now,
and I’ll probably revisit this soon to make it nicer.
For now, I’m pushing this commit that fixes the distinction between the
two namespaces, and fixes related bugs – in the hope that it is not too
brittle.

append_module_directive – now uses denormalize_module_name to
ensure recipe module names are denormalized.

append_param_directive – because we are dealing with parameters,
I can’t use denormalize_module_name as easily as in append_module_directive.
The clean solution is kept for later :-(

b48908c3

Jul 13, 2018

ADC (Automatic Deep Compression) example + features, tests, bug fixes (#28) · 718f777b

Neta Zmora authored 6 years ago

This is a merge of the ADC branch and master.
ADC (using a DDPG RL agent to compress image classifiers) is still WiP and requires
An unreleased version of Coach (https://github.com/NervanaSystems/coach).

Small features in this commit:
-Added model_find_module() - find module object given its name
- Add channel ranking and pruning: pruning/ranked_structures_pruner.py
- Add a CIFAR10 VGG16 model: models/cifar10/vgg_cifar.py
- Thinning: change the level of some log messages – some of the messages were
moved to ‘debug’ level because they are not usually interesting.
- Add a function to print nicely formatted integers - distiller/utils.py
- Sensitivity analysis for channels-removal
- compress_classifier.py – handle keyboard interrupts
- compress_classifier.py – fix re-raise of exceptions, so they maintain call-stack

-Added tests:
-- test_summarygraph.py: test_simplenet() - Added a regression test to target a bug that occurs when taking the predecessor of the first node in a graph
-- test_ranking.py - test_ch_ranking, test_ranked_channel_pruning
-- test_model_summary.py - test_png_generation, test_summary (sparsity/ compute/model/modules)

- Bug fixes in this commit:
-- Thinning bug fix: handle zero-sized 'indices' tensor
During the thinning process, the 'indices' tensor can become zero-sized,
and will have an undefiend length. Therefore, we need to check for this
situation when assessing the number of elements in 'indices'
-- Language model: adjust main.py to new distiller.model_summary API

718f777b

Jul 11, 2018

Extend pruning tests to parallel models · e3e41ba6
Neta Zmora authored 6 years ago

e3e41ba6

More robust handling of data-parallel/serial graphs (#27) · b64be690

Neta Zmora authored 6 years ago

Remove the complicated logic trying to handle data-parallel models as
serially-processed models, and vice versa.

*Function distiller.utils.make_non_parallel_copy() does the heavy lifting of
replacing  all instances of nn.DataParallel in a model with instances of
DoNothingModuleWrapper.
The DoNothingModuleWrapper wrapper does nothing but forward to the
wrapped module.  This is a trick we use to transform a data-parallel model
to a serial-processed model.

*SummaryGraph uses a copy of the model after the model is processed by
distiller.make_non_parallel_copy() which renders the model non-data-parallel.

*The same goes for model_performance_summary()

*Model inputs are explicitly placed on the Cuda device, since now all models are
Executed on the CPU.  Previously, if a model was not created using
nn.DataParallel, then the model was not explicitly placed on the Cuda device.

*The logic in distiller.CompressionScheduler that attempted to load a
model parallel model and process it serially, or load a serial model and
process it data-parallel, was removed.  This removes a lot of fuzziness and makes
the code more robust: we do not needlessly try to be heroes.

* model summaries - remove pytorch 0.4 warning

* create_model: remove redundant .cuda() call

* Tests: support both parallel and serial tests

b64be690

Jun 30, 2018

Bug fix: add support for thinning the optimizer · b21f449b

Neta Zmora authored 6 years ago

You no longer need to use —momentum=0 when removing structures
dynamically.
The SGD momentum update (velocity) is dependent on the weights, which
PyTorch optimizers cache internally.  This caching is not a problem for
filter/channel removal (thinning) because although we dynamically
change the shapes of the weights tensors, we don’t change the weights
tensors themselves.
PyTorch’s SGD creates tensors to store the momentum updates, and these
tensors have the same shape as the weights tensors.  When we change the
weights tensors, we need to make the appropriate changes in the Optimizer,
or disable the momentum.
We added a new function - thinning.optimizer_thinning() - to do this.
This function is brittle as it is tested only on optim.SGD and relies on the
internal representation of the SGD optimizer, which can change w/o notice.
For example, optim.Adam uses state['exp_avg'], state['exp_avg_sq']
Which also depend the shape of the weight tensors.
We needed to pass the Optimizer instance to Thinning policies
(ChannelRemover, FilterRemover) via the callbacks, which required us
to change the callback interface.
In the future we plan a bigger change to the callback API, to allow
passing of arbitrary context from the training environment to Distiller.

Also in this commit:
* compress_classifier.py had special handling for resnet layer-removal, which
is used in examples/ssl/ssl_4D-removal_training.yaml.
This is a brittle and ugly hack.  Until we have a more elegant solution, I’m
Removing support for layer-removal.
* Added to the tests invocation of forward and backward passes over a model.
This tests more of the real flows, which use the optimizer and construct
gradient tensors.
* Added a test of a special case of convolution filter-pruning which occurs
when the next layer is fully-connected (linear)

b21f449b

Jun 26, 2018

Model thinning: refactor pruning tests · 4240ec94
Neta Zmora authored 6 years ago
```
Refactor the tests so that they can be applied to more models.
```
4240ec94
Model thinning: bug fix – properly handle thinning of Convs with biases · 0fd40970
Neta Zmora authored 6 years ago
```
The channel-thinning code does not handle correctly channel removal when
the Convolution layer has a biases tensor.
```
0fd40970

Model thinning: bug fix – aggressive channel/filter pruning raises an exception · a1cf9595

Neta Zmora authored 6 years ago

* Fix bug: taking the len() of a zero-dimensional ‘indices’ tensor is not legal.
    Use nelement() instead.
    A zero-dim ‘indices’ tensor occurs when the pruning is very aggressive and
    leaves one channel or filter in the tensor.
* Protect again pruning of all channels/filters of a layer: Raise ValueError if
  trying to create (thru thinning) a Convolution layer with zero channels or filters.
* Tests:
* Some PEP8 cleanup.
* Add some test documentation.
* Refactored some test code to tests/common.py
* Added testing of pruning all the channels/filters in a Convolution

a1cf9595

Jun 14, 2018

Fixed a couple of bugs and added tests · 1d62f96f

Neta Zmora authored 6 years ago

When removing channels and thinning, the number of filters of the
next layer was not set correctly.

When loading a model that has already been thinned (e.g when loading
a model, thinning, saving, loading), don’t crash on wrong tensor sizes.

Cache the thinning recipe in the model when loading from checkpoint.
Without this, a loaded thin model will lose its recipes when saved to
checkpoint

1d62f96f

Jun 13, 2018
- Testing: added a basic filter pruning + thinning test · 8de6223e
  Neta Zmora authored 6 years ago
  
  8de6223e