Skip to content
Snippets Groups Projects
model_zoo.md 26.22 KiB

Distiller Model Zoo

How to contribute models to the Model Zoo

We encourage you to contribute new models to the Model Zoo. We welcome implementations of published papers or of your own work. To assure that models and algorithms shared with others are high-quality, please commit your models with the following:

  • Command-line arguments
  • Log files
  • PyTorch model

Contents

The Distiller model zoo is not a "traditional" model-zoo, because it does not necessarily contain best-in-class compressed models. Instead, the model-zoo contains a number of deep learning models that have been compressed using Distiller following some well-known research papers. These are meant to serve as examples of how Distiller can be used.

Each model contains a Distiller schedule detailing how the model was compressed, a PyTorch checkpoint, text logs and TensorBoard logs.

table, th, td { border: 1px solid black; }
Paper Dataset Network Method & Granularity Schedule Features
Learning both Weights and Connections for Efficient Neural Networks ImageNet Alexnet Element-wise pruning Iterative; Manual Magnitude thresholding based on a sensitivity quantifier.
Element-wise sparsity sensitivity analysis
To prune, or not to prune: exploring the efficacy of pruning for model compression ImageNet MobileNet Element-wise pruning Automated gradual; Iterative Magnitude thresholding based on target level
Learning Structured Sparsity in Deep Neural Networks CIFAR10 ResNet20 Group regularization 1.Train with group-lasso
2.Remove zero groups and fine-tune
Group Lasso regularization. Groups: kernels (2D), channels, filters (3D), layers (4D), vectors (rows, cols)
Pruning Filters for Efficient ConvNets CIFAR10 ResNet56 Filter ranking; guided by sensitivity analysis 1.Rank filters
2. Remove filters and channels
3.Fine-tune
One-shot ranking and pruning of filters; with network thinning

Learning both Weights and Connections for Efficient Neural Networks

This schedule is an example of "Iterative Pruning" for Alexnet/Imagent, as described in chapter 3 of Song Han's PhD dissertation: Efficient Methods and Hardware for Deep Learning and in his paper Learning both Weights and Connections for Efficient Neural Networks.

The Distiller schedule uses SensitivityPruner which is similar to MagnitudeParameterPruner, but instead of specifying "raw" thresholds, it uses a "sensitivity parameter". Song Han's paper says that "the pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layers weights," and this is not explained much further. In Distiller, the "quality parameter" is referred to as "sensitivity" and is based on the values learned from performing sensitivity analysis. Using a parameter that is related to the standard deviation is very helpful: under the assumption that the weights tensors are distributed normally, the standard deviation acts as a threshold normalizer.

Note that Distiller's implementation deviates slightly from the algorithm Song Han describes in his PhD dissertation, in that the threshold value is set only once. In his PhD dissertation, Song Han describes a growing threshold, at each iteration. This requires n+1 hyper-parameters (n being the number of pruning iterations we use): the threshold and the threshold increase (delta) at each pruning iteration. Distiller's implementation takes advantage of the fact that as pruning progresses, more weights are pulled toward zero, and therefore the threshold "traps" more weights. Thus, we can use less hyper-parameters and achieve the same results.

Results

Our reference is TorchVision's pretrained Alexnet model which has a Top1 accuracy of 56.55 and Top5=79.09. We prune away 88.44% of the parameters and achieve Top1=56.61 and Top5=79.45. Song Han prunes 89% of the parameters, which is slightly better than our results.

Parameters:
+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+
|    | Name                      | Shape            |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean
|----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|
|  0 | features.module.0.weight  | (64, 3, 11, 11)  |         23232 |          13411 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   42.27359 | 0.14391 | -0.00002 |    0.08805 |
|  1 | features.module.3.weight  | (192, 64, 5, 5)  |        307200 |         115560 |    0.00000 |    0.00000 |  0.00000 |  1.91243 |  0.00000 |   62.38281 | 0.04703 | -0.00250 |    0.02289 |
|  2 | features.module.6.weight  | (384, 192, 3, 3) |        663552 |         256565 |    0.00000 |    0.00000 |  0.00000 |  6.18490 |  0.00000 |   61.33445 | 0.03354 | -0.00184 |    0.01803 |
|  3 | features.module.8.weight  | (256, 384, 3, 3) |        884736 |         315065 |    0.00000 |    0.00000 |  0.00000 |  6.96411 |  0.00000 |   64.38881 | 0.02646 | -0.00168 |    0.01422 |
|  4 | features.module.10.weight | (256, 256, 3, 3) |        589824 |         186938 |    0.00000 |    0.00000 |  0.00000 | 15.49225 |  0.00000 |   68.30614 | 0.02714 | -0.00246 |    0.01409 |
|  5 | classifier.1.weight       | (4096, 9216)     |      37748736 |        3398881 |    0.00000 |    0.21973 |  0.00000 |  0.21973 |  0.00000 |   90.99604 | 0.00589 | -0.00020 |    0.00168 |
|  6 | classifier.4.weight       | (4096, 4096)     |      16777216 |        1782769 |    0.21973 |    3.46680 |  0.00000 |  3.46680 |  0.00000 |   89.37387 | 0.00849 | -0.00066 |    0.00263 |
|  7 | classifier.6.weight       | (1000, 4096)     |       4096000 |         994738 |    3.36914 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   75.71440 | 0.01718 |  0.00030 |    0.00778 |
|  8 | Total sparsity:           | -                |      61090496 |        7063928 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   88.43694 | 0.00000 |  0.00000 |    0.00000 |
+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+
 2018-04-04 21:30:52,499 - Total sparsity: 88.44

 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------
 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)
 2018-04-04 21:31:35,357 - ==> Top1: 51.838    Top5: 74.817    Loss: 2.150

 2018-04-04 21:31:39,251 - --- test ---------------------
 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)
 2018-04-04 21:32:01,274 - ==> Top1: 56.606    Top5: 79.446    Loss: 1.893

To prune, or not to prune: exploring the efficacy of pruning for model compression

In their paper Zhu and Gupta, "compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint." They also "propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning."

This pruning schedule is implemented by distiller.AutomatedGradualPruner, which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps. Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper. The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs. We feel that using epochs performs well, and is more "stable", since the number of mini-batches will change, if you change the batch size.

ImageNet files: