Skip to content
Snippets Groups Projects
Commit 70e26735 authored by Neta Zmora's avatar Neta Zmora
Browse files

Add README.md files for APG and DropFilter

parent 49933144
No related branches found
No related tags found
No related merge requests found
## Automated Gradual Pruner (AGP) Pruning Examples
### Introduction
In [To prune, or not to prune: exploring the efficacy of pruning for model compression](https://arxiv.org/abs/1710.01878),
authors Michael Zhu and Suyog Gupta provide an algorithm to schedule iterative level pruning.
> We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value (usually 0) to a final sparsity value over a span of n pruning steps.
The intuition behind this sparsity function in equation (1) is to prune the network rapidly in the initial phase when the redundant connections are
abundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.
The authors describe AGP:
>
- Our automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity.
- Doesn't require much hyper-parameter tuning
- Shown to perform well across different models
- Does not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.
### Distiller
* The original AGP paper described the application of AGP for fine-grained pruning, and in Distiller we also implemented AGP for structured-pruning.
* We also provide examples of applying AGP for pruning language models. The results and
methodology are discussed at length in the [documentation](https://nervanasystems.github.io/distiller/tutorial-lang_model.html)
### Examples
The tables below provide the results of the experimental pruning schedules that
appear in this directory. Each example YAML schedule-file contains the command-line
used to execute the experiment, and further details.
#### Element-wise sparsity
| Model | Granularity | Sparsity (%) | Top1 | Baseline Top1
| --- | :--- | ---: | ---: | ---: |
| AlexNet | Fine | 88.3 | 56.528 | 56.55
| MobileNet v1 (width=1)| Fine | 51.6 | 68.8 | 68.9
| ResNeXt-101-32x4d| Fine | 75.0 | 78.66 | 78.19
| ResNet-18 | Fine | 59.9 | 69.87 | 69.76
| ResNet-50 | Fine | 26 .0 | 76.54 | 76.15
| ResNet-50 | Fine | 80.0 | 75.99 | 76.15
| ResNet-50 | Fine | 84.6 | 75.66 | 76.15
#### Block sparsity
| Model | Granularity | Sparsity | Top1 | Baseline Top1
| --- | :--- | ---: | ---: | ---: |
| ResNet-50 | 1x1x8 | 36.7 | 76.36 | 76.15
#### Filter pruning with thinning
Our objective here is to minimize compute but performing thinning. Therefore,
sparsity is often at 0%, but the number of parameters is reduced as
filters are removed.
In this table we seek to see a <b>lower</b> value for `Parameters Kept (%)` and, more importantely,
`Compute Kept (%)`.
| Model | Granularity | Sparsity (%) | Parameters Kept (%) | Compute Kept (%)| Top1 | Baseline Top1
| --- | :--- | ---: | ---: | ---: | ---: | ---: |
| ResNet-50 | Filters| 0.0 | 43.37 | 44.56 | 74.47 | 76.15
| ResNet-50 (2) | Filters| 0.0 | 49.69 | 49.82 | 74.78 | 76.15
| ResNet-50 (3) | Filters| 0.0 | 67.95 | 67.33 | 75.75 | 76.15
| ResNet-50 (w/ FC) | Filters| 11.6 | 42.74 | 44.56 | 74.56 | 76.15
\ No newline at end of file
# python3 ${DISTILLER_PATH}/examples/classifier_compression/compress_classifier.py --arch=resnext101_32x4d --pretrained --compress=${THIS} --epochs=81 --lr 0.01 ${IMAGENET_PATH} --vs=0 # python3 ${DISTILLER_PATH}/examples/classifier_compression/compress_classifier.py --arch=resnext101_32x4d --pretrained --compress=${THIS} --epochs=81 --lr 0.01 ${IMAGENET_PATH} --vs=0
# Parameters: # Parameters:
# 2019-03-18 13:15:41,611 - +-----+-------------------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ # +-----+-------------------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+
# | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean | # | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean |
# |-----+-------------------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------| # |-----+-------------------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|
# | 0 | module.features.4.0.0.0.0.0.weight | (128, 64, 1, 1) | 8192 | 2048 | 0.00000 | 0.00000 | 20.31250 | 75.00000 | 37.50000 | 75.00000 | 0.04950 | -0.00209 | 0.01767 | # | 0 | module.features.4.0.0.0.0.0.weight | (128, 64, 1, 1) | 8192 | 2048 | 0.00000 | 0.00000 | 20.31250 | 75.00000 | 37.50000 | 75.00000 | 0.04950 | -0.00209 | 0.01767 |
......
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
# of ResNet50 using the ImageNet dataset. # of ResNet50 using the ImageNet dataset.
# Top1 is 76.538 (=23.462 error rate) vs the published Top1: 76.15 (https://pytorch.org/docs/stable/torchvision/models.html) # Top1 is 76.538 (=23.462 error rate) vs the published Top1: 76.15 (https://pytorch.org/docs/stable/torchvision/models.html)
# #
# I ran this for 80 epochs, but it can probably run for a much shorter time and prodcue the same results (50 epochs?) # I ran this for 80 epochs, but it can probably run for a much shorter time and produce the same results (50 epochs?)
# #
# time python3 compress_classifier.py -a=resnet50 --pretrained -p=50 ../../../data.imagenet/ -j=22 --epochs=80 --lr=0.001 --compress=resnet50.schedule_agp.yaml # time python3 compress_classifier.py -a=resnet50 --pretrained -p=50 ../../../data.imagenet/ -j=22 --epochs=80 --lr=0.001 --compress=resnet50.schedule_agp.yaml
# #
......
...@@ -31,7 +31,7 @@ ...@@ -31,7 +31,7 @@
# time python3 compress_classifier.py --arch=plain20_cifar ../../../data.cifar -p=50 --lr=0.1 --epochs=180 --batch=128 --compress=../baseline_networks/cifar/plain20_cifar_baseline_training.yaml --gpu=0 -j=1 --deterministic # time python3 compress_classifier.py --arch=plain20_cifar ../../../data.cifar -p=50 --lr=0.1 --epochs=180 --batch=128 --compress=../baseline_networks/cifar/plain20_cifar_baseline_training.yaml --gpu=0 -j=1 --deterministic
# #
# Results: # Results:
# Top1 = 90.18 - which is 0.3% lower than ower goal. # Top1 = 90.18 - which is 0.3% lower than owr goal.
# *For better results, with much shorter training, see the explanation after the tables below. # *For better results, with much shorter training, see the explanation after the tables below.
# #
# Parameters: # Parameters:
......
## DropFilter
DropFilter - a regularization method similar to Dropout, which drops entire convolutional
filters, instead of mere neurons.
However, unlike the original intent of DropFilter - to act as a regularizer and reduce the generalization error
of the network, here we employ higher rates of filter-dropping (rates are increased over time by following an AGP
schedule) in order to make the network more robust to filter-pruning. We test this robustness using sensitivity
analysis.
A relevant quote from [3]:
> To train slimmable neural networks, we begin with a naive approach, where we directly train a
shared neural network with different width configurations. The training framework is similar to the
one of our final approach, as shown in Algorithm 1. The training is stable, however, the network
obtains extremely low top-1 testing accuracy around 0:1% on 1000-class ImageNet classification.
Error curves of the naive approach are shown in Figure 2. We conjecture the major problem in
the naive approach is that: for a single channel in a layer, different numbers of input channels in
previous layer result in different means and variances of the aggregated feature, which are then
rolling averaged to a shared batch normalization layer. The inconsistency leads to inaccurate batch
normalization statistics in a layer-by-layer propagating manner. Note that these batch normalization
statistics (moving averaged means and variances) are only used during testing, in training the means
and variances of the current mini-batch are used.
### Examples
Dropping filters requires finer control over the scheduling process since we want to drop different sets
of filters every `n` training iterations/steps (i.e. mini-batches), whereas usually we make such decisions
at the epoch boundary.
1. [plain20_cifar_dropfilter_training.yaml](https://github.com/NervanaSystems/distiller/blob/master/examples/drop_filter/plain20_cifar_dropfilter_training.yaml)
In this example we train a Plain20 model with increasing levels of dropped filters,
starting at 5% drop and going as high as 50% drop. Our aim is to make the network more robust to filter-pruning<br>
* The network is trained from scratch.
* We spend a few epochs just training, to start from weights that are somewhat trained.
* We use AGP to control the schedule of the slow increase in the percentage of dropped filters.
* To choose which filters to drop we use a Bernoulli probability function.
| Model | Drop Rate | Top1 | Baseline Top1
| --- | :---: | ---: | ---: |
| Plain-20 | 5-50%| 89.61 | 90.18
2. [plain20_cifar_dropfilter_training_regularization.yaml](https://github.com/NervanaSystems/distiller/blob/master/examples/drop_filter/plain20_cifar_dropfilter_training_regularization.yaml)
In this example we use DropFilter for regularization.
* The network is trained from scratch.
* To choose which filters to drop we use a Bernoulli probability function.
| Model | Drop Rate | Top1 | Baseline Top1
| --- | :---: | ---: | ---: |
| Plain-20 | 10%| 90.88 | 90.18
3. [resnet20_cifar_randomlevel_training.yaml](https://github.com/NervanaSystems/distiller/blob/master/examples/drop_filter/resnet20_cifar_randomlevel_training.yaml)
In this example we randomly choose a percentage of filters to prune (level), and then use L1-norm ranking to
choose which filters to drop.
| Model | Drop Rate | Top1 | Baseline Top1
| --- | :---: | ---: | ---: |
| ResNet-20 | 10-20%| 90.80 | 90.78
### References:
[1] Zhengsu Chen Jianwei Niu Qi Tian.
DropFilter: Dropout for Convolutions.
https://arxiv.org/abs/1810.09849
[2] Hengyue Pan, Hui Jiang, Xin Niu, Yong Dou.
DropFilter: A Novel Regularization Method for Learning Convolutional Neural Networks
<br>https://arxiv.org/abs/1811.06783
[3] Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, Thomas Huang.
Slimmable Neural Networks. In ICLR 2019, arXiv:1812.08928
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment