diff --git a/examples/agp-pruning/README.md b/examples/agp-pruning/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ad2846a115391b7e39cdb385f839e2cc1560e3f7 --- /dev/null +++ b/examples/agp-pruning/README.md @@ -0,0 +1,59 @@ +## Automated Gradual Pruner (AGP) Pruning Examples + +### Introduction +In [To prune, or not to prune: exploring the efficacy of pruning for model compression](https://arxiv.org/abs/1710.01878), +authors Michael Zhu and Suyog Gupta provide an algorithm to schedule iterative level pruning. + +> We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value (usually 0) to a ï¬nal sparsity value over a span of n pruning steps. +The intuition behind this sparsity function in equation (1) is to prune the network rapidly in the initial phase when the redundant connections are +abundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network. + +The authors describe AGP: +> +- Our automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity. +- Doesn't require much hyper-parameter tuning +- Shown to perform well across different models +- Does not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable. + +### Distiller +* The original AGP paper described the application of AGP for fine-grained pruning, and in Distiller we also implemented AGP for structured-pruning. +* We also provide examples of applying AGP for pruning language models. The results and +methodology are discussed at length in the [documentation](https://nervanasystems.github.io/distiller/tutorial-lang_model.html) + +### Examples + +The tables below provide the results of the experimental pruning schedules that +appear in this directory. Each example YAML schedule-file contains the command-line +used to execute the experiment, and further details. + +#### Element-wise sparsity +| Model | Granularity | Sparsity (%) | Top1 | Baseline Top1 +| --- | :--- | ---: | ---: | ---: | +| AlexNet | Fine | 88.3 | 56.528 | 56.55 +| MobileNet v1 (width=1)| Fine | 51.6 | 68.8 | 68.9 +| ResNeXt-101-32x4d| Fine | 75.0 | 78.66 | 78.19 +| ResNet-18 | Fine | 59.9 | 69.87 | 69.76 +| ResNet-50 | Fine | 26 .0 | 76.54 | 76.15 +| ResNet-50 | Fine | 80.0 | 75.99 | 76.15 +| ResNet-50 | Fine | 84.6 | 75.66 | 76.15 + +#### Block sparsity +| Model | Granularity | Sparsity | Top1 | Baseline Top1 +| --- | :--- | ---: | ---: | ---: | +| ResNet-50 | 1x1x8 | 36.7 | 76.36 | 76.15 + +#### Filter pruning with thinning + +Our objective here is to minimize compute but performing thinning. Therefore, +sparsity is often at 0%, but the number of parameters is reduced as +filters are removed. + +In this table we seek to see a <b>lower</b> value for `Parameters Kept (%)` and, more importantely, +`Compute Kept (%)`. + +| Model | Granularity | Sparsity (%) | Parameters Kept (%) | Compute Kept (%)| Top1 | Baseline Top1 +| --- | :--- | ---: | ---: | ---: | ---: | ---: | +| ResNet-50 | Filters| 0.0 | 43.37 | 44.56 | 74.47 | 76.15 +| ResNet-50 (2) | Filters| 0.0 | 49.69 | 49.82 | 74.78 | 76.15 +| ResNet-50 (3) | Filters| 0.0 | 67.95 | 67.33 | 75.75 | 76.15 +| ResNet-50 (w/ FC) | Filters| 11.6 | 42.74 | 44.56 | 74.56 | 76.15 \ No newline at end of file diff --git a/examples/agp-pruning/prune_resnext101_fine75.yaml b/examples/agp-pruning/prune_resnext101_fine75.yaml index 4a23d1fb83e48c248caf80f28da4f673f58cc337..f135508e1080b2134c28465286e0932437b65821 100755 --- a/examples/agp-pruning/prune_resnext101_fine75.yaml +++ b/examples/agp-pruning/prune_resnext101_fine75.yaml @@ -1,7 +1,7 @@ # python3 ${DISTILLER_PATH}/examples/classifier_compression/compress_classifier.py --arch=resnext101_32x4d --pretrained --compress=${THIS} --epochs=81 --lr 0.01 ${IMAGENET_PATH} --vs=0 # Parameters: -# 2019-03-18 13:15:41,611 - +-----+-------------------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ +# +-----+-------------------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ # | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean | # |-----+-------------------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------| # | 0 | module.features.4.0.0.0.0.0.weight | (128, 64, 1, 1) | 8192 | 2048 | 0.00000 | 0.00000 | 20.31250 | 75.00000 | 37.50000 | 75.00000 | 0.04950 | -0.00209 | 0.01767 | diff --git a/examples/agp-pruning/resnet50_pruning_for_accuracy.schedule_agp.yaml b/examples/agp-pruning/resnet50_pruning_for_accuracy.schedule_agp.yaml index 4b35651067a8f09dbb7801100372d0750da73af6..2c206a2371b9782b1fdd973a0b696baf492f8b35 100755 --- a/examples/agp-pruning/resnet50_pruning_for_accuracy.schedule_agp.yaml +++ b/examples/agp-pruning/resnet50_pruning_for_accuracy.schedule_agp.yaml @@ -2,7 +2,7 @@ # of ResNet50 using the ImageNet dataset. # Top1 is 76.538 (=23.462 error rate) vs the published Top1: 76.15 (https://pytorch.org/docs/stable/torchvision/models.html) # -# I ran this for 80 epochs, but it can probably run for a much shorter time and prodcue the same results (50 epochs?) +# I ran this for 80 epochs, but it can probably run for a much shorter time and produce the same results (50 epochs?) # # time python3 compress_classifier.py -a=resnet50 --pretrained -p=50 ../../../data.imagenet/ -j=22 --epochs=80 --lr=0.001 --compress=resnet50.schedule_agp.yaml # diff --git a/examples/baseline_networks/cifar/plain20_cifar_baseline_training.yaml b/examples/baseline_networks/cifar/plain20_cifar_baseline_training.yaml index 3489c2654a276db72f5d7de0ac77415d4a402297..5a231c72fb0c022aac0c86c370c445e2ecbb1859 100755 --- a/examples/baseline_networks/cifar/plain20_cifar_baseline_training.yaml +++ b/examples/baseline_networks/cifar/plain20_cifar_baseline_training.yaml @@ -31,7 +31,7 @@ # time python3 compress_classifier.py --arch=plain20_cifar ../../../data.cifar -p=50 --lr=0.1 --epochs=180 --batch=128 --compress=../baseline_networks/cifar/plain20_cifar_baseline_training.yaml --gpu=0 -j=1 --deterministic # # Results: -# Top1 = 90.18 - which is 0.3% lower than ower goal. +# Top1 = 90.18 - which is 0.3% lower than owr goal. # *For better results, with much shorter training, see the explanation after the tables below. # # Parameters: diff --git a/examples/drop_filter/README.md b/examples/drop_filter/README.md new file mode 100644 index 0000000000000000000000000000000000000000..7afc45a2aeec6d9c2ca8a2047a783d5f61a58792 --- /dev/null +++ b/examples/drop_filter/README.md @@ -0,0 +1,71 @@ +## DropFilter +DropFilter - a regularization method similar to Dropout, which drops entire convolutional +filters, instead of mere neurons. +However, unlike the original intent of DropFilter - to act as a regularizer and reduce the generalization error +of the network, here we employ higher rates of filter-dropping (rates are increased over time by following an AGP +schedule) in order to make the network more robust to filter-pruning. We test this robustness using sensitivity +analysis. + + +A relevant quote from [3]: +> To train slimmable neural networks, we begin with a naive approach, where we directly train a +shared neural network with different width configurations. The training framework is similar to the +one of our final approach, as shown in Algorithm 1. The training is stable, however, the network +obtains extremely low top-1 testing accuracy around 0:1% on 1000-class ImageNet classification. +Error curves of the naive approach are shown in Figure 2. We conjecture the major problem in +the naive approach is that: for a single channel in a layer, different numbers of input channels in +previous layer result in different means and variances of the aggregated feature, which are then +rolling averaged to a shared batch normalization layer. The inconsistency leads to inaccurate batch +normalization statistics in a layer-by-layer propagating manner. Note that these batch normalization +statistics (moving averaged means and variances) are only used during testing, in training the means +and variances of the current mini-batch are used. + +### Examples + +Dropping filters requires finer control over the scheduling process since we want to drop different sets +of filters every `n` training iterations/steps (i.e. mini-batches), whereas usually we make such decisions +at the epoch boundary. + +1. [plain20_cifar_dropfilter_training.yaml](https://github.com/NervanaSystems/distiller/blob/master/examples/drop_filter/plain20_cifar_dropfilter_training.yaml) + + In this example we train a Plain20 model with increasing levels of dropped filters, + starting at 5% drop and going as high as 50% drop. Our aim is to make the network more robust to filter-pruning<br> + * The network is trained from scratch. + * We spend a few epochs just training, to start from weights that are somewhat trained. + * We use AGP to control the schedule of the slow increase in the percentage of dropped filters. + * To choose which filters to drop we use a Bernoulli probability function. + + | Model | Drop Rate | Top1 | Baseline Top1 + | --- | :---: | ---: | ---: | + | Plain-20 | 5-50%| 89.61 | 90.18 + +2. [plain20_cifar_dropfilter_training_regularization.yaml](https://github.com/NervanaSystems/distiller/blob/master/examples/drop_filter/plain20_cifar_dropfilter_training_regularization.yaml) + + In this example we use DropFilter for regularization. + * The network is trained from scratch. + * To choose which filters to drop we use a Bernoulli probability function. + + | Model | Drop Rate | Top1 | Baseline Top1 + | --- | :---: | ---: | ---: | + | Plain-20 | 10%| 90.88 | 90.18 + +3. [resnet20_cifar_randomlevel_training.yaml](https://github.com/NervanaSystems/distiller/blob/master/examples/drop_filter/resnet20_cifar_randomlevel_training.yaml) + + In this example we randomly choose a percentage of filters to prune (level), and then use L1-norm ranking to + choose which filters to drop. + + | Model | Drop Rate | Top1 | Baseline Top1 + | --- | :---: | ---: | ---: | + | ResNet-20 | 10-20%| 90.80 | 90.78 + +### References: +[1] Zhengsu Chen Jianwei Niu Qi Tian. + DropFilter: Dropout for Convolutions. + https://arxiv.org/abs/1810.09849 + +[2] Hengyue Pan, Hui Jiang, Xin Niu, Yong Dou. + DropFilter: A Novel Regularization Method for Learning Convolutional Neural Networks + <br>https://arxiv.org/abs/1811.06783 + +[3] Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, Thomas Huang. +Slimmable Neural Networks. In ICLR 2019, arXiv:1812.08928 \ No newline at end of file