Skip to content
Snippets Groups Projects
Commit 178c8c49 authored by Neta Zmora's avatar Neta Zmora
Browse files

Documentation refactoring

- Moved the Language model and struct pruning tutorials from the Wiki to
the HTML documentation.  Love the ease of Wiki, but GitHub doesn't let
Google crawl these pages, and users can't open PRs on Wiki pages.

- Updated the pruning algorithms documentation
parent ec9a3bf1
No related branches found
No related tags found
No related merge requests found
Showing
with 677 additions and 72 deletions
# Weights pruning algorithms
# Weights Pruning Algorithms
<center>![algorithms](imgs/algorithms-pruning.png)</center>
<center>![algorithms](imgs/algorithms.png)</center>
## Magnitude pruner
## Magnitude Pruner
This is the most basic pruner: it applies a thresholding function, \\(thresh(.)\\), on each element, \\(w_i\\), of a weights tensor. A different threshold can be used for each layer's weights tensor.<br>
Because the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family.
......@@ -11,7 +10,7 @@ Because the threshold is applied on individual elements, this pruner belongs to
\right\rbrace \\]
## Sensitivity pruner
## Sensitivity Pruner
Finding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values. We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor.
<br>
The diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model. You can see that they have an approximate Gaussian distribution.<br>
......@@ -36,7 +35,7 @@ In [Learning both Weights and Connections for Efficient Neural Networks](https:/
So the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \\(s\\). Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.
### Method of operation
### Method of Operation
1. Start by running a pruning sensitivity analysis on the model.
2. Then use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.
......@@ -70,7 +69,7 @@ policies:
frequency: 2
```
## Level pruner
## Level Pruner
Class ```SparsityLevelParameterPruner``` uses a similar method to go around specifying specific thresholding magnitudes.
Instead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity). Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level.<br>
......@@ -78,13 +77,17 @@ This pruner is much more stable compared to ```SensitivityPruner``` because the
To set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each
### Method of operation
### Method of Operation
1. Sort the weights in the specified layer by their absolute values. <br>
2. Mask to zero the smallest magnitude weights until the desired sparsity level is reached.
## Automated gradual pruner (AGP)
## Splicing Pruner
In [Dynamic Network Surgery for Efficient DNNs](https://arxiv.org/abs/1600.604493) Guo et. al propose that network pruning and splicing work in tandem. A ```SpilicingPruner``` is a pruner that both prunes and splices connections and works best with a Dynamic Network Surgery schedule, which, for example, configures the ```PruningPolicy``` to mask weights only during the forward pass.
## Automated Gradual Pruner (AGP)
In [To prune, or not to prune: exploring the efficacy of pruning for model compression](https://arxiv.org/abs/1710.01878), authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in ```AutomatedGradualPruner```.
<center>![agp formula](imgs/agp_formula.png)</center>
> "We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value \\(s_i\\) (usually 0) to a final sparsity value \\(s_f\\) over a span of n pruning steps.
......@@ -102,23 +105,47 @@ The authors describe AGP:
- Shown to perform well across different models
- Does not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.
## RNN pruner
## RNN Pruner
The authors of [Exploring Sparsity in Recurrent Neural Networks](https://arxiv.org/abs/1704.05119), Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, "propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network." They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training. They show pruning of RNN, GRU, LSTM and embedding layers.
Distiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm.
<center>![Baidu RNN Pruning](imgs/baidu_rnn_pruning.png)</center>
# Structure pruners
# Structure Pruners
Element-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation. Structure pruners, remove entire "structures", such as kernels, filters, and even entire feature-maps.
## Ranked structure pruner
The ```L1RankedStructureParameterPruner``` pruner calculates the magnitude of some "structure", orders all of the structures based on some magnitude function and the *m* lowest ranking structures are pruned away. Currently this pruner only performs ranking of filters (3D structures) and it uses the mean of the absolute value of the tensor as the representative of the filter magnitude. The absolute mean does not depend on the size of the filter, so it is easier to use compared to just using the \\(L_1\\)-norm of the structure, and at the same time it is a good proxy of the \\(L_1\\)-norm.
## Structure Ranking Pruners
Ranking pruners use some criterion to rank the structures in a tensor, and then prune the tensor to a specified level. In principle, these pruners perform one-shot pruning, but can be combined with automatic pruning-level scheduling, such as AGP (see below).
In [Pruning Filters for Efficient ConvNets](https://arxiv.org/abs/1608.08710) the authors use filter ranking, with **one-shot pruning** followed by fine-tuning. The authors of [Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6288897) also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation:
> First, after sweeping through the full training set several times the weights become relatively stable — they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)
### L1RankedStructureParameterPruner
The ```L1RankedStructureParameterPruner``` pruner calculates the magnitude of some "structure", orders all of the structures based on some magnitude function and the *m* lowest ranking structures are pruned away. This pruner performs ranking of structures using the mean of the absolute value of the structure as the representative of the structure magnitude. The absolute mean does not depend on the size of the structure, so it is easier to use compared to just using the \\(L_1\\)-norm of the structure, and at the same time it is a good proxy of the \\(L_1\\)-norm. Basically, you can think of ```mean(abs(t))``` as a form of normalization of the structure L1-norm by the length of the structure. ```L1RankedStructureParameterPruner``` currently prunes weight filters, channels, and rows (for linear layers).
### ActivationAPoZRankedFilterPruner
The ```ActivationAPoZRankedFilterPruner``` pruner uses the activation channels mean APoZ (average percentage of zeros) to rank weight filters and prune a specified percentage of filters.
This method is called *Network Trimming* from the research paper:
"Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures",
Hengyuan Hu, Rui Peng, Yu-Wing Tai, Chi-Keung Tang, ICLR 2016
https://arxiv.org/abs/1607.03250
### GradientRankedFilterPruner
The ```GradientRankedFilterPruner``` tries to asses the importance of weight filters using the product of their gradients and the filter value.
### RandomRankedFilterPruner
For research purposes we may want to compare the results of some structure-ranking pruner to a random structure-ranking. The ```RandomRankedFilterPruner``` pruner can be used for this purpose.
## Automated Gradual Pruner (AGP) for Structures
The idea of a mathematical formula controlling the sparsity level growth is very useful and ```StructuredAGP``` extends the implementation to structured pruning.
### Pruner Compositions
Pruners can be combined to create new pruning schemes. Specifically, with a few lines of code we currently marry the AGP sparsity level scheduler with our filter-ranking classes to create pruner compositions. For each of these, we use AGP to decided how many filters to prune at each step, and we choose the filters to remove using one of the filter-ranking methods:
- ```L1RankedStructureParameterPruner_AGP```
- ```ActivationAPoZRankedFilterPruner_AGP```
- ```GradientRankedFilterPruner_AGP```
- ```RandomRankedFilterPruner_AGP```
## Activation-influenced pruner
The motivation for this pruner, is that if a feature-map produces very small activations, then this feature-map is not very important, and can be pruned away.
- <b>Status: not implemented</b><br>
## Hybrid Pruning
In a single schedule we can mix different pruning techniques. For example, we might mix pruning and regularization. Or structured pruning and element-wise pruning. We can even apply different methods on the same tensor. For example, we might want to perform filter pruning for a few epochs, then perform *thinning* and continue with element-wise pruning of the smaller network tensors. This technique of mixing different methods we call Hybrid Pruning, and Distiller has a few example schedules.
docs-src/docs/imgs/algorithms-pruning.png

78.4 KiB

docs-src/docs/imgs/algorithms.png

79.2 KiB

This diff is collapsed.
# Pruning Filters & Channels
## Introduction
Channel and filter pruning are examples of structured-pruning which create compressed models that do not require special hardware to execute. This latter fact makes this form of structured pruning particularly interesting and popular.
In networks that have serial data dependencies, it is pretty straight-forward to understand and define how to prune channels and filters. However, in more complex models, with parallel-data dependencies (paths) - such as ResNets (skip connections) and GoogLeNet (Inception layers) – things become increasingly more complex and require a deeper understanding of the data flow in the model, in order to define the pruning schedule.
This post explains channel and filter pruning, the challenges, and how to define a Distiller pruning schedule for these structures. The details of the implementation are left for a separate post.
Before we dive into pruning, let’s level-set on the terminology, because different people (and even research papers) do not always agree on the nomenclature. This reflects my understanding of the nomenclature, and therefore these are the names used in Distiller. I’ll restrict this discussion to Convolution layers in CNNs, to contain the scope of the topic I’ll be covering, although Distiller supports pruning of other structures such as matrix columns and rows.
PyTorch describes [```torch.nn.Conv2d```]( https://pytorch.org/docs/stable/nn.html#conv2d) as applying “a 2D convolution over an input signal composed of several input planes.” We call each of these input planes a **feature-map** (or FM, for short). Another name is **input channel**, as in the R/G/B channels of an image. Some people refer to feature-maps as **activations** (i.e. the activation of neurons), although I think strictly speaking **activations** are the output of an activation layer that was fed a group of feature-maps. Because it is very common, and because the use of an activation is orthogonal to our discussion, I will use **activations** to refer to the output of a Convolution layer (i.e. 3D stack of feature-maps).
In the PyTorch documentation Convolution outputs have shape (N, C<sub>out</sub>, H<sub>out</sub>, W<sub>out</sub>) where N is a batch size, C<sub>out</sub> denotes a number of output channels, H<sub>out</sub> is a height of output planes in pixels, and W<sub>out</sub> is width in pixels. We won’t be paying much attention to the batch-size since it’s not important to our discussion, so without loss of generality we can set N=1. I’m also assuming the most common Convolutions having ```groups==1```.
Convolution weights are 4D: (F, C, K, K) where F is the number of filters, C is the number of channels, and K is the kernel size (we can assume the kernel height and width are equal for simplicity). A **kernel** is a 2D matrix (K, K) that is part of a 3D feature detector. This feature detector is called a **filter** and it is basically a stack of 2D **kernels**. Each kernel is convolved with a 2D input channel (i.e. feature-map) so if there are C<sub>in</sub> channels in the input, then there are C<sub>in</sub> kernels in a filter (C == C<sub>in</sub>). Each filter is convolved with the entire input to create a single output channel (i.e. feature-map). If there are C<sub>out</sub> output channels, then there are C<sub>out</sub> filters (F == C<sub>out</sub>).
## Filter Pruning
Filter pruning and channel pruning are very similar, and I’ll expand on that similarity later on – but for now let’s focus on filter pruning.
In filter pruning we use some criterion to determine which filters are **important** and which are not. Researchers came up with all sorts of pruning criteria: the L1-magnitude of the filters (citation), the entropy of the activations (citation), and the classification accuracy reduction (citation) are just some examples. Disregarding how we chose the filters to prune, let’s imagine that in the diagram below, we chose to prune (remove) the green and orange filters (the circle with the “*” designates a Convolution operation).
Since we have two less filters operating on the input, we must have two less output feature-maps. So when we prune filters, besides changing the physical size of the weight tensors, we also need to reconfigure the immediate Convolution layer (change its ```out_channels```) and the following Convolution layer (change its ```in_channels```). And finally, because the next layer’s input is now smaller (has fewer channels), we should also shrink the next layer’s weights tensors, by removing the channels corresponding to the filters we pruned. We say that there is a **data-dependency** between the two Convolution layers. I didn’t make any mention of the activation function that usually follows Convolution, because these functions are parameter-less and are not sensitive to the shape of their input.
There are some other dependencies that Distiller resolves (such as Optimizer parameters tightly-coupled to the weights) that I won’t discuss here, because they are implementation details.
<center>![Example 1](imgs/pruning_structs_ex1.png)</center>
The scheduler YAML syntax for this example is pasted below. We use L1-norm ranking of weight filters, and the pruning-rate is set by the AGP algorithm (Automatic Gradual Pruning). The Convolution layers are conveniently named ```conv1``` and ```conv2``` in this example.
```
pruners:
example_pruner:
class: L1RankedStructureParameterPruner_AGP
initial_sparsity : 0.10
final_sparsity: 0.50
group_type: Filters
weights: [module.conv1.weight]
```
Now let’s add a Batch Normalization layer between the two convolutions:
<center>![Example 2](imgs/pruning_structs_ex2.png)</center>
The Batch Normalization layer is parameterized by a couple of tensors that contain information per input-channel (i.e. scale and shift). Because our Convolution produces less output FMs, and these are the input to the Batch Normalization layer, we also need to reconfigure the Batch Normalization layer. And we also need to physically shrink the Batch Normalization layer’s scale and shift tensors, which are coefficients in the BN input transformation. Moreover, the scale and shift coefficients that we remove from the tensors, must correspond to the filters (or output feature-maps channels) that we removed from the Convolution weight tensors. This small nuance will prove to be a large pain, but we’ll get to that in later examples.
The presence of a Batch Normalization layer in the example above is transparent to us, and in fact, the YAML schedule does not change. Distiller detects the presence of Batch Normalization layers and adjusts their parameters automatically.
Let’s look at another example, with non-serial data-dependencies. Here, the output of ```conv1``` is the input for ```conv2``` and ```conv3```. This is an example of parallel data-dependency, since both ```conv2``` and ```conv3``` depend on ```conv1```.
<center>![Example 3](imgs/pruning_structs_ex3.png)</center>
Note that the Distiller YAML schedule is unchanged from the previous two examples, since we are still only explicitly pruning the weight filters of ```conv1```. The weight channels of ```conv2``` and ```conv3``` are pruned implicitly by Distiller in a process called “Thinning” (on which I will expand in a different post).
Next, let’s look at another example also involving three Convolutions, but this time we want to prune the filters of two convolutional layers, whose outputs are element-wise-summed and fed into a third Convolution.
In this example ```conv3``` is dependent on both ```conv1``` and ```conv2```, and there are two implications to this dependency. The first, and more obvious implication, is that we need to prune the same number of filters from both ```conv1``` and ```conv2```. Since we apply element-wise addition on the outputs of ```conv1``` and ```conv2```, they must have the same shape - and they can only have the same shape if ```conv1``` and ```conv2``` prune the same number of filters. The second implication of this triangular data-dependency is that both ```conv1``` and ```conv2``` must prune the **same** filters! Let’s imagine for a moment, that we ignore this second constraint. The diagram below illustrates the dilemma that arises: how should we prune the channels of the weights of ```conv3```? Obviously, we can’t.
<center>![Example 4](imgs/pruning_structs_ex4.png)</center>
We must apply the second constraint – and that means that we now need to be proactive: we need to decide whether to use the prune ```conv1``` and ```conv2``` according to the filter-pruning choices of ```conv1``` or of ```conv2```. The diagram below illustrates the pruning scheme after deciding to follow the pruning choices of ```conv1```.
<center>![Example 5](imgs/pruning_structs_ex5.png)</center>
The YAML compression schedule syntax needs to be able to express the two dependencies (or constraints) discussed above. First we need to tell the Filter Pruner that we there is a dependency of type **Leader**. This means that all of the tensors listed in the ```weights``` field are pruned together, to the same extent at each iteration, and that to prune the filters we will use the pruning decisions of the first tensor listed. In the example below ```module.conv1.weight``` and ```module.conv2.weight``` are pruned together according to the pruning choices for ```module.conv1.weight```.
```
pruners:
example_pruner:
class: L1RankedStructureParameterPruner_AGP
initial_sparsity : 0.10
final_sparsity: 0.50
group_type: Filters
group_dependency: Leader
weights: [module.conv1.weight, module.conv2.weight]
```
When we turn to filter-pruning ResNets we see some pretty long dependency chains because of the skip-connections. If you don’t pay attention, you can easily under-specify (or mis-specify) dependency chains and Distiller will exit with an exception. The exception does not explain the specification error and this needs to be improved.
## Channel Pruning
Channel pruning is very similar to Filter pruning with all the details of dependencies reversed. Look again at example #1, but this time imagine that we’ve changed our schedule to prune the **channels** of ```module.conv2.weight```.
```
pruners:
example_pruner:
class: L1RankedStructureParameterPruner_AGP
initial_sparsity : 0.10
final_sparsity: 0.50
group_type: Channels
weights: [module.conv2.weight]
```
As the diagram shows, ```conv1``` is now dependent on ```conv2``` and its weights filters will be implicitly pruned according to the channels removed from the weights of ```conv2```.
<center>![Example 1](imgs/pruning_structs_ex1.png)</center>
Geek On.
......@@ -16,8 +16,8 @@ pages:
- Home: index.md
- Installation: install.md
- Usage: usage.md
- Compression scheduling: schedule.md
- Compressing models:
- Compression Scheduling: schedule.md
- Compressing Models:
- Pruning: pruning.md
- Regularization: regularization.md
- Quantization: quantization.md
......@@ -28,5 +28,8 @@ pages:
- Quantization: algo_quantization.md
- Early Exit: algo_earlyexit.md
- Model Zoo: model_zoo.md
- Jupyter notebooks: jupyter.md
- Jupyter Notebooks: jupyter.md
- Design: design.md
- Tutorials:
- Pruning Filters and Channels: tutorial-struct_pruning.md
- Pruning a Language Model: tutorial-lang_model.md
......@@ -58,12 +58,12 @@
<li class="toctree-l1">
<a class="" href="/schedule/index.html">Compression scheduling</a>
<a class="" href="/schedule/index.html">Compression Scheduling</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Compressing models</span>
<span class="caption-text">Compressing Models</span>
<ul class="subnav">
<li class="">
......@@ -114,7 +114,7 @@
<li class="toctree-l1">
<a class="" href="/jupyter/index.html">Jupyter notebooks</a>
<a class="" href="/jupyter/index.html">Jupyter Notebooks</a>
</li>
<li class="toctree-l1">
......@@ -122,6 +122,21 @@
<a class="" href="/design/index.html">Design</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Tutorials</span>
<ul class="subnav">
<li class="">
<a class="" href="/tutorial-struct_pruning/index.html">Pruning Filters and Channels</a>
</li>
<li class="">
<a class="" href="/tutorial-lang_model/index.html">Pruning a Language Model</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
......
......@@ -65,12 +65,12 @@
<li class="toctree-l1">
<a class="" href="../schedule/index.html">Compression scheduling</a>
<a class="" href="../schedule/index.html">Compression Scheduling</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Compressing models</span>
<span class="caption-text">Compressing Models</span>
<ul class="subnav">
<li class="">
......@@ -137,7 +137,7 @@
<li class="toctree-l1">
<a class="" href="../jupyter/index.html">Jupyter notebooks</a>
<a class="" href="../jupyter/index.html">Jupyter Notebooks</a>
</li>
<li class="toctree-l1">
......@@ -145,6 +145,21 @@
<a class="" href="../design/index.html">Design</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Tutorials</span>
<ul class="subnav">
<li class="">
<a class="" href="../tutorial-struct_pruning/index.html">Pruning Filters and Channels</a>
</li>
<li class="">
<a class="" href="../tutorial-lang_model/index.html">Pruning a Language Model</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
......
......@@ -65,12 +65,12 @@
<li class="toctree-l1">
<a class="" href="../schedule/index.html">Compression scheduling</a>
<a class="" href="../schedule/index.html">Compression Scheduling</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Compressing models</span>
<span class="caption-text">Compressing Models</span>
<ul class="subnav">
<li class="">
......@@ -104,30 +104,34 @@
<a class="current" href="index.html">Pruning</a>
<ul class="subnav">
<li class="toctree-l3"><a href="#weights-pruning-algorithms">Weights pruning algorithms</a></li>
<li class="toctree-l3"><a href="#weights-pruning-algorithms">Weights Pruning Algorithms</a></li>
<ul>
<li><a class="toctree-l4" href="#magnitude-pruner">Magnitude pruner</a></li>
<li><a class="toctree-l4" href="#magnitude-pruner">Magnitude Pruner</a></li>
<li><a class="toctree-l4" href="#sensitivity-pruner">Sensitivity pruner</a></li>
<li><a class="toctree-l4" href="#sensitivity-pruner">Sensitivity Pruner</a></li>
<li><a class="toctree-l4" href="#level-pruner">Level pruner</a></li>
<li><a class="toctree-l4" href="#level-pruner">Level Pruner</a></li>
<li><a class="toctree-l4" href="#automated-gradual-pruner-agp">Automated gradual pruner (AGP)</a></li>
<li><a class="toctree-l4" href="#splicing-pruner">Splicing Pruner</a></li>
<li><a class="toctree-l4" href="#rnn-pruner">RNN pruner</a></li>
<li><a class="toctree-l4" href="#automated-gradual-pruner-agp">Automated Gradual Pruner (AGP)</a></li>
<li><a class="toctree-l4" href="#rnn-pruner">RNN Pruner</a></li>
</ul>
<li class="toctree-l3"><a href="#structure-pruners">Structure pruners</a></li>
<li class="toctree-l3"><a href="#structure-pruners">Structure Pruners</a></li>
<ul>
<li><a class="toctree-l4" href="#ranked-structure-pruner">Ranked structure pruner</a></li>
<li><a class="toctree-l4" href="#structure-ranking-pruners">Structure Ranking Pruners</a></li>
<li><a class="toctree-l4" href="#automated-gradual-pruner-agp-for-structures">Automated Gradual Pruner (AGP) for Structures</a></li>
<li><a class="toctree-l4" href="#activation-influenced-pruner">Activation-influenced pruner</a></li>
<li><a class="toctree-l4" href="#hybrid-pruning">Hybrid Pruning</a></li>
</ul>
......@@ -152,7 +156,7 @@
<li class="toctree-l1">
<a class="" href="../jupyter/index.html">Jupyter notebooks</a>
<a class="" href="../jupyter/index.html">Jupyter Notebooks</a>
</li>
<li class="toctree-l1">
......@@ -160,6 +164,21 @@
<a class="" href="../design/index.html">Design</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Tutorials</span>
<ul class="subnav">
<li class="">
<a class="" href="../tutorial-struct_pruning/index.html">Pruning Filters and Channels</a>
</li>
<li class="">
<a class="" href="../tutorial-lang_model/index.html">Pruning a Language Model</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
......@@ -196,15 +215,15 @@
<div role="main">
<div class="section">
<h1 id="weights-pruning-algorithms">Weights pruning algorithms</h1>
<p><center><img alt="algorithms" src="../imgs/algorithms.png" /></center></p>
<h2 id="magnitude-pruner">Magnitude pruner</h2>
<h1 id="weights-pruning-algorithms">Weights Pruning Algorithms</h1>
<p><center><img alt="algorithms" src="../imgs/algorithms-pruning.png" /></center></p>
<h2 id="magnitude-pruner">Magnitude Pruner</h2>
<p>This is the most basic pruner: it applies a thresholding function, \(thresh(.)\), on each element, \(w_i\), of a weights tensor. A different threshold can be used for each layer's weights tensor.<br>
Because the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family.</p>
<p>\[ thresh(w_i)=\left\lbrace
\matrix{{{w_i: \; if \;|w_i| \; \gt}\;\lambda}\cr {0: \; if \; |w_i| \leq \lambda} }
\right\rbrace \]</p>
<h2 id="sensitivity-pruner">Sensitivity pruner</h2>
<h2 id="sensitivity-pruner">Sensitivity Pruner</h2>
<p>Finding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values. We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor.
<br>
The diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model. You can see that they have an approximate Gaussian distribution.<br>
......@@ -223,7 +242,7 @@ The diagram below shows the distribution the weights tensor of the first convolu
<p>"We used the sensitivity results to find each layer’s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer’s weights</p>
</blockquote>
<p>So the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \(s\). Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.</p>
<h3 id="method-of-operation">Method of operation</h3>
<h3 id="method-of-operation">Method of Operation</h3>
<ol>
<li>Start by running a pruning sensitivity analysis on the model. </li>
<li>Then use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.</li>
......@@ -254,17 +273,19 @@ policies:
frequency: 2
</code></pre>
<h2 id="level-pruner">Level pruner</h2>
<h2 id="level-pruner">Level Pruner</h2>
<p>Class <code>SparsityLevelParameterPruner</code> uses a similar method to go around specifying specific thresholding magnitudes.
Instead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity). Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level.<br><br />
This pruner is much more stable compared to <code>SensitivityPruner</code> because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's <code>SensitivityPruner</code> is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution. Song Han's methodology of using several different values for the multiplier \(s\), and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far). </p>
<p>To set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each</p>
<h3 id="method-of-operation_1">Method of operation</h3>
<h3 id="method-of-operation_1">Method of Operation</h3>
<ol>
<li>Sort the weights in the specified layer by their absolute values. <br></li>
<li>Mask to zero the smallest magnitude weights until the desired sparsity level is reached.</li>
</ol>
<h2 id="automated-gradual-pruner-agp">Automated gradual pruner (AGP)</h2>
<h2 id="splicing-pruner">Splicing Pruner</h2>
<p>In <a href="https://arxiv.org/abs/1600.604493">Dynamic Network Surgery for Efficient DNNs</a> Guo et. al propose that network pruning and splicing work in tandem. A <code>SpilicingPruner</code> is a pruner that both prunes and splices connections and works best with a Dynamic Network Surgery schedule, which, for example, configures the <code>PruningPolicy</code> to mask weights only during the forward pass.</p>
<h2 id="automated-gradual-pruner-agp">Automated Gradual Pruner (AGP)</h2>
<p>In <a href="https://arxiv.org/abs/1710.01878">To prune, or not to prune: exploring the efficacy of pruning for model compression</a>, authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in <code>AutomatedGradualPruner</code>.
<center><img alt="agp formula" src="../imgs/agp_formula.png" /></center></p>
<blockquote>
......@@ -283,21 +304,42 @@ abundant and gradually reduce the number of weights being pruned each time as th
<li>Does not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.</li>
</ul>
</blockquote>
<h2 id="rnn-pruner">RNN pruner</h2>
<h2 id="rnn-pruner">RNN Pruner</h2>
<p>The authors of <a href="https://arxiv.org/abs/1704.05119">Exploring Sparsity in Recurrent Neural Networks</a>, Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, "propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network." They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training. They show pruning of RNN, GRU, LSTM and embedding layers.</p>
<p>Distiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm.</p>
<p><center><img alt="Baidu RNN Pruning" src="../imgs/baidu_rnn_pruning.png" /></center></p>
<h1 id="structure-pruners">Structure pruners</h1>
<h1 id="structure-pruners">Structure Pruners</h1>
<p>Element-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation. Structure pruners, remove entire "structures", such as kernels, filters, and even entire feature-maps.</p>
<h2 id="ranked-structure-pruner">Ranked structure pruner</h2>
<p>The <code>L1RankedStructureParameterPruner</code> pruner calculates the magnitude of some "structure", orders all of the structures based on some magnitude function and the <em>m</em> lowest ranking structures are pruned away. Currently this pruner only performs ranking of filters (3D structures) and it uses the mean of the absolute value of the tensor as the representative of the filter magnitude. The absolute mean does not depend on the size of the filter, so it is easier to use compared to just using the \(L_1\)-norm of the structure, and at the same time it is a good proxy of the \(L_1\)-norm.</p>
<p>In <a href="https://arxiv.org/abs/1608.08710">Pruning Filters for Efficient ConvNets</a> the authors use filter ranking, with <strong>one-shot pruning</strong> followed by fine-tuning. The authors of <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6288897">Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition</a> also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation:</p>
<h2 id="structure-ranking-pruners">Structure Ranking Pruners</h2>
<p>Ranking pruners use some criterion to rank the structures in a tensor, and then prune the tensor to a specified level. In principle, these pruners perform one-shot pruning, but can be combined with automatic pruning-level scheduling, such as AGP (see below).
In <a href="https://arxiv.org/abs/1608.08710">Pruning Filters for Efficient ConvNets</a> the authors use filter ranking, with <strong>one-shot pruning</strong> followed by fine-tuning. The authors of <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6288897">Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition</a> also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation:</p>
<blockquote>
<p>First, after sweeping through the full training set several times the weights become relatively stable — they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)</p>
</blockquote>
<h2 id="activation-influenced-pruner">Activation-influenced pruner</h2>
<p>The motivation for this pruner, is that if a feature-map produces very small activations, then this feature-map is not very important, and can be pruned away.
- <b>Status: not implemented</b><br></p>
<h3 id="l1rankedstructureparameterpruner">L1RankedStructureParameterPruner</h3>
<p>The <code>L1RankedStructureParameterPruner</code> pruner calculates the magnitude of some "structure", orders all of the structures based on some magnitude function and the <em>m</em> lowest ranking structures are pruned away. This pruner performs ranking of structures using the mean of the absolute value of the structure as the representative of the structure magnitude. The absolute mean does not depend on the size of the structure, so it is easier to use compared to just using the \(L_1\)-norm of the structure, and at the same time it is a good proxy of the \(L_1\)-norm. Basically, you can think of <code>mean(abs(t))</code> as a form of normalization of the structure L1-norm by the length of the structure. <code>L1RankedStructureParameterPruner</code> currently prunes weight filters, channels, and rows (for linear layers).</p>
<h3 id="activationapozrankedfilterpruner">ActivationAPoZRankedFilterPruner</h3>
<p>The <code>ActivationAPoZRankedFilterPruner</code> pruner uses the activation channels mean APoZ (average percentage of zeros) to rank weight filters and prune a specified percentage of filters.
This method is called <em>Network Trimming</em> from the research paper:
"Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures",
Hengyuan Hu, Rui Peng, Yu-Wing Tai, Chi-Keung Tang, ICLR 2016
https://arxiv.org/abs/1607.03250 </p>
<h3 id="gradientrankedfilterpruner">GradientRankedFilterPruner</h3>
<p>The <code>GradientRankedFilterPruner</code> tries to asses the importance of weight filters using the product of their gradients and the filter value. </p>
<h3 id="randomrankedfilterpruner">RandomRankedFilterPruner</h3>
<p>For research purposes we may want to compare the results of some structure-ranking pruner to a random structure-ranking. The <code>RandomRankedFilterPruner</code> pruner can be used for this purpose.</p>
<h2 id="automated-gradual-pruner-agp-for-structures">Automated Gradual Pruner (AGP) for Structures</h2>
<p>The idea of a mathematical formula controlling the sparsity level growth is very useful and <code>StructuredAGP</code> extends the implementation to structured pruning.</p>
<h3 id="pruner-compositions">Pruner Compositions</h3>
<p>Pruners can be combined to create new pruning schemes. Specifically, with a few lines of code we currently marry the AGP sparsity level scheduler with our filter-ranking classes to create pruner compositions. For each of these, we use AGP to decided how many filters to prune at each step, and we choose the filters to remove using one of the filter-ranking methods:</p>
<ul>
<li><code>L1RankedStructureParameterPruner_AGP</code></li>
<li><code>ActivationAPoZRankedFilterPruner_AGP</code></li>
<li><code>GradientRankedFilterPruner_AGP</code></li>
<li><code>RandomRankedFilterPruner_AGP</code></li>
</ul>
<h2 id="hybrid-pruning">Hybrid Pruning</h2>
<p>In a single schedule we can mix different pruning techniques. For example, we might mix pruning and regularization. Or structured pruning and element-wise pruning. We can even apply different methods on the same tensor. For example, we might want to perform filter pruning for a few epochs, then perform <em>thinning</em> and continue with element-wise pruning of the smaller network tensors. This technique of mixing different methods we call Hybrid Pruning, and Distiller has a few example schedules.</p>
</div>
</div>
......
......@@ -65,12 +65,12 @@
<li class="toctree-l1">
<a class="" href="../schedule/index.html">Compression scheduling</a>
<a class="" href="../schedule/index.html">Compression Scheduling</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Compressing models</span>
<span class="caption-text">Compressing Models</span>
<ul class="subnav">
<li class="">
......@@ -139,7 +139,7 @@
<li class="toctree-l1">
<a class="" href="../jupyter/index.html">Jupyter notebooks</a>
<a class="" href="../jupyter/index.html">Jupyter Notebooks</a>
</li>
<li class="toctree-l1">
......@@ -147,6 +147,21 @@
<a class="" href="../design/index.html">Design</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Tutorials</span>
<ul class="subnav">
<li class="">
<a class="" href="../tutorial-struct_pruning/index.html">Pruning Filters and Channels</a>
</li>
<li class="">
<a class="" href="../tutorial-lang_model/index.html">Pruning a Language Model</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
......
......@@ -65,12 +65,12 @@
<li class="toctree-l1">
<a class="" href="../schedule/index.html">Compression scheduling</a>
<a class="" href="../schedule/index.html">Compression Scheduling</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Compressing models</span>
<span class="caption-text">Compressing Models</span>
<ul class="subnav">
<li class="">
......@@ -133,7 +133,7 @@
<li class="toctree-l1">
<a class="" href="../jupyter/index.html">Jupyter notebooks</a>
<a class="" href="../jupyter/index.html">Jupyter Notebooks</a>
</li>
<li class="toctree-l1">
......@@ -141,6 +141,21 @@
<a class="" href="../design/index.html">Design</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Tutorials</span>
<ul class="subnav">
<li class="">
<a class="" href="../tutorial-struct_pruning/index.html">Pruning Filters and Channels</a>
</li>
<li class="">
<a class="" href="../tutorial-lang_model/index.html">Pruning a Language Model</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
......@@ -163,7 +178,7 @@
<li>Compressing models &raquo;</li>
<li>Compressing Models &raquo;</li>
......
......@@ -65,12 +65,12 @@
<li class="toctree-l1">
<a class="" href="../schedule/index.html">Compression scheduling</a>
<a class="" href="../schedule/index.html">Compression Scheduling</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Compressing models</span>
<span class="caption-text">Compressing Models</span>
<ul class="subnav">
<li class="">
......@@ -121,7 +121,7 @@
<li class="toctree-l1">
<a class="" href="../jupyter/index.html">Jupyter notebooks</a>
<a class="" href="../jupyter/index.html">Jupyter Notebooks</a>
</li>
<li class="toctree-l1 current">
......@@ -140,6 +140,21 @@
</ul>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Tutorials</span>
<ul class="subnav">
<li class="">
<a class="" href="../tutorial-struct_pruning/index.html">Pruning Filters and Channels</a>
</li>
<li class="">
<a class="" href="../tutorial-lang_model/index.html">Pruning a Language Model</a>
</li>
</ul>
</li>
......@@ -273,8 +288,10 @@ In <code>distiller/quantization/clipped_linear.py</code> there are examples of l
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../tutorial-struct_pruning/index.html" class="btn btn-neutral float-right" title="Pruning Filters and Channels">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="../jupyter/index.html" class="btn btn-neutral" title="Jupyter notebooks"><span class="icon icon-circle-arrow-left"></span> Previous</a>
<a href="../jupyter/index.html" class="btn btn-neutral" title="Jupyter Notebooks"><span class="icon icon-circle-arrow-left"></span> Previous</a>
</div>
......@@ -303,6 +320,8 @@ In <code>distiller/quantization/clipped_linear.py</code> there are examples of l
<span><a href="../jupyter/index.html" style="color: #fcfcfc;">&laquo; Previous</a></span>
<span style="margin-left: 15px"><a href="../tutorial-struct_pruning/index.html" style="color: #fcfcfc">Next &raquo;</a></span>
</span>
</div>
<script>var base_url = '..';</script>
......
......@@ -79,12 +79,12 @@
<li class="toctree-l1">
<a class="" href="schedule/index.html">Compression scheduling</a>
<a class="" href="schedule/index.html">Compression Scheduling</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Compressing models</span>
<span class="caption-text">Compressing Models</span>
<ul class="subnav">
<li class="">
......@@ -135,7 +135,7 @@
<li class="toctree-l1">
<a class="" href="jupyter/index.html">Jupyter notebooks</a>
<a class="" href="jupyter/index.html">Jupyter Notebooks</a>
</li>
<li class="toctree-l1">
......@@ -143,6 +143,21 @@
<a class="" href="design/index.html">Design</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Tutorials</span>
<ul class="subnav">
<li class="">
<a class="" href="tutorial-struct_pruning/index.html">Pruning Filters and Channels</a>
</li>
<li class="">
<a class="" href="tutorial-lang_model/index.html">Pruning a Language Model</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
......@@ -258,5 +273,5 @@ And of course, if we used a sparse or compressed representation, then we are red
<!--
MkDocs version : 0.17.2
Build Date UTC : 2018-12-04 15:19:48
Build Date UTC : 2018-12-06 14:04:25
-->
......@@ -81,12 +81,12 @@
<li class="toctree-l1">
<a class="" href="../schedule/index.html">Compression scheduling</a>
<a class="" href="../schedule/index.html">Compression Scheduling</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Compressing models</span>
<span class="caption-text">Compressing Models</span>
<ul class="subnav">
<li class="">
......@@ -137,7 +137,7 @@
<li class="toctree-l1">
<a class="" href="../jupyter/index.html">Jupyter notebooks</a>
<a class="" href="../jupyter/index.html">Jupyter Notebooks</a>
</li>
<li class="toctree-l1">
......@@ -145,6 +145,21 @@
<a class="" href="../design/index.html">Design</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Tutorials</span>
<ul class="subnav">
<li class="">
<a class="" href="../tutorial-struct_pruning/index.html">Pruning Filters and Channels</a>
</li>
<li class="">
<a class="" href="../tutorial-lang_model/index.html">Pruning a Language Model</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment