algo_pruning.md



Weights pruning algorithms


Magnitude pruner
This is the most basic pruner: it applies a thresholding function, \(thresh(.)\), on each element, \(w_i\), of a weights tensor.  A different threshold can be used for each layer's weights tensor.

Because the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family.
\[ thresh(w_i)=\left\lbrace
\matrix{{{w_i: ; if ;|w_i| ; \gt};\lambda}\cr {0: ; if ; |w_i| \leq \lambda} }
\right\rbrace \]

Sensitivity pruner
Finding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values.  We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor.


The diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model.  You can see that they have an approximate Gaussian distribution.

 
The distributions of Alexnet conv1 and fc1 layers
We use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors.  For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (\(\sigma\)) of the tensor.  Thus, if we set the threshold to \(s*\sigma\), then basically we are thresholding \(s * 68\%\) of the tensor elements.
\[ thresh(w_i)=\left\lbrace
\matrix{{{w_i: ; if ;|w_i| ; \gt};\lambda}\cr {0: ; if ; |w_i| \leq \lambda} }
\right\rbrace \]
\[
\lambda = s * \sigma_l ;;; where; \sigma_l; is ;the ;std ;of ;layer ;l ;as ;measured ;on ;the ;dense ;model
\]
How do we choose this \(s\) multiplier?
In Learning both Weights and Connections for Efficient Neural Networks the authors write:

"We used the sensitivity results to find each layer’s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer’s weights

So the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \(s\).  Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.

Method of operation

Start by running a pruning sensitivity analysis on the model.
Then use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.


Schedule
In their paper Song Han et al. use iterative pruning and change the value of the \(s\) multiplier at each pruning step.  Distiller's SensitivityPruner works differently: the value \(s\) is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are "pulled" toward the center of the distribution and thus more elements gets pruned.