diff --git a/.gitignore b/.gitignore index 8f641a310621fb58fd6d9b2cade1474b9f6e6588..69edb25f12fb33f2910fde05a6a4c92bb567ac14 100644 --- a/.gitignore +++ b/.gitignore @@ -9,3 +9,4 @@ env/ .env/ .idea/ logs/ +.DS_Store diff --git a/docs-src/docs/algo_quantization.md b/docs-src/docs/algo_quantization.md index 05f6654ae8eef51d0b89befa673308c7c468b8f8..964d90fd9409a7d29ff847e7b7d211eef0c32bd3 100644 --- a/docs-src/docs/algo_quantization.md +++ b/docs-src/docs/algo_quantization.md @@ -1,5 +1,55 @@ # Quantization Algorithms +The following quantization methods are currently implemented in Distiller: + +## DoReFa + +(As proposed in [DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients](https://arxiv.org/abs/1606.06160)) + +In this method, we first define the quantization function \(quantize_k\), which takes a real value \(a_f \in [0, 1]\) and outputs a discrete-valued \(a_q \in \left\{ \frac{0}{2^k-1}, \frac{1}{2^k-1}, ... , \frac{2^k-1}{2^k-1} \right\}\), where \(k\) is the number of bits used for quantization. + +\[a_q = quantize_k(a_f) = \frac{1}{2^k-1} round \left( \left(2^k - 1 \right) a_f \right)\] + +Activations are clipped to the \([0, 1]\) range and then quantized as follows: + +\[x_q = quantize_k(x_f)\] + +For weights, we define the following function \(f\), which takes an unbounded real valued input and outputs a real value in \([0, 1]\): + +\[f(w) = \frac{tanh(w)}{2 max(|tanh(w)|)} + \frac{1}{2} \] + +Now we can use \(quantize_k\) to get quantized weight values, as follows: + +\[w_q = 2 quantize_k \left( f(w_f) \right) - 1\] + +This method requires training the model with quantization, as discussed [here](quantization.md#training-with-quantization). Use the `DorefaQuantizer` class to transform an existing model to a model suitable for training with quantization using DoReFa. + +### Notes: + +- Gradients quantization as proposed in the paper is not supported yet. +- The paper defines special handling for binary weights which isn't supported in Distiller yet. + +## WRPN + +(As proposed in [WRPN: Wide Reduced-Precision Networks](https://arxiv.org/abs/1709.01134)) + +In this method, activations are clipped to \([0, 1]\) and quantized as follows (\(k\) is the number of bits used for quantization): + +\[x_q = \frac{1}{2^k-1} round \left( \left(2^k - 1 \right) x_f \right)\] + +Weights are clipped to \([-1, 1]\) and quantized as follows: + +\[w_q = \frac{1}{2^{k-1}-1} round \left( \left(2^{k-1} - 1 \right)w_f \right)\] + +Note that \(k-1\) bits are used to quantize weights, leaving one bit for sign. + +This method requires training the model with quantization, as discussed [here](quantization/#training-with-quantization). Use the `WRPNQuantizer` class to transform an existing model to a model suitable for training with quantization using WRPN. + +### Notes: + +- The paper proposed widening of layers as a means to reduce accuracy loss. This isn't implemented as part of `WRPNQuantizer` at the moment. To experiment with this, modify your model implementation to have wider layers. +- The paper defines special handling for binary weights which isn't supported in Distiller yet. + ## Symmetric Linear Quantization In this method, a float value is quantized by multiplying with a numeric constant (the **scale factor**), hence it is **Linear**. We use a signed integer to represent the quantized range, with no quantization bias (or "offset") used. As a result, the floating-point range considered for quantization is **symmetric** with respect to zero. @@ -10,14 +60,16 @@ Let us denote the original floating-point tensor by \(x_f\), the quantized tenso (The \(round\) operation is round-to-nearest-integer) Let's see how a **convolution** or **fully-connected (FC)** layer is quantized using this method: (we denote input, output, weights and bias with \(x, y, w\) and \(b\) respectively) -\[y_f = \sum{x_f w_f} + b_f = \sum{\frac{x_q}{q_x} \frac{w_q}{q_w}} + \frac{b_q}{q_b} = \frac{1}{q_x q_w} \sum{(x_q w_q + \frac{q_b}{q_x q_w}b_q)}\] -\[y_q = round(q_y y_f) = round(\frac{q_y}{q_x q_w} \sum{(x_q w_q + \frac{q_b}{q_x q_w}b_q)})\] +\[y_f = \sum{x_f w_f} + b_f = \sum{\frac{x_q}{q_x} \frac{w_q}{q_w}} + \frac{b_q}{q_b} = \frac{1}{q_x q_w} \sum{ \left( x_q w_q + \frac{q_b}{q_x q_w}b_q \right) }\] +\[y_q = round(q_y y_f) = round\left(\frac{q_y}{q_x q_w} \sum{ \left( x_q w_q + \frac{q_b}{q_x q_w}b_q \right) } \right) \] Note how the bias has to be re-scaled to match the scale of the summation. ### Implementation + We've implemented **convolution** and **FC** using this method. -- They are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. +- They are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. The wrapper is implemented in the `RangeLinearQuantParamLayerWrapper` class. - All other layers are unaffected and are executed using their original FP32 implementation. +- To automatically transform an existing model to a quantized model using this method, use the `SymmetricLinearQuantizer` class. - For weights and bias the scale factor is determined once at quantization setup ("offline"), and for activations it is determined dynamically at runtime ("online"). - **Important note:** Currently, this method is implemented as **inference only**, with no back-propagation functionality. Hence, it can only be used to quantize a pre-trained FP32 model, with no re-training. As such, using it with \(n < 8\) is likely to lead to severe accuracy degradation for any non-trivial workload. \ No newline at end of file diff --git a/docs-src/docs/design.md b/docs-src/docs/design.md index 0fbf94acfb986d10ca610ee0f9e3f0957c8a1e4c..e7a3a7ce06ed0143c8457c72d3992c1f6f5a87c9 100755 --- a/docs-src/docs/design.md +++ b/docs-src/docs/design.md @@ -47,20 +47,36 @@ A quantized model is obtained by replacing existing operations with quantized ve In Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided. -We also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. The high-level flow is as follows: +We also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. This mechanism is exposed by the `Quantizer` class. `Quantizer` should be sub-classed for each quantization method. -- Define a **mapping** between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. +### Model Transformation + +The high-level flow is as follows: + +- Define a **mapping** between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the `replacement_factory` attribute of the `Quantizer` class. - Iterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it. -- Replace the existing module with the module returned by the function. +- Replace the existing module with the module returned by the function. It is important to note that the **name** of the module **does not** change, as that could break the `forward` function of the parent module. -Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different "strategies" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different **mapping** will likely be defined. +Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different "strategies" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different **mapping** will likely be defined. +Each sub-class of `Quantizer` should populate the `replacement_factory` dictionary attribute with the appropriate mapping. -This mechanism is exposed by the `Quantizer` class: +### Flexible Bit-Widths -- `Quantizer` should be sub-classed for each quantization method. - Each instance of `Quantizer` is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the `bits_activations` and `bits_weights` parameters in `Quantizer`'s constructor. Sub-classes may define bit-widths for other tensor types as needed. - We also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks ("container" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern. - So, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the `bits_overrides` parameter in the constructor. +### Weights Quantization + +The `Quantizer` class also provides an API to quantize the weights of all layers at once. To use it, the `param_quantization_fn` attribute needs to point to a function that accepts a tensor and the number of bits. During model transformation, the `Quantizer` class will build a list of all model parameters that need to be quantized along with their bit-width. Then, the `quantize_params` function can be called, which will iterate over all parameters and quantize them using `params_quantization_fn`. + +### Training with Quantization + +The `Quantizer` class supports training with quantization in the loop, as described [here](quantization.md#training-with-quantization). This is enabled by setting `train_with_fp_copy=True` in the `Quantizer` constructor. At model transformation, in each module that has parameters that should be quantized, a new `torch.nn.Parameter` is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module **is not** created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following "hack": + +1. The existing `torch.nn.Parameter`, e.g. `weights`, is replaced by a `torch.nn.Parameter` named `float_weight`. +2. To maintain the existing functionality of the module, we then register a `buffer` in the module with the original name - `weights`. +3. During training, `float_weight` will be passed to `param_quantization_fn` and the result will be stored in `weight`. + The base `Quantizer` class is implemented in `distiller/quantization/quantizer.py`. -For a simple sub-class implementing symmetric linear quantization, see `SymmetricLinearQuantizer` in `distiller/quantization/range_linear.py`. +For a simple sub-class implementing symmetric linear quantization, see `SymmetricLinearQuantizer` in `distiller/quantization/range_linear.py`. For examples of lower-precision methods using training with quantization see `DorefaQuantizer` and `WRPNQuantizer` in `distiller/quantization/clipped_linear.py` diff --git a/docs-src/docs/imgs/training_quant_flow.png b/docs-src/docs/imgs/training_quant_flow.png new file mode 100644 index 0000000000000000000000000000000000000000..5c91c1d44075b40aea10926d28320465be02c6b3 Binary files /dev/null and b/docs-src/docs/imgs/training_quant_flow.png differ diff --git a/docs-src/docs/imgs/use-flow.png b/docs-src/docs/imgs/use-flow.png old mode 100755 new mode 100644 index 4b1ff29980eb5a0c5cec1e417d7df35bc1a1f466..de4a3bb64413cdaa89659d78379eaffbbc4ab066 Binary files a/docs-src/docs/imgs/use-flow.png and b/docs-src/docs/imgs/use-flow.png differ diff --git a/docs-src/docs/quantization.md b/docs-src/docs/quantization.md index 738be35fe30a057e4bd92cd28ad7a4c33127900c..366d12cac5eb00059a13b04a364d291d6331e7c9 100644 --- a/docs-src/docs/quantization.md +++ b/docs-src/docs/quantization.md @@ -35,27 +35,39 @@ The result of multiplying two \(n\)-bit integers is, at most, a \(2n\)-bit numbe ## "Conservative" Quantization: INT8 In many cases, taking a model trained for FP32 and directly quantizing it to INT8, without any re-training, can result in a relatively low loss of accuracy (which may or may not be acceptable, depending on the use case). Some fine-tuning can further improve the accuracy ([Gysel at al., 2018](#gysel-et-al-2018)). -As mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor (. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained "online" during inference, or "offline". +As mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained "online" during inference, or "offline". - **Offline** means gathering activations statistics before deploying the model, either during training or by running a few "calibration" batches on the trained FP32 model. Based on these gathered statistics, the scaled factors are calculated and are fixed once the model is deployed. This method has the risk of encountering values outside the previously observed ranges at runtime. These values will be clipped, which might lead to accuracy degradation. - **Online** means calculating the min/max values for each tensor dynamically during runtime. In this method clipping cannot occur, however the added computation resources required to calculate the min/max values at runtime might be prohibitive. -It is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. Statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible ([Migacz, 2017](#migacz-2017)) +It is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. Statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible ([Migacz, 2017](#migacz-2017)). -Another possible optimization point is **scale-factor scope**. The most common way is use a single scale-factor per-layer +Another possible optimization point is **scale-factor scope**. The most common way is use a single scale-factor per-layer, but it is also possible to calculate a scale-factor per-channel. This can be beneficial if the weight distributions vary greatly between channels. ## "Aggressive" Quantization: INT4 and Lower Naively quantizing a FP32 model to INT4 and lower usually incurs significant accuracy degradation. Many works have tried to mitigate this effect. They usually employ one or more of the following concepts in order to improve model accuracy: -- **Training / Re-Training**: For INT4 and lower, training is required in order to obtain reasonable accuracy. This means training with quantization of weights and activations "baked" into the training procedure. This is not straight forward, since quantization operations are usually not differentiable. This is usually worked-around by using "straight-through estimator" ([Bengio, 2013](#bengio-et-al-2013)) to approximate the gradient of these operations. -[Zhou S et al., 2016](#zhou-et-al-2016) have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods *require* a trained FP32 model, either as a starting point ([Zhou A et al., 2017](#zhou-et-al-2017)), or as a teacher network in a student-teacher training setup ([Mishra and Marr, 2018](#mishra-and-marr-2018)). +- **Training / Re-Training**: For INT4 and lower, training is required in order to obtain reasonable accuracy. The training loop is modified to take quantization into account. See details in the [next section](#training-with-quantization). +[Zhou S et al., 2016](#zhou-et-al-2016) have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods *require* a trained FP32 model, either as a starting point ([Zhou A et al., 2017](#zhou-et-al-2017)), or as a teacher network in a knowledge distillation training setup ([Mishra and Marr, 2018](#mishra-and-marr-2018)). - **Replacing the activation function**: The most common activation function in vision models is ReLU, which is unbounded. That is - its dynamic range is not limited for positive inputs. This is very problematic for INT4 and below due to the very limited range and resolution. Therefore, most methods replace ReLU with another function which is bounded. In some cases a clipping function with hard coded values is used ([Zhou S et al., 2016](#zhou-et-al-2016), [Mishra et al., 2018](#mishra-et-al-2018)). Another method learns the clipping value per layer, with better results ([Choi et al., 2018](#choi-et-al-2018)). Once the clipping value is set, the scale factor used for quantization is also set, and no further calibration steps are required (as opposed to INT8 methods described above). - **Modifying network structure**: [Mishra et al., 2018](#mishra-et-al-2018) try to compensate for the loss of information due to quantization by using wider layers (more channels). [Lin et al., 2017](#lin-et-al-2017) proposed a binary quantization method in which a single FP32 convolution is replaced with multiple binary convolutions, each scaled to represent a different "base", covering a larger dynamic range overall. - **First and last layer**: Many methods do not quantize the first and last layer of the model. It has been observed by [Han et al., 2015](#han-et-al-2015) that the first convolutional layer is more sensitive to weights pruning, and some quantization works cite the same reason and show it empirically ([Zhou S et al., 2016](#zhou-et-al-2016), [Choi et al., 2018](#choi-et-al-2018)). Some works also note that these layers usually constitute a very small portion of the overall computation within the model, further reducing the motivation to quantize them ([Rastegari et al., 2016](#rastegari-et-al-2016)). Most methods keep the first and last layers at FP32. However, [Choi et al., 2018](#choi-et-al-2018) showed that "conservative" quantization of these layers, e.g. to INT8, does not reduce accuracy. - **Iterative quantization**: Most methods quantize the entire model at once. [Zhou A et al., 2017](#zhou-et-al-2017) employ an iterative method, which starts with a trained FP32 baseline, and quantizes only a portion of the model at the time followed by several epochs of re-training to recover the accuracy loss from quantization. - **Mixed Weights and Activations Precision**: It has been observed that activations are more sensitive to quantization than weights ([Zhou S et al., 2016](#zhou-et-al-2016)). Hence it is not uncommon to see experiments with activations quantized to a higher precision compared to weights. Some works have focused solely on quantizing weights, keeping the activations at FP32 ([Li et al., 2016](#li-et-al-2016), [Zhu et al., 2016](#zhu-et-al-2016)). +## Training with Quantization + +As mentioned above, in order to minimize the loss of accuracy from "aggressive" quantization, many methods that target INT4 and lower involve training the model in a way that considers the quantization. This means training with quantization of weights and activations "baked" into the training procedure. The training graph usually looks like this: + + + +A full precision copy of the weights is maintained throughout the training process ("weights_fp" in the diagram). Its purpose is to accumulate the small changes from the gradients without loss of precision (Note that the quantization of the weights is an integral part of the training graph, meaning that we back-propagate through it as well). Once the model is trained, only the quantized weights are used for inference. +In the diagram we show "layer N" as the conv + batch-norm + activation combination, but the same applies to fully-connected layers, element-wise operations, etc. During training, the operations within "layer N" can still run in full precision, with the "quantize" operations in the boundaries ensuring discrete-valued weights and activations. This is sometimes called "simulated quantization". + +### Straight-Through Estimator +An important question in this context is how to back-propagate through the quantization functions. These functions are discrete-valued, hence their derivative is 0 almost everywhere. So, using their gradients as-is would severly hinder the learning process. An approximation commonly used to overcome this issue is the "straight-through estimator" (STE) ([Hinton et al., 2012](#hinton-et-al-2012), [Bengio, 2013](#bengio-et-al-2013)), which simply passes the gradient through these functions as-is. + ## References <div id="dally-2015"></div> **William Dally**. High-Performance Hardware for Machine Learning. [Tutorial, NIPS, 2015](https://media.nips.cc/Conferences/2015/tutorialslides/Dally-NIPS-Tutorial-2015.pdf) @@ -72,9 +84,6 @@ Naively quantizing a FP32 model to INT4 and lower usually incurs significant acc <div id="migacz-2017"></div> **Szymon Migacz**. 8-bit Inference with TensorRT. [GTC San Jose, 2017](http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf) -<div id="bengio-et-al-2013"></div> -**Yoshua Bengio, Nicholas Leonard and Aaron Courville**. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. [arxiv:1308.3432, 2013](https://arxiv.org/abs/1308.3432) - <div id="zhou-et-al-2016"></div> **Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu and Yuheng Zou**. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. [arxiv:1606.06160](https://arxiv.org/abs/1606.06160) @@ -101,3 +110,9 @@ Naively quantizing a FP32 model to INT4 and lower usually incurs significant acc <div id="zhu-et-al-2016"></div> **Chenzhuo Zhu, Song Han, Huizi Mao and William J. Dally**. Trained Ternary Quantization. [arxiv:1612.01064](https://arxiv.org/abs/1612.01064) + +<div id="bengio-et-al-2013"></div> +**Yoshua Bengio, Nicholas Leonard and Aaron Courville**. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. [arxiv:1308.3432, 2013](https://arxiv.org/abs/1308.3432) + +<div id="hinton-et-al-2012"></div> +**Geoffrey Hinton, Nitish Srivastava, Kevin Swersky, Tijmen Tieleman and Abdelrahman Mohamed**. Neural Networks for Machine Learning. [Coursera, video lectures, 2012](https://www.coursera.org/learn/neural-networks) diff --git a/docs-src/docs/schedule.md b/docs-src/docs/schedule.md index 18e97cb7470283117749dc13afafedcdea6e7a43..008927798ba2abfc6d448cebffebe7026822b714 100755 --- a/docs-src/docs/schedule.md +++ b/docs-src/docs/schedule.md @@ -1,17 +1,16 @@ # Compression scheduler -In iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of ```CompressionScheduler```: it needed to be part of the training loop, and to be able to make and implement pruning, regularization and (later) quantization decisions. We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification. We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base. Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code. +In iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of ```CompressionScheduler```: it needed to be part of the training loop, and to be able to make and implement pruning, regularization and quantization decisions. We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification. We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base. Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code. ## High level overview -Let's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, LR-scheduler and Policies. +Let's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, Quantizers, LR-scheduler and Policies. - - Pruners and Regularizers are very similar: they implement either a Pruning algorithm or a Regularization algorithm. + - Pruners, Regularizers and Quantizers are very similar: They implement either a Pruning/Regularization/Quantization algorithm, respectively. - An LR-scheduler specifies the LR-decay algorithm. These define the **what** part of the schedule. -The Policies define the **when** part of the schedule: at which epoch to start applying the Pruner/Regularizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application). A policy also defines the instance of Pruner/Regularizer/LR-decay it is managing. -<br> -The CompressionScheduler is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners and Regularizers from code. +The Policies define the **when** part of the schedule: at which epoch to start applying the Pruner/Regularizer/Quantizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application). A policy also defines the instance of Pruner/Regularizer/Quantizer/LR-decay it is managing. +The CompressionScheduler is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners, Regularizers and Quantizers from code. ## Syntax through example We'll use ```alexnet.schedule_agp.yaml``` to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet. @@ -53,9 +52,9 @@ There is only one version of the YAML syntax, and the version number is not veri ``` version: 1 ``` -In the ```pruners``` section, we define the instances of pruners we want the scheduler to instantiate and use.<br> -We define a single pruner instance, named ```my_pruner``` of algorithm ```SensitivityPruner```. We will refer to this instance in the ```Policies``` section.<br> -Then we list the sensitivity multipliers, \\(s\\), of each of the weight tensors.<br> +In the ```pruners``` section, we define the instances of pruners we want the scheduler to instantiate and use. +We define a single pruner instance, named ```my_pruner```, of algorithm ```SensitivityPruner```. We will refer to this instance in the ```Policies``` section. +Then we list the sensitivity multipliers, \\(s\\), of each of the weight tensors. You may list as many Pruners as you want in this section, as long as each has a unique name. You can several types of pruners in one schedule. ``` @@ -73,7 +72,7 @@ pruners: 'classifier.6.weight': 0.6 ``` -Next, we want to specify the learning-rate decay scheduling in the ```lr_schedulers``` section. We assign a name to this instance: ```pruning_lr```. As in the ```pruners``` section, you may use any name, as long as all LR-schedulers have a unique name. At the moment, only one instance of LR-scheduler is allowed. You can use any LR-scheduler class that ```torch.optim.lr_scheduler``` supports and pass their arguments. The keyword arguments (kwargs) are passed directly to the constructor of the subclasses of [_LRScheduler](http://pytorch.org/docs/master/_modules/torch/optim/lr_scheduler.html), so that as new LR-schedulers are added to ```torch.optim.lr_scheduler```, they can be used without changing the application code. +Next, we want to specify the learning-rate decay scheduling in the ```lr_schedulers``` section. We assign a name to this instance: ```pruning_lr```. As in the ```pruners``` section, you may use any name, as long as all LR-schedulers have a unique name. At the moment, only one instance of LR-scheduler is allowed. The LR-scheduler must be a subclass of PyTorch's [\_LRScheduler](http://pytorch.org/docs/master/_modules/torch/optim/lr_scheduler.html). You can use any of the schedulers defined in ```torch.optim.lr_scheduler``` (see [here](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate)). In addition, we've implemented some additional schedulers in Distiller (see [here](https://github.com/NervanaSystems/distiller/blob/master/distiller/learning_rate.py)). The keyword arguments (kwargs) are passed directly to the LR-scheduler's constructor, so that as new LR-schedulers are added to ```torch.optim.lr_scheduler```, they can be used without changing the application code. ``` lr_schedulers: @@ -82,7 +81,7 @@ lr_schedulers: gamma: 0.9 ``` -Finally, we define the ```policies``` section which defines the actual scheduling. A ```Policy``` manages an instance of a ```Pruner```, ```Regularizer```, or ```LRSchedule```, by naming the instance. In the example below, a ```PruningPolicy``` uses the pruner instance named ```my_pruner```: it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38. +Finally, we define the ```policies``` section which defines the actual scheduling. A ```Policy``` manages an instance of a ```Pruner```, ```Regularizer```, `Quantizer`, or ```LRScheduler```, by naming the instance. In the example below, a ```PruningPolicy``` uses the pruner instance named ```my_pruner```: it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38. ``` policies: - pruner: @@ -238,3 +237,43 @@ policies: frequency: 1 ``` + +## Quantization + +Similarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the `Quantizer` class (see details [here](design.md#quantization)). +Let's see an example: + +``` +quantizers: + dorefa_quantizer: + class: DorefaQuantizer + bits_activations: 8 + bits_weights: 4 + bits_overrides: + conv1: + wts: null + acts: null + relu1: + wts: null + acts: null + final_relu: + wts: null + acts: null + fc: + wts: null + acts: null +``` + +- The specific quantization method we're instantiating here is `DorefaQuantizer`. +- Then we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively. +- Then, we define the `bits_overrides` mapping. In this case, we choose not to quantize the first and last layer of the model. In the case of `DorefaQuantizer`, the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters `conv1`, the first activation layer `relu1`, the last activation layer `final_relu` and the last layer with parameters `fc`. +- Specifying `null` means "do not quantize". +- Note that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers. +- We can also reference **groups of layers** in the `bits_overrides` mapping. This is done using regular expressions. Suppose we have a sub-module in our model named `block1`, which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named `conv1`, `conv2` and so on. In that case we would define the following: + +``` +bits_overrides: + block1.conv*: + wts: 2 + acts: null +``` diff --git a/docs-src/docs/usage.md b/docs-src/docs/usage.md index 507fc6ffd884533420ab25f5d0105dc024281878..b659472befcc523d7193229257ed7e73407eee14 100755 --- a/docs-src/docs/usage.md +++ b/docs-src/docs/usage.md @@ -107,14 +107,20 @@ For more details on the example schedules, you can refer to the coverage of the - Filter-wise pruning sensitivity-analysis: - ResNet20 (CIFAR10) - ResNet56 (CIFAR10) +<br><br> * **examples/sensitivity-pruning**: - AlexNet sensitivity pruning with Iterative Pruning - AlexNet sensitivity pruning with One-Shot Pruning - +<br><br> * **examples/ssl**: - ResNet20 baseline training (CIFAR10 dataset) - Structured Sparsity Learning (SSL) with layer removal on ResNet20 - SSL with channels removal on ResNet20 +<br><br> +* **examples/quantization**: + - AlexNet w. Batch-Norm (base FP32 + DoReFa) + - Pre-activation ResNet20 on CIFAR10 (base FP32 + DoReFa) + - Pre-activation ResNet18 on ImageNEt (base FP32 + DoReFa) ## Experiment reproducibility @@ -135,8 +141,8 @@ The ```sense``` command-line argument can be set to either ```element``` or ```f There is also a [Jupyter notebook](http://localhost:8888/notebooks/sensitivity_analysis.ipynb) with example invocations, outputs and explanations. -## Quantization -Currently Distiller support 8-bit quantization only (quantization of lower precision data types will follow shortly) which does not require training, so any model (whether pruned or not) can be quantized.<br> +## "Direct" Quantization Without Training +Distiller supports 8-bit quantization of trained modules without re-training (using [Symmetric Linear Quantization](algo_quantization.md#symmetric-linear-quantization)). So, any model (whether pruned or not) can be quantized. Use the ```--quantize``` command-line flag, together with ```--evaluate``` to evaluate the accuracy of your model after quantization. The following example qunatizes ResNet18 for ImageNet: ``` $ python3 compress_classifier.py -a resnet18 ../../../data.imagenet --pretrained --quantize --evaluate diff --git a/docs/algo_quantization/index.html b/docs/algo_quantization/index.html index b14f5a1b9d5ff79c0aef3d24004ab7942de8a6fc..238d0537f965df6389563c1fb7ca93660140cb45 100644 --- a/docs/algo_quantization/index.html +++ b/docs/algo_quantization/index.html @@ -104,6 +104,10 @@ <ul> + <li><a class="toctree-l4" href="#dorefa">DoReFa</a></li> + + <li><a class="toctree-l4" href="#wrpn">WRPN</a></li> + <li><a class="toctree-l4" href="#symmetric-linear-quantization">Symmetric Linear Quantization</a></li> </ul> @@ -166,6 +170,48 @@ <div class="section"> <h1 id="quantization-algorithms">Quantization Algorithms</h1> +<p>The following quantization methods are currently implemented in Distiller:</p> +<h2 id="dorefa">DoReFa</h2> +<p>(As proposed in <a href="https://arxiv.org/abs/1606.06160">DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients</a>) </p> +<p>In this method, we first define the quantization function <script type="math/tex">quantize_k</script>, which takes a real value <script type="math/tex">a_f \in [0, 1]</script> and outputs a discrete-valued <script type="math/tex">a_q \in \left\{ \frac{0}{2^k-1}, \frac{1}{2^k-1}, ... , \frac{2^k-1}{2^k-1} \right\}</script>, where <script type="math/tex">k</script> is the number of bits used for quantization.</p> +<p> +<script type="math/tex; mode=display">a_q = quantize_k(a_f) = \frac{1}{2^k-1} round \left( \left(2^k - 1 \right) a_f \right)</script> +</p> +<p>Activations are clipped to the <script type="math/tex">[0, 1]</script> range and then quantized as follows:</p> +<p> +<script type="math/tex; mode=display">x_q = quantize_k(x_f)</script> +</p> +<p>For weights, we define the following function <script type="math/tex">f</script>, which takes an unbounded real valued input and outputs a real value in <script type="math/tex">[0, 1]</script>:</p> +<p> +<script type="math/tex; mode=display">f(w) = \frac{tanh(w)}{2 max(|tanh(w)|)} + \frac{1}{2} </script> +</p> +<p>Now we can use <script type="math/tex">quantize_k</script> to get quantized weight values, as follows:</p> +<p> +<script type="math/tex; mode=display">w_q = 2 quantize_k \left( f(w_f) \right) - 1</script> +</p> +<p>This method requires training the model with quantization, as discussed <a href="../quantization/index.html#training-with-quantization">here</a>. Use the <code>DorefaQuantizer</code> class to transform an existing model to a model suitable for training with quantization using DoReFa.</p> +<h3 id="notes">Notes:</h3> +<ul> +<li>Gradients quantization as proposed in the paper is not supported yet.</li> +<li>The paper defines special handling for binary weights which isn't supported in Distiller yet.</li> +</ul> +<h2 id="wrpn">WRPN</h2> +<p>(As proposed in <a href="https://arxiv.org/abs/1709.01134">WRPN: Wide Reduced-Precision Networks</a>) </p> +<p>In this method, activations are clipped to <script type="math/tex">[0, 1]</script> and quantized as follows (<script type="math/tex">k</script> is the number of bits used for quantization):</p> +<p> +<script type="math/tex; mode=display">x_q = \frac{1}{2^k-1} round \left( \left(2^k - 1 \right) x_f \right)</script> +</p> +<p>Weights are clipped to <script type="math/tex">[-1, 1]</script> and quantized as follows:</p> +<p> +<script type="math/tex; mode=display">w_q = \frac{1}{2^{k-1}-1} round \left( \left(2^{k-1} - 1 \right)w_f \right)</script> +</p> +<p>Note that <script type="math/tex">k-1</script> bits are used to quantize weights, leaving one bit for sign.</p> +<p>This method requires training the model with quantization, as discussed <a href="../quantization/#training-with-quantization">here</a>. Use the <code>WRPNQuantizer</code> class to transform an existing model to a model suitable for training with quantization using WRPN.</p> +<h3 id="notes_1">Notes:</h3> +<ul> +<li>The paper proposed widening of layers as a means to reduce accuracy loss. This isn't implemented as part of <code>WRPNQuantizer</code> at the moment. To experiment with this, modify your model implementation to have wider layers.</li> +<li>The paper defines special handling for binary weights which isn't supported in Distiller yet.</li> +</ul> <h2 id="symmetric-linear-quantization">Symmetric Linear Quantization</h2> <p>In this method, a float value is quantized by multiplying with a numeric constant (the <strong>scale factor</strong>), hence it is <strong>Linear</strong>. We use a signed integer to represent the quantized range, with no quantization bias (or "offset") used. As a result, the floating-point range considered for quantization is <strong>symmetric</strong> with respect to zero.<br /> In the current implementation the scale factor is chosen so that the entire range of the floating-point tensor is quantized (we do not attempt to remove outliers).<br /> @@ -174,14 +220,15 @@ Let us denote the original floating-point tensor by <script type="math/tex">x_f< <script type="math/tex; mode=display">x_q = round(q_x x_f)</script> (The <script type="math/tex">round</script> operation is round-to-nearest-integer) </p> <p>Let's see how a <strong>convolution</strong> or <strong>fully-connected (FC)</strong> layer is quantized using this method: (we denote input, output, weights and bias with <script type="math/tex">x, y, w</script> and <script type="math/tex">b</script> respectively) -<script type="math/tex; mode=display">y_f = \sum{x_f w_f} + b_f = \sum{\frac{x_q}{q_x} \frac{w_q}{q_w}} + \frac{b_q}{q_b} = \frac{1}{q_x q_w} \sum{(x_q w_q + \frac{q_b}{q_x q_w}b_q)}</script> -<script type="math/tex; mode=display">y_q = round(q_y y_f) = round(\frac{q_y}{q_x q_w} \sum{(x_q w_q + \frac{q_b}{q_x q_w}b_q)})</script> +<script type="math/tex; mode=display">y_f = \sum{x_f w_f} + b_f = \sum{\frac{x_q}{q_x} \frac{w_q}{q_w}} + \frac{b_q}{q_b} = \frac{1}{q_x q_w} \sum{ \left( x_q w_q + \frac{q_b}{q_x q_w}b_q \right) }</script> +<script type="math/tex; mode=display">y_q = round(q_y y_f) = round\left(\frac{q_y}{q_x q_w} \sum{ \left( x_q w_q + \frac{q_b}{q_x q_w}b_q \right) } \right) </script> Note how the bias has to be re-scaled to match the scale of the summation.</p> <h3 id="implementation">Implementation</h3> <p>We've implemented <strong>convolution</strong> and <strong>FC</strong> using this method. </p> <ul> -<li>They are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. </li> +<li>They are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. The wrapper is implemented in the <code>RangeLinearQuantParamLayerWrapper</code> class. </li> <li>All other layers are unaffected and are executed using their original FP32 implementation. </li> +<li>To automatically transform an existing model to a quantized model using this method, use the <code>SymmetricLinearQuantizer</code> class.</li> <li>For weights and bias the scale factor is determined once at quantization setup ("offline"), and for activations it is determined dynamically at runtime ("online"). </li> <li><strong>Important note:</strong> Currently, this method is implemented as <strong>inference only</strong>, with no back-propagation functionality. Hence, it can only be used to quantize a pre-trained FP32 model, with no re-training. As such, using it with <script type="math/tex">n < 8</script> is likely to lead to severe accuracy degradation for any non-trivial workload.</li> </ul> diff --git a/docs/design/index.html b/docs/design/index.html index aeff42b2de56dda01fe3e532deb74072b04525e2..bf89613a3fb16e6f3f9fb25c29dde092b4b63746 100644 --- a/docs/design/index.html +++ b/docs/design/index.html @@ -208,22 +208,33 @@ train(): <h2 id="quantization">Quantization</h2> <p>A quantized model is obtained by replacing existing operations with quantized versions. The quantized versions can be either complete replacements, or wrappers. A wrapper will use the existing modules internally and add quantization and de-quantization operations before/after as necessary.</p> <p>In Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided.</p> -<p>We also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. The high-level flow is as follows:</p> +<p>We also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. This mechanism is exposed by the <code>Quantizer</code> class. <code>Quantizer</code> should be sub-classed for each quantization method.</p> +<h3 id="model-transformation">Model Transformation</h3> +<p>The high-level flow is as follows:</p> <ul> -<li>Define a <strong>mapping</strong> between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module.</li> +<li>Define a <strong>mapping</strong> between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the <code>replacement_factory</code> attribute of the <code>Quantizer</code> class.</li> <li>Iterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it.</li> -<li>Replace the existing module with the module returned by the function.</li> +<li>Replace the existing module with the module returned by the function. It is important to note that the <strong>name</strong> of the module <strong>does not</strong> change, as that could break the <code>forward</code> function of the parent module.</li> </ul> -<p>Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different "strategies" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different <strong>mapping</strong> will likely be defined.</p> -<p>This mechanism is exposed by the <code>Quantizer</code> class:</p> +<p>Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different "strategies" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different <strong>mapping</strong> will likely be defined.<br /> +Each sub-class of <code>Quantizer</code> should populate the <code>replacement_factory</code> dictionary attribute with the appropriate mapping.</p> +<h3 id="flexible-bit-widths">Flexible Bit-Widths</h3> <ul> -<li><code>Quantizer</code> should be sub-classed for each quantization method.</li> <li>Each instance of <code>Quantizer</code> is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the <code>bits_activations</code> and <code>bits_weights</code> parameters in <code>Quantizer</code>'s constructor. Sub-classes may define bit-widths for other tensor types as needed.</li> <li>We also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks ("container" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern.</li> <li>So, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the <code>bits_overrides</code> parameter in the constructor.</li> </ul> +<h3 id="weights-quantization">Weights Quantization</h3> +<p>The <code>Quantizer</code> class also provides an API to quantize the weights of all layers at once. To use it, the <code>param_quantization_fn</code> attribute needs to point to a function that accepts a tensor and the number of bits. During model transformation, the <code>Quantizer</code> class will build a list of all model parameters that need to be quantized along with their bit-width. Then, the <code>quantize_params</code> function can be called, which will iterate over all parameters and quantize them using <code>params_quantization_fn</code>.</p> +<h3 id="training-with-quantization">Training with Quantization</h3> +<p>The <code>Quantizer</code> class supports training with quantization in the loop, as described <a href="../quantization/index.html#training-with-quantization">here</a>. This is enabled by setting <code>train_with_fp_copy=True</code> in the <code>Quantizer</code> constructor. At model transformation, in each module that has parameters that should be quantized, a new <code>torch.nn.Parameter</code> is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module <strong>is not</strong> created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following "hack":</p> +<ol> +<li>The existing <code>torch.nn.Parameter</code>, e.g. <code>weights</code>, is replaced by a <code>torch.nn.Parameter</code> named <code>float_weight</code>.</li> +<li>To maintain the existing functionality of the module, we then register a <code>buffer</code> in the module with the original name - <code>weights</code>.</li> +<li>During training, <code>float_weight</code> will be passed to <code>param_quantization_fn</code> and the result will be stored in <code>weight</code>.</li> +</ol> <p>The base <code>Quantizer</code> class is implemented in <code>distiller/quantization/quantizer.py</code>.<br /> -For a simple sub-class implementing symmetric linear quantization, see <code>SymmetricLinearQuantizer</code> in <code>distiller/quantization/range_linear.py</code>.</p> +For a simple sub-class implementing symmetric linear quantization, see <code>SymmetricLinearQuantizer</code> in <code>distiller/quantization/range_linear.py</code>. For examples of lower-precision methods using training with quantization see <code>DorefaQuantizer</code> and <code>WRPNQuantizer</code> in <code>distiller/quantization/clipped_linear.py</code></p> </div> </div> diff --git a/docs/imgs/baidu_rnn_pruning.png b/docs/imgs/baidu_rnn_pruning.png new file mode 100644 index 0000000000000000000000000000000000000000..ab4a3960a54354b7335967c4b39709a7f57cdea5 Binary files /dev/null and b/docs/imgs/baidu_rnn_pruning.png differ diff --git a/docs/imgs/training_quant_flow.png b/docs/imgs/training_quant_flow.png new file mode 100644 index 0000000000000000000000000000000000000000..5c91c1d44075b40aea10926d28320465be02c6b3 Binary files /dev/null and b/docs/imgs/training_quant_flow.png differ diff --git a/docs/imgs/use-flow.png b/docs/imgs/use-flow.png index 4b1ff29980eb5a0c5cec1e417d7df35bc1a1f466..de4a3bb64413cdaa89659d78379eaffbbc4ab066 100644 Binary files a/docs/imgs/use-flow.png and b/docs/imgs/use-flow.png differ diff --git a/docs/index.html b/docs/index.html index 1a4036280f7ae675a033b15213926ecbea2009e0..1f3be6c7a87b4acfe65617d4d4a92666a75aae7c 100644 --- a/docs/index.html +++ b/docs/index.html @@ -246,5 +246,5 @@ And of course, if we used a sparse or compressed representation, then we are red <!-- MkDocs version : 0.17.2 -Build Date UTC : 2018-06-14 11:48:24 +Build Date UTC : 2018-06-21 23:06:46 --> diff --git a/docs/quantization/index.html b/docs/quantization/index.html index b2b475d920047fce1e512cf42f1f1505b2d15e98..472633992779a549e1db2400767d8d8acc77167c 100644 --- a/docs/quantization/index.html +++ b/docs/quantization/index.html @@ -97,6 +97,8 @@ <li><a class="toctree-l4" href="#aggressive-quantization-int4-and-lower">"Aggressive" Quantization: INT4 and Lower</a></li> + <li><a class="toctree-l4" href="#training-with-quantization">Training with Quantization</a></li> + <li><a class="toctree-l4" href="#references">References</a></li> </ul> @@ -214,24 +216,31 @@ Note that this scale factor is, in most cases, a floating-point number. Hence, e The result of multiplying two <script type="math/tex">n</script>-bit integers is, at most, a <script type="math/tex">2n</script>-bit number. In convolution layers, such multiplications are accumulated <script type="math/tex">c\cdot k^2</script> times, where <script type="math/tex">c</script> is the number of input channels and <script type="math/tex">k</script> is the kernel width (assuming a square kernel). Hence, to avoid overflowing, the accumulator should be <script type="math/tex">2n + M</script>-bits wide, where M is at least <script type="math/tex">log_2(c\cdot k^2)</script>. In many cases 32-bit accumulators are used, however for INT4 and lower it might be possible to use less than 32 -bits, depending on the expected use cases and layer widths.</p> <h2 id="conservative-quantization-int8">"Conservative" Quantization: INT8</h2> <p>In many cases, taking a model trained for FP32 and directly quantizing it to INT8, without any re-training, can result in a relatively low loss of accuracy (which may or may not be acceptable, depending on the use case). Some fine-tuning can further improve the accuracy (<a href="#gysel-et-al-2018">Gysel at al., 2018</a>).<br /> -As mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor (. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained "online" during inference, or "offline".</p> +As mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained "online" during inference, or "offline".</p> <ul> <li><strong>Offline</strong> means gathering activations statistics before deploying the model, either during training or by running a few "calibration" batches on the trained FP32 model. Based on these gathered statistics, the scaled factors are calculated and are fixed once the model is deployed. This method has the risk of encountering values outside the previously observed ranges at runtime. These values will be clipped, which might lead to accuracy degradation.</li> <li><strong>Online</strong> means calculating the min/max values for each tensor dynamically during runtime. In this method clipping cannot occur, however the added computation resources required to calculate the min/max values at runtime might be prohibitive.</li> </ul> -<p>It is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. Statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible (<a href="#migacz-2017">Migacz, 2017</a>) </p> -<p>Another possible optimization point is <strong>scale-factor scope</strong>. The most common way is use a single scale-factor per-layer</p> +<p>It is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. Statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible (<a href="#migacz-2017">Migacz, 2017</a>). </p> +<p>Another possible optimization point is <strong>scale-factor scope</strong>. The most common way is use a single scale-factor per-layer, but it is also possible to calculate a scale-factor per-channel. This can be beneficial if the weight distributions vary greatly between channels.</p> <h2 id="aggressive-quantization-int4-and-lower">"Aggressive" Quantization: INT4 and Lower</h2> <p>Naively quantizing a FP32 model to INT4 and lower usually incurs significant accuracy degradation. Many works have tried to mitigate this effect. They usually employ one or more of the following concepts in order to improve model accuracy:</p> <ul> -<li><strong>Training / Re-Training</strong>: For INT4 and lower, training is required in order to obtain reasonable accuracy. This means training with quantization of weights and activations "baked" into the training procedure. This is not straight forward, since quantization operations are usually not differentiable. This is usually worked-around by using "straight-through estimator" (<a href="#bengio-et-al-2013">Bengio, 2013</a>) to approximate the gradient of these operations.<br /> -<a href="#zhou-et-al-2016">Zhou S et al., 2016</a> have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods <em>require</em> a trained FP32 model, either as a starting point (<a href="#zhou-et-al-2017">Zhou A et al., 2017</a>), or as a teacher network in a student-teacher training setup (<a href="#mishra-and-marr-2018">Mishra and Marr, 2018</a>).</li> +<li><strong>Training / Re-Training</strong>: For INT4 and lower, training is required in order to obtain reasonable accuracy. The training loop is modified to take quantization into account. See details in the <a href="#training-with-quantization">next section</a>.<br /> +<a href="#zhou-et-al-2016">Zhou S et al., 2016</a> have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods <em>require</em> a trained FP32 model, either as a starting point (<a href="#zhou-et-al-2017">Zhou A et al., 2017</a>), or as a teacher network in a knowledge distillation training setup (<a href="#mishra-and-marr-2018">Mishra and Marr, 2018</a>).</li> <li><strong>Replacing the activation function</strong>: The most common activation function in vision models is ReLU, which is unbounded. That is - its dynamic range is not limited for positive inputs. This is very problematic for INT4 and below due to the very limited range and resolution. Therefore, most methods replace ReLU with another function which is bounded. In some cases a clipping function with hard coded values is used (<a href="#zhou-et-al-2016">Zhou S et al., 2016</a>, <a href="#mishra-et-al-2018">Mishra et al., 2018</a>). Another method learns the clipping value per layer, with better results (<a href="#choi-et-al-2018">Choi et al., 2018</a>). Once the clipping value is set, the scale factor used for quantization is also set, and no further calibration steps are required (as opposed to INT8 methods described above).</li> <li><strong>Modifying network structure</strong>: <a href="#mishra-et-al-2018">Mishra et al., 2018</a> try to compensate for the loss of information due to quantization by using wider layers (more channels). <a href="#lin-et-al-2017">Lin et al., 2017</a> proposed a binary quantization method in which a single FP32 convolution is replaced with multiple binary convolutions, each scaled to represent a different "base", covering a larger dynamic range overall.</li> <li><strong>First and last layer</strong>: Many methods do not quantize the first and last layer of the model. It has been observed by <a href="#han-et-al-2015">Han et al., 2015</a> that the first convolutional layer is more sensitive to weights pruning, and some quantization works cite the same reason and show it empirically (<a href="#zhou-et-al-2016">Zhou S et al., 2016</a>, <a href="#choi-et-al-2018">Choi et al., 2018</a>). Some works also note that these layers usually constitute a very small portion of the overall computation within the model, further reducing the motivation to quantize them (<a href="#rastegari-et-al-2016">Rastegari et al., 2016</a>). Most methods keep the first and last layers at FP32. However, <a href="#choi-et-al-2018">Choi et al., 2018</a> showed that "conservative" quantization of these layers, e.g. to INT8, does not reduce accuracy.</li> <li><strong>Iterative quantization</strong>: Most methods quantize the entire model at once. <a href="#zhou-et-al-2017">Zhou A et al., 2017</a> employ an iterative method, which starts with a trained FP32 baseline, and quantizes only a portion of the model at the time followed by several epochs of re-training to recover the accuracy loss from quantization.</li> <li><strong>Mixed Weights and Activations Precision</strong>: It has been observed that activations are more sensitive to quantization than weights (<a href="#zhou-et-al-2016">Zhou S et al., 2016</a>). Hence it is not uncommon to see experiments with activations quantized to a higher precision compared to weights. Some works have focused solely on quantizing weights, keeping the activations at FP32 (<a href="#li-et-al-2016">Li et al., 2016</a>, <a href="#zhu-et-al-2016">Zhu et al., 2016</a>).</li> </ul> +<h2 id="training-with-quantization">Training with Quantization</h2> +<p>As mentioned above, in order to minimize the loss of accuracy from "aggressive" quantization, many methods that target INT4 and lower involve training the model in a way that considers the quantization. This means training with quantization of weights and activations "baked" into the training procedure. The training graph usually looks like this:</p> +<p><img alt="Training with Quantization" src="../imgs/training_quant_flow.png" /></p> +<p>A full precision copy of the weights is maintained throughout the training process ("weights_fp" in the diagram). Its purpose is to accumulate the small changes from the gradients without loss of precision (Note that the quantization of the weights is an integral part of the training graph, meaning that we back-propagate through it as well). Once the model is trained, only the quantized weights are used for inference.<br /> +In the diagram we show "layer N" as the conv + batch-norm + activation combination, but the same applies to fully-connected layers, element-wise operations, etc. During training, the operations within "layer N" can still run in full precision, with the "quantize" operations in the boundaries ensuring discrete-valued weights and activations. This is sometimes called "simulated quantization". </p> +<h3 id="straight-through-estimator">Straight-Through Estimator</h3> +<p>An important question in this context is how to back-propagate through the quantization functions. These functions are discrete-valued, hence their derivative is 0 almost everywhere. So, using their gradients as-is would severly hinder the learning process. An approximation commonly used to overcome this issue is the "straight-through estimator" (STE) (<a href="#hinton-et-al-2012">Hinton et al., 2012</a>, <a href="#bengio-et-al-2013">Bengio, 2013</a>), which simply passes the gradient through these functions as-is. </p> <h2 id="references">References</h2> <p><div id="dally-2015"></div> <strong>William Dally</strong>. High-Performance Hardware for Machine Learning. <a href="https://media.nips.cc/Conferences/2015/tutorialslides/Dally-NIPS-Tutorial-2015.pdf">Tutorial, NIPS, 2015</a></p> @@ -247,9 +256,6 @@ As mentioned above, a scale factor is used to adapt the dynamic range of the ten <div id="migacz-2017"></div> <p><strong>Szymon Migacz</strong>. 8-bit Inference with TensorRT. <a href="http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf">GTC San Jose, 2017</a></p> -<div id="bengio-et-al-2013"></div> - -<p><strong>Yoshua Bengio, Nicholas Leonard and Aaron Courville</strong>. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. <a href="https://arxiv.org/abs/1308.3432">arxiv:1308.3432, 2013</a></p> <div id="zhou-et-al-2016"></div> <p><strong>Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu and Yuheng Zou</strong>. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. <a href="https://arxiv.org/abs/1606.06160">arxiv:1606.06160</a></p> @@ -277,6 +283,12 @@ As mentioned above, a scale factor is used to adapt the dynamic range of the ten <div id="zhu-et-al-2016"></div> <p><strong>Chenzhuo Zhu, Song Han, Huizi Mao and William J. Dally</strong>. Trained Ternary Quantization. <a href="https://arxiv.org/abs/1612.01064">arxiv:1612.01064</a></p> +<div id="bengio-et-al-2013"></div> + +<p><strong>Yoshua Bengio, Nicholas Leonard and Aaron Courville</strong>. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. <a href="https://arxiv.org/abs/1308.3432">arxiv:1308.3432, 2013</a></p> +<div id="hinton-et-al-2012"></div> + +<p><strong>Geoffrey Hinton, Nitish Srivastava, Kevin Swersky, Tijmen Tieleman and Abdelrahman Mohamed</strong>. Neural Networks for Machine Learning. <a href="https://www.coursera.org/learn/neural-networks">Coursera, video lectures, 2012</a></p> </div> </div> diff --git a/docs/schedule/index.html b/docs/schedule/index.html index 8d0af64f2e95c0c979e705faecf596186ca4ad5b..2bf4ccccc64f1d99fac9d05f4ff06ff9ab281791 100644 --- a/docs/schedule/index.html +++ b/docs/schedule/index.html @@ -80,6 +80,8 @@ <li><a class="toctree-l3" href="#mixing-it-up">Mixing it up</a></li> + <li><a class="toctree-l3" href="#quantization">Quantization</a></li> + </ul> @@ -168,17 +170,16 @@ <div class="section"> <h1 id="compression-scheduler">Compression scheduler</h1> -<p>In iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of <code>CompressionScheduler</code>: it needed to be part of the training loop, and to be able to make and implement pruning, regularization and (later) quantization decisions. We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification. We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base. Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code. </p> +<p>In iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of <code>CompressionScheduler</code>: it needed to be part of the training loop, and to be able to make and implement pruning, regularization and quantization decisions. We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification. We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base. Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code. </p> <h2 id="high-level-overview">High level overview</h2> -<p>Let's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, LR-scheduler and Policies.</p> +<p>Let's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, Quantizers, LR-scheduler and Policies.</p> <ul> -<li>Pruners and Regularizers are very similar: they implement either a Pruning algorithm or a Regularization algorithm. </li> +<li>Pruners, Regularizers and Quantizers are very similar: They implement either a Pruning/Regularization/Quantization algorithm, respectively. </li> <li>An LR-scheduler specifies the LR-decay algorithm. </li> </ul> <p>These define the <strong>what</strong> part of the schedule. </p> -<p>The Policies define the <strong>when</strong> part of the schedule: at which epoch to start applying the Pruner/Regularizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application). A policy also defines the instance of Pruner/Regularizer/LR-decay it is managing. -<br> -The CompressionScheduler is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners and Regularizers from code.</p> +<p>The Policies define the <strong>when</strong> part of the schedule: at which epoch to start applying the Pruner/Regularizer/Quantizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application). A policy also defines the instance of Pruner/Regularizer/Quantizer/LR-decay it is managing.<br /> +The CompressionScheduler is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners, Regularizers and Quantizers from code.</p> <h2 id="syntax-through-example">Syntax through example</h2> <p>We'll use <code>alexnet.schedule_agp.yaml</code> to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet.</p> <pre><code>version: 1 @@ -218,9 +219,9 @@ policies: <pre><code>version: 1 </code></pre> -<p>In the <code>pruners</code> section, we define the instances of pruners we want the scheduler to instantiate and use.<br> -We define a single pruner instance, named <code>my_pruner</code> of algorithm <code>SensitivityPruner</code>. We will refer to this instance in the <code>Policies</code> section.<br> -Then we list the sensitivity multipliers, \(s\), of each of the weight tensors.<br> +<p>In the <code>pruners</code> section, we define the instances of pruners we want the scheduler to instantiate and use.<br /> +We define a single pruner instance, named <code>my_pruner</code>, of algorithm <code>SensitivityPruner</code>. We will refer to this instance in the <code>Policies</code> section.<br /> +Then we list the sensitivity multipliers, \(s\), of each of the weight tensors.<br /> You may list as many Pruners as you want in this section, as long as each has a unique name. You can several types of pruners in one schedule.</p> <pre><code>pruners: my_pruner: @@ -236,14 +237,14 @@ You may list as many Pruners as you want in this section, as long as each has a 'classifier.6.weight': 0.6 </code></pre> -<p>Next, we want to specify the learning-rate decay scheduling in the <code>lr_schedulers</code> section. We assign a name to this instance: <code>pruning_lr</code>. As in the <code>pruners</code> section, you may use any name, as long as all LR-schedulers have a unique name. At the moment, only one instance of LR-scheduler is allowed. You can use any LR-scheduler class that <code>torch.optim.lr_scheduler</code> supports and pass their arguments. The keyword arguments (kwargs) are passed directly to the constructor of the subclasses of <a href="http://pytorch.org/docs/master/_modules/torch/optim/lr_scheduler.html">_LRScheduler</a>, so that as new LR-schedulers are added to <code>torch.optim.lr_scheduler</code>, they can be used without changing the application code.</p> +<p>Next, we want to specify the learning-rate decay scheduling in the <code>lr_schedulers</code> section. We assign a name to this instance: <code>pruning_lr</code>. As in the <code>pruners</code> section, you may use any name, as long as all LR-schedulers have a unique name. At the moment, only one instance of LR-scheduler is allowed. The LR-scheduler must be a subclass of PyTorch's <a href="http://pytorch.org/docs/master/_modules/torch/optim/lr_scheduler.html">_LRScheduler</a>. You can use any of the schedulers defined in <code>torch.optim.lr_scheduler</code> (see <a href="https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate">here</a>). In addition, we've implemented some additional schedulers in Distiller (see <a href="https://github.com/NervanaSystems/distiller/blob/master/distiller/learning_rate.py">here</a>). The keyword arguments (kwargs) are passed directly to the LR-scheduler's constructor, so that as new LR-schedulers are added to <code>torch.optim.lr_scheduler</code>, they can be used without changing the application code.</p> <pre><code>lr_schedulers: pruning_lr: class: ExponentialLR gamma: 0.9 </code></pre> -<p>Finally, we define the <code>policies</code> section which defines the actual scheduling. A <code>Policy</code> manages an instance of a <code>Pruner</code>, <code>Regularizer</code>, or <code>LRSchedule</code>, by naming the instance. In the example below, a <code>PruningPolicy</code> uses the pruner instance named <code>my_pruner</code>: it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38. </p> +<p>Finally, we define the <code>policies</code> section which defines the actual scheduling. A <code>Policy</code> manages an instance of a <code>Pruner</code>, <code>Regularizer</code>, <code>Quantizer</code>, or <code>LRScheduler</code>, by naming the instance. In the example below, a <code>PruningPolicy</code> uses the pruner instance named <code>my_pruner</code>: it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38. </p> <pre><code>policies: - pruner: instance_name : 'my_pruner' @@ -397,6 +398,43 @@ policies: ending_epoch: 200 frequency: 1 +</code></pre> + +<h2 id="quantization">Quantization</h2> +<p>Similarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the <code>Quantizer</code> class (see details <a href="../design/index.html#quantization">here</a>). +Let's see an example:</p> +<pre><code>quantizers: + dorefa_quantizer: + class: DorefaQuantizer + bits_activations: 8 + bits_weights: 4 + bits_overrides: + conv1: + wts: null + acts: null + relu1: + wts: null + acts: null + final_relu: + wts: null + acts: null + fc: + wts: null + acts: null +</code></pre> + +<ul> +<li>The specific quantization method we're instantiating here is <code>DorefaQuantizer</code>.</li> +<li>Then we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively. </li> +<li>Then, we define the <code>bits_overrides</code> mapping. In this case, we choose not to quantize the first and last layer of the model. In the case of <code>DorefaQuantizer</code>, the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters <code>conv1</code>, the first activation layer <code>relu1</code>, the last activation layer <code>final_relu</code> and the last layer with parameters <code>fc</code>.</li> +<li>Specifying <code>null</code> means "do not quantize".</li> +<li>Note that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers.</li> +<li>We can also reference <strong>groups of layers</strong> in the <code>bits_overrides</code> mapping. This is done using regular expressions. Suppose we have a sub-module in our model named <code>block1</code>, which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named <code>conv1</code>, <code>conv2</code> and so on. In that case we would define the following:</li> +</ul> +<pre><code>bits_overrides: + block1.conv*: + wts: 2 + acts: null </code></pre> </div> diff --git a/docs/search/search_index.json b/docs/search/search_index.json index 6e6344cd64ce82898d2a5d60e9e56cafc511b5c9..2181117daa54116c90c7c6cbf426c1eb164a6350 100644 --- a/docs/search/search_index.json +++ b/docs/search/search_index.json @@ -1,509 +1,564 @@ { "docs": [ { - "location": "/index.html", - "text": "Distiller Documentation\n\n\nWhat is Distiller\n\n\nDistiller\n is an open-source Python package for neural network compression research.\n\n\nNetwork compression can reduce the footprint of a neural network, increase its inference speed and save energy. Distiller provides a \nPyTorch\n environment for prototyping and analyzing compression algorithms, such as sparsity-inducing methods and low precision arithmetic.\n\n\nDistiller contains:\n\n\n\n\nA framework for integrating pruning, regularization and quantization algorithms.\n\n\nA set of tools for analyzing and evaluating compression performance.\n\n\nExample implementations of state-of-the-art compression algorithms.\n\n\n\n\nMotivation\n\n\nA sparse tensor is any tensor that contains some zeros, but sparse tensors are usually only interesting if they contain a significant number of zeros. A sparse neural network performs computations using some sparse tensors (preferably many). These tensors can be parameters (weights and biases) or activations (feature maps).\n\n\nWhy do we care about sparsity?\n\nPresent day neural networks tend to be deep, with millions of weights and activations. Refer to GoogLeNet or ResNet50, for a couple of examples.\nThese large models are compute-intensive which means that even with dedicated acceleration hardware, the inference pass (network evaluation) will take time. You might think that latency is an issue only in certain cases, such as autonomous driving systems, but in fact, whenever we humans interact with our phones and computers, we are sensitive to the latency of the interaction. We don't like to wait for search results or for an application or web-page to load, and we are especially sensitive in realtime interactions such as speech recognition. So inference latency is often something we want to minimize.\n\n\nLarge models are also memory-intensive with millions of parameters. Moving around all of the data required to compute inference results consumes energy, which is a problem on a mobile device as well as in a server environment. Data center server-racks are limited by their power-envelope and their ToC (total cost of ownership) is correlated to their power consumption and thermal characteristics. In the mobile device environment, we are obviously always aware of the implications of power consumption on the device battery.\nInference performance in the data center is often measured using a KPI (key performance indicator) which folds latency and power considerations: inferences per second, per Watt (inferences/sec/watt).\n\n\nThe storage and transfer of large neural networks is also a challenge in mobile device environments, because of limitations on application sizes and long application download times.\n\n\nFor these reasons, we wish to compress the network as much as possible, to reduce the amount of bandwidth and compute required. Inducing sparseness, through regularization or pruning, in neural-network models, is one way to compress the network (quantization is another method).\nSparse neural networks hold the promise of speed, small size, and energy efficiency. \n\n\nSmaller\n\n\nSparse NN model representations can be compressed by taking advantage of the fact that the tensor elements are dominated by zeros. The compression format, if any, is very HW and SW specific, and the optimal format may be different per tensor (an obvious example: largely dense tensors should not be compressed). The compute hardware needs to support the compressions formats, for representation compression to be meaningful. Compression representation decisions might interact with algorithms such as the use of tiles for memory accesses. Data such as a parameter tensor is read/written from/to main system memory compressed, but the computation can be dense or sparse. In dense compute we use dense operators, so the compressed data eventually needs to be decompressed into its full, dense size. The best we can do is bring the compressed representation as close as possible to the compute engine.\n\nSparse compute, on the other hand, operates on the sparse representation which never requires decompression (we therefore distinguish between sparse representation and compressed representation). This is not a simple matter to implement in HW, and often means lower utilization of the vectorized compute engines. Therefore, there is a third class of representations, which take advantage of specific hardware characteristics. For example, for a vectorized compute engine we can remove an entire zero-weights vector and skip its computation (this uses structured pruning or regularization).\n\n\nFaster\n\n\nMany of the layers in modern neural-networks are bandwidth-bound, which means that the execution latency is dominated by the available bandwidth. In essence, the hardware spends more time bringing data close to the compute engines, than actually performing the computations. Fully-connected layers, RNNs and LSTMs are some examples of bandwidth-dominated operations.\n\nReducing the bandwidth required by these layers, will immediately speed them up.\n\nSome pruning algorithms prune entire kernels, filters and even layers from the network without adversely impacting the final accuracy. Depending on the hardware implementation, these methods can be leveraged to skip computations, thus reducing latency and power.\n\n\nMore energy efficient\n\n\nBecause we pay two orders-of-magnitude more energy to access off-chip memory (e.g. DDR) compared to on-chip memory (e.g. SRAM or cache), many hardware designs employ a multi-layered cache hierarchy. Fitting the parameters and activations of a network in these on-chip caches can make a big difference on the required bandwidth, the total inference latency, and off course reduce power consumption.\n\nAnd of course, if we used a sparse or compressed representation, then we are reducing the data throughput and therefore the energy consumption.", + "location": "/index.html", + "text": "Distiller Documentation\n\n\nWhat is Distiller\n\n\nDistiller\n is an open-source Python package for neural network compression research.\n\n\nNetwork compression can reduce the footprint of a neural network, increase its inference speed and save energy. Distiller provides a \nPyTorch\n environment for prototyping and analyzing compression algorithms, such as sparsity-inducing methods and low precision arithmetic.\n\n\nDistiller contains:\n\n\n\n\nA framework for integrating pruning, regularization and quantization algorithms.\n\n\nA set of tools for analyzing and evaluating compression performance.\n\n\nExample implementations of state-of-the-art compression algorithms.\n\n\n\n\nMotivation\n\n\nA sparse tensor is any tensor that contains some zeros, but sparse tensors are usually only interesting if they contain a significant number of zeros. A sparse neural network performs computations using some sparse tensors (preferably many). These tensors can be parameters (weights and biases) or activations (feature maps).\n\n\nWhy do we care about sparsity?\n\nPresent day neural networks tend to be deep, with millions of weights and activations. Refer to GoogLeNet or ResNet50, for a couple of examples.\nThese large models are compute-intensive which means that even with dedicated acceleration hardware, the inference pass (network evaluation) will take time. You might think that latency is an issue only in certain cases, such as autonomous driving systems, but in fact, whenever we humans interact with our phones and computers, we are sensitive to the latency of the interaction. We don't like to wait for search results or for an application or web-page to load, and we are especially sensitive in realtime interactions such as speech recognition. So inference latency is often something we want to minimize.\n\n\nLarge models are also memory-intensive with millions of parameters. Moving around all of the data required to compute inference results consumes energy, which is a problem on a mobile device as well as in a server environment. Data center server-racks are limited by their power-envelope and their ToC (total cost of ownership) is correlated to their power consumption and thermal characteristics. In the mobile device environment, we are obviously always aware of the implications of power consumption on the device battery.\nInference performance in the data center is often measured using a KPI (key performance indicator) which folds latency and power considerations: inferences per second, per Watt (inferences/sec/watt).\n\n\nThe storage and transfer of large neural networks is also a challenge in mobile device environments, because of limitations on application sizes and long application download times.\n\n\nFor these reasons, we wish to compress the network as much as possible, to reduce the amount of bandwidth and compute required. Inducing sparseness, through regularization or pruning, in neural-network models, is one way to compress the network (quantization is another method).\nSparse neural networks hold the promise of speed, small size, and energy efficiency. \n\n\nSmaller\n\n\nSparse NN model representations can be compressed by taking advantage of the fact that the tensor elements are dominated by zeros. The compression format, if any, is very HW and SW specific, and the optimal format may be different per tensor (an obvious example: largely dense tensors should not be compressed). The compute hardware needs to support the compressions formats, for representation compression to be meaningful. Compression representation decisions might interact with algorithms such as the use of tiles for memory accesses. Data such as a parameter tensor is read/written from/to main system memory compressed, but the computation can be dense or sparse. In dense compute we use dense operators, so the compressed data eventually needs to be decompressed into its full, dense size. The best we can do is bring the compressed representation as close as possible to the compute engine.\n\nSparse compute, on the other hand, operates on the sparse representation which never requires decompression (we therefore distinguish between sparse representation and compressed representation). This is not a simple matter to implement in HW, and often means lower utilization of the vectorized compute engines. Therefore, there is a third class of representations, which take advantage of specific hardware characteristics. For example, for a vectorized compute engine we can remove an entire zero-weights vector and skip its computation (this uses structured pruning or regularization).\n\n\nFaster\n\n\nMany of the layers in modern neural-networks are bandwidth-bound, which means that the execution latency is dominated by the available bandwidth. In essence, the hardware spends more time bringing data close to the compute engines, than actually performing the computations. Fully-connected layers, RNNs and LSTMs are some examples of bandwidth-dominated operations.\n\nReducing the bandwidth required by these layers, will immediately speed them up.\n\nSome pruning algorithms prune entire kernels, filters and even layers from the network without adversely impacting the final accuracy. Depending on the hardware implementation, these methods can be leveraged to skip computations, thus reducing latency and power.\n\n\nMore energy efficient\n\n\nBecause we pay two orders-of-magnitude more energy to access off-chip memory (e.g. DDR) compared to on-chip memory (e.g. SRAM or cache), many hardware designs employ a multi-layered cache hierarchy. Fitting the parameters and activations of a network in these on-chip caches can make a big difference on the required bandwidth, the total inference latency, and off course reduce power consumption.\n\nAnd of course, if we used a sparse or compressed representation, then we are reducing the data throughput and therefore the energy consumption.", "title": "Home" - }, + }, { - "location": "/index.html#distiller-documentation", - "text": "", + "location": "/index.html#distiller-documentation", + "text": "", "title": "Distiller Documentation" - }, + }, { - "location": "/index.html#what-is-distiller", - "text": "Distiller is an open-source Python package for neural network compression research. Network compression can reduce the footprint of a neural network, increase its inference speed and save energy. Distiller provides a PyTorch environment for prototyping and analyzing compression algorithms, such as sparsity-inducing methods and low precision arithmetic. Distiller contains: A framework for integrating pruning, regularization and quantization algorithms. A set of tools for analyzing and evaluating compression performance. Example implementations of state-of-the-art compression algorithms.", + "location": "/index.html#what-is-distiller", + "text": "Distiller is an open-source Python package for neural network compression research. Network compression can reduce the footprint of a neural network, increase its inference speed and save energy. Distiller provides a PyTorch environment for prototyping and analyzing compression algorithms, such as sparsity-inducing methods and low precision arithmetic. Distiller contains: A framework for integrating pruning, regularization and quantization algorithms. A set of tools for analyzing and evaluating compression performance. Example implementations of state-of-the-art compression algorithms.", "title": "What is Distiller" - }, + }, { - "location": "/index.html#motivation", - "text": "A sparse tensor is any tensor that contains some zeros, but sparse tensors are usually only interesting if they contain a significant number of zeros. A sparse neural network performs computations using some sparse tensors (preferably many). These tensors can be parameters (weights and biases) or activations (feature maps). Why do we care about sparsity? \nPresent day neural networks tend to be deep, with millions of weights and activations. Refer to GoogLeNet or ResNet50, for a couple of examples.\nThese large models are compute-intensive which means that even with dedicated acceleration hardware, the inference pass (network evaluation) will take time. You might think that latency is an issue only in certain cases, such as autonomous driving systems, but in fact, whenever we humans interact with our phones and computers, we are sensitive to the latency of the interaction. We don't like to wait for search results or for an application or web-page to load, and we are especially sensitive in realtime interactions such as speech recognition. So inference latency is often something we want to minimize. \nLarge models are also memory-intensive with millions of parameters. Moving around all of the data required to compute inference results consumes energy, which is a problem on a mobile device as well as in a server environment. Data center server-racks are limited by their power-envelope and their ToC (total cost of ownership) is correlated to their power consumption and thermal characteristics. In the mobile device environment, we are obviously always aware of the implications of power consumption on the device battery.\nInference performance in the data center is often measured using a KPI (key performance indicator) which folds latency and power considerations: inferences per second, per Watt (inferences/sec/watt). \nThe storage and transfer of large neural networks is also a challenge in mobile device environments, because of limitations on application sizes and long application download times. \nFor these reasons, we wish to compress the network as much as possible, to reduce the amount of bandwidth and compute required. Inducing sparseness, through regularization or pruning, in neural-network models, is one way to compress the network (quantization is another method).\nSparse neural networks hold the promise of speed, small size, and energy efficiency.", + "location": "/index.html#motivation", + "text": "A sparse tensor is any tensor that contains some zeros, but sparse tensors are usually only interesting if they contain a significant number of zeros. A sparse neural network performs computations using some sparse tensors (preferably many). These tensors can be parameters (weights and biases) or activations (feature maps). Why do we care about sparsity? \nPresent day neural networks tend to be deep, with millions of weights and activations. Refer to GoogLeNet or ResNet50, for a couple of examples.\nThese large models are compute-intensive which means that even with dedicated acceleration hardware, the inference pass (network evaluation) will take time. You might think that latency is an issue only in certain cases, such as autonomous driving systems, but in fact, whenever we humans interact with our phones and computers, we are sensitive to the latency of the interaction. We don't like to wait for search results or for an application or web-page to load, and we are especially sensitive in realtime interactions such as speech recognition. So inference latency is often something we want to minimize. \nLarge models are also memory-intensive with millions of parameters. Moving around all of the data required to compute inference results consumes energy, which is a problem on a mobile device as well as in a server environment. Data center server-racks are limited by their power-envelope and their ToC (total cost of ownership) is correlated to their power consumption and thermal characteristics. In the mobile device environment, we are obviously always aware of the implications of power consumption on the device battery.\nInference performance in the data center is often measured using a KPI (key performance indicator) which folds latency and power considerations: inferences per second, per Watt (inferences/sec/watt). \nThe storage and transfer of large neural networks is also a challenge in mobile device environments, because of limitations on application sizes and long application download times. \nFor these reasons, we wish to compress the network as much as possible, to reduce the amount of bandwidth and compute required. Inducing sparseness, through regularization or pruning, in neural-network models, is one way to compress the network (quantization is another method).\nSparse neural networks hold the promise of speed, small size, and energy efficiency.", "title": "Motivation" - }, + }, { - "location": "/index.html#smaller", - "text": "Sparse NN model representations can be compressed by taking advantage of the fact that the tensor elements are dominated by zeros. The compression format, if any, is very HW and SW specific, and the optimal format may be different per tensor (an obvious example: largely dense tensors should not be compressed). The compute hardware needs to support the compressions formats, for representation compression to be meaningful. Compression representation decisions might interact with algorithms such as the use of tiles for memory accesses. Data such as a parameter tensor is read/written from/to main system memory compressed, but the computation can be dense or sparse. In dense compute we use dense operators, so the compressed data eventually needs to be decompressed into its full, dense size. The best we can do is bring the compressed representation as close as possible to the compute engine. \nSparse compute, on the other hand, operates on the sparse representation which never requires decompression (we therefore distinguish between sparse representation and compressed representation). This is not a simple matter to implement in HW, and often means lower utilization of the vectorized compute engines. Therefore, there is a third class of representations, which take advantage of specific hardware characteristics. For example, for a vectorized compute engine we can remove an entire zero-weights vector and skip its computation (this uses structured pruning or regularization).", + "location": "/index.html#smaller", + "text": "Sparse NN model representations can be compressed by taking advantage of the fact that the tensor elements are dominated by zeros. The compression format, if any, is very HW and SW specific, and the optimal format may be different per tensor (an obvious example: largely dense tensors should not be compressed). The compute hardware needs to support the compressions formats, for representation compression to be meaningful. Compression representation decisions might interact with algorithms such as the use of tiles for memory accesses. Data such as a parameter tensor is read/written from/to main system memory compressed, but the computation can be dense or sparse. In dense compute we use dense operators, so the compressed data eventually needs to be decompressed into its full, dense size. The best we can do is bring the compressed representation as close as possible to the compute engine. \nSparse compute, on the other hand, operates on the sparse representation which never requires decompression (we therefore distinguish between sparse representation and compressed representation). This is not a simple matter to implement in HW, and often means lower utilization of the vectorized compute engines. Therefore, there is a third class of representations, which take advantage of specific hardware characteristics. For example, for a vectorized compute engine we can remove an entire zero-weights vector and skip its computation (this uses structured pruning or regularization).", "title": "Smaller" - }, + }, { - "location": "/index.html#faster", - "text": "Many of the layers in modern neural-networks are bandwidth-bound, which means that the execution latency is dominated by the available bandwidth. In essence, the hardware spends more time bringing data close to the compute engines, than actually performing the computations. Fully-connected layers, RNNs and LSTMs are some examples of bandwidth-dominated operations. \nReducing the bandwidth required by these layers, will immediately speed them up. \nSome pruning algorithms prune entire kernels, filters and even layers from the network without adversely impacting the final accuracy. Depending on the hardware implementation, these methods can be leveraged to skip computations, thus reducing latency and power.", + "location": "/index.html#faster", + "text": "Many of the layers in modern neural-networks are bandwidth-bound, which means that the execution latency is dominated by the available bandwidth. In essence, the hardware spends more time bringing data close to the compute engines, than actually performing the computations. Fully-connected layers, RNNs and LSTMs are some examples of bandwidth-dominated operations. \nReducing the bandwidth required by these layers, will immediately speed them up. \nSome pruning algorithms prune entire kernels, filters and even layers from the network without adversely impacting the final accuracy. Depending on the hardware implementation, these methods can be leveraged to skip computations, thus reducing latency and power.", "title": "Faster" - }, + }, { - "location": "/index.html#more-energy-efficient", - "text": "Because we pay two orders-of-magnitude more energy to access off-chip memory (e.g. DDR) compared to on-chip memory (e.g. SRAM or cache), many hardware designs employ a multi-layered cache hierarchy. Fitting the parameters and activations of a network in these on-chip caches can make a big difference on the required bandwidth, the total inference latency, and off course reduce power consumption. \nAnd of course, if we used a sparse or compressed representation, then we are reducing the data throughput and therefore the energy consumption.", + "location": "/index.html#more-energy-efficient", + "text": "Because we pay two orders-of-magnitude more energy to access off-chip memory (e.g. DDR) compared to on-chip memory (e.g. SRAM or cache), many hardware designs employ a multi-layered cache hierarchy. Fitting the parameters and activations of a network in these on-chip caches can make a big difference on the required bandwidth, the total inference latency, and off course reduce power consumption. \nAnd of course, if we used a sparse or compressed representation, then we are reducing the data throughput and therefore the energy consumption.", "title": "More energy efficient" - }, + }, { - "location": "/install/index.html", - "text": "Distiller Installation\n\n\nThese instructions will help get Distiller up and running on your local machine.\n\n\nYou may also want to refer to these resources:\n\n\n\n\nDataset installation\n instructions.\n\n\nJupyter installation\n instructions.\n\n\n\n\nNotes:\n- Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.\n- If you are not using a GPU, you might need to make small adjustments to the code.\n\n\nClone Distiller\n\n\nClone the Distiller code repository from github:\n\n\n$ git clone https://github.com/NervanaSystems/distiller.git\n\n\n\n\nThe rest of the documentation that follows, assumes that you have cloned your repository to a directory called \ndistiller\n. \n\n\nCreate a Python virtual environment\n\n\nWe recommend using a \nPython virtual environment\n, but that of course, is up to you.\nThere's nothing special about using Distiller in a virtual environment, but we provide some instructions, for completeness.\n\nBefore creating the virtual environment, make sure you are located in directory \ndistiller\n. After creating the environment, you should see a directory called \ndistiller/env\n.\n\n\n\nUsing virtualenv\n\n\nIf you don't have virtualenv installed, you can find the installation instructions \nhere\n.\n\n\nTo create the environment, execute:\n\n\n$ python3 -m virtualenv env\n\n\n\n\nThis creates a subdirectory named \nenv\n where the python virtual environment is stored, and configures the current shell to use it as the default python environment.\n\n\nUsing venv\n\n\nIf you prefer to use \nvenv\n, then begin by installing it:\n\n\n$ sudo apt-get install python3-venv\n\n\n\n\nThen create the environment:\n\n\n$ python3 -m venv env\n\n\n\n\nAs with virtualenv, this creates a directory called \ndistiller/env\n.\n\n\nActivate the environment\n\n\nThe environment activation and deactivation commands for \nvenv\n and \nvirtualenv\n are the same.\n\n\n!NOTE: Make sure to activate the environment, before proceeding with the installation of the dependency packages:\n\n\n$ source env/bin/activate\n\n\n\n\nInstall dependencies\n\n\nFinally, install Distiller's dependency packages using \npip3\n:\n\n\n$ pip3 install -r requirements.txt\n\n\n\n\nPyTorch is included in the \nrequirements.txt\n file, and will currently download PyTorch version 3.1 for CUDA 8.0. This is the setup we've used for testing Distiller.", + "location": "/install/index.html", + "text": "Distiller Installation\n\n\nThese instructions will help get Distiller up and running on your local machine.\n\n\nYou may also want to refer to these resources:\n\n\n\n\nDataset installation\n instructions.\n\n\nJupyter installation\n instructions.\n\n\n\n\nNotes:\n- Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.\n- If you are not using a GPU, you might need to make small adjustments to the code.\n\n\nClone Distiller\n\n\nClone the Distiller code repository from github:\n\n\n$ git clone https://github.com/NervanaSystems/distiller.git\n\n\n\n\nThe rest of the documentation that follows, assumes that you have cloned your repository to a directory called \ndistiller\n. \n\n\nCreate a Python virtual environment\n\n\nWe recommend using a \nPython virtual environment\n, but that of course, is up to you.\nThere's nothing special about using Distiller in a virtual environment, but we provide some instructions, for completeness.\n\nBefore creating the virtual environment, make sure you are located in directory \ndistiller\n. After creating the environment, you should see a directory called \ndistiller/env\n.\n\n\n\nUsing virtualenv\n\n\nIf you don't have virtualenv installed, you can find the installation instructions \nhere\n.\n\n\nTo create the environment, execute:\n\n\n$ python3 -m virtualenv env\n\n\n\n\nThis creates a subdirectory named \nenv\n where the python virtual environment is stored, and configures the current shell to use it as the default python environment.\n\n\nUsing venv\n\n\nIf you prefer to use \nvenv\n, then begin by installing it:\n\n\n$ sudo apt-get install python3-venv\n\n\n\n\nThen create the environment:\n\n\n$ python3 -m venv env\n\n\n\n\nAs with virtualenv, this creates a directory called \ndistiller/env\n.\n\n\nActivate the environment\n\n\nThe environment activation and deactivation commands for \nvenv\n and \nvirtualenv\n are the same.\n\n\n!NOTE: Make sure to activate the environment, before proceeding with the installation of the dependency packages:\n\n\n$ source env/bin/activate\n\n\n\n\nInstall dependencies\n\n\nFinally, install Distiller's dependency packages using \npip3\n:\n\n\n$ pip3 install -r requirements.txt\n\n\n\n\nPyTorch is included in the \nrequirements.txt\n file, and will currently download PyTorch version 3.1 for CUDA 8.0. This is the setup we've used for testing Distiller.", "title": "Installation" - }, + }, { - "location": "/install/index.html#distiller-installation", - "text": "These instructions will help get Distiller up and running on your local machine. You may also want to refer to these resources: Dataset installation instructions. Jupyter installation instructions. Notes:\n- Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.\n- If you are not using a GPU, you might need to make small adjustments to the code.", + "location": "/install/index.html#distiller-installation", + "text": "These instructions will help get Distiller up and running on your local machine. You may also want to refer to these resources: Dataset installation instructions. Jupyter installation instructions. Notes:\n- Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.\n- If you are not using a GPU, you might need to make small adjustments to the code.", "title": "Distiller Installation" - }, + }, { - "location": "/install/index.html#clone-distiller", - "text": "Clone the Distiller code repository from github: $ git clone https://github.com/NervanaSystems/distiller.git The rest of the documentation that follows, assumes that you have cloned your repository to a directory called distiller .", + "location": "/install/index.html#clone-distiller", + "text": "Clone the Distiller code repository from github: $ git clone https://github.com/NervanaSystems/distiller.git The rest of the documentation that follows, assumes that you have cloned your repository to a directory called distiller .", "title": "Clone Distiller" - }, + }, { - "location": "/install/index.html#create-a-python-virtual-environment", - "text": "We recommend using a Python virtual environment , but that of course, is up to you.\nThere's nothing special about using Distiller in a virtual environment, but we provide some instructions, for completeness. \nBefore creating the virtual environment, make sure you are located in directory distiller . After creating the environment, you should see a directory called distiller/env .", + "location": "/install/index.html#create-a-python-virtual-environment", + "text": "We recommend using a Python virtual environment , but that of course, is up to you.\nThere's nothing special about using Distiller in a virtual environment, but we provide some instructions, for completeness. \nBefore creating the virtual environment, make sure you are located in directory distiller . After creating the environment, you should see a directory called distiller/env .", "title": "Create a Python virtual environment" - }, + }, { - "location": "/install/index.html#using-virtualenv", - "text": "If you don't have virtualenv installed, you can find the installation instructions here . To create the environment, execute: $ python3 -m virtualenv env This creates a subdirectory named env where the python virtual environment is stored, and configures the current shell to use it as the default python environment.", + "location": "/install/index.html#using-virtualenv", + "text": "If you don't have virtualenv installed, you can find the installation instructions here . To create the environment, execute: $ python3 -m virtualenv env This creates a subdirectory named env where the python virtual environment is stored, and configures the current shell to use it as the default python environment.", "title": "Using virtualenv" - }, + }, { - "location": "/install/index.html#using-venv", - "text": "If you prefer to use venv , then begin by installing it: $ sudo apt-get install python3-venv Then create the environment: $ python3 -m venv env As with virtualenv, this creates a directory called distiller/env .", + "location": "/install/index.html#using-venv", + "text": "If you prefer to use venv , then begin by installing it: $ sudo apt-get install python3-venv Then create the environment: $ python3 -m venv env As with virtualenv, this creates a directory called distiller/env .", "title": "Using venv" - }, + }, { - "location": "/install/index.html#activate-the-environment", - "text": "The environment activation and deactivation commands for venv and virtualenv are the same. !NOTE: Make sure to activate the environment, before proceeding with the installation of the dependency packages: $ source env/bin/activate", + "location": "/install/index.html#activate-the-environment", + "text": "The environment activation and deactivation commands for venv and virtualenv are the same. !NOTE: Make sure to activate the environment, before proceeding with the installation of the dependency packages: $ source env/bin/activate", "title": "Activate the environment" - }, + }, { - "location": "/install/index.html#install-dependencies", - "text": "Finally, install Distiller's dependency packages using pip3 : $ pip3 install -r requirements.txt PyTorch is included in the requirements.txt file, and will currently download PyTorch version 3.1 for CUDA 8.0. This is the setup we've used for testing Distiller.", + "location": "/install/index.html#install-dependencies", + "text": "Finally, install Distiller's dependency packages using pip3 : $ pip3 install -r requirements.txt PyTorch is included in the requirements.txt file, and will currently download PyTorch version 3.1 for CUDA 8.0. This is the setup we've used for testing Distiller.", "title": "Install dependencies" - }, + }, { - "location": "/usage/index.html", - "text": "Using the sample application\n\n\nThe Distiller repository contains a sample application, \ndistiller/examples/classifier_compression/compress_classifier.py\n, and a set of scheduling files which demonstrate Distiller's features. Following is a brief discussion of how to use this application and the accompanying schedules.\n\n\nYou might also want to refer to the following resources:\n\n\n\n\nAn \nexplanation\n of the scheduler file format.\n\n\nAn in-depth \ndiscussion\n of how we used these schedule files to implement several state-of-the-art DNN compression research papers.\n\n\n\n\nThe sample application supports various features for compression of image classification DNNs, and gives an example of how to integrate distiller in your own application. The code is documented and should be considered the best source of documentation, but we provide some elaboration here.\n\n\nThis diagram shows how where \ncompress_classifier.py\n fits in the compression workflow, and how we integrate the Jupyter notebooks as part of our research work.\n\n\n\nCommand line arguments\n\n\nTo get help on the command line arguments, invoke:\n\n\n$ python3 compress_classifier.py --help\n\n\n\n\nFor example:\n\n\n$ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\nParameters:\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean |\n |----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n | 0 | features.module.0.weight | (64, 3, 11, 11) | 23232 | 13411 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 42.27359 | 0.14391 | -0.00002 | 0.08805 |\n | 1 | features.module.3.weight | (192, 64, 5, 5) | 307200 | 115560 | 0.00000 | 0.00000 | 0.00000 | 1.91243 | 0.00000 | 62.38281 | 0.04703 | -0.00250 | 0.02289 |\n | 2 | features.module.6.weight | (384, 192, 3, 3) | 663552 | 256565 | 0.00000 | 0.00000 | 0.00000 | 6.18490 | 0.00000 | 61.33445 | 0.03354 | -0.00184 | 0.01803 |\n | 3 | features.module.8.weight | (256, 384, 3, 3) | 884736 | 315065 | 0.00000 | 0.00000 | 0.00000 | 6.96411 | 0.00000 | 64.38881 | 0.02646 | -0.00168 | 0.01422 |\n | 4 | features.module.10.weight | (256, 256, 3, 3) | 589824 | 186938 | 0.00000 | 0.00000 | 0.00000 | 15.49225 | 0.00000 | 68.30614 | 0.02714 | -0.00246 | 0.01409 |\n | 5 | classifier.1.weight | (4096, 9216) | 37748736 | 3398881 | 0.00000 | 0.21973 | 0.00000 | 0.21973 | 0.00000 | 90.99604 | 0.00589 | -0.00020 | 0.00168 |\n | 6 | classifier.4.weight | (4096, 4096) | 16777216 | 1782769 | 0.21973 | 3.46680 | 0.00000 | 3.46680 | 0.00000 | 89.37387 | 0.00849 | -0.00066 | 0.00263 |\n | 7 | classifier.6.weight | (1000, 4096) | 4096000 | 994738 | 3.36914 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 75.71440 | 0.01718 | 0.00030 | 0.00778 |\n | 8 | Total sparsity: | - | 61090496 | 7063928 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 88.43694 | 0.00000 | 0.00000 | 0.00000 |\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:04,646 - Epoch: [89][ 50/ 500] Loss 2.175988 Top1 51.289063 Top5 74.023438\n 2018-04-04 21:31:06,427 - Epoch: [89][ 100/ 500] Loss 2.171564 Top1 51.175781 Top5 74.308594\n 2018-04-04 21:31:11,432 - Epoch: [89][ 150/ 500] Loss 2.159347 Top1 51.546875 Top5 74.473958\n 2018-04-04 21:31:14,364 - Epoch: [89][ 200/ 500] Loss 2.156857 Top1 51.585938 Top5 74.568359\n 2018-04-04 21:31:18,381 - Epoch: [89][ 250/ 500] Loss 2.152790 Top1 51.707813 Top5 74.681250\n 2018-04-04 21:31:22,195 - Epoch: [89][ 300/ 500] Loss 2.149962 Top1 51.791667 Top5 74.755208\n 2018-04-04 21:31:25,508 - Epoch: [89][ 350/ 500] Loss 2.150936 Top1 51.827009 Top5 74.767857\n 2018-04-04 21:31:29,538 - Epoch: [89][ 400/ 500] Loss 2.150853 Top1 51.781250 Top5 74.763672\n 2018-04-04 21:31:32,842 - Epoch: [89][ 450/ 500] Loss 2.150156 Top1 51.828125 Top5 74.821181\n 2018-04-04 21:31:35,338 - Epoch: [89][ 500/ 500] Loss 2.150417 Top1 51.833594 Top5 74.817187\n 2018-04-04 21:31:35,357 - ==> Top1: 51.838 Top5: 74.817 Loss: 2.150\n\n 2018-04-04 21:31:35,364 - Saving checkpoint\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:31:51,512 - Test: [ 50/ 195] Loss 1.487607 Top1 63.273438 Top5 85.695312\n 2018-04-04 21:31:55,015 - Test: [ 100/ 195] Loss 1.638043 Top1 60.636719 Top5 83.664062\n 2018-04-04 21:31:58,732 - Test: [ 150/ 195] Loss 1.833214 Top1 57.619792 Top5 80.447917\n 2018-04-04 21:32:01,274 - ==> Top1: 56.606 Top5: 79.446 Loss: 1.893\n\n\n\n\nLet's look at the command line again:\n\n\n$ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\n\n\n\nIn this example, we prune a TorchVision pre-trained AlexNet network, using the following configuration:\n\n\n\n\nLearning-rate of 0.005\n\n\nPrint progress every 50 mini-batches.\n\n\nUse 44 worker threads to load data (make sure to use something suitable for your machine).\n\n\nRun for 90 epochs. Torchvision's pre-trained models did not store the epoch metadata, so pruning starts at epoch 0. When you train and prune your own networks, the last training epoch is saved as a metadata with the model. Therefore, when you load such models, the first epoch is not 0, but it is the last training epoch.\n\n\nThe pruning schedule is provided in \nalexnet.schedule_sensitivity.yaml\n\n\nLog files are written to directory \nlogs\n.\n\n\n\n\nExamples\n\n\nDistiller comes with several example schedules which can be used together with \ncompress_classifier.py\n.\nThese example schedules (YAML) files, contain the command line that is used in order to invoke the schedule (so that you can easily recreate the results in your environment), together with the results of the pruning or regularization. The results usually contain a table showing the sparsity of each of the model parameters, together with the validation and test top1, top5 and loss scores.\n\n\nFor more details on the example schedules, you can refer to the coverage of the \nModel Zoo\n.\n\n\n\n\nexamples/agp-pruning\n:\n\n\nAutomated Gradual Pruning (AGP) on MobileNet and ResNet18 (ImageNet dataset)\n\n\n\n\n\n\n\nexamples/hybrid\n:\n\n\nAlexNet AGP with 2D (kernel) regularization (ImageNet dataset)\n\n\nAlexNet sensitivity pruning with 2D regularization\n\n\n\n\n\n\n\nexamples/network_slimming\n:\n\n\nResNet20 Network Slimming (this is work-in-progress)\n\n\n\n\n\n\n\nexamples/pruning_filters_for_efficient_convnets\n:\n\n\nResNet56 baseline training (CIFAR10 dataset)\n\n\nResNet56 filter removal using filter ranking\n\n\n\n\n\n\n\nexamples/sensitivity_analysis\n:\n\n\nElement-wise pruning sensitivity-analysis:\n\n\nAlexNet (ImageNet)\n\n\nMobileNet (ImageNet)\n\n\nResNet18 (ImageNet)\n\n\nResNet20 (CIFAR10)\n\n\nResNet34 (ImageNet)\n\n\nFilter-wise pruning sensitivity-analysis:\n\n\nResNet20 (CIFAR10)\n\n\nResNet56 (CIFAR10)\n\n\n\n\n\n\n\n\nexamples/sensitivity-pruning\n:\n\n\n\n\nAlexNet sensitivity pruning with Iterative Pruning\n\n\nAlexNet sensitivity pruning with One-Shot Pruning\n\n\n\n\n\n\n\n\nexamples/ssl\n:\n\n\n\n\nResNet20 baseline training (CIFAR10 dataset)\n\n\nStructured Sparsity Learning (SSL) with layer removal on ResNet20\n\n\nSSL with channels removal on ResNet20\n\n\n\n\n\n\n\n\nExperiment reproducibility\n\n\nExperiment reproducibility is sometimes important. Pete Warden recently expounded about this in his \nblog\n.\n\nPyTorch's support for deterministic execution requires us to use only one thread for loading data (other wise the multi-threaded execution of the data loaders can create random order and change the results), and to set the seed of the CPU and GPU PRNGs. Using the \n--deterministic\n command-line flag and setting \nj=1\n will produce reproducible results (for the same PyTorch version).\n\n\nPerforming pruning sensitivity analysis\n\n\nDistiller supports element-wise and filter-wise pruning sensitivity analysis. In both cases, L1-norm is used to rank which elements or filters to prune. For example, when running filter-pruning sensitivity analysis, the L1-norm of the filters of each layer's weights tensor are calculated, and the bottom x% are set to zero. \n\nThe analysis process is quite long, because currently we use the entire test dataset to assess the accuracy performance at each pruning level of each weights tensor. Using a small dataset for this would save much time and we plan on assessing if this will provide sufficient results.\n\nResults are output as a CSV file (\nsensitivity.csv\n) and PNG file (\nsensitivity.png\n). The implementation is in \ndistiller/sensitivity.py\n and it contains further details about process and the format of the CSV file.\n\n\nThe example below performs element-wise pruning sensitivity analysis on ResNet20 for CIFAR10:\n\n\n$ python3 compress_classifier.py -a resnet20_cifar ../../../data.cifar10/ -j=1 --resume=../cifar10/resnet20/checkpoint_trained_dense.pth.tar --sense=element\n\n\n\n\nThe \nsense\n command-line argument can be set to either \nelement\n or \nfilter\n, depending on the type of analysis you want done.\n\n\nThere is also a \nJupyter notebook\n with example invocations, outputs and explanations.\n\n\nQuantization\n\n\nCurrently Distiller support 8-bit quantization only (quantization of lower precision data types will follow shortly) which does not require training, so any model (whether pruned or not) can be quantized.\n\nUse the \n--quantize\n command-line flag, together with \n--evaluate\n to evaluate the accuracy of your model after quantization. The following example qunatizes ResNet18 for ImageNet:\n\n\n$ python3 compress_classifier.py -a resnet18 ../../../data.imagenet --pretrained --quantize --evaluate\n\n\n\n\nGenerates:\n\n\nPreparing model for quantization\n--- test ---------------------\n50000 samples (256 per mini-batch)\nTest: [ 10/ 195] Loss 0.856354 Top1 79.257812 Top5 92.500000\nTest: [ 20/ 195] Loss 0.923131 Top1 76.953125 Top5 92.246094\nTest: [ 30/ 195] Loss 0.885186 Top1 77.955729 Top5 92.486979\nTest: [ 40/ 195] Loss 0.930263 Top1 76.181641 Top5 92.597656\nTest: [ 50/ 195] Loss 0.931062 Top1 75.726562 Top5 92.906250\nTest: [ 60/ 195] Loss 0.932019 Top1 75.651042 Top5 93.151042\nTest: [ 70/ 195] Loss 0.921287 Top1 76.060268 Top5 93.270089\nTest: [ 80/ 195] Loss 0.932539 Top1 75.986328 Top5 93.100586\nTest: [ 90/ 195] Loss 0.996000 Top1 74.700521 Top5 92.330729\nTest: [ 100/ 195] Loss 1.066699 Top1 73.289062 Top5 91.437500\nTest: [ 110/ 195] Loss 1.100970 Top1 72.574574 Top5 91.001420\nTest: [ 120/ 195] Loss 1.122376 Top1 72.268880 Top5 90.696615\nTest: [ 130/ 195] Loss 1.171726 Top1 71.198918 Top5 90.120192\nTest: [ 140/ 195] Loss 1.191500 Top1 70.797991 Top5 89.902344\nTest: [ 150/ 195] Loss 1.219954 Top1 70.210938 Top5 89.453125\nTest: [ 160/ 195] Loss 1.240942 Top1 69.855957 Top5 89.162598\nTest: [ 170/ 195] Loss 1.265741 Top1 69.342831 Top5 88.807445\nTest: [ 180/ 195] Loss 1.281185 Top1 69.051649 Top5 88.589410\nTest: [ 190/ 195] Loss 1.279682 Top1 69.019326 Top5 88.632812\n==> Top1: 69.130 Top5: 88.732 Loss: 1.276\n\n\n\n\nSummaries\n\n\nYou can use the sample compression application to generate model summary reports, such as the attributes and compute summary report (see screen capture below).\nYou can log sparsity statistics (written to console and CSV file), performance, optimizer and model information, and also create a PNG image of the DNN.\nCreating a PNG image is an experimental feature (it relies on features which are not available on PyTorch 3.1 and that we hope will be available in PyTorch's next release), so to use it you will need to compile the PyTorch master branch, and hope for the best ;-).\n\n\n$ python3 compress_classifier.py --resume=../ssl/checkpoints/checkpoint_trained_ch_regularized_dense.pth.tar -a=resnet20_cifar ../../../data.cifar10 --summary=compute\n\n\n\n\nGenerates:\n\n\n+----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\n| | Name | Type | Attrs | IFM | IFM volume | OFM | OFM volume | Weights volume | MACs |\n|----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------|\n| 0 | module.conv1 | Conv2d | k=(3, 3) | (1, 3, 32, 32) | 3072 | (1, 16, 32, 32) | 16384 | 432 | 442368 |\n| 1 | module.layer1.0.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 2 | module.layer1.0.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 3 | module.layer1.1.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 4 | module.layer1.1.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 5 | module.layer1.2.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 6 | module.layer1.2.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 7 | module.layer2.0.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 32, 16, 16) | 8192 | 4608 | 1179648 |\n| 8 | module.layer2.0.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 9 | module.layer2.0.downsample.0 | Conv2d | k=(1, 1) | (1, 16, 32, 32) | 16384 | (1, 32, 16, 16) | 8192 | 512 | 131072 |\n| 10 | module.layer2.1.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 11 | module.layer2.1.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 12 | module.layer2.2.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 13 | module.layer2.2.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 14 | module.layer3.0.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 64, 8, 8) | 4096 | 18432 | 1179648 |\n| 15 | module.layer3.0.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 16 | module.layer3.0.downsample.0 | Conv2d | k=(1, 1) | (1, 32, 16, 16) | 8192 | (1, 64, 8, 8) | 4096 | 2048 | 131072 |\n| 17 | module.layer3.1.conv1 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 18 | module.layer3.1.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 19 | module.layer3.2.conv1 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 20 | module.layer3.2.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 21 | module.fc | Linear | | (1, 64) | 64 | (1, 10) | 10 | 640 | 640 |\n+----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\nTotal MACs: 40,813,184\n\n\n\n\nUsing TensorBoard\n\n\nGoogle's \nTensorBoard\n is an excellent tool for visualizing the progress of DNN training. Distiller's logger supports writing performance indicators and parameter statistics in a file format that can be read by TensorBoard (Distiller uses TensorFlow's APIs in order to do this, which is why Distiller requires the installation of TensorFlow).\n\nTo view the graphs, invoke the TensorBoard server. For example:\n\n\n$ tensorboard --logdir=logs\n\n\n\n\nDistillers's setup (requirements.txt) installs TensorFlow for CPU. If you want a different installation, please follow the \nTensorFlow installation instructions\n.\n\n\nCollecting feature-maps statistics\n\n\nIn CNNs with ReLU layers, ReLU activations (feature-maps) also exhibit a nice level of sparsity (50-60% sparsity is typical). \n\nYou can collect activation statistics using the \n--act_stats\n command-line flag.\n\n\nUsing the Jupyter notebooks\n\n\nThe Jupyter notebooks contain many examples of how to use the statistics summaries generated by Distiller. They are explained in a separate page.\n\n\nGenerating this documentation\n\n\nInstall mkdocs and the required packages by executing:\n\n\n$ pip3 install -r doc-requirements.txt\n\n\n\n\nTo build the project documentation run:\n\n\n$ cd distiller/docs-src\n$ mkdocs build --clean\n\n\n\n\nThis will create a folder named 'site' which contains the documentation website.\nOpen distiller/docs/site/index.html to view the documentation home page.", + "location": "/usage/index.html", + "text": "Using the sample application\n\n\nThe Distiller repository contains a sample application, \ndistiller/examples/classifier_compression/compress_classifier.py\n, and a set of scheduling files which demonstrate Distiller's features. Following is a brief discussion of how to use this application and the accompanying schedules.\n\n\nYou might also want to refer to the following resources:\n\n\n\n\nAn \nexplanation\n of the scheduler file format.\n\n\nAn in-depth \ndiscussion\n of how we used these schedule files to implement several state-of-the-art DNN compression research papers.\n\n\n\n\nThe sample application supports various features for compression of image classification DNNs, and gives an example of how to integrate distiller in your own application. The code is documented and should be considered the best source of documentation, but we provide some elaboration here.\n\n\nThis diagram shows how where \ncompress_classifier.py\n fits in the compression workflow, and how we integrate the Jupyter notebooks as part of our research work.\n\n\n\nCommand line arguments\n\n\nTo get help on the command line arguments, invoke:\n\n\n$ python3 compress_classifier.py --help\n\n\n\n\nFor example:\n\n\n$ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\nParameters:\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean |\n |----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n | 0 | features.module.0.weight | (64, 3, 11, 11) | 23232 | 13411 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 42.27359 | 0.14391 | -0.00002 | 0.08805 |\n | 1 | features.module.3.weight | (192, 64, 5, 5) | 307200 | 115560 | 0.00000 | 0.00000 | 0.00000 | 1.91243 | 0.00000 | 62.38281 | 0.04703 | -0.00250 | 0.02289 |\n | 2 | features.module.6.weight | (384, 192, 3, 3) | 663552 | 256565 | 0.00000 | 0.00000 | 0.00000 | 6.18490 | 0.00000 | 61.33445 | 0.03354 | -0.00184 | 0.01803 |\n | 3 | features.module.8.weight | (256, 384, 3, 3) | 884736 | 315065 | 0.00000 | 0.00000 | 0.00000 | 6.96411 | 0.00000 | 64.38881 | 0.02646 | -0.00168 | 0.01422 |\n | 4 | features.module.10.weight | (256, 256, 3, 3) | 589824 | 186938 | 0.00000 | 0.00000 | 0.00000 | 15.49225 | 0.00000 | 68.30614 | 0.02714 | -0.00246 | 0.01409 |\n | 5 | classifier.1.weight | (4096, 9216) | 37748736 | 3398881 | 0.00000 | 0.21973 | 0.00000 | 0.21973 | 0.00000 | 90.99604 | 0.00589 | -0.00020 | 0.00168 |\n | 6 | classifier.4.weight | (4096, 4096) | 16777216 | 1782769 | 0.21973 | 3.46680 | 0.00000 | 3.46680 | 0.00000 | 89.37387 | 0.00849 | -0.00066 | 0.00263 |\n | 7 | classifier.6.weight | (1000, 4096) | 4096000 | 994738 | 3.36914 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 75.71440 | 0.01718 | 0.00030 | 0.00778 |\n | 8 | Total sparsity: | - | 61090496 | 7063928 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 88.43694 | 0.00000 | 0.00000 | 0.00000 |\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:04,646 - Epoch: [89][ 50/ 500] Loss 2.175988 Top1 51.289063 Top5 74.023438\n 2018-04-04 21:31:06,427 - Epoch: [89][ 100/ 500] Loss 2.171564 Top1 51.175781 Top5 74.308594\n 2018-04-04 21:31:11,432 - Epoch: [89][ 150/ 500] Loss 2.159347 Top1 51.546875 Top5 74.473958\n 2018-04-04 21:31:14,364 - Epoch: [89][ 200/ 500] Loss 2.156857 Top1 51.585938 Top5 74.568359\n 2018-04-04 21:31:18,381 - Epoch: [89][ 250/ 500] Loss 2.152790 Top1 51.707813 Top5 74.681250\n 2018-04-04 21:31:22,195 - Epoch: [89][ 300/ 500] Loss 2.149962 Top1 51.791667 Top5 74.755208\n 2018-04-04 21:31:25,508 - Epoch: [89][ 350/ 500] Loss 2.150936 Top1 51.827009 Top5 74.767857\n 2018-04-04 21:31:29,538 - Epoch: [89][ 400/ 500] Loss 2.150853 Top1 51.781250 Top5 74.763672\n 2018-04-04 21:31:32,842 - Epoch: [89][ 450/ 500] Loss 2.150156 Top1 51.828125 Top5 74.821181\n 2018-04-04 21:31:35,338 - Epoch: [89][ 500/ 500] Loss 2.150417 Top1 51.833594 Top5 74.817187\n 2018-04-04 21:31:35,357 - ==\n Top1: 51.838 Top5: 74.817 Loss: 2.150\n\n 2018-04-04 21:31:35,364 - Saving checkpoint\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:31:51,512 - Test: [ 50/ 195] Loss 1.487607 Top1 63.273438 Top5 85.695312\n 2018-04-04 21:31:55,015 - Test: [ 100/ 195] Loss 1.638043 Top1 60.636719 Top5 83.664062\n 2018-04-04 21:31:58,732 - Test: [ 150/ 195] Loss 1.833214 Top1 57.619792 Top5 80.447917\n 2018-04-04 21:32:01,274 - ==\n Top1: 56.606 Top5: 79.446 Loss: 1.893\n\n\n\n\nLet's look at the command line again:\n\n\n$ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\n\n\n\nIn this example, we prune a TorchVision pre-trained AlexNet network, using the following configuration:\n\n\n\n\nLearning-rate of 0.005\n\n\nPrint progress every 50 mini-batches.\n\n\nUse 44 worker threads to load data (make sure to use something suitable for your machine).\n\n\nRun for 90 epochs. Torchvision's pre-trained models did not store the epoch metadata, so pruning starts at epoch 0. When you train and prune your own networks, the last training epoch is saved as a metadata with the model. Therefore, when you load such models, the first epoch is not 0, but it is the last training epoch.\n\n\nThe pruning schedule is provided in \nalexnet.schedule_sensitivity.yaml\n\n\nLog files are written to directory \nlogs\n.\n\n\n\n\nExamples\n\n\nDistiller comes with several example schedules which can be used together with \ncompress_classifier.py\n.\nThese example schedules (YAML) files, contain the command line that is used in order to invoke the schedule (so that you can easily recreate the results in your environment), together with the results of the pruning or regularization. The results usually contain a table showing the sparsity of each of the model parameters, together with the validation and test top1, top5 and loss scores.\n\n\nFor more details on the example schedules, you can refer to the coverage of the \nModel Zoo\n.\n\n\n\n\nexamples/agp-pruning\n:\n\n\nAutomated Gradual Pruning (AGP) on MobileNet and ResNet18 (ImageNet dataset)\n\n\n\n\n\n\n\nexamples/hybrid\n:\n\n\nAlexNet AGP with 2D (kernel) regularization (ImageNet dataset)\n\n\nAlexNet sensitivity pruning with 2D regularization\n\n\n\n\n\n\n\nexamples/network_slimming\n:\n\n\nResNet20 Network Slimming (this is work-in-progress)\n\n\n\n\n\n\n\nexamples/pruning_filters_for_efficient_convnets\n:\n\n\nResNet56 baseline training (CIFAR10 dataset)\n\n\nResNet56 filter removal using filter ranking\n\n\n\n\n\n\n\nexamples/sensitivity_analysis\n:\n\n\nElement-wise pruning sensitivity-analysis:\n\n\nAlexNet (ImageNet)\n\n\nMobileNet (ImageNet)\n\n\nResNet18 (ImageNet)\n\n\nResNet20 (CIFAR10)\n\n\nResNet34 (ImageNet)\n\n\nFilter-wise pruning sensitivity-analysis:\n\n\nResNet20 (CIFAR10)\n\n\nResNet56 (CIFAR10)\n\n\n\n\n\n\n\nexamples/sensitivity-pruning\n:\n\n\nAlexNet sensitivity pruning with Iterative Pruning\n\n\nAlexNet sensitivity pruning with One-Shot Pruning\n\n\n\n\n\n\n\nexamples/ssl\n:\n\n\nResNet20 baseline training (CIFAR10 dataset)\n\n\nStructured Sparsity Learning (SSL) with layer removal on ResNet20\n\n\nSSL with channels removal on ResNet20\n\n\n\n\n\n\n\nexamples/quantization\n:\n\n\nAlexNet w. Batch-Norm (base FP32 + DoReFa)\n\n\nPre-activation ResNet20 on CIFAR10 (base FP32 + DoReFa)\n\n\nPre-activation ResNet18 on ImageNEt (base FP32 + DoReFa)\n\n\n\n\n\n\n\n\nExperiment reproducibility\n\n\nExperiment reproducibility is sometimes important. Pete Warden recently expounded about this in his \nblog\n.\n\nPyTorch's support for deterministic execution requires us to use only one thread for loading data (other wise the multi-threaded execution of the data loaders can create random order and change the results), and to set the seed of the CPU and GPU PRNGs. Using the \n--deterministic\n command-line flag and setting \nj=1\n will produce reproducible results (for the same PyTorch version).\n\n\nPerforming pruning sensitivity analysis\n\n\nDistiller supports element-wise and filter-wise pruning sensitivity analysis. In both cases, L1-norm is used to rank which elements or filters to prune. For example, when running filter-pruning sensitivity analysis, the L1-norm of the filters of each layer's weights tensor are calculated, and the bottom x% are set to zero. \n\nThe analysis process is quite long, because currently we use the entire test dataset to assess the accuracy performance at each pruning level of each weights tensor. Using a small dataset for this would save much time and we plan on assessing if this will provide sufficient results.\n\nResults are output as a CSV file (\nsensitivity.csv\n) and PNG file (\nsensitivity.png\n). The implementation is in \ndistiller/sensitivity.py\n and it contains further details about process and the format of the CSV file.\n\n\nThe example below performs element-wise pruning sensitivity analysis on ResNet20 for CIFAR10:\n\n\n$ python3 compress_classifier.py -a resnet20_cifar ../../../data.cifar10/ -j=1 --resume=../cifar10/resnet20/checkpoint_trained_dense.pth.tar --sense=element\n\n\n\n\nThe \nsense\n command-line argument can be set to either \nelement\n or \nfilter\n, depending on the type of analysis you want done.\n\n\nThere is also a \nJupyter notebook\n with example invocations, outputs and explanations.\n\n\n\"Direct\" Quantization Without Training\n\n\nDistiller supports 8-bit quantization of trained modules without re-training (using \nSymmetric Linear Quantization\n). So, any model (whether pruned or not) can be quantized.\n\nUse the \n--quantize\n command-line flag, together with \n--evaluate\n to evaluate the accuracy of your model after quantization. The following example qunatizes ResNet18 for ImageNet:\n\n\n$ python3 compress_classifier.py -a resnet18 ../../../data.imagenet --pretrained --quantize --evaluate\n\n\n\n\nGenerates:\n\n\nPreparing model for quantization\n--- test ---------------------\n50000 samples (256 per mini-batch)\nTest: [ 10/ 195] Loss 0.856354 Top1 79.257812 Top5 92.500000\nTest: [ 20/ 195] Loss 0.923131 Top1 76.953125 Top5 92.246094\nTest: [ 30/ 195] Loss 0.885186 Top1 77.955729 Top5 92.486979\nTest: [ 40/ 195] Loss 0.930263 Top1 76.181641 Top5 92.597656\nTest: [ 50/ 195] Loss 0.931062 Top1 75.726562 Top5 92.906250\nTest: [ 60/ 195] Loss 0.932019 Top1 75.651042 Top5 93.151042\nTest: [ 70/ 195] Loss 0.921287 Top1 76.060268 Top5 93.270089\nTest: [ 80/ 195] Loss 0.932539 Top1 75.986328 Top5 93.100586\nTest: [ 90/ 195] Loss 0.996000 Top1 74.700521 Top5 92.330729\nTest: [ 100/ 195] Loss 1.066699 Top1 73.289062 Top5 91.437500\nTest: [ 110/ 195] Loss 1.100970 Top1 72.574574 Top5 91.001420\nTest: [ 120/ 195] Loss 1.122376 Top1 72.268880 Top5 90.696615\nTest: [ 130/ 195] Loss 1.171726 Top1 71.198918 Top5 90.120192\nTest: [ 140/ 195] Loss 1.191500 Top1 70.797991 Top5 89.902344\nTest: [ 150/ 195] Loss 1.219954 Top1 70.210938 Top5 89.453125\nTest: [ 160/ 195] Loss 1.240942 Top1 69.855957 Top5 89.162598\nTest: [ 170/ 195] Loss 1.265741 Top1 69.342831 Top5 88.807445\nTest: [ 180/ 195] Loss 1.281185 Top1 69.051649 Top5 88.589410\nTest: [ 190/ 195] Loss 1.279682 Top1 69.019326 Top5 88.632812\n==\n Top1: 69.130 Top5: 88.732 Loss: 1.276\n\n\n\n\nSummaries\n\n\nYou can use the sample compression application to generate model summary reports, such as the attributes and compute summary report (see screen capture below).\nYou can log sparsity statistics (written to console and CSV file), performance, optimizer and model information, and also create a PNG image of the DNN.\nCreating a PNG image is an experimental feature (it relies on features which are not available on PyTorch 3.1 and that we hope will be available in PyTorch's next release), so to use it you will need to compile the PyTorch master branch, and hope for the best ;-).\n\n\n$ python3 compress_classifier.py --resume=../ssl/checkpoints/checkpoint_trained_ch_regularized_dense.pth.tar -a=resnet20_cifar ../../../data.cifar10 --summary=compute\n\n\n\n\nGenerates:\n\n\n+----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\n| | Name | Type | Attrs | IFM | IFM volume | OFM | OFM volume | Weights volume | MACs |\n|----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------|\n| 0 | module.conv1 | Conv2d | k=(3, 3) | (1, 3, 32, 32) | 3072 | (1, 16, 32, 32) | 16384 | 432 | 442368 |\n| 1 | module.layer1.0.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 2 | module.layer1.0.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 3 | module.layer1.1.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 4 | module.layer1.1.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 5 | module.layer1.2.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 6 | module.layer1.2.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 7 | module.layer2.0.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 32, 16, 16) | 8192 | 4608 | 1179648 |\n| 8 | module.layer2.0.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 9 | module.layer2.0.downsample.0 | Conv2d | k=(1, 1) | (1, 16, 32, 32) | 16384 | (1, 32, 16, 16) | 8192 | 512 | 131072 |\n| 10 | module.layer2.1.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 11 | module.layer2.1.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 12 | module.layer2.2.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 13 | module.layer2.2.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 14 | module.layer3.0.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 64, 8, 8) | 4096 | 18432 | 1179648 |\n| 15 | module.layer3.0.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 16 | module.layer3.0.downsample.0 | Conv2d | k=(1, 1) | (1, 32, 16, 16) | 8192 | (1, 64, 8, 8) | 4096 | 2048 | 131072 |\n| 17 | module.layer3.1.conv1 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 18 | module.layer3.1.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 19 | module.layer3.2.conv1 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 20 | module.layer3.2.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 21 | module.fc | Linear | | (1, 64) | 64 | (1, 10) | 10 | 640 | 640 |\n+----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\nTotal MACs: 40,813,184\n\n\n\n\nUsing TensorBoard\n\n\nGoogle's \nTensorBoard\n is an excellent tool for visualizing the progress of DNN training. Distiller's logger supports writing performance indicators and parameter statistics in a file format that can be read by TensorBoard (Distiller uses TensorFlow's APIs in order to do this, which is why Distiller requires the installation of TensorFlow).\n\nTo view the graphs, invoke the TensorBoard server. For example:\n\n\n$ tensorboard --logdir=logs\n\n\n\n\nDistillers's setup (requirements.txt) installs TensorFlow for CPU. If you want a different installation, please follow the \nTensorFlow installation instructions\n.\n\n\nCollecting feature-maps statistics\n\n\nIn CNNs with ReLU layers, ReLU activations (feature-maps) also exhibit a nice level of sparsity (50-60% sparsity is typical). \n\nYou can collect activation statistics using the \n--act_stats\n command-line flag.\n\n\nUsing the Jupyter notebooks\n\n\nThe Jupyter notebooks contain many examples of how to use the statistics summaries generated by Distiller. They are explained in a separate page.\n\n\nGenerating this documentation\n\n\nInstall mkdocs and the required packages by executing:\n\n\n$ pip3 install -r doc-requirements.txt\n\n\n\n\nTo build the project documentation run:\n\n\n$ cd distiller/docs-src\n$ mkdocs build --clean\n\n\n\n\nThis will create a folder named 'site' which contains the documentation website.\nOpen distiller/docs/site/index.html to view the documentation home page.", "title": "Usage" - }, + }, { - "location": "/usage/index.html#using-the-sample-application", - "text": "The Distiller repository contains a sample application, distiller/examples/classifier_compression/compress_classifier.py , and a set of scheduling files which demonstrate Distiller's features. Following is a brief discussion of how to use this application and the accompanying schedules. You might also want to refer to the following resources: An explanation of the scheduler file format. An in-depth discussion of how we used these schedule files to implement several state-of-the-art DNN compression research papers. The sample application supports various features for compression of image classification DNNs, and gives an example of how to integrate distiller in your own application. The code is documented and should be considered the best source of documentation, but we provide some elaboration here. This diagram shows how where compress_classifier.py fits in the compression workflow, and how we integrate the Jupyter notebooks as part of our research work.", + "location": "/usage/index.html#using-the-sample-application", + "text": "The Distiller repository contains a sample application, distiller/examples/classifier_compression/compress_classifier.py , and a set of scheduling files which demonstrate Distiller's features. Following is a brief discussion of how to use this application and the accompanying schedules. You might also want to refer to the following resources: An explanation of the scheduler file format. An in-depth discussion of how we used these schedule files to implement several state-of-the-art DNN compression research papers. The sample application supports various features for compression of image classification DNNs, and gives an example of how to integrate distiller in your own application. The code is documented and should be considered the best source of documentation, but we provide some elaboration here. This diagram shows how where compress_classifier.py fits in the compression workflow, and how we integrate the Jupyter notebooks as part of our research work.", "title": "Using the sample application" - }, + }, { - "location": "/usage/index.html#command-line-arguments", - "text": "To get help on the command line arguments, invoke: $ python3 compress_classifier.py --help For example: $ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\nParameters:\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean |\n |----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n | 0 | features.module.0.weight | (64, 3, 11, 11) | 23232 | 13411 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 42.27359 | 0.14391 | -0.00002 | 0.08805 |\n | 1 | features.module.3.weight | (192, 64, 5, 5) | 307200 | 115560 | 0.00000 | 0.00000 | 0.00000 | 1.91243 | 0.00000 | 62.38281 | 0.04703 | -0.00250 | 0.02289 |\n | 2 | features.module.6.weight | (384, 192, 3, 3) | 663552 | 256565 | 0.00000 | 0.00000 | 0.00000 | 6.18490 | 0.00000 | 61.33445 | 0.03354 | -0.00184 | 0.01803 |\n | 3 | features.module.8.weight | (256, 384, 3, 3) | 884736 | 315065 | 0.00000 | 0.00000 | 0.00000 | 6.96411 | 0.00000 | 64.38881 | 0.02646 | -0.00168 | 0.01422 |\n | 4 | features.module.10.weight | (256, 256, 3, 3) | 589824 | 186938 | 0.00000 | 0.00000 | 0.00000 | 15.49225 | 0.00000 | 68.30614 | 0.02714 | -0.00246 | 0.01409 |\n | 5 | classifier.1.weight | (4096, 9216) | 37748736 | 3398881 | 0.00000 | 0.21973 | 0.00000 | 0.21973 | 0.00000 | 90.99604 | 0.00589 | -0.00020 | 0.00168 |\n | 6 | classifier.4.weight | (4096, 4096) | 16777216 | 1782769 | 0.21973 | 3.46680 | 0.00000 | 3.46680 | 0.00000 | 89.37387 | 0.00849 | -0.00066 | 0.00263 |\n | 7 | classifier.6.weight | (1000, 4096) | 4096000 | 994738 | 3.36914 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 75.71440 | 0.01718 | 0.00030 | 0.00778 |\n | 8 | Total sparsity: | - | 61090496 | 7063928 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 88.43694 | 0.00000 | 0.00000 | 0.00000 |\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:04,646 - Epoch: [89][ 50/ 500] Loss 2.175988 Top1 51.289063 Top5 74.023438\n 2018-04-04 21:31:06,427 - Epoch: [89][ 100/ 500] Loss 2.171564 Top1 51.175781 Top5 74.308594\n 2018-04-04 21:31:11,432 - Epoch: [89][ 150/ 500] Loss 2.159347 Top1 51.546875 Top5 74.473958\n 2018-04-04 21:31:14,364 - Epoch: [89][ 200/ 500] Loss 2.156857 Top1 51.585938 Top5 74.568359\n 2018-04-04 21:31:18,381 - Epoch: [89][ 250/ 500] Loss 2.152790 Top1 51.707813 Top5 74.681250\n 2018-04-04 21:31:22,195 - Epoch: [89][ 300/ 500] Loss 2.149962 Top1 51.791667 Top5 74.755208\n 2018-04-04 21:31:25,508 - Epoch: [89][ 350/ 500] Loss 2.150936 Top1 51.827009 Top5 74.767857\n 2018-04-04 21:31:29,538 - Epoch: [89][ 400/ 500] Loss 2.150853 Top1 51.781250 Top5 74.763672\n 2018-04-04 21:31:32,842 - Epoch: [89][ 450/ 500] Loss 2.150156 Top1 51.828125 Top5 74.821181\n 2018-04-04 21:31:35,338 - Epoch: [89][ 500/ 500] Loss 2.150417 Top1 51.833594 Top5 74.817187\n 2018-04-04 21:31:35,357 - ==> Top1: 51.838 Top5: 74.817 Loss: 2.150\n\n 2018-04-04 21:31:35,364 - Saving checkpoint\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:31:51,512 - Test: [ 50/ 195] Loss 1.487607 Top1 63.273438 Top5 85.695312\n 2018-04-04 21:31:55,015 - Test: [ 100/ 195] Loss 1.638043 Top1 60.636719 Top5 83.664062\n 2018-04-04 21:31:58,732 - Test: [ 150/ 195] Loss 1.833214 Top1 57.619792 Top5 80.447917\n 2018-04-04 21:32:01,274 - ==> Top1: 56.606 Top5: 79.446 Loss: 1.893 Let's look at the command line again: $ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml In this example, we prune a TorchVision pre-trained AlexNet network, using the following configuration: Learning-rate of 0.005 Print progress every 50 mini-batches. Use 44 worker threads to load data (make sure to use something suitable for your machine). Run for 90 epochs. Torchvision's pre-trained models did not store the epoch metadata, so pruning starts at epoch 0. When you train and prune your own networks, the last training epoch is saved as a metadata with the model. Therefore, when you load such models, the first epoch is not 0, but it is the last training epoch. The pruning schedule is provided in alexnet.schedule_sensitivity.yaml Log files are written to directory logs .", + "location": "/usage/index.html#command-line-arguments", + "text": "To get help on the command line arguments, invoke: $ python3 compress_classifier.py --help For example: $ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\nParameters:\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean |\n |----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n | 0 | features.module.0.weight | (64, 3, 11, 11) | 23232 | 13411 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 42.27359 | 0.14391 | -0.00002 | 0.08805 |\n | 1 | features.module.3.weight | (192, 64, 5, 5) | 307200 | 115560 | 0.00000 | 0.00000 | 0.00000 | 1.91243 | 0.00000 | 62.38281 | 0.04703 | -0.00250 | 0.02289 |\n | 2 | features.module.6.weight | (384, 192, 3, 3) | 663552 | 256565 | 0.00000 | 0.00000 | 0.00000 | 6.18490 | 0.00000 | 61.33445 | 0.03354 | -0.00184 | 0.01803 |\n | 3 | features.module.8.weight | (256, 384, 3, 3) | 884736 | 315065 | 0.00000 | 0.00000 | 0.00000 | 6.96411 | 0.00000 | 64.38881 | 0.02646 | -0.00168 | 0.01422 |\n | 4 | features.module.10.weight | (256, 256, 3, 3) | 589824 | 186938 | 0.00000 | 0.00000 | 0.00000 | 15.49225 | 0.00000 | 68.30614 | 0.02714 | -0.00246 | 0.01409 |\n | 5 | classifier.1.weight | (4096, 9216) | 37748736 | 3398881 | 0.00000 | 0.21973 | 0.00000 | 0.21973 | 0.00000 | 90.99604 | 0.00589 | -0.00020 | 0.00168 |\n | 6 | classifier.4.weight | (4096, 4096) | 16777216 | 1782769 | 0.21973 | 3.46680 | 0.00000 | 3.46680 | 0.00000 | 89.37387 | 0.00849 | -0.00066 | 0.00263 |\n | 7 | classifier.6.weight | (1000, 4096) | 4096000 | 994738 | 3.36914 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 75.71440 | 0.01718 | 0.00030 | 0.00778 |\n | 8 | Total sparsity: | - | 61090496 | 7063928 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 88.43694 | 0.00000 | 0.00000 | 0.00000 |\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:04,646 - Epoch: [89][ 50/ 500] Loss 2.175988 Top1 51.289063 Top5 74.023438\n 2018-04-04 21:31:06,427 - Epoch: [89][ 100/ 500] Loss 2.171564 Top1 51.175781 Top5 74.308594\n 2018-04-04 21:31:11,432 - Epoch: [89][ 150/ 500] Loss 2.159347 Top1 51.546875 Top5 74.473958\n 2018-04-04 21:31:14,364 - Epoch: [89][ 200/ 500] Loss 2.156857 Top1 51.585938 Top5 74.568359\n 2018-04-04 21:31:18,381 - Epoch: [89][ 250/ 500] Loss 2.152790 Top1 51.707813 Top5 74.681250\n 2018-04-04 21:31:22,195 - Epoch: [89][ 300/ 500] Loss 2.149962 Top1 51.791667 Top5 74.755208\n 2018-04-04 21:31:25,508 - Epoch: [89][ 350/ 500] Loss 2.150936 Top1 51.827009 Top5 74.767857\n 2018-04-04 21:31:29,538 - Epoch: [89][ 400/ 500] Loss 2.150853 Top1 51.781250 Top5 74.763672\n 2018-04-04 21:31:32,842 - Epoch: [89][ 450/ 500] Loss 2.150156 Top1 51.828125 Top5 74.821181\n 2018-04-04 21:31:35,338 - Epoch: [89][ 500/ 500] Loss 2.150417 Top1 51.833594 Top5 74.817187\n 2018-04-04 21:31:35,357 - == Top1: 51.838 Top5: 74.817 Loss: 2.150\n\n 2018-04-04 21:31:35,364 - Saving checkpoint\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:31:51,512 - Test: [ 50/ 195] Loss 1.487607 Top1 63.273438 Top5 85.695312\n 2018-04-04 21:31:55,015 - Test: [ 100/ 195] Loss 1.638043 Top1 60.636719 Top5 83.664062\n 2018-04-04 21:31:58,732 - Test: [ 150/ 195] Loss 1.833214 Top1 57.619792 Top5 80.447917\n 2018-04-04 21:32:01,274 - == Top1: 56.606 Top5: 79.446 Loss: 1.893 Let's look at the command line again: $ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml In this example, we prune a TorchVision pre-trained AlexNet network, using the following configuration: Learning-rate of 0.005 Print progress every 50 mini-batches. Use 44 worker threads to load data (make sure to use something suitable for your machine). Run for 90 epochs. Torchvision's pre-trained models did not store the epoch metadata, so pruning starts at epoch 0. When you train and prune your own networks, the last training epoch is saved as a metadata with the model. Therefore, when you load such models, the first epoch is not 0, but it is the last training epoch. The pruning schedule is provided in alexnet.schedule_sensitivity.yaml Log files are written to directory logs .", "title": "Command line arguments" - }, + }, { - "location": "/usage/index.html#examples", - "text": "Distiller comes with several example schedules which can be used together with compress_classifier.py .\nThese example schedules (YAML) files, contain the command line that is used in order to invoke the schedule (so that you can easily recreate the results in your environment), together with the results of the pruning or regularization. The results usually contain a table showing the sparsity of each of the model parameters, together with the validation and test top1, top5 and loss scores. For more details on the example schedules, you can refer to the coverage of the Model Zoo . examples/agp-pruning : Automated Gradual Pruning (AGP) on MobileNet and ResNet18 (ImageNet dataset) examples/hybrid : AlexNet AGP with 2D (kernel) regularization (ImageNet dataset) AlexNet sensitivity pruning with 2D regularization examples/network_slimming : ResNet20 Network Slimming (this is work-in-progress) examples/pruning_filters_for_efficient_convnets : ResNet56 baseline training (CIFAR10 dataset) ResNet56 filter removal using filter ranking examples/sensitivity_analysis : Element-wise pruning sensitivity-analysis: AlexNet (ImageNet) MobileNet (ImageNet) ResNet18 (ImageNet) ResNet20 (CIFAR10) ResNet34 (ImageNet) Filter-wise pruning sensitivity-analysis: ResNet20 (CIFAR10) ResNet56 (CIFAR10) examples/sensitivity-pruning : AlexNet sensitivity pruning with Iterative Pruning AlexNet sensitivity pruning with One-Shot Pruning examples/ssl : ResNet20 baseline training (CIFAR10 dataset) Structured Sparsity Learning (SSL) with layer removal on ResNet20 SSL with channels removal on ResNet20", + "location": "/usage/index.html#examples", + "text": "Distiller comes with several example schedules which can be used together with compress_classifier.py .\nThese example schedules (YAML) files, contain the command line that is used in order to invoke the schedule (so that you can easily recreate the results in your environment), together with the results of the pruning or regularization. The results usually contain a table showing the sparsity of each of the model parameters, together with the validation and test top1, top5 and loss scores. For more details on the example schedules, you can refer to the coverage of the Model Zoo . examples/agp-pruning : Automated Gradual Pruning (AGP) on MobileNet and ResNet18 (ImageNet dataset) examples/hybrid : AlexNet AGP with 2D (kernel) regularization (ImageNet dataset) AlexNet sensitivity pruning with 2D regularization examples/network_slimming : ResNet20 Network Slimming (this is work-in-progress) examples/pruning_filters_for_efficient_convnets : ResNet56 baseline training (CIFAR10 dataset) ResNet56 filter removal using filter ranking examples/sensitivity_analysis : Element-wise pruning sensitivity-analysis: AlexNet (ImageNet) MobileNet (ImageNet) ResNet18 (ImageNet) ResNet20 (CIFAR10) ResNet34 (ImageNet) Filter-wise pruning sensitivity-analysis: ResNet20 (CIFAR10) ResNet56 (CIFAR10) examples/sensitivity-pruning : AlexNet sensitivity pruning with Iterative Pruning AlexNet sensitivity pruning with One-Shot Pruning examples/ssl : ResNet20 baseline training (CIFAR10 dataset) Structured Sparsity Learning (SSL) with layer removal on ResNet20 SSL with channels removal on ResNet20 examples/quantization : AlexNet w. Batch-Norm (base FP32 + DoReFa) Pre-activation ResNet20 on CIFAR10 (base FP32 + DoReFa) Pre-activation ResNet18 on ImageNEt (base FP32 + DoReFa)", "title": "Examples" - }, + }, { - "location": "/usage/index.html#experiment-reproducibility", - "text": "Experiment reproducibility is sometimes important. Pete Warden recently expounded about this in his blog . \nPyTorch's support for deterministic execution requires us to use only one thread for loading data (other wise the multi-threaded execution of the data loaders can create random order and change the results), and to set the seed of the CPU and GPU PRNGs. Using the --deterministic command-line flag and setting j=1 will produce reproducible results (for the same PyTorch version).", + "location": "/usage/index.html#experiment-reproducibility", + "text": "Experiment reproducibility is sometimes important. Pete Warden recently expounded about this in his blog . \nPyTorch's support for deterministic execution requires us to use only one thread for loading data (other wise the multi-threaded execution of the data loaders can create random order and change the results), and to set the seed of the CPU and GPU PRNGs. Using the --deterministic command-line flag and setting j=1 will produce reproducible results (for the same PyTorch version).", "title": "Experiment reproducibility" - }, + }, { - "location": "/usage/index.html#performing-pruning-sensitivity-analysis", - "text": "Distiller supports element-wise and filter-wise pruning sensitivity analysis. In both cases, L1-norm is used to rank which elements or filters to prune. For example, when running filter-pruning sensitivity analysis, the L1-norm of the filters of each layer's weights tensor are calculated, and the bottom x% are set to zero. \nThe analysis process is quite long, because currently we use the entire test dataset to assess the accuracy performance at each pruning level of each weights tensor. Using a small dataset for this would save much time and we plan on assessing if this will provide sufficient results. \nResults are output as a CSV file ( sensitivity.csv ) and PNG file ( sensitivity.png ). The implementation is in distiller/sensitivity.py and it contains further details about process and the format of the CSV file. The example below performs element-wise pruning sensitivity analysis on ResNet20 for CIFAR10: $ python3 compress_classifier.py -a resnet20_cifar ../../../data.cifar10/ -j=1 --resume=../cifar10/resnet20/checkpoint_trained_dense.pth.tar --sense=element The sense command-line argument can be set to either element or filter , depending on the type of analysis you want done. There is also a Jupyter notebook with example invocations, outputs and explanations.", + "location": "/usage/index.html#performing-pruning-sensitivity-analysis", + "text": "Distiller supports element-wise and filter-wise pruning sensitivity analysis. In both cases, L1-norm is used to rank which elements or filters to prune. For example, when running filter-pruning sensitivity analysis, the L1-norm of the filters of each layer's weights tensor are calculated, and the bottom x% are set to zero. \nThe analysis process is quite long, because currently we use the entire test dataset to assess the accuracy performance at each pruning level of each weights tensor. Using a small dataset for this would save much time and we plan on assessing if this will provide sufficient results. \nResults are output as a CSV file ( sensitivity.csv ) and PNG file ( sensitivity.png ). The implementation is in distiller/sensitivity.py and it contains further details about process and the format of the CSV file. The example below performs element-wise pruning sensitivity analysis on ResNet20 for CIFAR10: $ python3 compress_classifier.py -a resnet20_cifar ../../../data.cifar10/ -j=1 --resume=../cifar10/resnet20/checkpoint_trained_dense.pth.tar --sense=element The sense command-line argument can be set to either element or filter , depending on the type of analysis you want done. There is also a Jupyter notebook with example invocations, outputs and explanations.", "title": "Performing pruning sensitivity analysis" - }, + }, { - "location": "/usage/index.html#quantization", - "text": "Currently Distiller support 8-bit quantization only (quantization of lower precision data types will follow shortly) which does not require training, so any model (whether pruned or not) can be quantized. \nUse the --quantize command-line flag, together with --evaluate to evaluate the accuracy of your model after quantization. The following example qunatizes ResNet18 for ImageNet: $ python3 compress_classifier.py -a resnet18 ../../../data.imagenet --pretrained --quantize --evaluate Generates: Preparing model for quantization\n--- test ---------------------\n50000 samples (256 per mini-batch)\nTest: [ 10/ 195] Loss 0.856354 Top1 79.257812 Top5 92.500000\nTest: [ 20/ 195] Loss 0.923131 Top1 76.953125 Top5 92.246094\nTest: [ 30/ 195] Loss 0.885186 Top1 77.955729 Top5 92.486979\nTest: [ 40/ 195] Loss 0.930263 Top1 76.181641 Top5 92.597656\nTest: [ 50/ 195] Loss 0.931062 Top1 75.726562 Top5 92.906250\nTest: [ 60/ 195] Loss 0.932019 Top1 75.651042 Top5 93.151042\nTest: [ 70/ 195] Loss 0.921287 Top1 76.060268 Top5 93.270089\nTest: [ 80/ 195] Loss 0.932539 Top1 75.986328 Top5 93.100586\nTest: [ 90/ 195] Loss 0.996000 Top1 74.700521 Top5 92.330729\nTest: [ 100/ 195] Loss 1.066699 Top1 73.289062 Top5 91.437500\nTest: [ 110/ 195] Loss 1.100970 Top1 72.574574 Top5 91.001420\nTest: [ 120/ 195] Loss 1.122376 Top1 72.268880 Top5 90.696615\nTest: [ 130/ 195] Loss 1.171726 Top1 71.198918 Top5 90.120192\nTest: [ 140/ 195] Loss 1.191500 Top1 70.797991 Top5 89.902344\nTest: [ 150/ 195] Loss 1.219954 Top1 70.210938 Top5 89.453125\nTest: [ 160/ 195] Loss 1.240942 Top1 69.855957 Top5 89.162598\nTest: [ 170/ 195] Loss 1.265741 Top1 69.342831 Top5 88.807445\nTest: [ 180/ 195] Loss 1.281185 Top1 69.051649 Top5 88.589410\nTest: [ 190/ 195] Loss 1.279682 Top1 69.019326 Top5 88.632812\n==> Top1: 69.130 Top5: 88.732 Loss: 1.276", - "title": "Quantization" - }, + "location": "/usage/index.html#direct-quantization-without-training", + "text": "Distiller supports 8-bit quantization of trained modules without re-training (using Symmetric Linear Quantization ). So, any model (whether pruned or not) can be quantized. \nUse the --quantize command-line flag, together with --evaluate to evaluate the accuracy of your model after quantization. The following example qunatizes ResNet18 for ImageNet: $ python3 compress_classifier.py -a resnet18 ../../../data.imagenet --pretrained --quantize --evaluate Generates: Preparing model for quantization\n--- test ---------------------\n50000 samples (256 per mini-batch)\nTest: [ 10/ 195] Loss 0.856354 Top1 79.257812 Top5 92.500000\nTest: [ 20/ 195] Loss 0.923131 Top1 76.953125 Top5 92.246094\nTest: [ 30/ 195] Loss 0.885186 Top1 77.955729 Top5 92.486979\nTest: [ 40/ 195] Loss 0.930263 Top1 76.181641 Top5 92.597656\nTest: [ 50/ 195] Loss 0.931062 Top1 75.726562 Top5 92.906250\nTest: [ 60/ 195] Loss 0.932019 Top1 75.651042 Top5 93.151042\nTest: [ 70/ 195] Loss 0.921287 Top1 76.060268 Top5 93.270089\nTest: [ 80/ 195] Loss 0.932539 Top1 75.986328 Top5 93.100586\nTest: [ 90/ 195] Loss 0.996000 Top1 74.700521 Top5 92.330729\nTest: [ 100/ 195] Loss 1.066699 Top1 73.289062 Top5 91.437500\nTest: [ 110/ 195] Loss 1.100970 Top1 72.574574 Top5 91.001420\nTest: [ 120/ 195] Loss 1.122376 Top1 72.268880 Top5 90.696615\nTest: [ 130/ 195] Loss 1.171726 Top1 71.198918 Top5 90.120192\nTest: [ 140/ 195] Loss 1.191500 Top1 70.797991 Top5 89.902344\nTest: [ 150/ 195] Loss 1.219954 Top1 70.210938 Top5 89.453125\nTest: [ 160/ 195] Loss 1.240942 Top1 69.855957 Top5 89.162598\nTest: [ 170/ 195] Loss 1.265741 Top1 69.342831 Top5 88.807445\nTest: [ 180/ 195] Loss 1.281185 Top1 69.051649 Top5 88.589410\nTest: [ 190/ 195] Loss 1.279682 Top1 69.019326 Top5 88.632812\n== Top1: 69.130 Top5: 88.732 Loss: 1.276", + "title": "\"Direct\" Quantization Without Training" + }, { - "location": "/usage/index.html#summaries", - "text": "You can use the sample compression application to generate model summary reports, such as the attributes and compute summary report (see screen capture below).\nYou can log sparsity statistics (written to console and CSV file), performance, optimizer and model information, and also create a PNG image of the DNN.\nCreating a PNG image is an experimental feature (it relies on features which are not available on PyTorch 3.1 and that we hope will be available in PyTorch's next release), so to use it you will need to compile the PyTorch master branch, and hope for the best ;-). $ python3 compress_classifier.py --resume=../ssl/checkpoints/checkpoint_trained_ch_regularized_dense.pth.tar -a=resnet20_cifar ../../../data.cifar10 --summary=compute Generates: +----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\n| | Name | Type | Attrs | IFM | IFM volume | OFM | OFM volume | Weights volume | MACs |\n|----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------|\n| 0 | module.conv1 | Conv2d | k=(3, 3) | (1, 3, 32, 32) | 3072 | (1, 16, 32, 32) | 16384 | 432 | 442368 |\n| 1 | module.layer1.0.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 2 | module.layer1.0.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 3 | module.layer1.1.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 4 | module.layer1.1.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 5 | module.layer1.2.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 6 | module.layer1.2.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 7 | module.layer2.0.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 32, 16, 16) | 8192 | 4608 | 1179648 |\n| 8 | module.layer2.0.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 9 | module.layer2.0.downsample.0 | Conv2d | k=(1, 1) | (1, 16, 32, 32) | 16384 | (1, 32, 16, 16) | 8192 | 512 | 131072 |\n| 10 | module.layer2.1.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 11 | module.layer2.1.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 12 | module.layer2.2.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 13 | module.layer2.2.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 14 | module.layer3.0.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 64, 8, 8) | 4096 | 18432 | 1179648 |\n| 15 | module.layer3.0.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 16 | module.layer3.0.downsample.0 | Conv2d | k=(1, 1) | (1, 32, 16, 16) | 8192 | (1, 64, 8, 8) | 4096 | 2048 | 131072 |\n| 17 | module.layer3.1.conv1 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 18 | module.layer3.1.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 19 | module.layer3.2.conv1 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 20 | module.layer3.2.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 21 | module.fc | Linear | | (1, 64) | 64 | (1, 10) | 10 | 640 | 640 |\n+----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\nTotal MACs: 40,813,184", + "location": "/usage/index.html#summaries", + "text": "You can use the sample compression application to generate model summary reports, such as the attributes and compute summary report (see screen capture below).\nYou can log sparsity statistics (written to console and CSV file), performance, optimizer and model information, and also create a PNG image of the DNN.\nCreating a PNG image is an experimental feature (it relies on features which are not available on PyTorch 3.1 and that we hope will be available in PyTorch's next release), so to use it you will need to compile the PyTorch master branch, and hope for the best ;-). $ python3 compress_classifier.py --resume=../ssl/checkpoints/checkpoint_trained_ch_regularized_dense.pth.tar -a=resnet20_cifar ../../../data.cifar10 --summary=compute Generates: +----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\n| | Name | Type | Attrs | IFM | IFM volume | OFM | OFM volume | Weights volume | MACs |\n|----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------|\n| 0 | module.conv1 | Conv2d | k=(3, 3) | (1, 3, 32, 32) | 3072 | (1, 16, 32, 32) | 16384 | 432 | 442368 |\n| 1 | module.layer1.0.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 2 | module.layer1.0.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 3 | module.layer1.1.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 4 | module.layer1.1.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 5 | module.layer1.2.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 6 | module.layer1.2.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 |\n| 7 | module.layer2.0.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 32, 16, 16) | 8192 | 4608 | 1179648 |\n| 8 | module.layer2.0.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 9 | module.layer2.0.downsample.0 | Conv2d | k=(1, 1) | (1, 16, 32, 32) | 16384 | (1, 32, 16, 16) | 8192 | 512 | 131072 |\n| 10 | module.layer2.1.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 11 | module.layer2.1.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 12 | module.layer2.2.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 13 | module.layer2.2.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 |\n| 14 | module.layer3.0.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 64, 8, 8) | 4096 | 18432 | 1179648 |\n| 15 | module.layer3.0.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 16 | module.layer3.0.downsample.0 | Conv2d | k=(1, 1) | (1, 32, 16, 16) | 8192 | (1, 64, 8, 8) | 4096 | 2048 | 131072 |\n| 17 | module.layer3.1.conv1 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 18 | module.layer3.1.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 19 | module.layer3.2.conv1 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 20 | module.layer3.2.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 |\n| 21 | module.fc | Linear | | (1, 64) | 64 | (1, 10) | 10 | 640 | 640 |\n+----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\nTotal MACs: 40,813,184", "title": "Summaries" - }, + }, { - "location": "/usage/index.html#using-tensorboard", - "text": "Google's TensorBoard is an excellent tool for visualizing the progress of DNN training. Distiller's logger supports writing performance indicators and parameter statistics in a file format that can be read by TensorBoard (Distiller uses TensorFlow's APIs in order to do this, which is why Distiller requires the installation of TensorFlow). \nTo view the graphs, invoke the TensorBoard server. For example: $ tensorboard --logdir=logs Distillers's setup (requirements.txt) installs TensorFlow for CPU. If you want a different installation, please follow the TensorFlow installation instructions .", + "location": "/usage/index.html#using-tensorboard", + "text": "Google's TensorBoard is an excellent tool for visualizing the progress of DNN training. Distiller's logger supports writing performance indicators and parameter statistics in a file format that can be read by TensorBoard (Distiller uses TensorFlow's APIs in order to do this, which is why Distiller requires the installation of TensorFlow). \nTo view the graphs, invoke the TensorBoard server. For example: $ tensorboard --logdir=logs Distillers's setup (requirements.txt) installs TensorFlow for CPU. If you want a different installation, please follow the TensorFlow installation instructions .", "title": "Using TensorBoard" - }, + }, { - "location": "/usage/index.html#collecting-feature-maps-statistics", - "text": "In CNNs with ReLU layers, ReLU activations (feature-maps) also exhibit a nice level of sparsity (50-60% sparsity is typical). \nYou can collect activation statistics using the --act_stats command-line flag.", + "location": "/usage/index.html#collecting-feature-maps-statistics", + "text": "In CNNs with ReLU layers, ReLU activations (feature-maps) also exhibit a nice level of sparsity (50-60% sparsity is typical). \nYou can collect activation statistics using the --act_stats command-line flag.", "title": "Collecting feature-maps statistics" - }, + }, { - "location": "/usage/index.html#using-the-jupyter-notebooks", - "text": "The Jupyter notebooks contain many examples of how to use the statistics summaries generated by Distiller. They are explained in a separate page.", + "location": "/usage/index.html#using-the-jupyter-notebooks", + "text": "The Jupyter notebooks contain many examples of how to use the statistics summaries generated by Distiller. They are explained in a separate page.", "title": "Using the Jupyter notebooks" - }, + }, { - "location": "/usage/index.html#generating-this-documentation", - "text": "Install mkdocs and the required packages by executing: $ pip3 install -r doc-requirements.txt To build the project documentation run: $ cd distiller/docs-src\n$ mkdocs build --clean This will create a folder named 'site' which contains the documentation website.\nOpen distiller/docs/site/index.html to view the documentation home page.", + "location": "/usage/index.html#generating-this-documentation", + "text": "Install mkdocs and the required packages by executing: $ pip3 install -r doc-requirements.txt To build the project documentation run: $ cd distiller/docs-src\n$ mkdocs build --clean This will create a folder named 'site' which contains the documentation website.\nOpen distiller/docs/site/index.html to view the documentation home page.", "title": "Generating this documentation" - }, + }, { - "location": "/schedule/index.html", - "text": "Compression scheduler\n\n\nIn iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of \nCompressionScheduler\n: it needed to be part of the training loop, and to be able to make and implement pruning, regularization and (later) quantization decisions. We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification. We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base. Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code. \n\n\nHigh level overview\n\n\nLet's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, LR-scheduler and Policies.\n\n\n\n\nPruners and Regularizers are very similar: they implement either a Pruning algorithm or a Regularization algorithm. \n\n\nAn LR-scheduler specifies the LR-decay algorithm. \n\n\n\n\nThese define the \nwhat\n part of the schedule. \n\n\nThe Policies define the \nwhen\n part of the schedule: at which epoch to start applying the Pruner/Regularizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application). A policy also defines the instance of Pruner/Regularizer/LR-decay it is managing.\n\n\nThe CompressionScheduler is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners and Regularizers from code.\n\n\nSyntax through example\n\n\nWe'll use \nalexnet.schedule_agp.yaml\n to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet.\n\n\nversion: 1\npruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.625\n\nlr_schedulers:\n pruning_lr:\n class: ExponentialLR\n gamma: 0.9\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n - lr_scheduler:\n instance_name: pruning_lr\n starting_epoch: 24\n ending_epoch: 200\n frequency: 1\n\n\n\n\nThere is only one version of the YAML syntax, and the version number is not verified at the moment. However, to be future-proof it is probably better to let the YAML parser know that you are using version-1 syntax, in case there is ever a version 2.\n\n\nversion: 1\n\n\n\n\nIn the \npruners\n section, we define the instances of pruners we want the scheduler to instantiate and use.\n\nWe define a single pruner instance, named \nmy_pruner\n of algorithm \nSensitivityPruner\n. We will refer to this instance in the \nPolicies\n section.\n\nThen we list the sensitivity multipliers, \\(s\\), of each of the weight tensors.\n\nYou may list as many Pruners as you want in this section, as long as each has a unique name. You can several types of pruners in one schedule.\n\n\npruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.6\n\n\n\n\nNext, we want to specify the learning-rate decay scheduling in the \nlr_schedulers\n section. We assign a name to this instance: \npruning_lr\n. As in the \npruners\n section, you may use any name, as long as all LR-schedulers have a unique name. At the moment, only one instance of LR-scheduler is allowed. You can use any LR-scheduler class that \ntorch.optim.lr_scheduler\n supports and pass their arguments. The keyword arguments (kwargs) are passed directly to the constructor of the subclasses of \n_LRScheduler\n, so that as new LR-schedulers are added to \ntorch.optim.lr_scheduler\n, they can be used without changing the application code.\n\n\nlr_schedulers:\n pruning_lr:\n class: ExponentialLR\n gamma: 0.9\n\n\n\n\nFinally, we define the \npolicies\n section which defines the actual scheduling. A \nPolicy\n manages an instance of a \nPruner\n, \nRegularizer\n, or \nLRSchedule\n, by naming the instance. In the example below, a \nPruningPolicy\n uses the pruner instance named \nmy_pruner\n: it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38. \n\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n - lr_scheduler:\n instance_name: pruning_lr\n starting_epoch: 24\n ending_epoch: 200\n frequency: 1\n\n\n\n\nThis is \niterative pruning\n:\n\n\n\n\n\n\nTrain Connectivity\n\n\n\n\n\n\nPrune Connections\n\n\n\n\n\n\nRetrain Weights\n\n\n\n\n\n\nGoto 2\n\n\n\n\n\n\nIt is described in \nLearning both Weights and Connections for Efficient Neural Networks\n:\n\n\n\n\n\"Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections...After an initial training phase, we remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first phase learns the topology of the networks \u2014 learning which connections are important and removing the unimportant connections. We then retrain the sparse network so the remaining connections can compensate for the connections that have been removed. The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.\"\n\n\n\n\nRegularization\n\n\nYou can also define and schedule regularization.\n\n\nL1 regularization\n\n\nFormat (this is an informal specification, not a valid \nABNF\n specification):\n\n\nregularizers:\n <REGULARIZER_NAME_STR>:\n class: L1Regularizer\n reg_regims:\n <PYTORCH_PARAM_NAME_STR>: <STRENGTH_FLOAT>\n ...\n <PYTORCH_PARAM_NAME_STR>: <STRENGTH_FLOAT>\n threshold_criteria: [Mean_Abs | Max]\n\n\n\n\nFor example:\n\n\nversion: 1\n\nregularizers:\n my_L1_reg:\n class: L1Regularizer\n reg_regims:\n 'module.layer3.1.conv1.weight': 0.000002\n 'module.layer3.1.conv2.weight': 0.000002\n 'module.layer3.1.conv3.weight': 0.000002\n 'module.layer3.2.conv1.weight': 0.000002\n threshold_criteria: Mean_Abs\n\npolicies:\n - regularizer:\n instance_name: my_L1_reg\n starting_epoch: 0\n ending_epoch: 60\n frequency: 1\n\n\n\n\nGroup regularization\n\n\nFormat (informal specification):\n\n\nFormat:\n regularizers:\n <REGULARIZER_NAME_STR>:\n class: L1Regularizer\n reg_regims:\n <PYTORCH_PARAM_NAME_STR>: [<STRENGTH_FLOAT>, <'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'>]\n <PYTORCH_PARAM_NAME_STR>: [<STRENGTH_FLOAT>, <'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'>]\n threshold_criteria: [Mean_Abs | Max]\n\n\n\n\nFor example:\n\n\nversion: 1\n\nregularizers:\n my_filter_regularizer:\n class: GroupLassoRegularizer\n reg_regims:\n 'module.layer3.1.conv1.weight': [0.00005, '3D']\n 'module.layer3.1.conv2.weight': [0.00005, '3D']\n 'module.layer3.1.conv3.weight': [0.00005, '3D']\n 'module.layer3.2.conv1.weight': [0.00005, '3D']\n threshold_criteria: Mean_Abs\n\npolicies:\n - regularizer:\n instance_name: my_filter_regularizer\n starting_epoch: 0\n ending_epoch: 60\n frequency: 1\n\n\n\n\nMixing it up\n\n\nYou can mix pruning and regularization.\n\n\nversion: 1\npruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.625\n\nregularizers:\n 2d_groups_regularizer:\n class: GroupLassoRegularizer\n reg_regims:\n 'features.module.0.weight': [0.000012, '2D']\n 'features.module.3.weight': [0.000012, '2D']\n 'features.module.6.weight': [0.000012, '2D']\n 'features.module.8.weight': [0.000012, '2D']\n 'features.module.10.weight': [0.000012, '2D']\n\n\nlr_schedulers:\n # Learning rate decay scheduler\n pruning_lr:\n class: ExponentialLR\n gamma: 0.9\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n - regularizer:\n instance_name: '2d_groups_regularizer'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 1\n\n - lr_scheduler:\n instance_name: pruning_lr\n starting_epoch: 24\n ending_epoch: 200\n frequency: 1", + "location": "/schedule/index.html", + "text": "Compression scheduler\n\n\nIn iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of \nCompressionScheduler\n: it needed to be part of the training loop, and to be able to make and implement pruning, regularization and quantization decisions. We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification. We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base. Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code. \n\n\nHigh level overview\n\n\nLet's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, Quantizers, LR-scheduler and Policies.\n\n\n\n\nPruners, Regularizers and Quantizers are very similar: They implement either a Pruning/Regularization/Quantization algorithm, respectively. \n\n\nAn LR-scheduler specifies the LR-decay algorithm. \n\n\n\n\nThese define the \nwhat\n part of the schedule. \n\n\nThe Policies define the \nwhen\n part of the schedule: at which epoch to start applying the Pruner/Regularizer/Quantizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application). A policy also defines the instance of Pruner/Regularizer/Quantizer/LR-decay it is managing.\n\nThe CompressionScheduler is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners, Regularizers and Quantizers from code.\n\n\nSyntax through example\n\n\nWe'll use \nalexnet.schedule_agp.yaml\n to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet.\n\n\nversion: 1\npruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.625\n\nlr_schedulers:\n pruning_lr:\n class: ExponentialLR\n gamma: 0.9\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n - lr_scheduler:\n instance_name: pruning_lr\n starting_epoch: 24\n ending_epoch: 200\n frequency: 1\n\n\n\n\nThere is only one version of the YAML syntax, and the version number is not verified at the moment. However, to be future-proof it is probably better to let the YAML parser know that you are using version-1 syntax, in case there is ever a version 2.\n\n\nversion: 1\n\n\n\n\nIn the \npruners\n section, we define the instances of pruners we want the scheduler to instantiate and use.\n\nWe define a single pruner instance, named \nmy_pruner\n, of algorithm \nSensitivityPruner\n. We will refer to this instance in the \nPolicies\n section.\n\nThen we list the sensitivity multipliers, \\(s\\), of each of the weight tensors.\n\nYou may list as many Pruners as you want in this section, as long as each has a unique name. You can several types of pruners in one schedule.\n\n\npruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.6\n\n\n\n\nNext, we want to specify the learning-rate decay scheduling in the \nlr_schedulers\n section. We assign a name to this instance: \npruning_lr\n. As in the \npruners\n section, you may use any name, as long as all LR-schedulers have a unique name. At the moment, only one instance of LR-scheduler is allowed. The LR-scheduler must be a subclass of PyTorch's \n_LRScheduler\n. You can use any of the schedulers defined in \ntorch.optim.lr_scheduler\n (see \nhere\n). In addition, we've implemented some additional schedulers in Distiller (see \nhere\n). The keyword arguments (kwargs) are passed directly to the LR-scheduler's constructor, so that as new LR-schedulers are added to \ntorch.optim.lr_scheduler\n, they can be used without changing the application code.\n\n\nlr_schedulers:\n pruning_lr:\n class: ExponentialLR\n gamma: 0.9\n\n\n\n\nFinally, we define the \npolicies\n section which defines the actual scheduling. A \nPolicy\n manages an instance of a \nPruner\n, \nRegularizer\n, \nQuantizer\n, or \nLRScheduler\n, by naming the instance. In the example below, a \nPruningPolicy\n uses the pruner instance named \nmy_pruner\n: it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38. \n\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n - lr_scheduler:\n instance_name: pruning_lr\n starting_epoch: 24\n ending_epoch: 200\n frequency: 1\n\n\n\n\nThis is \niterative pruning\n:\n\n\n\n\n\n\nTrain Connectivity\n\n\n\n\n\n\nPrune Connections\n\n\n\n\n\n\nRetrain Weights\n\n\n\n\n\n\nGoto 2\n\n\n\n\n\n\nIt is described in \nLearning both Weights and Connections for Efficient Neural Networks\n:\n\n\n\n\n\"Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections...After an initial training phase, we remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first phase learns the topology of the networks \u2014 learning which connections are important and removing the unimportant connections. We then retrain the sparse network so the remaining connections can compensate for the connections that have been removed. The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.\"\n\n\n\n\nRegularization\n\n\nYou can also define and schedule regularization.\n\n\nL1 regularization\n\n\nFormat (this is an informal specification, not a valid \nABNF\n specification):\n\n\nregularizers:\n \nREGULARIZER_NAME_STR\n:\n class: L1Regularizer\n reg_regims:\n \nPYTORCH_PARAM_NAME_STR\n: \nSTRENGTH_FLOAT\n\n ...\n \nPYTORCH_PARAM_NAME_STR\n: \nSTRENGTH_FLOAT\n\n threshold_criteria: [Mean_Abs | Max]\n\n\n\n\nFor example:\n\n\nversion: 1\n\nregularizers:\n my_L1_reg:\n class: L1Regularizer\n reg_regims:\n 'module.layer3.1.conv1.weight': 0.000002\n 'module.layer3.1.conv2.weight': 0.000002\n 'module.layer3.1.conv3.weight': 0.000002\n 'module.layer3.2.conv1.weight': 0.000002\n threshold_criteria: Mean_Abs\n\npolicies:\n - regularizer:\n instance_name: my_L1_reg\n starting_epoch: 0\n ending_epoch: 60\n frequency: 1\n\n\n\n\nGroup regularization\n\n\nFormat (informal specification):\n\n\nFormat:\n regularizers:\n \nREGULARIZER_NAME_STR\n:\n class: L1Regularizer\n reg_regims:\n \nPYTORCH_PARAM_NAME_STR\n: [\nSTRENGTH_FLOAT\n, \n'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'\n]\n \nPYTORCH_PARAM_NAME_STR\n: [\nSTRENGTH_FLOAT\n, \n'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'\n]\n threshold_criteria: [Mean_Abs | Max]\n\n\n\n\nFor example:\n\n\nversion: 1\n\nregularizers:\n my_filter_regularizer:\n class: GroupLassoRegularizer\n reg_regims:\n 'module.layer3.1.conv1.weight': [0.00005, '3D']\n 'module.layer3.1.conv2.weight': [0.00005, '3D']\n 'module.layer3.1.conv3.weight': [0.00005, '3D']\n 'module.layer3.2.conv1.weight': [0.00005, '3D']\n threshold_criteria: Mean_Abs\n\npolicies:\n - regularizer:\n instance_name: my_filter_regularizer\n starting_epoch: 0\n ending_epoch: 60\n frequency: 1\n\n\n\n\nMixing it up\n\n\nYou can mix pruning and regularization.\n\n\nversion: 1\npruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.625\n\nregularizers:\n 2d_groups_regularizer:\n class: GroupLassoRegularizer\n reg_regims:\n 'features.module.0.weight': [0.000012, '2D']\n 'features.module.3.weight': [0.000012, '2D']\n 'features.module.6.weight': [0.000012, '2D']\n 'features.module.8.weight': [0.000012, '2D']\n 'features.module.10.weight': [0.000012, '2D']\n\n\nlr_schedulers:\n # Learning rate decay scheduler\n pruning_lr:\n class: ExponentialLR\n gamma: 0.9\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n - regularizer:\n instance_name: '2d_groups_regularizer'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 1\n\n - lr_scheduler:\n instance_name: pruning_lr\n starting_epoch: 24\n ending_epoch: 200\n frequency: 1\n\n\n\n\n\nQuantization\n\n\nSimilarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the \nQuantizer\n class (see details \nhere\n).\nLet's see an example:\n\n\nquantizers:\n dorefa_quantizer:\n class: DorefaQuantizer\n bits_activations: 8\n bits_weights: 4\n bits_overrides:\n conv1:\n wts: null\n acts: null\n relu1:\n wts: null\n acts: null\n final_relu:\n wts: null\n acts: null\n fc:\n wts: null\n acts: null\n\n\n\n\n\n\nThe specific quantization method we're instantiating here is \nDorefaQuantizer\n.\n\n\nThen we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively. \n\n\nThen, we define the \nbits_overrides\n mapping. In this case, we choose not to quantize the first and last layer of the model. In the case of \nDorefaQuantizer\n, the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters \nconv1\n, the first activation layer \nrelu1\n, the last activation layer \nfinal_relu\n and the last layer with parameters \nfc\n.\n\n\nSpecifying \nnull\n means \"do not quantize\".\n\n\nNote that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers.\n\n\nWe can also reference \ngroups of layers\n in the \nbits_overrides\n mapping. This is done using regular expressions. Suppose we have a sub-module in our model named \nblock1\n, which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named \nconv1\n, \nconv2\n and so on. In that case we would define the following:\n\n\n\n\nbits_overrides:\n block1.conv*:\n wts: 2\n acts: null", "title": "Compression scheduling" - }, + }, { - "location": "/schedule/index.html#compression-scheduler", - "text": "In iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of CompressionScheduler : it needed to be part of the training loop, and to be able to make and implement pruning, regularization and (later) quantization decisions. We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification. We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base. Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code.", + "location": "/schedule/index.html#compression-scheduler", + "text": "In iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of CompressionScheduler : it needed to be part of the training loop, and to be able to make and implement pruning, regularization and quantization decisions. We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification. We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base. Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code.", "title": "Compression scheduler" - }, + }, { - "location": "/schedule/index.html#high-level-overview", - "text": "Let's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, LR-scheduler and Policies. Pruners and Regularizers are very similar: they implement either a Pruning algorithm or a Regularization algorithm. An LR-scheduler specifies the LR-decay algorithm. These define the what part of the schedule. The Policies define the when part of the schedule: at which epoch to start applying the Pruner/Regularizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application). A policy also defines the instance of Pruner/Regularizer/LR-decay it is managing. \nThe CompressionScheduler is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners and Regularizers from code.", + "location": "/schedule/index.html#high-level-overview", + "text": "Let's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, Quantizers, LR-scheduler and Policies. Pruners, Regularizers and Quantizers are very similar: They implement either a Pruning/Regularization/Quantization algorithm, respectively. An LR-scheduler specifies the LR-decay algorithm. These define the what part of the schedule. The Policies define the when part of the schedule: at which epoch to start applying the Pruner/Regularizer/Quantizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application). A policy also defines the instance of Pruner/Regularizer/Quantizer/LR-decay it is managing. \nThe CompressionScheduler is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners, Regularizers and Quantizers from code.", "title": "High level overview" - }, + }, { - "location": "/schedule/index.html#syntax-through-example", - "text": "We'll use alexnet.schedule_agp.yaml to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet. version: 1\npruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.625\n\nlr_schedulers:\n pruning_lr:\n class: ExponentialLR\n gamma: 0.9\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n - lr_scheduler:\n instance_name: pruning_lr\n starting_epoch: 24\n ending_epoch: 200\n frequency: 1 There is only one version of the YAML syntax, and the version number is not verified at the moment. However, to be future-proof it is probably better to let the YAML parser know that you are using version-1 syntax, in case there is ever a version 2. version: 1 In the pruners section, we define the instances of pruners we want the scheduler to instantiate and use. \nWe define a single pruner instance, named my_pruner of algorithm SensitivityPruner . We will refer to this instance in the Policies section. \nThen we list the sensitivity multipliers, \\(s\\), of each of the weight tensors. \nYou may list as many Pruners as you want in this section, as long as each has a unique name. You can several types of pruners in one schedule. pruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.6 Next, we want to specify the learning-rate decay scheduling in the lr_schedulers section. We assign a name to this instance: pruning_lr . As in the pruners section, you may use any name, as long as all LR-schedulers have a unique name. At the moment, only one instance of LR-scheduler is allowed. You can use any LR-scheduler class that torch.optim.lr_scheduler supports and pass their arguments. The keyword arguments (kwargs) are passed directly to the constructor of the subclasses of _LRScheduler , so that as new LR-schedulers are added to torch.optim.lr_scheduler , they can be used without changing the application code. lr_schedulers:\n pruning_lr:\n class: ExponentialLR\n gamma: 0.9 Finally, we define the policies section which defines the actual scheduling. A Policy manages an instance of a Pruner , Regularizer , or LRSchedule , by naming the instance. In the example below, a PruningPolicy uses the pruner instance named my_pruner : it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38. policies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n - lr_scheduler:\n instance_name: pruning_lr\n starting_epoch: 24\n ending_epoch: 200\n frequency: 1 This is iterative pruning : Train Connectivity Prune Connections Retrain Weights Goto 2 It is described in Learning both Weights and Connections for Efficient Neural Networks : \"Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections...After an initial training phase, we remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first phase learns the topology of the networks \u2014 learning which connections are important and removing the unimportant connections. We then retrain the sparse network so the remaining connections can compensate for the connections that have been removed. The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.\"", + "location": "/schedule/index.html#syntax-through-example", + "text": "We'll use alexnet.schedule_agp.yaml to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet. version: 1\npruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.625\n\nlr_schedulers:\n pruning_lr:\n class: ExponentialLR\n gamma: 0.9\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n - lr_scheduler:\n instance_name: pruning_lr\n starting_epoch: 24\n ending_epoch: 200\n frequency: 1 There is only one version of the YAML syntax, and the version number is not verified at the moment. However, to be future-proof it is probably better to let the YAML parser know that you are using version-1 syntax, in case there is ever a version 2. version: 1 In the pruners section, we define the instances of pruners we want the scheduler to instantiate and use. \nWe define a single pruner instance, named my_pruner , of algorithm SensitivityPruner . We will refer to this instance in the Policies section. \nThen we list the sensitivity multipliers, \\(s\\), of each of the weight tensors. \nYou may list as many Pruners as you want in this section, as long as each has a unique name. You can several types of pruners in one schedule. pruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.6 Next, we want to specify the learning-rate decay scheduling in the lr_schedulers section. We assign a name to this instance: pruning_lr . As in the pruners section, you may use any name, as long as all LR-schedulers have a unique name. At the moment, only one instance of LR-scheduler is allowed. The LR-scheduler must be a subclass of PyTorch's _LRScheduler . You can use any of the schedulers defined in torch.optim.lr_scheduler (see here ). In addition, we've implemented some additional schedulers in Distiller (see here ). The keyword arguments (kwargs) are passed directly to the LR-scheduler's constructor, so that as new LR-schedulers are added to torch.optim.lr_scheduler , they can be used without changing the application code. lr_schedulers:\n pruning_lr:\n class: ExponentialLR\n gamma: 0.9 Finally, we define the policies section which defines the actual scheduling. A Policy manages an instance of a Pruner , Regularizer , Quantizer , or LRScheduler , by naming the instance. In the example below, a PruningPolicy uses the pruner instance named my_pruner : it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38. policies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n - lr_scheduler:\n instance_name: pruning_lr\n starting_epoch: 24\n ending_epoch: 200\n frequency: 1 This is iterative pruning : Train Connectivity Prune Connections Retrain Weights Goto 2 It is described in Learning both Weights and Connections for Efficient Neural Networks : \"Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections...After an initial training phase, we remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first phase learns the topology of the networks \u2014 learning which connections are important and removing the unimportant connections. We then retrain the sparse network so the remaining connections can compensate for the connections that have been removed. The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.\"", "title": "Syntax through example" - }, + }, { - "location": "/schedule/index.html#regularization", - "text": "You can also define and schedule regularization.", + "location": "/schedule/index.html#regularization", + "text": "You can also define and schedule regularization.", "title": "Regularization" - }, + }, { - "location": "/schedule/index.html#l1-regularization", - "text": "Format (this is an informal specification, not a valid ABNF specification): regularizers:\n <REGULARIZER_NAME_STR>:\n class: L1Regularizer\n reg_regims:\n <PYTORCH_PARAM_NAME_STR>: <STRENGTH_FLOAT>\n ...\n <PYTORCH_PARAM_NAME_STR>: <STRENGTH_FLOAT>\n threshold_criteria: [Mean_Abs | Max] For example: version: 1\n\nregularizers:\n my_L1_reg:\n class: L1Regularizer\n reg_regims:\n 'module.layer3.1.conv1.weight': 0.000002\n 'module.layer3.1.conv2.weight': 0.000002\n 'module.layer3.1.conv3.weight': 0.000002\n 'module.layer3.2.conv1.weight': 0.000002\n threshold_criteria: Mean_Abs\n\npolicies:\n - regularizer:\n instance_name: my_L1_reg\n starting_epoch: 0\n ending_epoch: 60\n frequency: 1", + "location": "/schedule/index.html#l1-regularization", + "text": "Format (this is an informal specification, not a valid ABNF specification): regularizers:\n REGULARIZER_NAME_STR :\n class: L1Regularizer\n reg_regims:\n PYTORCH_PARAM_NAME_STR : STRENGTH_FLOAT \n ...\n PYTORCH_PARAM_NAME_STR : STRENGTH_FLOAT \n threshold_criteria: [Mean_Abs | Max] For example: version: 1\n\nregularizers:\n my_L1_reg:\n class: L1Regularizer\n reg_regims:\n 'module.layer3.1.conv1.weight': 0.000002\n 'module.layer3.1.conv2.weight': 0.000002\n 'module.layer3.1.conv3.weight': 0.000002\n 'module.layer3.2.conv1.weight': 0.000002\n threshold_criteria: Mean_Abs\n\npolicies:\n - regularizer:\n instance_name: my_L1_reg\n starting_epoch: 0\n ending_epoch: 60\n frequency: 1", "title": "L1 regularization" - }, + }, { - "location": "/schedule/index.html#group-regularization", - "text": "Format (informal specification): Format:\n regularizers:\n <REGULARIZER_NAME_STR>:\n class: L1Regularizer\n reg_regims:\n <PYTORCH_PARAM_NAME_STR>: [<STRENGTH_FLOAT>, <'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'>]\n <PYTORCH_PARAM_NAME_STR>: [<STRENGTH_FLOAT>, <'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'>]\n threshold_criteria: [Mean_Abs | Max] For example: version: 1\n\nregularizers:\n my_filter_regularizer:\n class: GroupLassoRegularizer\n reg_regims:\n 'module.layer3.1.conv1.weight': [0.00005, '3D']\n 'module.layer3.1.conv2.weight': [0.00005, '3D']\n 'module.layer3.1.conv3.weight': [0.00005, '3D']\n 'module.layer3.2.conv1.weight': [0.00005, '3D']\n threshold_criteria: Mean_Abs\n\npolicies:\n - regularizer:\n instance_name: my_filter_regularizer\n starting_epoch: 0\n ending_epoch: 60\n frequency: 1", + "location": "/schedule/index.html#group-regularization", + "text": "Format (informal specification): Format:\n regularizers:\n REGULARIZER_NAME_STR :\n class: L1Regularizer\n reg_regims:\n PYTORCH_PARAM_NAME_STR : [ STRENGTH_FLOAT , '2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows' ]\n PYTORCH_PARAM_NAME_STR : [ STRENGTH_FLOAT , '2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows' ]\n threshold_criteria: [Mean_Abs | Max] For example: version: 1\n\nregularizers:\n my_filter_regularizer:\n class: GroupLassoRegularizer\n reg_regims:\n 'module.layer3.1.conv1.weight': [0.00005, '3D']\n 'module.layer3.1.conv2.weight': [0.00005, '3D']\n 'module.layer3.1.conv3.weight': [0.00005, '3D']\n 'module.layer3.2.conv1.weight': [0.00005, '3D']\n threshold_criteria: Mean_Abs\n\npolicies:\n - regularizer:\n instance_name: my_filter_regularizer\n starting_epoch: 0\n ending_epoch: 60\n frequency: 1", "title": "Group regularization" - }, + }, { - "location": "/schedule/index.html#mixing-it-up", - "text": "You can mix pruning and regularization. version: 1\npruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.625\n\nregularizers:\n 2d_groups_regularizer:\n class: GroupLassoRegularizer\n reg_regims:\n 'features.module.0.weight': [0.000012, '2D']\n 'features.module.3.weight': [0.000012, '2D']\n 'features.module.6.weight': [0.000012, '2D']\n 'features.module.8.weight': [0.000012, '2D']\n 'features.module.10.weight': [0.000012, '2D']\n\n\nlr_schedulers:\n # Learning rate decay scheduler\n pruning_lr:\n class: ExponentialLR\n gamma: 0.9\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n - regularizer:\n instance_name: '2d_groups_regularizer'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 1\n\n - lr_scheduler:\n instance_name: pruning_lr\n starting_epoch: 24\n ending_epoch: 200\n frequency: 1", + "location": "/schedule/index.html#mixing-it-up", + "text": "You can mix pruning and regularization. version: 1\npruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.625\n\nregularizers:\n 2d_groups_regularizer:\n class: GroupLassoRegularizer\n reg_regims:\n 'features.module.0.weight': [0.000012, '2D']\n 'features.module.3.weight': [0.000012, '2D']\n 'features.module.6.weight': [0.000012, '2D']\n 'features.module.8.weight': [0.000012, '2D']\n 'features.module.10.weight': [0.000012, '2D']\n\n\nlr_schedulers:\n # Learning rate decay scheduler\n pruning_lr:\n class: ExponentialLR\n gamma: 0.9\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n - regularizer:\n instance_name: '2d_groups_regularizer'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 1\n\n - lr_scheduler:\n instance_name: pruning_lr\n starting_epoch: 24\n ending_epoch: 200\n frequency: 1", "title": "Mixing it up" - }, + }, + { + "location": "/schedule/index.html#quantization", + "text": "Similarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the Quantizer class (see details here ).\nLet's see an example: quantizers:\n dorefa_quantizer:\n class: DorefaQuantizer\n bits_activations: 8\n bits_weights: 4\n bits_overrides:\n conv1:\n wts: null\n acts: null\n relu1:\n wts: null\n acts: null\n final_relu:\n wts: null\n acts: null\n fc:\n wts: null\n acts: null The specific quantization method we're instantiating here is DorefaQuantizer . Then we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively. Then, we define the bits_overrides mapping. In this case, we choose not to quantize the first and last layer of the model. In the case of DorefaQuantizer , the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters conv1 , the first activation layer relu1 , the last activation layer final_relu and the last layer with parameters fc . Specifying null means \"do not quantize\". Note that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers. We can also reference groups of layers in the bits_overrides mapping. This is done using regular expressions. Suppose we have a sub-module in our model named block1 , which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named conv1 , conv2 and so on. In that case we would define the following: bits_overrides:\n block1.conv*:\n wts: 2\n acts: null", + "title": "Quantization" + }, { - "location": "/pruning/index.html", - "text": "Pruning\n\n\nA common methodology for inducing sparsity in weights and activations is called \npruning\n. Pruning is the application of a binary criteria to decide which weights to prune: weights which match the pruning criteria are assigned a value of zero. Pruned elements are \"trimmed\" from the model: we zero their values and also make sure they don't take part in the back-propagation process.\n\n\nWe can prune weights, biases, and activations. Biases are few and their contribution to a layer's output is relatively large, so there is little incentive to prune them. We usually see sparse activations following a ReLU layer, because ReLU quenches negative activations to exact zero (\\(ReLU(x): max(0,x)\\)). Sparsity in weights is less common, as weights tend to be very small, but are often not exact zeros.\n\n\n\nLet's define sparsity\n\n\nSparsity is a a measure of how many elements in a tensor are exact zeros, relative to the tensor size. A tensor is considered sparse if \"most\" of its elements are zero. How much is \"most\", is not strictly defined, but when you see a sparse tensor you know it ;-)\n\nThe \n\\(l_0\\)-\"norm\" function\n measures how many zero-elements are in a tensor \nx\n:\n\\[\\lVert x \\rVert_0\\;=\\;|x_1|^0 + |x_2|^0 + ... + |x_n|^0 \\]\nIn other words, an element contributes either a value of 1 or 0 to \\(l_0\\). Anything but an exact zero contributes a value of 1 - that's pretty cool.\n\nSometimes it helps to think about density, the number of non-zero elements (NNZ) and sparsity's complement:\n\\[\ndensity = 1 - sparsity\n\\]\nYou can use \ndistiller.sparsity\n and \ndistiller.density\n to query a PyTorch tensor's sparsity and density.\n\n\nWhat is weights pruning?\n\n\nWeights pruning, or model pruning, is a set of methods to increase the sparsity (amount of zero-valued elements in a tensor) of a network's weights. In general, the term 'parameters' refers to both weights and bias tensors of a model. Biases are rarely, if ever, pruned because there are very few bias elements compared to weights elements, and it is just not worth the trouble.\n\n\nPruning requires a criteria for choosing which elements to prune - this is called the \npruning criteria\n. The most common pruning criteria is the absolute value of each element: the element's absolute value is compared to some threshold value, and if it is below the threshold the element is set to zero (i.e. pruned) . This is implemented by the \ndistiller.MagnitudeParameterPruner\n class. The idea behind this method, is that weights with small \\(l_1\\)-norms (absolute value) contribute little to the final result (low saliency), so they are less important and can be removed.\n\n\nA related idea motivating pruning, is that models are over-parametrized and contain redundant logic and features. Therefore, some of these redundancies can be removed by setting their weights to zero.\n\n\nAnd yet another way to think of pruning is to phrase it as a search for a set of weights with as many zeros as possible, which still produces acceptable inference accuracies compared to the dense-model (non-pruned model). Another way to look at it, is to imagine that because of the very high-dimensionality of the parameter space, the immediate space around the dense-model's solution likely contains some sparse solutions, and we want to use find these sparse solutions. \n\n\n\n\nPruning schedule\n\n\nThe most straight-forward to prune is to take a trained model and prune it once; also called \none-shot pruning\n. In \nLearning both Weights and Connections for Efficient Neural Networks\n Song Han et. al show that this is surprisingly effective, but also leaves a lot of potential sparsity untapped. The surprise is what they call the \"free lunch\" effect: \n\"reducing 2x the connections without losing accuracy even without retraining.\"\n\nHowever, they also note that when employing a pruning-followed-by-retraining regimen, they can achieve much better results (higher sparsity at no accuracy loss). This is called \niterative pruning\n, and the retraining that follows pruning is often referred to as \nfine-tuning\n. How the pruning criteria changes between iterations, how many iterations we perform and how often, and which tensors are pruned - this is collectively called the \npruning schedule\n.\n\n\nWe can think of iterative pruning as repeatedly learning which weights are important, removing the least important ones based on some importance criteria, and then retraining the model to let it \"recover\" from the pruning by adjusting the remaining weights. At each iteration, we prune more weights.\n\nThe decision of when to stop pruning is also expressed in the schedule, and it depends on the pruning algorithm. For example, if we are trying to achieve a specific sparsity level, then we stop when the pruning achieves that level. And if we are pruning weights structures in order to reduce the required compute budget, then we stop the pruning when this compute reduction is achieved.\n\n\nDistiller supports expressing the pruning schedule as a YAML file (which is then executed by an instance of a PruningScheduler).\n\n\nPruning granularity\n\n\nPruning individual weight elements is called \nelement-wise pruning\n, and it is also sometimes referred to as \nfine-grained\n pruning.\n\n\nCoarse-grained pruning\n - also referred to as \nstructured pruning\n, \ngroup pruning\n, or \nblock pruning\n - is pruning entire groups of elements which have some significance. Groups come in various shapes and sizes, but an easy to visualize group-pruning is filter-pruning, in which entire filters are removed.\n\n\nSensitivity analysis\n\n\nThe hard part about inducing sparsity via pruning is determining what threshold, or sparsity level, to use for each layer's tensors. Sensitivity analysis is a method that tries to help us rank the tensors by their sensitivity to pruning. \n\nThe idea is to set the pruning level (percentage) of a specific layer, and then to prune once, run an evaluation on the test dataset and record the accuracy score. We do this for all of the parameterized layers, and for each layer we examine several sparsity levels. This should teach us about the \"sensitivity\" of each of the layers to pruning.\n\n\nThe evaluated model should be trained to maximum accuracy before running the analysis, because we aim to understand the behavior of the trained model's performance in relation to pruning of a specific weights tensor.\n\n\nMuch as we can prune structures, we can also perform sensitivity analysis on structures. Distiller implements element-wise pruning sensitivity analysis using the \\(l_1\\)-norm of individual elements; and filter-wise pruning sensitivity analysis using the mean \\(l_1\\)-norm of filters.\n\n\n\nThe authors of \nPruning Filters for Efficient ConvNets\n describe how they do sensitivity analysis:\n\n\n\n\n\"To understand the sensitivity of each layer, we prune each layer independently and evaluate the resulting pruned network\u2019s accuracy on the validation set. Figure 2(b) shows that layers that maintain their accuracy as filters are pruned away correspond to layers with larger slopes in Figure 2(a). On the contrary, layers with relatively flat slopes are more sensitive to pruning. We empirically determine the number of filters to prune for each layer based on their sensitivity to pruning. For deep networks such as VGG-16 or ResNets, we observe that layers in the same stage (with the same feature map size) have a similar sensitivity to pruning. To avoid introducing layer-wise meta-parameters, we use the same pruning ratio for all layers in the same stage. For layers that are sensitive to pruning, we prune a smaller percentage of these layers or completely skip pruning them.\"\n\n\n\n\nThe diagram below shows the results of running an element-wise sensitivity analysis on Alexnet, using Distillers's \nperform_sensitivity_analysis\n utility function.\n\n\nAs reported by Song Han, and exhibited in the diagram, in Alexnet the feature detecting layers (convolution layers) are more sensitive to pruning, and their sensitivity drops, the deeper they are. The fully-connected layers are much less sensitive, which is great, because that's where most of the parameters are.\n\n\n\n\nReferences\n\n\n \nSong Han, Jeff Pool, John Tran, William J. Dally\n.\n \nLearning both Weights and Connections for Efficient Neural Networks\n,\n arXiv:1607.04381v2,\n 2015.\n\n\n\n\n\nHao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, Hans Peter Graf\n.\n \nPruning Filters for Efficient ConvNets\n,\n arXiv:1608.08710v3,\n 2017.", + "location": "/pruning/index.html", + "text": "Pruning\n\n\nA common methodology for inducing sparsity in weights and activations is called \npruning\n. Pruning is the application of a binary criteria to decide which weights to prune: weights which match the pruning criteria are assigned a value of zero. Pruned elements are \"trimmed\" from the model: we zero their values and also make sure they don't take part in the back-propagation process.\n\n\nWe can prune weights, biases, and activations. Biases are few and their contribution to a layer's output is relatively large, so there is little incentive to prune them. We usually see sparse activations following a ReLU layer, because ReLU quenches negative activations to exact zero (\\(ReLU(x): max(0,x)\\)). Sparsity in weights is less common, as weights tend to be very small, but are often not exact zeros.\n\n\n\nLet's define sparsity\n\n\nSparsity is a a measure of how many elements in a tensor are exact zeros, relative to the tensor size. A tensor is considered sparse if \"most\" of its elements are zero. How much is \"most\", is not strictly defined, but when you see a sparse tensor you know it ;-)\n\nThe \n\\(l_0\\)-\"norm\" function\n measures how many zero-elements are in a tensor \nx\n:\n\\[\\lVert x \\rVert_0\\;=\\;|x_1|^0 + |x_2|^0 + ... + |x_n|^0 \\]\nIn other words, an element contributes either a value of 1 or 0 to \\(l_0\\). Anything but an exact zero contributes a value of 1 - that's pretty cool.\n\nSometimes it helps to think about density, the number of non-zero elements (NNZ) and sparsity's complement:\n\\[\ndensity = 1 - sparsity\n\\]\nYou can use \ndistiller.sparsity\n and \ndistiller.density\n to query a PyTorch tensor's sparsity and density.\n\n\nWhat is weights pruning?\n\n\nWeights pruning, or model pruning, is a set of methods to increase the sparsity (amount of zero-valued elements in a tensor) of a network's weights. In general, the term 'parameters' refers to both weights and bias tensors of a model. Biases are rarely, if ever, pruned because there are very few bias elements compared to weights elements, and it is just not worth the trouble.\n\n\nPruning requires a criteria for choosing which elements to prune - this is called the \npruning criteria\n. The most common pruning criteria is the absolute value of each element: the element's absolute value is compared to some threshold value, and if it is below the threshold the element is set to zero (i.e. pruned) . This is implemented by the \ndistiller.MagnitudeParameterPruner\n class. The idea behind this method, is that weights with small \\(l_1\\)-norms (absolute value) contribute little to the final result (low saliency), so they are less important and can be removed.\n\n\nA related idea motivating pruning, is that models are over-parametrized and contain redundant logic and features. Therefore, some of these redundancies can be removed by setting their weights to zero.\n\n\nAnd yet another way to think of pruning is to phrase it as a search for a set of weights with as many zeros as possible, which still produces acceptable inference accuracies compared to the dense-model (non-pruned model). Another way to look at it, is to imagine that because of the very high-dimensionality of the parameter space, the immediate space around the dense-model's solution likely contains some sparse solutions, and we want to use find these sparse solutions. \n\n\n\n\nPruning schedule\n\n\nThe most straight-forward to prune is to take a trained model and prune it once; also called \none-shot pruning\n. In \nLearning both Weights and Connections for Efficient Neural Networks\n Song Han et. al show that this is surprisingly effective, but also leaves a lot of potential sparsity untapped. The surprise is what they call the \"free lunch\" effect: \n\"reducing 2x the connections without losing accuracy even without retraining.\"\n\nHowever, they also note that when employing a pruning-followed-by-retraining regimen, they can achieve much better results (higher sparsity at no accuracy loss). This is called \niterative pruning\n, and the retraining that follows pruning is often referred to as \nfine-tuning\n. How the pruning criteria changes between iterations, how many iterations we perform and how often, and which tensors are pruned - this is collectively called the \npruning schedule\n.\n\n\nWe can think of iterative pruning as repeatedly learning which weights are important, removing the least important ones based on some importance criteria, and then retraining the model to let it \"recover\" from the pruning by adjusting the remaining weights. At each iteration, we prune more weights.\n\nThe decision of when to stop pruning is also expressed in the schedule, and it depends on the pruning algorithm. For example, if we are trying to achieve a specific sparsity level, then we stop when the pruning achieves that level. And if we are pruning weights structures in order to reduce the required compute budget, then we stop the pruning when this compute reduction is achieved.\n\n\nDistiller supports expressing the pruning schedule as a YAML file (which is then executed by an instance of a PruningScheduler).\n\n\nPruning granularity\n\n\nPruning individual weight elements is called \nelement-wise pruning\n, and it is also sometimes referred to as \nfine-grained\n pruning.\n\n\nCoarse-grained pruning\n - also referred to as \nstructured pruning\n, \ngroup pruning\n, or \nblock pruning\n - is pruning entire groups of elements which have some significance. Groups come in various shapes and sizes, but an easy to visualize group-pruning is filter-pruning, in which entire filters are removed.\n\n\nSensitivity analysis\n\n\nThe hard part about inducing sparsity via pruning is determining what threshold, or sparsity level, to use for each layer's tensors. Sensitivity analysis is a method that tries to help us rank the tensors by their sensitivity to pruning. \n\nThe idea is to set the pruning level (percentage) of a specific layer, and then to prune once, run an evaluation on the test dataset and record the accuracy score. We do this for all of the parameterized layers, and for each layer we examine several sparsity levels. This should teach us about the \"sensitivity\" of each of the layers to pruning.\n\n\nThe evaluated model should be trained to maximum accuracy before running the analysis, because we aim to understand the behavior of the trained model's performance in relation to pruning of a specific weights tensor.\n\n\nMuch as we can prune structures, we can also perform sensitivity analysis on structures. Distiller implements element-wise pruning sensitivity analysis using the \\(l_1\\)-norm of individual elements; and filter-wise pruning sensitivity analysis using the mean \\(l_1\\)-norm of filters.\n\n\n\nThe authors of \nPruning Filters for Efficient ConvNets\n describe how they do sensitivity analysis:\n\n\n\n\n\"To understand the sensitivity of each layer, we prune each layer independently and evaluate the resulting pruned network\u2019s accuracy on the validation set. Figure 2(b) shows that layers that maintain their accuracy as filters are pruned away correspond to layers with larger slopes in Figure 2(a). On the contrary, layers with relatively flat slopes are more sensitive to pruning. We empirically determine the number of filters to prune for each layer based on their sensitivity to pruning. For deep networks such as VGG-16 or ResNets, we observe that layers in the same stage (with the same feature map size) have a similar sensitivity to pruning. To avoid introducing layer-wise meta-parameters, we use the same pruning ratio for all layers in the same stage. For layers that are sensitive to pruning, we prune a smaller percentage of these layers or completely skip pruning them.\"\n\n\n\n\nThe diagram below shows the results of running an element-wise sensitivity analysis on Alexnet, using Distillers's \nperform_sensitivity_analysis\n utility function.\n\n\nAs reported by Song Han, and exhibited in the diagram, in Alexnet the feature detecting layers (convolution layers) are more sensitive to pruning, and their sensitivity drops, the deeper they are. The fully-connected layers are much less sensitive, which is great, because that's where most of the parameters are.\n\n\n\n\nReferences\n\n\n \nSong Han, Jeff Pool, John Tran, William J. Dally\n.\n \nLearning both Weights and Connections for Efficient Neural Networks\n,\n arXiv:1607.04381v2,\n 2015.\n\n\n\n\n\nHao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, Hans Peter Graf\n.\n \nPruning Filters for Efficient ConvNets\n,\n arXiv:1608.08710v3,\n 2017.", "title": "Pruning" - }, + }, { - "location": "/pruning/index.html#pruning", - "text": "A common methodology for inducing sparsity in weights and activations is called pruning . Pruning is the application of a binary criteria to decide which weights to prune: weights which match the pruning criteria are assigned a value of zero. Pruned elements are \"trimmed\" from the model: we zero their values and also make sure they don't take part in the back-propagation process. We can prune weights, biases, and activations. Biases are few and their contribution to a layer's output is relatively large, so there is little incentive to prune them. We usually see sparse activations following a ReLU layer, because ReLU quenches negative activations to exact zero (\\(ReLU(x): max(0,x)\\)). Sparsity in weights is less common, as weights tend to be very small, but are often not exact zeros.", + "location": "/pruning/index.html#pruning", + "text": "A common methodology for inducing sparsity in weights and activations is called pruning . Pruning is the application of a binary criteria to decide which weights to prune: weights which match the pruning criteria are assigned a value of zero. Pruned elements are \"trimmed\" from the model: we zero their values and also make sure they don't take part in the back-propagation process. We can prune weights, biases, and activations. Biases are few and their contribution to a layer's output is relatively large, so there is little incentive to prune them. We usually see sparse activations following a ReLU layer, because ReLU quenches negative activations to exact zero (\\(ReLU(x): max(0,x)\\)). Sparsity in weights is less common, as weights tend to be very small, but are often not exact zeros.", "title": "Pruning" - }, + }, { - "location": "/pruning/index.html#lets-define-sparsity", - "text": "Sparsity is a a measure of how many elements in a tensor are exact zeros, relative to the tensor size. A tensor is considered sparse if \"most\" of its elements are zero. How much is \"most\", is not strictly defined, but when you see a sparse tensor you know it ;-) \nThe \\(l_0\\)-\"norm\" function measures how many zero-elements are in a tensor x :\n\\[\\lVert x \\rVert_0\\;=\\;|x_1|^0 + |x_2|^0 + ... + |x_n|^0 \\]\nIn other words, an element contributes either a value of 1 or 0 to \\(l_0\\). Anything but an exact zero contributes a value of 1 - that's pretty cool. \nSometimes it helps to think about density, the number of non-zero elements (NNZ) and sparsity's complement:\n\\[\ndensity = 1 - sparsity\n\\]\nYou can use distiller.sparsity and distiller.density to query a PyTorch tensor's sparsity and density.", + "location": "/pruning/index.html#lets-define-sparsity", + "text": "Sparsity is a a measure of how many elements in a tensor are exact zeros, relative to the tensor size. A tensor is considered sparse if \"most\" of its elements are zero. How much is \"most\", is not strictly defined, but when you see a sparse tensor you know it ;-) \nThe \\(l_0\\)-\"norm\" function measures how many zero-elements are in a tensor x :\n\\[\\lVert x \\rVert_0\\;=\\;|x_1|^0 + |x_2|^0 + ... + |x_n|^0 \\]\nIn other words, an element contributes either a value of 1 or 0 to \\(l_0\\). Anything but an exact zero contributes a value of 1 - that's pretty cool. \nSometimes it helps to think about density, the number of non-zero elements (NNZ) and sparsity's complement:\n\\[\ndensity = 1 - sparsity\n\\]\nYou can use distiller.sparsity and distiller.density to query a PyTorch tensor's sparsity and density.", "title": "Let's define sparsity" - }, + }, { - "location": "/pruning/index.html#what-is-weights-pruning", - "text": "Weights pruning, or model pruning, is a set of methods to increase the sparsity (amount of zero-valued elements in a tensor) of a network's weights. In general, the term 'parameters' refers to both weights and bias tensors of a model. Biases are rarely, if ever, pruned because there are very few bias elements compared to weights elements, and it is just not worth the trouble. \nPruning requires a criteria for choosing which elements to prune - this is called the pruning criteria . The most common pruning criteria is the absolute value of each element: the element's absolute value is compared to some threshold value, and if it is below the threshold the element is set to zero (i.e. pruned) . This is implemented by the distiller.MagnitudeParameterPruner class. The idea behind this method, is that weights with small \\(l_1\\)-norms (absolute value) contribute little to the final result (low saliency), so they are less important and can be removed. \nA related idea motivating pruning, is that models are over-parametrized and contain redundant logic and features. Therefore, some of these redundancies can be removed by setting their weights to zero. \nAnd yet another way to think of pruning is to phrase it as a search for a set of weights with as many zeros as possible, which still produces acceptable inference accuracies compared to the dense-model (non-pruned model). Another way to look at it, is to imagine that because of the very high-dimensionality of the parameter space, the immediate space around the dense-model's solution likely contains some sparse solutions, and we want to use find these sparse solutions.", + "location": "/pruning/index.html#what-is-weights-pruning", + "text": "Weights pruning, or model pruning, is a set of methods to increase the sparsity (amount of zero-valued elements in a tensor) of a network's weights. In general, the term 'parameters' refers to both weights and bias tensors of a model. Biases are rarely, if ever, pruned because there are very few bias elements compared to weights elements, and it is just not worth the trouble. \nPruning requires a criteria for choosing which elements to prune - this is called the pruning criteria . The most common pruning criteria is the absolute value of each element: the element's absolute value is compared to some threshold value, and if it is below the threshold the element is set to zero (i.e. pruned) . This is implemented by the distiller.MagnitudeParameterPruner class. The idea behind this method, is that weights with small \\(l_1\\)-norms (absolute value) contribute little to the final result (low saliency), so they are less important and can be removed. \nA related idea motivating pruning, is that models are over-parametrized and contain redundant logic and features. Therefore, some of these redundancies can be removed by setting their weights to zero. \nAnd yet another way to think of pruning is to phrase it as a search for a set of weights with as many zeros as possible, which still produces acceptable inference accuracies compared to the dense-model (non-pruned model). Another way to look at it, is to imagine that because of the very high-dimensionality of the parameter space, the immediate space around the dense-model's solution likely contains some sparse solutions, and we want to use find these sparse solutions.", "title": "What is weights pruning?" - }, + }, { - "location": "/pruning/index.html#pruning-schedule", - "text": "The most straight-forward to prune is to take a trained model and prune it once; also called one-shot pruning . In Learning both Weights and Connections for Efficient Neural Networks Song Han et. al show that this is surprisingly effective, but also leaves a lot of potential sparsity untapped. The surprise is what they call the \"free lunch\" effect: \"reducing 2x the connections without losing accuracy even without retraining.\" \nHowever, they also note that when employing a pruning-followed-by-retraining regimen, they can achieve much better results (higher sparsity at no accuracy loss). This is called iterative pruning , and the retraining that follows pruning is often referred to as fine-tuning . How the pruning criteria changes between iterations, how many iterations we perform and how often, and which tensors are pruned - this is collectively called the pruning schedule . \nWe can think of iterative pruning as repeatedly learning which weights are important, removing the least important ones based on some importance criteria, and then retraining the model to let it \"recover\" from the pruning by adjusting the remaining weights. At each iteration, we prune more weights. \nThe decision of when to stop pruning is also expressed in the schedule, and it depends on the pruning algorithm. For example, if we are trying to achieve a specific sparsity level, then we stop when the pruning achieves that level. And if we are pruning weights structures in order to reduce the required compute budget, then we stop the pruning when this compute reduction is achieved. \nDistiller supports expressing the pruning schedule as a YAML file (which is then executed by an instance of a PruningScheduler).", + "location": "/pruning/index.html#pruning-schedule", + "text": "The most straight-forward to prune is to take a trained model and prune it once; also called one-shot pruning . In Learning both Weights and Connections for Efficient Neural Networks Song Han et. al show that this is surprisingly effective, but also leaves a lot of potential sparsity untapped. The surprise is what they call the \"free lunch\" effect: \"reducing 2x the connections without losing accuracy even without retraining.\" \nHowever, they also note that when employing a pruning-followed-by-retraining regimen, they can achieve much better results (higher sparsity at no accuracy loss). This is called iterative pruning , and the retraining that follows pruning is often referred to as fine-tuning . How the pruning criteria changes between iterations, how many iterations we perform and how often, and which tensors are pruned - this is collectively called the pruning schedule . \nWe can think of iterative pruning as repeatedly learning which weights are important, removing the least important ones based on some importance criteria, and then retraining the model to let it \"recover\" from the pruning by adjusting the remaining weights. At each iteration, we prune more weights. \nThe decision of when to stop pruning is also expressed in the schedule, and it depends on the pruning algorithm. For example, if we are trying to achieve a specific sparsity level, then we stop when the pruning achieves that level. And if we are pruning weights structures in order to reduce the required compute budget, then we stop the pruning when this compute reduction is achieved. \nDistiller supports expressing the pruning schedule as a YAML file (which is then executed by an instance of a PruningScheduler).", "title": "Pruning schedule" - }, + }, { - "location": "/pruning/index.html#pruning-granularity", - "text": "Pruning individual weight elements is called element-wise pruning , and it is also sometimes referred to as fine-grained pruning. Coarse-grained pruning - also referred to as structured pruning , group pruning , or block pruning - is pruning entire groups of elements which have some significance. Groups come in various shapes and sizes, but an easy to visualize group-pruning is filter-pruning, in which entire filters are removed.", + "location": "/pruning/index.html#pruning-granularity", + "text": "Pruning individual weight elements is called element-wise pruning , and it is also sometimes referred to as fine-grained pruning. Coarse-grained pruning - also referred to as structured pruning , group pruning , or block pruning - is pruning entire groups of elements which have some significance. Groups come in various shapes and sizes, but an easy to visualize group-pruning is filter-pruning, in which entire filters are removed.", "title": "Pruning granularity" - }, + }, { - "location": "/pruning/index.html#sensitivity-analysis", - "text": "The hard part about inducing sparsity via pruning is determining what threshold, or sparsity level, to use for each layer's tensors. Sensitivity analysis is a method that tries to help us rank the tensors by their sensitivity to pruning. \nThe idea is to set the pruning level (percentage) of a specific layer, and then to prune once, run an evaluation on the test dataset and record the accuracy score. We do this for all of the parameterized layers, and for each layer we examine several sparsity levels. This should teach us about the \"sensitivity\" of each of the layers to pruning. \nThe evaluated model should be trained to maximum accuracy before running the analysis, because we aim to understand the behavior of the trained model's performance in relation to pruning of a specific weights tensor. \nMuch as we can prune structures, we can also perform sensitivity analysis on structures. Distiller implements element-wise pruning sensitivity analysis using the \\(l_1\\)-norm of individual elements; and filter-wise pruning sensitivity analysis using the mean \\(l_1\\)-norm of filters. The authors of Pruning Filters for Efficient ConvNets describe how they do sensitivity analysis: \"To understand the sensitivity of each layer, we prune each layer independently and evaluate the resulting pruned network\u2019s accuracy on the validation set. Figure 2(b) shows that layers that maintain their accuracy as filters are pruned away correspond to layers with larger slopes in Figure 2(a). On the contrary, layers with relatively flat slopes are more sensitive to pruning. We empirically determine the number of filters to prune for each layer based on their sensitivity to pruning. For deep networks such as VGG-16 or ResNets, we observe that layers in the same stage (with the same feature map size) have a similar sensitivity to pruning. To avoid introducing layer-wise meta-parameters, we use the same pruning ratio for all layers in the same stage. For layers that are sensitive to pruning, we prune a smaller percentage of these layers or completely skip pruning them.\" The diagram below shows the results of running an element-wise sensitivity analysis on Alexnet, using Distillers's perform_sensitivity_analysis utility function. \nAs reported by Song Han, and exhibited in the diagram, in Alexnet the feature detecting layers (convolution layers) are more sensitive to pruning, and their sensitivity drops, the deeper they are. The fully-connected layers are much less sensitive, which is great, because that's where most of the parameters are.", + "location": "/pruning/index.html#sensitivity-analysis", + "text": "The hard part about inducing sparsity via pruning is determining what threshold, or sparsity level, to use for each layer's tensors. Sensitivity analysis is a method that tries to help us rank the tensors by their sensitivity to pruning. \nThe idea is to set the pruning level (percentage) of a specific layer, and then to prune once, run an evaluation on the test dataset and record the accuracy score. We do this for all of the parameterized layers, and for each layer we examine several sparsity levels. This should teach us about the \"sensitivity\" of each of the layers to pruning. \nThe evaluated model should be trained to maximum accuracy before running the analysis, because we aim to understand the behavior of the trained model's performance in relation to pruning of a specific weights tensor. \nMuch as we can prune structures, we can also perform sensitivity analysis on structures. Distiller implements element-wise pruning sensitivity analysis using the \\(l_1\\)-norm of individual elements; and filter-wise pruning sensitivity analysis using the mean \\(l_1\\)-norm of filters. The authors of Pruning Filters for Efficient ConvNets describe how they do sensitivity analysis: \"To understand the sensitivity of each layer, we prune each layer independently and evaluate the resulting pruned network\u2019s accuracy on the validation set. Figure 2(b) shows that layers that maintain their accuracy as filters are pruned away correspond to layers with larger slopes in Figure 2(a). On the contrary, layers with relatively flat slopes are more sensitive to pruning. We empirically determine the number of filters to prune for each layer based on their sensitivity to pruning. For deep networks such as VGG-16 or ResNets, we observe that layers in the same stage (with the same feature map size) have a similar sensitivity to pruning. To avoid introducing layer-wise meta-parameters, we use the same pruning ratio for all layers in the same stage. For layers that are sensitive to pruning, we prune a smaller percentage of these layers or completely skip pruning them.\" The diagram below shows the results of running an element-wise sensitivity analysis on Alexnet, using Distillers's perform_sensitivity_analysis utility function. \nAs reported by Song Han, and exhibited in the diagram, in Alexnet the feature detecting layers (convolution layers) are more sensitive to pruning, and their sensitivity drops, the deeper they are. The fully-connected layers are much less sensitive, which is great, because that's where most of the parameters are.", "title": "Sensitivity analysis" - }, + }, { - "location": "/pruning/index.html#references", - "text": "Song Han, Jeff Pool, John Tran, William J. Dally .\n Learning both Weights and Connections for Efficient Neural Networks ,\n arXiv:1607.04381v2,\n 2015. Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, Hans Peter Graf .\n Pruning Filters for Efficient ConvNets ,\n arXiv:1608.08710v3,\n 2017.", + "location": "/pruning/index.html#references", + "text": "Song Han, Jeff Pool, John Tran, William J. Dally .\n Learning both Weights and Connections for Efficient Neural Networks ,\n arXiv:1607.04381v2,\n 2015. Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, Hans Peter Graf .\n Pruning Filters for Efficient ConvNets ,\n arXiv:1608.08710v3,\n 2017.", "title": "References" - }, + }, { - "location": "/regularization/index.html", - "text": "Regularization\n\n\nIn their book \nDeep Learning\n Ian Goodfellow et al. define regularization as\n\n\n\n\n\"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\"\n\n\n\n\nPyTorch's \noptimizers\n use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance).\n\n\nIn general, we can write this as:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W)\n\\]\nAnd specifically,\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2\n\\]\nWhere W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or \ncriterion\n in the Distiller sample image classifier compression application).\n\n\noptimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001)\ncriterion = nn.CrossEntropyLoss()\n...\nfor input, target in dataset:\n optimizer.zero_grad()\n output = model(input)\n loss = criterion(output, target)\n loss.backward()\n optimizer.step()\n\n\n\n\n\\(\\lambda_R\\) is a scalar called the \nregularization strength\n, and it balances the data error and the regularization error. In PyTorch, this is the \nweight_decay\n argument.\n\n\n\\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a \nmagnitude\n, or sizing, of the weights tensor.\n\\[\n\\lVert W \\rVert_2^2 = \\sum_{l=1}^{L} \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l)\n\\]\n\n\n\\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation.\n\n\nThe qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in \nDeep Learning\n.\n\n\nSparsity and Regularization\n\n\nWe mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods.\n\n\nIn \nDense-Sparse-Dense (DSD)\n, Song Han et al. use pruning as a regularizer to improve a model's accuracy:\n\n\n\n\n\"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\"\n\n\n\n\nRegularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\]\n\n\n\\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as \nfeature selection\n and gives us another interpretation of pruning.\n\n\nOne\n of Distiller's Jupyter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.\n\n\nIf we configure \nweight_decay\n to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2 + \\lambda_{R_1} \\lVert W \\rVert_1\n\\]\n\n\nClass \ndistiller.L1Regularizer\n implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization.\n\n\nl1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()\n\n\n\n\nGroup Regularization\n\n\nIn Group Regularization, we penalize entire groups of parameter elements, instead of individual elements. Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not. The group structures have to be pre-defined.\n\n\nTo the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty. We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers. It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\]\n\n\nLet's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\).\n\n\n\\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\).\n\n\n\\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer. Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups).\n\n\n\nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity). Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed.\n\n\nHuizi-et-al-2017\n provides an overview of some of the different groups: kernel, channel, filter, layers. Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even \nintra kernel strided sparsity\n can also be used.\n\n\ndistiller.GroupLassoRegularizer\n currently implements most of these groups, and you can easily add new groups.\n\n\nReferences\n\n\n \nIan Goodfellow and Yoshua Bengio and Aaron Courville\n.\n \nDeep Learning\n,\n arXiv:1607.04381v2,\n 2017.\n\n\n\n\n\nSong Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally\n.\n \nDSD: Dense-Sparse-Dense Training for Deep Neural Networks\n,\n arXiv:1607.04381v2,\n 2017.\n\n\n\n\n\nHuizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally\n.\n \nExploring the Regularity of Sparse Structure in Convolutional Neural Networks\n,\n arXiv:1705.08922v3,\n 2017.\n\n\n\n\n\nSajid Anwar, Kyuyeon Hwang, and Wonyong Sung\n.\n \nStructured pruning of deep convolutional neural networks\n,\n arXiv:1512.08571,\n 2015", + "location": "/regularization/index.html", + "text": "Regularization\n\n\nIn their book \nDeep Learning\n Ian Goodfellow et al. define regularization as\n\n\n\n\n\"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\"\n\n\n\n\nPyTorch's \noptimizers\n use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance).\n\n\nIn general, we can write this as:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W)\n\\]\nAnd specifically,\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2\n\\]\nWhere W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or \ncriterion\n in the Distiller sample image classifier compression application).\n\n\noptimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001)\ncriterion = nn.CrossEntropyLoss()\n...\nfor input, target in dataset:\n optimizer.zero_grad()\n output = model(input)\n loss = criterion(output, target)\n loss.backward()\n optimizer.step()\n\n\n\n\n\\(\\lambda_R\\) is a scalar called the \nregularization strength\n, and it balances the data error and the regularization error. In PyTorch, this is the \nweight_decay\n argument.\n\n\n\\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a \nmagnitude\n, or sizing, of the weights tensor.\n\\[\n\\lVert W \\rVert_2^2 = \\sum_{l=1}^{L} \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l)\n\\]\n\n\n\\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation.\n\n\nThe qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in \nDeep Learning\n.\n\n\nSparsity and Regularization\n\n\nWe mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods.\n\n\nIn \nDense-Sparse-Dense (DSD)\n, Song Han et al. use pruning as a regularizer to improve a model's accuracy:\n\n\n\n\n\"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\"\n\n\n\n\nRegularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\]\n\n\n\\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as \nfeature selection\n and gives us another interpretation of pruning.\n\n\nOne\n of Distiller's Jupyter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.\n\n\nIf we configure \nweight_decay\n to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2 + \\lambda_{R_1} \\lVert W \\rVert_1\n\\]\n\n\nClass \ndistiller.L1Regularizer\n implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization.\n\n\nl1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()\n\n\n\n\nGroup Regularization\n\n\nIn Group Regularization, we penalize entire groups of parameter elements, instead of individual elements. Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not. The group structures have to be pre-defined.\n\n\nTo the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty. We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers. It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\]\n\n\nLet's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\).\n\n\n\\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\).\n\n\n\\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer. Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups).\n\n\n\nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity). Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed.\n\n\nHuizi-et-al-2017\n provides an overview of some of the different groups: kernel, channel, filter, layers. Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even \nintra kernel strided sparsity\n can also be used.\n\n\ndistiller.GroupLassoRegularizer\n currently implements most of these groups, and you can easily add new groups.\n\n\nReferences\n\n\n \nIan Goodfellow and Yoshua Bengio and Aaron Courville\n.\n \nDeep Learning\n,\n arXiv:1607.04381v2,\n 2017.\n\n\n\n\n\nSong Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally\n.\n \nDSD: Dense-Sparse-Dense Training for Deep Neural Networks\n,\n arXiv:1607.04381v2,\n 2017.\n\n\n\n\n\nHuizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally\n.\n \nExploring the Regularity of Sparse Structure in Convolutional Neural Networks\n,\n arXiv:1705.08922v3,\n 2017.\n\n\n\n\n\nSajid Anwar, Kyuyeon Hwang, and Wonyong Sung\n.\n \nStructured pruning of deep convolutional neural networks\n,\n arXiv:1512.08571,\n 2015", "title": "Regularization" - }, + }, { - "location": "/regularization/index.html#regularization", - "text": "In their book Deep Learning Ian Goodfellow et al. define regularization as \"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\" PyTorch's optimizers use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance). In general, we can write this as:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W)\n\\]\nAnd specifically,\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2\n\\]\nWhere W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or criterion in the Distiller sample image classifier compression application). optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001)\ncriterion = nn.CrossEntropyLoss()\n...\nfor input, target in dataset:\n optimizer.zero_grad()\n output = model(input)\n loss = criterion(output, target)\n loss.backward()\n optimizer.step() \\(\\lambda_R\\) is a scalar called the regularization strength , and it balances the data error and the regularization error. In PyTorch, this is the weight_decay argument. \\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a magnitude , or sizing, of the weights tensor.\n\\[\n\\lVert W \\rVert_2^2 = \\sum_{l=1}^{L} \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l)\n\\] \\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation. The qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in Deep Learning .", + "location": "/regularization/index.html#regularization", + "text": "In their book Deep Learning Ian Goodfellow et al. define regularization as \"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\" PyTorch's optimizers use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance). In general, we can write this as:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W)\n\\]\nAnd specifically,\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2\n\\]\nWhere W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or criterion in the Distiller sample image classifier compression application). optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001)\ncriterion = nn.CrossEntropyLoss()\n...\nfor input, target in dataset:\n optimizer.zero_grad()\n output = model(input)\n loss = criterion(output, target)\n loss.backward()\n optimizer.step() \\(\\lambda_R\\) is a scalar called the regularization strength , and it balances the data error and the regularization error. In PyTorch, this is the weight_decay argument. \\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a magnitude , or sizing, of the weights tensor.\n\\[\n\\lVert W \\rVert_2^2 = \\sum_{l=1}^{L} \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l)\n\\] \\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation. The qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in Deep Learning .", "title": "Regularization" - }, + }, { - "location": "/regularization/index.html#sparsity-and-regularization", - "text": "We mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods. In Dense-Sparse-Dense (DSD) , Song Han et al. use pruning as a regularizer to improve a model's accuracy: \"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\" Regularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\] \\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as feature selection and gives us another interpretation of pruning. One of Distiller's Jupyter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization. If we configure weight_decay to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2 + \\lambda_{R_1} \\lVert W \\rVert_1\n\\] Class distiller.L1Regularizer implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization. l1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()", + "location": "/regularization/index.html#sparsity-and-regularization", + "text": "We mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods. In Dense-Sparse-Dense (DSD) , Song Han et al. use pruning as a regularizer to improve a model's accuracy: \"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\" Regularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\] \\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as feature selection and gives us another interpretation of pruning. One of Distiller's Jupyter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization. If we configure weight_decay to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2 + \\lambda_{R_1} \\lVert W \\rVert_1\n\\] Class distiller.L1Regularizer implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization. l1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()", "title": "Sparsity and Regularization" - }, + }, { - "location": "/regularization/index.html#group-regularization", - "text": "In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements. Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not. The group structures have to be pre-defined. To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty. We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers. It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\] Let's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\). \\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\). \\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer. Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups). \nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity). Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed. Huizi-et-al-2017 provides an overview of some of the different groups: kernel, channel, filter, layers. Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even intra kernel strided sparsity can also be used. distiller.GroupLassoRegularizer currently implements most of these groups, and you can easily add new groups.", + "location": "/regularization/index.html#group-regularization", + "text": "In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements. Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not. The group structures have to be pre-defined. To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty. We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers. It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\] Let's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\). \\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\). \\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer. Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups). \nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity). Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed. Huizi-et-al-2017 provides an overview of some of the different groups: kernel, channel, filter, layers. Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even intra kernel strided sparsity can also be used. distiller.GroupLassoRegularizer currently implements most of these groups, and you can easily add new groups.", "title": "Group Regularization" - }, + }, { - "location": "/regularization/index.html#references", - "text": "Ian Goodfellow and Yoshua Bengio and Aaron Courville .\n Deep Learning ,\n arXiv:1607.04381v2,\n 2017. Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally .\n DSD: Dense-Sparse-Dense Training for Deep Neural Networks ,\n arXiv:1607.04381v2,\n 2017. Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally .\n Exploring the Regularity of Sparse Structure in Convolutional Neural Networks ,\n arXiv:1705.08922v3,\n 2017. Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung .\n Structured pruning of deep convolutional neural networks ,\n arXiv:1512.08571,\n 2015", + "location": "/regularization/index.html#references", + "text": "Ian Goodfellow and Yoshua Bengio and Aaron Courville .\n Deep Learning ,\n arXiv:1607.04381v2,\n 2017. Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally .\n DSD: Dense-Sparse-Dense Training for Deep Neural Networks ,\n arXiv:1607.04381v2,\n 2017. Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally .\n Exploring the Regularity of Sparse Structure in Convolutional Neural Networks ,\n arXiv:1705.08922v3,\n 2017. Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung .\n Structured pruning of deep convolutional neural networks ,\n arXiv:1512.08571,\n 2015", "title": "References" - }, + }, { - "location": "/quantization/index.html", - "text": "Quantization\n\n\nQuantization refers to the process of reducing the number of bits that represent a number. In the context of deep learning, the predominant numerical format used for research and for deployment has so far been 32-bit floating point, or FP32. However, the desire for reduced bandwidth and compute requirements of deep learning models has driven research into using lower-precision numerical formats. It has been extensively demonstrated that weights and activations can be represented using 8-bit integers (or INT8) without incurring significant loss in accuracy. The use of even lower bit-widths, such as 4/2/1-bits, is an active field of research that has also shown great progress.\n\n\nNote that this discussion is on quantization only in the context of more efficient inference. Using lower-precision numerics for more efficient training is currently out of scope.\n\n\nMotivation: Overall Efficiency\n\n\nThe more obvious benefit from quantization is \nsignificantly reduced bandwidth and storage\n. For instance, using INT8 for weights and activations consumes 4x less overall bandwidth compared to FP32.\n\nAdditionally integer compute is \nfaster\n than floating point compute. It is also much more \narea and energy efficient\n: \n\n\n\n\n\n\n\n\nINT8 Operation\n\n\nEnergy Saving vs FP32\n\n\nArea Saving vs FP32\n\n\n\n\n\n\n\n\n\n\nAdd\n\n\n30x\n\n\n116x\n\n\n\n\n\n\nMultiply\n\n\n18.5x\n\n\n27x\n\n\n\n\n\n\n\n\n(\nDally, 2015\n)\n\n\nNote that very aggressive quantization can yield even more efficiency. If weights are binary (-1, 1) or ternary (-1, 0, 1 using 2-bits), then convolution and fully-connected layers can be computed with additions and subtractions only, removing multiplications completely. If activations are binary as well, then additions can also be removed, in favor of bitwise operations (\nRastegari et al., 2016\n).\n\n\nInteger vs. FP32\n\n\nThere are two main attributes when discussing a numerical format. The first is \ndynamic range\n, which refers to the range of representable numbers. The second one is how many values can be represented within the dynamic range, which in turn determines the \nprecision / resolution\n of the format (the distance between two numbers).\n\nFor all integer formats, the dynamic range is \n[-2^{n-1} .. 2^{n-1}-1]\n, where \nn\n is the number of bits. So for INT8 the range is \n[-128 .. 127]\n, and for INT4 it is \n[-16 .. 15]\n (we're limiting ourselves to signed integers for now). The number of representable values is \n2^n\n.\nContrast that with FP32, where the dynamic range is \n\\pm 3.4\\ x\\ 10^{38}\n, and approximately \n4.2\\ x\\ 10^9\n values can be represented.\n\nWe can immediately see that FP32 is much more \nversatile\n, in that it is able to represent a wide range of distributions accurately. This is a nice property for deep learning models, where the distributions of weights and activations are usually very different (at least in dynamic range). In addition the dynamic range can differ between layers in the model.\n\nIn order to be able to represent these different distributions with an integer format, a \nscale factor\n is used to map the dynamic range of the tensor to the integer format range. But still we remain with the issue of having a significantly lower number of representable values, that is - much lower resolution.\n\nNote that this scale factor is, in most cases, a floating-point number. Hence, even when using integer numerics, some floating-point computations remain. \nCourbariaux et al., 2014\n scale using only shifts, eliminating the floating point operation. In \nGEMMLWOP\n, the FP32 scale factor is approximated using an integer or fixed-point multiplication followed by a shift operation. In many cases the effect of this approximation on accuracy is negligible.\n\n\nAvoiding Overflows\n\n\nConvolution and fully connected layers involve the storing of intermediate results in accumulators. Due to the limited dynamic range of integer formats, if we would use the same bit-width for the weights and activation, \nand\n for the accumulators, we would likely overflow very quickly. Therefore, accumulators are usually implemented with higher bit-widths.\n\nThe result of multiplying two \nn\n-bit integers is, at most, a \n2n\n-bit number. In convolution layers, such multiplications are accumulated \nc\\cdot k^2\n times, where \nc\n is the number of input channels and \nk\n is the kernel width (assuming a square kernel). Hence, to avoid overflowing, the accumulator should be \n2n + M\n-bits wide, where M is at least \nlog_2(c\\cdot k^2)\n. In many cases 32-bit accumulators are used, however for INT4 and lower it might be possible to use less than 32 -bits, depending on the expected use cases and layer widths.\n\n\n\"Conservative\" Quantization: INT8\n\n\nIn many cases, taking a model trained for FP32 and directly quantizing it to INT8, without any re-training, can result in a relatively low loss of accuracy (which may or may not be acceptable, depending on the use case). Some fine-tuning can further improve the accuracy (\nGysel at al., 2018\n).\n\nAs mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor (. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained \"online\" during inference, or \"offline\".\n\n\n\n\nOffline\n means gathering activations statistics before deploying the model, either during training or by running a few \"calibration\" batches on the trained FP32 model. Based on these gathered statistics, the scaled factors are calculated and are fixed once the model is deployed. This method has the risk of encountering values outside the previously observed ranges at runtime. These values will be clipped, which might lead to accuracy degradation.\n\n\nOnline\n means calculating the min/max values for each tensor dynamically during runtime. In this method clipping cannot occur, however the added computation resources required to calculate the min/max values at runtime might be prohibitive.\n\n\n\n\nIt is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. Statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible (\nMigacz, 2017\n) \n\n\nAnother possible optimization point is \nscale-factor scope\n. The most common way is use a single scale-factor per-layer\n\n\n\"Aggressive\" Quantization: INT4 and Lower\n\n\nNaively quantizing a FP32 model to INT4 and lower usually incurs significant accuracy degradation. Many works have tried to mitigate this effect. They usually employ one or more of the following concepts in order to improve model accuracy:\n\n\n\n\nTraining / Re-Training\n: For INT4 and lower, training is required in order to obtain reasonable accuracy. This means training with quantization of weights and activations \"baked\" into the training procedure. This is not straight forward, since quantization operations are usually not differentiable. This is usually worked-around by using \"straight-through estimator\" (\nBengio, 2013\n) to approximate the gradient of these operations.\n\n\nZhou S et al., 2016\n have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods \nrequire\n a trained FP32 model, either as a starting point (\nZhou A et al., 2017\n), or as a teacher network in a student-teacher training setup (\nMishra and Marr, 2018\n).\n\n\nReplacing the activation function\n: The most common activation function in vision models is ReLU, which is unbounded. That is - its dynamic range is not limited for positive inputs. This is very problematic for INT4 and below due to the very limited range and resolution. Therefore, most methods replace ReLU with another function which is bounded. In some cases a clipping function with hard coded values is used (\nZhou S et al., 2016\n, \nMishra et al., 2018\n). Another method learns the clipping value per layer, with better results (\nChoi et al., 2018\n). Once the clipping value is set, the scale factor used for quantization is also set, and no further calibration steps are required (as opposed to INT8 methods described above).\n\n\nModifying network structure\n: \nMishra et al., 2018\n try to compensate for the loss of information due to quantization by using wider layers (more channels). \nLin et al., 2017\n proposed a binary quantization method in which a single FP32 convolution is replaced with multiple binary convolutions, each scaled to represent a different \"base\", covering a larger dynamic range overall.\n\n\nFirst and last layer\n: Many methods do not quantize the first and last layer of the model. It has been observed by \nHan et al., 2015\n that the first convolutional layer is more sensitive to weights pruning, and some quantization works cite the same reason and show it empirically (\nZhou S et al., 2016\n, \nChoi et al., 2018\n). Some works also note that these layers usually constitute a very small portion of the overall computation within the model, further reducing the motivation to quantize them (\nRastegari et al., 2016\n). Most methods keep the first and last layers at FP32. However, \nChoi et al., 2018\n showed that \"conservative\" quantization of these layers, e.g. to INT8, does not reduce accuracy.\n\n\nIterative quantization\n: Most methods quantize the entire model at once. \nZhou A et al., 2017\n employ an iterative method, which starts with a trained FP32 baseline, and quantizes only a portion of the model at the time followed by several epochs of re-training to recover the accuracy loss from quantization.\n\n\nMixed Weights and Activations Precision\n: It has been observed that activations are more sensitive to quantization than weights (\nZhou S et al., 2016\n). Hence it is not uncommon to see experiments with activations quantized to a higher precision compared to weights. Some works have focused solely on quantizing weights, keeping the activations at FP32 (\nLi et al., 2016\n, \nZhu et al., 2016\n).\n\n\n\n\nReferences\n\n\n\n\nWilliam Dally\n. High-Performance Hardware for Machine Learning. \nTutorial, NIPS, 2015\n\n\n\n\n\nMohammad Rastegari, Vicente Ordone, Joseph Redmon and Ali Farhadi\n. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. \nECCV, 2016\n\n\n\n\n\nMatthieu Courbariaux, Yoshua Bengio and Jean-Pierre David\n. Training deep neural networks with low precision multiplications. \narxiv:1412.7024\n\n\n\n\n\nPhilipp Gysel, Jon Pimentel, Mohammad Motamedi and Soheil Ghiasi\n. Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks. \nIEEE Transactions on Neural Networks and Learning Systems, 2018\n\n\n\n\n\nSzymon Migacz\n. 8-bit Inference with TensorRT. \nGTC San Jose, 2017\n\n\n\n\n\nYoshua Bengio, Nicholas Leonard and Aaron Courville\n. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. \narxiv:1308.3432, 2013\n\n\n\n\n\nShuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu and Yuheng Zou\n. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. \narxiv:1606.06160\n\n\n\n\n\nAojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu and Yurong Chen\n. Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights. \nICLR, 2017\n\n\n\n\n\nAsit Mishra and Debbie Marr\n. Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy. \nICLR, 2018\n\n\n\n\n\nAsit Mishra, Eriko Nurvitadhi, Jeffrey J Cook and Debbie Marr\n. WRPN: Wide Reduced-Precision Networks. \nICLR, 2018\n\n\n\n\n\nJungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan and Kailash Gopalakrishnan\n. PACT: Parameterized Clipping Activation for Quantized Neural Networks. \n2018\n\n\n\n\n\nXiaofan Lin, Cong Zhao and Wei Pan\n. Towards Accurate Binary Convolutional Neural Network. \nNIPS, 2017\n\n\n\n\n\nSong Han, Jeff Pool, John Tran and William Dally\n. Learning both Weights and Connections for Efficient Neural Network. \nNIPS, 2015\n\n\n\n\n\nFengfu Li, Bo Zhang and Bin Liu\n. Ternary Weight Networks. \narxiv:1605.04711\n\n\n\n\n\nChenzhuo Zhu, Song Han, Huizi Mao and William J. Dally\n. Trained Ternary Quantization. \narxiv:1612.01064", + "location": "/quantization/index.html", + "text": "Quantization\n\n\nQuantization refers to the process of reducing the number of bits that represent a number. In the context of deep learning, the predominant numerical format used for research and for deployment has so far been 32-bit floating point, or FP32. However, the desire for reduced bandwidth and compute requirements of deep learning models has driven research into using lower-precision numerical formats. It has been extensively demonstrated that weights and activations can be represented using 8-bit integers (or INT8) without incurring significant loss in accuracy. The use of even lower bit-widths, such as 4/2/1-bits, is an active field of research that has also shown great progress.\n\n\nNote that this discussion is on quantization only in the context of more efficient inference. Using lower-precision numerics for more efficient training is currently out of scope.\n\n\nMotivation: Overall Efficiency\n\n\nThe more obvious benefit from quantization is \nsignificantly reduced bandwidth and storage\n. For instance, using INT8 for weights and activations consumes 4x less overall bandwidth compared to FP32.\n\nAdditionally integer compute is \nfaster\n than floating point compute. It is also much more \narea and energy efficient\n: \n\n\n\n\n\n\n\n\nINT8 Operation\n\n\nEnergy Saving vs FP32\n\n\nArea Saving vs FP32\n\n\n\n\n\n\n\n\n\n\nAdd\n\n\n30x\n\n\n116x\n\n\n\n\n\n\nMultiply\n\n\n18.5x\n\n\n27x\n\n\n\n\n\n\n\n\n(\nDally, 2015\n)\n\n\nNote that very aggressive quantization can yield even more efficiency. If weights are binary (-1, 1) or ternary (-1, 0, 1 using 2-bits), then convolution and fully-connected layers can be computed with additions and subtractions only, removing multiplications completely. If activations are binary as well, then additions can also be removed, in favor of bitwise operations (\nRastegari et al., 2016\n).\n\n\nInteger vs. FP32\n\n\nThere are two main attributes when discussing a numerical format. The first is \ndynamic range\n, which refers to the range of representable numbers. The second one is how many values can be represented within the dynamic range, which in turn determines the \nprecision / resolution\n of the format (the distance between two numbers).\n\nFor all integer formats, the dynamic range is \n[-2^{n-1} .. 2^{n-1}-1]\n, where \nn\n is the number of bits. So for INT8 the range is \n[-128 .. 127]\n, and for INT4 it is \n[-16 .. 15]\n (we're limiting ourselves to signed integers for now). The number of representable values is \n2^n\n.\nContrast that with FP32, where the dynamic range is \n\\pm 3.4\\ x\\ 10^{38}\n, and approximately \n4.2\\ x\\ 10^9\n values can be represented.\n\nWe can immediately see that FP32 is much more \nversatile\n, in that it is able to represent a wide range of distributions accurately. This is a nice property for deep learning models, where the distributions of weights and activations are usually very different (at least in dynamic range). In addition the dynamic range can differ between layers in the model.\n\nIn order to be able to represent these different distributions with an integer format, a \nscale factor\n is used to map the dynamic range of the tensor to the integer format range. But still we remain with the issue of having a significantly lower number of representable values, that is - much lower resolution.\n\nNote that this scale factor is, in most cases, a floating-point number. Hence, even when using integer numerics, some floating-point computations remain. \nCourbariaux et al., 2014\n scale using only shifts, eliminating the floating point operation. In \nGEMMLWOP\n, the FP32 scale factor is approximated using an integer or fixed-point multiplication followed by a shift operation. In many cases the effect of this approximation on accuracy is negligible.\n\n\nAvoiding Overflows\n\n\nConvolution and fully connected layers involve the storing of intermediate results in accumulators. Due to the limited dynamic range of integer formats, if we would use the same bit-width for the weights and activation, \nand\n for the accumulators, we would likely overflow very quickly. Therefore, accumulators are usually implemented with higher bit-widths.\n\nThe result of multiplying two \nn\n-bit integers is, at most, a \n2n\n-bit number. In convolution layers, such multiplications are accumulated \nc\\cdot k^2\n times, where \nc\n is the number of input channels and \nk\n is the kernel width (assuming a square kernel). Hence, to avoid overflowing, the accumulator should be \n2n + M\n-bits wide, where M is at least \nlog_2(c\\cdot k^2)\n. In many cases 32-bit accumulators are used, however for INT4 and lower it might be possible to use less than 32 -bits, depending on the expected use cases and layer widths.\n\n\n\"Conservative\" Quantization: INT8\n\n\nIn many cases, taking a model trained for FP32 and directly quantizing it to INT8, without any re-training, can result in a relatively low loss of accuracy (which may or may not be acceptable, depending on the use case). Some fine-tuning can further improve the accuracy (\nGysel at al., 2018\n).\n\nAs mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained \"online\" during inference, or \"offline\".\n\n\n\n\nOffline\n means gathering activations statistics before deploying the model, either during training or by running a few \"calibration\" batches on the trained FP32 model. Based on these gathered statistics, the scaled factors are calculated and are fixed once the model is deployed. This method has the risk of encountering values outside the previously observed ranges at runtime. These values will be clipped, which might lead to accuracy degradation.\n\n\nOnline\n means calculating the min/max values for each tensor dynamically during runtime. In this method clipping cannot occur, however the added computation resources required to calculate the min/max values at runtime might be prohibitive.\n\n\n\n\nIt is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. Statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible (\nMigacz, 2017\n). \n\n\nAnother possible optimization point is \nscale-factor scope\n. The most common way is use a single scale-factor per-layer, but it is also possible to calculate a scale-factor per-channel. This can be beneficial if the weight distributions vary greatly between channels.\n\n\n\"Aggressive\" Quantization: INT4 and Lower\n\n\nNaively quantizing a FP32 model to INT4 and lower usually incurs significant accuracy degradation. Many works have tried to mitigate this effect. They usually employ one or more of the following concepts in order to improve model accuracy:\n\n\n\n\nTraining / Re-Training\n: For INT4 and lower, training is required in order to obtain reasonable accuracy. The training loop is modified to take quantization into account. See details in the \nnext section\n.\n\n\nZhou S et al., 2016\n have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods \nrequire\n a trained FP32 model, either as a starting point (\nZhou A et al., 2017\n), or as a teacher network in a knowledge distillation training setup (\nMishra and Marr, 2018\n).\n\n\nReplacing the activation function\n: The most common activation function in vision models is ReLU, which is unbounded. That is - its dynamic range is not limited for positive inputs. This is very problematic for INT4 and below due to the very limited range and resolution. Therefore, most methods replace ReLU with another function which is bounded. In some cases a clipping function with hard coded values is used (\nZhou S et al., 2016\n, \nMishra et al., 2018\n). Another method learns the clipping value per layer, with better results (\nChoi et al., 2018\n). Once the clipping value is set, the scale factor used for quantization is also set, and no further calibration steps are required (as opposed to INT8 methods described above).\n\n\nModifying network structure\n: \nMishra et al., 2018\n try to compensate for the loss of information due to quantization by using wider layers (more channels). \nLin et al., 2017\n proposed a binary quantization method in which a single FP32 convolution is replaced with multiple binary convolutions, each scaled to represent a different \"base\", covering a larger dynamic range overall.\n\n\nFirst and last layer\n: Many methods do not quantize the first and last layer of the model. It has been observed by \nHan et al., 2015\n that the first convolutional layer is more sensitive to weights pruning, and some quantization works cite the same reason and show it empirically (\nZhou S et al., 2016\n, \nChoi et al., 2018\n). Some works also note that these layers usually constitute a very small portion of the overall computation within the model, further reducing the motivation to quantize them (\nRastegari et al., 2016\n). Most methods keep the first and last layers at FP32. However, \nChoi et al., 2018\n showed that \"conservative\" quantization of these layers, e.g. to INT8, does not reduce accuracy.\n\n\nIterative quantization\n: Most methods quantize the entire model at once. \nZhou A et al., 2017\n employ an iterative method, which starts with a trained FP32 baseline, and quantizes only a portion of the model at the time followed by several epochs of re-training to recover the accuracy loss from quantization.\n\n\nMixed Weights and Activations Precision\n: It has been observed that activations are more sensitive to quantization than weights (\nZhou S et al., 2016\n). Hence it is not uncommon to see experiments with activations quantized to a higher precision compared to weights. Some works have focused solely on quantizing weights, keeping the activations at FP32 (\nLi et al., 2016\n, \nZhu et al., 2016\n).\n\n\n\n\nTraining with Quantization\n\n\nAs mentioned above, in order to minimize the loss of accuracy from \"aggressive\" quantization, many methods that target INT4 and lower involve training the model in a way that considers the quantization. This means training with quantization of weights and activations \"baked\" into the training procedure. The training graph usually looks like this:\n\n\n\n\nA full precision copy of the weights is maintained throughout the training process (\"weights_fp\" in the diagram). Its purpose is to accumulate the small changes from the gradients without loss of precision (Note that the quantization of the weights is an integral part of the training graph, meaning that we back-propagate through it as well). Once the model is trained, only the quantized weights are used for inference.\n\nIn the diagram we show \"layer N\" as the conv + batch-norm + activation combination, but the same applies to fully-connected layers, element-wise operations, etc. During training, the operations within \"layer N\" can still run in full precision, with the \"quantize\" operations in the boundaries ensuring discrete-valued weights and activations. This is sometimes called \"simulated quantization\". \n\n\nStraight-Through Estimator\n\n\nAn important question in this context is how to back-propagate through the quantization functions. These functions are discrete-valued, hence their derivative is 0 almost everywhere. So, using their gradients as-is would severly hinder the learning process. An approximation commonly used to overcome this issue is the \"straight-through estimator\" (STE) (\nHinton et al., 2012\n, \nBengio, 2013\n), which simply passes the gradient through these functions as-is. \n\n\nReferences\n\n\n\n\nWilliam Dally\n. High-Performance Hardware for Machine Learning. \nTutorial, NIPS, 2015\n\n\n\n\n\nMohammad Rastegari, Vicente Ordone, Joseph Redmon and Ali Farhadi\n. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. \nECCV, 2016\n\n\n\n\n\nMatthieu Courbariaux, Yoshua Bengio and Jean-Pierre David\n. Training deep neural networks with low precision multiplications. \narxiv:1412.7024\n\n\n\n\n\nPhilipp Gysel, Jon Pimentel, Mohammad Motamedi and Soheil Ghiasi\n. Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks. \nIEEE Transactions on Neural Networks and Learning Systems, 2018\n\n\n\n\n\nSzymon Migacz\n. 8-bit Inference with TensorRT. \nGTC San Jose, 2017\n\n\n\n\n\nShuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu and Yuheng Zou\n. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. \narxiv:1606.06160\n\n\n\n\n\nAojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu and Yurong Chen\n. Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights. \nICLR, 2017\n\n\n\n\n\nAsit Mishra and Debbie Marr\n. Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy. \nICLR, 2018\n\n\n\n\n\nAsit Mishra, Eriko Nurvitadhi, Jeffrey J Cook and Debbie Marr\n. WRPN: Wide Reduced-Precision Networks. \nICLR, 2018\n\n\n\n\n\nJungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan and Kailash Gopalakrishnan\n. PACT: Parameterized Clipping Activation for Quantized Neural Networks. \n2018\n\n\n\n\n\nXiaofan Lin, Cong Zhao and Wei Pan\n. Towards Accurate Binary Convolutional Neural Network. \nNIPS, 2017\n\n\n\n\n\nSong Han, Jeff Pool, John Tran and William Dally\n. Learning both Weights and Connections for Efficient Neural Network. \nNIPS, 2015\n\n\n\n\n\nFengfu Li, Bo Zhang and Bin Liu\n. Ternary Weight Networks. \narxiv:1605.04711\n\n\n\n\n\nChenzhuo Zhu, Song Han, Huizi Mao and William J. Dally\n. Trained Ternary Quantization. \narxiv:1612.01064\n\n\n\n\n\nYoshua Bengio, Nicholas Leonard and Aaron Courville\n. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. \narxiv:1308.3432, 2013\n\n\n\n\n\nGeoffrey Hinton, Nitish Srivastava, Kevin Swersky, Tijmen Tieleman and Abdelrahman Mohamed\n. Neural Networks for Machine Learning. \nCoursera, video lectures, 2012", "title": "Quantization" - }, + }, { - "location": "/quantization/index.html#quantization", - "text": "Quantization refers to the process of reducing the number of bits that represent a number. In the context of deep learning, the predominant numerical format used for research and for deployment has so far been 32-bit floating point, or FP32. However, the desire for reduced bandwidth and compute requirements of deep learning models has driven research into using lower-precision numerical formats. It has been extensively demonstrated that weights and activations can be represented using 8-bit integers (or INT8) without incurring significant loss in accuracy. The use of even lower bit-widths, such as 4/2/1-bits, is an active field of research that has also shown great progress. Note that this discussion is on quantization only in the context of more efficient inference. Using lower-precision numerics for more efficient training is currently out of scope.", + "location": "/quantization/index.html#quantization", + "text": "Quantization refers to the process of reducing the number of bits that represent a number. In the context of deep learning, the predominant numerical format used for research and for deployment has so far been 32-bit floating point, or FP32. However, the desire for reduced bandwidth and compute requirements of deep learning models has driven research into using lower-precision numerical formats. It has been extensively demonstrated that weights and activations can be represented using 8-bit integers (or INT8) without incurring significant loss in accuracy. The use of even lower bit-widths, such as 4/2/1-bits, is an active field of research that has also shown great progress. Note that this discussion is on quantization only in the context of more efficient inference. Using lower-precision numerics for more efficient training is currently out of scope.", "title": "Quantization" - }, + }, { - "location": "/quantization/index.html#motivation-overall-efficiency", - "text": "The more obvious benefit from quantization is significantly reduced bandwidth and storage . For instance, using INT8 for weights and activations consumes 4x less overall bandwidth compared to FP32. \nAdditionally integer compute is faster than floating point compute. It is also much more area and energy efficient : INT8 Operation Energy Saving vs FP32 Area Saving vs FP32 Add 30x 116x Multiply 18.5x 27x ( Dally, 2015 ) Note that very aggressive quantization can yield even more efficiency. If weights are binary (-1, 1) or ternary (-1, 0, 1 using 2-bits), then convolution and fully-connected layers can be computed with additions and subtractions only, removing multiplications completely. If activations are binary as well, then additions can also be removed, in favor of bitwise operations ( Rastegari et al., 2016 ).", + "location": "/quantization/index.html#motivation-overall-efficiency", + "text": "The more obvious benefit from quantization is significantly reduced bandwidth and storage . For instance, using INT8 for weights and activations consumes 4x less overall bandwidth compared to FP32. \nAdditionally integer compute is faster than floating point compute. It is also much more area and energy efficient : INT8 Operation Energy Saving vs FP32 Area Saving vs FP32 Add 30x 116x Multiply 18.5x 27x ( Dally, 2015 ) Note that very aggressive quantization can yield even more efficiency. If weights are binary (-1, 1) or ternary (-1, 0, 1 using 2-bits), then convolution and fully-connected layers can be computed with additions and subtractions only, removing multiplications completely. If activations are binary as well, then additions can also be removed, in favor of bitwise operations ( Rastegari et al., 2016 ).", "title": "Motivation: Overall Efficiency" - }, + }, { - "location": "/quantization/index.html#integer-vs-fp32", - "text": "There are two main attributes when discussing a numerical format. The first is dynamic range , which refers to the range of representable numbers. The second one is how many values can be represented within the dynamic range, which in turn determines the precision / resolution of the format (the distance between two numbers). \nFor all integer formats, the dynamic range is [-2^{n-1} .. 2^{n-1}-1] , where n is the number of bits. So for INT8 the range is [-128 .. 127] , and for INT4 it is [-16 .. 15] (we're limiting ourselves to signed integers for now). The number of representable values is 2^n .\nContrast that with FP32, where the dynamic range is \\pm 3.4\\ x\\ 10^{38} , and approximately 4.2\\ x\\ 10^9 values can be represented. \nWe can immediately see that FP32 is much more versatile , in that it is able to represent a wide range of distributions accurately. This is a nice property for deep learning models, where the distributions of weights and activations are usually very different (at least in dynamic range). In addition the dynamic range can differ between layers in the model. \nIn order to be able to represent these different distributions with an integer format, a scale factor is used to map the dynamic range of the tensor to the integer format range. But still we remain with the issue of having a significantly lower number of representable values, that is - much lower resolution. \nNote that this scale factor is, in most cases, a floating-point number. Hence, even when using integer numerics, some floating-point computations remain. Courbariaux et al., 2014 scale using only shifts, eliminating the floating point operation. In GEMMLWOP , the FP32 scale factor is approximated using an integer or fixed-point multiplication followed by a shift operation. In many cases the effect of this approximation on accuracy is negligible.", + "location": "/quantization/index.html#integer-vs-fp32", + "text": "There are two main attributes when discussing a numerical format. The first is dynamic range , which refers to the range of representable numbers. The second one is how many values can be represented within the dynamic range, which in turn determines the precision / resolution of the format (the distance between two numbers). \nFor all integer formats, the dynamic range is [-2^{n-1} .. 2^{n-1}-1] , where n is the number of bits. So for INT8 the range is [-128 .. 127] , and for INT4 it is [-16 .. 15] (we're limiting ourselves to signed integers for now). The number of representable values is 2^n .\nContrast that with FP32, where the dynamic range is \\pm 3.4\\ x\\ 10^{38} , and approximately 4.2\\ x\\ 10^9 values can be represented. \nWe can immediately see that FP32 is much more versatile , in that it is able to represent a wide range of distributions accurately. This is a nice property for deep learning models, where the distributions of weights and activations are usually very different (at least in dynamic range). In addition the dynamic range can differ between layers in the model. \nIn order to be able to represent these different distributions with an integer format, a scale factor is used to map the dynamic range of the tensor to the integer format range. But still we remain with the issue of having a significantly lower number of representable values, that is - much lower resolution. \nNote that this scale factor is, in most cases, a floating-point number. Hence, even when using integer numerics, some floating-point computations remain. Courbariaux et al., 2014 scale using only shifts, eliminating the floating point operation. In GEMMLWOP , the FP32 scale factor is approximated using an integer or fixed-point multiplication followed by a shift operation. In many cases the effect of this approximation on accuracy is negligible.", "title": "Integer vs. FP32" - }, + }, { - "location": "/quantization/index.html#avoiding-overflows", - "text": "Convolution and fully connected layers involve the storing of intermediate results in accumulators. Due to the limited dynamic range of integer formats, if we would use the same bit-width for the weights and activation, and for the accumulators, we would likely overflow very quickly. Therefore, accumulators are usually implemented with higher bit-widths. \nThe result of multiplying two n -bit integers is, at most, a 2n -bit number. In convolution layers, such multiplications are accumulated c\\cdot k^2 times, where c is the number of input channels and k is the kernel width (assuming a square kernel). Hence, to avoid overflowing, the accumulator should be 2n + M -bits wide, where M is at least log_2(c\\cdot k^2) . In many cases 32-bit accumulators are used, however for INT4 and lower it might be possible to use less than 32 -bits, depending on the expected use cases and layer widths.", + "location": "/quantization/index.html#avoiding-overflows", + "text": "Convolution and fully connected layers involve the storing of intermediate results in accumulators. Due to the limited dynamic range of integer formats, if we would use the same bit-width for the weights and activation, and for the accumulators, we would likely overflow very quickly. Therefore, accumulators are usually implemented with higher bit-widths. \nThe result of multiplying two n -bit integers is, at most, a 2n -bit number. In convolution layers, such multiplications are accumulated c\\cdot k^2 times, where c is the number of input channels and k is the kernel width (assuming a square kernel). Hence, to avoid overflowing, the accumulator should be 2n + M -bits wide, where M is at least log_2(c\\cdot k^2) . In many cases 32-bit accumulators are used, however for INT4 and lower it might be possible to use less than 32 -bits, depending on the expected use cases and layer widths.", "title": "Avoiding Overflows" - }, + }, { - "location": "/quantization/index.html#conservative-quantization-int8", - "text": "In many cases, taking a model trained for FP32 and directly quantizing it to INT8, without any re-training, can result in a relatively low loss of accuracy (which may or may not be acceptable, depending on the use case). Some fine-tuning can further improve the accuracy ( Gysel at al., 2018 ). \nAs mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor (. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained \"online\" during inference, or \"offline\". Offline means gathering activations statistics before deploying the model, either during training or by running a few \"calibration\" batches on the trained FP32 model. Based on these gathered statistics, the scaled factors are calculated and are fixed once the model is deployed. This method has the risk of encountering values outside the previously observed ranges at runtime. These values will be clipped, which might lead to accuracy degradation. Online means calculating the min/max values for each tensor dynamically during runtime. In this method clipping cannot occur, however the added computation resources required to calculate the min/max values at runtime might be prohibitive. It is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. Statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible ( Migacz, 2017 ) Another possible optimization point is scale-factor scope . The most common way is use a single scale-factor per-layer", + "location": "/quantization/index.html#conservative-quantization-int8", + "text": "In many cases, taking a model trained for FP32 and directly quantizing it to INT8, without any re-training, can result in a relatively low loss of accuracy (which may or may not be acceptable, depending on the use case). Some fine-tuning can further improve the accuracy ( Gysel at al., 2018 ). \nAs mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained \"online\" during inference, or \"offline\". Offline means gathering activations statistics before deploying the model, either during training or by running a few \"calibration\" batches on the trained FP32 model. Based on these gathered statistics, the scaled factors are calculated and are fixed once the model is deployed. This method has the risk of encountering values outside the previously observed ranges at runtime. These values will be clipped, which might lead to accuracy degradation. Online means calculating the min/max values for each tensor dynamically during runtime. In this method clipping cannot occur, however the added computation resources required to calculate the min/max values at runtime might be prohibitive. It is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. Statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible ( Migacz, 2017 ). Another possible optimization point is scale-factor scope . The most common way is use a single scale-factor per-layer, but it is also possible to calculate a scale-factor per-channel. This can be beneficial if the weight distributions vary greatly between channels.", "title": "\"Conservative\" Quantization: INT8" - }, + }, { - "location": "/quantization/index.html#aggressive-quantization-int4-and-lower", - "text": "Naively quantizing a FP32 model to INT4 and lower usually incurs significant accuracy degradation. Many works have tried to mitigate this effect. They usually employ one or more of the following concepts in order to improve model accuracy: Training / Re-Training : For INT4 and lower, training is required in order to obtain reasonable accuracy. This means training with quantization of weights and activations \"baked\" into the training procedure. This is not straight forward, since quantization operations are usually not differentiable. This is usually worked-around by using \"straight-through estimator\" ( Bengio, 2013 ) to approximate the gradient of these operations. Zhou S et al., 2016 have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods require a trained FP32 model, either as a starting point ( Zhou A et al., 2017 ), or as a teacher network in a student-teacher training setup ( Mishra and Marr, 2018 ). Replacing the activation function : The most common activation function in vision models is ReLU, which is unbounded. That is - its dynamic range is not limited for positive inputs. This is very problematic for INT4 and below due to the very limited range and resolution. Therefore, most methods replace ReLU with another function which is bounded. In some cases a clipping function with hard coded values is used ( Zhou S et al., 2016 , Mishra et al., 2018 ). Another method learns the clipping value per layer, with better results ( Choi et al., 2018 ). Once the clipping value is set, the scale factor used for quantization is also set, and no further calibration steps are required (as opposed to INT8 methods described above). Modifying network structure : Mishra et al., 2018 try to compensate for the loss of information due to quantization by using wider layers (more channels). Lin et al., 2017 proposed a binary quantization method in which a single FP32 convolution is replaced with multiple binary convolutions, each scaled to represent a different \"base\", covering a larger dynamic range overall. First and last layer : Many methods do not quantize the first and last layer of the model. It has been observed by Han et al., 2015 that the first convolutional layer is more sensitive to weights pruning, and some quantization works cite the same reason and show it empirically ( Zhou S et al., 2016 , Choi et al., 2018 ). Some works also note that these layers usually constitute a very small portion of the overall computation within the model, further reducing the motivation to quantize them ( Rastegari et al., 2016 ). Most methods keep the first and last layers at FP32. However, Choi et al., 2018 showed that \"conservative\" quantization of these layers, e.g. to INT8, does not reduce accuracy. Iterative quantization : Most methods quantize the entire model at once. Zhou A et al., 2017 employ an iterative method, which starts with a trained FP32 baseline, and quantizes only a portion of the model at the time followed by several epochs of re-training to recover the accuracy loss from quantization. Mixed Weights and Activations Precision : It has been observed that activations are more sensitive to quantization than weights ( Zhou S et al., 2016 ). Hence it is not uncommon to see experiments with activations quantized to a higher precision compared to weights. Some works have focused solely on quantizing weights, keeping the activations at FP32 ( Li et al., 2016 , Zhu et al., 2016 ).", + "location": "/quantization/index.html#aggressive-quantization-int4-and-lower", + "text": "Naively quantizing a FP32 model to INT4 and lower usually incurs significant accuracy degradation. Many works have tried to mitigate this effect. They usually employ one or more of the following concepts in order to improve model accuracy: Training / Re-Training : For INT4 and lower, training is required in order to obtain reasonable accuracy. The training loop is modified to take quantization into account. See details in the next section . Zhou S et al., 2016 have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods require a trained FP32 model, either as a starting point ( Zhou A et al., 2017 ), or as a teacher network in a knowledge distillation training setup ( Mishra and Marr, 2018 ). Replacing the activation function : The most common activation function in vision models is ReLU, which is unbounded. That is - its dynamic range is not limited for positive inputs. This is very problematic for INT4 and below due to the very limited range and resolution. Therefore, most methods replace ReLU with another function which is bounded. In some cases a clipping function with hard coded values is used ( Zhou S et al., 2016 , Mishra et al., 2018 ). Another method learns the clipping value per layer, with better results ( Choi et al., 2018 ). Once the clipping value is set, the scale factor used for quantization is also set, and no further calibration steps are required (as opposed to INT8 methods described above). Modifying network structure : Mishra et al., 2018 try to compensate for the loss of information due to quantization by using wider layers (more channels). Lin et al., 2017 proposed a binary quantization method in which a single FP32 convolution is replaced with multiple binary convolutions, each scaled to represent a different \"base\", covering a larger dynamic range overall. First and last layer : Many methods do not quantize the first and last layer of the model. It has been observed by Han et al., 2015 that the first convolutional layer is more sensitive to weights pruning, and some quantization works cite the same reason and show it empirically ( Zhou S et al., 2016 , Choi et al., 2018 ). Some works also note that these layers usually constitute a very small portion of the overall computation within the model, further reducing the motivation to quantize them ( Rastegari et al., 2016 ). Most methods keep the first and last layers at FP32. However, Choi et al., 2018 showed that \"conservative\" quantization of these layers, e.g. to INT8, does not reduce accuracy. Iterative quantization : Most methods quantize the entire model at once. Zhou A et al., 2017 employ an iterative method, which starts with a trained FP32 baseline, and quantizes only a portion of the model at the time followed by several epochs of re-training to recover the accuracy loss from quantization. Mixed Weights and Activations Precision : It has been observed that activations are more sensitive to quantization than weights ( Zhou S et al., 2016 ). Hence it is not uncommon to see experiments with activations quantized to a higher precision compared to weights. Some works have focused solely on quantizing weights, keeping the activations at FP32 ( Li et al., 2016 , Zhu et al., 2016 ).", "title": "\"Aggressive\" Quantization: INT4 and Lower" - }, + }, { - "location": "/quantization/index.html#references", - "text": "William Dally . High-Performance Hardware for Machine Learning. Tutorial, NIPS, 2015 Mohammad Rastegari, Vicente Ordone, Joseph Redmon and Ali Farhadi . XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. ECCV, 2016 Matthieu Courbariaux, Yoshua Bengio and Jean-Pierre David . Training deep neural networks with low precision multiplications. arxiv:1412.7024 Philipp Gysel, Jon Pimentel, Mohammad Motamedi and Soheil Ghiasi . Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 2018 Szymon Migacz . 8-bit Inference with TensorRT. GTC San Jose, 2017 Yoshua Bengio, Nicholas Leonard and Aaron Courville . Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arxiv:1308.3432, 2013 Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu and Yuheng Zou . DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arxiv:1606.06160 Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu and Yurong Chen . Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights. ICLR, 2017 Asit Mishra and Debbie Marr . Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy. ICLR, 2018 Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook and Debbie Marr . WRPN: Wide Reduced-Precision Networks. ICLR, 2018 Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan and Kailash Gopalakrishnan . PACT: Parameterized Clipping Activation for Quantized Neural Networks. 2018 Xiaofan Lin, Cong Zhao and Wei Pan . Towards Accurate Binary Convolutional Neural Network. NIPS, 2017 Song Han, Jeff Pool, John Tran and William Dally . Learning both Weights and Connections for Efficient Neural Network. NIPS, 2015 Fengfu Li, Bo Zhang and Bin Liu . Ternary Weight Networks. arxiv:1605.04711 Chenzhuo Zhu, Song Han, Huizi Mao and William J. Dally . Trained Ternary Quantization. arxiv:1612.01064", + "location": "/quantization/index.html#training-with-quantization", + "text": "As mentioned above, in order to minimize the loss of accuracy from \"aggressive\" quantization, many methods that target INT4 and lower involve training the model in a way that considers the quantization. This means training with quantization of weights and activations \"baked\" into the training procedure. The training graph usually looks like this: A full precision copy of the weights is maintained throughout the training process (\"weights_fp\" in the diagram). Its purpose is to accumulate the small changes from the gradients without loss of precision (Note that the quantization of the weights is an integral part of the training graph, meaning that we back-propagate through it as well). Once the model is trained, only the quantized weights are used for inference. \nIn the diagram we show \"layer N\" as the conv + batch-norm + activation combination, but the same applies to fully-connected layers, element-wise operations, etc. During training, the operations within \"layer N\" can still run in full precision, with the \"quantize\" operations in the boundaries ensuring discrete-valued weights and activations. This is sometimes called \"simulated quantization\".", + "title": "Training with Quantization" + }, + { + "location": "/quantization/index.html#straight-through-estimator", + "text": "An important question in this context is how to back-propagate through the quantization functions. These functions are discrete-valued, hence their derivative is 0 almost everywhere. So, using their gradients as-is would severly hinder the learning process. An approximation commonly used to overcome this issue is the \"straight-through estimator\" (STE) ( Hinton et al., 2012 , Bengio, 2013 ), which simply passes the gradient through these functions as-is.", + "title": "Straight-Through Estimator" + }, + { + "location": "/quantization/index.html#references", + "text": "William Dally . High-Performance Hardware for Machine Learning. Tutorial, NIPS, 2015 Mohammad Rastegari, Vicente Ordone, Joseph Redmon and Ali Farhadi . XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. ECCV, 2016 Matthieu Courbariaux, Yoshua Bengio and Jean-Pierre David . Training deep neural networks with low precision multiplications. arxiv:1412.7024 Philipp Gysel, Jon Pimentel, Mohammad Motamedi and Soheil Ghiasi . Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 2018 Szymon Migacz . 8-bit Inference with TensorRT. GTC San Jose, 2017 Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu and Yuheng Zou . DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arxiv:1606.06160 Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu and Yurong Chen . Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights. ICLR, 2017 Asit Mishra and Debbie Marr . Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy. ICLR, 2018 Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook and Debbie Marr . WRPN: Wide Reduced-Precision Networks. ICLR, 2018 Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan and Kailash Gopalakrishnan . PACT: Parameterized Clipping Activation for Quantized Neural Networks. 2018 Xiaofan Lin, Cong Zhao and Wei Pan . Towards Accurate Binary Convolutional Neural Network. NIPS, 2017 Song Han, Jeff Pool, John Tran and William Dally . Learning both Weights and Connections for Efficient Neural Network. NIPS, 2015 Fengfu Li, Bo Zhang and Bin Liu . Ternary Weight Networks. arxiv:1605.04711 Chenzhuo Zhu, Song Han, Huizi Mao and William J. Dally . Trained Ternary Quantization. arxiv:1612.01064 Yoshua Bengio, Nicholas Leonard and Aaron Courville . Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arxiv:1308.3432, 2013 Geoffrey Hinton, Nitish Srivastava, Kevin Swersky, Tijmen Tieleman and Abdelrahman Mohamed . Neural Networks for Machine Learning. Coursera, video lectures, 2012", "title": "References" - }, + }, { - "location": "/algo_pruning/index.html", - "text": "Weights pruning algorithms\n\n\n\n\nMagnitude pruner\n\n\nThis is the most basic pruner: it applies a thresholding function, \\(thresh(.)\\), on each element, \\(w_i\\), of a weights tensor. A different threshold can be used for each layer's weights tensor.\n\nBecause the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family.\n\n\n\\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]\n\n\nSensitivity pruner\n\n\nFinding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values. We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor.\n\n\nThe diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model. You can see that they have an approximate Gaussian distribution.\n\n\n \n\n\nThe distributions of Alexnet conv1 and fc1 layers\n\n\nWe use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors. For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (\\(\\sigma\\)) of the tensor. Thus, if we set the threshold to \\(s*\\sigma\\), then basically we are thresholding \\(s * 68\\%\\) of the tensor elements. \n\n\n\\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]\n\n\n\\[\n\\lambda = s * \\sigma_l \\;\\;\\; where\\; \\sigma_l\\; is \\;the \\;std \\;of \\;layer \\;l \\;as \\;measured \\;on \\;the \\;dense \\;model\n\\]\n\n\nHow do we choose this \\(s\\) multiplier?\n\n\nIn \nLearning both Weights and Connections for Efficient Neural Networks\n the authors write:\n\n\n\n\n\"We used the sensitivity results to find each layer\u2019s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer\u2019s weights\n\n\n\n\nSo the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \\(s\\). Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.\n\n\nMethod of operation\n\n\n\n\nStart by running a pruning sensitivity analysis on the model. \n\n\nThen use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.\n\n\n\n\nSchedule\n\n\nIn their \npaper\n Song Han et al. use iterative pruning and change the value of the \\(s\\) multiplier at each pruning step. Distiller's \nSensitivityPruner\n works differently: the value \\(s\\) is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are \"pulled\" toward the center of the distribution and thus more elements gets pruned.\n\n\nThis actually works quite well as we can see in the diagram below. This is a TensorBoard screen-capture from Alexnet training, which shows how this method starts off pruning very aggressively, but then slowly reduces the pruning rate.\n\n\n\nWe use a simple iterative-pruning schedule such as: \nPrune every second epoch starting at epoch 0, and ending at epoch 38.\n This excerpt from \nalexnet.schedule_sensitivity.yaml\n shows how this iterative schedule is conveyed in Distiller scheduling configuration YAML:\n\n\npruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.625\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n\n\n\nLevel pruner\n\n\nClass \nSparsityLevelParameterPruner\n uses a similar method to go around specifying specific thresholding magnitudes.\nInstead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity). Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level.\n\nThis pruner is much more stable compared to \nSensitivityPruner\n because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's \nSensitivityPruner\n is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution. Song Han's methodology of using several different values for the multiplier \\(s\\), and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far). \n\n\nTo set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each\n\n\nMethod of operation\n\n\n\n\nSort the weights in the specified layer by their absolute values. \n\n\nMask to zero the smallest magnitude weights until the desired sparsity level is reached.\n\n\n\n\nAutomated gradual pruner (AGP)\n\n\nIn \nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n, authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in \nAutomatedGradualPruner\n.\n\n\n\n\n\n\"We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value \\(s_i\\) (usually 0) to a \ufb01nal sparsity value \\(s_f\\) over a span of n pruning steps.\nThe intuition behind this sparsity function in equation (1) is to prune the network rapidly in the initial phase when the redundant connections are\nabundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.\"\"\n\n\n\n\n\n\nYou can play with the scheduling parameters in the \nagp_schedule.ipynb notebook\n.\n\n\nThe authors describe AGP:\n\n\n\n\n\n\nOur automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity.\n\n\nDoesn't require much hyper-parameter tuning\n\n\nShown to perform well across different models\n\n\nDoes not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.\n\n\n\n\n\n\nRNN pruner\n\n\nThe authors of \nExploring Sparsity in Recurrent Neural Networks\n, Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, \"propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network.\" They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training. They show pruning of RNN, GRU, LSTM and embedding layers.\n\n\nDistiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm.\n\n\n\n\nStructure pruners\n\n\nElement-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation. Structure pruners, remove entire \"structures\", such as kernels, filters, and even entire feature-maps.\n\n\nRanked structure pruner\n\n\nThe \nL1RankedStructureParameterPruner\n pruner calculates the magnitude of some \"structure\", orders all of the structures based on some magnitude function and the \nm\n lowest ranking structures are pruned away. Currently this pruner only performs ranking of filters (3D structures) and it uses the mean of the absolute value of the tensor as the representative of the filter magnitude. The absolute mean does not depend on the size of the filter, so it is easier to use compared to just using the \\(L_1\\)-norm of the structure, and at the same time it is a good proxy of the \\(L_1\\)-norm.\n\n\nIn \nPruning Filters for Efficient ConvNets\n the authors use filter ranking, with \none-shot pruning\n followed by fine-tuning. The authors of \nExploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition\n also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation:\n\n\n\n\nFirst, after sweeping through the full training set several times the weights become relatively stable \u2014 they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)\n\n\n\n\nActivation-influenced pruner\n\n\nThe motivation for this pruner, is that if a feature-map produces very small activations, then this feature-map is not very important, and can be pruned away.\n- \nStatus: not implemented", + "location": "/algo_pruning/index.html", + "text": "Weights pruning algorithms\n\n\n\n\nMagnitude pruner\n\n\nThis is the most basic pruner: it applies a thresholding function, \\(thresh(.)\\), on each element, \\(w_i\\), of a weights tensor. A different threshold can be used for each layer's weights tensor.\n\nBecause the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family.\n\n\n\\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]\n\n\nSensitivity pruner\n\n\nFinding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values. We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor.\n\n\nThe diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model. You can see that they have an approximate Gaussian distribution.\n\n\n \n\n\nThe distributions of Alexnet conv1 and fc1 layers\n\n\nWe use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors. For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (\\(\\sigma\\)) of the tensor. Thus, if we set the threshold to \\(s*\\sigma\\), then basically we are thresholding \\(s * 68\\%\\) of the tensor elements. \n\n\n\\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]\n\n\n\\[\n\\lambda = s * \\sigma_l \\;\\;\\; where\\; \\sigma_l\\; is \\;the \\;std \\;of \\;layer \\;l \\;as \\;measured \\;on \\;the \\;dense \\;model\n\\]\n\n\nHow do we choose this \\(s\\) multiplier?\n\n\nIn \nLearning both Weights and Connections for Efficient Neural Networks\n the authors write:\n\n\n\n\n\"We used the sensitivity results to find each layer\u2019s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer\u2019s weights\n\n\n\n\nSo the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \\(s\\). Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.\n\n\nMethod of operation\n\n\n\n\nStart by running a pruning sensitivity analysis on the model. \n\n\nThen use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.\n\n\n\n\nSchedule\n\n\nIn their \npaper\n Song Han et al. use iterative pruning and change the value of the \\(s\\) multiplier at each pruning step. Distiller's \nSensitivityPruner\n works differently: the value \\(s\\) is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are \"pulled\" toward the center of the distribution and thus more elements gets pruned.\n\n\nThis actually works quite well as we can see in the diagram below. This is a TensorBoard screen-capture from Alexnet training, which shows how this method starts off pruning very aggressively, but then slowly reduces the pruning rate.\n\n\n\nWe use a simple iterative-pruning schedule such as: \nPrune every second epoch starting at epoch 0, and ending at epoch 38.\n This excerpt from \nalexnet.schedule_sensitivity.yaml\n shows how this iterative schedule is conveyed in Distiller scheduling configuration YAML:\n\n\npruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.625\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2\n\n\n\n\nLevel pruner\n\n\nClass \nSparsityLevelParameterPruner\n uses a similar method to go around specifying specific thresholding magnitudes.\nInstead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity). Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level.\n\nThis pruner is much more stable compared to \nSensitivityPruner\n because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's \nSensitivityPruner\n is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution. Song Han's methodology of using several different values for the multiplier \\(s\\), and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far). \n\n\nTo set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each\n\n\nMethod of operation\n\n\n\n\nSort the weights in the specified layer by their absolute values. \n\n\nMask to zero the smallest magnitude weights until the desired sparsity level is reached.\n\n\n\n\nAutomated gradual pruner (AGP)\n\n\nIn \nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n, authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in \nAutomatedGradualPruner\n.\n\n\n\n\n\n\"We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value \\(s_i\\) (usually 0) to a \ufb01nal sparsity value \\(s_f\\) over a span of n pruning steps.\nThe intuition behind this sparsity function in equation (1) is to prune the network rapidly in the initial phase when the redundant connections are\nabundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.\"\"\n\n\n\n\n\n\nYou can play with the scheduling parameters in the \nagp_schedule.ipynb notebook\n.\n\n\nThe authors describe AGP:\n\n\n\n\n\n\nOur automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity.\n\n\nDoesn't require much hyper-parameter tuning\n\n\nShown to perform well across different models\n\n\nDoes not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.\n\n\n\n\n\n\nRNN pruner\n\n\nThe authors of \nExploring Sparsity in Recurrent Neural Networks\n, Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, \"propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network.\" They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training. They show pruning of RNN, GRU, LSTM and embedding layers.\n\n\nDistiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm.\n\n\n\n\nStructure pruners\n\n\nElement-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation. Structure pruners, remove entire \"structures\", such as kernels, filters, and even entire feature-maps.\n\n\nRanked structure pruner\n\n\nThe \nL1RankedStructureParameterPruner\n pruner calculates the magnitude of some \"structure\", orders all of the structures based on some magnitude function and the \nm\n lowest ranking structures are pruned away. Currently this pruner only performs ranking of filters (3D structures) and it uses the mean of the absolute value of the tensor as the representative of the filter magnitude. The absolute mean does not depend on the size of the filter, so it is easier to use compared to just using the \\(L_1\\)-norm of the structure, and at the same time it is a good proxy of the \\(L_1\\)-norm.\n\n\nIn \nPruning Filters for Efficient ConvNets\n the authors use filter ranking, with \none-shot pruning\n followed by fine-tuning. The authors of \nExploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition\n also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation:\n\n\n\n\nFirst, after sweeping through the full training set several times the weights become relatively stable \u2014 they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)\n\n\n\n\nActivation-influenced pruner\n\n\nThe motivation for this pruner, is that if a feature-map produces very small activations, then this feature-map is not very important, and can be pruned away.\n- \nStatus: not implemented", "title": "Pruning" - }, + }, { - "location": "/algo_pruning/index.html#weights-pruning-algorithms", - "text": "", + "location": "/algo_pruning/index.html#weights-pruning-algorithms", + "text": "", "title": "Weights pruning algorithms" - }, + }, { - "location": "/algo_pruning/index.html#magnitude-pruner", - "text": "This is the most basic pruner: it applies a thresholding function, \\(thresh(.)\\), on each element, \\(w_i\\), of a weights tensor. A different threshold can be used for each layer's weights tensor. \nBecause the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family. \\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]", + "location": "/algo_pruning/index.html#magnitude-pruner", + "text": "This is the most basic pruner: it applies a thresholding function, \\(thresh(.)\\), on each element, \\(w_i\\), of a weights tensor. A different threshold can be used for each layer's weights tensor. \nBecause the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family. \\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]", "title": "Magnitude pruner" - }, + }, { - "location": "/algo_pruning/index.html#sensitivity-pruner", - "text": "Finding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values. We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor. \nThe diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model. You can see that they have an approximate Gaussian distribution. The distributions of Alexnet conv1 and fc1 layers We use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors. For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (\\(\\sigma\\)) of the tensor. Thus, if we set the threshold to \\(s*\\sigma\\), then basically we are thresholding \\(s * 68\\%\\) of the tensor elements. \\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\] \\[\n\\lambda = s * \\sigma_l \\;\\;\\; where\\; \\sigma_l\\; is \\;the \\;std \\;of \\;layer \\;l \\;as \\;measured \\;on \\;the \\;dense \\;model\n\\] How do we choose this \\(s\\) multiplier? In Learning both Weights and Connections for Efficient Neural Networks the authors write: \"We used the sensitivity results to find each layer\u2019s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer\u2019s weights So the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \\(s\\). Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.", + "location": "/algo_pruning/index.html#sensitivity-pruner", + "text": "Finding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values. We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor. \nThe diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model. You can see that they have an approximate Gaussian distribution. The distributions of Alexnet conv1 and fc1 layers We use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors. For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (\\(\\sigma\\)) of the tensor. Thus, if we set the threshold to \\(s*\\sigma\\), then basically we are thresholding \\(s * 68\\%\\) of the tensor elements. \\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\] \\[\n\\lambda = s * \\sigma_l \\;\\;\\; where\\; \\sigma_l\\; is \\;the \\;std \\;of \\;layer \\;l \\;as \\;measured \\;on \\;the \\;dense \\;model\n\\] How do we choose this \\(s\\) multiplier? In Learning both Weights and Connections for Efficient Neural Networks the authors write: \"We used the sensitivity results to find each layer\u2019s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer\u2019s weights So the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \\(s\\). Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.", "title": "Sensitivity pruner" - }, + }, { - "location": "/algo_pruning/index.html#method-of-operation", - "text": "Start by running a pruning sensitivity analysis on the model. Then use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.", + "location": "/algo_pruning/index.html#method-of-operation", + "text": "Start by running a pruning sensitivity analysis on the model. Then use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.", "title": "Method of operation" - }, + }, { - "location": "/algo_pruning/index.html#schedule", - "text": "In their paper Song Han et al. use iterative pruning and change the value of the \\(s\\) multiplier at each pruning step. Distiller's SensitivityPruner works differently: the value \\(s\\) is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are \"pulled\" toward the center of the distribution and thus more elements gets pruned. This actually works quite well as we can see in the diagram below. This is a TensorBoard screen-capture from Alexnet training, which shows how this method starts off pruning very aggressively, but then slowly reduces the pruning rate. We use a simple iterative-pruning schedule such as: Prune every second epoch starting at epoch 0, and ending at epoch 38. This excerpt from alexnet.schedule_sensitivity.yaml shows how this iterative schedule is conveyed in Distiller scheduling configuration YAML: pruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.625\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2", + "location": "/algo_pruning/index.html#schedule", + "text": "In their paper Song Han et al. use iterative pruning and change the value of the \\(s\\) multiplier at each pruning step. Distiller's SensitivityPruner works differently: the value \\(s\\) is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are \"pulled\" toward the center of the distribution and thus more elements gets pruned. This actually works quite well as we can see in the diagram below. This is a TensorBoard screen-capture from Alexnet training, which shows how this method starts off pruning very aggressively, but then slowly reduces the pruning rate. We use a simple iterative-pruning schedule such as: Prune every second epoch starting at epoch 0, and ending at epoch 38. This excerpt from alexnet.schedule_sensitivity.yaml shows how this iterative schedule is conveyed in Distiller scheduling configuration YAML: pruners:\n my_pruner:\n class: 'SensitivityPruner'\n sensitivities:\n 'features.module.0.weight': 0.25\n 'features.module.3.weight': 0.35\n 'features.module.6.weight': 0.40\n 'features.module.8.weight': 0.45\n 'features.module.10.weight': 0.55\n 'classifier.1.weight': 0.875\n 'classifier.4.weight': 0.875\n 'classifier.6.weight': 0.625\n\npolicies:\n - pruner:\n instance_name : 'my_pruner'\n starting_epoch: 0\n ending_epoch: 38\n frequency: 2", "title": "Schedule" - }, + }, { - "location": "/algo_pruning/index.html#level-pruner", - "text": "Class SparsityLevelParameterPruner uses a similar method to go around specifying specific thresholding magnitudes.\nInstead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity). Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level. \nThis pruner is much more stable compared to SensitivityPruner because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's SensitivityPruner is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution. Song Han's methodology of using several different values for the multiplier \\(s\\), and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far). To set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each", + "location": "/algo_pruning/index.html#level-pruner", + "text": "Class SparsityLevelParameterPruner uses a similar method to go around specifying specific thresholding magnitudes.\nInstead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity). Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level. \nThis pruner is much more stable compared to SensitivityPruner because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's SensitivityPruner is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution. Song Han's methodology of using several different values for the multiplier \\(s\\), and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far). To set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each", "title": "Level pruner" - }, + }, { - "location": "/algo_pruning/index.html#method-of-operation_1", - "text": "Sort the weights in the specified layer by their absolute values. Mask to zero the smallest magnitude weights until the desired sparsity level is reached.", + "location": "/algo_pruning/index.html#method-of-operation_1", + "text": "Sort the weights in the specified layer by their absolute values. Mask to zero the smallest magnitude weights until the desired sparsity level is reached.", "title": "Method of operation" - }, + }, { - "location": "/algo_pruning/index.html#automated-gradual-pruner-agp", - "text": "In To prune, or not to prune: exploring the efficacy of pruning for model compression , authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in AutomatedGradualPruner . \"We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value \\(s_i\\) (usually 0) to a \ufb01nal sparsity value \\(s_f\\) over a span of n pruning steps.\nThe intuition behind this sparsity function in equation (1) is to prune the network rapidly in the initial phase when the redundant connections are\nabundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.\"\" You can play with the scheduling parameters in the agp_schedule.ipynb notebook . The authors describe AGP: Our automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity. Doesn't require much hyper-parameter tuning Shown to perform well across different models Does not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.", + "location": "/algo_pruning/index.html#automated-gradual-pruner-agp", + "text": "In To prune, or not to prune: exploring the efficacy of pruning for model compression , authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in AutomatedGradualPruner . \"We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value \\(s_i\\) (usually 0) to a \ufb01nal sparsity value \\(s_f\\) over a span of n pruning steps.\nThe intuition behind this sparsity function in equation (1) is to prune the network rapidly in the initial phase when the redundant connections are\nabundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.\"\" You can play with the scheduling parameters in the agp_schedule.ipynb notebook . The authors describe AGP: Our automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity. Doesn't require much hyper-parameter tuning Shown to perform well across different models Does not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.", "title": "Automated gradual pruner (AGP)" - }, + }, { - "location": "/algo_pruning/index.html#rnn-pruner", - "text": "The authors of Exploring Sparsity in Recurrent Neural Networks , Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, \"propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network.\" They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training. They show pruning of RNN, GRU, LSTM and embedding layers. Distiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm.", + "location": "/algo_pruning/index.html#rnn-pruner", + "text": "The authors of Exploring Sparsity in Recurrent Neural Networks , Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, \"propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network.\" They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training. They show pruning of RNN, GRU, LSTM and embedding layers. Distiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm.", "title": "RNN pruner" - }, + }, { - "location": "/algo_pruning/index.html#structure-pruners", - "text": "Element-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation. Structure pruners, remove entire \"structures\", such as kernels, filters, and even entire feature-maps.", + "location": "/algo_pruning/index.html#structure-pruners", + "text": "Element-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation. Structure pruners, remove entire \"structures\", such as kernels, filters, and even entire feature-maps.", "title": "Structure pruners" - }, + }, { - "location": "/algo_pruning/index.html#ranked-structure-pruner", - "text": "The L1RankedStructureParameterPruner pruner calculates the magnitude of some \"structure\", orders all of the structures based on some magnitude function and the m lowest ranking structures are pruned away. Currently this pruner only performs ranking of filters (3D structures) and it uses the mean of the absolute value of the tensor as the representative of the filter magnitude. The absolute mean does not depend on the size of the filter, so it is easier to use compared to just using the \\(L_1\\)-norm of the structure, and at the same time it is a good proxy of the \\(L_1\\)-norm. In Pruning Filters for Efficient ConvNets the authors use filter ranking, with one-shot pruning followed by fine-tuning. The authors of Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation: First, after sweeping through the full training set several times the weights become relatively stable \u2014 they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)", + "location": "/algo_pruning/index.html#ranked-structure-pruner", + "text": "The L1RankedStructureParameterPruner pruner calculates the magnitude of some \"structure\", orders all of the structures based on some magnitude function and the m lowest ranking structures are pruned away. Currently this pruner only performs ranking of filters (3D structures) and it uses the mean of the absolute value of the tensor as the representative of the filter magnitude. The absolute mean does not depend on the size of the filter, so it is easier to use compared to just using the \\(L_1\\)-norm of the structure, and at the same time it is a good proxy of the \\(L_1\\)-norm. In Pruning Filters for Efficient ConvNets the authors use filter ranking, with one-shot pruning followed by fine-tuning. The authors of Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation: First, after sweeping through the full training set several times the weights become relatively stable \u2014 they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)", "title": "Ranked structure pruner" - }, + }, { - "location": "/algo_pruning/index.html#activation-influenced-pruner", - "text": "The motivation for this pruner, is that if a feature-map produces very small activations, then this feature-map is not very important, and can be pruned away.\n- Status: not implemented", + "location": "/algo_pruning/index.html#activation-influenced-pruner", + "text": "The motivation for this pruner, is that if a feature-map produces very small activations, then this feature-map is not very important, and can be pruned away.\n- Status: not implemented", "title": "Activation-influenced pruner" - }, + }, { - "location": "/algo_quantization/index.html", - "text": "Quantization Algorithms\n\n\nSymmetric Linear Quantization\n\n\nIn this method, a float value is quantized by multiplying with a numeric constant (the \nscale factor\n), hence it is \nLinear\n. We use a signed integer to represent the quantized range, with no quantization bias (or \"offset\") used. As a result, the floating-point range considered for quantization is \nsymmetric\n with respect to zero.\n\nIn the current implementation the scale factor is chosen so that the entire range of the floating-point tensor is quantized (we do not attempt to remove outliers).\n\nLet us denote the original floating-point tensor by \nx_f\n, the quantized tensor by \nx_q\n, the scale factor by \nq_x\n and the number of bits used for quantization by \nn\n. Then, we get:\n\nq_x = \\frac{2^{n-1}-1}{\\max|x|}\n\n\nx_q = round(q_x x_f)\n\n(The \nround\n operation is round-to-nearest-integer) \n\n\nLet's see how a \nconvolution\n or \nfully-connected (FC)\n layer is quantized using this method: (we denote input, output, weights and bias with \nx, y, w\n and \nb\n respectively)\n\ny_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q}{q_x} \\frac{w_q}{q_w}} + \\frac{b_q}{q_b} = \\frac{1}{q_x q_w} \\sum{(x_q w_q + \\frac{q_b}{q_x q_w}b_q)}\n\n\ny_q = round(q_y y_f) = round(\\frac{q_y}{q_x q_w} \\sum{(x_q w_q + \\frac{q_b}{q_x q_w}b_q)})\n\nNote how the bias has to be re-scaled to match the scale of the summation.\n\n\nImplementation\n\n\nWe've implemented \nconvolution\n and \nFC\n using this method. \n\n\n\n\nThey are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. \n\n\nAll other layers are unaffected and are executed using their original FP32 implementation. \n\n\nFor weights and bias the scale factor is determined once at quantization setup (\"offline\"), and for activations it is determined dynamically at runtime (\"online\"). \n\n\nImportant note:\n Currently, this method is implemented as \ninference only\n, with no back-propagation functionality. Hence, it can only be used to quantize a pre-trained FP32 model, with no re-training. As such, using it with \nn < 8\n is likely to lead to severe accuracy degradation for any non-trivial workload.", + "location": "/algo_quantization/index.html", + "text": "Quantization Algorithms\n\n\nThe following quantization methods are currently implemented in Distiller:\n\n\nDoReFa\n\n\n(As proposed in \nDoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients\n) \n\n\nIn this method, we first define the quantization function \nquantize_k\n, which takes a real value \na_f \\in [0, 1]\n and outputs a discrete-valued \na_q \\in \\left\\{ \\frac{0}{2^k-1}, \\frac{1}{2^k-1}, ... , \\frac{2^k-1}{2^k-1} \\right\\}\n, where \nk\n is the number of bits used for quantization.\n\n\n\n\na_q = quantize_k(a_f) = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) a_f \\right)\n\n\n\n\nActivations are clipped to the \n[0, 1]\n range and then quantized as follows:\n\n\n\n\nx_q = quantize_k(x_f)\n\n\n\n\nFor weights, we define the following function \nf\n, which takes an unbounded real valued input and outputs a real value in \n[0, 1]\n:\n\n\n\n\nf(w) = \\frac{tanh(w)}{2 max(|tanh(w)|)} + \\frac{1}{2} \n\n\n\n\nNow we can use \nquantize_k\n to get quantized weight values, as follows:\n\n\n\n\nw_q = 2 quantize_k \\left( f(w_f) \\right) - 1\n\n\n\n\nThis method requires training the model with quantization, as discussed \nhere\n. Use the \nDorefaQuantizer\n class to transform an existing model to a model suitable for training with quantization using DoReFa.\n\n\nNotes:\n\n\n\n\nGradients quantization as proposed in the paper is not supported yet.\n\n\nThe paper defines special handling for binary weights which isn't supported in Distiller yet.\n\n\n\n\nWRPN\n\n\n(As proposed in \nWRPN: Wide Reduced-Precision Networks\n) \n\n\nIn this method, activations are clipped to \n[0, 1]\n and quantized as follows (\nk\n is the number of bits used for quantization):\n\n\n\n\nx_q = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) x_f \\right)\n\n\n\n\nWeights are clipped to \n[-1, 1]\n and quantized as follows:\n\n\n\n\nw_q = \\frac{1}{2^{k-1}-1} round \\left( \\left(2^{k-1} - 1 \\right)w_f \\right)\n\n\n\n\nNote that \nk-1\n bits are used to quantize weights, leaving one bit for sign.\n\n\nThis method requires training the model with quantization, as discussed \nhere\n. Use the \nWRPNQuantizer\n class to transform an existing model to a model suitable for training with quantization using WRPN.\n\n\nNotes:\n\n\n\n\nThe paper proposed widening of layers as a means to reduce accuracy loss. This isn't implemented as part of \nWRPNQuantizer\n at the moment. To experiment with this, modify your model implementation to have wider layers.\n\n\nThe paper defines special handling for binary weights which isn't supported in Distiller yet.\n\n\n\n\nSymmetric Linear Quantization\n\n\nIn this method, a float value is quantized by multiplying with a numeric constant (the \nscale factor\n), hence it is \nLinear\n. We use a signed integer to represent the quantized range, with no quantization bias (or \"offset\") used. As a result, the floating-point range considered for quantization is \nsymmetric\n with respect to zero.\n\nIn the current implementation the scale factor is chosen so that the entire range of the floating-point tensor is quantized (we do not attempt to remove outliers).\n\nLet us denote the original floating-point tensor by \nx_f\n, the quantized tensor by \nx_q\n, the scale factor by \nq_x\n and the number of bits used for quantization by \nn\n. Then, we get:\n\nq_x = \\frac{2^{n-1}-1}{\\max|x|}\n\n\nx_q = round(q_x x_f)\n\n(The \nround\n operation is round-to-nearest-integer) \n\n\nLet's see how a \nconvolution\n or \nfully-connected (FC)\n layer is quantized using this method: (we denote input, output, weights and bias with \nx, y, w\n and \nb\n respectively)\n\ny_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q}{q_x} \\frac{w_q}{q_w}} + \\frac{b_q}{q_b} = \\frac{1}{q_x q_w} \\sum{ \\left( x_q w_q + \\frac{q_b}{q_x q_w}b_q \\right) }\n\n\ny_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\sum{ \\left( x_q w_q + \\frac{q_b}{q_x q_w}b_q \\right) } \\right) \n\nNote how the bias has to be re-scaled to match the scale of the summation.\n\n\nImplementation\n\n\nWe've implemented \nconvolution\n and \nFC\n using this method. \n\n\n\n\nThey are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. The wrapper is implemented in the \nRangeLinearQuantParamLayerWrapper\n class. \n\n\nAll other layers are unaffected and are executed using their original FP32 implementation. \n\n\nTo automatically transform an existing model to a quantized model using this method, use the \nSymmetricLinearQuantizer\n class.\n\n\nFor weights and bias the scale factor is determined once at quantization setup (\"offline\"), and for activations it is determined dynamically at runtime (\"online\"). \n\n\nImportant note:\n Currently, this method is implemented as \ninference only\n, with no back-propagation functionality. Hence, it can only be used to quantize a pre-trained FP32 model, with no re-training. As such, using it with \nn < 8\n is likely to lead to severe accuracy degradation for any non-trivial workload.", "title": "Quantization" - }, + }, { - "location": "/algo_quantization/index.html#quantization-algorithms", - "text": "", + "location": "/algo_quantization/index.html#quantization-algorithms", + "text": "The following quantization methods are currently implemented in Distiller:", "title": "Quantization Algorithms" - }, - { - "location": "/algo_quantization/index.html#symmetric-linear-quantization", - "text": "In this method, a float value is quantized by multiplying with a numeric constant (the scale factor ), hence it is Linear . We use a signed integer to represent the quantized range, with no quantization bias (or \"offset\") used. As a result, the floating-point range considered for quantization is symmetric with respect to zero. \nIn the current implementation the scale factor is chosen so that the entire range of the floating-point tensor is quantized (we do not attempt to remove outliers). \nLet us denote the original floating-point tensor by x_f , the quantized tensor by x_q , the scale factor by q_x and the number of bits used for quantization by n . Then, we get: q_x = \\frac{2^{n-1}-1}{\\max|x|} x_q = round(q_x x_f) \n(The round operation is round-to-nearest-integer) Let's see how a convolution or fully-connected (FC) layer is quantized using this method: (we denote input, output, weights and bias with x, y, w and b respectively) y_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q}{q_x} \\frac{w_q}{q_w}} + \\frac{b_q}{q_b} = \\frac{1}{q_x q_w} \\sum{(x_q w_q + \\frac{q_b}{q_x q_w}b_q)} y_q = round(q_y y_f) = round(\\frac{q_y}{q_x q_w} \\sum{(x_q w_q + \\frac{q_b}{q_x q_w}b_q)}) \nNote how the bias has to be re-scaled to match the scale of the summation.", + }, + { + "location": "/algo_quantization/index.html#dorefa", + "text": "(As proposed in DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients ) In this method, we first define the quantization function quantize_k , which takes a real value a_f \\in [0, 1] and outputs a discrete-valued a_q \\in \\left\\{ \\frac{0}{2^k-1}, \\frac{1}{2^k-1}, ... , \\frac{2^k-1}{2^k-1} \\right\\} , where k is the number of bits used for quantization. a_q = quantize_k(a_f) = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) a_f \\right) Activations are clipped to the [0, 1] range and then quantized as follows: x_q = quantize_k(x_f) For weights, we define the following function f , which takes an unbounded real valued input and outputs a real value in [0, 1] : f(w) = \\frac{tanh(w)}{2 max(|tanh(w)|)} + \\frac{1}{2} Now we can use quantize_k to get quantized weight values, as follows: w_q = 2 quantize_k \\left( f(w_f) \\right) - 1 This method requires training the model with quantization, as discussed here . Use the DorefaQuantizer class to transform an existing model to a model suitable for training with quantization using DoReFa.", + "title": "DoReFa" + }, + { + "location": "/algo_quantization/index.html#notes", + "text": "Gradients quantization as proposed in the paper is not supported yet. The paper defines special handling for binary weights which isn't supported in Distiller yet.", + "title": "Notes:" + }, + { + "location": "/algo_quantization/index.html#wrpn", + "text": "(As proposed in WRPN: Wide Reduced-Precision Networks ) In this method, activations are clipped to [0, 1] and quantized as follows ( k is the number of bits used for quantization): x_q = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) x_f \\right) Weights are clipped to [-1, 1] and quantized as follows: w_q = \\frac{1}{2^{k-1}-1} round \\left( \\left(2^{k-1} - 1 \\right)w_f \\right) Note that k-1 bits are used to quantize weights, leaving one bit for sign. This method requires training the model with quantization, as discussed here . Use the WRPNQuantizer class to transform an existing model to a model suitable for training with quantization using WRPN.", + "title": "WRPN" + }, + { + "location": "/algo_quantization/index.html#notes_1", + "text": "The paper proposed widening of layers as a means to reduce accuracy loss. This isn't implemented as part of WRPNQuantizer at the moment. To experiment with this, modify your model implementation to have wider layers. The paper defines special handling for binary weights which isn't supported in Distiller yet.", + "title": "Notes:" + }, + { + "location": "/algo_quantization/index.html#symmetric-linear-quantization", + "text": "In this method, a float value is quantized by multiplying with a numeric constant (the scale factor ), hence it is Linear . We use a signed integer to represent the quantized range, with no quantization bias (or \"offset\") used. As a result, the floating-point range considered for quantization is symmetric with respect to zero. \nIn the current implementation the scale factor is chosen so that the entire range of the floating-point tensor is quantized (we do not attempt to remove outliers). \nLet us denote the original floating-point tensor by x_f , the quantized tensor by x_q , the scale factor by q_x and the number of bits used for quantization by n . Then, we get: q_x = \\frac{2^{n-1}-1}{\\max|x|} x_q = round(q_x x_f) \n(The round operation is round-to-nearest-integer) Let's see how a convolution or fully-connected (FC) layer is quantized using this method: (we denote input, output, weights and bias with x, y, w and b respectively) y_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q}{q_x} \\frac{w_q}{q_w}} + \\frac{b_q}{q_b} = \\frac{1}{q_x q_w} \\sum{ \\left( x_q w_q + \\frac{q_b}{q_x q_w}b_q \\right) } y_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\sum{ \\left( x_q w_q + \\frac{q_b}{q_x q_w}b_q \\right) } \\right) \nNote how the bias has to be re-scaled to match the scale of the summation.", "title": "Symmetric Linear Quantization" - }, + }, { - "location": "/algo_quantization/index.html#implementation", - "text": "We've implemented convolution and FC using this method. They are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. All other layers are unaffected and are executed using their original FP32 implementation. For weights and bias the scale factor is determined once at quantization setup (\"offline\"), and for activations it is determined dynamically at runtime (\"online\"). Important note: Currently, this method is implemented as inference only , with no back-propagation functionality. Hence, it can only be used to quantize a pre-trained FP32 model, with no re-training. As such, using it with n < 8 is likely to lead to severe accuracy degradation for any non-trivial workload.", + "location": "/algo_quantization/index.html#implementation", + "text": "We've implemented convolution and FC using this method. They are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. The wrapper is implemented in the RangeLinearQuantParamLayerWrapper class. All other layers are unaffected and are executed using their original FP32 implementation. To automatically transform an existing model to a quantized model using this method, use the SymmetricLinearQuantizer class. For weights and bias the scale factor is determined once at quantization setup (\"offline\"), and for activations it is determined dynamically at runtime (\"online\"). Important note: Currently, this method is implemented as inference only , with no back-propagation functionality. Hence, it can only be used to quantize a pre-trained FP32 model, with no re-training. As such, using it with n < 8 is likely to lead to severe accuracy degradation for any non-trivial workload.", "title": "Implementation" - }, + }, { - "location": "/model_zoo/index.html", - "text": "Distiller Model Zoo\n\n\nHow to contribute models to the Model Zoo\n\n\nWe encourage you to contribute new models to the Model Zoo. We welcome implementations of published papers or of your own work. To assure that models and algorithms shared with others are high-quality, please commit your models with the following:\n\n\n\n\nCommand-line arguments\n\n\nLog files\n\n\nPyTorch model\n\n\n\n\nContents\n\n\nThe Distiller model zoo is not a \"traditional\" model-zoo, because it does not necessarily contain best-in-class compressed models. Instead, the model-zoo contains a number of deep learning models that have been compressed using Distiller following some well-known research papers. These are meant to serve as examples of how Distiller can be used.\n\n\nEach model contains a Distiller schedule detailing how the model was compressed, a PyTorch checkpoint, text logs and TensorBoard logs.\n\n\n\n\ntable, th, td {\n border: 1px solid black;\n}\n\n\n\n\n \n\n \nPaper\n\n \nDataset\n\n \nNetwork\n\n \nMethod & Granularity\n\n \nSchedule\n\n \nFeatures\n\n \n\n \n\n \nLearning both Weights and Connections for Efficient Neural Networks\n\n \nImageNet\n\n \nAlexnet\n\n \nElement-wise pruning\n\n \nIterative; Manual\n\n \nMagnitude thresholding based on a sensitivity quantifier.\nElement-wise sparsity sensitivity analysis\n\n \n\n \n\n \nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n\n \nImageNet\n\n \nMobileNet\n\n \nElement-wise pruning\n\n \nAutomated gradual; Iterative\n\n \nMagnitude thresholding based on target level\n\n \n\n \n\n \nLearning Structured Sparsity in Deep Neural Networks\n\n \nCIFAR10\n\n \nResNet20\n\n \nGroup regularization\n\n \n1.Train with group-lasso\n2.Remove zero groups and fine-tune\n\n \nGroup Lasso regularization. Groups: kernels (2D), channels, filters (3D), layers (4D), vectors (rows, cols)\n\n \n\n \n\n \nPruning Filters for Efficient ConvNets\n\n \nCIFAR10\n\n \nResNet56\n\n \nFilter ranking; guided by sensitivity analysis\n\n \n1.Rank filters\n2. Remove filters and channels\n3.Fine-tune\n\n \nOne-shot ranking and pruning of filters; with network thinning\n \n\n\n\n\nLearning both Weights and Connections for Efficient Neural Networks\n\n\nThis schedule is an example of \"Iterative Pruning\" for Alexnet/Imagent, as described in chapter 3 of Song Han's PhD dissertation: \nEfficient Methods and Hardware for Deep Learning\n and in his paper \nLearning both Weights and Connections for Efficient Neural Networks\n. \n\n\nThe Distiller schedule uses SensitivityPruner which is similar to MagnitudeParameterPruner, but instead of specifying \"raw\" thresholds, it uses a \"sensitivity parameter\". Song Han's paper says that \"the pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layers weights,\" and this is not explained much further. In Distiller, the \"quality parameter\" is referred to as \"sensitivity\" and\nis based on the values learned from performing sensitivity analysis. Using a parameter that is related to the standard deviation is very helpful: under the assumption that the weights tensors are distributed normally, the standard deviation acts as a threshold normalizer.\n\n\nNote that Distiller's implementation deviates slightly from the algorithm Song Han describes in his PhD dissertation, in that the threshold value is set only once. In his PhD dissertation, Song Han describes a growing threshold, at each iteration. This requires n+1 hyper-parameters (n being the number of pruning iterations we use): the threshold and the threshold increase (delta) at each pruning iteration. Distiller's implementation takes advantage of the fact that as pruning progresses, more weights are pulled toward zero, and therefore the threshold \"traps\" more weights. Thus, we can use less hyper-parameters and achieve the same results.\n\n\n\n\nDistiller schedule: \ndistiller/examples/sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\n\nCheckpoint file: \nalexnet.checkpoint.89.pth.tar\n\n\n\n\nResults\n\n\nOur reference is TorchVision's pretrained Alexnet model which has a Top1 accuracy of 56.55 and Top5=79.09. We prune away 88.44% of the parameters and achieve Top1=56.61 and Top5=79.45.\nSong Han prunes 89% of the parameters, which is slightly better than our results.\n\n\nParameters:\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n| | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean\n|----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n| 0 | features.module.0.weight | (64, 3, 11, 11) | 23232 | 13411 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 42.27359 | 0.14391 | -0.00002 | 0.08805 |\n| 1 | features.module.3.weight | (192, 64, 5, 5) | 307200 | 115560 | 0.00000 | 0.00000 | 0.00000 | 1.91243 | 0.00000 | 62.38281 | 0.04703 | -0.00250 | 0.02289 |\n| 2 | features.module.6.weight | (384, 192, 3, 3) | 663552 | 256565 | 0.00000 | 0.00000 | 0.00000 | 6.18490 | 0.00000 | 61.33445 | 0.03354 | -0.00184 | 0.01803 |\n| 3 | features.module.8.weight | (256, 384, 3, 3) | 884736 | 315065 | 0.00000 | 0.00000 | 0.00000 | 6.96411 | 0.00000 | 64.38881 | 0.02646 | -0.00168 | 0.01422 |\n| 4 | features.module.10.weight | (256, 256, 3, 3) | 589824 | 186938 | 0.00000 | 0.00000 | 0.00000 | 15.49225 | 0.00000 | 68.30614 | 0.02714 | -0.00246 | 0.01409 |\n| 5 | classifier.1.weight | (4096, 9216) | 37748736 | 3398881 | 0.00000 | 0.21973 | 0.00000 | 0.21973 | 0.00000 | 90.99604 | 0.00589 | -0.00020 | 0.00168 |\n| 6 | classifier.4.weight | (4096, 4096) | 16777216 | 1782769 | 0.21973 | 3.46680 | 0.00000 | 3.46680 | 0.00000 | 89.37387 | 0.00849 | -0.00066 | 0.00263 |\n| 7 | classifier.6.weight | (1000, 4096) | 4096000 | 994738 | 3.36914 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 75.71440 | 0.01718 | 0.00030 | 0.00778 |\n| 8 | Total sparsity: | - | 61090496 | 7063928 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 88.43694 | 0.00000 | 0.00000 | 0.00000 |\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:35,357 - ==> Top1: 51.838 Top5: 74.817 Loss: 2.150\n\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:32:01,274 - ==> Top1: 56.606 Top5: 79.446 Loss: 1.893\n\n\n\n\nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n\n\nIn their paper Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their\nsmaller, but dense (small-dense) counterparts with identical memory footprint.\"\nThey also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with\nminimal tuning.\"\n\n\nThis pruning schedule is implemented by distiller.AutomatedGradualPruner, which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps. Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper. The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs. We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size.\n\n\nImageNet files:\n\n\n\n\nDistiller schedule: \ndistiller/examples/agp-pruning/mobilenet.imagenet.schedule_agp.yaml\n\n\nCheckpoint file: \ncheckpoint.pth.tar\n\n\n\n\nResNet18 files:\n\n\n\n\nDistiller schedule: \ndistiller/examples/agp-pruning/resnet18.schedule_agp.yaml\n\n\nCheckpoint file: \ncheckpoint.pth.tar\n\n\n\n\nResults\n\n\nAs our baseline we used a \npretrained PyTorch MobileNet model\n (width=1) which has Top1=68.848 and Top5=88.740.\n\nIn their paper, Zhu and Gupta prune 50% of the elements of MobileNet (width=1) with a 1.1% drop in accuracy. We pruned about 51.6% of the elements, with virtually no change in the accuracies (Top1: 68.808 and Top5: 88.656). We didn't try to prune more than this, but we do note that the baseline accuracy that we used is almost 2% lower than the accuracy published in the paper. \n\n\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n| | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean |\n|----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n| 0 | module.model.0.0.weight | (32, 3, 3, 3) | 864 | 864 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.14466 | 0.00103 | 0.06508 |\n| 1 | module.model.1.0.weight | (32, 1, 3, 3) | 288 | 288 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.32146 | 0.01020 | 0.12932 |\n| 2 | module.model.1.3.weight | (64, 32, 1, 1) | 2048 | 2048 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.11942 | 0.00024 | 0.03627 |\n| 3 | module.model.2.0.weight | (64, 1, 3, 3) | 576 | 576 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.15809 | 0.00543 | 0.11513 |\n| 4 | module.model.2.3.weight | (128, 64, 1, 1) | 8192 | 8192 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08442 | -0.00031 | 0.04182 |\n| 5 | module.model.3.0.weight | (128, 1, 3, 3) | 1152 | 1152 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.16780 | 0.00125 | 0.10545 |\n| 6 | module.model.3.3.weight | (128, 128, 1, 1) | 16384 | 16384 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07126 | -0.00197 | 0.04123 |\n| 7 | module.model.4.0.weight | (128, 1, 3, 3) | 1152 | 1152 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.10182 | 0.00171 | 0.08719 |\n| 8 | module.model.4.3.weight | (256, 128, 1, 1) | 32768 | 13108 | 0.00000 | 0.00000 | 10.15625 | 59.99756 | 12.50000 | 59.99756 | 0.05543 | -0.00002 | 0.02760 |\n| 9 | module.model.5.0.weight | (256, 1, 3, 3) | 2304 | 2304 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.12516 | -0.00288 | 0.08058 |\n| 10 | module.model.5.3.weight | (256, 256, 1, 1) | 65536 | 26215 | 0.00000 | 0.00000 | 12.50000 | 59.99908 | 23.82812 | 59.99908 | 0.04453 | 0.00002 | 0.02271 |\n| 11 | module.model.6.0.weight | (256, 1, 3, 3) | 2304 | 2304 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08024 | 0.00252 | 0.06377 |\n| 12 | module.model.6.3.weight | (512, 256, 1, 1) | 131072 | 52429 | 0.00000 | 0.00000 | 23.82812 | 59.99985 | 14.25781 | 59.99985 | 0.03561 | -0.00057 | 0.01779 |\n| 13 | module.model.7.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.11008 | -0.00018 | 0.06829 |\n| 14 | module.model.7.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 14.25781 | 59.99985 | 21.28906 | 59.99985 | 0.02944 | -0.00060 | 0.01515 |\n| 15 | module.model.8.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08258 | 0.00370 | 0.04905 |\n| 16 | module.model.8.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 21.28906 | 59.99985 | 28.51562 | 59.99985 | 0.02865 | -0.00046 | 0.01465 |\n| 17 | module.model.9.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07578 | 0.00468 | 0.04201 |\n| 18 | module.model.9.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 28.51562 | 59.99985 | 23.43750 | 59.99985 | 0.02939 | -0.00044 | 0.01511 |\n| 19 | module.model.10.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07091 | 0.00014 | 0.04306 |\n| 20 | module.model.10.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 24.60938 | 59.99985 | 20.89844 | 59.99985 | 0.03095 | -0.00059 | 0.01672 |\n| 21 | module.model.11.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.05729 | -0.00518 | 0.04267 |\n| 22 | module.model.11.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 20.89844 | 59.99985 | 17.57812 | 59.99985 | 0.03229 | -0.00044 | 0.01797 |\n| 23 | module.model.12.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.04981 | -0.00136 | 0.03967 |\n| 24 | module.model.12.3.weight | (1024, 512, 1, 1) | 524288 | 209716 | 0.00000 | 0.00000 | 16.01562 | 59.99985 | 44.23828 | 59.99985 | 0.02514 | -0.00106 | 0.01278 |\n| 25 | module.model.13.0.weight | (1024, 1, 3, 3) | 9216 | 9216 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.02396 | -0.00949 | 0.01549 |\n| 26 | module.model.13.3.weight | (1024, 1024, 1, 1) | 1048576 | 419431 | 0.00000 | 0.00000 | 44.72656 | 59.99994 | 1.46484 | 59.99994 | 0.01801 | -0.00017 | 0.00931 |\n| 27 | module.fc.weight | (1000, 1024) | 1024000 | 409600 | 1.46484 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 60.00000 | 0.05078 | 0.00271 | 0.02734 |\n| 28 | Total sparsity: | - | 4209088 | 1726917 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 58.97171 | 0.00000 | 0.00000 | 0.00000 |\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\nTotal sparsity: 58.97\n\n--- validate (epoch=199)-----------\n128116 samples (256 per mini-batch)\n==> Top1: 65.337 Top5: 84.984 Loss: 1.494\n\n--- test ---------------------\n50000 samples (256 per mini-batch)\n==> Top1: 68.810 Top5: 88.626 Loss: 1.282\n\n\n\n\n\nLearning Structured Sparsity in Deep Neural Networks\n\n\nThis research paper from the University of Pittsburgh, \"proposes a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNN\u2019s evaluation.\"\n\n\nNote that this paper does not use pruning, but instead uses group regularization during the training to force weights towards zero, as a group. We used a schedule which thresholds the regularized elements at a magnitude equal to the regularization strength. At the end of the regularization phase, we save the final sparsity masks generated by the regularization, and exit. Then we load this regularized model, remove the layers corresponding to the zeroed weight tensors (all of a layer's elements have a zero value). \n\n\nBaseline training\n\n\nWe started by training the baseline ResNet20-Cifar dense network since we didn't have a pre-trained model.\n\n\n\n\nDistiller schedule: \ndistiller/examples/ssl/resnet20_cifar_baseline_training.yaml\n\n\nCheckpoint files: \ndistiller/examples/ssl/checkpoints/\n\n\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.3 --epochs=180 --compress=../cifar10/resnet20/baseline_training.yaml -j=1 --deterministic\n\n\n\n\nRegularization\n\n\nThen we started training from scratch again, but this time we used Group Lasso regularization on entire layers:\n\nDistiller schedule: \ndistiller/examples/ssl/ssl_4D-removal_4L_training.yaml\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.4 --epochs=180 --compress=../ssl/ssl_4D-removal_training.yaml -j=1 --deterministic\n\n\n\n\nThe diagram below shows the training of Resnet20/CIFAR10 using Group Lasso regularization on entire layers (in blue) vs. training Resnet20/CIFAR10 baseline (in red). You may notice several interesting things:\n1. The LR-decay policy is the same, but the two sessions start with different initial LR values.\n2. The data-loss of the regularized training follows the same shape as the un-regularized training (baseline), and eventually the two seem to merge.\n3. We see similar behavior in the validation Top1 and Top5 accuracy results, but the regularized training eventually performs better.\n4. In the top right corner we see the behavior of the regularization loss (\nReg Loss\n), which actually increases for some time, until the data-loss has a sharp drop (after ~16K mini-batches), at which point the regularization loss also starts dropping.\n\n\n\nThis \nregularization\n yields 5 layers with zeroed weight tensors. We load this model, remove the 5 layers, and start the fine tuning of the weights. This process of layer removal is specific to ResNet for CIFAR, which we altered by adding code to skip over layers during the forward path. When you export to ONNX, the removed layers do not participate in the forward path, so they don't get incarnated. \n\n\nWe managed to remove 5 of the 16 3x3 convolution layers which dominate the computation time. It's not bad, but we probably could have done better.\n\n\nFine-tuning\n\n\nDuring the \nfine-tuning\n process, because the removed layers do not participate in the forward path, they do not appear in the backward path and are not backpropogated: therefore they are completely disconnected from the network.\n\nWe copy the checkpoint file of the regularized model to \ncheckpoint_trained_4D_regularized_5Lremoved.pth.tar\n.\n\nDistiller schedule: \ndistiller/examples/ssl/ssl_4D-removal_finetuning.yaml\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.1 --epochs=250 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --compress=../ssl/ssl_4D-removal_finetuning.yaml -j=1 --deterministic\n\n\n\n\nResults\n\n\nOur baseline results for ResNet20 Cifar are: Top1=91.450 and Top5=99.750\n\n\nWe used Distiller's GroupLassoRegularizer to remove 5 layers from Resnet20 (CIFAR10) with no degradation of the accuracies.\n\nThe regularized model exhibits really poor classification abilities: \n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --evaluate\n\n=> loading checkpoint ../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar\n best top@1: 90.620\nLoaded compression schedule from checkpoint (epoch 179)\nRemoving layer: module.layer1.0.conv1 [layer=0 block=0 conv=0]\nRemoving layer: module.layer1.0.conv2 [layer=0 block=0 conv=1]\nRemoving layer: module.layer1.1.conv1 [layer=0 block=1 conv=0]\nRemoving layer: module.layer1.1.conv2 [layer=0 block=1 conv=1]\nRemoving layer: module.layer2.2.conv2 [layer=1 block=2 conv=1]\nFiles already downloaded and verified\nFiles already downloaded and verified\nDataset sizes:\n training=45000\n validation=5000\n test=10000\n--- test ---------------------\n10000 samples (256 per mini-batch)\n==> Top1: 22.290 Top5: 68.940 Loss: 5.172\n\n\n\n\nHowever, after fine-tuning, we recovered most of the accuracies loss, but not quite all of it: Top1=91.020 and Top5=99.670\n\n\nWe didn't spend time trying to wrestle with this network, and therefore didn't achieve SSL's published results (which showed that they managed to remove 6 layers and at the same time increase accuracies).\n\n\nPruning Filters for Efficient ConvNets\n\n\nQuoting the authors directly:\n\n\n\n\nWe present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly.\nIn contrast to pruning weights, this approach does not result in sparse connectivity patterns. Hence, it does not need the support of sparse convolution libraries and can work with existing efficient BLAS libraries for dense matrix multiplications.\n\n\n\n\nThe implementation of the research by Hao et al. required us to add filter-pruning sensitivity analysis, and support for \"network thinning\".\n\n\nAfter performing filter-pruning sensitivity analysis to assess which layers are more sensitive to the pruning of filters, we execute distiller.L1RankedStructureParameterPruner once in order to rank the filters of each layer by their L1-norm values, and then we prune the schedule-prescribed sparsity level. \n\n\n\n\nDistiller schedule: \ndistiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank.yaml\n\n\nCheckpoint files: \ncheckpoint_finetuned.pth.tar\n\n\n\n\nThe excerpt from the schedule, displayed below, shows how we declare the L1RankedStructureParameterPruner. This class currently ranks filters only, but because in the future this class may support ranking of various structures, you need to specify for each parameter both the target sparsity level, and the structure type ('3D' is filter-wise pruning).\n\n\npruners:\n filter_pruner:\n class: 'L1RankedStructureParameterPruner'\n reg_regims:\n 'module.layer1.0.conv1.weight': [0.6, '3D']\n 'module.layer1.1.conv1.weight': [0.6, '3D']\n 'module.layer1.2.conv1.weight': [0.6, '3D']\n 'module.layer1.3.conv1.weight': [0.6, '3D']\n\n\n\n\nIn the policy, we specify that we want to invoke this pruner once, at epoch 180. Because we are starting from a network which was trained for 180 epochs (see Baseline training below), the filter ranking is performed right at the outset of this schedule.\n\n\npolicies:\n - pruner:\n instance_name: filter_pruner\n epochs: [180]\n\n\n\n\n\nFollowing the pruning, we want to \"physically\" remove the pruned filters from the network, which involves reconfiguring the Convolutional layers and the parameter tensors. When we remove filters from Convolution layer \nn\n we need to perform several changes to the network:\n1. Shrink layer \nn\n's weights tensor, leaving only the \"important\" filters.\n2. Configure layer \nn\n's \n.out_channels\n member to its new, smaller, value.\n3. If a BN layer follows layer \nn\n, then it also needs to be reconfigured and its scale and shift parameter vectors need to be shrunk.\n4. If a Convolution layer follows the BN layer, then it will have less input channels which requires reconfiguration and shrinking of its weights.\n\n\nAll of this is performed by distiller.ResnetCifarFilterRemover which is also scheduled at epoch 180. We call this process \"network thinning\".\n\n\nextensions:\n net_thinner:\n class: 'ResnetCifarFilterRemover'\n thinning_func_str: resnet_cifar_remove_filters\n\n\n\n\n\nNetwork thinning requires us to understand the layer connectivity and data-dependency of the DNN, and we are working on a robust method to perform this. On networks with topologies similar to ResNet (residuals) and GoogLeNet (inception), which have several inputs and outputs to/from Convolution layers, there is extra details to consider.\n\nOur current implementation is specific to certain layers in ResNet and is a bit fragile. We will continue to improve and generalize this.\n\n\nBaseline training\n\n\nWe started by training the baseline ResNet56-Cifar dense network (180 epochs) since we didn't have a pre-trained model.\n\n\n\n\nDistiller schedule: \ndistiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml\n\n\nCheckpoint files: \ncheckpoint.resnet56_cifar_baseline.pth.tar\n\n\n\n\nResults\n\n\nWe trained a ResNet56-Cifar10 network and achieve accuracy results which are on-par with published results:\nTop1: 92.970 and Top5: 99.740.\n\n\nWe used Hao et al.'s algorithm to remove 37.3% of the original convolution MACs, while maintaining virtually the same accuracy as the baseline:\nTop1: 92.830 and Top5: 99.760", + "location": "/model_zoo/index.html", + "text": "Distiller Model Zoo\n\n\nHow to contribute models to the Model Zoo\n\n\nWe encourage you to contribute new models to the Model Zoo. We welcome implementations of published papers or of your own work. To assure that models and algorithms shared with others are high-quality, please commit your models with the following:\n\n\n\n\nCommand-line arguments\n\n\nLog files\n\n\nPyTorch model\n\n\n\n\nContents\n\n\nThe Distiller model zoo is not a \"traditional\" model-zoo, because it does not necessarily contain best-in-class compressed models. Instead, the model-zoo contains a number of deep learning models that have been compressed using Distiller following some well-known research papers. These are meant to serve as examples of how Distiller can be used.\n\n\nEach model contains a Distiller schedule detailing how the model was compressed, a PyTorch checkpoint, text logs and TensorBoard logs.\n\n\n\n\ntable, th, td {\n border: 1px solid black;\n}\n\n\n\n\n \n\n \nPaper\n\n \nDataset\n\n \nNetwork\n\n \nMethod \n Granularity\n\n \nSchedule\n\n \nFeatures\n\n \n\n \n\n \nLearning both Weights and Connections for Efficient Neural Networks\n\n \nImageNet\n\n \nAlexnet\n\n \nElement-wise pruning\n\n \nIterative; Manual\n\n \nMagnitude thresholding based on a sensitivity quantifier.\nElement-wise sparsity sensitivity analysis\n\n \n\n \n\n \nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n\n \nImageNet\n\n \nMobileNet\n\n \nElement-wise pruning\n\n \nAutomated gradual; Iterative\n\n \nMagnitude thresholding based on target level\n\n \n\n \n\n \nLearning Structured Sparsity in Deep Neural Networks\n\n \nCIFAR10\n\n \nResNet20\n\n \nGroup regularization\n\n \n1.Train with group-lasso\n2.Remove zero groups and fine-tune\n\n \nGroup Lasso regularization. Groups: kernels (2D), channels, filters (3D), layers (4D), vectors (rows, cols)\n\n \n\n \n\n \nPruning Filters for Efficient ConvNets\n\n \nCIFAR10\n\n \nResNet56\n\n \nFilter ranking; guided by sensitivity analysis\n\n \n1.Rank filters\n2. Remove filters and channels\n3.Fine-tune\n\n \nOne-shot ranking and pruning of filters; with network thinning\n \n\n\n\n\nLearning both Weights and Connections for Efficient Neural Networks\n\n\nThis schedule is an example of \"Iterative Pruning\" for Alexnet/Imagent, as described in chapter 3 of Song Han's PhD dissertation: \nEfficient Methods and Hardware for Deep Learning\n and in his paper \nLearning both Weights and Connections for Efficient Neural Networks\n. \n\n\nThe Distiller schedule uses SensitivityPruner which is similar to MagnitudeParameterPruner, but instead of specifying \"raw\" thresholds, it uses a \"sensitivity parameter\". Song Han's paper says that \"the pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layers weights,\" and this is not explained much further. In Distiller, the \"quality parameter\" is referred to as \"sensitivity\" and\nis based on the values learned from performing sensitivity analysis. Using a parameter that is related to the standard deviation is very helpful: under the assumption that the weights tensors are distributed normally, the standard deviation acts as a threshold normalizer.\n\n\nNote that Distiller's implementation deviates slightly from the algorithm Song Han describes in his PhD dissertation, in that the threshold value is set only once. In his PhD dissertation, Song Han describes a growing threshold, at each iteration. This requires n+1 hyper-parameters (n being the number of pruning iterations we use): the threshold and the threshold increase (delta) at each pruning iteration. Distiller's implementation takes advantage of the fact that as pruning progresses, more weights are pulled toward zero, and therefore the threshold \"traps\" more weights. Thus, we can use less hyper-parameters and achieve the same results.\n\n\n\n\nDistiller schedule: \ndistiller/examples/sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\n\nCheckpoint file: \nalexnet.checkpoint.89.pth.tar\n\n\n\n\nResults\n\n\nOur reference is TorchVision's pretrained Alexnet model which has a Top1 accuracy of 56.55 and Top5=79.09. We prune away 88.44% of the parameters and achieve Top1=56.61 and Top5=79.45.\nSong Han prunes 89% of the parameters, which is slightly better than our results.\n\n\nParameters:\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n| | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean\n|----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n| 0 | features.module.0.weight | (64, 3, 11, 11) | 23232 | 13411 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 42.27359 | 0.14391 | -0.00002 | 0.08805 |\n| 1 | features.module.3.weight | (192, 64, 5, 5) | 307200 | 115560 | 0.00000 | 0.00000 | 0.00000 | 1.91243 | 0.00000 | 62.38281 | 0.04703 | -0.00250 | 0.02289 |\n| 2 | features.module.6.weight | (384, 192, 3, 3) | 663552 | 256565 | 0.00000 | 0.00000 | 0.00000 | 6.18490 | 0.00000 | 61.33445 | 0.03354 | -0.00184 | 0.01803 |\n| 3 | features.module.8.weight | (256, 384, 3, 3) | 884736 | 315065 | 0.00000 | 0.00000 | 0.00000 | 6.96411 | 0.00000 | 64.38881 | 0.02646 | -0.00168 | 0.01422 |\n| 4 | features.module.10.weight | (256, 256, 3, 3) | 589824 | 186938 | 0.00000 | 0.00000 | 0.00000 | 15.49225 | 0.00000 | 68.30614 | 0.02714 | -0.00246 | 0.01409 |\n| 5 | classifier.1.weight | (4096, 9216) | 37748736 | 3398881 | 0.00000 | 0.21973 | 0.00000 | 0.21973 | 0.00000 | 90.99604 | 0.00589 | -0.00020 | 0.00168 |\n| 6 | classifier.4.weight | (4096, 4096) | 16777216 | 1782769 | 0.21973 | 3.46680 | 0.00000 | 3.46680 | 0.00000 | 89.37387 | 0.00849 | -0.00066 | 0.00263 |\n| 7 | classifier.6.weight | (1000, 4096) | 4096000 | 994738 | 3.36914 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 75.71440 | 0.01718 | 0.00030 | 0.00778 |\n| 8 | Total sparsity: | - | 61090496 | 7063928 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 88.43694 | 0.00000 | 0.00000 | 0.00000 |\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:35,357 - ==\n Top1: 51.838 Top5: 74.817 Loss: 2.150\n\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:32:01,274 - ==\n Top1: 56.606 Top5: 79.446 Loss: 1.893\n\n\n\n\nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n\n\nIn their paper Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their\nsmaller, but dense (small-dense) counterparts with identical memory footprint.\"\nThey also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with\nminimal tuning.\"\n\n\nThis pruning schedule is implemented by distiller.AutomatedGradualPruner, which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps. Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper. The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs. We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size.\n\n\nImageNet files:\n\n\n\n\nDistiller schedule: \ndistiller/examples/agp-pruning/mobilenet.imagenet.schedule_agp.yaml\n\n\nCheckpoint file: \ncheckpoint.pth.tar\n\n\n\n\nResNet18 files:\n\n\n\n\nDistiller schedule: \ndistiller/examples/agp-pruning/resnet18.schedule_agp.yaml\n\n\nCheckpoint file: \ncheckpoint.pth.tar\n\n\n\n\nResults\n\n\nAs our baseline we used a \npretrained PyTorch MobileNet model\n (width=1) which has Top1=68.848 and Top5=88.740.\n\nIn their paper, Zhu and Gupta prune 50% of the elements of MobileNet (width=1) with a 1.1% drop in accuracy. We pruned about 51.6% of the elements, with virtually no change in the accuracies (Top1: 68.808 and Top5: 88.656). We didn't try to prune more than this, but we do note that the baseline accuracy that we used is almost 2% lower than the accuracy published in the paper. \n\n\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n| | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean |\n|----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n| 0 | module.model.0.0.weight | (32, 3, 3, 3) | 864 | 864 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.14466 | 0.00103 | 0.06508 |\n| 1 | module.model.1.0.weight | (32, 1, 3, 3) | 288 | 288 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.32146 | 0.01020 | 0.12932 |\n| 2 | module.model.1.3.weight | (64, 32, 1, 1) | 2048 | 2048 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.11942 | 0.00024 | 0.03627 |\n| 3 | module.model.2.0.weight | (64, 1, 3, 3) | 576 | 576 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.15809 | 0.00543 | 0.11513 |\n| 4 | module.model.2.3.weight | (128, 64, 1, 1) | 8192 | 8192 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08442 | -0.00031 | 0.04182 |\n| 5 | module.model.3.0.weight | (128, 1, 3, 3) | 1152 | 1152 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.16780 | 0.00125 | 0.10545 |\n| 6 | module.model.3.3.weight | (128, 128, 1, 1) | 16384 | 16384 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07126 | -0.00197 | 0.04123 |\n| 7 | module.model.4.0.weight | (128, 1, 3, 3) | 1152 | 1152 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.10182 | 0.00171 | 0.08719 |\n| 8 | module.model.4.3.weight | (256, 128, 1, 1) | 32768 | 13108 | 0.00000 | 0.00000 | 10.15625 | 59.99756 | 12.50000 | 59.99756 | 0.05543 | -0.00002 | 0.02760 |\n| 9 | module.model.5.0.weight | (256, 1, 3, 3) | 2304 | 2304 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.12516 | -0.00288 | 0.08058 |\n| 10 | module.model.5.3.weight | (256, 256, 1, 1) | 65536 | 26215 | 0.00000 | 0.00000 | 12.50000 | 59.99908 | 23.82812 | 59.99908 | 0.04453 | 0.00002 | 0.02271 |\n| 11 | module.model.6.0.weight | (256, 1, 3, 3) | 2304 | 2304 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08024 | 0.00252 | 0.06377 |\n| 12 | module.model.6.3.weight | (512, 256, 1, 1) | 131072 | 52429 | 0.00000 | 0.00000 | 23.82812 | 59.99985 | 14.25781 | 59.99985 | 0.03561 | -0.00057 | 0.01779 |\n| 13 | module.model.7.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.11008 | -0.00018 | 0.06829 |\n| 14 | module.model.7.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 14.25781 | 59.99985 | 21.28906 | 59.99985 | 0.02944 | -0.00060 | 0.01515 |\n| 15 | module.model.8.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08258 | 0.00370 | 0.04905 |\n| 16 | module.model.8.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 21.28906 | 59.99985 | 28.51562 | 59.99985 | 0.02865 | -0.00046 | 0.01465 |\n| 17 | module.model.9.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07578 | 0.00468 | 0.04201 |\n| 18 | module.model.9.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 28.51562 | 59.99985 | 23.43750 | 59.99985 | 0.02939 | -0.00044 | 0.01511 |\n| 19 | module.model.10.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07091 | 0.00014 | 0.04306 |\n| 20 | module.model.10.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 24.60938 | 59.99985 | 20.89844 | 59.99985 | 0.03095 | -0.00059 | 0.01672 |\n| 21 | module.model.11.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.05729 | -0.00518 | 0.04267 |\n| 22 | module.model.11.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 20.89844 | 59.99985 | 17.57812 | 59.99985 | 0.03229 | -0.00044 | 0.01797 |\n| 23 | module.model.12.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.04981 | -0.00136 | 0.03967 |\n| 24 | module.model.12.3.weight | (1024, 512, 1, 1) | 524288 | 209716 | 0.00000 | 0.00000 | 16.01562 | 59.99985 | 44.23828 | 59.99985 | 0.02514 | -0.00106 | 0.01278 |\n| 25 | module.model.13.0.weight | (1024, 1, 3, 3) | 9216 | 9216 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.02396 | -0.00949 | 0.01549 |\n| 26 | module.model.13.3.weight | (1024, 1024, 1, 1) | 1048576 | 419431 | 0.00000 | 0.00000 | 44.72656 | 59.99994 | 1.46484 | 59.99994 | 0.01801 | -0.00017 | 0.00931 |\n| 27 | module.fc.weight | (1000, 1024) | 1024000 | 409600 | 1.46484 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 60.00000 | 0.05078 | 0.00271 | 0.02734 |\n| 28 | Total sparsity: | - | 4209088 | 1726917 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 58.97171 | 0.00000 | 0.00000 | 0.00000 |\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\nTotal sparsity: 58.97\n\n--- validate (epoch=199)-----------\n128116 samples (256 per mini-batch)\n==\n Top1: 65.337 Top5: 84.984 Loss: 1.494\n\n--- test ---------------------\n50000 samples (256 per mini-batch)\n==\n Top1: 68.810 Top5: 88.626 Loss: 1.282\n\n\n\n\n\nLearning Structured Sparsity in Deep Neural Networks\n\n\nThis research paper from the University of Pittsburgh, \"proposes a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNN\u2019s evaluation.\"\n\n\nNote that this paper does not use pruning, but instead uses group regularization during the training to force weights towards zero, as a group. We used a schedule which thresholds the regularized elements at a magnitude equal to the regularization strength. At the end of the regularization phase, we save the final sparsity masks generated by the regularization, and exit. Then we load this regularized model, remove the layers corresponding to the zeroed weight tensors (all of a layer's elements have a zero value). \n\n\nBaseline training\n\n\nWe started by training the baseline ResNet20-Cifar dense network since we didn't have a pre-trained model.\n\n\n\n\nDistiller schedule: \ndistiller/examples/ssl/resnet20_cifar_baseline_training.yaml\n\n\nCheckpoint files: \ndistiller/examples/ssl/checkpoints/\n\n\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.3 --epochs=180 --compress=../cifar10/resnet20/baseline_training.yaml -j=1 --deterministic\n\n\n\n\nRegularization\n\n\nThen we started training from scratch again, but this time we used Group Lasso regularization on entire layers:\n\nDistiller schedule: \ndistiller/examples/ssl/ssl_4D-removal_4L_training.yaml\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.4 --epochs=180 --compress=../ssl/ssl_4D-removal_training.yaml -j=1 --deterministic\n\n\n\n\nThe diagram below shows the training of Resnet20/CIFAR10 using Group Lasso regularization on entire layers (in blue) vs. training Resnet20/CIFAR10 baseline (in red). You may notice several interesting things:\n1. The LR-decay policy is the same, but the two sessions start with different initial LR values.\n2. The data-loss of the regularized training follows the same shape as the un-regularized training (baseline), and eventually the two seem to merge.\n3. We see similar behavior in the validation Top1 and Top5 accuracy results, but the regularized training eventually performs better.\n4. In the top right corner we see the behavior of the regularization loss (\nReg Loss\n), which actually increases for some time, until the data-loss has a sharp drop (after ~16K mini-batches), at which point the regularization loss also starts dropping.\n\n\n\nThis \nregularization\n yields 5 layers with zeroed weight tensors. We load this model, remove the 5 layers, and start the fine tuning of the weights. This process of layer removal is specific to ResNet for CIFAR, which we altered by adding code to skip over layers during the forward path. When you export to ONNX, the removed layers do not participate in the forward path, so they don't get incarnated. \n\n\nWe managed to remove 5 of the 16 3x3 convolution layers which dominate the computation time. It's not bad, but we probably could have done better.\n\n\nFine-tuning\n\n\nDuring the \nfine-tuning\n process, because the removed layers do not participate in the forward path, they do not appear in the backward path and are not backpropogated: therefore they are completely disconnected from the network.\n\nWe copy the checkpoint file of the regularized model to \ncheckpoint_trained_4D_regularized_5Lremoved.pth.tar\n.\n\nDistiller schedule: \ndistiller/examples/ssl/ssl_4D-removal_finetuning.yaml\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.1 --epochs=250 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --compress=../ssl/ssl_4D-removal_finetuning.yaml -j=1 --deterministic\n\n\n\n\nResults\n\n\nOur baseline results for ResNet20 Cifar are: Top1=91.450 and Top5=99.750\n\n\nWe used Distiller's GroupLassoRegularizer to remove 5 layers from Resnet20 (CIFAR10) with no degradation of the accuracies.\n\nThe regularized model exhibits really poor classification abilities: \n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --evaluate\n\n=\n loading checkpoint ../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar\n best top@1: 90.620\nLoaded compression schedule from checkpoint (epoch 179)\nRemoving layer: module.layer1.0.conv1 [layer=0 block=0 conv=0]\nRemoving layer: module.layer1.0.conv2 [layer=0 block=0 conv=1]\nRemoving layer: module.layer1.1.conv1 [layer=0 block=1 conv=0]\nRemoving layer: module.layer1.1.conv2 [layer=0 block=1 conv=1]\nRemoving layer: module.layer2.2.conv2 [layer=1 block=2 conv=1]\nFiles already downloaded and verified\nFiles already downloaded and verified\nDataset sizes:\n training=45000\n validation=5000\n test=10000\n--- test ---------------------\n10000 samples (256 per mini-batch)\n==\n Top1: 22.290 Top5: 68.940 Loss: 5.172\n\n\n\n\nHowever, after fine-tuning, we recovered most of the accuracies loss, but not quite all of it: Top1=91.020 and Top5=99.670\n\n\nWe didn't spend time trying to wrestle with this network, and therefore didn't achieve SSL's published results (which showed that they managed to remove 6 layers and at the same time increase accuracies).\n\n\nPruning Filters for Efficient ConvNets\n\n\nQuoting the authors directly:\n\n\n\n\nWe present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly.\nIn contrast to pruning weights, this approach does not result in sparse connectivity patterns. Hence, it does not need the support of sparse convolution libraries and can work with existing efficient BLAS libraries for dense matrix multiplications.\n\n\n\n\nThe implementation of the research by Hao et al. required us to add filter-pruning sensitivity analysis, and support for \"network thinning\".\n\n\nAfter performing filter-pruning sensitivity analysis to assess which layers are more sensitive to the pruning of filters, we execute distiller.L1RankedStructureParameterPruner once in order to rank the filters of each layer by their L1-norm values, and then we prune the schedule-prescribed sparsity level. \n\n\n\n\nDistiller schedule: \ndistiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank.yaml\n\n\nCheckpoint files: \ncheckpoint_finetuned.pth.tar\n\n\n\n\nThe excerpt from the schedule, displayed below, shows how we declare the L1RankedStructureParameterPruner. This class currently ranks filters only, but because in the future this class may support ranking of various structures, you need to specify for each parameter both the target sparsity level, and the structure type ('3D' is filter-wise pruning).\n\n\npruners:\n filter_pruner:\n class: 'L1RankedStructureParameterPruner'\n reg_regims:\n 'module.layer1.0.conv1.weight': [0.6, '3D']\n 'module.layer1.1.conv1.weight': [0.6, '3D']\n 'module.layer1.2.conv1.weight': [0.6, '3D']\n 'module.layer1.3.conv1.weight': [0.6, '3D']\n\n\n\n\nIn the policy, we specify that we want to invoke this pruner once, at epoch 180. Because we are starting from a network which was trained for 180 epochs (see Baseline training below), the filter ranking is performed right at the outset of this schedule.\n\n\npolicies:\n - pruner:\n instance_name: filter_pruner\n epochs: [180]\n\n\n\n\n\nFollowing the pruning, we want to \"physically\" remove the pruned filters from the network, which involves reconfiguring the Convolutional layers and the parameter tensors. When we remove filters from Convolution layer \nn\n we need to perform several changes to the network:\n1. Shrink layer \nn\n's weights tensor, leaving only the \"important\" filters.\n2. Configure layer \nn\n's \n.out_channels\n member to its new, smaller, value.\n3. If a BN layer follows layer \nn\n, then it also needs to be reconfigured and its scale and shift parameter vectors need to be shrunk.\n4. If a Convolution layer follows the BN layer, then it will have less input channels which requires reconfiguration and shrinking of its weights.\n\n\nAll of this is performed by distiller.ResnetCifarFilterRemover which is also scheduled at epoch 180. We call this process \"network thinning\".\n\n\nextensions:\n net_thinner:\n class: 'ResnetCifarFilterRemover'\n thinning_func_str: resnet_cifar_remove_filters\n\n\n\n\n\nNetwork thinning requires us to understand the layer connectivity and data-dependency of the DNN, and we are working on a robust method to perform this. On networks with topologies similar to ResNet (residuals) and GoogLeNet (inception), which have several inputs and outputs to/from Convolution layers, there is extra details to consider.\n\nOur current implementation is specific to certain layers in ResNet and is a bit fragile. We will continue to improve and generalize this.\n\n\nBaseline training\n\n\nWe started by training the baseline ResNet56-Cifar dense network (180 epochs) since we didn't have a pre-trained model.\n\n\n\n\nDistiller schedule: \ndistiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml\n\n\nCheckpoint files: \ncheckpoint.resnet56_cifar_baseline.pth.tar\n\n\n\n\nResults\n\n\nWe trained a ResNet56-Cifar10 network and achieve accuracy results which are on-par with published results:\nTop1: 92.970 and Top5: 99.740.\n\n\nWe used Hao et al.'s algorithm to remove 37.3% of the original convolution MACs, while maintaining virtually the same accuracy as the baseline:\nTop1: 92.830 and Top5: 99.760", "title": "Model Zoo" - }, + }, { - "location": "/model_zoo/index.html#distiller-model-zoo", - "text": "", + "location": "/model_zoo/index.html#distiller-model-zoo", + "text": "", "title": "Distiller Model Zoo" - }, + }, { - "location": "/model_zoo/index.html#how-to-contribute-models-to-the-model-zoo", - "text": "We encourage you to contribute new models to the Model Zoo. We welcome implementations of published papers or of your own work. To assure that models and algorithms shared with others are high-quality, please commit your models with the following: Command-line arguments Log files PyTorch model", + "location": "/model_zoo/index.html#how-to-contribute-models-to-the-model-zoo", + "text": "We encourage you to contribute new models to the Model Zoo. We welcome implementations of published papers or of your own work. To assure that models and algorithms shared with others are high-quality, please commit your models with the following: Command-line arguments Log files PyTorch model", "title": "How to contribute models to the Model Zoo" - }, + }, { - "location": "/model_zoo/index.html#contents", - "text": "The Distiller model zoo is not a \"traditional\" model-zoo, because it does not necessarily contain best-in-class compressed models. Instead, the model-zoo contains a number of deep learning models that have been compressed using Distiller following some well-known research papers. These are meant to serve as examples of how Distiller can be used. Each model contains a Distiller schedule detailing how the model was compressed, a PyTorch checkpoint, text logs and TensorBoard logs. \ntable, th, td {\n border: 1px solid black;\n} \n \n Paper \n Dataset \n Network \n Method & Granularity \n Schedule \n Features \n \n \n Learning both Weights and Connections for Efficient Neural Networks \n ImageNet \n Alexnet \n Element-wise pruning \n Iterative; Manual \n Magnitude thresholding based on a sensitivity quantifier. Element-wise sparsity sensitivity analysis \n \n \n To prune, or not to prune: exploring the efficacy of pruning for model compression \n ImageNet \n MobileNet \n Element-wise pruning \n Automated gradual; Iterative \n Magnitude thresholding based on target level \n \n \n Learning Structured Sparsity in Deep Neural Networks \n CIFAR10 \n ResNet20 \n Group regularization \n 1.Train with group-lasso 2.Remove zero groups and fine-tune \n Group Lasso regularization. Groups: kernels (2D), channels, filters (3D), layers (4D), vectors (rows, cols) \n \n \n Pruning Filters for Efficient ConvNets \n CIFAR10 \n ResNet56 \n Filter ranking; guided by sensitivity analysis \n 1.Rank filters 2. Remove filters and channels 3.Fine-tune \n One-shot ranking and pruning of filters; with network thinning", + "location": "/model_zoo/index.html#contents", + "text": "The Distiller model zoo is not a \"traditional\" model-zoo, because it does not necessarily contain best-in-class compressed models. Instead, the model-zoo contains a number of deep learning models that have been compressed using Distiller following some well-known research papers. These are meant to serve as examples of how Distiller can be used. Each model contains a Distiller schedule detailing how the model was compressed, a PyTorch checkpoint, text logs and TensorBoard logs. \ntable, th, td {\n border: 1px solid black;\n} \n \n Paper \n Dataset \n Network \n Method Granularity \n Schedule \n Features \n \n \n Learning both Weights and Connections for Efficient Neural Networks \n ImageNet \n Alexnet \n Element-wise pruning \n Iterative; Manual \n Magnitude thresholding based on a sensitivity quantifier. Element-wise sparsity sensitivity analysis \n \n \n To prune, or not to prune: exploring the efficacy of pruning for model compression \n ImageNet \n MobileNet \n Element-wise pruning \n Automated gradual; Iterative \n Magnitude thresholding based on target level \n \n \n Learning Structured Sparsity in Deep Neural Networks \n CIFAR10 \n ResNet20 \n Group regularization \n 1.Train with group-lasso 2.Remove zero groups and fine-tune \n Group Lasso regularization. Groups: kernels (2D), channels, filters (3D), layers (4D), vectors (rows, cols) \n \n \n Pruning Filters for Efficient ConvNets \n CIFAR10 \n ResNet56 \n Filter ranking; guided by sensitivity analysis \n 1.Rank filters 2. Remove filters and channels 3.Fine-tune \n One-shot ranking and pruning of filters; with network thinning", "title": "Contents" - }, + }, { - "location": "/model_zoo/index.html#learning-both-weights-and-connections-for-efficient-neural-networks", - "text": "This schedule is an example of \"Iterative Pruning\" for Alexnet/Imagent, as described in chapter 3 of Song Han's PhD dissertation: Efficient Methods and Hardware for Deep Learning and in his paper Learning both Weights and Connections for Efficient Neural Networks . The Distiller schedule uses SensitivityPruner which is similar to MagnitudeParameterPruner, but instead of specifying \"raw\" thresholds, it uses a \"sensitivity parameter\". Song Han's paper says that \"the pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layers weights,\" and this is not explained much further. In Distiller, the \"quality parameter\" is referred to as \"sensitivity\" and\nis based on the values learned from performing sensitivity analysis. Using a parameter that is related to the standard deviation is very helpful: under the assumption that the weights tensors are distributed normally, the standard deviation acts as a threshold normalizer. Note that Distiller's implementation deviates slightly from the algorithm Song Han describes in his PhD dissertation, in that the threshold value is set only once. In his PhD dissertation, Song Han describes a growing threshold, at each iteration. This requires n+1 hyper-parameters (n being the number of pruning iterations we use): the threshold and the threshold increase (delta) at each pruning iteration. Distiller's implementation takes advantage of the fact that as pruning progresses, more weights are pulled toward zero, and therefore the threshold \"traps\" more weights. Thus, we can use less hyper-parameters and achieve the same results. Distiller schedule: distiller/examples/sensitivity-pruning/alexnet.schedule_sensitivity.yaml Checkpoint file: alexnet.checkpoint.89.pth.tar", + "location": "/model_zoo/index.html#learning-both-weights-and-connections-for-efficient-neural-networks", + "text": "This schedule is an example of \"Iterative Pruning\" for Alexnet/Imagent, as described in chapter 3 of Song Han's PhD dissertation: Efficient Methods and Hardware for Deep Learning and in his paper Learning both Weights and Connections for Efficient Neural Networks . The Distiller schedule uses SensitivityPruner which is similar to MagnitudeParameterPruner, but instead of specifying \"raw\" thresholds, it uses a \"sensitivity parameter\". Song Han's paper says that \"the pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layers weights,\" and this is not explained much further. In Distiller, the \"quality parameter\" is referred to as \"sensitivity\" and\nis based on the values learned from performing sensitivity analysis. Using a parameter that is related to the standard deviation is very helpful: under the assumption that the weights tensors are distributed normally, the standard deviation acts as a threshold normalizer. Note that Distiller's implementation deviates slightly from the algorithm Song Han describes in his PhD dissertation, in that the threshold value is set only once. In his PhD dissertation, Song Han describes a growing threshold, at each iteration. This requires n+1 hyper-parameters (n being the number of pruning iterations we use): the threshold and the threshold increase (delta) at each pruning iteration. Distiller's implementation takes advantage of the fact that as pruning progresses, more weights are pulled toward zero, and therefore the threshold \"traps\" more weights. Thus, we can use less hyper-parameters and achieve the same results. Distiller schedule: distiller/examples/sensitivity-pruning/alexnet.schedule_sensitivity.yaml Checkpoint file: alexnet.checkpoint.89.pth.tar", "title": "Learning both Weights and Connections for Efficient Neural Networks" - }, + }, { - "location": "/model_zoo/index.html#results", - "text": "Our reference is TorchVision's pretrained Alexnet model which has a Top1 accuracy of 56.55 and Top5=79.09. We prune away 88.44% of the parameters and achieve Top1=56.61 and Top5=79.45.\nSong Han prunes 89% of the parameters, which is slightly better than our results. Parameters:\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n| | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean\n|----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n| 0 | features.module.0.weight | (64, 3, 11, 11) | 23232 | 13411 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 42.27359 | 0.14391 | -0.00002 | 0.08805 |\n| 1 | features.module.3.weight | (192, 64, 5, 5) | 307200 | 115560 | 0.00000 | 0.00000 | 0.00000 | 1.91243 | 0.00000 | 62.38281 | 0.04703 | -0.00250 | 0.02289 |\n| 2 | features.module.6.weight | (384, 192, 3, 3) | 663552 | 256565 | 0.00000 | 0.00000 | 0.00000 | 6.18490 | 0.00000 | 61.33445 | 0.03354 | -0.00184 | 0.01803 |\n| 3 | features.module.8.weight | (256, 384, 3, 3) | 884736 | 315065 | 0.00000 | 0.00000 | 0.00000 | 6.96411 | 0.00000 | 64.38881 | 0.02646 | -0.00168 | 0.01422 |\n| 4 | features.module.10.weight | (256, 256, 3, 3) | 589824 | 186938 | 0.00000 | 0.00000 | 0.00000 | 15.49225 | 0.00000 | 68.30614 | 0.02714 | -0.00246 | 0.01409 |\n| 5 | classifier.1.weight | (4096, 9216) | 37748736 | 3398881 | 0.00000 | 0.21973 | 0.00000 | 0.21973 | 0.00000 | 90.99604 | 0.00589 | -0.00020 | 0.00168 |\n| 6 | classifier.4.weight | (4096, 4096) | 16777216 | 1782769 | 0.21973 | 3.46680 | 0.00000 | 3.46680 | 0.00000 | 89.37387 | 0.00849 | -0.00066 | 0.00263 |\n| 7 | classifier.6.weight | (1000, 4096) | 4096000 | 994738 | 3.36914 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 75.71440 | 0.01718 | 0.00030 | 0.00778 |\n| 8 | Total sparsity: | - | 61090496 | 7063928 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 88.43694 | 0.00000 | 0.00000 | 0.00000 |\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:35,357 - ==> Top1: 51.838 Top5: 74.817 Loss: 2.150\n\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:32:01,274 - ==> Top1: 56.606 Top5: 79.446 Loss: 1.893", + "location": "/model_zoo/index.html#results", + "text": "Our reference is TorchVision's pretrained Alexnet model which has a Top1 accuracy of 56.55 and Top5=79.09. We prune away 88.44% of the parameters and achieve Top1=56.61 and Top5=79.45.\nSong Han prunes 89% of the parameters, which is slightly better than our results. Parameters:\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n| | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean\n|----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n| 0 | features.module.0.weight | (64, 3, 11, 11) | 23232 | 13411 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 42.27359 | 0.14391 | -0.00002 | 0.08805 |\n| 1 | features.module.3.weight | (192, 64, 5, 5) | 307200 | 115560 | 0.00000 | 0.00000 | 0.00000 | 1.91243 | 0.00000 | 62.38281 | 0.04703 | -0.00250 | 0.02289 |\n| 2 | features.module.6.weight | (384, 192, 3, 3) | 663552 | 256565 | 0.00000 | 0.00000 | 0.00000 | 6.18490 | 0.00000 | 61.33445 | 0.03354 | -0.00184 | 0.01803 |\n| 3 | features.module.8.weight | (256, 384, 3, 3) | 884736 | 315065 | 0.00000 | 0.00000 | 0.00000 | 6.96411 | 0.00000 | 64.38881 | 0.02646 | -0.00168 | 0.01422 |\n| 4 | features.module.10.weight | (256, 256, 3, 3) | 589824 | 186938 | 0.00000 | 0.00000 | 0.00000 | 15.49225 | 0.00000 | 68.30614 | 0.02714 | -0.00246 | 0.01409 |\n| 5 | classifier.1.weight | (4096, 9216) | 37748736 | 3398881 | 0.00000 | 0.21973 | 0.00000 | 0.21973 | 0.00000 | 90.99604 | 0.00589 | -0.00020 | 0.00168 |\n| 6 | classifier.4.weight | (4096, 4096) | 16777216 | 1782769 | 0.21973 | 3.46680 | 0.00000 | 3.46680 | 0.00000 | 89.37387 | 0.00849 | -0.00066 | 0.00263 |\n| 7 | classifier.6.weight | (1000, 4096) | 4096000 | 994738 | 3.36914 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 75.71440 | 0.01718 | 0.00030 | 0.00778 |\n| 8 | Total sparsity: | - | 61090496 | 7063928 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 88.43694 | 0.00000 | 0.00000 | 0.00000 |\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:35,357 - == Top1: 51.838 Top5: 74.817 Loss: 2.150\n\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:32:01,274 - == Top1: 56.606 Top5: 79.446 Loss: 1.893", "title": "Results" - }, + }, { - "location": "/model_zoo/index.html#to-prune-or-not-to-prune-exploring-the-efficacy-of-pruning-for-model-compression", - "text": "In their paper Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their\nsmaller, but dense (small-dense) counterparts with identical memory footprint.\"\nThey also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with\nminimal tuning.\" This pruning schedule is implemented by distiller.AutomatedGradualPruner, which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps. Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper. The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs. We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size. ImageNet files: Distiller schedule: distiller/examples/agp-pruning/mobilenet.imagenet.schedule_agp.yaml Checkpoint file: checkpoint.pth.tar ResNet18 files: Distiller schedule: distiller/examples/agp-pruning/resnet18.schedule_agp.yaml Checkpoint file: checkpoint.pth.tar", + "location": "/model_zoo/index.html#to-prune-or-not-to-prune-exploring-the-efficacy-of-pruning-for-model-compression", + "text": "In their paper Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their\nsmaller, but dense (small-dense) counterparts with identical memory footprint.\"\nThey also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with\nminimal tuning.\" This pruning schedule is implemented by distiller.AutomatedGradualPruner, which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps. Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper. The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs. We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size. ImageNet files: Distiller schedule: distiller/examples/agp-pruning/mobilenet.imagenet.schedule_agp.yaml Checkpoint file: checkpoint.pth.tar ResNet18 files: Distiller schedule: distiller/examples/agp-pruning/resnet18.schedule_agp.yaml Checkpoint file: checkpoint.pth.tar", "title": "To prune, or not to prune: exploring the efficacy of pruning for model compression" - }, + }, { - "location": "/model_zoo/index.html#results_1", - "text": "As our baseline we used a pretrained PyTorch MobileNet model (width=1) which has Top1=68.848 and Top5=88.740. \nIn their paper, Zhu and Gupta prune 50% of the elements of MobileNet (width=1) with a 1.1% drop in accuracy. We pruned about 51.6% of the elements, with virtually no change in the accuracies (Top1: 68.808 and Top5: 88.656). We didn't try to prune more than this, but we do note that the baseline accuracy that we used is almost 2% lower than the accuracy published in the paper. +----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n| | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean |\n|----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n| 0 | module.model.0.0.weight | (32, 3, 3, 3) | 864 | 864 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.14466 | 0.00103 | 0.06508 |\n| 1 | module.model.1.0.weight | (32, 1, 3, 3) | 288 | 288 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.32146 | 0.01020 | 0.12932 |\n| 2 | module.model.1.3.weight | (64, 32, 1, 1) | 2048 | 2048 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.11942 | 0.00024 | 0.03627 |\n| 3 | module.model.2.0.weight | (64, 1, 3, 3) | 576 | 576 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.15809 | 0.00543 | 0.11513 |\n| 4 | module.model.2.3.weight | (128, 64, 1, 1) | 8192 | 8192 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08442 | -0.00031 | 0.04182 |\n| 5 | module.model.3.0.weight | (128, 1, 3, 3) | 1152 | 1152 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.16780 | 0.00125 | 0.10545 |\n| 6 | module.model.3.3.weight | (128, 128, 1, 1) | 16384 | 16384 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07126 | -0.00197 | 0.04123 |\n| 7 | module.model.4.0.weight | (128, 1, 3, 3) | 1152 | 1152 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.10182 | 0.00171 | 0.08719 |\n| 8 | module.model.4.3.weight | (256, 128, 1, 1) | 32768 | 13108 | 0.00000 | 0.00000 | 10.15625 | 59.99756 | 12.50000 | 59.99756 | 0.05543 | -0.00002 | 0.02760 |\n| 9 | module.model.5.0.weight | (256, 1, 3, 3) | 2304 | 2304 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.12516 | -0.00288 | 0.08058 |\n| 10 | module.model.5.3.weight | (256, 256, 1, 1) | 65536 | 26215 | 0.00000 | 0.00000 | 12.50000 | 59.99908 | 23.82812 | 59.99908 | 0.04453 | 0.00002 | 0.02271 |\n| 11 | module.model.6.0.weight | (256, 1, 3, 3) | 2304 | 2304 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08024 | 0.00252 | 0.06377 |\n| 12 | module.model.6.3.weight | (512, 256, 1, 1) | 131072 | 52429 | 0.00000 | 0.00000 | 23.82812 | 59.99985 | 14.25781 | 59.99985 | 0.03561 | -0.00057 | 0.01779 |\n| 13 | module.model.7.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.11008 | -0.00018 | 0.06829 |\n| 14 | module.model.7.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 14.25781 | 59.99985 | 21.28906 | 59.99985 | 0.02944 | -0.00060 | 0.01515 |\n| 15 | module.model.8.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08258 | 0.00370 | 0.04905 |\n| 16 | module.model.8.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 21.28906 | 59.99985 | 28.51562 | 59.99985 | 0.02865 | -0.00046 | 0.01465 |\n| 17 | module.model.9.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07578 | 0.00468 | 0.04201 |\n| 18 | module.model.9.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 28.51562 | 59.99985 | 23.43750 | 59.99985 | 0.02939 | -0.00044 | 0.01511 |\n| 19 | module.model.10.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07091 | 0.00014 | 0.04306 |\n| 20 | module.model.10.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 24.60938 | 59.99985 | 20.89844 | 59.99985 | 0.03095 | -0.00059 | 0.01672 |\n| 21 | module.model.11.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.05729 | -0.00518 | 0.04267 |\n| 22 | module.model.11.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 20.89844 | 59.99985 | 17.57812 | 59.99985 | 0.03229 | -0.00044 | 0.01797 |\n| 23 | module.model.12.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.04981 | -0.00136 | 0.03967 |\n| 24 | module.model.12.3.weight | (1024, 512, 1, 1) | 524288 | 209716 | 0.00000 | 0.00000 | 16.01562 | 59.99985 | 44.23828 | 59.99985 | 0.02514 | -0.00106 | 0.01278 |\n| 25 | module.model.13.0.weight | (1024, 1, 3, 3) | 9216 | 9216 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.02396 | -0.00949 | 0.01549 |\n| 26 | module.model.13.3.weight | (1024, 1024, 1, 1) | 1048576 | 419431 | 0.00000 | 0.00000 | 44.72656 | 59.99994 | 1.46484 | 59.99994 | 0.01801 | -0.00017 | 0.00931 |\n| 27 | module.fc.weight | (1000, 1024) | 1024000 | 409600 | 1.46484 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 60.00000 | 0.05078 | 0.00271 | 0.02734 |\n| 28 | Total sparsity: | - | 4209088 | 1726917 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 58.97171 | 0.00000 | 0.00000 | 0.00000 |\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\nTotal sparsity: 58.97\n\n--- validate (epoch=199)-----------\n128116 samples (256 per mini-batch)\n==> Top1: 65.337 Top5: 84.984 Loss: 1.494\n\n--- test ---------------------\n50000 samples (256 per mini-batch)\n==> Top1: 68.810 Top5: 88.626 Loss: 1.282", + "location": "/model_zoo/index.html#results_1", + "text": "As our baseline we used a pretrained PyTorch MobileNet model (width=1) which has Top1=68.848 and Top5=88.740. \nIn their paper, Zhu and Gupta prune 50% of the elements of MobileNet (width=1) with a 1.1% drop in accuracy. We pruned about 51.6% of the elements, with virtually no change in the accuracies (Top1: 68.808 and Top5: 88.656). We didn't try to prune more than this, but we do note that the baseline accuracy that we used is almost 2% lower than the accuracy published in the paper. +----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n| | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean |\n|----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n| 0 | module.model.0.0.weight | (32, 3, 3, 3) | 864 | 864 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.14466 | 0.00103 | 0.06508 |\n| 1 | module.model.1.0.weight | (32, 1, 3, 3) | 288 | 288 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.32146 | 0.01020 | 0.12932 |\n| 2 | module.model.1.3.weight | (64, 32, 1, 1) | 2048 | 2048 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.11942 | 0.00024 | 0.03627 |\n| 3 | module.model.2.0.weight | (64, 1, 3, 3) | 576 | 576 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.15809 | 0.00543 | 0.11513 |\n| 4 | module.model.2.3.weight | (128, 64, 1, 1) | 8192 | 8192 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08442 | -0.00031 | 0.04182 |\n| 5 | module.model.3.0.weight | (128, 1, 3, 3) | 1152 | 1152 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.16780 | 0.00125 | 0.10545 |\n| 6 | module.model.3.3.weight | (128, 128, 1, 1) | 16384 | 16384 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07126 | -0.00197 | 0.04123 |\n| 7 | module.model.4.0.weight | (128, 1, 3, 3) | 1152 | 1152 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.10182 | 0.00171 | 0.08719 |\n| 8 | module.model.4.3.weight | (256, 128, 1, 1) | 32768 | 13108 | 0.00000 | 0.00000 | 10.15625 | 59.99756 | 12.50000 | 59.99756 | 0.05543 | -0.00002 | 0.02760 |\n| 9 | module.model.5.0.weight | (256, 1, 3, 3) | 2304 | 2304 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.12516 | -0.00288 | 0.08058 |\n| 10 | module.model.5.3.weight | (256, 256, 1, 1) | 65536 | 26215 | 0.00000 | 0.00000 | 12.50000 | 59.99908 | 23.82812 | 59.99908 | 0.04453 | 0.00002 | 0.02271 |\n| 11 | module.model.6.0.weight | (256, 1, 3, 3) | 2304 | 2304 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08024 | 0.00252 | 0.06377 |\n| 12 | module.model.6.3.weight | (512, 256, 1, 1) | 131072 | 52429 | 0.00000 | 0.00000 | 23.82812 | 59.99985 | 14.25781 | 59.99985 | 0.03561 | -0.00057 | 0.01779 |\n| 13 | module.model.7.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.11008 | -0.00018 | 0.06829 |\n| 14 | module.model.7.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 14.25781 | 59.99985 | 21.28906 | 59.99985 | 0.02944 | -0.00060 | 0.01515 |\n| 15 | module.model.8.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08258 | 0.00370 | 0.04905 |\n| 16 | module.model.8.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 21.28906 | 59.99985 | 28.51562 | 59.99985 | 0.02865 | -0.00046 | 0.01465 |\n| 17 | module.model.9.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07578 | 0.00468 | 0.04201 |\n| 18 | module.model.9.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 28.51562 | 59.99985 | 23.43750 | 59.99985 | 0.02939 | -0.00044 | 0.01511 |\n| 19 | module.model.10.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07091 | 0.00014 | 0.04306 |\n| 20 | module.model.10.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 24.60938 | 59.99985 | 20.89844 | 59.99985 | 0.03095 | -0.00059 | 0.01672 |\n| 21 | module.model.11.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.05729 | -0.00518 | 0.04267 |\n| 22 | module.model.11.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 20.89844 | 59.99985 | 17.57812 | 59.99985 | 0.03229 | -0.00044 | 0.01797 |\n| 23 | module.model.12.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.04981 | -0.00136 | 0.03967 |\n| 24 | module.model.12.3.weight | (1024, 512, 1, 1) | 524288 | 209716 | 0.00000 | 0.00000 | 16.01562 | 59.99985 | 44.23828 | 59.99985 | 0.02514 | -0.00106 | 0.01278 |\n| 25 | module.model.13.0.weight | (1024, 1, 3, 3) | 9216 | 9216 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.02396 | -0.00949 | 0.01549 |\n| 26 | module.model.13.3.weight | (1024, 1024, 1, 1) | 1048576 | 419431 | 0.00000 | 0.00000 | 44.72656 | 59.99994 | 1.46484 | 59.99994 | 0.01801 | -0.00017 | 0.00931 |\n| 27 | module.fc.weight | (1000, 1024) | 1024000 | 409600 | 1.46484 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 60.00000 | 0.05078 | 0.00271 | 0.02734 |\n| 28 | Total sparsity: | - | 4209088 | 1726917 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 58.97171 | 0.00000 | 0.00000 | 0.00000 |\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\nTotal sparsity: 58.97\n\n--- validate (epoch=199)-----------\n128116 samples (256 per mini-batch)\n== Top1: 65.337 Top5: 84.984 Loss: 1.494\n\n--- test ---------------------\n50000 samples (256 per mini-batch)\n== Top1: 68.810 Top5: 88.626 Loss: 1.282", "title": "Results" - }, + }, { - "location": "/model_zoo/index.html#learning-structured-sparsity-in-deep-neural-networks", - "text": "This research paper from the University of Pittsburgh, \"proposes a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNN\u2019s evaluation.\" Note that this paper does not use pruning, but instead uses group regularization during the training to force weights towards zero, as a group. We used a schedule which thresholds the regularized elements at a magnitude equal to the regularization strength. At the end of the regularization phase, we save the final sparsity masks generated by the regularization, and exit. Then we load this regularized model, remove the layers corresponding to the zeroed weight tensors (all of a layer's elements have a zero value).", + "location": "/model_zoo/index.html#learning-structured-sparsity-in-deep-neural-networks", + "text": "This research paper from the University of Pittsburgh, \"proposes a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNN\u2019s evaluation.\" Note that this paper does not use pruning, but instead uses group regularization during the training to force weights towards zero, as a group. We used a schedule which thresholds the regularized elements at a magnitude equal to the regularization strength. At the end of the regularization phase, we save the final sparsity masks generated by the regularization, and exit. Then we load this regularized model, remove the layers corresponding to the zeroed weight tensors (all of a layer's elements have a zero value).", "title": "Learning Structured Sparsity in Deep Neural Networks" - }, + }, { - "location": "/model_zoo/index.html#baseline-training", - "text": "We started by training the baseline ResNet20-Cifar dense network since we didn't have a pre-trained model. Distiller schedule: distiller/examples/ssl/resnet20_cifar_baseline_training.yaml Checkpoint files: distiller/examples/ssl/checkpoints/ $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.3 --epochs=180 --compress=../cifar10/resnet20/baseline_training.yaml -j=1 --deterministic", + "location": "/model_zoo/index.html#baseline-training", + "text": "We started by training the baseline ResNet20-Cifar dense network since we didn't have a pre-trained model. Distiller schedule: distiller/examples/ssl/resnet20_cifar_baseline_training.yaml Checkpoint files: distiller/examples/ssl/checkpoints/ $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.3 --epochs=180 --compress=../cifar10/resnet20/baseline_training.yaml -j=1 --deterministic", "title": "Baseline training" - }, + }, { - "location": "/model_zoo/index.html#regularization", - "text": "Then we started training from scratch again, but this time we used Group Lasso regularization on entire layers: \nDistiller schedule: distiller/examples/ssl/ssl_4D-removal_4L_training.yaml $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.4 --epochs=180 --compress=../ssl/ssl_4D-removal_training.yaml -j=1 --deterministic The diagram below shows the training of Resnet20/CIFAR10 using Group Lasso regularization on entire layers (in blue) vs. training Resnet20/CIFAR10 baseline (in red). You may notice several interesting things:\n1. The LR-decay policy is the same, but the two sessions start with different initial LR values.\n2. The data-loss of the regularized training follows the same shape as the un-regularized training (baseline), and eventually the two seem to merge.\n3. We see similar behavior in the validation Top1 and Top5 accuracy results, but the regularized training eventually performs better.\n4. In the top right corner we see the behavior of the regularization loss ( Reg Loss ), which actually increases for some time, until the data-loss has a sharp drop (after ~16K mini-batches), at which point the regularization loss also starts dropping. This regularization yields 5 layers with zeroed weight tensors. We load this model, remove the 5 layers, and start the fine tuning of the weights. This process of layer removal is specific to ResNet for CIFAR, which we altered by adding code to skip over layers during the forward path. When you export to ONNX, the removed layers do not participate in the forward path, so they don't get incarnated. We managed to remove 5 of the 16 3x3 convolution layers which dominate the computation time. It's not bad, but we probably could have done better.", + "location": "/model_zoo/index.html#regularization", + "text": "Then we started training from scratch again, but this time we used Group Lasso regularization on entire layers: \nDistiller schedule: distiller/examples/ssl/ssl_4D-removal_4L_training.yaml $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.4 --epochs=180 --compress=../ssl/ssl_4D-removal_training.yaml -j=1 --deterministic The diagram below shows the training of Resnet20/CIFAR10 using Group Lasso regularization on entire layers (in blue) vs. training Resnet20/CIFAR10 baseline (in red). You may notice several interesting things:\n1. The LR-decay policy is the same, but the two sessions start with different initial LR values.\n2. The data-loss of the regularized training follows the same shape as the un-regularized training (baseline), and eventually the two seem to merge.\n3. We see similar behavior in the validation Top1 and Top5 accuracy results, but the regularized training eventually performs better.\n4. In the top right corner we see the behavior of the regularization loss ( Reg Loss ), which actually increases for some time, until the data-loss has a sharp drop (after ~16K mini-batches), at which point the regularization loss also starts dropping. This regularization yields 5 layers with zeroed weight tensors. We load this model, remove the 5 layers, and start the fine tuning of the weights. This process of layer removal is specific to ResNet for CIFAR, which we altered by adding code to skip over layers during the forward path. When you export to ONNX, the removed layers do not participate in the forward path, so they don't get incarnated. We managed to remove 5 of the 16 3x3 convolution layers which dominate the computation time. It's not bad, but we probably could have done better.", "title": "Regularization" - }, + }, { - "location": "/model_zoo/index.html#fine-tuning", - "text": "During the fine-tuning process, because the removed layers do not participate in the forward path, they do not appear in the backward path and are not backpropogated: therefore they are completely disconnected from the network. \nWe copy the checkpoint file of the regularized model to checkpoint_trained_4D_regularized_5Lremoved.pth.tar . \nDistiller schedule: distiller/examples/ssl/ssl_4D-removal_finetuning.yaml $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.1 --epochs=250 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --compress=../ssl/ssl_4D-removal_finetuning.yaml -j=1 --deterministic", + "location": "/model_zoo/index.html#fine-tuning", + "text": "During the fine-tuning process, because the removed layers do not participate in the forward path, they do not appear in the backward path and are not backpropogated: therefore they are completely disconnected from the network. \nWe copy the checkpoint file of the regularized model to checkpoint_trained_4D_regularized_5Lremoved.pth.tar . \nDistiller schedule: distiller/examples/ssl/ssl_4D-removal_finetuning.yaml $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.1 --epochs=250 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --compress=../ssl/ssl_4D-removal_finetuning.yaml -j=1 --deterministic", "title": "Fine-tuning" - }, + }, { - "location": "/model_zoo/index.html#results_2", - "text": "Our baseline results for ResNet20 Cifar are: Top1=91.450 and Top5=99.750 We used Distiller's GroupLassoRegularizer to remove 5 layers from Resnet20 (CIFAR10) with no degradation of the accuracies. \nThe regularized model exhibits really poor classification abilities: $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --evaluate\n\n=> loading checkpoint ../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar\n best top@1: 90.620\nLoaded compression schedule from checkpoint (epoch 179)\nRemoving layer: module.layer1.0.conv1 [layer=0 block=0 conv=0]\nRemoving layer: module.layer1.0.conv2 [layer=0 block=0 conv=1]\nRemoving layer: module.layer1.1.conv1 [layer=0 block=1 conv=0]\nRemoving layer: module.layer1.1.conv2 [layer=0 block=1 conv=1]\nRemoving layer: module.layer2.2.conv2 [layer=1 block=2 conv=1]\nFiles already downloaded and verified\nFiles already downloaded and verified\nDataset sizes:\n training=45000\n validation=5000\n test=10000\n--- test ---------------------\n10000 samples (256 per mini-batch)\n==> Top1: 22.290 Top5: 68.940 Loss: 5.172 However, after fine-tuning, we recovered most of the accuracies loss, but not quite all of it: Top1=91.020 and Top5=99.670 We didn't spend time trying to wrestle with this network, and therefore didn't achieve SSL's published results (which showed that they managed to remove 6 layers and at the same time increase accuracies).", + "location": "/model_zoo/index.html#results_2", + "text": "Our baseline results for ResNet20 Cifar are: Top1=91.450 and Top5=99.750 We used Distiller's GroupLassoRegularizer to remove 5 layers from Resnet20 (CIFAR10) with no degradation of the accuracies. \nThe regularized model exhibits really poor classification abilities: $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --evaluate\n\n= loading checkpoint ../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar\n best top@1: 90.620\nLoaded compression schedule from checkpoint (epoch 179)\nRemoving layer: module.layer1.0.conv1 [layer=0 block=0 conv=0]\nRemoving layer: module.layer1.0.conv2 [layer=0 block=0 conv=1]\nRemoving layer: module.layer1.1.conv1 [layer=0 block=1 conv=0]\nRemoving layer: module.layer1.1.conv2 [layer=0 block=1 conv=1]\nRemoving layer: module.layer2.2.conv2 [layer=1 block=2 conv=1]\nFiles already downloaded and verified\nFiles already downloaded and verified\nDataset sizes:\n training=45000\n validation=5000\n test=10000\n--- test ---------------------\n10000 samples (256 per mini-batch)\n== Top1: 22.290 Top5: 68.940 Loss: 5.172 However, after fine-tuning, we recovered most of the accuracies loss, but not quite all of it: Top1=91.020 and Top5=99.670 We didn't spend time trying to wrestle with this network, and therefore didn't achieve SSL's published results (which showed that they managed to remove 6 layers and at the same time increase accuracies).", "title": "Results" - }, + }, { - "location": "/model_zoo/index.html#pruning-filters-for-efficient-convnets", - "text": "Quoting the authors directly: We present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly.\nIn contrast to pruning weights, this approach does not result in sparse connectivity patterns. Hence, it does not need the support of sparse convolution libraries and can work with existing efficient BLAS libraries for dense matrix multiplications. The implementation of the research by Hao et al. required us to add filter-pruning sensitivity analysis, and support for \"network thinning\". After performing filter-pruning sensitivity analysis to assess which layers are more sensitive to the pruning of filters, we execute distiller.L1RankedStructureParameterPruner once in order to rank the filters of each layer by their L1-norm values, and then we prune the schedule-prescribed sparsity level. Distiller schedule: distiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank.yaml Checkpoint files: checkpoint_finetuned.pth.tar The excerpt from the schedule, displayed below, shows how we declare the L1RankedStructureParameterPruner. This class currently ranks filters only, but because in the future this class may support ranking of various structures, you need to specify for each parameter both the target sparsity level, and the structure type ('3D' is filter-wise pruning). pruners:\n filter_pruner:\n class: 'L1RankedStructureParameterPruner'\n reg_regims:\n 'module.layer1.0.conv1.weight': [0.6, '3D']\n 'module.layer1.1.conv1.weight': [0.6, '3D']\n 'module.layer1.2.conv1.weight': [0.6, '3D']\n 'module.layer1.3.conv1.weight': [0.6, '3D'] In the policy, we specify that we want to invoke this pruner once, at epoch 180. Because we are starting from a network which was trained for 180 epochs (see Baseline training below), the filter ranking is performed right at the outset of this schedule. policies:\n - pruner:\n instance_name: filter_pruner\n epochs: [180] Following the pruning, we want to \"physically\" remove the pruned filters from the network, which involves reconfiguring the Convolutional layers and the parameter tensors. When we remove filters from Convolution layer n we need to perform several changes to the network:\n1. Shrink layer n 's weights tensor, leaving only the \"important\" filters.\n2. Configure layer n 's .out_channels member to its new, smaller, value.\n3. If a BN layer follows layer n , then it also needs to be reconfigured and its scale and shift parameter vectors need to be shrunk.\n4. If a Convolution layer follows the BN layer, then it will have less input channels which requires reconfiguration and shrinking of its weights. All of this is performed by distiller.ResnetCifarFilterRemover which is also scheduled at epoch 180. We call this process \"network thinning\". extensions:\n net_thinner:\n class: 'ResnetCifarFilterRemover'\n thinning_func_str: resnet_cifar_remove_filters Network thinning requires us to understand the layer connectivity and data-dependency of the DNN, and we are working on a robust method to perform this. On networks with topologies similar to ResNet (residuals) and GoogLeNet (inception), which have several inputs and outputs to/from Convolution layers, there is extra details to consider. \nOur current implementation is specific to certain layers in ResNet and is a bit fragile. We will continue to improve and generalize this.", + "location": "/model_zoo/index.html#pruning-filters-for-efficient-convnets", + "text": "Quoting the authors directly: We present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly.\nIn contrast to pruning weights, this approach does not result in sparse connectivity patterns. Hence, it does not need the support of sparse convolution libraries and can work with existing efficient BLAS libraries for dense matrix multiplications. The implementation of the research by Hao et al. required us to add filter-pruning sensitivity analysis, and support for \"network thinning\". After performing filter-pruning sensitivity analysis to assess which layers are more sensitive to the pruning of filters, we execute distiller.L1RankedStructureParameterPruner once in order to rank the filters of each layer by their L1-norm values, and then we prune the schedule-prescribed sparsity level. Distiller schedule: distiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank.yaml Checkpoint files: checkpoint_finetuned.pth.tar The excerpt from the schedule, displayed below, shows how we declare the L1RankedStructureParameterPruner. This class currently ranks filters only, but because in the future this class may support ranking of various structures, you need to specify for each parameter both the target sparsity level, and the structure type ('3D' is filter-wise pruning). pruners:\n filter_pruner:\n class: 'L1RankedStructureParameterPruner'\n reg_regims:\n 'module.layer1.0.conv1.weight': [0.6, '3D']\n 'module.layer1.1.conv1.weight': [0.6, '3D']\n 'module.layer1.2.conv1.weight': [0.6, '3D']\n 'module.layer1.3.conv1.weight': [0.6, '3D'] In the policy, we specify that we want to invoke this pruner once, at epoch 180. Because we are starting from a network which was trained for 180 epochs (see Baseline training below), the filter ranking is performed right at the outset of this schedule. policies:\n - pruner:\n instance_name: filter_pruner\n epochs: [180] Following the pruning, we want to \"physically\" remove the pruned filters from the network, which involves reconfiguring the Convolutional layers and the parameter tensors. When we remove filters from Convolution layer n we need to perform several changes to the network:\n1. Shrink layer n 's weights tensor, leaving only the \"important\" filters.\n2. Configure layer n 's .out_channels member to its new, smaller, value.\n3. If a BN layer follows layer n , then it also needs to be reconfigured and its scale and shift parameter vectors need to be shrunk.\n4. If a Convolution layer follows the BN layer, then it will have less input channels which requires reconfiguration and shrinking of its weights. All of this is performed by distiller.ResnetCifarFilterRemover which is also scheduled at epoch 180. We call this process \"network thinning\". extensions:\n net_thinner:\n class: 'ResnetCifarFilterRemover'\n thinning_func_str: resnet_cifar_remove_filters Network thinning requires us to understand the layer connectivity and data-dependency of the DNN, and we are working on a robust method to perform this. On networks with topologies similar to ResNet (residuals) and GoogLeNet (inception), which have several inputs and outputs to/from Convolution layers, there is extra details to consider. \nOur current implementation is specific to certain layers in ResNet and is a bit fragile. We will continue to improve and generalize this.", "title": "Pruning Filters for Efficient ConvNets" - }, + }, { - "location": "/model_zoo/index.html#baseline-training_1", - "text": "We started by training the baseline ResNet56-Cifar dense network (180 epochs) since we didn't have a pre-trained model. Distiller schedule: distiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml Checkpoint files: checkpoint.resnet56_cifar_baseline.pth.tar", + "location": "/model_zoo/index.html#baseline-training_1", + "text": "We started by training the baseline ResNet56-Cifar dense network (180 epochs) since we didn't have a pre-trained model. Distiller schedule: distiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml Checkpoint files: checkpoint.resnet56_cifar_baseline.pth.tar", "title": "Baseline training" - }, + }, { - "location": "/model_zoo/index.html#results_3", - "text": "We trained a ResNet56-Cifar10 network and achieve accuracy results which are on-par with published results:\nTop1: 92.970 and Top5: 99.740. We used Hao et al.'s algorithm to remove 37.3% of the original convolution MACs, while maintaining virtually the same accuracy as the baseline:\nTop1: 92.830 and Top5: 99.760", + "location": "/model_zoo/index.html#results_3", + "text": "We trained a ResNet56-Cifar10 network and achieve accuracy results which are on-par with published results:\nTop1: 92.970 and Top5: 99.740. We used Hao et al.'s algorithm to remove 37.3% of the original convolution MACs, while maintaining virtually the same accuracy as the baseline:\nTop1: 92.830 and Top5: 99.760", "title": "Results" - }, + }, { - "location": "/jupyter/index.html", - "text": "Jupyter environment\n\n\nThe Jupyter notebooks environment allows us to plan our compression session and load Distiller data summaries to study and analyze compression results.\n\n\nEach notebook has embedded instructions and explanations, so here we provide only a brief description of each notebook.\n\n\nInstallation\n\n\nJupyter and its dependencies are included as part of the main \nrequirements.txt\n file, so there is no need for a dedicated installation step.\n\nHowever, to use the ipywidgets extension, you will need to enable it:\n\n\n$ jupyter nbextension enable --py widgetsnbextension --sys-prefix\n\n\n\n\nYou may want to refer to the \nipywidgets extension installation documentation\n.\n\n\nAnother extension which requires special installation handling is \nQgrid\n. Qgrid is a Jupyter notebook widget that adds interactive features, such as sorting, to Panadas DataFrames rendering. To enable Qgrid:\n\n\n$ jupyter nbextension enable --py --sys-prefix qgrid\n\n\n\n\nLaunching the Jupyter server\n\n\nThere are all kinds of options to use when launching Jupyter which you can use. The example below tells the server to listen to connections from any IP address, and not to launch the browser window, but of course, you are free to launch Jupyter any way you want.\n\nConsult the \nuser's guide\n for more details.\n\n\n$ jupyter-notebook --ip=* --no-browser\n\n\n\n\nUsing the Distiller notebooks\n\n\nThe Distiller Jupyter notebooks are located in the \ndistiller/jupyter\n directory.\n\nThey are provided as tools that you can use to prepare your compression experiments and study their results.\nWe welcome new ideas and implementations of Jupyter.\n\n\nRoughly, the notebooks can be divided into three categories.\n\n\nTheory\n\n\n\n\njupyter/L1-regularization.ipynb\n: Experience hands-on how L1 and L2 regularization affect the solution of a toy loss-minimization problem, to get a better grasp on the interaction between regularization and sparsity.\n\n\njupyter/alexnet_insights.ipynb\n: This notebook reviews and compares a couple of pruning sessions on Alexnet. We compare distributions, performance, statistics and show some visualizations of the weights tensors.\n\n\n\n\nPreparation for compression\n\n\n\n\njupyter/model_summary.ipynb\n: Begin by getting familiar with your model. Examine the sizes and properties of layers and connections. Study which layers are compute-bound, and which are bandwidth-bound, and decide how to prune or regularize the model.\n\n\njupyter/sensitivity_analysis.ipynb\n: If you performed pruning sensitivity analysis on your model, this notebook can help you load the results and graphically study how the layers behave.\n\n\njupyter/interactive_lr_scheduler.ipynb\n: The learning rate decay policy affects pruning results, perhaps as much as it affects training results. Graph a few LR-decay policies to see how they behave.\n\n\njupyter/jupyter/agp_schedule.ipynb\n: If you are using the Automated Gradual Pruner, this notebook can help you tune the schedule.\n\n\n\n\nReviewing experiment results\n\n\n\n\njupyter/compare_executions.ipynb\n: This is a simple notebook to help you graphically compare the results of executions of several experiments.\n\n\njupyter/compression_insights.ipynb\n: This notebook is packed with code, tables and graphs to us understand the results of a compression session. Distiller provides \nsummaries\n, which are Pandas dataframes, which contain statistical information about you model. We chose to use Pandas dataframes because they can be sliced, queried, summarized and graphed with a few lines of code.", + "location": "/jupyter/index.html", + "text": "Jupyter environment\n\n\nThe Jupyter notebooks environment allows us to plan our compression session and load Distiller data summaries to study and analyze compression results.\n\n\nEach notebook has embedded instructions and explanations, so here we provide only a brief description of each notebook.\n\n\nInstallation\n\n\nJupyter and its dependencies are included as part of the main \nrequirements.txt\n file, so there is no need for a dedicated installation step.\n\nHowever, to use the ipywidgets extension, you will need to enable it:\n\n\n$ jupyter nbextension enable --py widgetsnbextension --sys-prefix\n\n\n\n\nYou may want to refer to the \nipywidgets extension installation documentation\n.\n\n\nAnother extension which requires special installation handling is \nQgrid\n. Qgrid is a Jupyter notebook widget that adds interactive features, such as sorting, to Panadas DataFrames rendering. To enable Qgrid:\n\n\n$ jupyter nbextension enable --py --sys-prefix qgrid\n\n\n\n\nLaunching the Jupyter server\n\n\nThere are all kinds of options to use when launching Jupyter which you can use. The example below tells the server to listen to connections from any IP address, and not to launch the browser window, but of course, you are free to launch Jupyter any way you want.\n\nConsult the \nuser's guide\n for more details.\n\n\n$ jupyter-notebook --ip=* --no-browser\n\n\n\n\nUsing the Distiller notebooks\n\n\nThe Distiller Jupyter notebooks are located in the \ndistiller/jupyter\n directory.\n\nThey are provided as tools that you can use to prepare your compression experiments and study their results.\nWe welcome new ideas and implementations of Jupyter.\n\n\nRoughly, the notebooks can be divided into three categories.\n\n\nTheory\n\n\n\n\njupyter/L1-regularization.ipynb\n: Experience hands-on how L1 and L2 regularization affect the solution of a toy loss-minimization problem, to get a better grasp on the interaction between regularization and sparsity.\n\n\njupyter/alexnet_insights.ipynb\n: This notebook reviews and compares a couple of pruning sessions on Alexnet. We compare distributions, performance, statistics and show some visualizations of the weights tensors.\n\n\n\n\nPreparation for compression\n\n\n\n\njupyter/model_summary.ipynb\n: Begin by getting familiar with your model. Examine the sizes and properties of layers and connections. Study which layers are compute-bound, and which are bandwidth-bound, and decide how to prune or regularize the model.\n\n\njupyter/sensitivity_analysis.ipynb\n: If you performed pruning sensitivity analysis on your model, this notebook can help you load the results and graphically study how the layers behave.\n\n\njupyter/interactive_lr_scheduler.ipynb\n: The learning rate decay policy affects pruning results, perhaps as much as it affects training results. Graph a few LR-decay policies to see how they behave.\n\n\njupyter/jupyter/agp_schedule.ipynb\n: If you are using the Automated Gradual Pruner, this notebook can help you tune the schedule.\n\n\n\n\nReviewing experiment results\n\n\n\n\njupyter/compare_executions.ipynb\n: This is a simple notebook to help you graphically compare the results of executions of several experiments.\n\n\njupyter/compression_insights.ipynb\n: This notebook is packed with code, tables and graphs to us understand the results of a compression session. Distiller provides \nsummaries\n, which are Pandas dataframes, which contain statistical information about you model. We chose to use Pandas dataframes because they can be sliced, queried, summarized and graphed with a few lines of code.", "title": "Jupyter notebooks" - }, + }, { - "location": "/jupyter/index.html#jupyter-environment", - "text": "The Jupyter notebooks environment allows us to plan our compression session and load Distiller data summaries to study and analyze compression results. Each notebook has embedded instructions and explanations, so here we provide only a brief description of each notebook.", + "location": "/jupyter/index.html#jupyter-environment", + "text": "The Jupyter notebooks environment allows us to plan our compression session and load Distiller data summaries to study and analyze compression results. Each notebook has embedded instructions and explanations, so here we provide only a brief description of each notebook.", "title": "Jupyter environment" - }, + }, { - "location": "/jupyter/index.html#installation", - "text": "Jupyter and its dependencies are included as part of the main requirements.txt file, so there is no need for a dedicated installation step. \nHowever, to use the ipywidgets extension, you will need to enable it: $ jupyter nbextension enable --py widgetsnbextension --sys-prefix You may want to refer to the ipywidgets extension installation documentation . Another extension which requires special installation handling is Qgrid . Qgrid is a Jupyter notebook widget that adds interactive features, such as sorting, to Panadas DataFrames rendering. To enable Qgrid: $ jupyter nbextension enable --py --sys-prefix qgrid", + "location": "/jupyter/index.html#installation", + "text": "Jupyter and its dependencies are included as part of the main requirements.txt file, so there is no need for a dedicated installation step. \nHowever, to use the ipywidgets extension, you will need to enable it: $ jupyter nbextension enable --py widgetsnbextension --sys-prefix You may want to refer to the ipywidgets extension installation documentation . Another extension which requires special installation handling is Qgrid . Qgrid is a Jupyter notebook widget that adds interactive features, such as sorting, to Panadas DataFrames rendering. To enable Qgrid: $ jupyter nbextension enable --py --sys-prefix qgrid", "title": "Installation" - }, + }, { - "location": "/jupyter/index.html#launching-the-jupyter-server", - "text": "There are all kinds of options to use when launching Jupyter which you can use. The example below tells the server to listen to connections from any IP address, and not to launch the browser window, but of course, you are free to launch Jupyter any way you want. \nConsult the user's guide for more details. $ jupyter-notebook --ip=* --no-browser", + "location": "/jupyter/index.html#launching-the-jupyter-server", + "text": "There are all kinds of options to use when launching Jupyter which you can use. The example below tells the server to listen to connections from any IP address, and not to launch the browser window, but of course, you are free to launch Jupyter any way you want. \nConsult the user's guide for more details. $ jupyter-notebook --ip=* --no-browser", "title": "Launching the Jupyter server" - }, + }, { - "location": "/jupyter/index.html#using-the-distiller-notebooks", - "text": "The Distiller Jupyter notebooks are located in the distiller/jupyter directory. \nThey are provided as tools that you can use to prepare your compression experiments and study their results.\nWe welcome new ideas and implementations of Jupyter. Roughly, the notebooks can be divided into three categories.", + "location": "/jupyter/index.html#using-the-distiller-notebooks", + "text": "The Distiller Jupyter notebooks are located in the distiller/jupyter directory. \nThey are provided as tools that you can use to prepare your compression experiments and study their results.\nWe welcome new ideas and implementations of Jupyter. Roughly, the notebooks can be divided into three categories.", "title": "Using the Distiller notebooks" - }, + }, { - "location": "/jupyter/index.html#theory", - "text": "jupyter/L1-regularization.ipynb : Experience hands-on how L1 and L2 regularization affect the solution of a toy loss-minimization problem, to get a better grasp on the interaction between regularization and sparsity. jupyter/alexnet_insights.ipynb : This notebook reviews and compares a couple of pruning sessions on Alexnet. We compare distributions, performance, statistics and show some visualizations of the weights tensors.", + "location": "/jupyter/index.html#theory", + "text": "jupyter/L1-regularization.ipynb : Experience hands-on how L1 and L2 regularization affect the solution of a toy loss-minimization problem, to get a better grasp on the interaction between regularization and sparsity. jupyter/alexnet_insights.ipynb : This notebook reviews and compares a couple of pruning sessions on Alexnet. We compare distributions, performance, statistics and show some visualizations of the weights tensors.", "title": "Theory" - }, + }, { - "location": "/jupyter/index.html#preparation-for-compression", - "text": "jupyter/model_summary.ipynb : Begin by getting familiar with your model. Examine the sizes and properties of layers and connections. Study which layers are compute-bound, and which are bandwidth-bound, and decide how to prune or regularize the model. jupyter/sensitivity_analysis.ipynb : If you performed pruning sensitivity analysis on your model, this notebook can help you load the results and graphically study how the layers behave. jupyter/interactive_lr_scheduler.ipynb : The learning rate decay policy affects pruning results, perhaps as much as it affects training results. Graph a few LR-decay policies to see how they behave. jupyter/jupyter/agp_schedule.ipynb : If you are using the Automated Gradual Pruner, this notebook can help you tune the schedule.", + "location": "/jupyter/index.html#preparation-for-compression", + "text": "jupyter/model_summary.ipynb : Begin by getting familiar with your model. Examine the sizes and properties of layers and connections. Study which layers are compute-bound, and which are bandwidth-bound, and decide how to prune or regularize the model. jupyter/sensitivity_analysis.ipynb : If you performed pruning sensitivity analysis on your model, this notebook can help you load the results and graphically study how the layers behave. jupyter/interactive_lr_scheduler.ipynb : The learning rate decay policy affects pruning results, perhaps as much as it affects training results. Graph a few LR-decay policies to see how they behave. jupyter/jupyter/agp_schedule.ipynb : If you are using the Automated Gradual Pruner, this notebook can help you tune the schedule.", "title": "Preparation for compression" - }, + }, { - "location": "/jupyter/index.html#reviewing-experiment-results", - "text": "jupyter/compare_executions.ipynb : This is a simple notebook to help you graphically compare the results of executions of several experiments. jupyter/compression_insights.ipynb : This notebook is packed with code, tables and graphs to us understand the results of a compression session. Distiller provides summaries , which are Pandas dataframes, which contain statistical information about you model. We chose to use Pandas dataframes because they can be sliced, queried, summarized and graphed with a few lines of code.", + "location": "/jupyter/index.html#reviewing-experiment-results", + "text": "jupyter/compare_executions.ipynb : This is a simple notebook to help you graphically compare the results of executions of several experiments. jupyter/compression_insights.ipynb : This notebook is packed with code, tables and graphs to us understand the results of a compression session. Distiller provides summaries , which are Pandas dataframes, which contain statistical information about you model. We chose to use Pandas dataframes because they can be sliced, queried, summarized and graphed with a few lines of code.", "title": "Reviewing experiment results" - }, + }, { - "location": "/design/index.html", - "text": "Distiller design\n\n\nDistiller is designed to be easily integrated into your own PyTorch research applications.\n\nIt is easiest to understand this integration by examining the code of the sample application for compressing image classification models (\ncompress_classifier.py\n).\n\n\nThe application borrows its main flow code from torchvision's ImageNet classification training sample application (https://github.com/pytorch/examples/tree/master/imagenet). We tried to keep it similar, in order to make it familiar and easy to understand.\n\n\nIntegrating compression is very simple: simply add invocations of the appropriate compression_scheduler callbacks, for each stage in the training. The training skeleton looks like the pseudo code below. The boiler-plate Pytorch classification training is speckled with invocations of CompressionScheduler.\n\n\nFor each epoch:\n compression_scheduler.on_epoch_begin(epoch)\n train()\n validate()\n save_checkpoint()\n compression_scheduler.on_epoch_end(epoch)\n\ntrain():\n For each training step:\n compression_scheduler.on_minibatch_begin(epoch)\n output = model(input_var)\n loss = criterion(output, target_var)\n compression_scheduler.before_backward_pass(epoch)\n loss.backward()\n optimizer.step()\n compression_scheduler.on_minibatch_end(epoch)\n\n\n\n\nThese callbacks can be seen in the diagram below, as the arrow pointing from the Training Loop and into Distiller's \nScheduler\n, which invokes the correct algorithm. The application also uses Distiller services to collect statistics in \nSummaries\n and logs files, which can be queried at a later time, from Jupyter notebooks or TensorBoard.\n\n\n\n\nSparsification and fine-tuning\n\n\n\n\nThe application sets up a model as normally done in PyTorch.\n\n\nAnd then instantiates a Scheduler and configures it:\n\n\nScheduler configuration is defined in a YAML file\n\n\nThe configuration specifies Policies. Each Policy is tied to a specific algorithm which controls some aspect of the training.\n\n\nSome types of algorithms control the actual sparsification of the model. Such types are \"pruner\" and \"regularizer\".\n\n\nSome algorithms control some parameter of the training process, such as the learning-rate decay scheduler (\nlr_scheduler\n).\n\n\nThe parameters of each algorithm are also specified in the configuration.\n\n\n\n\n\n\n\n\n\n\nIn addition to specifying the algorithm, each Policy specifies scheduling parameters which control when the algorithm is executed: start epoch, end epoch and frequency.\n\n\nThe Scheduler exposes callbacks for relevant training stages: epoch start/end, mini-batch start/end and pre-backward pass. Each scheduler callback activates the policies that were defined according the schedule that was defined.\n\n\nThese callbacks are placed the training loop.\n\n\n\n\nQuantization\n\n\nA quantized model is obtained by replacing existing operations with quantized versions. The quantized versions can be either complete replacements, or wrappers. A wrapper will use the existing modules internally and add quantization and de-quantization operations before/after as necessary.\n\n\nIn Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided.\n\n\nWe also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. The high-level flow is as follows:\n\n\n\n\nDefine a \nmapping\n between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module.\n\n\nIterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it.\n\n\nReplace the existing module with the module returned by the function.\n\n\n\n\nDifferent quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different \nmapping\n will likely be defined.\n\n\nThis mechanism is exposed by the \nQuantizer\n class:\n\n\n\n\nQuantizer\n should be sub-classed for each quantization method.\n\n\nEach instance of \nQuantizer\n is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the \nbits_activations\n and \nbits_weights\n parameters in \nQuantizer\n's constructor. Sub-classes may define bit-widths for other tensor types as needed.\n\n\nWe also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks (\"container\" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern.\n\n\nSo, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the \nbits_overrides\n parameter in the constructor.\n\n\n\n\nThe base \nQuantizer\n class is implemented in \ndistiller/quantization/quantizer.py\n.\n\nFor a simple sub-class implementing symmetric linear quantization, see \nSymmetricLinearQuantizer\n in \ndistiller/quantization/range_linear.py\n.", + "location": "/design/index.html", + "text": "Distiller design\n\n\nDistiller is designed to be easily integrated into your own PyTorch research applications.\n\nIt is easiest to understand this integration by examining the code of the sample application for compressing image classification models (\ncompress_classifier.py\n).\n\n\nThe application borrows its main flow code from torchvision's ImageNet classification training sample application (https://github.com/pytorch/examples/tree/master/imagenet). We tried to keep it similar, in order to make it familiar and easy to understand.\n\n\nIntegrating compression is very simple: simply add invocations of the appropriate compression_scheduler callbacks, for each stage in the training. The training skeleton looks like the pseudo code below. The boiler-plate Pytorch classification training is speckled with invocations of CompressionScheduler.\n\n\nFor each epoch:\n compression_scheduler.on_epoch_begin(epoch)\n train()\n validate()\n save_checkpoint()\n compression_scheduler.on_epoch_end(epoch)\n\ntrain():\n For each training step:\n compression_scheduler.on_minibatch_begin(epoch)\n output = model(input_var)\n loss = criterion(output, target_var)\n compression_scheduler.before_backward_pass(epoch)\n loss.backward()\n optimizer.step()\n compression_scheduler.on_minibatch_end(epoch)\n\n\n\n\nThese callbacks can be seen in the diagram below, as the arrow pointing from the Training Loop and into Distiller's \nScheduler\n, which invokes the correct algorithm. The application also uses Distiller services to collect statistics in \nSummaries\n and logs files, which can be queried at a later time, from Jupyter notebooks or TensorBoard.\n\n\n\n\nSparsification and fine-tuning\n\n\n\n\nThe application sets up a model as normally done in PyTorch.\n\n\nAnd then instantiates a Scheduler and configures it:\n\n\nScheduler configuration is defined in a YAML file\n\n\nThe configuration specifies Policies. Each Policy is tied to a specific algorithm which controls some aspect of the training.\n\n\nSome types of algorithms control the actual sparsification of the model. Such types are \"pruner\" and \"regularizer\".\n\n\nSome algorithms control some parameter of the training process, such as the learning-rate decay scheduler (\nlr_scheduler\n).\n\n\nThe parameters of each algorithm are also specified in the configuration.\n\n\n\n\n\n\n\n\n\n\nIn addition to specifying the algorithm, each Policy specifies scheduling parameters which control when the algorithm is executed: start epoch, end epoch and frequency.\n\n\nThe Scheduler exposes callbacks for relevant training stages: epoch start/end, mini-batch start/end and pre-backward pass. Each scheduler callback activates the policies that were defined according the schedule that was defined.\n\n\nThese callbacks are placed the training loop.\n\n\n\n\nQuantization\n\n\nA quantized model is obtained by replacing existing operations with quantized versions. The quantized versions can be either complete replacements, or wrappers. A wrapper will use the existing modules internally and add quantization and de-quantization operations before/after as necessary.\n\n\nIn Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided.\n\n\nWe also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. This mechanism is exposed by the \nQuantizer\n class. \nQuantizer\n should be sub-classed for each quantization method.\n\n\nModel Transformation\n\n\nThe high-level flow is as follows:\n\n\n\n\nDefine a \nmapping\n between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the \nreplacement_factory\n attribute of the \nQuantizer\n class.\n\n\nIterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it.\n\n\nReplace the existing module with the module returned by the function. It is important to note that the \nname\n of the module \ndoes not\n change, as that could break the \nforward\n function of the parent module.\n\n\n\n\nDifferent quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different \nmapping\n will likely be defined.\n\nEach sub-class of \nQuantizer\n should populate the \nreplacement_factory\n dictionary attribute with the appropriate mapping.\n\n\nFlexible Bit-Widths\n\n\n\n\nEach instance of \nQuantizer\n is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the \nbits_activations\n and \nbits_weights\n parameters in \nQuantizer\n's constructor. Sub-classes may define bit-widths for other tensor types as needed.\n\n\nWe also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks (\"container\" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern.\n\n\nSo, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the \nbits_overrides\n parameter in the constructor.\n\n\n\n\nWeights Quantization\n\n\nThe \nQuantizer\n class also provides an API to quantize the weights of all layers at once. To use it, the \nparam_quantization_fn\n attribute needs to point to a function that accepts a tensor and the number of bits. During model transformation, the \nQuantizer\n class will build a list of all model parameters that need to be quantized along with their bit-width. Then, the \nquantize_params\n function can be called, which will iterate over all parameters and quantize them using \nparams_quantization_fn\n.\n\n\nTraining with Quantization\n\n\nThe \nQuantizer\n class supports training with quantization in the loop, as described \nhere\n. This is enabled by setting \ntrain_with_fp_copy=True\n in the \nQuantizer\n constructor. At model transformation, in each module that has parameters that should be quantized, a new \ntorch.nn.Parameter\n is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module \nis not\n created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following \"hack\":\n\n\n\n\nThe existing \ntorch.nn.Parameter\n, e.g. \nweights\n, is replaced by a \ntorch.nn.Parameter\n named \nfloat_weight\n.\n\n\nTo maintain the existing functionality of the module, we then register a \nbuffer\n in the module with the original name - \nweights\n.\n\n\nDuring training, \nfloat_weight\n will be passed to \nparam_quantization_fn\n and the result will be stored in \nweight\n.\n\n\n\n\nThe base \nQuantizer\n class is implemented in \ndistiller/quantization/quantizer.py\n.\n\nFor a simple sub-class implementing symmetric linear quantization, see \nSymmetricLinearQuantizer\n in \ndistiller/quantization/range_linear.py\n. For examples of lower-precision methods using training with quantization see \nDorefaQuantizer\n and \nWRPNQuantizer\n in \ndistiller/quantization/clipped_linear.py", "title": "Design" - }, + }, { - "location": "/design/index.html#distiller-design", - "text": "Distiller is designed to be easily integrated into your own PyTorch research applications. \nIt is easiest to understand this integration by examining the code of the sample application for compressing image classification models ( compress_classifier.py ). The application borrows its main flow code from torchvision's ImageNet classification training sample application (https://github.com/pytorch/examples/tree/master/imagenet). We tried to keep it similar, in order to make it familiar and easy to understand. Integrating compression is very simple: simply add invocations of the appropriate compression_scheduler callbacks, for each stage in the training. The training skeleton looks like the pseudo code below. The boiler-plate Pytorch classification training is speckled with invocations of CompressionScheduler. For each epoch:\n compression_scheduler.on_epoch_begin(epoch)\n train()\n validate()\n save_checkpoint()\n compression_scheduler.on_epoch_end(epoch)\n\ntrain():\n For each training step:\n compression_scheduler.on_minibatch_begin(epoch)\n output = model(input_var)\n loss = criterion(output, target_var)\n compression_scheduler.before_backward_pass(epoch)\n loss.backward()\n optimizer.step()\n compression_scheduler.on_minibatch_end(epoch) These callbacks can be seen in the diagram below, as the arrow pointing from the Training Loop and into Distiller's Scheduler , which invokes the correct algorithm. The application also uses Distiller services to collect statistics in Summaries and logs files, which can be queried at a later time, from Jupyter notebooks or TensorBoard.", + "location": "/design/index.html#distiller-design", + "text": "Distiller is designed to be easily integrated into your own PyTorch research applications. \nIt is easiest to understand this integration by examining the code of the sample application for compressing image classification models ( compress_classifier.py ). The application borrows its main flow code from torchvision's ImageNet classification training sample application (https://github.com/pytorch/examples/tree/master/imagenet). We tried to keep it similar, in order to make it familiar and easy to understand. Integrating compression is very simple: simply add invocations of the appropriate compression_scheduler callbacks, for each stage in the training. The training skeleton looks like the pseudo code below. The boiler-plate Pytorch classification training is speckled with invocations of CompressionScheduler. For each epoch:\n compression_scheduler.on_epoch_begin(epoch)\n train()\n validate()\n save_checkpoint()\n compression_scheduler.on_epoch_end(epoch)\n\ntrain():\n For each training step:\n compression_scheduler.on_minibatch_begin(epoch)\n output = model(input_var)\n loss = criterion(output, target_var)\n compression_scheduler.before_backward_pass(epoch)\n loss.backward()\n optimizer.step()\n compression_scheduler.on_minibatch_end(epoch) These callbacks can be seen in the diagram below, as the arrow pointing from the Training Loop and into Distiller's Scheduler , which invokes the correct algorithm. The application also uses Distiller services to collect statistics in Summaries and logs files, which can be queried at a later time, from Jupyter notebooks or TensorBoard.", "title": "Distiller design" - }, + }, { - "location": "/design/index.html#sparsification-and-fine-tuning", - "text": "The application sets up a model as normally done in PyTorch. And then instantiates a Scheduler and configures it: Scheduler configuration is defined in a YAML file The configuration specifies Policies. Each Policy is tied to a specific algorithm which controls some aspect of the training. Some types of algorithms control the actual sparsification of the model. Such types are \"pruner\" and \"regularizer\". Some algorithms control some parameter of the training process, such as the learning-rate decay scheduler ( lr_scheduler ). The parameters of each algorithm are also specified in the configuration. In addition to specifying the algorithm, each Policy specifies scheduling parameters which control when the algorithm is executed: start epoch, end epoch and frequency. The Scheduler exposes callbacks for relevant training stages: epoch start/end, mini-batch start/end and pre-backward pass. Each scheduler callback activates the policies that were defined according the schedule that was defined. These callbacks are placed the training loop.", + "location": "/design/index.html#sparsification-and-fine-tuning", + "text": "The application sets up a model as normally done in PyTorch. And then instantiates a Scheduler and configures it: Scheduler configuration is defined in a YAML file The configuration specifies Policies. Each Policy is tied to a specific algorithm which controls some aspect of the training. Some types of algorithms control the actual sparsification of the model. Such types are \"pruner\" and \"regularizer\". Some algorithms control some parameter of the training process, such as the learning-rate decay scheduler ( lr_scheduler ). The parameters of each algorithm are also specified in the configuration. In addition to specifying the algorithm, each Policy specifies scheduling parameters which control when the algorithm is executed: start epoch, end epoch and frequency. The Scheduler exposes callbacks for relevant training stages: epoch start/end, mini-batch start/end and pre-backward pass. Each scheduler callback activates the policies that were defined according the schedule that was defined. These callbacks are placed the training loop.", "title": "Sparsification and fine-tuning" - }, + }, { - "location": "/design/index.html#quantization", - "text": "A quantized model is obtained by replacing existing operations with quantized versions. The quantized versions can be either complete replacements, or wrappers. A wrapper will use the existing modules internally and add quantization and de-quantization operations before/after as necessary. In Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided. We also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. The high-level flow is as follows: Define a mapping between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. Iterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it. Replace the existing module with the module returned by the function. Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different mapping will likely be defined. This mechanism is exposed by the Quantizer class: Quantizer should be sub-classed for each quantization method. Each instance of Quantizer is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the bits_activations and bits_weights parameters in Quantizer 's constructor. Sub-classes may define bit-widths for other tensor types as needed. We also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks (\"container\" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern. So, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the bits_overrides parameter in the constructor. The base Quantizer class is implemented in distiller/quantization/quantizer.py . \nFor a simple sub-class implementing symmetric linear quantization, see SymmetricLinearQuantizer in distiller/quantization/range_linear.py .", + "location": "/design/index.html#quantization", + "text": "A quantized model is obtained by replacing existing operations with quantized versions. The quantized versions can be either complete replacements, or wrappers. A wrapper will use the existing modules internally and add quantization and de-quantization operations before/after as necessary. In Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided. We also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. This mechanism is exposed by the Quantizer class. Quantizer should be sub-classed for each quantization method.", "title": "Quantization" + }, + { + "location": "/design/index.html#model-transformation", + "text": "The high-level flow is as follows: Define a mapping between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the replacement_factory attribute of the Quantizer class. Iterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it. Replace the existing module with the module returned by the function. It is important to note that the name of the module does not change, as that could break the forward function of the parent module. Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different mapping will likely be defined. \nEach sub-class of Quantizer should populate the replacement_factory dictionary attribute with the appropriate mapping.", + "title": "Model Transformation" + }, + { + "location": "/design/index.html#flexible-bit-widths", + "text": "Each instance of Quantizer is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the bits_activations and bits_weights parameters in Quantizer 's constructor. Sub-classes may define bit-widths for other tensor types as needed. We also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks (\"container\" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern. So, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the bits_overrides parameter in the constructor.", + "title": "Flexible Bit-Widths" + }, + { + "location": "/design/index.html#weights-quantization", + "text": "The Quantizer class also provides an API to quantize the weights of all layers at once. To use it, the param_quantization_fn attribute needs to point to a function that accepts a tensor and the number of bits. During model transformation, the Quantizer class will build a list of all model parameters that need to be quantized along with their bit-width. Then, the quantize_params function can be called, which will iterate over all parameters and quantize them using params_quantization_fn .", + "title": "Weights Quantization" + }, + { + "location": "/design/index.html#training-with-quantization", + "text": "The Quantizer class supports training with quantization in the loop, as described here . This is enabled by setting train_with_fp_copy=True in the Quantizer constructor. At model transformation, in each module that has parameters that should be quantized, a new torch.nn.Parameter is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module is not created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following \"hack\": The existing torch.nn.Parameter , e.g. weights , is replaced by a torch.nn.Parameter named float_weight . To maintain the existing functionality of the module, we then register a buffer in the module with the original name - weights . During training, float_weight will be passed to param_quantization_fn and the result will be stored in weight . The base Quantizer class is implemented in distiller/quantization/quantizer.py . \nFor a simple sub-class implementing symmetric linear quantization, see SymmetricLinearQuantizer in distiller/quantization/range_linear.py . For examples of lower-precision methods using training with quantization see DorefaQuantizer and WRPNQuantizer in distiller/quantization/clipped_linear.py", + "title": "Training with Quantization" } ] } \ No newline at end of file diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 2a7ed7825a5c6372a8fa6b5c519125699e5b35e4..2b0ecb38ecdc225b313d58ea361475e38e6a07a6 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,7 +4,7 @@ <url> <loc>/index.html</loc> - <lastmod>2018-06-14</lastmod> + <lastmod>2018-06-22</lastmod> <changefreq>daily</changefreq> </url> @@ -12,7 +12,7 @@ <url> <loc>/install/index.html</loc> - <lastmod>2018-06-14</lastmod> + <lastmod>2018-06-22</lastmod> <changefreq>daily</changefreq> </url> @@ -20,7 +20,7 @@ <url> <loc>/usage/index.html</loc> - <lastmod>2018-06-14</lastmod> + <lastmod>2018-06-22</lastmod> <changefreq>daily</changefreq> </url> @@ -28,7 +28,7 @@ <url> <loc>/schedule/index.html</loc> - <lastmod>2018-06-14</lastmod> + <lastmod>2018-06-22</lastmod> <changefreq>daily</changefreq> </url> @@ -37,19 +37,19 @@ <url> <loc>/pruning/index.html</loc> - <lastmod>2018-06-14</lastmod> + <lastmod>2018-06-22</lastmod> <changefreq>daily</changefreq> </url> <url> <loc>/regularization/index.html</loc> - <lastmod>2018-06-14</lastmod> + <lastmod>2018-06-22</lastmod> <changefreq>daily</changefreq> </url> <url> <loc>/quantization/index.html</loc> - <lastmod>2018-06-14</lastmod> + <lastmod>2018-06-22</lastmod> <changefreq>daily</changefreq> </url> @@ -59,13 +59,13 @@ <url> <loc>/algo_pruning/index.html</loc> - <lastmod>2018-06-14</lastmod> + <lastmod>2018-06-22</lastmod> <changefreq>daily</changefreq> </url> <url> <loc>/algo_quantization/index.html</loc> - <lastmod>2018-06-14</lastmod> + <lastmod>2018-06-22</lastmod> <changefreq>daily</changefreq> </url> @@ -74,7 +74,7 @@ <url> <loc>/model_zoo/index.html</loc> - <lastmod>2018-06-14</lastmod> + <lastmod>2018-06-22</lastmod> <changefreq>daily</changefreq> </url> @@ -82,7 +82,7 @@ <url> <loc>/jupyter/index.html</loc> - <lastmod>2018-06-14</lastmod> + <lastmod>2018-06-22</lastmod> <changefreq>daily</changefreq> </url> @@ -90,7 +90,7 @@ <url> <loc>/design/index.html</loc> - <lastmod>2018-06-14</lastmod> + <lastmod>2018-06-22</lastmod> <changefreq>daily</changefreq> </url> diff --git a/docs/usage/index.html b/docs/usage/index.html index 11497b5f841ba1bfb81b4a705945ad74637f29c0..87a0ad604b0559ed4dbecca594835db088e27e87 100644 --- a/docs/usage/index.html +++ b/docs/usage/index.html @@ -75,7 +75,7 @@ <li><a class="toctree-l3" href="#performing-pruning-sensitivity-analysis">Performing pruning sensitivity analysis</a></li> - <li><a class="toctree-l3" href="#quantization">Quantization</a></li> + <li><a class="toctree-l3" href="#direct-quantization-without-training">"Direct" Quantization Without Training</a></li> <li><a class="toctree-l3" href="#summaries">Summaries</a></li> @@ -287,22 +287,27 @@ These example schedules (YAML) files, contain the command line that is used in o <li>ResNet34 (ImageNet)</li> <li>Filter-wise pruning sensitivity-analysis:</li> <li>ResNet20 (CIFAR10)</li> -<li>ResNet56 (CIFAR10)</li> +<li>ResNet56 (CIFAR10) +<br><br></li> </ul> </li> -<li> -<p><strong>examples/sensitivity-pruning</strong>:</p> -<ul> +<li><strong>examples/sensitivity-pruning</strong>:<ul> <li>AlexNet sensitivity pruning with Iterative Pruning</li> -<li>AlexNet sensitivity pruning with One-Shot Pruning</li> +<li>AlexNet sensitivity pruning with One-Shot Pruning +<br><br></li> </ul> </li> -<li> -<p><strong>examples/ssl</strong>:</p> -<ul> +<li><strong>examples/ssl</strong>:<ul> <li>ResNet20 baseline training (CIFAR10 dataset)</li> <li>Structured Sparsity Learning (SSL) with layer removal on ResNet20</li> -<li>SSL with channels removal on ResNet20</li> +<li>SSL with channels removal on ResNet20 +<br><br></li> +</ul> +</li> +<li><strong>examples/quantization</strong>:<ul> +<li>AlexNet w. Batch-Norm (base FP32 + DoReFa)</li> +<li>Pre-activation ResNet20 on CIFAR10 (base FP32 + DoReFa)</li> +<li>Pre-activation ResNet18 on ImageNEt (base FP32 + DoReFa)</li> </ul> </li> </ul> @@ -319,8 +324,8 @@ Results are output as a CSV file (<code>sensitivity.csv</code>) and PNG file (<c <p>The <code>sense</code> command-line argument can be set to either <code>element</code> or <code>filter</code>, depending on the type of analysis you want done.<br></p> <p>There is also a <a href="http://localhost:8888/notebooks/sensitivity_analysis.ipynb">Jupyter notebook</a> with example invocations, outputs and explanations.</p> -<h2 id="quantization">Quantization</h2> -<p>Currently Distiller support 8-bit quantization only (quantization of lower precision data types will follow shortly) which does not require training, so any model (whether pruned or not) can be quantized.<br> +<h2 id="direct-quantization-without-training">"Direct" Quantization Without Training</h2> +<p>Distiller supports 8-bit quantization of trained modules without re-training (using <a href="../algo_quantization/index.html#symmetric-linear-quantization">Symmetric Linear Quantization</a>). So, any model (whether pruned or not) can be quantized.<br /> Use the <code>--quantize</code> command-line flag, together with <code>--evaluate</code> to evaluate the accuracy of your model after quantization. The following example qunatizes ResNet18 for ImageNet:</p> <pre><code>$ python3 compress_classifier.py -a resnet18 ../../../data.imagenet --pretrained --quantize --evaluate </code></pre>