From 8bcaaa53d123e36cfdec651060739d88df48bb59 Mon Sep 17 00:00:00 2001
From: Guy Jacob <guy.jacob@intel.com>
Date: Tue, 11 Dec 2018 10:50:11 +0200
Subject: [PATCH] Updated early-exit docs (from @haim-barad)

---
 docs-src/docs/algo_earlyexit.md   |   46 +-
 docs/algo_earlyexit/index.html    |   37 +-
 docs/algo_quantization/index.html |    2 +-
 docs/index.html                   |    2 +-
 docs/search/search_index.json     | 1006 +++++++++++++++--------------
 docs/sitemap.xml                  |   34 +-
 6 files changed, 592 insertions(+), 535 deletions(-)

diff --git a/docs-src/docs/algo_earlyexit.md b/docs-src/docs/algo_earlyexit.md
index 226d45e..b0cbf41 100755
--- a/docs-src/docs/algo_earlyexit.md
+++ b/docs-src/docs/algo_earlyexit.md
@@ -1,7 +1,9 @@
 # Early Exit Inference
+
 While Deep Neural Networks benefit from a large number of layers, it's often the case that many data points in classification tasks can be classified accurately with much less work. There have been several studies recently regarding the idea of exiting before the normal endpoint of the neural network. Panda et al in [Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition](#panda) points out that a lot of data points can be classified easily and require less processing than some more difficult points and they view this in terms of power savings. Surat et al in [BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks](#branchynet) look at a selective approach to exit placement and criteria for exiting early.
 
 ## Why Does Early Exit Work?
+
 Early Exit is a strategy with a straightforward and easy to understand concept Figure #fig(boundaries) shows a simple example in a 2-D feature space. While deep networks can represent more complex and expressive boundaries between classes (assuming weâ€™re confident of avoiding over-fitting the data), itâ€™s also clear that much of the data can be properly classified with even the simplest of classification boundaries.
 
 ![Figure !fig(boundaries): Simple and more expressive classification boundaries](imgs/decision_boundary.png)
@@ -9,32 +11,60 @@ Early Exit is a strategy with a straightforward and easy to understand concept F
 Data points far from the boundary can be considered "easy to classify" and achieve a high degree of confidence quicker than do data points close to the boundary. In fact, we can think of the area between the outer straight lines as being the region that is "difficult to classify" and require the full expressiveness of the neural network to accurately classify it.
 
 ## Example code for Early Exit
-Both CIFAR10 and ImageNet code comes directly from publically available examples from Pytorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.
 
+Both CIFAR10 and ImageNet code comes directly from publicly available examples from PyTorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.
+
+**Note:** the sample code provided for ResNet models with Early Exits has exactly one early exit for the CIFAR10 example and exactly two early exits for the ImageNet example. If you want to modify the number of early exits, you will need to make sure that the model code is updated to have a corresponding number of exits.
 Deeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively.
 
 Note that this code does not actually take exits. What it does is to compute statistics of loss and accuracy assuming exits were taken when criteria are met. Actually implementing exits can be tricky and architecture dependent and we plan to address these issues.
 
+### Example command lines
+
+We have provided examples for ResNets of varying sizes for both CIFAR10 and ImageNet datasets. An example command line for training for CIFAR10 is:
+
+```bash
+python compress_classifier.py --arch=resnet32_cifar_earlyexit --epochs=20 -b 128 \
+    --lr=0.003 --earlyexit_thresholds 0.4 --earlyexit_lossweights 0.4 -j 30 \
+    --out-dir /home/ -n earlyexit /home/pcifar10
+```
+
+And an example command line for ImageNet is:
+
+```bash
+python compress_classifier.py --arch=resnet50_earlyexit --epochs=120 -b 128 \
+    --lr=0.003 --earlyexit_thresholds 1.2 0.9 --earlyexit_lossweights 0.1 0.3 \
+    -j 30 --out-dir /home/ -n earlyexit /home/I1K/i1k-extracted/
+```
+
 ### Heuristics
-The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more agressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.
 
-There are other benefits to adding exits in that training the modified network now has backpropagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.
+The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more aggressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.
+
+There are other benefits to adding exits in that training the modified network now has back-propagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.
+
+### Early Exit Hyper-Parameters
 
-### Early Exit Hyperparameters
 There are two parameters that are required to enable early exit. Leave them undefined if you are not enabling Early Exit:
 
-1. **--earlyexit_thresholds** defines the
-thresholds for each of the early exits. The cross entropy measure must be **less than** the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify "--earlyexit_thresholds 0.9 1.2" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.
+1. **--earlyexit_thresholds** defines the thresholds for each of the early exits. The cross entropy measure must be **less than** the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify "--earlyexit_thresholds 0.9 1.2" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.
+
+12 **--earlyexit_lossweights** provide the weights for the linear combination of losses during training to compute a single, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of "--earlyexit_lossweights 0.2 0.3" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.
+
+### Output Stats
 
-1. **--earlyexit_lossweights** provide the weights for the linear combination of losses during training to compute a signle, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of "--earlyexit_lossweights 0.2 0.3" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.
+The example code outputs various statistics regarding the loss and accuracy at each of the exits. During training, the Top1 and Top5 stats represent the accuracy should all of the data be forced out that exit (in order to compute the loss at that exit). During inference (i.e. validation and test stages), the Top1 and Top5 stats represent the accuracy for those data points that could exit because the calculated entropy at that exit was lower than the specified threshold for that exit.
 
 ### CIFAR10
+
 In the case of CIFAR10, we have inserted a single exit after the first full layer grouping. The layers on the exit path itself includes a convolutional layer and a fully connected layer. If you move the exit, be sure to match the proper sizes for inputs and outputs to the exit layers.
 
 ### ImageNet
-This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic resnet code and could be used with other size resnets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.
+
+This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic ResNet code and could be used with other size ResNets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.
 
 ## References
+
 <div id="panda"></div> **Priyadarshini Panda, Abhronil Sengupta, Kaushik Roy**.
     [*Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition*](https://arxiv.org/abs/1509.08971v6), arXiv:1509.08971v6, 2017.
 
diff --git a/docs/algo_earlyexit/index.html b/docs/algo_earlyexit/index.html
index a27fc57..4cbd2c5 100644
--- a/docs/algo_earlyexit/index.html
+++ b/docs/algo_earlyexit/index.html
@@ -203,29 +203,46 @@
 <p><img alt="Figure !fig(boundaries): Simple and more expressive classification boundaries" src="../imgs/decision_boundary.png" /></p>
 <p>Data points far from the boundary can be considered "easy to classify" and achieve a high degree of confidence quicker than do data points close to the boundary. In fact, we can think of the area between the outer straight lines as being the region that is "difficult to classify" and require the full expressiveness of the neural network to accurately classify it.</p>
 <h2 id="example-code-for-early-exit">Example code for Early Exit</h2>
-<p>Both CIFAR10 and ImageNet code comes directly from publically available examples from Pytorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.</p>
-<p>Deeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively.</p>
+<p>Both CIFAR10 and ImageNet code comes directly from publicly available examples from PyTorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.</p>
+<p><strong>Note:</strong> the sample code provided for ResNet models with Early Exits has exactly one early exit for the CIFAR10 example and exactly two early exits for the ImageNet example. If you want to modify the number of early exits, you will need to make sure that the model code is updated to have a corresponding number of exits.
+Deeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively.</p>
 <p>Note that this code does not actually take exits. What it does is to compute statistics of loss and accuracy assuming exits were taken when criteria are met. Actually implementing exits can be tricky and architecture dependent and we plan to address these issues.</p>
+<h3 id="example-command-lines">Example command lines</h3>
+<p>We have provided examples for ResNets of varying sizes for both CIFAR10 and ImageNet datasets. An example command line for training for CIFAR10 is:</p>
+<pre><code class="bash">python compress_classifier.py --arch=resnet32_cifar_earlyexit --epochs=20 -b 128 \
+    --lr=0.003 --earlyexit_thresholds 0.4 --earlyexit_lossweights 0.4 -j 30 \
+    --out-dir /home/ -n earlyexit /home/pcifar10
+</code></pre>
+
+<p>And an example command line for ImageNet is:</p>
+<pre><code class="bash">python compress_classifier.py --arch=resnet50_earlyexit --epochs=120 -b 128 \
+    --lr=0.003 --earlyexit_thresholds 1.2 0.9 --earlyexit_lossweights 0.1 0.3 \
+    -j 30 --out-dir /home/ -n earlyexit /home/I1K/i1k-extracted/
+</code></pre>
+
 <h3 id="heuristics">Heuristics</h3>
-<p>The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more agressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.</p>
-<p>There are other benefits to adding exits in that training the modified network now has backpropagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.</p>
-<h3 id="early-exit-hyperparameters">Early Exit Hyperparameters</h3>
+<p>The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more aggressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.</p>
+<p>There are other benefits to adding exits in that training the modified network now has back-propagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.</p>
+<h3 id="early-exit-hyper-parameters">Early Exit Hyper-Parameters</h3>
 <p>There are two parameters that are required to enable early exit. Leave them undefined if you are not enabling Early Exit:</p>
 <ol>
 <li>
-<p><strong>--earlyexit_thresholds</strong> defines the
-thresholds for each of the early exits. The cross entropy measure must be <strong>less than</strong> the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify "--earlyexit_thresholds 0.9 1.2" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.</p>
+<p><strong>--earlyexit_thresholds</strong> defines the thresholds for each of the early exits. The cross entropy measure must be <strong>less than</strong> the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify "--earlyexit_thresholds 0.9 1.2" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.</p>
 </li>
 <li>
-<p><strong>--earlyexit_lossweights</strong> provide the weights for the linear combination of losses during training to compute a signle, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of "--earlyexit_lossweights 0.2 0.3" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.</p>
+<p><strong>--earlyexit_lossweights</strong> provide the weights for the linear combination of losses during training to compute a single, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of "--earlyexit_lossweights 0.2 0.3" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.</p>
 </li>
 </ol>
+<h3 id="output-stats">Output Stats</h3>
+<p>The example code outputs various statistics regarding the loss and accuracy at each of the exits. During training, the Top1 and Top5 stats represent the accuracy should all of the data be forced out that exit (in order to compute the loss at that exit). During inference (i.e. validation and test stages), the Top1 and Top5 stats represent the accuracy for those data points that could exit because the calculated entropy at that exit was lower than the specified threshold for that exit.</p>
 <h3 id="cifar10">CIFAR10</h3>
 <p>In the case of CIFAR10, we have inserted a single exit after the first full layer grouping. The layers on the exit path itself includes a convolutional layer and a fully connected layer. If you move the exit, be sure to match the proper sizes for inputs and outputs to the exit layers.</p>
 <h3 id="imagenet">ImageNet</h3>
-<p>This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic resnet code and could be used with other size resnets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.</p>
+<p>This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic ResNet code and could be used with other size ResNets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.</p>
 <h2 id="references">References</h2>
-<p><div id="panda"></div> <strong>Priyadarshini Panda, Abhronil Sengupta, Kaushik Roy</strong>.
+<div id="panda"></div>
+
+<p><strong>Priyadarshini Panda, Abhronil Sengupta, Kaushik Roy</strong>.
     <a href="https://arxiv.org/abs/1509.08971v6"><em>Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition</em></a>, arXiv:1509.08971v6, 2017.</p>
 <div id="branchynet"></div>
 
diff --git a/docs/algo_quantization/index.html b/docs/algo_quantization/index.html
index bc0acc4..c6192cc 100644
--- a/docs/algo_quantization/index.html
+++ b/docs/algo_quantization/index.html
@@ -205,7 +205,7 @@ For any of the methods below that require quantization-aware training, please se
 <p>Let's break down the terminology we use here:</p>
 <ul>
 <li><strong>Linear:</strong> Means a float value is quantized by multiplying with a numeric constant (the <strong>scale factor</strong>).</li>
-<li><strong>Range-Based:</strong>: Means that in order to calculate the scale factor, we look at the actual range of the tensor's values. In the most naive implementation, we use the actual min/max values of the tensor. Alternatively, we use some derivation based on the tensor's range / distribution to come up with a narrower min/max range, in order to remove possible outliers. This is in contrast to the other methods described here, which we could call <strong>clipping-based</strong>, as they impose an explicit clipping function on the tensors (using either a hard-coded value or a learned value).</li>
+<li><strong>Range-Based:</strong> Means that in order to calculate the scale factor, we look at the actual range of the tensor's values. In the most naive implementation, we use the actual min/max values of the tensor. Alternatively, we use some derivation based on the tensor's range / distribution to come up with a narrower min/max range, in order to remove possible outliers. This is in contrast to the other methods described here, which we could call <strong>clipping-based</strong>, as they impose an explicit clipping function on the tensors (using either a hard-coded value or a learned value).</li>
 </ul>
 <h3 id="asymmetric-vs-symmetric">Asymmetric vs. Symmetric</h3>
 <p>In this method we can use two modes - <strong>asymmetric</strong> and <strong>symmetric</strong>.</p>
diff --git a/docs/index.html b/docs/index.html
index 543fd3b..e6b76cc 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -273,5 +273,5 @@ And of course, if we used a sparse or compressed representation, then we are red
 
 <!--
 MkDocs version : 0.17.2
-Build Date UTC : 2018-12-06 14:40:20
+Build Date UTC : 2018-12-11 08:46:53
 -->
diff --git a/docs/search/search_index.json b/docs/search/search_index.json
index 7aa6b12..303dffb 100644
--- a/docs/search/search_index.json
+++ b/docs/search/search_index.json
@@ -1,833 +1,843 @@
 {
     "docs": [
         {
-            "location": "/index.html",
-            "text": "Distiller Documentation\n\n\nWhat is Distiller\n\n\nDistiller\n is an open-source Python package for neural network compression research.\n\n\nNetwork compression can reduce the footprint of a neural network, increase its inference speed and save energy. Distiller provides a \nPyTorch\n environment for prototyping and analyzing compression algorithms, such as sparsity-inducing methods and low precision arithmetic.\n\n\nDistiller contains:\n\n\n\n\nA framework for integrating pruning, regularization and quantization algorithms.\n\n\nA set of tools for analyzing and evaluating compression performance.\n\n\nExample implementations of state-of-the-art compression algorithms.\n\n\n\n\nMotivation\n\n\nA sparse tensor is any tensor that contains some zeros, but sparse tensors are usually only interesting if they contain a significant number of zeros.  A sparse neural network performs computations using some sparse tensors (preferably many).  These tensors can be parameters (weights and biases) or activations (feature maps).\n\n\nWhy do we care about sparsity?\n\nPresent day neural networks tend to be deep, with millions of weights and activations.  Refer to GoogLeNet or ResNet50, for a couple of examples.\nThese large models are compute-intensive which means that even with dedicated acceleration hardware, the inference pass (network evaluation) will take time.  You might think that latency is an issue only in certain cases, such as autonomous driving systems, but in fact, whenever we humans interact with our phones and computers, we are sensitive to the latency of the interaction.  We don't like to wait for search results or for an application or web-page to load, and we are especially sensitive in realtime interactions such as speech recognition.  So inference latency is often something we want to minimize.\n\n\nLarge models are also memory-intensive with millions of parameters.  Moving around all of the data required to compute inference results consumes energy, which is a problem on a mobile device as well as in a server environment.  Data center server-racks are limited by their power-envelope and their ToC (total cost of ownership) is correlated to their power consumption and thermal characteristics.  In the mobile device environment, we are obviously always aware of the implications of power consumption on the device battery.\nInference performance in the data center is often measured using a KPI (key performance indicator) which folds latency and power considerations: inferences per second, per Watt (inferences/sec/watt).\n\n\nThe storage and transfer of large neural networks is also a challenge in mobile device environments, because of limitations on application sizes and long application download times.\n\n\nFor these reasons, we wish to compress the network as much as possible, to reduce the amount of bandwidth and compute required.  Inducing sparseness, through regularization or pruning, in neural-network models, is one way to compress the network (quantization is another method).\nSparse neural networks hold the promise of speed, small size, and energy efficiency.  \n\n\nSmaller\n\n\nSparse NN model representations can be compressed by taking advantage of the fact that the tensor elements are dominated by zeros.  The compression format, if any, is very HW and SW specific, and the optimal format may be different per tensor (an obvious example: largely dense tensors should not be compressed).  The compute hardware needs to support the compressions formats, for representation compression to be meaningful.  Compression representation decisions might interact with algorithms such as the use of tiles for memory accesses.  Data such as a parameter tensor is read/written from/to main system memory compressed, but the computation can be dense or sparse.  In dense compute we use dense operators, so the compressed data eventually needs to be decompressed into its full, dense size.  The best we can do is bring the compressed representation as close as possible to the compute engine.\n\nSparse compute, on the other hand, operates on the sparse representation which never requires decompression (we therefore distinguish between sparse representation and compressed representation).  This is not a simple matter to implement in HW, and often means lower utilization of the vectorized compute engines.  Therefore, there is a third class of representations, which take advantage of specific hardware characteristics.  For example, for a vectorized compute engine we can remove an entire zero-weights vector and skip its computation (this uses structured pruning or regularization).\n\n\nFaster\n\n\nMany of the layers in modern neural-networks are bandwidth-bound, which means that the execution latency is dominated by the available bandwidth. In essence, the hardware spends more time bringing data close to the compute engines, than actually performing the computations.  Fully-connected layers, RNNs and LSTMs are some examples of bandwidth-dominated operations.\n\nReducing the bandwidth required by these layers, will immediately speed them up.\n\nSome pruning algorithms prune entire kernels, filters and even layers from the network without adversely impacting the final accuracy.  Depending on the hardware implementation, these methods can be leveraged to skip computations, thus reducing latency and power.\n\n\nMore energy efficient\n\n\nBecause we pay two orders-of-magnitude more energy to access off-chip memory (e.g. DDR) compared to on-chip memory (e.g. SRAM or cache), many hardware designs employ a multi-layered cache hierarchy.  Fitting the parameters and activations of a network in these on-chip caches can make a big difference on the required bandwidth, the total inference latency, and off course reduce power consumption.\n\nAnd of course, if we used a sparse or compressed representation, then we are reducing the data throughput and therefore the energy consumption.",
+            "location": "/index.html", 
+            "text": "Distiller Documentation\n\n\nWhat is Distiller\n\n\nDistiller\n is an open-source Python package for neural network compression research.\n\n\nNetwork compression can reduce the footprint of a neural network, increase its inference speed and save energy. Distiller provides a \nPyTorch\n environment for prototyping and analyzing compression algorithms, such as sparsity-inducing methods and low precision arithmetic.\n\n\nDistiller contains:\n\n\n\n\nA framework for integrating pruning, regularization and quantization algorithms.\n\n\nA set of tools for analyzing and evaluating compression performance.\n\n\nExample implementations of state-of-the-art compression algorithms.\n\n\n\n\nMotivation\n\n\nA sparse tensor is any tensor that contains some zeros, but sparse tensors are usually only interesting if they contain a significant number of zeros.  A sparse neural network performs computations using some sparse tensors (preferably many).  These tensors can be parameters (weights and biases) or activations (feature maps).\n\n\nWhy do we care about sparsity?\n\nPresent day neural networks tend to be deep, with millions of weights and activations.  Refer to GoogLeNet or ResNet50, for a couple of examples.\nThese large models are compute-intensive which means that even with dedicated acceleration hardware, the inference pass (network evaluation) will take time.  You might think that latency is an issue only in certain cases, such as autonomous driving systems, but in fact, whenever we humans interact with our phones and computers, we are sensitive to the latency of the interaction.  We don't like to wait for search results or for an application or web-page to load, and we are especially sensitive in realtime interactions such as speech recognition.  So inference latency is often something we want to minimize.\n\n\nLarge models are also memory-intensive with millions of parameters.  Moving around all of the data required to compute inference results consumes energy, which is a problem on a mobile device as well as in a server environment.  Data center server-racks are limited by their power-envelope and their ToC (total cost of ownership) is correlated to their power consumption and thermal characteristics.  In the mobile device environment, we are obviously always aware of the implications of power consumption on the device battery.\nInference performance in the data center is often measured using a KPI (key performance indicator) which folds latency and power considerations: inferences per second, per Watt (inferences/sec/watt).\n\n\nThe storage and transfer of large neural networks is also a challenge in mobile device environments, because of limitations on application sizes and long application download times.\n\n\nFor these reasons, we wish to compress the network as much as possible, to reduce the amount of bandwidth and compute required.  Inducing sparseness, through regularization or pruning, in neural-network models, is one way to compress the network (quantization is another method).\nSparse neural networks hold the promise of speed, small size, and energy efficiency.  \n\n\nSmaller\n\n\nSparse NN model representations can be compressed by taking advantage of the fact that the tensor elements are dominated by zeros.  The compression format, if any, is very HW and SW specific, and the optimal format may be different per tensor (an obvious example: largely dense tensors should not be compressed).  The compute hardware needs to support the compressions formats, for representation compression to be meaningful.  Compression representation decisions might interact with algorithms such as the use of tiles for memory accesses.  Data such as a parameter tensor is read/written from/to main system memory compressed, but the computation can be dense or sparse.  In dense compute we use dense operators, so the compressed data eventually needs to be decompressed into its full, dense size.  The best we can do is bring the compressed representation as close as possible to the compute engine.\n\nSparse compute, on the other hand, operates on the sparse representation which never requires decompression (we therefore distinguish between sparse representation and compressed representation).  This is not a simple matter to implement in HW, and often means lower utilization of the vectorized compute engines.  Therefore, there is a third class of representations, which take advantage of specific hardware characteristics.  For example, for a vectorized compute engine we can remove an entire zero-weights vector and skip its computation (this uses structured pruning or regularization).\n\n\nFaster\n\n\nMany of the layers in modern neural-networks are bandwidth-bound, which means that the execution latency is dominated by the available bandwidth. In essence, the hardware spends more time bringing data close to the compute engines, than actually performing the computations.  Fully-connected layers, RNNs and LSTMs are some examples of bandwidth-dominated operations.\n\nReducing the bandwidth required by these layers, will immediately speed them up.\n\nSome pruning algorithms prune entire kernels, filters and even layers from the network without adversely impacting the final accuracy.  Depending on the hardware implementation, these methods can be leveraged to skip computations, thus reducing latency and power.\n\n\nMore energy efficient\n\n\nBecause we pay two orders-of-magnitude more energy to access off-chip memory (e.g. DDR) compared to on-chip memory (e.g. SRAM or cache), many hardware designs employ a multi-layered cache hierarchy.  Fitting the parameters and activations of a network in these on-chip caches can make a big difference on the required bandwidth, the total inference latency, and off course reduce power consumption.\n\nAnd of course, if we used a sparse or compressed representation, then we are reducing the data throughput and therefore the energy consumption.", 
             "title": "Home"
-        },
+        }, 
         {
-            "location": "/index.html#distiller-documentation",
-            "text": "",
+            "location": "/index.html#distiller-documentation", 
+            "text": "", 
             "title": "Distiller Documentation"
-        },
+        }, 
         {
-            "location": "/index.html#what-is-distiller",
-            "text": "Distiller  is an open-source Python package for neural network compression research.  Network compression can reduce the footprint of a neural network, increase its inference speed and save energy. Distiller provides a  PyTorch  environment for prototyping and analyzing compression algorithms, such as sparsity-inducing methods and low precision arithmetic.  Distiller contains:   A framework for integrating pruning, regularization and quantization algorithms.  A set of tools for analyzing and evaluating compression performance.  Example implementations of state-of-the-art compression algorithms.",
+            "location": "/index.html#what-is-distiller", 
+            "text": "Distiller  is an open-source Python package for neural network compression research.  Network compression can reduce the footprint of a neural network, increase its inference speed and save energy. Distiller provides a  PyTorch  environment for prototyping and analyzing compression algorithms, such as sparsity-inducing methods and low precision arithmetic.  Distiller contains:   A framework for integrating pruning, regularization and quantization algorithms.  A set of tools for analyzing and evaluating compression performance.  Example implementations of state-of-the-art compression algorithms.", 
             "title": "What is Distiller"
-        },
+        }, 
         {
-            "location": "/index.html#motivation",
-            "text": "A sparse tensor is any tensor that contains some zeros, but sparse tensors are usually only interesting if they contain a significant number of zeros.  A sparse neural network performs computations using some sparse tensors (preferably many).  These tensors can be parameters (weights and biases) or activations (feature maps).  Why do we care about sparsity? \nPresent day neural networks tend to be deep, with millions of weights and activations.  Refer to GoogLeNet or ResNet50, for a couple of examples.\nThese large models are compute-intensive which means that even with dedicated acceleration hardware, the inference pass (network evaluation) will take time.  You might think that latency is an issue only in certain cases, such as autonomous driving systems, but in fact, whenever we humans interact with our phones and computers, we are sensitive to the latency of the interaction.  We don't like to wait for search results or for an application or web-page to load, and we are especially sensitive in realtime interactions such as speech recognition.  So inference latency is often something we want to minimize. \nLarge models are also memory-intensive with millions of parameters.  Moving around all of the data required to compute inference results consumes energy, which is a problem on a mobile device as well as in a server environment.  Data center server-racks are limited by their power-envelope and their ToC (total cost of ownership) is correlated to their power consumption and thermal characteristics.  In the mobile device environment, we are obviously always aware of the implications of power consumption on the device battery.\nInference performance in the data center is often measured using a KPI (key performance indicator) which folds latency and power considerations: inferences per second, per Watt (inferences/sec/watt). \nThe storage and transfer of large neural networks is also a challenge in mobile device environments, because of limitations on application sizes and long application download times. \nFor these reasons, we wish to compress the network as much as possible, to reduce the amount of bandwidth and compute required.  Inducing sparseness, through regularization or pruning, in neural-network models, is one way to compress the network (quantization is another method).\nSparse neural networks hold the promise of speed, small size, and energy efficiency.",
+            "location": "/index.html#motivation", 
+            "text": "A sparse tensor is any tensor that contains some zeros, but sparse tensors are usually only interesting if they contain a significant number of zeros.  A sparse neural network performs computations using some sparse tensors (preferably many).  These tensors can be parameters (weights and biases) or activations (feature maps).  Why do we care about sparsity? \nPresent day neural networks tend to be deep, with millions of weights and activations.  Refer to GoogLeNet or ResNet50, for a couple of examples.\nThese large models are compute-intensive which means that even with dedicated acceleration hardware, the inference pass (network evaluation) will take time.  You might think that latency is an issue only in certain cases, such as autonomous driving systems, but in fact, whenever we humans interact with our phones and computers, we are sensitive to the latency of the interaction.  We don't like to wait for search results or for an application or web-page to load, and we are especially sensitive in realtime interactions such as speech recognition.  So inference latency is often something we want to minimize. \nLarge models are also memory-intensive with millions of parameters.  Moving around all of the data required to compute inference results consumes energy, which is a problem on a mobile device as well as in a server environment.  Data center server-racks are limited by their power-envelope and their ToC (total cost of ownership) is correlated to their power consumption and thermal characteristics.  In the mobile device environment, we are obviously always aware of the implications of power consumption on the device battery.\nInference performance in the data center is often measured using a KPI (key performance indicator) which folds latency and power considerations: inferences per second, per Watt (inferences/sec/watt). \nThe storage and transfer of large neural networks is also a challenge in mobile device environments, because of limitations on application sizes and long application download times. \nFor these reasons, we wish to compress the network as much as possible, to reduce the amount of bandwidth and compute required.  Inducing sparseness, through regularization or pruning, in neural-network models, is one way to compress the network (quantization is another method).\nSparse neural networks hold the promise of speed, small size, and energy efficiency.", 
             "title": "Motivation"
-        },
+        }, 
         {
-            "location": "/index.html#smaller",
-            "text": "Sparse NN model representations can be compressed by taking advantage of the fact that the tensor elements are dominated by zeros.  The compression format, if any, is very HW and SW specific, and the optimal format may be different per tensor (an obvious example: largely dense tensors should not be compressed).  The compute hardware needs to support the compressions formats, for representation compression to be meaningful.  Compression representation decisions might interact with algorithms such as the use of tiles for memory accesses.  Data such as a parameter tensor is read/written from/to main system memory compressed, but the computation can be dense or sparse.  In dense compute we use dense operators, so the compressed data eventually needs to be decompressed into its full, dense size.  The best we can do is bring the compressed representation as close as possible to the compute engine. \nSparse compute, on the other hand, operates on the sparse representation which never requires decompression (we therefore distinguish between sparse representation and compressed representation).  This is not a simple matter to implement in HW, and often means lower utilization of the vectorized compute engines.  Therefore, there is a third class of representations, which take advantage of specific hardware characteristics.  For example, for a vectorized compute engine we can remove an entire zero-weights vector and skip its computation (this uses structured pruning or regularization).",
+            "location": "/index.html#smaller", 
+            "text": "Sparse NN model representations can be compressed by taking advantage of the fact that the tensor elements are dominated by zeros.  The compression format, if any, is very HW and SW specific, and the optimal format may be different per tensor (an obvious example: largely dense tensors should not be compressed).  The compute hardware needs to support the compressions formats, for representation compression to be meaningful.  Compression representation decisions might interact with algorithms such as the use of tiles for memory accesses.  Data such as a parameter tensor is read/written from/to main system memory compressed, but the computation can be dense or sparse.  In dense compute we use dense operators, so the compressed data eventually needs to be decompressed into its full, dense size.  The best we can do is bring the compressed representation as close as possible to the compute engine. \nSparse compute, on the other hand, operates on the sparse representation which never requires decompression (we therefore distinguish between sparse representation and compressed representation).  This is not a simple matter to implement in HW, and often means lower utilization of the vectorized compute engines.  Therefore, there is a third class of representations, which take advantage of specific hardware characteristics.  For example, for a vectorized compute engine we can remove an entire zero-weights vector and skip its computation (this uses structured pruning or regularization).", 
             "title": "Smaller"
-        },
+        }, 
         {
-            "location": "/index.html#faster",
-            "text": "Many of the layers in modern neural-networks are bandwidth-bound, which means that the execution latency is dominated by the available bandwidth. In essence, the hardware spends more time bringing data close to the compute engines, than actually performing the computations.  Fully-connected layers, RNNs and LSTMs are some examples of bandwidth-dominated operations. \nReducing the bandwidth required by these layers, will immediately speed them up. \nSome pruning algorithms prune entire kernels, filters and even layers from the network without adversely impacting the final accuracy.  Depending on the hardware implementation, these methods can be leveraged to skip computations, thus reducing latency and power.",
+            "location": "/index.html#faster", 
+            "text": "Many of the layers in modern neural-networks are bandwidth-bound, which means that the execution latency is dominated by the available bandwidth. In essence, the hardware spends more time bringing data close to the compute engines, than actually performing the computations.  Fully-connected layers, RNNs and LSTMs are some examples of bandwidth-dominated operations. \nReducing the bandwidth required by these layers, will immediately speed them up. \nSome pruning algorithms prune entire kernels, filters and even layers from the network without adversely impacting the final accuracy.  Depending on the hardware implementation, these methods can be leveraged to skip computations, thus reducing latency and power.", 
             "title": "Faster"
-        },
+        }, 
         {
-            "location": "/index.html#more-energy-efficient",
-            "text": "Because we pay two orders-of-magnitude more energy to access off-chip memory (e.g. DDR) compared to on-chip memory (e.g. SRAM or cache), many hardware designs employ a multi-layered cache hierarchy.  Fitting the parameters and activations of a network in these on-chip caches can make a big difference on the required bandwidth, the total inference latency, and off course reduce power consumption. \nAnd of course, if we used a sparse or compressed representation, then we are reducing the data throughput and therefore the energy consumption.",
+            "location": "/index.html#more-energy-efficient", 
+            "text": "Because we pay two orders-of-magnitude more energy to access off-chip memory (e.g. DDR) compared to on-chip memory (e.g. SRAM or cache), many hardware designs employ a multi-layered cache hierarchy.  Fitting the parameters and activations of a network in these on-chip caches can make a big difference on the required bandwidth, the total inference latency, and off course reduce power consumption. \nAnd of course, if we used a sparse or compressed representation, then we are reducing the data throughput and therefore the energy consumption.", 
             "title": "More energy efficient"
-        },
+        }, 
         {
-            "location": "/install/index.html",
-            "text": "Distiller Installation\n\n\nThese instructions will help get Distiller up and running on your local machine.\n\n\nYou may also want to refer to these resources:\n\n\n\n\nDataset installation\n instructions.\n\n\nJupyter installation\n instructions.\n\n\n\n\nNotes:\n- Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.\n- If you are not using a GPU, you might need to make small adjustments to the code.\n\n\nClone Distiller\n\n\nClone the Distiller code repository from github:\n\n\n$ git clone https://github.com/NervanaSystems/distiller.git\n\n\n\n\nThe rest of the documentation that follows, assumes that you have cloned your repository to a directory called \ndistiller\n. \n\n\nCreate a Python virtual environment\n\n\nWe recommend using a \nPython virtual environment\n, but that of course, is up to you.\nThere's nothing special about using Distiller in a virtual environment, but we provide some instructions, for completeness.\n\nBefore creating the virtual environment, make sure you are located in directory \ndistiller\n.  After creating the environment, you should see a directory called \ndistiller/env\n.\n\n\n\nUsing virtualenv\n\n\nIf you don't have virtualenv installed, you can find the installation instructions \nhere\n.\n\n\nTo create the environment, execute:\n\n\n$ python3 -m virtualenv env\n\n\n\n\nThis creates a subdirectory named \nenv\n where the python virtual environment is stored, and configures the current shell to use it as the default python environment.\n\n\nUsing venv\n\n\nIf you prefer to use \nvenv\n, then begin by installing it:\n\n\n$ sudo apt-get install python3-venv\n\n\n\n\nThen create the environment:\n\n\n$ python3 -m venv env\n\n\n\n\nAs with virtualenv, this creates a directory called \ndistiller/env\n.\n\n\nActivate the environment\n\n\nThe environment activation and deactivation commands for \nvenv\n and \nvirtualenv\n are the same.\n\n\n!NOTE: Make sure to activate the environment, before proceeding with the installation of the dependency packages:\n\n\n$ source env/bin/activate\n\n\n\n\nInstall dependencies\n\n\nFinally, install Distiller's dependency packages using \npip3\n:\n\n\n$ pip3 install -r requirements.txt\n\n\n\n\nPyTorch is included in the \nrequirements.txt\n file, and will currently download PyTorch version 3.1 for CUDA 8.0.  This is the setup we've used for testing Distiller.",
+            "location": "/install/index.html", 
+            "text": "Distiller Installation\n\n\nThese instructions will help get Distiller up and running on your local machine.\n\n\nYou may also want to refer to these resources:\n\n\n\n\nDataset installation\n instructions.\n\n\nJupyter installation\n instructions.\n\n\n\n\nNotes:\n- Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.\n- If you are not using a GPU, you might need to make small adjustments to the code.\n\n\nClone Distiller\n\n\nClone the Distiller code repository from github:\n\n\n$ git clone https://github.com/NervanaSystems/distiller.git\n\n\n\n\nThe rest of the documentation that follows, assumes that you have cloned your repository to a directory called \ndistiller\n. \n\n\nCreate a Python virtual environment\n\n\nWe recommend using a \nPython virtual environment\n, but that of course, is up to you.\nThere's nothing special about using Distiller in a virtual environment, but we provide some instructions, for completeness.\n\nBefore creating the virtual environment, make sure you are located in directory \ndistiller\n.  After creating the environment, you should see a directory called \ndistiller/env\n.\n\n\n\nUsing virtualenv\n\n\nIf you don't have virtualenv installed, you can find the installation instructions \nhere\n.\n\n\nTo create the environment, execute:\n\n\n$ python3 -m virtualenv env\n\n\n\n\nThis creates a subdirectory named \nenv\n where the python virtual environment is stored, and configures the current shell to use it as the default python environment.\n\n\nUsing venv\n\n\nIf you prefer to use \nvenv\n, then begin by installing it:\n\n\n$ sudo apt-get install python3-venv\n\n\n\n\nThen create the environment:\n\n\n$ python3 -m venv env\n\n\n\n\nAs with virtualenv, this creates a directory called \ndistiller/env\n.\n\n\nActivate the environment\n\n\nThe environment activation and deactivation commands for \nvenv\n and \nvirtualenv\n are the same.\n\n\n!NOTE: Make sure to activate the environment, before proceeding with the installation of the dependency packages:\n\n\n$ source env/bin/activate\n\n\n\n\nInstall dependencies\n\n\nFinally, install Distiller's dependency packages using \npip3\n:\n\n\n$ pip3 install -r requirements.txt\n\n\n\n\nPyTorch is included in the \nrequirements.txt\n file, and will currently download PyTorch version 3.1 for CUDA 8.0.  This is the setup we've used for testing Distiller.", 
             "title": "Installation"
-        },
+        }, 
         {
-            "location": "/install/index.html#distiller-installation",
-            "text": "These instructions will help get Distiller up and running on your local machine.  You may also want to refer to these resources:   Dataset installation  instructions.  Jupyter installation  instructions.   Notes:\n- Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.\n- If you are not using a GPU, you might need to make small adjustments to the code.",
+            "location": "/install/index.html#distiller-installation", 
+            "text": "These instructions will help get Distiller up and running on your local machine.  You may also want to refer to these resources:   Dataset installation  instructions.  Jupyter installation  instructions.   Notes:\n- Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.\n- If you are not using a GPU, you might need to make small adjustments to the code.", 
             "title": "Distiller Installation"
-        },
+        }, 
         {
-            "location": "/install/index.html#clone-distiller",
-            "text": "Clone the Distiller code repository from github:  $ git clone https://github.com/NervanaSystems/distiller.git  The rest of the documentation that follows, assumes that you have cloned your repository to a directory called  distiller .",
+            "location": "/install/index.html#clone-distiller", 
+            "text": "Clone the Distiller code repository from github:  $ git clone https://github.com/NervanaSystems/distiller.git  The rest of the documentation that follows, assumes that you have cloned your repository to a directory called  distiller .", 
             "title": "Clone Distiller"
-        },
+        }, 
         {
-            "location": "/install/index.html#create-a-python-virtual-environment",
-            "text": "We recommend using a  Python virtual environment , but that of course, is up to you.\nThere's nothing special about using Distiller in a virtual environment, but we provide some instructions, for completeness. \nBefore creating the virtual environment, make sure you are located in directory  distiller .  After creating the environment, you should see a directory called  distiller/env .",
+            "location": "/install/index.html#create-a-python-virtual-environment", 
+            "text": "We recommend using a  Python virtual environment , but that of course, is up to you.\nThere's nothing special about using Distiller in a virtual environment, but we provide some instructions, for completeness. \nBefore creating the virtual environment, make sure you are located in directory  distiller .  After creating the environment, you should see a directory called  distiller/env .", 
             "title": "Create a Python virtual environment"
-        },
+        }, 
         {
-            "location": "/install/index.html#using-virtualenv",
-            "text": "If you don't have virtualenv installed, you can find the installation instructions  here .  To create the environment, execute:  $ python3 -m virtualenv env  This creates a subdirectory named  env  where the python virtual environment is stored, and configures the current shell to use it as the default python environment.",
+            "location": "/install/index.html#using-virtualenv", 
+            "text": "If you don't have virtualenv installed, you can find the installation instructions  here .  To create the environment, execute:  $ python3 -m virtualenv env  This creates a subdirectory named  env  where the python virtual environment is stored, and configures the current shell to use it as the default python environment.", 
             "title": "Using virtualenv"
-        },
+        }, 
         {
-            "location": "/install/index.html#using-venv",
-            "text": "If you prefer to use  venv , then begin by installing it:  $ sudo apt-get install python3-venv  Then create the environment:  $ python3 -m venv env  As with virtualenv, this creates a directory called  distiller/env .",
+            "location": "/install/index.html#using-venv", 
+            "text": "If you prefer to use  venv , then begin by installing it:  $ sudo apt-get install python3-venv  Then create the environment:  $ python3 -m venv env  As with virtualenv, this creates a directory called  distiller/env .", 
             "title": "Using venv"
-        },
+        }, 
         {
-            "location": "/install/index.html#activate-the-environment",
-            "text": "The environment activation and deactivation commands for  venv  and  virtualenv  are the same.  !NOTE: Make sure to activate the environment, before proceeding with the installation of the dependency packages:  $ source env/bin/activate",
+            "location": "/install/index.html#activate-the-environment", 
+            "text": "The environment activation and deactivation commands for  venv  and  virtualenv  are the same.  !NOTE: Make sure to activate the environment, before proceeding with the installation of the dependency packages:  $ source env/bin/activate", 
             "title": "Activate the environment"
-        },
+        }, 
         {
-            "location": "/install/index.html#install-dependencies",
-            "text": "Finally, install Distiller's dependency packages using  pip3 :  $ pip3 install -r requirements.txt  PyTorch is included in the  requirements.txt  file, and will currently download PyTorch version 3.1 for CUDA 8.0.  This is the setup we've used for testing Distiller.",
+            "location": "/install/index.html#install-dependencies", 
+            "text": "Finally, install Distiller's dependency packages using  pip3 :  $ pip3 install -r requirements.txt  PyTorch is included in the  requirements.txt  file, and will currently download PyTorch version 3.1 for CUDA 8.0.  This is the setup we've used for testing Distiller.", 
             "title": "Install dependencies"
-        },
+        }, 
         {
-            "location": "/usage/index.html",
-            "text": "Using the sample application\n\n\nThe Distiller repository contains a sample application, \ndistiller/examples/classifier_compression/compress_classifier.py\n, and a set of scheduling files which demonstrate Distiller's features.  Following is a brief discussion of how to use this application and the accompanying schedules.\n\n\nYou might also want to refer to the following resources:\n\n\n\n\nAn \nexplanation\n of the scheduler file format.\n\n\nAn in-depth \ndiscussion\n of how we used these schedule files to implement several state-of-the-art DNN compression research papers.\n\n\n\n\nThe sample application supports various features for compression of image classification DNNs, and gives an example of how to integrate distiller in your own application.  The code is documented and should be considered the best source of documentation, but we provide some elaboration here.\n\n\nThis diagram shows how where \ncompress_classifier.py\n fits in the compression workflow, and how we integrate the Jupyter notebooks as part of our research work.\n\n\n\nCommand line arguments\n\n\nTo get help on the command line arguments, invoke:\n\n\n$ python3 compress_classifier.py --help\n\n\n\n\nFor example:\n\n\n$ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\nParameters:\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n |    | Name                      | Shape            |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |\n |----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n |  0 | features.module.0.weight  | (64, 3, 11, 11)  |         23232 |          13411 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   42.27359 | 0.14391 | -0.00002 |    0.08805 |\n |  1 | features.module.3.weight  | (192, 64, 5, 5)  |        307200 |         115560 |    0.00000 |    0.00000 |  0.00000 |  1.91243 |  0.00000 |   62.38281 | 0.04703 | -0.00250 |    0.02289 |\n |  2 | features.module.6.weight  | (384, 192, 3, 3) |        663552 |         256565 |    0.00000 |    0.00000 |  0.00000 |  6.18490 |  0.00000 |   61.33445 | 0.03354 | -0.00184 |    0.01803 |\n |  3 | features.module.8.weight  | (256, 384, 3, 3) |        884736 |         315065 |    0.00000 |    0.00000 |  0.00000 |  6.96411 |  0.00000 |   64.38881 | 0.02646 | -0.00168 |    0.01422 |\n |  4 | features.module.10.weight | (256, 256, 3, 3) |        589824 |         186938 |    0.00000 |    0.00000 |  0.00000 | 15.49225 |  0.00000 |   68.30614 | 0.02714 | -0.00246 |    0.01409 |\n |  5 | classifier.1.weight       | (4096, 9216)     |      37748736 |        3398881 |    0.00000 |    0.21973 |  0.00000 |  0.21973 |  0.00000 |   90.99604 | 0.00589 | -0.00020 |    0.00168 |\n |  6 | classifier.4.weight       | (4096, 4096)     |      16777216 |        1782769 |    0.21973 |    3.46680 |  0.00000 |  3.46680 |  0.00000 |   89.37387 | 0.00849 | -0.00066 |    0.00263 |\n |  7 | classifier.6.weight       | (1000, 4096)     |       4096000 |         994738 |    3.36914 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   75.71440 | 0.01718 |  0.00030 |    0.00778 |\n |  8 | Total sparsity:           | -                |      61090496 |        7063928 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   88.43694 | 0.00000 |  0.00000 |    0.00000 |\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:04,646 - Epoch: [89][   50/  500]    Loss 2.175988    Top1 51.289063    Top5 74.023438\n 2018-04-04 21:31:06,427 - Epoch: [89][  100/  500]    Loss 2.171564    Top1 51.175781    Top5 74.308594\n 2018-04-04 21:31:11,432 - Epoch: [89][  150/  500]    Loss 2.159347    Top1 51.546875    Top5 74.473958\n 2018-04-04 21:31:14,364 - Epoch: [89][  200/  500]    Loss 2.156857    Top1 51.585938    Top5 74.568359\n 2018-04-04 21:31:18,381 - Epoch: [89][  250/  500]    Loss 2.152790    Top1 51.707813    Top5 74.681250\n 2018-04-04 21:31:22,195 - Epoch: [89][  300/  500]    Loss 2.149962    Top1 51.791667    Top5 74.755208\n 2018-04-04 21:31:25,508 - Epoch: [89][  350/  500]    Loss 2.150936    Top1 51.827009    Top5 74.767857\n 2018-04-04 21:31:29,538 - Epoch: [89][  400/  500]    Loss 2.150853    Top1 51.781250    Top5 74.763672\n 2018-04-04 21:31:32,842 - Epoch: [89][  450/  500]    Loss 2.150156    Top1 51.828125    Top5 74.821181\n 2018-04-04 21:31:35,338 - Epoch: [89][  500/  500]    Loss 2.150417    Top1 51.833594    Top5 74.817187\n 2018-04-04 21:31:35,357 - ==> Top1: 51.838    Top5: 74.817    Loss: 2.150\n\n 2018-04-04 21:31:35,364 - Saving checkpoint\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:31:51,512 - Test: [   50/  195]    Loss 1.487607    Top1 63.273438    Top5 85.695312\n 2018-04-04 21:31:55,015 - Test: [  100/  195]    Loss 1.638043    Top1 60.636719    Top5 83.664062\n 2018-04-04 21:31:58,732 - Test: [  150/  195]    Loss 1.833214    Top1 57.619792    Top5 80.447917\n 2018-04-04 21:32:01,274 - ==> Top1: 56.606    Top5: 79.446    Loss: 1.893\n\n\n\n\nLet's look at the command line again:\n\n\n$ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\n\n\n\nIn this example, we prune a TorchVision pre-trained AlexNet network, using the following configuration:\n\n\n\n\nLearning-rate of 0.005\n\n\nPrint progress every 50 mini-batches.\n\n\nUse 44 worker threads to load data (make sure to use something suitable for your machine).\n\n\nRun for 90 epochs.  Torchvision's pre-trained models did not store the epoch metadata, so pruning starts at epoch 0.  When you train and prune your own networks, the last training epoch is saved as a metadata with the model.  Therefore, when you load such models, the first epoch is not 0, but it is the last training epoch.\n\n\nThe pruning schedule is provided in \nalexnet.schedule_sensitivity.yaml\n\n\nLog files are written to directory \nlogs\n.\n\n\n\n\nExamples\n\n\nDistiller comes with several example schedules which can be used together with \ncompress_classifier.py\n.\nThese example schedules (YAML) files, contain the command line that is used in order to invoke the schedule (so that you can easily recreate the results in your environment), together with the results of the pruning or regularization.  The results usually contain a table showing the sparsity of  each of the model parameters, together with the validation and test top1, top5 and loss scores.\n\n\nFor more details on the example schedules, you can refer to the coverage of the \nModel Zoo\n.\n\n\n\n\nexamples/agp-pruning\n:\n\n\nAutomated Gradual Pruning (AGP) on MobileNet and ResNet18 (ImageNet dataset)\n\n\n\n\n\n\n\nexamples/hybrid\n:\n\n\nAlexNet AGP with 2D (kernel) regularization (ImageNet dataset)\n\n\nAlexNet sensitivity pruning with 2D regularization\n\n\n\n\n\n\n\nexamples/network_slimming\n:\n\n\nResNet20 Network Slimming (this is work-in-progress)\n\n\n\n\n\n\n\nexamples/pruning_filters_for_efficient_convnets\n:\n\n\nResNet56 baseline training (CIFAR10 dataset)\n\n\nResNet56 filter removal using filter ranking\n\n\n\n\n\n\n\nexamples/sensitivity_analysis\n:\n\n\nElement-wise pruning sensitivity-analysis:\n\n\nAlexNet (ImageNet)\n\n\nMobileNet (ImageNet)\n\n\nResNet18 (ImageNet)\n\n\nResNet20 (CIFAR10)\n\n\nResNet34 (ImageNet)\n\n\nFilter-wise pruning sensitivity-analysis:\n\n\nResNet20 (CIFAR10)\n\n\nResNet56 (CIFAR10)\n\n\n\n\n\n\n\nexamples/sensitivity-pruning\n:\n\n\nAlexNet sensitivity pruning with Iterative Pruning\n\n\nAlexNet sensitivity pruning with One-Shot Pruning\n\n\n\n\n\n\n\nexamples/ssl\n:\n\n\nResNet20 baseline training (CIFAR10 dataset)\n\n\nStructured Sparsity Learning (SSL) with layer removal on ResNet20\n\n\nSSL with channels removal on ResNet20\n\n\n\n\n\n\n\nexamples/quantization\n:\n\n\nAlexNet w. Batch-Norm (base FP32 + DoReFa)\n\n\nPre-activation ResNet20 on CIFAR10 (base FP32 + DoReFa)\n\n\nPre-activation ResNet18 on ImageNEt (base FP32 + DoReFa)\n\n\n\n\n\n\n\n\nExperiment reproducibility\n\n\nExperiment reproducibility is sometimes important.  Pete Warden recently expounded about this in his \nblog\n.\n\nPyTorch's support for deterministic execution requires us to use only one thread for loading data (other wise the multi-threaded execution of the data loaders can create random order and change the results), and to set the seed of the CPU and GPU PRNGs.  Using the \n--deterministic\n command-line flag and setting \nj=1\n will produce reproducible results (for the same PyTorch version).\n\n\nPerforming pruning sensitivity analysis\n\n\nDistiller supports element-wise and filter-wise pruning sensitivity analysis.  In both cases, L1-norm is used to rank which elements or filters to prune.  For example, when running filter-pruning sensitivity analysis, the L1-norm of the filters of each layer's weights tensor are calculated, and the bottom x% are set to zero.  \n\nThe analysis process is quite long, because currently we use the entire test dataset to assess the accuracy performance at each pruning level of each weights tensor.  Using a small dataset for this would save much time and we plan on assessing if this will provide sufficient results.\n\nResults are output as a CSV file (\nsensitivity.csv\n) and PNG file (\nsensitivity.png\n).  The implementation is in \ndistiller/sensitivity.py\n and it contains further details about process and the format of the CSV file.\n\n\nThe example below performs element-wise pruning sensitivity analysis on ResNet20 for CIFAR10:\n\n\n$ python3 compress_classifier.py -a resnet20_cifar ../../../data.cifar10/ -j=1 --resume=../cifar10/resnet20/checkpoint_trained_dense.pth.tar --sense=element\n\n\n\n\nThe \nsense\n command-line argument can be set to either \nelement\n or \nfilter\n, depending on the type of analysis you want done.\n\n\nThere is also a \nJupyter notebook\n with example invocations, outputs and explanations.\n\n\nPost-Training Quantization\n\n\nDistiller supports post-training quantization of trained modules without re-training (using \nRange-Based Linear Quantization\n). So, any model (whether pruned or not) can be quantized. To invoke post-training quantization, use \n--quantize-eval\n along with \n--evaluate\n. Additional arguments are available to control parameters of the quantization:\n\n\nArguments controlling quantization at evaluation time(\"post-training quantization\"):\n  --quantize-eval, --qe\n                        Apply linear quantization to model before evaluation.\n                        Applicable only if --evaluate is also set\n  --qe-mode QE_MODE, --qem QE_MODE\n                        Linear quantization mode. Choices: asym_s | asym_u |\n                        sym\n  --qe-bits-acts NUM_BITS, --qeba NUM_BITS\n                        Number of bits for quantization of activations\n  --qe-bits-wts NUM_BITS, --qebw NUM_BITS\n                        Number of bits for quantization of weights\n  --qe-bits-accum NUM_BITS\n                        Number of bits for quantization of the accumulator\n  --qe-clip-acts, --qeca\n                        Enable clipping of activations using min/max values\n                        averaging over batch\n  --qe-no-clip-layers LAYER_NAME [LAYER_NAME ...], --qencl LAYER_NAME [LAYER_NAME ...]\n                        List of fully-qualified layer names for which not to\n                        clip activations. Applicable only if --qe-clip-acts is\n                        also set\n  --qe-per-channel, --qepc\n                        Enable per-channel quantization of weights (per output channel)\n\n\n\n\n\nThe following example qunatizes ResNet18 for ImageNet:\n\n\n$ python3 compress_classifier.py -a resnet18 ../../../data.imagenet  --pretrained --quantize-eval --evaluate\n\n\n\n\nA checkpoint with the quantized model will be dumped in the run directory. It will contain the quantized model parameters (the data type will still be FP32, but the values will be integers). The calculated quantization parameters (scale and zero-point) are stored as well in each quantized layer.\n\n\nFor more examples of post-training quantization see \nhere\n\n\nSummaries\n\n\nYou can use the sample compression application to generate model summary reports, such as the attributes and compute summary report (see screen capture below).\nYou can log sparsity statistics (written to console and CSV file), performance, optimizer and model information, and also create a PNG image of the DNN.\nCreating a PNG image is an experimental feature (it relies on features which are not available on PyTorch 3.1 and that we hope will be available in PyTorch's next release), so to use it you will need to compile the PyTorch master branch, and hope for the best ;-).\n\n\n$ python3 compress_classifier.py --resume=../ssl/checkpoints/checkpoint_trained_ch_regularized_dense.pth.tar -a=resnet20_cifar ../../../data.cifar10 --summary=compute\n\n\n\n\nGenerates:\n\n\n+----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\n|    | Name                         | Type   | Attrs    | IFM             |   IFM volume | OFM             |   OFM volume |   Weights volume |    MACs |\n|----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------|\n|  0 | module.conv1                 | Conv2d | k=(3, 3) | (1, 3, 32, 32)  |         3072 | (1, 16, 32, 32) |        16384 |              432 |  442368 |\n|  1 | module.layer1.0.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  2 | module.layer1.0.conv2        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  3 | module.layer1.1.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  4 | module.layer1.1.conv2        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  5 | module.layer1.2.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  6 | module.layer1.2.conv2        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  7 | module.layer2.0.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 32, 16, 16) |         8192 |             4608 | 1179648 |\n|  8 | module.layer2.0.conv2        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n|  9 | module.layer2.0.downsample.0 | Conv2d | k=(1, 1) | (1, 16, 32, 32) |        16384 | (1, 32, 16, 16) |         8192 |              512 |  131072 |\n| 10 | module.layer2.1.conv1        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 11 | module.layer2.1.conv2        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 12 | module.layer2.2.conv1        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 13 | module.layer2.2.conv2        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 14 | module.layer3.0.conv1        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 64, 8, 8)   |         4096 |            18432 | 1179648 |\n| 15 | module.layer3.0.conv2        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 16 | module.layer3.0.downsample.0 | Conv2d | k=(1, 1) | (1, 32, 16, 16) |         8192 | (1, 64, 8, 8)   |         4096 |             2048 |  131072 |\n| 17 | module.layer3.1.conv1        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 18 | module.layer3.1.conv2        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 19 | module.layer3.2.conv1        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 20 | module.layer3.2.conv2        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 21 | module.fc                    | Linear |          | (1, 64)         |           64 | (1, 10)         |           10 |              640 |     640 |\n+----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\nTotal MACs: 40,813,184\n\n\n\n\nUsing TensorBoard\n\n\nGoogle's \nTensorBoard\n is an excellent tool for visualizing the progress of DNN training.  Distiller's logger supports writing performance indicators and parameter statistics in a file format that can be read by TensorBoard (Distiller uses TensorFlow's APIs in order to do this, which is why Distiller requires the installation of TensorFlow).\n\nTo view the graphs, invoke the TensorBoard server.  For example:\n\n\n$ tensorboard --logdir=logs\n\n\n\n\nDistillers's setup (requirements.txt) installs TensorFlow for CPU. If you want a different installation, please follow the \nTensorFlow installation instructions\n.\n\n\nCollecting activations statistics\n\n\nIn CNNs with ReLU layers, ReLU activations (feature-maps) also exhibit a nice level of sparsity (50-60% sparsity is typical). \n\nYou can collect activation statistics using the \n--act_stats\n command-line flag.\n\nFor example:\n\n\n$ python3 compress_classifier.py -a=resnet56_cifar -p=50 ../../../data.cifar10  --resume=checkpoint.resnet56_cifar_baseline.pth.tar --act-stats=test -e\n\n\n\n\nThe \ntest\n parameter indicates that, in this example, we want to collect activation statistics during the \ntest\n phase.  Note that we also used the \n-e\n command-line argument to indicate that we want to run a \ntest\n phase. The other two legal parameter values are \ntrain\n and \nvalid\n which collect activation statistics during the \ntraining\n and \nvalidation\n phases, respectively.  \n\n\nCollectors and their collaterals\n\n\nAn instance of a subclass of \nActivationStatsCollector\n can be used to collect activation statistics.  Currently, \nActivationStatsCollector\n has two types of subclasses: \nSummaryActivationStatsCollector\n and \nRecordsActivationStatsCollector\n.\n\nInstances of \nSummaryActivationStatsCollector\n compute the mean of some statistic of the activation.  It is rather\nlight-weight and quicker than collecting a record per activation.  The statistic function is configured in the constructor.\n\nIn the sample compression application, \ncompress_classifier.py\n, we create a dictionary of collectors.  For example:\n\n\nSummaryActivationStatsCollector(model,\n                                \"sparsity\",\n                                lambda t: 100 * distiller.utils.sparsity(t))\n\n\n\n\nThe lambda expression is invoked per activation encountered during forward passes, and the value it returns (in this case, the sparsity of the activation tensors, multiplied by 100) is stored in \nmodule.sparsity\n (\n\"sparsity\"\n is this collector's name).  To access the statistics, you can invoke \ncollector.value()\n, or you can access each module's data directly.\n\n\nAnother type of collector is \nRecordsActivationStatsCollector\n which computes a hard-coded set of activations statistics and collects a\n\nrecord per activation\n.  For obvious reasons, this is slower than instances of \nSummaryActivationStatsCollector\n.\nActivationStatsCollector\n default to collecting activations statistics only on the output activations of ReLU layers, but we can choose any layer type we want.  In the example below we collect statistics from outputs of \ntorch.nn.Conv2d\n layers.\n\n\nRecordsActivationStatsCollector(model, classes=[torch.nn.Conv2d])\n\n\n\n\nCollectors can write their data to Excel workbooks (which are named using the collector's name), by invoking \ncollector.to_xlsx(path_to_workbook)\n.  In \ncompress_classifier.py\n we currently create four different collectors which you can selectively disable.  You can also add other statistics collectors and use a different function to compute your new statistic.\n\n\ncollectors = missingdict({\n    \"sparsity\":      SummaryActivationStatsCollector(model, \"sparsity\",\n                                                     lambda t: 100 * distiller.utils.sparsity(t)),\n    \"l1_channels\":   SummaryActivationStatsCollector(model, \"l1_channels\",\n                                                     distiller.utils.activation_channels_l1),\n    \"apoz_channels\": SummaryActivationStatsCollector(model, \"apoz_channels\",\n                                                     distiller.utils.activation_channels_apoz),\n    \"records\":       RecordsActivationStatsCollector(model, classes=[torch.nn.Conv2d])})\n\n\n\n\nBy default, these Collectors write their data to files in the active log directory.\n\n\nYou can use a utility function, \ndistiller.log_activation_statsitics\n, to log the data of an \nActivationStatsCollector\n instance to one of the backend-loggers.  For an example, the code below logs the \n\"sparsity\"\n collector to a TensorBoard log file.\n\n\ndistiller.log_activation_statsitics(epoch, \"train\", loggers=[tflogger],\n                                    collector=collectors[\"sparsity\"])\n\n\n\n\nCaveats\n\n\nDistiller collects activations statistics using PyTorch's forward-hooks mechanism.  Collectors iteratively register the modules' forward-hooks, and collectors are called during the forward traversal and get exposed to activation data.  Registering for forward callbacks is performed like this:\n\n\nmodule.register_forward_hook\n\n\n\n\nThis makes apparent two limitations of this mechanism:\n\n\n\n\nWe can only register on PyTorch modules.  This means that we can't register on the forward hook of a functionals such as \ntorch.nn.functional.relu\n and \ntorch.nn.functional.max_pool2d\n.\n\n   Therefore, you may need to replace functionals with their module alternative.  For example:  \n\n\n\n\nclass MadeUpNet(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.conv1 = nn.Conv2d(3, 6, 5)\n\n    def forward(self, x):\n        x = F.relu(self.conv1(x))\n        return x\n\n\n\n\nCan be changed to:  \n\n\nclass MadeUpNet(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.conv1 = nn.Conv2d(3, 6, 5)\n        self.relu = nn.ReLU(inplace=True)\n\n    def forward(self, x):\n        x = self.relu(self.conv1(x))\n        return x\n\n\n\n\n\n\nWe can only use a module instance once in our models.  If we use the same module several times, then we can't determine which node in the graph has invoked the callback, because the PyTorch callback signature \ndef hook(module, input, output)\n doesn't provide enough contextual information.\n\nTorchVision's \nResNet\n is an example of a model that uses the same instance of nn.ReLU multiple times:  \n\n\n\n\nclass BasicBlock(nn.Module):\n    expansion = 1\n    def __init__(self, inplanes, planes, stride=1, downsample=None):\n        super(BasicBlock, self).__init__()\n        self.conv1 = conv3x3(inplanes, planes, stride)\n        self.bn1 = nn.BatchNorm2d(planes)\n        self.relu = nn.ReLU(inplace=True)\n        self.conv2 = conv3x3(planes, planes)\n        self.bn2 = nn.BatchNorm2d(planes)\n        self.downsample = downsample\n        self.stride = stride\n\n    def forward(self, x):\n        residual = x\n        out = self.conv1(x)\n        out = self.bn1(out)\n        out = self.relu(out)                    # <================\n        out = self.conv2(out)\n        out = self.bn2(out)\n        if self.downsample is not None:\n            residual = self.downsample(x)\n        out += residual\n        out = self.relu(out)                    # <================\n        return out\n\n\n\n\nIn Distiller we changed \nResNet\n to use multiple instances of nn.ReLU, and each instance is used only once:  \n\n\nclass BasicBlock(nn.Module):\n    expansion = 1\n    def __init__(self, inplanes, planes, stride=1, downsample=None):\n        super(BasicBlock, self).__init__()\n        self.conv1 = conv3x3(inplanes, planes, stride)\n        self.bn1 = nn.BatchNorm2d(planes)\n        self.relu1 = nn.ReLU(inplace=True)\n        self.conv2 = conv3x3(planes, planes)\n        self.bn2 = nn.BatchNorm2d(planes)\n        self.relu2 = nn.ReLU(inplace=True)\n        self.downsample = downsample\n        self.stride = stride\n\n    def forward(self, x):\n        residual = x\n        out = self.conv1(x)\n        out = self.bn1(out)\n        out = self.relu1(out)                   # <================\n        out = self.conv2(out)\n        out = self.bn2(out)\n        if self.downsample is not None:\n            residual = self.downsample(x)\n        out += residual\n        out = self.relu2(out)                   # <================\n        return out\n\n\n\n\nUsing the Jupyter notebooks\n\n\nThe Jupyter notebooks contain many examples of how to use the statistics summaries generated by Distiller.  They are explained in a separate page.\n\n\nGenerating this documentation\n\n\nInstall mkdocs and the required packages by executing:\n\n\n$ pip3 install -r doc-requirements.txt\n\n\n\n\nTo build the project documentation run:\n\n\n$ cd distiller/docs-src\n$ mkdocs build --clean\n\n\n\n\nThis will create a folder named 'site' which contains the documentation website.\nOpen distiller/docs/site/index.html to view the documentation home page.",
+            "location": "/usage/index.html", 
+            "text": "Using the sample application\n\n\nThe Distiller repository contains a sample application, \ndistiller/examples/classifier_compression/compress_classifier.py\n, and a set of scheduling files which demonstrate Distiller's features.  Following is a brief discussion of how to use this application and the accompanying schedules.\n\n\nYou might also want to refer to the following resources:\n\n\n\n\nAn \nexplanation\n of the scheduler file format.\n\n\nAn in-depth \ndiscussion\n of how we used these schedule files to implement several state-of-the-art DNN compression research papers.\n\n\n\n\nThe sample application supports various features for compression of image classification DNNs, and gives an example of how to integrate distiller in your own application.  The code is documented and should be considered the best source of documentation, but we provide some elaboration here.\n\n\nThis diagram shows how where \ncompress_classifier.py\n fits in the compression workflow, and how we integrate the Jupyter notebooks as part of our research work.\n\n\n\nCommand line arguments\n\n\nTo get help on the command line arguments, invoke:\n\n\n$ python3 compress_classifier.py --help\n\n\n\n\nFor example:\n\n\n$ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\nParameters:\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n |    | Name                      | Shape            |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |\n |----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n |  0 | features.module.0.weight  | (64, 3, 11, 11)  |         23232 |          13411 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   42.27359 | 0.14391 | -0.00002 |    0.08805 |\n |  1 | features.module.3.weight  | (192, 64, 5, 5)  |        307200 |         115560 |    0.00000 |    0.00000 |  0.00000 |  1.91243 |  0.00000 |   62.38281 | 0.04703 | -0.00250 |    0.02289 |\n |  2 | features.module.6.weight  | (384, 192, 3, 3) |        663552 |         256565 |    0.00000 |    0.00000 |  0.00000 |  6.18490 |  0.00000 |   61.33445 | 0.03354 | -0.00184 |    0.01803 |\n |  3 | features.module.8.weight  | (256, 384, 3, 3) |        884736 |         315065 |    0.00000 |    0.00000 |  0.00000 |  6.96411 |  0.00000 |   64.38881 | 0.02646 | -0.00168 |    0.01422 |\n |  4 | features.module.10.weight | (256, 256, 3, 3) |        589824 |         186938 |    0.00000 |    0.00000 |  0.00000 | 15.49225 |  0.00000 |   68.30614 | 0.02714 | -0.00246 |    0.01409 |\n |  5 | classifier.1.weight       | (4096, 9216)     |      37748736 |        3398881 |    0.00000 |    0.21973 |  0.00000 |  0.21973 |  0.00000 |   90.99604 | 0.00589 | -0.00020 |    0.00168 |\n |  6 | classifier.4.weight       | (4096, 4096)     |      16777216 |        1782769 |    0.21973 |    3.46680 |  0.00000 |  3.46680 |  0.00000 |   89.37387 | 0.00849 | -0.00066 |    0.00263 |\n |  7 | classifier.6.weight       | (1000, 4096)     |       4096000 |         994738 |    3.36914 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   75.71440 | 0.01718 |  0.00030 |    0.00778 |\n |  8 | Total sparsity:           | -                |      61090496 |        7063928 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   88.43694 | 0.00000 |  0.00000 |    0.00000 |\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:04,646 - Epoch: [89][   50/  500]    Loss 2.175988    Top1 51.289063    Top5 74.023438\n 2018-04-04 21:31:06,427 - Epoch: [89][  100/  500]    Loss 2.171564    Top1 51.175781    Top5 74.308594\n 2018-04-04 21:31:11,432 - Epoch: [89][  150/  500]    Loss 2.159347    Top1 51.546875    Top5 74.473958\n 2018-04-04 21:31:14,364 - Epoch: [89][  200/  500]    Loss 2.156857    Top1 51.585938    Top5 74.568359\n 2018-04-04 21:31:18,381 - Epoch: [89][  250/  500]    Loss 2.152790    Top1 51.707813    Top5 74.681250\n 2018-04-04 21:31:22,195 - Epoch: [89][  300/  500]    Loss 2.149962    Top1 51.791667    Top5 74.755208\n 2018-04-04 21:31:25,508 - Epoch: [89][  350/  500]    Loss 2.150936    Top1 51.827009    Top5 74.767857\n 2018-04-04 21:31:29,538 - Epoch: [89][  400/  500]    Loss 2.150853    Top1 51.781250    Top5 74.763672\n 2018-04-04 21:31:32,842 - Epoch: [89][  450/  500]    Loss 2.150156    Top1 51.828125    Top5 74.821181\n 2018-04-04 21:31:35,338 - Epoch: [89][  500/  500]    Loss 2.150417    Top1 51.833594    Top5 74.817187\n 2018-04-04 21:31:35,357 - ==\n Top1: 51.838    Top5: 74.817    Loss: 2.150\n\n 2018-04-04 21:31:35,364 - Saving checkpoint\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:31:51,512 - Test: [   50/  195]    Loss 1.487607    Top1 63.273438    Top5 85.695312\n 2018-04-04 21:31:55,015 - Test: [  100/  195]    Loss 1.638043    Top1 60.636719    Top5 83.664062\n 2018-04-04 21:31:58,732 - Test: [  150/  195]    Loss 1.833214    Top1 57.619792    Top5 80.447917\n 2018-04-04 21:32:01,274 - ==\n Top1: 56.606    Top5: 79.446    Loss: 1.893\n\n\n\n\nLet's look at the command line again:\n\n\n$ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\n\n\n\nIn this example, we prune a TorchVision pre-trained AlexNet network, using the following configuration:\n\n\n\n\nLearning-rate of 0.005\n\n\nPrint progress every 50 mini-batches.\n\n\nUse 44 worker threads to load data (make sure to use something suitable for your machine).\n\n\nRun for 90 epochs.  Torchvision's pre-trained models did not store the epoch metadata, so pruning starts at epoch 0.  When you train and prune your own networks, the last training epoch is saved as a metadata with the model.  Therefore, when you load such models, the first epoch is not 0, but it is the last training epoch.\n\n\nThe pruning schedule is provided in \nalexnet.schedule_sensitivity.yaml\n\n\nLog files are written to directory \nlogs\n.\n\n\n\n\nExamples\n\n\nDistiller comes with several example schedules which can be used together with \ncompress_classifier.py\n.\nThese example schedules (YAML) files, contain the command line that is used in order to invoke the schedule (so that you can easily recreate the results in your environment), together with the results of the pruning or regularization.  The results usually contain a table showing the sparsity of  each of the model parameters, together with the validation and test top1, top5 and loss scores.\n\n\nFor more details on the example schedules, you can refer to the coverage of the \nModel Zoo\n.\n\n\n\n\nexamples/agp-pruning\n:\n\n\nAutomated Gradual Pruning (AGP) on MobileNet and ResNet18 (ImageNet dataset)\n\n\n\n\n\n\n\nexamples/hybrid\n:\n\n\nAlexNet AGP with 2D (kernel) regularization (ImageNet dataset)\n\n\nAlexNet sensitivity pruning with 2D regularization\n\n\n\n\n\n\n\nexamples/network_slimming\n:\n\n\nResNet20 Network Slimming (this is work-in-progress)\n\n\n\n\n\n\n\nexamples/pruning_filters_for_efficient_convnets\n:\n\n\nResNet56 baseline training (CIFAR10 dataset)\n\n\nResNet56 filter removal using filter ranking\n\n\n\n\n\n\n\nexamples/sensitivity_analysis\n:\n\n\nElement-wise pruning sensitivity-analysis:\n\n\nAlexNet (ImageNet)\n\n\nMobileNet (ImageNet)\n\n\nResNet18 (ImageNet)\n\n\nResNet20 (CIFAR10)\n\n\nResNet34 (ImageNet)\n\n\nFilter-wise pruning sensitivity-analysis:\n\n\nResNet20 (CIFAR10)\n\n\nResNet56 (CIFAR10)\n\n\n\n\n\n\n\nexamples/sensitivity-pruning\n:\n\n\nAlexNet sensitivity pruning with Iterative Pruning\n\n\nAlexNet sensitivity pruning with One-Shot Pruning\n\n\n\n\n\n\n\nexamples/ssl\n:\n\n\nResNet20 baseline training (CIFAR10 dataset)\n\n\nStructured Sparsity Learning (SSL) with layer removal on ResNet20\n\n\nSSL with channels removal on ResNet20\n\n\n\n\n\n\n\nexamples/quantization\n:\n\n\nAlexNet w. Batch-Norm (base FP32 + DoReFa)\n\n\nPre-activation ResNet20 on CIFAR10 (base FP32 + DoReFa)\n\n\nPre-activation ResNet18 on ImageNEt (base FP32 + DoReFa)\n\n\n\n\n\n\n\n\nExperiment reproducibility\n\n\nExperiment reproducibility is sometimes important.  Pete Warden recently expounded about this in his \nblog\n.\n\nPyTorch's support for deterministic execution requires us to use only one thread for loading data (other wise the multi-threaded execution of the data loaders can create random order and change the results), and to set the seed of the CPU and GPU PRNGs.  Using the \n--deterministic\n command-line flag and setting \nj=1\n will produce reproducible results (for the same PyTorch version).\n\n\nPerforming pruning sensitivity analysis\n\n\nDistiller supports element-wise and filter-wise pruning sensitivity analysis.  In both cases, L1-norm is used to rank which elements or filters to prune.  For example, when running filter-pruning sensitivity analysis, the L1-norm of the filters of each layer's weights tensor are calculated, and the bottom x% are set to zero.  \n\nThe analysis process is quite long, because currently we use the entire test dataset to assess the accuracy performance at each pruning level of each weights tensor.  Using a small dataset for this would save much time and we plan on assessing if this will provide sufficient results.\n\nResults are output as a CSV file (\nsensitivity.csv\n) and PNG file (\nsensitivity.png\n).  The implementation is in \ndistiller/sensitivity.py\n and it contains further details about process and the format of the CSV file.\n\n\nThe example below performs element-wise pruning sensitivity analysis on ResNet20 for CIFAR10:\n\n\n$ python3 compress_classifier.py -a resnet20_cifar ../../../data.cifar10/ -j=1 --resume=../cifar10/resnet20/checkpoint_trained_dense.pth.tar --sense=element\n\n\n\n\nThe \nsense\n command-line argument can be set to either \nelement\n or \nfilter\n, depending on the type of analysis you want done.\n\n\nThere is also a \nJupyter notebook\n with example invocations, outputs and explanations.\n\n\nPost-Training Quantization\n\n\nDistiller supports post-training quantization of trained modules without re-training (using \nRange-Based Linear Quantization\n). So, any model (whether pruned or not) can be quantized. To invoke post-training quantization, use \n--quantize-eval\n along with \n--evaluate\n. Additional arguments are available to control parameters of the quantization:\n\n\nArguments controlling quantization at evaluation time(\npost-training quantization\n):\n  --quantize-eval, --qe\n                        Apply linear quantization to model before evaluation.\n                        Applicable only if --evaluate is also set\n  --qe-mode QE_MODE, --qem QE_MODE\n                        Linear quantization mode. Choices: asym_s | asym_u |\n                        sym\n  --qe-bits-acts NUM_BITS, --qeba NUM_BITS\n                        Number of bits for quantization of activations\n  --qe-bits-wts NUM_BITS, --qebw NUM_BITS\n                        Number of bits for quantization of weights\n  --qe-bits-accum NUM_BITS\n                        Number of bits for quantization of the accumulator\n  --qe-clip-acts, --qeca\n                        Enable clipping of activations using min/max values\n                        averaging over batch\n  --qe-no-clip-layers LAYER_NAME [LAYER_NAME ...], --qencl LAYER_NAME [LAYER_NAME ...]\n                        List of fully-qualified layer names for which not to\n                        clip activations. Applicable only if --qe-clip-acts is\n                        also set\n  --qe-per-channel, --qepc\n                        Enable per-channel quantization of weights (per output channel)\n\n\n\n\n\nThe following example qunatizes ResNet18 for ImageNet:\n\n\n$ python3 compress_classifier.py -a resnet18 ../../../data.imagenet  --pretrained --quantize-eval --evaluate\n\n\n\n\nA checkpoint with the quantized model will be dumped in the run directory. It will contain the quantized model parameters (the data type will still be FP32, but the values will be integers). The calculated quantization parameters (scale and zero-point) are stored as well in each quantized layer.\n\n\nFor more examples of post-training quantization see \nhere\n\n\nSummaries\n\n\nYou can use the sample compression application to generate model summary reports, such as the attributes and compute summary report (see screen capture below).\nYou can log sparsity statistics (written to console and CSV file), performance, optimizer and model information, and also create a PNG image of the DNN.\nCreating a PNG image is an experimental feature (it relies on features which are not available on PyTorch 3.1 and that we hope will be available in PyTorch's next release), so to use it you will need to compile the PyTorch master branch, and hope for the best ;-).\n\n\n$ python3 compress_classifier.py --resume=../ssl/checkpoints/checkpoint_trained_ch_regularized_dense.pth.tar -a=resnet20_cifar ../../../data.cifar10 --summary=compute\n\n\n\n\nGenerates:\n\n\n+----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\n|    | Name                         | Type   | Attrs    | IFM             |   IFM volume | OFM             |   OFM volume |   Weights volume |    MACs |\n|----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------|\n|  0 | module.conv1                 | Conv2d | k=(3, 3) | (1, 3, 32, 32)  |         3072 | (1, 16, 32, 32) |        16384 |              432 |  442368 |\n|  1 | module.layer1.0.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  2 | module.layer1.0.conv2        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  3 | module.layer1.1.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  4 | module.layer1.1.conv2        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  5 | module.layer1.2.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  6 | module.layer1.2.conv2        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  7 | module.layer2.0.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 32, 16, 16) |         8192 |             4608 | 1179648 |\n|  8 | module.layer2.0.conv2        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n|  9 | module.layer2.0.downsample.0 | Conv2d | k=(1, 1) | (1, 16, 32, 32) |        16384 | (1, 32, 16, 16) |         8192 |              512 |  131072 |\n| 10 | module.layer2.1.conv1        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 11 | module.layer2.1.conv2        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 12 | module.layer2.2.conv1        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 13 | module.layer2.2.conv2        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 14 | module.layer3.0.conv1        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 64, 8, 8)   |         4096 |            18432 | 1179648 |\n| 15 | module.layer3.0.conv2        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 16 | module.layer3.0.downsample.0 | Conv2d | k=(1, 1) | (1, 32, 16, 16) |         8192 | (1, 64, 8, 8)   |         4096 |             2048 |  131072 |\n| 17 | module.layer3.1.conv1        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 18 | module.layer3.1.conv2        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 19 | module.layer3.2.conv1        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 20 | module.layer3.2.conv2        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 21 | module.fc                    | Linear |          | (1, 64)         |           64 | (1, 10)         |           10 |              640 |     640 |\n+----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\nTotal MACs: 40,813,184\n\n\n\n\nUsing TensorBoard\n\n\nGoogle's \nTensorBoard\n is an excellent tool for visualizing the progress of DNN training.  Distiller's logger supports writing performance indicators and parameter statistics in a file format that can be read by TensorBoard (Distiller uses TensorFlow's APIs in order to do this, which is why Distiller requires the installation of TensorFlow).\n\nTo view the graphs, invoke the TensorBoard server.  For example:\n\n\n$ tensorboard --logdir=logs\n\n\n\n\nDistillers's setup (requirements.txt) installs TensorFlow for CPU. If you want a different installation, please follow the \nTensorFlow installation instructions\n.\n\n\nCollecting activations statistics\n\n\nIn CNNs with ReLU layers, ReLU activations (feature-maps) also exhibit a nice level of sparsity (50-60% sparsity is typical). \n\nYou can collect activation statistics using the \n--act_stats\n command-line flag.\n\nFor example:\n\n\n$ python3 compress_classifier.py -a=resnet56_cifar -p=50 ../../../data.cifar10  --resume=checkpoint.resnet56_cifar_baseline.pth.tar --act-stats=test -e\n\n\n\n\nThe \ntest\n parameter indicates that, in this example, we want to collect activation statistics during the \ntest\n phase.  Note that we also used the \n-e\n command-line argument to indicate that we want to run a \ntest\n phase. The other two legal parameter values are \ntrain\n and \nvalid\n which collect activation statistics during the \ntraining\n and \nvalidation\n phases, respectively.  \n\n\nCollectors and their collaterals\n\n\nAn instance of a subclass of \nActivationStatsCollector\n can be used to collect activation statistics.  Currently, \nActivationStatsCollector\n has two types of subclasses: \nSummaryActivationStatsCollector\n and \nRecordsActivationStatsCollector\n.\n\nInstances of \nSummaryActivationStatsCollector\n compute the mean of some statistic of the activation.  It is rather\nlight-weight and quicker than collecting a record per activation.  The statistic function is configured in the constructor.\n\nIn the sample compression application, \ncompress_classifier.py\n, we create a dictionary of collectors.  For example:\n\n\nSummaryActivationStatsCollector(model,\n                                \nsparsity\n,\n                                lambda t: 100 * distiller.utils.sparsity(t))\n\n\n\n\nThe lambda expression is invoked per activation encountered during forward passes, and the value it returns (in this case, the sparsity of the activation tensors, multiplied by 100) is stored in \nmodule.sparsity\n (\n\"sparsity\"\n is this collector's name).  To access the statistics, you can invoke \ncollector.value()\n, or you can access each module's data directly.\n\n\nAnother type of collector is \nRecordsActivationStatsCollector\n which computes a hard-coded set of activations statistics and collects a\n\nrecord per activation\n.  For obvious reasons, this is slower than instances of \nSummaryActivationStatsCollector\n.\nActivationStatsCollector\n default to collecting activations statistics only on the output activations of ReLU layers, but we can choose any layer type we want.  In the example below we collect statistics from outputs of \ntorch.nn.Conv2d\n layers.\n\n\nRecordsActivationStatsCollector(model, classes=[torch.nn.Conv2d])\n\n\n\n\nCollectors can write their data to Excel workbooks (which are named using the collector's name), by invoking \ncollector.to_xlsx(path_to_workbook)\n.  In \ncompress_classifier.py\n we currently create four different collectors which you can selectively disable.  You can also add other statistics collectors and use a different function to compute your new statistic.\n\n\ncollectors = missingdict({\n    \nsparsity\n:      SummaryActivationStatsCollector(model, \nsparsity\n,\n                                                     lambda t: 100 * distiller.utils.sparsity(t)),\n    \nl1_channels\n:   SummaryActivationStatsCollector(model, \nl1_channels\n,\n                                                     distiller.utils.activation_channels_l1),\n    \napoz_channels\n: SummaryActivationStatsCollector(model, \napoz_channels\n,\n                                                     distiller.utils.activation_channels_apoz),\n    \nrecords\n:       RecordsActivationStatsCollector(model, classes=[torch.nn.Conv2d])})\n\n\n\n\nBy default, these Collectors write their data to files in the active log directory.\n\n\nYou can use a utility function, \ndistiller.log_activation_statsitics\n, to log the data of an \nActivationStatsCollector\n instance to one of the backend-loggers.  For an example, the code below logs the \n\"sparsity\"\n collector to a TensorBoard log file.\n\n\ndistiller.log_activation_statsitics(epoch, \ntrain\n, loggers=[tflogger],\n                                    collector=collectors[\nsparsity\n])\n\n\n\n\nCaveats\n\n\nDistiller collects activations statistics using PyTorch's forward-hooks mechanism.  Collectors iteratively register the modules' forward-hooks, and collectors are called during the forward traversal and get exposed to activation data.  Registering for forward callbacks is performed like this:\n\n\nmodule.register_forward_hook\n\n\n\n\nThis makes apparent two limitations of this mechanism:\n\n\n\n\nWe can only register on PyTorch modules.  This means that we can't register on the forward hook of a functionals such as \ntorch.nn.functional.relu\n and \ntorch.nn.functional.max_pool2d\n.\n\n   Therefore, you may need to replace functionals with their module alternative.  For example:  \n\n\n\n\nclass MadeUpNet(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.conv1 = nn.Conv2d(3, 6, 5)\n\n    def forward(self, x):\n        x = F.relu(self.conv1(x))\n        return x\n\n\n\n\nCan be changed to:  \n\n\nclass MadeUpNet(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.conv1 = nn.Conv2d(3, 6, 5)\n        self.relu = nn.ReLU(inplace=True)\n\n    def forward(self, x):\n        x = self.relu(self.conv1(x))\n        return x\n\n\n\n\n\n\nWe can only use a module instance once in our models.  If we use the same module several times, then we can't determine which node in the graph has invoked the callback, because the PyTorch callback signature \ndef hook(module, input, output)\n doesn't provide enough contextual information.\n\nTorchVision's \nResNet\n is an example of a model that uses the same instance of nn.ReLU multiple times:  \n\n\n\n\nclass BasicBlock(nn.Module):\n    expansion = 1\n    def __init__(self, inplanes, planes, stride=1, downsample=None):\n        super(BasicBlock, self).__init__()\n        self.conv1 = conv3x3(inplanes, planes, stride)\n        self.bn1 = nn.BatchNorm2d(planes)\n        self.relu = nn.ReLU(inplace=True)\n        self.conv2 = conv3x3(planes, planes)\n        self.bn2 = nn.BatchNorm2d(planes)\n        self.downsample = downsample\n        self.stride = stride\n\n    def forward(self, x):\n        residual = x\n        out = self.conv1(x)\n        out = self.bn1(out)\n        out = self.relu(out)                    # \n================\n        out = self.conv2(out)\n        out = self.bn2(out)\n        if self.downsample is not None:\n            residual = self.downsample(x)\n        out += residual\n        out = self.relu(out)                    # \n================\n        return out\n\n\n\n\nIn Distiller we changed \nResNet\n to use multiple instances of nn.ReLU, and each instance is used only once:  \n\n\nclass BasicBlock(nn.Module):\n    expansion = 1\n    def __init__(self, inplanes, planes, stride=1, downsample=None):\n        super(BasicBlock, self).__init__()\n        self.conv1 = conv3x3(inplanes, planes, stride)\n        self.bn1 = nn.BatchNorm2d(planes)\n        self.relu1 = nn.ReLU(inplace=True)\n        self.conv2 = conv3x3(planes, planes)\n        self.bn2 = nn.BatchNorm2d(planes)\n        self.relu2 = nn.ReLU(inplace=True)\n        self.downsample = downsample\n        self.stride = stride\n\n    def forward(self, x):\n        residual = x\n        out = self.conv1(x)\n        out = self.bn1(out)\n        out = self.relu1(out)                   # \n================\n        out = self.conv2(out)\n        out = self.bn2(out)\n        if self.downsample is not None:\n            residual = self.downsample(x)\n        out += residual\n        out = self.relu2(out)                   # \n================\n        return out\n\n\n\n\nUsing the Jupyter notebooks\n\n\nThe Jupyter notebooks contain many examples of how to use the statistics summaries generated by Distiller.  They are explained in a separate page.\n\n\nGenerating this documentation\n\n\nInstall mkdocs and the required packages by executing:\n\n\n$ pip3 install -r doc-requirements.txt\n\n\n\n\nTo build the project documentation run:\n\n\n$ cd distiller/docs-src\n$ mkdocs build --clean\n\n\n\n\nThis will create a folder named 'site' which contains the documentation website.\nOpen distiller/docs/site/index.html to view the documentation home page.", 
             "title": "Usage"
-        },
+        }, 
         {
-            "location": "/usage/index.html#using-the-sample-application",
-            "text": "The Distiller repository contains a sample application,  distiller/examples/classifier_compression/compress_classifier.py , and a set of scheduling files which demonstrate Distiller's features.  Following is a brief discussion of how to use this application and the accompanying schedules.  You might also want to refer to the following resources:   An  explanation  of the scheduler file format.  An in-depth  discussion  of how we used these schedule files to implement several state-of-the-art DNN compression research papers.   The sample application supports various features for compression of image classification DNNs, and gives an example of how to integrate distiller in your own application.  The code is documented and should be considered the best source of documentation, but we provide some elaboration here.  This diagram shows how where  compress_classifier.py  fits in the compression workflow, and how we integrate the Jupyter notebooks as part of our research work.",
+            "location": "/usage/index.html#using-the-sample-application", 
+            "text": "The Distiller repository contains a sample application,  distiller/examples/classifier_compression/compress_classifier.py , and a set of scheduling files which demonstrate Distiller's features.  Following is a brief discussion of how to use this application and the accompanying schedules.  You might also want to refer to the following resources:   An  explanation  of the scheduler file format.  An in-depth  discussion  of how we used these schedule files to implement several state-of-the-art DNN compression research papers.   The sample application supports various features for compression of image classification DNNs, and gives an example of how to integrate distiller in your own application.  The code is documented and should be considered the best source of documentation, but we provide some elaboration here.  This diagram shows how where  compress_classifier.py  fits in the compression workflow, and how we integrate the Jupyter notebooks as part of our research work.", 
             "title": "Using the sample application"
-        },
+        }, 
         {
-            "location": "/usage/index.html#command-line-arguments",
-            "text": "To get help on the command line arguments, invoke:  $ python3 compress_classifier.py --help  For example:  $ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\nParameters:\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n |    | Name                      | Shape            |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |\n |----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n |  0 | features.module.0.weight  | (64, 3, 11, 11)  |         23232 |          13411 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   42.27359 | 0.14391 | -0.00002 |    0.08805 |\n |  1 | features.module.3.weight  | (192, 64, 5, 5)  |        307200 |         115560 |    0.00000 |    0.00000 |  0.00000 |  1.91243 |  0.00000 |   62.38281 | 0.04703 | -0.00250 |    0.02289 |\n |  2 | features.module.6.weight  | (384, 192, 3, 3) |        663552 |         256565 |    0.00000 |    0.00000 |  0.00000 |  6.18490 |  0.00000 |   61.33445 | 0.03354 | -0.00184 |    0.01803 |\n |  3 | features.module.8.weight  | (256, 384, 3, 3) |        884736 |         315065 |    0.00000 |    0.00000 |  0.00000 |  6.96411 |  0.00000 |   64.38881 | 0.02646 | -0.00168 |    0.01422 |\n |  4 | features.module.10.weight | (256, 256, 3, 3) |        589824 |         186938 |    0.00000 |    0.00000 |  0.00000 | 15.49225 |  0.00000 |   68.30614 | 0.02714 | -0.00246 |    0.01409 |\n |  5 | classifier.1.weight       | (4096, 9216)     |      37748736 |        3398881 |    0.00000 |    0.21973 |  0.00000 |  0.21973 |  0.00000 |   90.99604 | 0.00589 | -0.00020 |    0.00168 |\n |  6 | classifier.4.weight       | (4096, 4096)     |      16777216 |        1782769 |    0.21973 |    3.46680 |  0.00000 |  3.46680 |  0.00000 |   89.37387 | 0.00849 | -0.00066 |    0.00263 |\n |  7 | classifier.6.weight       | (1000, 4096)     |       4096000 |         994738 |    3.36914 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   75.71440 | 0.01718 |  0.00030 |    0.00778 |\n |  8 | Total sparsity:           | -                |      61090496 |        7063928 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   88.43694 | 0.00000 |  0.00000 |    0.00000 |\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:04,646 - Epoch: [89][   50/  500]    Loss 2.175988    Top1 51.289063    Top5 74.023438\n 2018-04-04 21:31:06,427 - Epoch: [89][  100/  500]    Loss 2.171564    Top1 51.175781    Top5 74.308594\n 2018-04-04 21:31:11,432 - Epoch: [89][  150/  500]    Loss 2.159347    Top1 51.546875    Top5 74.473958\n 2018-04-04 21:31:14,364 - Epoch: [89][  200/  500]    Loss 2.156857    Top1 51.585938    Top5 74.568359\n 2018-04-04 21:31:18,381 - Epoch: [89][  250/  500]    Loss 2.152790    Top1 51.707813    Top5 74.681250\n 2018-04-04 21:31:22,195 - Epoch: [89][  300/  500]    Loss 2.149962    Top1 51.791667    Top5 74.755208\n 2018-04-04 21:31:25,508 - Epoch: [89][  350/  500]    Loss 2.150936    Top1 51.827009    Top5 74.767857\n 2018-04-04 21:31:29,538 - Epoch: [89][  400/  500]    Loss 2.150853    Top1 51.781250    Top5 74.763672\n 2018-04-04 21:31:32,842 - Epoch: [89][  450/  500]    Loss 2.150156    Top1 51.828125    Top5 74.821181\n 2018-04-04 21:31:35,338 - Epoch: [89][  500/  500]    Loss 2.150417    Top1 51.833594    Top5 74.817187\n 2018-04-04 21:31:35,357 - ==> Top1: 51.838    Top5: 74.817    Loss: 2.150\n\n 2018-04-04 21:31:35,364 - Saving checkpoint\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:31:51,512 - Test: [   50/  195]    Loss 1.487607    Top1 63.273438    Top5 85.695312\n 2018-04-04 21:31:55,015 - Test: [  100/  195]    Loss 1.638043    Top1 60.636719    Top5 83.664062\n 2018-04-04 21:31:58,732 - Test: [  150/  195]    Loss 1.833214    Top1 57.619792    Top5 80.447917\n 2018-04-04 21:32:01,274 - ==> Top1: 56.606    Top5: 79.446    Loss: 1.893  Let's look at the command line again:  $ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml  In this example, we prune a TorchVision pre-trained AlexNet network, using the following configuration:   Learning-rate of 0.005  Print progress every 50 mini-batches.  Use 44 worker threads to load data (make sure to use something suitable for your machine).  Run for 90 epochs.  Torchvision's pre-trained models did not store the epoch metadata, so pruning starts at epoch 0.  When you train and prune your own networks, the last training epoch is saved as a metadata with the model.  Therefore, when you load such models, the first epoch is not 0, but it is the last training epoch.  The pruning schedule is provided in  alexnet.schedule_sensitivity.yaml  Log files are written to directory  logs .",
+            "location": "/usage/index.html#command-line-arguments", 
+            "text": "To get help on the command line arguments, invoke:  $ python3 compress_classifier.py --help  For example:  $ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\nParameters:\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n |    | Name                      | Shape            |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |\n |----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n |  0 | features.module.0.weight  | (64, 3, 11, 11)  |         23232 |          13411 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   42.27359 | 0.14391 | -0.00002 |    0.08805 |\n |  1 | features.module.3.weight  | (192, 64, 5, 5)  |        307200 |         115560 |    0.00000 |    0.00000 |  0.00000 |  1.91243 |  0.00000 |   62.38281 | 0.04703 | -0.00250 |    0.02289 |\n |  2 | features.module.6.weight  | (384, 192, 3, 3) |        663552 |         256565 |    0.00000 |    0.00000 |  0.00000 |  6.18490 |  0.00000 |   61.33445 | 0.03354 | -0.00184 |    0.01803 |\n |  3 | features.module.8.weight  | (256, 384, 3, 3) |        884736 |         315065 |    0.00000 |    0.00000 |  0.00000 |  6.96411 |  0.00000 |   64.38881 | 0.02646 | -0.00168 |    0.01422 |\n |  4 | features.module.10.weight | (256, 256, 3, 3) |        589824 |         186938 |    0.00000 |    0.00000 |  0.00000 | 15.49225 |  0.00000 |   68.30614 | 0.02714 | -0.00246 |    0.01409 |\n |  5 | classifier.1.weight       | (4096, 9216)     |      37748736 |        3398881 |    0.00000 |    0.21973 |  0.00000 |  0.21973 |  0.00000 |   90.99604 | 0.00589 | -0.00020 |    0.00168 |\n |  6 | classifier.4.weight       | (4096, 4096)     |      16777216 |        1782769 |    0.21973 |    3.46680 |  0.00000 |  3.46680 |  0.00000 |   89.37387 | 0.00849 | -0.00066 |    0.00263 |\n |  7 | classifier.6.weight       | (1000, 4096)     |       4096000 |         994738 |    3.36914 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   75.71440 | 0.01718 |  0.00030 |    0.00778 |\n |  8 | Total sparsity:           | -                |      61090496 |        7063928 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   88.43694 | 0.00000 |  0.00000 |    0.00000 |\n +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:04,646 - Epoch: [89][   50/  500]    Loss 2.175988    Top1 51.289063    Top5 74.023438\n 2018-04-04 21:31:06,427 - Epoch: [89][  100/  500]    Loss 2.171564    Top1 51.175781    Top5 74.308594\n 2018-04-04 21:31:11,432 - Epoch: [89][  150/  500]    Loss 2.159347    Top1 51.546875    Top5 74.473958\n 2018-04-04 21:31:14,364 - Epoch: [89][  200/  500]    Loss 2.156857    Top1 51.585938    Top5 74.568359\n 2018-04-04 21:31:18,381 - Epoch: [89][  250/  500]    Loss 2.152790    Top1 51.707813    Top5 74.681250\n 2018-04-04 21:31:22,195 - Epoch: [89][  300/  500]    Loss 2.149962    Top1 51.791667    Top5 74.755208\n 2018-04-04 21:31:25,508 - Epoch: [89][  350/  500]    Loss 2.150936    Top1 51.827009    Top5 74.767857\n 2018-04-04 21:31:29,538 - Epoch: [89][  400/  500]    Loss 2.150853    Top1 51.781250    Top5 74.763672\n 2018-04-04 21:31:32,842 - Epoch: [89][  450/  500]    Loss 2.150156    Top1 51.828125    Top5 74.821181\n 2018-04-04 21:31:35,338 - Epoch: [89][  500/  500]    Loss 2.150417    Top1 51.833594    Top5 74.817187\n 2018-04-04 21:31:35,357 - ==  Top1: 51.838    Top5: 74.817    Loss: 2.150\n\n 2018-04-04 21:31:35,364 - Saving checkpoint\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:31:51,512 - Test: [   50/  195]    Loss 1.487607    Top1 63.273438    Top5 85.695312\n 2018-04-04 21:31:55,015 - Test: [  100/  195]    Loss 1.638043    Top1 60.636719    Top5 83.664062\n 2018-04-04 21:31:58,732 - Test: [  150/  195]    Loss 1.833214    Top1 57.619792    Top5 80.447917\n 2018-04-04 21:32:01,274 - ==  Top1: 56.606    Top5: 79.446    Loss: 1.893  Let's look at the command line again:  $ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml  In this example, we prune a TorchVision pre-trained AlexNet network, using the following configuration:   Learning-rate of 0.005  Print progress every 50 mini-batches.  Use 44 worker threads to load data (make sure to use something suitable for your machine).  Run for 90 epochs.  Torchvision's pre-trained models did not store the epoch metadata, so pruning starts at epoch 0.  When you train and prune your own networks, the last training epoch is saved as a metadata with the model.  Therefore, when you load such models, the first epoch is not 0, but it is the last training epoch.  The pruning schedule is provided in  alexnet.schedule_sensitivity.yaml  Log files are written to directory  logs .", 
             "title": "Command line arguments"
-        },
+        }, 
         {
-            "location": "/usage/index.html#examples",
-            "text": "Distiller comes with several example schedules which can be used together with  compress_classifier.py .\nThese example schedules (YAML) files, contain the command line that is used in order to invoke the schedule (so that you can easily recreate the results in your environment), together with the results of the pruning or regularization.  The results usually contain a table showing the sparsity of  each of the model parameters, together with the validation and test top1, top5 and loss scores.  For more details on the example schedules, you can refer to the coverage of the  Model Zoo .   examples/agp-pruning :  Automated Gradual Pruning (AGP) on MobileNet and ResNet18 (ImageNet dataset)    examples/hybrid :  AlexNet AGP with 2D (kernel) regularization (ImageNet dataset)  AlexNet sensitivity pruning with 2D regularization    examples/network_slimming :  ResNet20 Network Slimming (this is work-in-progress)    examples/pruning_filters_for_efficient_convnets :  ResNet56 baseline training (CIFAR10 dataset)  ResNet56 filter removal using filter ranking    examples/sensitivity_analysis :  Element-wise pruning sensitivity-analysis:  AlexNet (ImageNet)  MobileNet (ImageNet)  ResNet18 (ImageNet)  ResNet20 (CIFAR10)  ResNet34 (ImageNet)  Filter-wise pruning sensitivity-analysis:  ResNet20 (CIFAR10)  ResNet56 (CIFAR10)    examples/sensitivity-pruning :  AlexNet sensitivity pruning with Iterative Pruning  AlexNet sensitivity pruning with One-Shot Pruning    examples/ssl :  ResNet20 baseline training (CIFAR10 dataset)  Structured Sparsity Learning (SSL) with layer removal on ResNet20  SSL with channels removal on ResNet20    examples/quantization :  AlexNet w. Batch-Norm (base FP32 + DoReFa)  Pre-activation ResNet20 on CIFAR10 (base FP32 + DoReFa)  Pre-activation ResNet18 on ImageNEt (base FP32 + DoReFa)",
+            "location": "/usage/index.html#examples", 
+            "text": "Distiller comes with several example schedules which can be used together with  compress_classifier.py .\nThese example schedules (YAML) files, contain the command line that is used in order to invoke the schedule (so that you can easily recreate the results in your environment), together with the results of the pruning or regularization.  The results usually contain a table showing the sparsity of  each of the model parameters, together with the validation and test top1, top5 and loss scores.  For more details on the example schedules, you can refer to the coverage of the  Model Zoo .   examples/agp-pruning :  Automated Gradual Pruning (AGP) on MobileNet and ResNet18 (ImageNet dataset)    examples/hybrid :  AlexNet AGP with 2D (kernel) regularization (ImageNet dataset)  AlexNet sensitivity pruning with 2D regularization    examples/network_slimming :  ResNet20 Network Slimming (this is work-in-progress)    examples/pruning_filters_for_efficient_convnets :  ResNet56 baseline training (CIFAR10 dataset)  ResNet56 filter removal using filter ranking    examples/sensitivity_analysis :  Element-wise pruning sensitivity-analysis:  AlexNet (ImageNet)  MobileNet (ImageNet)  ResNet18 (ImageNet)  ResNet20 (CIFAR10)  ResNet34 (ImageNet)  Filter-wise pruning sensitivity-analysis:  ResNet20 (CIFAR10)  ResNet56 (CIFAR10)    examples/sensitivity-pruning :  AlexNet sensitivity pruning with Iterative Pruning  AlexNet sensitivity pruning with One-Shot Pruning    examples/ssl :  ResNet20 baseline training (CIFAR10 dataset)  Structured Sparsity Learning (SSL) with layer removal on ResNet20  SSL with channels removal on ResNet20    examples/quantization :  AlexNet w. Batch-Norm (base FP32 + DoReFa)  Pre-activation ResNet20 on CIFAR10 (base FP32 + DoReFa)  Pre-activation ResNet18 on ImageNEt (base FP32 + DoReFa)", 
             "title": "Examples"
-        },
+        }, 
         {
-            "location": "/usage/index.html#experiment-reproducibility",
-            "text": "Experiment reproducibility is sometimes important.  Pete Warden recently expounded about this in his  blog . \nPyTorch's support for deterministic execution requires us to use only one thread for loading data (other wise the multi-threaded execution of the data loaders can create random order and change the results), and to set the seed of the CPU and GPU PRNGs.  Using the  --deterministic  command-line flag and setting  j=1  will produce reproducible results (for the same PyTorch version).",
+            "location": "/usage/index.html#experiment-reproducibility", 
+            "text": "Experiment reproducibility is sometimes important.  Pete Warden recently expounded about this in his  blog . \nPyTorch's support for deterministic execution requires us to use only one thread for loading data (other wise the multi-threaded execution of the data loaders can create random order and change the results), and to set the seed of the CPU and GPU PRNGs.  Using the  --deterministic  command-line flag and setting  j=1  will produce reproducible results (for the same PyTorch version).", 
             "title": "Experiment reproducibility"
-        },
+        }, 
         {
-            "location": "/usage/index.html#performing-pruning-sensitivity-analysis",
-            "text": "Distiller supports element-wise and filter-wise pruning sensitivity analysis.  In both cases, L1-norm is used to rank which elements or filters to prune.  For example, when running filter-pruning sensitivity analysis, the L1-norm of the filters of each layer's weights tensor are calculated, and the bottom x% are set to zero.   \nThe analysis process is quite long, because currently we use the entire test dataset to assess the accuracy performance at each pruning level of each weights tensor.  Using a small dataset for this would save much time and we plan on assessing if this will provide sufficient results. \nResults are output as a CSV file ( sensitivity.csv ) and PNG file ( sensitivity.png ).  The implementation is in  distiller/sensitivity.py  and it contains further details about process and the format of the CSV file.  The example below performs element-wise pruning sensitivity analysis on ResNet20 for CIFAR10:  $ python3 compress_classifier.py -a resnet20_cifar ../../../data.cifar10/ -j=1 --resume=../cifar10/resnet20/checkpoint_trained_dense.pth.tar --sense=element  The  sense  command-line argument can be set to either  element  or  filter , depending on the type of analysis you want done.  There is also a  Jupyter notebook  with example invocations, outputs and explanations.",
+            "location": "/usage/index.html#performing-pruning-sensitivity-analysis", 
+            "text": "Distiller supports element-wise and filter-wise pruning sensitivity analysis.  In both cases, L1-norm is used to rank which elements or filters to prune.  For example, when running filter-pruning sensitivity analysis, the L1-norm of the filters of each layer's weights tensor are calculated, and the bottom x% are set to zero.   \nThe analysis process is quite long, because currently we use the entire test dataset to assess the accuracy performance at each pruning level of each weights tensor.  Using a small dataset for this would save much time and we plan on assessing if this will provide sufficient results. \nResults are output as a CSV file ( sensitivity.csv ) and PNG file ( sensitivity.png ).  The implementation is in  distiller/sensitivity.py  and it contains further details about process and the format of the CSV file.  The example below performs element-wise pruning sensitivity analysis on ResNet20 for CIFAR10:  $ python3 compress_classifier.py -a resnet20_cifar ../../../data.cifar10/ -j=1 --resume=../cifar10/resnet20/checkpoint_trained_dense.pth.tar --sense=element  The  sense  command-line argument can be set to either  element  or  filter , depending on the type of analysis you want done.  There is also a  Jupyter notebook  with example invocations, outputs and explanations.", 
             "title": "Performing pruning sensitivity analysis"
-        },
+        }, 
         {
-            "location": "/usage/index.html#post-training-quantization",
-            "text": "Distiller supports post-training quantization of trained modules without re-training (using  Range-Based Linear Quantization ). So, any model (whether pruned or not) can be quantized. To invoke post-training quantization, use  --quantize-eval  along with  --evaluate . Additional arguments are available to control parameters of the quantization:  Arguments controlling quantization at evaluation time(\"post-training quantization\"):\n  --quantize-eval, --qe\n                        Apply linear quantization to model before evaluation.\n                        Applicable only if --evaluate is also set\n  --qe-mode QE_MODE, --qem QE_MODE\n                        Linear quantization mode. Choices: asym_s | asym_u |\n                        sym\n  --qe-bits-acts NUM_BITS, --qeba NUM_BITS\n                        Number of bits for quantization of activations\n  --qe-bits-wts NUM_BITS, --qebw NUM_BITS\n                        Number of bits for quantization of weights\n  --qe-bits-accum NUM_BITS\n                        Number of bits for quantization of the accumulator\n  --qe-clip-acts, --qeca\n                        Enable clipping of activations using min/max values\n                        averaging over batch\n  --qe-no-clip-layers LAYER_NAME [LAYER_NAME ...], --qencl LAYER_NAME [LAYER_NAME ...]\n                        List of fully-qualified layer names for which not to\n                        clip activations. Applicable only if --qe-clip-acts is\n                        also set\n  --qe-per-channel, --qepc\n                        Enable per-channel quantization of weights (per output channel)  The following example qunatizes ResNet18 for ImageNet:  $ python3 compress_classifier.py -a resnet18 ../../../data.imagenet  --pretrained --quantize-eval --evaluate  A checkpoint with the quantized model will be dumped in the run directory. It will contain the quantized model parameters (the data type will still be FP32, but the values will be integers). The calculated quantization parameters (scale and zero-point) are stored as well in each quantized layer.  For more examples of post-training quantization see  here",
+            "location": "/usage/index.html#post-training-quantization", 
+            "text": "Distiller supports post-training quantization of trained modules without re-training (using  Range-Based Linear Quantization ). So, any model (whether pruned or not) can be quantized. To invoke post-training quantization, use  --quantize-eval  along with  --evaluate . Additional arguments are available to control parameters of the quantization:  Arguments controlling quantization at evaluation time( post-training quantization ):\n  --quantize-eval, --qe\n                        Apply linear quantization to model before evaluation.\n                        Applicable only if --evaluate is also set\n  --qe-mode QE_MODE, --qem QE_MODE\n                        Linear quantization mode. Choices: asym_s | asym_u |\n                        sym\n  --qe-bits-acts NUM_BITS, --qeba NUM_BITS\n                        Number of bits for quantization of activations\n  --qe-bits-wts NUM_BITS, --qebw NUM_BITS\n                        Number of bits for quantization of weights\n  --qe-bits-accum NUM_BITS\n                        Number of bits for quantization of the accumulator\n  --qe-clip-acts, --qeca\n                        Enable clipping of activations using min/max values\n                        averaging over batch\n  --qe-no-clip-layers LAYER_NAME [LAYER_NAME ...], --qencl LAYER_NAME [LAYER_NAME ...]\n                        List of fully-qualified layer names for which not to\n                        clip activations. Applicable only if --qe-clip-acts is\n                        also set\n  --qe-per-channel, --qepc\n                        Enable per-channel quantization of weights (per output channel)  The following example qunatizes ResNet18 for ImageNet:  $ python3 compress_classifier.py -a resnet18 ../../../data.imagenet  --pretrained --quantize-eval --evaluate  A checkpoint with the quantized model will be dumped in the run directory. It will contain the quantized model parameters (the data type will still be FP32, but the values will be integers). The calculated quantization parameters (scale and zero-point) are stored as well in each quantized layer.  For more examples of post-training quantization see  here", 
             "title": "Post-Training Quantization"
-        },
+        }, 
         {
-            "location": "/usage/index.html#summaries",
-            "text": "You can use the sample compression application to generate model summary reports, such as the attributes and compute summary report (see screen capture below).\nYou can log sparsity statistics (written to console and CSV file), performance, optimizer and model information, and also create a PNG image of the DNN.\nCreating a PNG image is an experimental feature (it relies on features which are not available on PyTorch 3.1 and that we hope will be available in PyTorch's next release), so to use it you will need to compile the PyTorch master branch, and hope for the best ;-).  $ python3 compress_classifier.py --resume=../ssl/checkpoints/checkpoint_trained_ch_regularized_dense.pth.tar -a=resnet20_cifar ../../../data.cifar10 --summary=compute  Generates:  +----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\n|    | Name                         | Type   | Attrs    | IFM             |   IFM volume | OFM             |   OFM volume |   Weights volume |    MACs |\n|----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------|\n|  0 | module.conv1                 | Conv2d | k=(3, 3) | (1, 3, 32, 32)  |         3072 | (1, 16, 32, 32) |        16384 |              432 |  442368 |\n|  1 | module.layer1.0.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  2 | module.layer1.0.conv2        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  3 | module.layer1.1.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  4 | module.layer1.1.conv2        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  5 | module.layer1.2.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  6 | module.layer1.2.conv2        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  7 | module.layer2.0.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 32, 16, 16) |         8192 |             4608 | 1179648 |\n|  8 | module.layer2.0.conv2        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n|  9 | module.layer2.0.downsample.0 | Conv2d | k=(1, 1) | (1, 16, 32, 32) |        16384 | (1, 32, 16, 16) |         8192 |              512 |  131072 |\n| 10 | module.layer2.1.conv1        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 11 | module.layer2.1.conv2        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 12 | module.layer2.2.conv1        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 13 | module.layer2.2.conv2        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 14 | module.layer3.0.conv1        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 64, 8, 8)   |         4096 |            18432 | 1179648 |\n| 15 | module.layer3.0.conv2        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 16 | module.layer3.0.downsample.0 | Conv2d | k=(1, 1) | (1, 32, 16, 16) |         8192 | (1, 64, 8, 8)   |         4096 |             2048 |  131072 |\n| 17 | module.layer3.1.conv1        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 18 | module.layer3.1.conv2        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 19 | module.layer3.2.conv1        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 20 | module.layer3.2.conv2        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 21 | module.fc                    | Linear |          | (1, 64)         |           64 | (1, 10)         |           10 |              640 |     640 |\n+----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\nTotal MACs: 40,813,184",
+            "location": "/usage/index.html#summaries", 
+            "text": "You can use the sample compression application to generate model summary reports, such as the attributes and compute summary report (see screen capture below).\nYou can log sparsity statistics (written to console and CSV file), performance, optimizer and model information, and also create a PNG image of the DNN.\nCreating a PNG image is an experimental feature (it relies on features which are not available on PyTorch 3.1 and that we hope will be available in PyTorch's next release), so to use it you will need to compile the PyTorch master branch, and hope for the best ;-).  $ python3 compress_classifier.py --resume=../ssl/checkpoints/checkpoint_trained_ch_regularized_dense.pth.tar -a=resnet20_cifar ../../../data.cifar10 --summary=compute  Generates:  +----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\n|    | Name                         | Type   | Attrs    | IFM             |   IFM volume | OFM             |   OFM volume |   Weights volume |    MACs |\n|----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------|\n|  0 | module.conv1                 | Conv2d | k=(3, 3) | (1, 3, 32, 32)  |         3072 | (1, 16, 32, 32) |        16384 |              432 |  442368 |\n|  1 | module.layer1.0.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  2 | module.layer1.0.conv2        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  3 | module.layer1.1.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  4 | module.layer1.1.conv2        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  5 | module.layer1.2.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  6 | module.layer1.2.conv2        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 16, 32, 32) |        16384 |             2304 | 2359296 |\n|  7 | module.layer2.0.conv1        | Conv2d | k=(3, 3) | (1, 16, 32, 32) |        16384 | (1, 32, 16, 16) |         8192 |             4608 | 1179648 |\n|  8 | module.layer2.0.conv2        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n|  9 | module.layer2.0.downsample.0 | Conv2d | k=(1, 1) | (1, 16, 32, 32) |        16384 | (1, 32, 16, 16) |         8192 |              512 |  131072 |\n| 10 | module.layer2.1.conv1        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 11 | module.layer2.1.conv2        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 12 | module.layer2.2.conv1        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 13 | module.layer2.2.conv2        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 32, 16, 16) |         8192 |             9216 | 2359296 |\n| 14 | module.layer3.0.conv1        | Conv2d | k=(3, 3) | (1, 32, 16, 16) |         8192 | (1, 64, 8, 8)   |         4096 |            18432 | 1179648 |\n| 15 | module.layer3.0.conv2        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 16 | module.layer3.0.downsample.0 | Conv2d | k=(1, 1) | (1, 32, 16, 16) |         8192 | (1, 64, 8, 8)   |         4096 |             2048 |  131072 |\n| 17 | module.layer3.1.conv1        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 18 | module.layer3.1.conv2        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 19 | module.layer3.2.conv1        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 20 | module.layer3.2.conv2        | Conv2d | k=(3, 3) | (1, 64, 8, 8)   |         4096 | (1, 64, 8, 8)   |         4096 |            36864 | 2359296 |\n| 21 | module.fc                    | Linear |          | (1, 64)         |           64 | (1, 10)         |           10 |              640 |     640 |\n+----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+\nTotal MACs: 40,813,184", 
             "title": "Summaries"
-        },
+        }, 
         {
-            "location": "/usage/index.html#using-tensorboard",
-            "text": "Google's  TensorBoard  is an excellent tool for visualizing the progress of DNN training.  Distiller's logger supports writing performance indicators and parameter statistics in a file format that can be read by TensorBoard (Distiller uses TensorFlow's APIs in order to do this, which is why Distiller requires the installation of TensorFlow). \nTo view the graphs, invoke the TensorBoard server.  For example:  $ tensorboard --logdir=logs  Distillers's setup (requirements.txt) installs TensorFlow for CPU. If you want a different installation, please follow the  TensorFlow installation instructions .",
+            "location": "/usage/index.html#using-tensorboard", 
+            "text": "Google's  TensorBoard  is an excellent tool for visualizing the progress of DNN training.  Distiller's logger supports writing performance indicators and parameter statistics in a file format that can be read by TensorBoard (Distiller uses TensorFlow's APIs in order to do this, which is why Distiller requires the installation of TensorFlow). \nTo view the graphs, invoke the TensorBoard server.  For example:  $ tensorboard --logdir=logs  Distillers's setup (requirements.txt) installs TensorFlow for CPU. If you want a different installation, please follow the  TensorFlow installation instructions .", 
             "title": "Using TensorBoard"
-        },
+        }, 
         {
-            "location": "/usage/index.html#collecting-activations-statistics",
-            "text": "In CNNs with ReLU layers, ReLU activations (feature-maps) also exhibit a nice level of sparsity (50-60% sparsity is typical).  \nYou can collect activation statistics using the  --act_stats  command-line flag. \nFor example:  $ python3 compress_classifier.py -a=resnet56_cifar -p=50 ../../../data.cifar10  --resume=checkpoint.resnet56_cifar_baseline.pth.tar --act-stats=test -e  The  test  parameter indicates that, in this example, we want to collect activation statistics during the  test  phase.  Note that we also used the  -e  command-line argument to indicate that we want to run a  test  phase. The other two legal parameter values are  train  and  valid  which collect activation statistics during the  training  and  validation  phases, respectively.",
+            "location": "/usage/index.html#collecting-activations-statistics", 
+            "text": "In CNNs with ReLU layers, ReLU activations (feature-maps) also exhibit a nice level of sparsity (50-60% sparsity is typical).  \nYou can collect activation statistics using the  --act_stats  command-line flag. \nFor example:  $ python3 compress_classifier.py -a=resnet56_cifar -p=50 ../../../data.cifar10  --resume=checkpoint.resnet56_cifar_baseline.pth.tar --act-stats=test -e  The  test  parameter indicates that, in this example, we want to collect activation statistics during the  test  phase.  Note that we also used the  -e  command-line argument to indicate that we want to run a  test  phase. The other two legal parameter values are  train  and  valid  which collect activation statistics during the  training  and  validation  phases, respectively.", 
             "title": "Collecting activations statistics"
-        },
+        }, 
         {
-            "location": "/usage/index.html#collectors-and-their-collaterals",
-            "text": "An instance of a subclass of  ActivationStatsCollector  can be used to collect activation statistics.  Currently,  ActivationStatsCollector  has two types of subclasses:  SummaryActivationStatsCollector  and  RecordsActivationStatsCollector . \nInstances of  SummaryActivationStatsCollector  compute the mean of some statistic of the activation.  It is rather\nlight-weight and quicker than collecting a record per activation.  The statistic function is configured in the constructor. \nIn the sample compression application,  compress_classifier.py , we create a dictionary of collectors.  For example:  SummaryActivationStatsCollector(model,\n                                \"sparsity\",\n                                lambda t: 100 * distiller.utils.sparsity(t))  The lambda expression is invoked per activation encountered during forward passes, and the value it returns (in this case, the sparsity of the activation tensors, multiplied by 100) is stored in  module.sparsity  ( \"sparsity\"  is this collector's name).  To access the statistics, you can invoke  collector.value() , or you can access each module's data directly.  Another type of collector is  RecordsActivationStatsCollector  which computes a hard-coded set of activations statistics and collects a record per activation .  For obvious reasons, this is slower than instances of  SummaryActivationStatsCollector . ActivationStatsCollector  default to collecting activations statistics only on the output activations of ReLU layers, but we can choose any layer type we want.  In the example below we collect statistics from outputs of  torch.nn.Conv2d  layers.  RecordsActivationStatsCollector(model, classes=[torch.nn.Conv2d])  Collectors can write their data to Excel workbooks (which are named using the collector's name), by invoking  collector.to_xlsx(path_to_workbook) .  In  compress_classifier.py  we currently create four different collectors which you can selectively disable.  You can also add other statistics collectors and use a different function to compute your new statistic.  collectors = missingdict({\n    \"sparsity\":      SummaryActivationStatsCollector(model, \"sparsity\",\n                                                     lambda t: 100 * distiller.utils.sparsity(t)),\n    \"l1_channels\":   SummaryActivationStatsCollector(model, \"l1_channels\",\n                                                     distiller.utils.activation_channels_l1),\n    \"apoz_channels\": SummaryActivationStatsCollector(model, \"apoz_channels\",\n                                                     distiller.utils.activation_channels_apoz),\n    \"records\":       RecordsActivationStatsCollector(model, classes=[torch.nn.Conv2d])})  By default, these Collectors write their data to files in the active log directory.  You can use a utility function,  distiller.log_activation_statsitics , to log the data of an  ActivationStatsCollector  instance to one of the backend-loggers.  For an example, the code below logs the  \"sparsity\"  collector to a TensorBoard log file.  distiller.log_activation_statsitics(epoch, \"train\", loggers=[tflogger],\n                                    collector=collectors[\"sparsity\"])",
+            "location": "/usage/index.html#collectors-and-their-collaterals", 
+            "text": "An instance of a subclass of  ActivationStatsCollector  can be used to collect activation statistics.  Currently,  ActivationStatsCollector  has two types of subclasses:  SummaryActivationStatsCollector  and  RecordsActivationStatsCollector . \nInstances of  SummaryActivationStatsCollector  compute the mean of some statistic of the activation.  It is rather\nlight-weight and quicker than collecting a record per activation.  The statistic function is configured in the constructor. \nIn the sample compression application,  compress_classifier.py , we create a dictionary of collectors.  For example:  SummaryActivationStatsCollector(model,\n                                 sparsity ,\n                                lambda t: 100 * distiller.utils.sparsity(t))  The lambda expression is invoked per activation encountered during forward passes, and the value it returns (in this case, the sparsity of the activation tensors, multiplied by 100) is stored in  module.sparsity  ( \"sparsity\"  is this collector's name).  To access the statistics, you can invoke  collector.value() , or you can access each module's data directly.  Another type of collector is  RecordsActivationStatsCollector  which computes a hard-coded set of activations statistics and collects a record per activation .  For obvious reasons, this is slower than instances of  SummaryActivationStatsCollector . ActivationStatsCollector  default to collecting activations statistics only on the output activations of ReLU layers, but we can choose any layer type we want.  In the example below we collect statistics from outputs of  torch.nn.Conv2d  layers.  RecordsActivationStatsCollector(model, classes=[torch.nn.Conv2d])  Collectors can write their data to Excel workbooks (which are named using the collector's name), by invoking  collector.to_xlsx(path_to_workbook) .  In  compress_classifier.py  we currently create four different collectors which you can selectively disable.  You can also add other statistics collectors and use a different function to compute your new statistic.  collectors = missingdict({\n     sparsity :      SummaryActivationStatsCollector(model,  sparsity ,\n                                                     lambda t: 100 * distiller.utils.sparsity(t)),\n     l1_channels :   SummaryActivationStatsCollector(model,  l1_channels ,\n                                                     distiller.utils.activation_channels_l1),\n     apoz_channels : SummaryActivationStatsCollector(model,  apoz_channels ,\n                                                     distiller.utils.activation_channels_apoz),\n     records :       RecordsActivationStatsCollector(model, classes=[torch.nn.Conv2d])})  By default, these Collectors write their data to files in the active log directory.  You can use a utility function,  distiller.log_activation_statsitics , to log the data of an  ActivationStatsCollector  instance to one of the backend-loggers.  For an example, the code below logs the  \"sparsity\"  collector to a TensorBoard log file.  distiller.log_activation_statsitics(epoch,  train , loggers=[tflogger],\n                                    collector=collectors[ sparsity ])", 
             "title": "Collectors and their collaterals"
-        },
+        }, 
         {
-            "location": "/usage/index.html#caveats",
-            "text": "Distiller collects activations statistics using PyTorch's forward-hooks mechanism.  Collectors iteratively register the modules' forward-hooks, and collectors are called during the forward traversal and get exposed to activation data.  Registering for forward callbacks is performed like this:  module.register_forward_hook  This makes apparent two limitations of this mechanism:   We can only register on PyTorch modules.  This means that we can't register on the forward hook of a functionals such as  torch.nn.functional.relu  and  torch.nn.functional.max_pool2d . \n   Therefore, you may need to replace functionals with their module alternative.  For example:     class MadeUpNet(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.conv1 = nn.Conv2d(3, 6, 5)\n\n    def forward(self, x):\n        x = F.relu(self.conv1(x))\n        return x  Can be changed to:    class MadeUpNet(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.conv1 = nn.Conv2d(3, 6, 5)\n        self.relu = nn.ReLU(inplace=True)\n\n    def forward(self, x):\n        x = self.relu(self.conv1(x))\n        return x   We can only use a module instance once in our models.  If we use the same module several times, then we can't determine which node in the graph has invoked the callback, because the PyTorch callback signature  def hook(module, input, output)  doesn't provide enough contextual information. \nTorchVision's  ResNet  is an example of a model that uses the same instance of nn.ReLU multiple times:     class BasicBlock(nn.Module):\n    expansion = 1\n    def __init__(self, inplanes, planes, stride=1, downsample=None):\n        super(BasicBlock, self).__init__()\n        self.conv1 = conv3x3(inplanes, planes, stride)\n        self.bn1 = nn.BatchNorm2d(planes)\n        self.relu = nn.ReLU(inplace=True)\n        self.conv2 = conv3x3(planes, planes)\n        self.bn2 = nn.BatchNorm2d(planes)\n        self.downsample = downsample\n        self.stride = stride\n\n    def forward(self, x):\n        residual = x\n        out = self.conv1(x)\n        out = self.bn1(out)\n        out = self.relu(out)                    # <================\n        out = self.conv2(out)\n        out = self.bn2(out)\n        if self.downsample is not None:\n            residual = self.downsample(x)\n        out += residual\n        out = self.relu(out)                    # <================\n        return out  In Distiller we changed  ResNet  to use multiple instances of nn.ReLU, and each instance is used only once:    class BasicBlock(nn.Module):\n    expansion = 1\n    def __init__(self, inplanes, planes, stride=1, downsample=None):\n        super(BasicBlock, self).__init__()\n        self.conv1 = conv3x3(inplanes, planes, stride)\n        self.bn1 = nn.BatchNorm2d(planes)\n        self.relu1 = nn.ReLU(inplace=True)\n        self.conv2 = conv3x3(planes, planes)\n        self.bn2 = nn.BatchNorm2d(planes)\n        self.relu2 = nn.ReLU(inplace=True)\n        self.downsample = downsample\n        self.stride = stride\n\n    def forward(self, x):\n        residual = x\n        out = self.conv1(x)\n        out = self.bn1(out)\n        out = self.relu1(out)                   # <================\n        out = self.conv2(out)\n        out = self.bn2(out)\n        if self.downsample is not None:\n            residual = self.downsample(x)\n        out += residual\n        out = self.relu2(out)                   # <================\n        return out",
+            "location": "/usage/index.html#caveats", 
+            "text": "Distiller collects activations statistics using PyTorch's forward-hooks mechanism.  Collectors iteratively register the modules' forward-hooks, and collectors are called during the forward traversal and get exposed to activation data.  Registering for forward callbacks is performed like this:  module.register_forward_hook  This makes apparent two limitations of this mechanism:   We can only register on PyTorch modules.  This means that we can't register on the forward hook of a functionals such as  torch.nn.functional.relu  and  torch.nn.functional.max_pool2d . \n   Therefore, you may need to replace functionals with their module alternative.  For example:     class MadeUpNet(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.conv1 = nn.Conv2d(3, 6, 5)\n\n    def forward(self, x):\n        x = F.relu(self.conv1(x))\n        return x  Can be changed to:    class MadeUpNet(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.conv1 = nn.Conv2d(3, 6, 5)\n        self.relu = nn.ReLU(inplace=True)\n\n    def forward(self, x):\n        x = self.relu(self.conv1(x))\n        return x   We can only use a module instance once in our models.  If we use the same module several times, then we can't determine which node in the graph has invoked the callback, because the PyTorch callback signature  def hook(module, input, output)  doesn't provide enough contextual information. \nTorchVision's  ResNet  is an example of a model that uses the same instance of nn.ReLU multiple times:     class BasicBlock(nn.Module):\n    expansion = 1\n    def __init__(self, inplanes, planes, stride=1, downsample=None):\n        super(BasicBlock, self).__init__()\n        self.conv1 = conv3x3(inplanes, planes, stride)\n        self.bn1 = nn.BatchNorm2d(planes)\n        self.relu = nn.ReLU(inplace=True)\n        self.conv2 = conv3x3(planes, planes)\n        self.bn2 = nn.BatchNorm2d(planes)\n        self.downsample = downsample\n        self.stride = stride\n\n    def forward(self, x):\n        residual = x\n        out = self.conv1(x)\n        out = self.bn1(out)\n        out = self.relu(out)                    #  ================\n        out = self.conv2(out)\n        out = self.bn2(out)\n        if self.downsample is not None:\n            residual = self.downsample(x)\n        out += residual\n        out = self.relu(out)                    #  ================\n        return out  In Distiller we changed  ResNet  to use multiple instances of nn.ReLU, and each instance is used only once:    class BasicBlock(nn.Module):\n    expansion = 1\n    def __init__(self, inplanes, planes, stride=1, downsample=None):\n        super(BasicBlock, self).__init__()\n        self.conv1 = conv3x3(inplanes, planes, stride)\n        self.bn1 = nn.BatchNorm2d(planes)\n        self.relu1 = nn.ReLU(inplace=True)\n        self.conv2 = conv3x3(planes, planes)\n        self.bn2 = nn.BatchNorm2d(planes)\n        self.relu2 = nn.ReLU(inplace=True)\n        self.downsample = downsample\n        self.stride = stride\n\n    def forward(self, x):\n        residual = x\n        out = self.conv1(x)\n        out = self.bn1(out)\n        out = self.relu1(out)                   #  ================\n        out = self.conv2(out)\n        out = self.bn2(out)\n        if self.downsample is not None:\n            residual = self.downsample(x)\n        out += residual\n        out = self.relu2(out)                   #  ================\n        return out", 
             "title": "Caveats"
-        },
+        }, 
         {
-            "location": "/usage/index.html#using-the-jupyter-notebooks",
-            "text": "The Jupyter notebooks contain many examples of how to use the statistics summaries generated by Distiller.  They are explained in a separate page.",
+            "location": "/usage/index.html#using-the-jupyter-notebooks", 
+            "text": "The Jupyter notebooks contain many examples of how to use the statistics summaries generated by Distiller.  They are explained in a separate page.", 
             "title": "Using the Jupyter notebooks"
-        },
+        }, 
         {
-            "location": "/usage/index.html#generating-this-documentation",
-            "text": "Install mkdocs and the required packages by executing:  $ pip3 install -r doc-requirements.txt  To build the project documentation run:  $ cd distiller/docs-src\n$ mkdocs build --clean  This will create a folder named 'site' which contains the documentation website.\nOpen distiller/docs/site/index.html to view the documentation home page.",
+            "location": "/usage/index.html#generating-this-documentation", 
+            "text": "Install mkdocs and the required packages by executing:  $ pip3 install -r doc-requirements.txt  To build the project documentation run:  $ cd distiller/docs-src\n$ mkdocs build --clean  This will create a folder named 'site' which contains the documentation website.\nOpen distiller/docs/site/index.html to view the documentation home page.", 
             "title": "Generating this documentation"
-        },
+        }, 
         {
-            "location": "/schedule/index.html",
-            "text": "Compression scheduler\n\n\nIn iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of \nCompressionScheduler\n: it needed to be part of the training loop, and to be able to make and implement pruning, regularization and quantization decisions.  We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification.  We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base.  Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code.  \n\n\nHigh level overview\n\n\nLet's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, Quantizers, LR-scheduler and Policies.\n\n\n\n\nPruners, Regularizers and Quantizers are very similar: They implement either a Pruning/Regularization/Quantization algorithm, respectively. \n\n\nAn LR-scheduler specifies the LR-decay algorithm.  \n\n\n\n\nThese define the \nwhat\n part of the schedule.  \n\n\nThe Policies define the \nwhen\n part of the schedule: at which epoch to start applying the Pruner/Regularizer/Quantizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application).  A policy also defines the instance of Pruner/Regularizer/Quantizer/LR-decay it is managing.\n\nThe \nCompressionScheduler\n is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners, Regularizers and Quantizers from code.\n\n\nSyntax through example\n\n\nWe'll use \nalexnet.schedule_agp.yaml\n to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet.\n\n\nversion: 1\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\nlr_schedulers:\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1\n\n\n\n\nThere is only one version of the YAML syntax, and the version number is not verified at the moment.  However, to be future-proof it is probably better to let the YAML parser know that you are using version-1 syntax, in case there is ever a version 2.\n\n\nversion: 1\n\n\n\n\nIn the \npruners\n section, we define the instances of pruners we want the scheduler to instantiate and use.\n\nWe define a single pruner instance, named \nmy_pruner\n, of algorithm \nSensitivityPruner\n.  We will refer to this instance in the \nPolicies\n section.\n\nThen we list the sensitivity multipliers, \\(s\\), of each of the weight tensors.\n\nYou may list as many Pruners as you want in this section, as long as each has a unique name.  You can several types of pruners in one schedule.\n\n\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.6\n\n\n\n\nNext, we want to specify the learning-rate decay scheduling in the \nlr_schedulers\n section.  We assign a name to this instance: \npruning_lr\n.  As in the \npruners\n section, you may use any name, as long as all LR-schedulers have a unique name.  At the moment, only one instance of LR-scheduler is allowed.  The LR-scheduler must be a subclass of PyTorch's \n_LRScheduler\n.  You can use any of the schedulers defined in \ntorch.optim.lr_scheduler\n (see \nhere\n).  In addition, we've implemented some additional schedulers in Distiller (see \nhere\n). The keyword arguments (kwargs) are passed directly to the LR-scheduler's constructor, so that as new LR-schedulers are added to \ntorch.optim.lr_scheduler\n, they can be used without changing the application code.\n\n\nlr_schedulers:\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\n\n\n\nFinally, we define the \npolicies\n section which defines the actual scheduling.  A \nPolicy\n manages an instance of a \nPruner\n, \nRegularizer\n, \nQuantizer\n, or \nLRScheduler\n, by naming the instance.  In the example below, a \nPruningPolicy\n uses the pruner instance named \nmy_pruner\n: it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38.  \n\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1\n\n\n\n\nThis is \niterative pruning\n:\n\n\n\n\n\n\nTrain Connectivity\n\n\n\n\n\n\nPrune Connections\n\n\n\n\n\n\nRetrain Weights\n\n\n\n\n\n\nGoto 2\n\n\n\n\n\n\nIt is described  in \nLearning both Weights and Connections for Efficient Neural Networks\n:\n\n\n\n\n\"Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections...After an initial training phase, we remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first phase learns the topology of the networks \u2014 learning which connections are important and removing the unimportant connections. We then retrain the sparse network so the remaining connections can compensate for the connections that have been removed. The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.\"\n\n\n\n\nRegularization\n\n\nYou can also define and schedule regularization.\n\n\nL1 regularization\n\n\nFormat (this is an informal specification, not a valid \nABNF\n specification):\n\n\nregularizers:\n  <REGULARIZER_NAME_STR>:\n    class: L1Regularizer\n    reg_regims:\n      <PYTORCH_PARAM_NAME_STR>: <STRENGTH_FLOAT>\n      ...\n      <PYTORCH_PARAM_NAME_STR>: <STRENGTH_FLOAT>\n    threshold_criteria: [Mean_Abs | Max]\n\n\n\n\nFor example:\n\n\nversion: 1\n\nregularizers:\n  my_L1_reg:\n    class: L1Regularizer\n    reg_regims:\n      'module.layer3.1.conv1.weight': 0.000002\n      'module.layer3.1.conv2.weight': 0.000002\n      'module.layer3.1.conv3.weight': 0.000002\n      'module.layer3.2.conv1.weight': 0.000002\n    threshold_criteria: Mean_Abs\n\npolicies:\n  - regularizer:\n      instance_name: my_L1_reg\n    starting_epoch: 0\n    ending_epoch: 60\n    frequency: 1\n\n\n\n\nGroup regularization\n\n\nFormat (informal specification):\n\n\nFormat:\n  regularizers:\n    <REGULARIZER_NAME_STR>:\n      class: L1Regularizer\n      reg_regims:\n        <PYTORCH_PARAM_NAME_STR>: [<STRENGTH_FLOAT>, <'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'>]\n        <PYTORCH_PARAM_NAME_STR>: [<STRENGTH_FLOAT>, <'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'>]\n      threshold_criteria: [Mean_Abs | Max]\n\n\n\n\nFor example:\n\n\nversion: 1\n\nregularizers:\n  my_filter_regularizer:\n    class: GroupLassoRegularizer\n    reg_regims:\n      'module.layer3.1.conv1.weight': [0.00005, '3D']\n      'module.layer3.1.conv2.weight': [0.00005, '3D']\n      'module.layer3.1.conv3.weight': [0.00005, '3D']\n      'module.layer3.2.conv1.weight': [0.00005, '3D']\n    threshold_criteria: Mean_Abs\n\npolicies:\n  - regularizer:\n      instance_name: my_filter_regularizer\n    starting_epoch: 0\n    ending_epoch: 60\n    frequency: 1\n\n\n\n\nMixing it up\n\n\nYou can mix pruning and regularization.\n\n\nversion: 1\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\nregularizers:\n  2d_groups_regularizer:\n    class: GroupLassoRegularizer\n    reg_regims:\n      'features.module.0.weight': [0.000012, '2D']\n      'features.module.3.weight': [0.000012, '2D']\n      'features.module.6.weight': [0.000012, '2D']\n      'features.module.8.weight': [0.000012, '2D']\n      'features.module.10.weight': [0.000012, '2D']\n\n\nlr_schedulers:\n  # Learning rate decay scheduler\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - regularizer:\n      instance_name: '2d_groups_regularizer'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 1\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1\n\n\n\n\n\nQuantization\n\n\nSimilarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the \nQuantizer\n class (see details \nhere\n). \nNote\n that only a single quantizer instance may be defined per YAML.\n\n\nLet's see an example:\n\n\nquantizers:\n  dorefa_quantizer:\n    class: DorefaQuantizer\n    bits_activations: 8\n    bits_weights: 4\n    bits_overrides:\n      conv1:\n        wts: null\n        acts: null\n      relu1:\n        wts: null\n        acts: null\n      final_relu:\n        wts: null\n        acts: null\n      fc:\n        wts: null\n        acts: null\n\n\n\n\n\n\nThe specific quantization method we're instantiating here is \nDorefaQuantizer\n.\n\n\nThen we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively. \n\n\nThen, we define the \nbits_overrides\n mapping. In the example above, we choose not to quantize the first and last layer of the model. In the case of \nDorefaQuantizer\n, the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters \nconv1\n, the first activation layer \nrelu1\n, the last activation layer \nfinal_relu\n and the last layer with parameters \nfc\n.\n\n\nSpecifying \nnull\n means \"do not quantize\".\n\n\nNote that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers.\n\n\n\n\nDefining overrides for \ngroups of layers\n using regular expressions\n\n\nSuppose we have a sub-module in our model named \nblock1\n, which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named \nconv1\n, \nconv2\n and so on. In that case we would define the following:\n\n\nbits_overrides:\n  'block1\\.conv*':\n    wts: 2\n    acts: null\n\n\n\n\n\n\nRegEx Note\n: Remember that the dot (\n.\n) is a meta-character (i.e. a reserved character) in regular expressions. So, to match the actual dot characters which separate sub-modules in PyTorch module names, we need to escape it: \n\\.\n\n\n\n\nOverlapping patterns\n are also possible, which allows to define some override for a groups of layers and also \"single-out\" specific layers for different overrides. For example, let's take the last example and configure a different override for \nblock1.conv1\n:\n\n\nbits_overrides:\n  'block1\\.conv1':\n    wts: 4\n    acts: null\n  'block1\\.conv*':\n    wts: 2\n    acts: null\n\n\n\n\n\n\nImportant Note\n: The patterns are evaluated eagerly - first match wins. So, to properly quantize a model using \"broad\" patterns and more \"specific\" patterns as just shown, make sure the specific pattern is listed \nbefore\n the broad one.\n\n\n\n\nThe \nQuantizationPolicy\n, which controls the quantization procedure during training, is actually quite simplistic. All it does is call the \nprepare_model()\n function of the \nQuantizer\n when it's initialized, followed by the first call to \nquantize_params()\n. Then, at the end of each epoch, after the float copy of the weights has been updated, it calls the \nquantize_params()\n function again.\n\n\npolicies:\n    - quantizer:\n        instance_name: dorefa_quantizer\n      starting_epoch: 0\n      ending_epoch: 200\n      frequency: 1\n\n\n\n\nImportant Note\n: As mentioned \nhere\n, since the quantizer modifies the model's parameters (assuming training with quantization in the loop is used), the call to \nprepare_model()\n must be performed before an optimizer is called. Therefore, currently, the starting epoch for a quantization policy must be 0, otherwise the quantization process will not work as expected. If one wishes to do a \"warm-startup\" (or \"boot-strapping\"), training for a few epochs with full precision and only then starting to quantize, the only way to do this right now is to execute a separate run to generate the boot-strapped weights, and execute a second which will resume the checkpoint with the boot-strapped weights.\n\n\nKnowledge Distillation\n\n\nKnowledge distillation (see \nhere\n) is also implemented as a \nPolicy\n, which should be added to the scheduler. However, with the current implementation, it cannot be defined within the YAML file like the rest of the policies described above.\n\n\nTo make the integration of this method into applications a bit easier, a helper function can be used that will add a set of command-line arguments related to knowledge distillation:\n\n\nimport argparse\nimport distiller\n\nparser = argparse.ArgumentParser()\ndistiller.knowledge_distillation.add_distillation_args(parser)\n\n\n\n\n(The \nadd_distillation_args\n function accepts some optional arguments, see its implementation at \ndistiller/knowledge_distillation.py\n for details)\n\n\nThese are the command line arguments exposed by this function:\n\n\nKnowledge Distillation Training Arguments:\n  --kd-teacher ARCH     Model architecture for teacher model\n  --kd-pretrained       Use pre-trained model for teacher\n  --kd-resume PATH      Path to checkpoint from which to load teacher weights\n  --kd-temperature TEMP, --kd-temp TEMP\n                        Knowledge distillation softmax temperature\n  --kd-distill-wt WEIGHT, --kd-dw WEIGHT\n                        Weight for distillation loss (student vs. teacher soft\n                        targets)\n  --kd-student-wt WEIGHT, --kd-sw WEIGHT\n                        Weight for student vs. labels loss\n  --kd-teacher-wt WEIGHT, --kd-tw WEIGHT\n                        Weight for teacher vs. labels loss\n  --kd-start-epoch EPOCH_NUM\n                        Epoch from which to enable distillation\n\n\n\n\n\nOnce arguments have been parsed, some initialization code is required, similar to the following:\n\n\n# Assuming:\n# \"args\" variable holds command line arguments\n# \"model\" variable holds the model we're going to train, that is - the student model\n# \"compression_scheduler\" variable holds a CompressionScheduler instance\n\nargs.kd_policy = None\nif args.kd_teacher:\n    # Create teacher model - replace this with your model creation code\n    teacher = create_model(args.kd_pretrained, args.dataset, args.kd_teacher, device_ids=args.gpus)\n    if args.kd_resume:\n        teacher, _, _ = apputils.load_checkpoint(teacher, chkpt_file=args.kd_resume)\n\n    # Create policy and add to scheduler\n    dlw = distiller.DistillationLossWeights(args.kd_distill_wt, args.kd_student_wt, args.kd_teacher_wt)\n    args.kd_policy = distiller.KnowledgeDistillationPolicy(model, teacher, args.kd_temp, dlw)\n    compression_scheduler.add_policy(args.kd_policy, starting_epoch=args.kd_start_epoch, ending_epoch=args.epochs,\n                                     frequency=1)\n\n\n\n\nFinally, during the training loop, we need to perform forward propagation through the teacher model as well. The \nKnowledgeDistillationPolicy\n class keeps a reference to both the student and teacher models, and exposes a \nforward\n function that performs forward propagation on both of them. Since this is not one of the standard policy callbacks, we need to call this function manually from our training loop, as follows:\n\n\nif args.kd_policy is None:\n    # Revert to a \"normal\" forward-prop call if no knowledge distillation policy is present\n    output = model(input_var)\nelse:\n    output = args.kd_policy.forward(input_var)\n\n\n\n\nTo see this integration in action, take a look at the image classification sample at \nexamples/classifier_compression/compress_classifier.py\n.",
+            "location": "/schedule/index.html", 
+            "text": "Compression scheduler\n\n\nIn iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of \nCompressionScheduler\n: it needed to be part of the training loop, and to be able to make and implement pruning, regularization and quantization decisions.  We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification.  We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base.  Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code.  \n\n\nHigh level overview\n\n\nLet's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, Quantizers, LR-scheduler and Policies.\n\n\n\n\nPruners, Regularizers and Quantizers are very similar: They implement either a Pruning/Regularization/Quantization algorithm, respectively. \n\n\nAn LR-scheduler specifies the LR-decay algorithm.  \n\n\n\n\nThese define the \nwhat\n part of the schedule.  \n\n\nThe Policies define the \nwhen\n part of the schedule: at which epoch to start applying the Pruner/Regularizer/Quantizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application).  A policy also defines the instance of Pruner/Regularizer/Quantizer/LR-decay it is managing.\n\nThe \nCompressionScheduler\n is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners, Regularizers and Quantizers from code.\n\n\nSyntax through example\n\n\nWe'll use \nalexnet.schedule_agp.yaml\n to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet.\n\n\nversion: 1\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\nlr_schedulers:\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1\n\n\n\n\nThere is only one version of the YAML syntax, and the version number is not verified at the moment.  However, to be future-proof it is probably better to let the YAML parser know that you are using version-1 syntax, in case there is ever a version 2.\n\n\nversion: 1\n\n\n\n\nIn the \npruners\n section, we define the instances of pruners we want the scheduler to instantiate and use.\n\nWe define a single pruner instance, named \nmy_pruner\n, of algorithm \nSensitivityPruner\n.  We will refer to this instance in the \nPolicies\n section.\n\nThen we list the sensitivity multipliers, \\(s\\), of each of the weight tensors.\n\nYou may list as many Pruners as you want in this section, as long as each has a unique name.  You can several types of pruners in one schedule.\n\n\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.6\n\n\n\n\nNext, we want to specify the learning-rate decay scheduling in the \nlr_schedulers\n section.  We assign a name to this instance: \npruning_lr\n.  As in the \npruners\n section, you may use any name, as long as all LR-schedulers have a unique name.  At the moment, only one instance of LR-scheduler is allowed.  The LR-scheduler must be a subclass of PyTorch's \n_LRScheduler\n.  You can use any of the schedulers defined in \ntorch.optim.lr_scheduler\n (see \nhere\n).  In addition, we've implemented some additional schedulers in Distiller (see \nhere\n). The keyword arguments (kwargs) are passed directly to the LR-scheduler's constructor, so that as new LR-schedulers are added to \ntorch.optim.lr_scheduler\n, they can be used without changing the application code.\n\n\nlr_schedulers:\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\n\n\n\nFinally, we define the \npolicies\n section which defines the actual scheduling.  A \nPolicy\n manages an instance of a \nPruner\n, \nRegularizer\n, \nQuantizer\n, or \nLRScheduler\n, by naming the instance.  In the example below, a \nPruningPolicy\n uses the pruner instance named \nmy_pruner\n: it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38.  \n\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1\n\n\n\n\nThis is \niterative pruning\n:\n\n\n\n\n\n\nTrain Connectivity\n\n\n\n\n\n\nPrune Connections\n\n\n\n\n\n\nRetrain Weights\n\n\n\n\n\n\nGoto 2\n\n\n\n\n\n\nIt is described  in \nLearning both Weights and Connections for Efficient Neural Networks\n:\n\n\n\n\n\"Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections...After an initial training phase, we remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first phase learns the topology of the networks \u2014 learning which connections are important and removing the unimportant connections. We then retrain the sparse network so the remaining connections can compensate for the connections that have been removed. The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.\"\n\n\n\n\nRegularization\n\n\nYou can also define and schedule regularization.\n\n\nL1 regularization\n\n\nFormat (this is an informal specification, not a valid \nABNF\n specification):\n\n\nregularizers:\n  \nREGULARIZER_NAME_STR\n:\n    class: L1Regularizer\n    reg_regims:\n      \nPYTORCH_PARAM_NAME_STR\n: \nSTRENGTH_FLOAT\n\n      ...\n      \nPYTORCH_PARAM_NAME_STR\n: \nSTRENGTH_FLOAT\n\n    threshold_criteria: [Mean_Abs | Max]\n\n\n\n\nFor example:\n\n\nversion: 1\n\nregularizers:\n  my_L1_reg:\n    class: L1Regularizer\n    reg_regims:\n      'module.layer3.1.conv1.weight': 0.000002\n      'module.layer3.1.conv2.weight': 0.000002\n      'module.layer3.1.conv3.weight': 0.000002\n      'module.layer3.2.conv1.weight': 0.000002\n    threshold_criteria: Mean_Abs\n\npolicies:\n  - regularizer:\n      instance_name: my_L1_reg\n    starting_epoch: 0\n    ending_epoch: 60\n    frequency: 1\n\n\n\n\nGroup regularization\n\n\nFormat (informal specification):\n\n\nFormat:\n  regularizers:\n    \nREGULARIZER_NAME_STR\n:\n      class: L1Regularizer\n      reg_regims:\n        \nPYTORCH_PARAM_NAME_STR\n: [\nSTRENGTH_FLOAT\n, \n'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'\n]\n        \nPYTORCH_PARAM_NAME_STR\n: [\nSTRENGTH_FLOAT\n, \n'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'\n]\n      threshold_criteria: [Mean_Abs | Max]\n\n\n\n\nFor example:\n\n\nversion: 1\n\nregularizers:\n  my_filter_regularizer:\n    class: GroupLassoRegularizer\n    reg_regims:\n      'module.layer3.1.conv1.weight': [0.00005, '3D']\n      'module.layer3.1.conv2.weight': [0.00005, '3D']\n      'module.layer3.1.conv3.weight': [0.00005, '3D']\n      'module.layer3.2.conv1.weight': [0.00005, '3D']\n    threshold_criteria: Mean_Abs\n\npolicies:\n  - regularizer:\n      instance_name: my_filter_regularizer\n    starting_epoch: 0\n    ending_epoch: 60\n    frequency: 1\n\n\n\n\nMixing it up\n\n\nYou can mix pruning and regularization.\n\n\nversion: 1\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\nregularizers:\n  2d_groups_regularizer:\n    class: GroupLassoRegularizer\n    reg_regims:\n      'features.module.0.weight': [0.000012, '2D']\n      'features.module.3.weight': [0.000012, '2D']\n      'features.module.6.weight': [0.000012, '2D']\n      'features.module.8.weight': [0.000012, '2D']\n      'features.module.10.weight': [0.000012, '2D']\n\n\nlr_schedulers:\n  # Learning rate decay scheduler\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - regularizer:\n      instance_name: '2d_groups_regularizer'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 1\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1\n\n\n\n\n\nQuantization\n\n\nSimilarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the \nQuantizer\n class (see details \nhere\n). \nNote\n that only a single quantizer instance may be defined per YAML.\n\n\nLet's see an example:\n\n\nquantizers:\n  dorefa_quantizer:\n    class: DorefaQuantizer\n    bits_activations: 8\n    bits_weights: 4\n    bits_overrides:\n      conv1:\n        wts: null\n        acts: null\n      relu1:\n        wts: null\n        acts: null\n      final_relu:\n        wts: null\n        acts: null\n      fc:\n        wts: null\n        acts: null\n\n\n\n\n\n\nThe specific quantization method we're instantiating here is \nDorefaQuantizer\n.\n\n\nThen we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively. \n\n\nThen, we define the \nbits_overrides\n mapping. In the example above, we choose not to quantize the first and last layer of the model. In the case of \nDorefaQuantizer\n, the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters \nconv1\n, the first activation layer \nrelu1\n, the last activation layer \nfinal_relu\n and the last layer with parameters \nfc\n.\n\n\nSpecifying \nnull\n means \"do not quantize\".\n\n\nNote that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers.\n\n\n\n\nDefining overrides for \ngroups of layers\n using regular expressions\n\n\nSuppose we have a sub-module in our model named \nblock1\n, which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named \nconv1\n, \nconv2\n and so on. In that case we would define the following:\n\n\nbits_overrides:\n  'block1\\.conv*':\n    wts: 2\n    acts: null\n\n\n\n\n\n\nRegEx Note\n: Remember that the dot (\n.\n) is a meta-character (i.e. a reserved character) in regular expressions. So, to match the actual dot characters which separate sub-modules in PyTorch module names, we need to escape it: \n\\.\n\n\n\n\nOverlapping patterns\n are also possible, which allows to define some override for a groups of layers and also \"single-out\" specific layers for different overrides. For example, let's take the last example and configure a different override for \nblock1.conv1\n:\n\n\nbits_overrides:\n  'block1\\.conv1':\n    wts: 4\n    acts: null\n  'block1\\.conv*':\n    wts: 2\n    acts: null\n\n\n\n\n\n\nImportant Note\n: The patterns are evaluated eagerly - first match wins. So, to properly quantize a model using \"broad\" patterns and more \"specific\" patterns as just shown, make sure the specific pattern is listed \nbefore\n the broad one.\n\n\n\n\nThe \nQuantizationPolicy\n, which controls the quantization procedure during training, is actually quite simplistic. All it does is call the \nprepare_model()\n function of the \nQuantizer\n when it's initialized, followed by the first call to \nquantize_params()\n. Then, at the end of each epoch, after the float copy of the weights has been updated, it calls the \nquantize_params()\n function again.\n\n\npolicies:\n    - quantizer:\n        instance_name: dorefa_quantizer\n      starting_epoch: 0\n      ending_epoch: 200\n      frequency: 1\n\n\n\n\nImportant Note\n: As mentioned \nhere\n, since the quantizer modifies the model's parameters (assuming training with quantization in the loop is used), the call to \nprepare_model()\n must be performed before an optimizer is called. Therefore, currently, the starting epoch for a quantization policy must be 0, otherwise the quantization process will not work as expected. If one wishes to do a \"warm-startup\" (or \"boot-strapping\"), training for a few epochs with full precision and only then starting to quantize, the only way to do this right now is to execute a separate run to generate the boot-strapped weights, and execute a second which will resume the checkpoint with the boot-strapped weights.\n\n\nKnowledge Distillation\n\n\nKnowledge distillation (see \nhere\n) is also implemented as a \nPolicy\n, which should be added to the scheduler. However, with the current implementation, it cannot be defined within the YAML file like the rest of the policies described above.\n\n\nTo make the integration of this method into applications a bit easier, a helper function can be used that will add a set of command-line arguments related to knowledge distillation:\n\n\nimport argparse\nimport distiller\n\nparser = argparse.ArgumentParser()\ndistiller.knowledge_distillation.add_distillation_args(parser)\n\n\n\n\n(The \nadd_distillation_args\n function accepts some optional arguments, see its implementation at \ndistiller/knowledge_distillation.py\n for details)\n\n\nThese are the command line arguments exposed by this function:\n\n\nKnowledge Distillation Training Arguments:\n  --kd-teacher ARCH     Model architecture for teacher model\n  --kd-pretrained       Use pre-trained model for teacher\n  --kd-resume PATH      Path to checkpoint from which to load teacher weights\n  --kd-temperature TEMP, --kd-temp TEMP\n                        Knowledge distillation softmax temperature\n  --kd-distill-wt WEIGHT, --kd-dw WEIGHT\n                        Weight for distillation loss (student vs. teacher soft\n                        targets)\n  --kd-student-wt WEIGHT, --kd-sw WEIGHT\n                        Weight for student vs. labels loss\n  --kd-teacher-wt WEIGHT, --kd-tw WEIGHT\n                        Weight for teacher vs. labels loss\n  --kd-start-epoch EPOCH_NUM\n                        Epoch from which to enable distillation\n\n\n\n\n\nOnce arguments have been parsed, some initialization code is required, similar to the following:\n\n\n# Assuming:\n# \nargs\n variable holds command line arguments\n# \nmodel\n variable holds the model we're going to train, that is - the student model\n# \ncompression_scheduler\n variable holds a CompressionScheduler instance\n\nargs.kd_policy = None\nif args.kd_teacher:\n    # Create teacher model - replace this with your model creation code\n    teacher = create_model(args.kd_pretrained, args.dataset, args.kd_teacher, device_ids=args.gpus)\n    if args.kd_resume:\n        teacher, _, _ = apputils.load_checkpoint(teacher, chkpt_file=args.kd_resume)\n\n    # Create policy and add to scheduler\n    dlw = distiller.DistillationLossWeights(args.kd_distill_wt, args.kd_student_wt, args.kd_teacher_wt)\n    args.kd_policy = distiller.KnowledgeDistillationPolicy(model, teacher, args.kd_temp, dlw)\n    compression_scheduler.add_policy(args.kd_policy, starting_epoch=args.kd_start_epoch, ending_epoch=args.epochs,\n                                     frequency=1)\n\n\n\n\nFinally, during the training loop, we need to perform forward propagation through the teacher model as well. The \nKnowledgeDistillationPolicy\n class keeps a reference to both the student and teacher models, and exposes a \nforward\n function that performs forward propagation on both of them. Since this is not one of the standard policy callbacks, we need to call this function manually from our training loop, as follows:\n\n\nif args.kd_policy is None:\n    # Revert to a \nnormal\n forward-prop call if no knowledge distillation policy is present\n    output = model(input_var)\nelse:\n    output = args.kd_policy.forward(input_var)\n\n\n\n\nTo see this integration in action, take a look at the image classification sample at \nexamples/classifier_compression/compress_classifier.py\n.", 
             "title": "Compression Scheduling"
-        },
+        }, 
         {
-            "location": "/schedule/index.html#compression-scheduler",
-            "text": "In iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of  CompressionScheduler : it needed to be part of the training loop, and to be able to make and implement pruning, regularization and quantization decisions.  We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification.  We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base.  Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code.",
+            "location": "/schedule/index.html#compression-scheduler", 
+            "text": "In iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of  CompressionScheduler : it needed to be part of the training loop, and to be able to make and implement pruning, regularization and quantization decisions.  We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification.  We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base.  Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code.", 
             "title": "Compression scheduler"
-        },
+        }, 
         {
-            "location": "/schedule/index.html#high-level-overview",
-            "text": "Let's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, Quantizers, LR-scheduler and Policies.   Pruners, Regularizers and Quantizers are very similar: They implement either a Pruning/Regularization/Quantization algorithm, respectively.   An LR-scheduler specifies the LR-decay algorithm.     These define the  what  part of the schedule.    The Policies define the  when  part of the schedule: at which epoch to start applying the Pruner/Regularizer/Quantizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application).  A policy also defines the instance of Pruner/Regularizer/Quantizer/LR-decay it is managing. \nThe  CompressionScheduler  is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners, Regularizers and Quantizers from code.",
+            "location": "/schedule/index.html#high-level-overview", 
+            "text": "Let's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, Quantizers, LR-scheduler and Policies.   Pruners, Regularizers and Quantizers are very similar: They implement either a Pruning/Regularization/Quantization algorithm, respectively.   An LR-scheduler specifies the LR-decay algorithm.     These define the  what  part of the schedule.    The Policies define the  when  part of the schedule: at which epoch to start applying the Pruner/Regularizer/Quantizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application).  A policy also defines the instance of Pruner/Regularizer/Quantizer/LR-decay it is managing. \nThe  CompressionScheduler  is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners, Regularizers and Quantizers from code.", 
             "title": "High level overview"
-        },
+        }, 
         {
-            "location": "/schedule/index.html#syntax-through-example",
-            "text": "We'll use  alexnet.schedule_agp.yaml  to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet.  version: 1\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\nlr_schedulers:\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1  There is only one version of the YAML syntax, and the version number is not verified at the moment.  However, to be future-proof it is probably better to let the YAML parser know that you are using version-1 syntax, in case there is ever a version 2.  version: 1  In the  pruners  section, we define the instances of pruners we want the scheduler to instantiate and use. \nWe define a single pruner instance, named  my_pruner , of algorithm  SensitivityPruner .  We will refer to this instance in the  Policies  section. \nThen we list the sensitivity multipliers, \\(s\\), of each of the weight tensors. \nYou may list as many Pruners as you want in this section, as long as each has a unique name.  You can several types of pruners in one schedule.  pruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.6  Next, we want to specify the learning-rate decay scheduling in the  lr_schedulers  section.  We assign a name to this instance:  pruning_lr .  As in the  pruners  section, you may use any name, as long as all LR-schedulers have a unique name.  At the moment, only one instance of LR-scheduler is allowed.  The LR-scheduler must be a subclass of PyTorch's  _LRScheduler .  You can use any of the schedulers defined in  torch.optim.lr_scheduler  (see  here ).  In addition, we've implemented some additional schedulers in Distiller (see  here ). The keyword arguments (kwargs) are passed directly to the LR-scheduler's constructor, so that as new LR-schedulers are added to  torch.optim.lr_scheduler , they can be used without changing the application code.  lr_schedulers:\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9  Finally, we define the  policies  section which defines the actual scheduling.  A  Policy  manages an instance of a  Pruner ,  Regularizer ,  Quantizer , or  LRScheduler , by naming the instance.  In the example below, a  PruningPolicy  uses the pruner instance named  my_pruner : it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38.    policies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1  This is  iterative pruning :    Train Connectivity    Prune Connections    Retrain Weights    Goto 2    It is described  in  Learning both Weights and Connections for Efficient Neural Networks :   \"Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections...After an initial training phase, we remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first phase learns the topology of the networks \u2014 learning which connections are important and removing the unimportant connections. We then retrain the sparse network so the remaining connections can compensate for the connections that have been removed. The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.\"",
+            "location": "/schedule/index.html#syntax-through-example", 
+            "text": "We'll use  alexnet.schedule_agp.yaml  to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet.  version: 1\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\nlr_schedulers:\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1  There is only one version of the YAML syntax, and the version number is not verified at the moment.  However, to be future-proof it is probably better to let the YAML parser know that you are using version-1 syntax, in case there is ever a version 2.  version: 1  In the  pruners  section, we define the instances of pruners we want the scheduler to instantiate and use. \nWe define a single pruner instance, named  my_pruner , of algorithm  SensitivityPruner .  We will refer to this instance in the  Policies  section. \nThen we list the sensitivity multipliers, \\(s\\), of each of the weight tensors. \nYou may list as many Pruners as you want in this section, as long as each has a unique name.  You can several types of pruners in one schedule.  pruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.6  Next, we want to specify the learning-rate decay scheduling in the  lr_schedulers  section.  We assign a name to this instance:  pruning_lr .  As in the  pruners  section, you may use any name, as long as all LR-schedulers have a unique name.  At the moment, only one instance of LR-scheduler is allowed.  The LR-scheduler must be a subclass of PyTorch's  _LRScheduler .  You can use any of the schedulers defined in  torch.optim.lr_scheduler  (see  here ).  In addition, we've implemented some additional schedulers in Distiller (see  here ). The keyword arguments (kwargs) are passed directly to the LR-scheduler's constructor, so that as new LR-schedulers are added to  torch.optim.lr_scheduler , they can be used without changing the application code.  lr_schedulers:\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9  Finally, we define the  policies  section which defines the actual scheduling.  A  Policy  manages an instance of a  Pruner ,  Regularizer ,  Quantizer , or  LRScheduler , by naming the instance.  In the example below, a  PruningPolicy  uses the pruner instance named  my_pruner : it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38.    policies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1  This is  iterative pruning :    Train Connectivity    Prune Connections    Retrain Weights    Goto 2    It is described  in  Learning both Weights and Connections for Efficient Neural Networks :   \"Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections...After an initial training phase, we remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first phase learns the topology of the networks \u2014 learning which connections are important and removing the unimportant connections. We then retrain the sparse network so the remaining connections can compensate for the connections that have been removed. The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.\"", 
             "title": "Syntax through example"
-        },
+        }, 
         {
-            "location": "/schedule/index.html#regularization",
-            "text": "You can also define and schedule regularization.",
+            "location": "/schedule/index.html#regularization", 
+            "text": "You can also define and schedule regularization.", 
             "title": "Regularization"
-        },
+        }, 
         {
-            "location": "/schedule/index.html#l1-regularization",
-            "text": "Format (this is an informal specification, not a valid  ABNF  specification):  regularizers:\n  <REGULARIZER_NAME_STR>:\n    class: L1Regularizer\n    reg_regims:\n      <PYTORCH_PARAM_NAME_STR>: <STRENGTH_FLOAT>\n      ...\n      <PYTORCH_PARAM_NAME_STR>: <STRENGTH_FLOAT>\n    threshold_criteria: [Mean_Abs | Max]  For example:  version: 1\n\nregularizers:\n  my_L1_reg:\n    class: L1Regularizer\n    reg_regims:\n      'module.layer3.1.conv1.weight': 0.000002\n      'module.layer3.1.conv2.weight': 0.000002\n      'module.layer3.1.conv3.weight': 0.000002\n      'module.layer3.2.conv1.weight': 0.000002\n    threshold_criteria: Mean_Abs\n\npolicies:\n  - regularizer:\n      instance_name: my_L1_reg\n    starting_epoch: 0\n    ending_epoch: 60\n    frequency: 1",
+            "location": "/schedule/index.html#l1-regularization", 
+            "text": "Format (this is an informal specification, not a valid  ABNF  specification):  regularizers:\n   REGULARIZER_NAME_STR :\n    class: L1Regularizer\n    reg_regims:\n       PYTORCH_PARAM_NAME_STR :  STRENGTH_FLOAT \n      ...\n       PYTORCH_PARAM_NAME_STR :  STRENGTH_FLOAT \n    threshold_criteria: [Mean_Abs | Max]  For example:  version: 1\n\nregularizers:\n  my_L1_reg:\n    class: L1Regularizer\n    reg_regims:\n      'module.layer3.1.conv1.weight': 0.000002\n      'module.layer3.1.conv2.weight': 0.000002\n      'module.layer3.1.conv3.weight': 0.000002\n      'module.layer3.2.conv1.weight': 0.000002\n    threshold_criteria: Mean_Abs\n\npolicies:\n  - regularizer:\n      instance_name: my_L1_reg\n    starting_epoch: 0\n    ending_epoch: 60\n    frequency: 1", 
             "title": "L1 regularization"
-        },
+        }, 
         {
-            "location": "/schedule/index.html#group-regularization",
-            "text": "Format (informal specification):  Format:\n  regularizers:\n    <REGULARIZER_NAME_STR>:\n      class: L1Regularizer\n      reg_regims:\n        <PYTORCH_PARAM_NAME_STR>: [<STRENGTH_FLOAT>, <'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'>]\n        <PYTORCH_PARAM_NAME_STR>: [<STRENGTH_FLOAT>, <'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'>]\n      threshold_criteria: [Mean_Abs | Max]  For example:  version: 1\n\nregularizers:\n  my_filter_regularizer:\n    class: GroupLassoRegularizer\n    reg_regims:\n      'module.layer3.1.conv1.weight': [0.00005, '3D']\n      'module.layer3.1.conv2.weight': [0.00005, '3D']\n      'module.layer3.1.conv3.weight': [0.00005, '3D']\n      'module.layer3.2.conv1.weight': [0.00005, '3D']\n    threshold_criteria: Mean_Abs\n\npolicies:\n  - regularizer:\n      instance_name: my_filter_regularizer\n    starting_epoch: 0\n    ending_epoch: 60\n    frequency: 1",
+            "location": "/schedule/index.html#group-regularization", 
+            "text": "Format (informal specification):  Format:\n  regularizers:\n     REGULARIZER_NAME_STR :\n      class: L1Regularizer\n      reg_regims:\n         PYTORCH_PARAM_NAME_STR : [ STRENGTH_FLOAT ,  '2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows' ]\n         PYTORCH_PARAM_NAME_STR : [ STRENGTH_FLOAT ,  '2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows' ]\n      threshold_criteria: [Mean_Abs | Max]  For example:  version: 1\n\nregularizers:\n  my_filter_regularizer:\n    class: GroupLassoRegularizer\n    reg_regims:\n      'module.layer3.1.conv1.weight': [0.00005, '3D']\n      'module.layer3.1.conv2.weight': [0.00005, '3D']\n      'module.layer3.1.conv3.weight': [0.00005, '3D']\n      'module.layer3.2.conv1.weight': [0.00005, '3D']\n    threshold_criteria: Mean_Abs\n\npolicies:\n  - regularizer:\n      instance_name: my_filter_regularizer\n    starting_epoch: 0\n    ending_epoch: 60\n    frequency: 1", 
             "title": "Group regularization"
-        },
+        }, 
         {
-            "location": "/schedule/index.html#mixing-it-up",
-            "text": "You can mix pruning and regularization.  version: 1\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\nregularizers:\n  2d_groups_regularizer:\n    class: GroupLassoRegularizer\n    reg_regims:\n      'features.module.0.weight': [0.000012, '2D']\n      'features.module.3.weight': [0.000012, '2D']\n      'features.module.6.weight': [0.000012, '2D']\n      'features.module.8.weight': [0.000012, '2D']\n      'features.module.10.weight': [0.000012, '2D']\n\n\nlr_schedulers:\n  # Learning rate decay scheduler\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - regularizer:\n      instance_name: '2d_groups_regularizer'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 1\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1",
+            "location": "/schedule/index.html#mixing-it-up", 
+            "text": "You can mix pruning and regularization.  version: 1\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\nregularizers:\n  2d_groups_regularizer:\n    class: GroupLassoRegularizer\n    reg_regims:\n      'features.module.0.weight': [0.000012, '2D']\n      'features.module.3.weight': [0.000012, '2D']\n      'features.module.6.weight': [0.000012, '2D']\n      'features.module.8.weight': [0.000012, '2D']\n      'features.module.10.weight': [0.000012, '2D']\n\n\nlr_schedulers:\n  # Learning rate decay scheduler\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - regularizer:\n      instance_name: '2d_groups_regularizer'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 1\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1", 
             "title": "Mixing it up"
-        },
+        }, 
         {
-            "location": "/schedule/index.html#quantization",
-            "text": "Similarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the  Quantizer  class (see details  here ).  Note  that only a single quantizer instance may be defined per YAML.  Let's see an example:  quantizers:\n  dorefa_quantizer:\n    class: DorefaQuantizer\n    bits_activations: 8\n    bits_weights: 4\n    bits_overrides:\n      conv1:\n        wts: null\n        acts: null\n      relu1:\n        wts: null\n        acts: null\n      final_relu:\n        wts: null\n        acts: null\n      fc:\n        wts: null\n        acts: null   The specific quantization method we're instantiating here is  DorefaQuantizer .  Then we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively.   Then, we define the  bits_overrides  mapping. In the example above, we choose not to quantize the first and last layer of the model. In the case of  DorefaQuantizer , the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters  conv1 , the first activation layer  relu1 , the last activation layer  final_relu  and the last layer with parameters  fc .  Specifying  null  means \"do not quantize\".  Note that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers.",
+            "location": "/schedule/index.html#quantization", 
+            "text": "Similarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the  Quantizer  class (see details  here ).  Note  that only a single quantizer instance may be defined per YAML.  Let's see an example:  quantizers:\n  dorefa_quantizer:\n    class: DorefaQuantizer\n    bits_activations: 8\n    bits_weights: 4\n    bits_overrides:\n      conv1:\n        wts: null\n        acts: null\n      relu1:\n        wts: null\n        acts: null\n      final_relu:\n        wts: null\n        acts: null\n      fc:\n        wts: null\n        acts: null   The specific quantization method we're instantiating here is  DorefaQuantizer .  Then we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively.   Then, we define the  bits_overrides  mapping. In the example above, we choose not to quantize the first and last layer of the model. In the case of  DorefaQuantizer , the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters  conv1 , the first activation layer  relu1 , the last activation layer  final_relu  and the last layer with parameters  fc .  Specifying  null  means \"do not quantize\".  Note that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers.", 
             "title": "Quantization"
-        },
+        }, 
         {
-            "location": "/schedule/index.html#defining-overrides-for-groups-of-layers-using-regular-expressions",
-            "text": "Suppose we have a sub-module in our model named  block1 , which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named  conv1 ,  conv2  and so on. In that case we would define the following:  bits_overrides:\n  'block1\\.conv*':\n    wts: 2\n    acts: null   RegEx Note : Remember that the dot ( . ) is a meta-character (i.e. a reserved character) in regular expressions. So, to match the actual dot characters which separate sub-modules in PyTorch module names, we need to escape it:  \\.   Overlapping patterns  are also possible, which allows to define some override for a groups of layers and also \"single-out\" specific layers for different overrides. For example, let's take the last example and configure a different override for  block1.conv1 :  bits_overrides:\n  'block1\\.conv1':\n    wts: 4\n    acts: null\n  'block1\\.conv*':\n    wts: 2\n    acts: null   Important Note : The patterns are evaluated eagerly - first match wins. So, to properly quantize a model using \"broad\" patterns and more \"specific\" patterns as just shown, make sure the specific pattern is listed  before  the broad one.   The  QuantizationPolicy , which controls the quantization procedure during training, is actually quite simplistic. All it does is call the  prepare_model()  function of the  Quantizer  when it's initialized, followed by the first call to  quantize_params() . Then, at the end of each epoch, after the float copy of the weights has been updated, it calls the  quantize_params()  function again.  policies:\n    - quantizer:\n        instance_name: dorefa_quantizer\n      starting_epoch: 0\n      ending_epoch: 200\n      frequency: 1  Important Note : As mentioned  here , since the quantizer modifies the model's parameters (assuming training with quantization in the loop is used), the call to  prepare_model()  must be performed before an optimizer is called. Therefore, currently, the starting epoch for a quantization policy must be 0, otherwise the quantization process will not work as expected. If one wishes to do a \"warm-startup\" (or \"boot-strapping\"), training for a few epochs with full precision and only then starting to quantize, the only way to do this right now is to execute a separate run to generate the boot-strapped weights, and execute a second which will resume the checkpoint with the boot-strapped weights.",
+            "location": "/schedule/index.html#defining-overrides-for-groups-of-layers-using-regular-expressions", 
+            "text": "Suppose we have a sub-module in our model named  block1 , which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named  conv1 ,  conv2  and so on. In that case we would define the following:  bits_overrides:\n  'block1\\.conv*':\n    wts: 2\n    acts: null   RegEx Note : Remember that the dot ( . ) is a meta-character (i.e. a reserved character) in regular expressions. So, to match the actual dot characters which separate sub-modules in PyTorch module names, we need to escape it:  \\.   Overlapping patterns  are also possible, which allows to define some override for a groups of layers and also \"single-out\" specific layers for different overrides. For example, let's take the last example and configure a different override for  block1.conv1 :  bits_overrides:\n  'block1\\.conv1':\n    wts: 4\n    acts: null\n  'block1\\.conv*':\n    wts: 2\n    acts: null   Important Note : The patterns are evaluated eagerly - first match wins. So, to properly quantize a model using \"broad\" patterns and more \"specific\" patterns as just shown, make sure the specific pattern is listed  before  the broad one.   The  QuantizationPolicy , which controls the quantization procedure during training, is actually quite simplistic. All it does is call the  prepare_model()  function of the  Quantizer  when it's initialized, followed by the first call to  quantize_params() . Then, at the end of each epoch, after the float copy of the weights has been updated, it calls the  quantize_params()  function again.  policies:\n    - quantizer:\n        instance_name: dorefa_quantizer\n      starting_epoch: 0\n      ending_epoch: 200\n      frequency: 1  Important Note : As mentioned  here , since the quantizer modifies the model's parameters (assuming training with quantization in the loop is used), the call to  prepare_model()  must be performed before an optimizer is called. Therefore, currently, the starting epoch for a quantization policy must be 0, otherwise the quantization process will not work as expected. If one wishes to do a \"warm-startup\" (or \"boot-strapping\"), training for a few epochs with full precision and only then starting to quantize, the only way to do this right now is to execute a separate run to generate the boot-strapped weights, and execute a second which will resume the checkpoint with the boot-strapped weights.", 
             "title": "Defining overrides for groups of layers using regular expressions"
-        },
+        }, 
         {
-            "location": "/schedule/index.html#knowledge-distillation",
-            "text": "Knowledge distillation (see  here ) is also implemented as a  Policy , which should be added to the scheduler. However, with the current implementation, it cannot be defined within the YAML file like the rest of the policies described above.  To make the integration of this method into applications a bit easier, a helper function can be used that will add a set of command-line arguments related to knowledge distillation:  import argparse\nimport distiller\n\nparser = argparse.ArgumentParser()\ndistiller.knowledge_distillation.add_distillation_args(parser)  (The  add_distillation_args  function accepts some optional arguments, see its implementation at  distiller/knowledge_distillation.py  for details)  These are the command line arguments exposed by this function:  Knowledge Distillation Training Arguments:\n  --kd-teacher ARCH     Model architecture for teacher model\n  --kd-pretrained       Use pre-trained model for teacher\n  --kd-resume PATH      Path to checkpoint from which to load teacher weights\n  --kd-temperature TEMP, --kd-temp TEMP\n                        Knowledge distillation softmax temperature\n  --kd-distill-wt WEIGHT, --kd-dw WEIGHT\n                        Weight for distillation loss (student vs. teacher soft\n                        targets)\n  --kd-student-wt WEIGHT, --kd-sw WEIGHT\n                        Weight for student vs. labels loss\n  --kd-teacher-wt WEIGHT, --kd-tw WEIGHT\n                        Weight for teacher vs. labels loss\n  --kd-start-epoch EPOCH_NUM\n                        Epoch from which to enable distillation  Once arguments have been parsed, some initialization code is required, similar to the following:  # Assuming:\n# \"args\" variable holds command line arguments\n# \"model\" variable holds the model we're going to train, that is - the student model\n# \"compression_scheduler\" variable holds a CompressionScheduler instance\n\nargs.kd_policy = None\nif args.kd_teacher:\n    # Create teacher model - replace this with your model creation code\n    teacher = create_model(args.kd_pretrained, args.dataset, args.kd_teacher, device_ids=args.gpus)\n    if args.kd_resume:\n        teacher, _, _ = apputils.load_checkpoint(teacher, chkpt_file=args.kd_resume)\n\n    # Create policy and add to scheduler\n    dlw = distiller.DistillationLossWeights(args.kd_distill_wt, args.kd_student_wt, args.kd_teacher_wt)\n    args.kd_policy = distiller.KnowledgeDistillationPolicy(model, teacher, args.kd_temp, dlw)\n    compression_scheduler.add_policy(args.kd_policy, starting_epoch=args.kd_start_epoch, ending_epoch=args.epochs,\n                                     frequency=1)  Finally, during the training loop, we need to perform forward propagation through the teacher model as well. The  KnowledgeDistillationPolicy  class keeps a reference to both the student and teacher models, and exposes a  forward  function that performs forward propagation on both of them. Since this is not one of the standard policy callbacks, we need to call this function manually from our training loop, as follows:  if args.kd_policy is None:\n    # Revert to a \"normal\" forward-prop call if no knowledge distillation policy is present\n    output = model(input_var)\nelse:\n    output = args.kd_policy.forward(input_var)  To see this integration in action, take a look at the image classification sample at  examples/classifier_compression/compress_classifier.py .",
+            "location": "/schedule/index.html#knowledge-distillation", 
+            "text": "Knowledge distillation (see  here ) is also implemented as a  Policy , which should be added to the scheduler. However, with the current implementation, it cannot be defined within the YAML file like the rest of the policies described above.  To make the integration of this method into applications a bit easier, a helper function can be used that will add a set of command-line arguments related to knowledge distillation:  import argparse\nimport distiller\n\nparser = argparse.ArgumentParser()\ndistiller.knowledge_distillation.add_distillation_args(parser)  (The  add_distillation_args  function accepts some optional arguments, see its implementation at  distiller/knowledge_distillation.py  for details)  These are the command line arguments exposed by this function:  Knowledge Distillation Training Arguments:\n  --kd-teacher ARCH     Model architecture for teacher model\n  --kd-pretrained       Use pre-trained model for teacher\n  --kd-resume PATH      Path to checkpoint from which to load teacher weights\n  --kd-temperature TEMP, --kd-temp TEMP\n                        Knowledge distillation softmax temperature\n  --kd-distill-wt WEIGHT, --kd-dw WEIGHT\n                        Weight for distillation loss (student vs. teacher soft\n                        targets)\n  --kd-student-wt WEIGHT, --kd-sw WEIGHT\n                        Weight for student vs. labels loss\n  --kd-teacher-wt WEIGHT, --kd-tw WEIGHT\n                        Weight for teacher vs. labels loss\n  --kd-start-epoch EPOCH_NUM\n                        Epoch from which to enable distillation  Once arguments have been parsed, some initialization code is required, similar to the following:  # Assuming:\n#  args  variable holds command line arguments\n#  model  variable holds the model we're going to train, that is - the student model\n#  compression_scheduler  variable holds a CompressionScheduler instance\n\nargs.kd_policy = None\nif args.kd_teacher:\n    # Create teacher model - replace this with your model creation code\n    teacher = create_model(args.kd_pretrained, args.dataset, args.kd_teacher, device_ids=args.gpus)\n    if args.kd_resume:\n        teacher, _, _ = apputils.load_checkpoint(teacher, chkpt_file=args.kd_resume)\n\n    # Create policy and add to scheduler\n    dlw = distiller.DistillationLossWeights(args.kd_distill_wt, args.kd_student_wt, args.kd_teacher_wt)\n    args.kd_policy = distiller.KnowledgeDistillationPolicy(model, teacher, args.kd_temp, dlw)\n    compression_scheduler.add_policy(args.kd_policy, starting_epoch=args.kd_start_epoch, ending_epoch=args.epochs,\n                                     frequency=1)  Finally, during the training loop, we need to perform forward propagation through the teacher model as well. The  KnowledgeDistillationPolicy  class keeps a reference to both the student and teacher models, and exposes a  forward  function that performs forward propagation on both of them. Since this is not one of the standard policy callbacks, we need to call this function manually from our training loop, as follows:  if args.kd_policy is None:\n    # Revert to a  normal  forward-prop call if no knowledge distillation policy is present\n    output = model(input_var)\nelse:\n    output = args.kd_policy.forward(input_var)  To see this integration in action, take a look at the image classification sample at  examples/classifier_compression/compress_classifier.py .", 
             "title": "Knowledge Distillation"
-        },
+        }, 
         {
-            "location": "/pruning/index.html",
-            "text": "Pruning\n\n\nA common methodology for inducing sparsity in weights and activations is called \npruning\n.  Pruning is the application of a binary criteria to decide which weights to prune: weights which match the pruning criteria are assigned a value of zero.  Pruned elements are \"trimmed\" from the model: we zero their values and also make sure they don't take part in the back-propagation process.\n\n\nWe can prune weights, biases, and activations.  Biases are few and their contribution to a layer's output is relatively large, so there is little incentive to prune them.  We usually see sparse activations following a ReLU layer, because ReLU quenches negative activations to exact zero (\\(ReLU(x): max(0,x)\\)).   Sparsity in weights is less common, as weights tend to be very small, but are often not exact zeros.\n\n\n\nLet's define sparsity\n\n\nSparsity is a a measure of how many elements in a tensor are exact zeros, relative to the tensor size.  A tensor is considered sparse if \"most\" of its elements are zero.  How much is \"most\", is not strictly defined, but when you see a sparse tensor you know it ;-)\n\nThe \n\\(l_0\\)-\"norm\" function\n measures how many zero-elements are in a tensor \nx\n:\n\\[\\lVert x \\rVert_0\\;=\\;|x_1|^0 + |x_2|^0 + ... + |x_n|^0 \\]\nIn other words, an element contributes either a value of 1 or 0 to \\(l_0\\).  Anything but an exact zero contributes a value of 1 - that's pretty cool.\n\nSometimes it helps to think about density, the number of non-zero elements (NNZ) and sparsity's complement:\n\\[\ndensity = 1 - sparsity\n\\]\nYou can use \ndistiller.sparsity\n and \ndistiller.density\n to query a PyTorch tensor's sparsity and density.\n\n\nWhat is weights pruning?\n\n\nWeights pruning, or model pruning, is a set of methods to increase the sparsity (amount of zero-valued elements in a tensor) of a network's weights.  In general, the term 'parameters' refers to both weights and bias tensors of a model.  Biases are rarely, if ever, pruned because there are very few bias elements compared to weights elements, and it is just not worth the trouble.\n\n\nPruning requires a criteria for choosing which elements to prune - this is called the \npruning criteria\n.  The most common pruning criteria is the absolute value of each element: the element's absolute value is compared to some threshold value, and if it is below the threshold the element is set to zero (i.e. pruned) .  This is implemented by the \ndistiller.MagnitudeParameterPruner\n class.  The idea behind this method, is that weights with small \\(l_1\\)-norms (absolute value) contribute little to the final result (low saliency), so they are less important and can be removed.\n\n\nA related idea motivating pruning, is that models are over-parametrized and contain redundant logic and features.  Therefore, some of these redundancies can be removed by setting their weights to zero.\n\n\nAnd yet another way to think of pruning is to phrase it as a search for a set of weights with as many zeros as possible, which still produces acceptable inference accuracies compared to the dense-model (non-pruned model).  Another way to look at it, is to imagine that because of the very high-dimensionality of the parameter space, the immediate space around the dense-model's solution likely contains some sparse solutions, and we want to use find these sparse solutions. \n\n\n\n\nPruning schedule\n\n\nThe most straight-forward to prune is to take a trained model and prune it once; also called \none-shot pruning\n.  In \nLearning both Weights and Connections for Efficient Neural Networks\n Song Han et. al show that this is surprisingly effective, but also leaves a lot of potential sparsity untapped.  The surprise is what they call the \"free lunch\" effect: \n\"reducing 2x the connections without losing accuracy even without retraining.\"\n\nHowever, they also note that when employing a pruning-followed-by-retraining regimen, they can achieve much better results (higher sparsity at no accuracy loss).  This is called \niterative pruning\n, and the retraining that follows pruning is often referred to as \nfine-tuning\n. How the pruning criteria changes between iterations, how many iterations we perform and how often, and which tensors are pruned - this is collectively called the \npruning schedule\n.\n\n\nWe can think of iterative pruning as repeatedly learning which weights are important, removing the least important ones based on some importance criteria, and then retraining the model to let it \"recover\" from the pruning by adjusting the remaining weights.  At each iteration, we prune more weights.\n\nThe decision of when to stop pruning is also expressed in the schedule, and it depends on the pruning algorithm.  For example, if we are trying to achieve a specific sparsity level, then we stop when the pruning achieves that level.  And if we are pruning weights structures in order to reduce the required compute budget, then we stop the pruning when this compute reduction is achieved.\n\n\nDistiller supports expressing the pruning schedule as a YAML file (which is then executed by an instance of a PruningScheduler).\n\n\nPruning granularity\n\n\nPruning individual weight elements is called \nelement-wise pruning\n, and it is also sometimes referred to as \nfine-grained\n pruning.\n\n\nCoarse-grained pruning\n - also referred to as \nstructured pruning\n, \ngroup pruning\n, or \nblock pruning\n - is pruning entire groups of elements which have some significance.  Groups come in various shapes and sizes, but an easy to visualize group-pruning is filter-pruning, in which entire filters are removed.\n\n\nSensitivity analysis\n\n\nThe hard part about inducing sparsity via pruning is determining what threshold, or sparsity level, to use for each layer's tensors.  Sensitivity analysis is a method that tries to help us rank the tensors by their sensitivity to pruning.  \n\nThe idea is to set the pruning level (percentage) of a specific layer, and then to prune once, run an evaluation on the test dataset and record the accuracy score.  We do this for all of the parameterized layers, and for each layer we examine several sparsity levels.  This should teach us about the \"sensitivity\" of each of the layers to pruning.\n\n\nThe evaluated model should be trained to maximum accuracy before running the analysis, because we aim to understand the behavior of the trained model's performance in relation to pruning of a specific weights tensor.\n\n\nMuch as we can prune structures, we can also perform sensitivity analysis on structures.  Distiller implements element-wise pruning sensitivity analysis using the \\(l_1\\)-norm of individual elements; and filter-wise pruning sensitivity analysis using the mean \\(l_1\\)-norm of filters.\n\n\n\nThe authors of \nPruning Filters for Efficient ConvNets\n describe how they do sensitivity analysis:\n\n\n\n\n\"To understand the sensitivity of each layer, we prune each layer independently and evaluate the resulting pruned network\u2019s accuracy on the validation set. Figure 2(b) shows that layers that maintain their accuracy as filters are pruned away correspond to layers with larger slopes in Figure 2(a). On the contrary, layers with relatively flat slopes are more sensitive to pruning. We empirically determine the number of filters to prune for each layer based on their sensitivity to pruning. For deep networks such as VGG-16 or ResNets, we observe that layers in the same stage (with the same feature map size) have a similar sensitivity to pruning. To avoid introducing layer-wise meta-parameters, we use the same pruning ratio for all layers in the same stage. For layers that are sensitive to pruning, we prune a smaller percentage of these layers or completely skip pruning them.\"\n\n\n\n\nThe diagram below shows the results of running an element-wise sensitivity analysis on Alexnet, using Distillers's \nperform_sensitivity_analysis\n utility function.\n\n\nAs reported by Song Han, and exhibited in the diagram, in Alexnet the feature detecting layers (convolution layers) are more sensitive to pruning, and their sensitivity drops, the deeper they are.  The fully-connected layers are much less sensitive, which is great, because that's where most of the parameters are.\n\n\n\n\nReferences\n\n\n \nSong Han, Jeff Pool, John Tran, William J. Dally\n.\n    \nLearning both Weights and Connections for Efficient Neural Networks\n,\n     arXiv:1607.04381v2,\n    2015.\n\n\n\n\n\nHao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, Hans Peter Graf\n.\n    \nPruning Filters for Efficient ConvNets\n,\n     arXiv:1608.08710v3,\n    2017.",
+            "location": "/pruning/index.html", 
+            "text": "Pruning\n\n\nA common methodology for inducing sparsity in weights and activations is called \npruning\n.  Pruning is the application of a binary criteria to decide which weights to prune: weights which match the pruning criteria are assigned a value of zero.  Pruned elements are \"trimmed\" from the model: we zero their values and also make sure they don't take part in the back-propagation process.\n\n\nWe can prune weights, biases, and activations.  Biases are few and their contribution to a layer's output is relatively large, so there is little incentive to prune them.  We usually see sparse activations following a ReLU layer, because ReLU quenches negative activations to exact zero (\\(ReLU(x): max(0,x)\\)).   Sparsity in weights is less common, as weights tend to be very small, but are often not exact zeros.\n\n\n\nLet's define sparsity\n\n\nSparsity is a a measure of how many elements in a tensor are exact zeros, relative to the tensor size.  A tensor is considered sparse if \"most\" of its elements are zero.  How much is \"most\", is not strictly defined, but when you see a sparse tensor you know it ;-)\n\nThe \n\\(l_0\\)-\"norm\" function\n measures how many zero-elements are in a tensor \nx\n:\n\\[\\lVert x \\rVert_0\\;=\\;|x_1|^0 + |x_2|^0 + ... + |x_n|^0 \\]\nIn other words, an element contributes either a value of 1 or 0 to \\(l_0\\).  Anything but an exact zero contributes a value of 1 - that's pretty cool.\n\nSometimes it helps to think about density, the number of non-zero elements (NNZ) and sparsity's complement:\n\\[\ndensity = 1 - sparsity\n\\]\nYou can use \ndistiller.sparsity\n and \ndistiller.density\n to query a PyTorch tensor's sparsity and density.\n\n\nWhat is weights pruning?\n\n\nWeights pruning, or model pruning, is a set of methods to increase the sparsity (amount of zero-valued elements in a tensor) of a network's weights.  In general, the term 'parameters' refers to both weights and bias tensors of a model.  Biases are rarely, if ever, pruned because there are very few bias elements compared to weights elements, and it is just not worth the trouble.\n\n\nPruning requires a criteria for choosing which elements to prune - this is called the \npruning criteria\n.  The most common pruning criteria is the absolute value of each element: the element's absolute value is compared to some threshold value, and if it is below the threshold the element is set to zero (i.e. pruned) .  This is implemented by the \ndistiller.MagnitudeParameterPruner\n class.  The idea behind this method, is that weights with small \\(l_1\\)-norms (absolute value) contribute little to the final result (low saliency), so they are less important and can be removed.\n\n\nA related idea motivating pruning, is that models are over-parametrized and contain redundant logic and features.  Therefore, some of these redundancies can be removed by setting their weights to zero.\n\n\nAnd yet another way to think of pruning is to phrase it as a search for a set of weights with as many zeros as possible, which still produces acceptable inference accuracies compared to the dense-model (non-pruned model).  Another way to look at it, is to imagine that because of the very high-dimensionality of the parameter space, the immediate space around the dense-model's solution likely contains some sparse solutions, and we want to use find these sparse solutions. \n\n\n\n\nPruning schedule\n\n\nThe most straight-forward to prune is to take a trained model and prune it once; also called \none-shot pruning\n.  In \nLearning both Weights and Connections for Efficient Neural Networks\n Song Han et. al show that this is surprisingly effective, but also leaves a lot of potential sparsity untapped.  The surprise is what they call the \"free lunch\" effect: \n\"reducing 2x the connections without losing accuracy even without retraining.\"\n\nHowever, they also note that when employing a pruning-followed-by-retraining regimen, they can achieve much better results (higher sparsity at no accuracy loss).  This is called \niterative pruning\n, and the retraining that follows pruning is often referred to as \nfine-tuning\n. How the pruning criteria changes between iterations, how many iterations we perform and how often, and which tensors are pruned - this is collectively called the \npruning schedule\n.\n\n\nWe can think of iterative pruning as repeatedly learning which weights are important, removing the least important ones based on some importance criteria, and then retraining the model to let it \"recover\" from the pruning by adjusting the remaining weights.  At each iteration, we prune more weights.\n\nThe decision of when to stop pruning is also expressed in the schedule, and it depends on the pruning algorithm.  For example, if we are trying to achieve a specific sparsity level, then we stop when the pruning achieves that level.  And if we are pruning weights structures in order to reduce the required compute budget, then we stop the pruning when this compute reduction is achieved.\n\n\nDistiller supports expressing the pruning schedule as a YAML file (which is then executed by an instance of a PruningScheduler).\n\n\nPruning granularity\n\n\nPruning individual weight elements is called \nelement-wise pruning\n, and it is also sometimes referred to as \nfine-grained\n pruning.\n\n\nCoarse-grained pruning\n - also referred to as \nstructured pruning\n, \ngroup pruning\n, or \nblock pruning\n - is pruning entire groups of elements which have some significance.  Groups come in various shapes and sizes, but an easy to visualize group-pruning is filter-pruning, in which entire filters are removed.\n\n\nSensitivity analysis\n\n\nThe hard part about inducing sparsity via pruning is determining what threshold, or sparsity level, to use for each layer's tensors.  Sensitivity analysis is a method that tries to help us rank the tensors by their sensitivity to pruning.  \n\nThe idea is to set the pruning level (percentage) of a specific layer, and then to prune once, run an evaluation on the test dataset and record the accuracy score.  We do this for all of the parameterized layers, and for each layer we examine several sparsity levels.  This should teach us about the \"sensitivity\" of each of the layers to pruning.\n\n\nThe evaluated model should be trained to maximum accuracy before running the analysis, because we aim to understand the behavior of the trained model's performance in relation to pruning of a specific weights tensor.\n\n\nMuch as we can prune structures, we can also perform sensitivity analysis on structures.  Distiller implements element-wise pruning sensitivity analysis using the \\(l_1\\)-norm of individual elements; and filter-wise pruning sensitivity analysis using the mean \\(l_1\\)-norm of filters.\n\n\n\nThe authors of \nPruning Filters for Efficient ConvNets\n describe how they do sensitivity analysis:\n\n\n\n\n\"To understand the sensitivity of each layer, we prune each layer independently and evaluate the resulting pruned network\u2019s accuracy on the validation set. Figure 2(b) shows that layers that maintain their accuracy as filters are pruned away correspond to layers with larger slopes in Figure 2(a). On the contrary, layers with relatively flat slopes are more sensitive to pruning. We empirically determine the number of filters to prune for each layer based on their sensitivity to pruning. For deep networks such as VGG-16 or ResNets, we observe that layers in the same stage (with the same feature map size) have a similar sensitivity to pruning. To avoid introducing layer-wise meta-parameters, we use the same pruning ratio for all layers in the same stage. For layers that are sensitive to pruning, we prune a smaller percentage of these layers or completely skip pruning them.\"\n\n\n\n\nThe diagram below shows the results of running an element-wise sensitivity analysis on Alexnet, using Distillers's \nperform_sensitivity_analysis\n utility function.\n\n\nAs reported by Song Han, and exhibited in the diagram, in Alexnet the feature detecting layers (convolution layers) are more sensitive to pruning, and their sensitivity drops, the deeper they are.  The fully-connected layers are much less sensitive, which is great, because that's where most of the parameters are.\n\n\n\n\nReferences\n\n\n \nSong Han, Jeff Pool, John Tran, William J. Dally\n.\n    \nLearning both Weights and Connections for Efficient Neural Networks\n,\n     arXiv:1607.04381v2,\n    2015.\n\n\n\n\n\nHao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, Hans Peter Graf\n.\n    \nPruning Filters for Efficient ConvNets\n,\n     arXiv:1608.08710v3,\n    2017.", 
             "title": "Pruning"
-        },
+        }, 
         {
-            "location": "/pruning/index.html#pruning",
-            "text": "A common methodology for inducing sparsity in weights and activations is called  pruning .  Pruning is the application of a binary criteria to decide which weights to prune: weights which match the pruning criteria are assigned a value of zero.  Pruned elements are \"trimmed\" from the model: we zero their values and also make sure they don't take part in the back-propagation process.  We can prune weights, biases, and activations.  Biases are few and their contribution to a layer's output is relatively large, so there is little incentive to prune them.  We usually see sparse activations following a ReLU layer, because ReLU quenches negative activations to exact zero (\\(ReLU(x): max(0,x)\\)).   Sparsity in weights is less common, as weights tend to be very small, but are often not exact zeros.",
+            "location": "/pruning/index.html#pruning", 
+            "text": "A common methodology for inducing sparsity in weights and activations is called  pruning .  Pruning is the application of a binary criteria to decide which weights to prune: weights which match the pruning criteria are assigned a value of zero.  Pruned elements are \"trimmed\" from the model: we zero their values and also make sure they don't take part in the back-propagation process.  We can prune weights, biases, and activations.  Biases are few and their contribution to a layer's output is relatively large, so there is little incentive to prune them.  We usually see sparse activations following a ReLU layer, because ReLU quenches negative activations to exact zero (\\(ReLU(x): max(0,x)\\)).   Sparsity in weights is less common, as weights tend to be very small, but are often not exact zeros.", 
             "title": "Pruning"
-        },
+        }, 
         {
-            "location": "/pruning/index.html#lets-define-sparsity",
-            "text": "Sparsity is a a measure of how many elements in a tensor are exact zeros, relative to the tensor size.  A tensor is considered sparse if \"most\" of its elements are zero.  How much is \"most\", is not strictly defined, but when you see a sparse tensor you know it ;-) \nThe  \\(l_0\\)-\"norm\" function  measures how many zero-elements are in a tensor  x :\n\\[\\lVert x \\rVert_0\\;=\\;|x_1|^0 + |x_2|^0 + ... + |x_n|^0 \\]\nIn other words, an element contributes either a value of 1 or 0 to \\(l_0\\).  Anything but an exact zero contributes a value of 1 - that's pretty cool. \nSometimes it helps to think about density, the number of non-zero elements (NNZ) and sparsity's complement:\n\\[\ndensity = 1 - sparsity\n\\]\nYou can use  distiller.sparsity  and  distiller.density  to query a PyTorch tensor's sparsity and density.",
+            "location": "/pruning/index.html#lets-define-sparsity", 
+            "text": "Sparsity is a a measure of how many elements in a tensor are exact zeros, relative to the tensor size.  A tensor is considered sparse if \"most\" of its elements are zero.  How much is \"most\", is not strictly defined, but when you see a sparse tensor you know it ;-) \nThe  \\(l_0\\)-\"norm\" function  measures how many zero-elements are in a tensor  x :\n\\[\\lVert x \\rVert_0\\;=\\;|x_1|^0 + |x_2|^0 + ... + |x_n|^0 \\]\nIn other words, an element contributes either a value of 1 or 0 to \\(l_0\\).  Anything but an exact zero contributes a value of 1 - that's pretty cool. \nSometimes it helps to think about density, the number of non-zero elements (NNZ) and sparsity's complement:\n\\[\ndensity = 1 - sparsity\n\\]\nYou can use  distiller.sparsity  and  distiller.density  to query a PyTorch tensor's sparsity and density.", 
             "title": "Let's define sparsity"
-        },
+        }, 
         {
-            "location": "/pruning/index.html#what-is-weights-pruning",
-            "text": "Weights pruning, or model pruning, is a set of methods to increase the sparsity (amount of zero-valued elements in a tensor) of a network's weights.  In general, the term 'parameters' refers to both weights and bias tensors of a model.  Biases are rarely, if ever, pruned because there are very few bias elements compared to weights elements, and it is just not worth the trouble. \nPruning requires a criteria for choosing which elements to prune - this is called the  pruning criteria .  The most common pruning criteria is the absolute value of each element: the element's absolute value is compared to some threshold value, and if it is below the threshold the element is set to zero (i.e. pruned) .  This is implemented by the  distiller.MagnitudeParameterPruner  class.  The idea behind this method, is that weights with small \\(l_1\\)-norms (absolute value) contribute little to the final result (low saliency), so they are less important and can be removed. \nA related idea motivating pruning, is that models are over-parametrized and contain redundant logic and features.  Therefore, some of these redundancies can be removed by setting their weights to zero. \nAnd yet another way to think of pruning is to phrase it as a search for a set of weights with as many zeros as possible, which still produces acceptable inference accuracies compared to the dense-model (non-pruned model).  Another way to look at it, is to imagine that because of the very high-dimensionality of the parameter space, the immediate space around the dense-model's solution likely contains some sparse solutions, and we want to use find these sparse solutions.",
+            "location": "/pruning/index.html#what-is-weights-pruning", 
+            "text": "Weights pruning, or model pruning, is a set of methods to increase the sparsity (amount of zero-valued elements in a tensor) of a network's weights.  In general, the term 'parameters' refers to both weights and bias tensors of a model.  Biases are rarely, if ever, pruned because there are very few bias elements compared to weights elements, and it is just not worth the trouble. \nPruning requires a criteria for choosing which elements to prune - this is called the  pruning criteria .  The most common pruning criteria is the absolute value of each element: the element's absolute value is compared to some threshold value, and if it is below the threshold the element is set to zero (i.e. pruned) .  This is implemented by the  distiller.MagnitudeParameterPruner  class.  The idea behind this method, is that weights with small \\(l_1\\)-norms (absolute value) contribute little to the final result (low saliency), so they are less important and can be removed. \nA related idea motivating pruning, is that models are over-parametrized and contain redundant logic and features.  Therefore, some of these redundancies can be removed by setting their weights to zero. \nAnd yet another way to think of pruning is to phrase it as a search for a set of weights with as many zeros as possible, which still produces acceptable inference accuracies compared to the dense-model (non-pruned model).  Another way to look at it, is to imagine that because of the very high-dimensionality of the parameter space, the immediate space around the dense-model's solution likely contains some sparse solutions, and we want to use find these sparse solutions.", 
             "title": "What is weights pruning?"
-        },
+        }, 
         {
-            "location": "/pruning/index.html#pruning-schedule",
-            "text": "The most straight-forward to prune is to take a trained model and prune it once; also called  one-shot pruning .  In  Learning both Weights and Connections for Efficient Neural Networks  Song Han et. al show that this is surprisingly effective, but also leaves a lot of potential sparsity untapped.  The surprise is what they call the \"free lunch\" effect:  \"reducing 2x the connections without losing accuracy even without retraining.\" \nHowever, they also note that when employing a pruning-followed-by-retraining regimen, they can achieve much better results (higher sparsity at no accuracy loss).  This is called  iterative pruning , and the retraining that follows pruning is often referred to as  fine-tuning . How the pruning criteria changes between iterations, how many iterations we perform and how often, and which tensors are pruned - this is collectively called the  pruning schedule . \nWe can think of iterative pruning as repeatedly learning which weights are important, removing the least important ones based on some importance criteria, and then retraining the model to let it \"recover\" from the pruning by adjusting the remaining weights.  At each iteration, we prune more weights. \nThe decision of when to stop pruning is also expressed in the schedule, and it depends on the pruning algorithm.  For example, if we are trying to achieve a specific sparsity level, then we stop when the pruning achieves that level.  And if we are pruning weights structures in order to reduce the required compute budget, then we stop the pruning when this compute reduction is achieved. \nDistiller supports expressing the pruning schedule as a YAML file (which is then executed by an instance of a PruningScheduler).",
+            "location": "/pruning/index.html#pruning-schedule", 
+            "text": "The most straight-forward to prune is to take a trained model and prune it once; also called  one-shot pruning .  In  Learning both Weights and Connections for Efficient Neural Networks  Song Han et. al show that this is surprisingly effective, but also leaves a lot of potential sparsity untapped.  The surprise is what they call the \"free lunch\" effect:  \"reducing 2x the connections without losing accuracy even without retraining.\" \nHowever, they also note that when employing a pruning-followed-by-retraining regimen, they can achieve much better results (higher sparsity at no accuracy loss).  This is called  iterative pruning , and the retraining that follows pruning is often referred to as  fine-tuning . How the pruning criteria changes between iterations, how many iterations we perform and how often, and which tensors are pruned - this is collectively called the  pruning schedule . \nWe can think of iterative pruning as repeatedly learning which weights are important, removing the least important ones based on some importance criteria, and then retraining the model to let it \"recover\" from the pruning by adjusting the remaining weights.  At each iteration, we prune more weights. \nThe decision of when to stop pruning is also expressed in the schedule, and it depends on the pruning algorithm.  For example, if we are trying to achieve a specific sparsity level, then we stop when the pruning achieves that level.  And if we are pruning weights structures in order to reduce the required compute budget, then we stop the pruning when this compute reduction is achieved. \nDistiller supports expressing the pruning schedule as a YAML file (which is then executed by an instance of a PruningScheduler).", 
             "title": "Pruning schedule"
-        },
+        }, 
         {
-            "location": "/pruning/index.html#pruning-granularity",
-            "text": "Pruning individual weight elements is called  element-wise pruning , and it is also sometimes referred to as  fine-grained  pruning.  Coarse-grained pruning  - also referred to as  structured pruning ,  group pruning , or  block pruning  - is pruning entire groups of elements which have some significance.  Groups come in various shapes and sizes, but an easy to visualize group-pruning is filter-pruning, in which entire filters are removed.",
+            "location": "/pruning/index.html#pruning-granularity", 
+            "text": "Pruning individual weight elements is called  element-wise pruning , and it is also sometimes referred to as  fine-grained  pruning.  Coarse-grained pruning  - also referred to as  structured pruning ,  group pruning , or  block pruning  - is pruning entire groups of elements which have some significance.  Groups come in various shapes and sizes, but an easy to visualize group-pruning is filter-pruning, in which entire filters are removed.", 
             "title": "Pruning granularity"
-        },
+        }, 
         {
-            "location": "/pruning/index.html#sensitivity-analysis",
-            "text": "The hard part about inducing sparsity via pruning is determining what threshold, or sparsity level, to use for each layer's tensors.  Sensitivity analysis is a method that tries to help us rank the tensors by their sensitivity to pruning.   \nThe idea is to set the pruning level (percentage) of a specific layer, and then to prune once, run an evaluation on the test dataset and record the accuracy score.  We do this for all of the parameterized layers, and for each layer we examine several sparsity levels.  This should teach us about the \"sensitivity\" of each of the layers to pruning. \nThe evaluated model should be trained to maximum accuracy before running the analysis, because we aim to understand the behavior of the trained model's performance in relation to pruning of a specific weights tensor. \nMuch as we can prune structures, we can also perform sensitivity analysis on structures.  Distiller implements element-wise pruning sensitivity analysis using the \\(l_1\\)-norm of individual elements; and filter-wise pruning sensitivity analysis using the mean \\(l_1\\)-norm of filters.  The authors of  Pruning Filters for Efficient ConvNets  describe how they do sensitivity analysis:   \"To understand the sensitivity of each layer, we prune each layer independently and evaluate the resulting pruned network\u2019s accuracy on the validation set. Figure 2(b) shows that layers that maintain their accuracy as filters are pruned away correspond to layers with larger slopes in Figure 2(a). On the contrary, layers with relatively flat slopes are more sensitive to pruning. We empirically determine the number of filters to prune for each layer based on their sensitivity to pruning. For deep networks such as VGG-16 or ResNets, we observe that layers in the same stage (with the same feature map size) have a similar sensitivity to pruning. To avoid introducing layer-wise meta-parameters, we use the same pruning ratio for all layers in the same stage. For layers that are sensitive to pruning, we prune a smaller percentage of these layers or completely skip pruning them.\"   The diagram below shows the results of running an element-wise sensitivity analysis on Alexnet, using Distillers's  perform_sensitivity_analysis  utility function. \nAs reported by Song Han, and exhibited in the diagram, in Alexnet the feature detecting layers (convolution layers) are more sensitive to pruning, and their sensitivity drops, the deeper they are.  The fully-connected layers are much less sensitive, which is great, because that's where most of the parameters are.",
+            "location": "/pruning/index.html#sensitivity-analysis", 
+            "text": "The hard part about inducing sparsity via pruning is determining what threshold, or sparsity level, to use for each layer's tensors.  Sensitivity analysis is a method that tries to help us rank the tensors by their sensitivity to pruning.   \nThe idea is to set the pruning level (percentage) of a specific layer, and then to prune once, run an evaluation on the test dataset and record the accuracy score.  We do this for all of the parameterized layers, and for each layer we examine several sparsity levels.  This should teach us about the \"sensitivity\" of each of the layers to pruning. \nThe evaluated model should be trained to maximum accuracy before running the analysis, because we aim to understand the behavior of the trained model's performance in relation to pruning of a specific weights tensor. \nMuch as we can prune structures, we can also perform sensitivity analysis on structures.  Distiller implements element-wise pruning sensitivity analysis using the \\(l_1\\)-norm of individual elements; and filter-wise pruning sensitivity analysis using the mean \\(l_1\\)-norm of filters.  The authors of  Pruning Filters for Efficient ConvNets  describe how they do sensitivity analysis:   \"To understand the sensitivity of each layer, we prune each layer independently and evaluate the resulting pruned network\u2019s accuracy on the validation set. Figure 2(b) shows that layers that maintain their accuracy as filters are pruned away correspond to layers with larger slopes in Figure 2(a). On the contrary, layers with relatively flat slopes are more sensitive to pruning. We empirically determine the number of filters to prune for each layer based on their sensitivity to pruning. For deep networks such as VGG-16 or ResNets, we observe that layers in the same stage (with the same feature map size) have a similar sensitivity to pruning. To avoid introducing layer-wise meta-parameters, we use the same pruning ratio for all layers in the same stage. For layers that are sensitive to pruning, we prune a smaller percentage of these layers or completely skip pruning them.\"   The diagram below shows the results of running an element-wise sensitivity analysis on Alexnet, using Distillers's  perform_sensitivity_analysis  utility function. \nAs reported by Song Han, and exhibited in the diagram, in Alexnet the feature detecting layers (convolution layers) are more sensitive to pruning, and their sensitivity drops, the deeper they are.  The fully-connected layers are much less sensitive, which is great, because that's where most of the parameters are.", 
             "title": "Sensitivity analysis"
-        },
+        }, 
         {
-            "location": "/pruning/index.html#references",
-            "text": "Song Han, Jeff Pool, John Tran, William J. Dally .\n     Learning both Weights and Connections for Efficient Neural Networks ,\n     arXiv:1607.04381v2,\n    2015.   Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, Hans Peter Graf .\n     Pruning Filters for Efficient ConvNets ,\n     arXiv:1608.08710v3,\n    2017.",
+            "location": "/pruning/index.html#references", 
+            "text": "Song Han, Jeff Pool, John Tran, William J. Dally .\n     Learning both Weights and Connections for Efficient Neural Networks ,\n     arXiv:1607.04381v2,\n    2015.   Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, Hans Peter Graf .\n     Pruning Filters for Efficient ConvNets ,\n     arXiv:1608.08710v3,\n    2017.", 
             "title": "References"
-        },
+        }, 
         {
-            "location": "/regularization/index.html",
-            "text": "Regularization\n\n\nIn their book \nDeep Learning\n Ian Goodfellow et al. define regularization as\n\n\n\n\n\"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\"\n\n\n\n\nPyTorch's \noptimizers\n use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance).\n\n\nIn general, we can write this as:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W)\n\\]\nAnd specifically,\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2\n\\]\nWhere W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or \ncriterion\n in the Distiller sample image classifier compression application).\n\n\noptimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001)\ncriterion = nn.CrossEntropyLoss()\n...\nfor input, target in dataset:\n    optimizer.zero_grad()\n    output = model(input)\n    loss = criterion(output, target)\n    loss.backward()\n    optimizer.step()\n\n\n\n\n\\(\\lambda_R\\) is a scalar called the \nregularization strength\n, and it balances the data error and the regularization error.  In PyTorch, this is the \nweight_decay\n argument.\n\n\n\\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a \nmagnitude\n, or sizing, of the weights tensor.\n\\[\n\\lVert W \\rVert_2^2 = \\sum_{l=1}^{L}  \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l)\n\\]\n\n\n\\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation.\n\n\nThe qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in \nDeep Learning\n.\n\n\nSparsity and Regularization\n\n\nWe mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods.\n\n\nIn \nDense-Sparse-Dense (DSD)\n, Song Han et al. use pruning as a regularizer to improve a model's accuracy:\n\n\n\n\n\"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\"\n\n\n\n\nRegularization can also be used to induce sparsity.  To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\]\n\n\n\\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero.  \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler.  This is sometimes referred to as \nfeature selection\n and gives us another interpretation of pruning.\n\n\nOne\n of Distiller's Jupyter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.\n\n\nIf we configure \nweight_decay\n to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2  + \\lambda_{R_1} \\lVert W \\rVert_1\n\\]\n\n\nClass \ndistiller.L1Regularizer\n implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization.\n\n\nl1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()\n\n\n\n\nGroup Regularization\n\n\nIn Group Regularization, we penalize entire groups of parameter elements, instead of individual elements.  Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not.  The group structures have to be pre-defined.\n\n\nTo the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty.  We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers.  It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\]\n\n\nLet's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\).\n\n\n\\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\).\n\n\n\\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer.  Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups).\n\n\n\nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity).  Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed.\n\n\nHuizi-et-al-2017\n provides an overview of some of the different groups: kernel, channel, filter, layers.  Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even \nintra kernel strided sparsity\n can also be used.\n\n\ndistiller.GroupLassoRegularizer\n currently implements most of these groups, and you can easily add new groups.\n\n\nReferences\n\n\n \nIan Goodfellow and Yoshua Bengio and Aaron Courville\n.\n    \nDeep Learning\n,\n     arXiv:1607.04381v2,\n    2017.\n\n\n\n\n\nSong Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally\n.\n    \nDSD: Dense-Sparse-Dense Training for Deep Neural Networks\n,\n     arXiv:1607.04381v2,\n    2017.\n\n\n\n\n\nHuizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally\n.\n    \nExploring the Regularity of Sparse Structure in Convolutional Neural Networks\n,\n    arXiv:1705.08922v3,\n    2017.\n\n\n\n\n\nSajid Anwar, Kyuyeon Hwang, and Wonyong Sung\n.\n    \nStructured pruning of deep convolutional neural networks\n,\n    arXiv:1512.08571,\n    2015",
+            "location": "/regularization/index.html", 
+            "text": "Regularization\n\n\nIn their book \nDeep Learning\n Ian Goodfellow et al. define regularization as\n\n\n\n\n\"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\"\n\n\n\n\nPyTorch's \noptimizers\n use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance).\n\n\nIn general, we can write this as:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W)\n\\]\nAnd specifically,\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2\n\\]\nWhere W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or \ncriterion\n in the Distiller sample image classifier compression application).\n\n\noptimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001)\ncriterion = nn.CrossEntropyLoss()\n...\nfor input, target in dataset:\n    optimizer.zero_grad()\n    output = model(input)\n    loss = criterion(output, target)\n    loss.backward()\n    optimizer.step()\n\n\n\n\n\\(\\lambda_R\\) is a scalar called the \nregularization strength\n, and it balances the data error and the regularization error.  In PyTorch, this is the \nweight_decay\n argument.\n\n\n\\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a \nmagnitude\n, or sizing, of the weights tensor.\n\\[\n\\lVert W \\rVert_2^2 = \\sum_{l=1}^{L}  \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l)\n\\]\n\n\n\\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation.\n\n\nThe qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in \nDeep Learning\n.\n\n\nSparsity and Regularization\n\n\nWe mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods.\n\n\nIn \nDense-Sparse-Dense (DSD)\n, Song Han et al. use pruning as a regularizer to improve a model's accuracy:\n\n\n\n\n\"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\"\n\n\n\n\nRegularization can also be used to induce sparsity.  To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\]\n\n\n\\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero.  \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler.  This is sometimes referred to as \nfeature selection\n and gives us another interpretation of pruning.\n\n\nOne\n of Distiller's Jupyter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.\n\n\nIf we configure \nweight_decay\n to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2  + \\lambda_{R_1} \\lVert W \\rVert_1\n\\]\n\n\nClass \ndistiller.L1Regularizer\n implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization.\n\n\nl1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()\n\n\n\n\nGroup Regularization\n\n\nIn Group Regularization, we penalize entire groups of parameter elements, instead of individual elements.  Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not.  The group structures have to be pre-defined.\n\n\nTo the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty.  We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers.  It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\]\n\n\nLet's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\).\n\n\n\\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\).\n\n\n\\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer.  Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups).\n\n\n\nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity).  Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed.\n\n\nHuizi-et-al-2017\n provides an overview of some of the different groups: kernel, channel, filter, layers.  Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even \nintra kernel strided sparsity\n can also be used.\n\n\ndistiller.GroupLassoRegularizer\n currently implements most of these groups, and you can easily add new groups.\n\n\nReferences\n\n\n \nIan Goodfellow and Yoshua Bengio and Aaron Courville\n.\n    \nDeep Learning\n,\n     arXiv:1607.04381v2,\n    2017.\n\n\n\n\n\nSong Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally\n.\n    \nDSD: Dense-Sparse-Dense Training for Deep Neural Networks\n,\n     arXiv:1607.04381v2,\n    2017.\n\n\n\n\n\nHuizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally\n.\n    \nExploring the Regularity of Sparse Structure in Convolutional Neural Networks\n,\n    arXiv:1705.08922v3,\n    2017.\n\n\n\n\n\nSajid Anwar, Kyuyeon Hwang, and Wonyong Sung\n.\n    \nStructured pruning of deep convolutional neural networks\n,\n    arXiv:1512.08571,\n    2015", 
             "title": "Regularization"
-        },
+        }, 
         {
-            "location": "/regularization/index.html#regularization",
-            "text": "In their book  Deep Learning  Ian Goodfellow et al. define regularization as   \"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\"   PyTorch's  optimizers  use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance).  In general, we can write this as:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W)\n\\]\nAnd specifically,\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2\n\\]\nWhere W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or  criterion  in the Distiller sample image classifier compression application).  optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001)\ncriterion = nn.CrossEntropyLoss()\n...\nfor input, target in dataset:\n    optimizer.zero_grad()\n    output = model(input)\n    loss = criterion(output, target)\n    loss.backward()\n    optimizer.step()  \\(\\lambda_R\\) is a scalar called the  regularization strength , and it balances the data error and the regularization error.  In PyTorch, this is the  weight_decay  argument.  \\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a  magnitude , or sizing, of the weights tensor.\n\\[\n\\lVert W \\rVert_2^2 = \\sum_{l=1}^{L}  \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l)\n\\]  \\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation.  The qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in  Deep Learning .",
+            "location": "/regularization/index.html#regularization", 
+            "text": "In their book  Deep Learning  Ian Goodfellow et al. define regularization as   \"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\"   PyTorch's  optimizers  use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance).  In general, we can write this as:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W)\n\\]\nAnd specifically,\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2\n\\]\nWhere W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or  criterion  in the Distiller sample image classifier compression application).  optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001)\ncriterion = nn.CrossEntropyLoss()\n...\nfor input, target in dataset:\n    optimizer.zero_grad()\n    output = model(input)\n    loss = criterion(output, target)\n    loss.backward()\n    optimizer.step()  \\(\\lambda_R\\) is a scalar called the  regularization strength , and it balances the data error and the regularization error.  In PyTorch, this is the  weight_decay  argument.  \\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a  magnitude , or sizing, of the weights tensor.\n\\[\n\\lVert W \\rVert_2^2 = \\sum_{l=1}^{L}  \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l)\n\\]  \\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation.  The qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in  Deep Learning .", 
             "title": "Regularization"
-        },
+        }, 
         {
-            "location": "/regularization/index.html#sparsity-and-regularization",
-            "text": "We mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods.  In  Dense-Sparse-Dense (DSD) , Song Han et al. use pruning as a regularizer to improve a model's accuracy:   \"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\"   Regularization can also be used to induce sparsity.  To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\]  \\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero.  \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler.  This is sometimes referred to as  feature selection  and gives us another interpretation of pruning.  One  of Distiller's Jupyter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.  If we configure  weight_decay  to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2  + \\lambda_{R_1} \\lVert W \\rVert_1\n\\]  Class  distiller.L1Regularizer  implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization.  l1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()",
+            "location": "/regularization/index.html#sparsity-and-regularization", 
+            "text": "We mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods.  In  Dense-Sparse-Dense (DSD) , Song Han et al. use pruning as a regularizer to improve a model's accuracy:   \"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\"   Regularization can also be used to induce sparsity.  To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\]  \\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero.  \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler.  This is sometimes referred to as  feature selection  and gives us another interpretation of pruning.  One  of Distiller's Jupyter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.  If we configure  weight_decay  to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2  + \\lambda_{R_1} \\lVert W \\rVert_1\n\\]  Class  distiller.L1Regularizer  implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization.  l1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()", 
             "title": "Sparsity and Regularization"
-        },
+        }, 
         {
-            "location": "/regularization/index.html#group-regularization",
-            "text": "In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements.  Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not.  The group structures have to be pre-defined.  To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty.  We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers.  It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\]  Let's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\).  \\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\).  \\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer.  Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups).  \nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity).  Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed.  Huizi-et-al-2017  provides an overview of some of the different groups: kernel, channel, filter, layers.  Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even  intra kernel strided sparsity  can also be used.  distiller.GroupLassoRegularizer  currently implements most of these groups, and you can easily add new groups.",
+            "location": "/regularization/index.html#group-regularization", 
+            "text": "In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements.  Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not.  The group structures have to be pre-defined.  To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty.  We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers.  It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\]  Let's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\).  \\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\).  \\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer.  Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups).  \nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity).  Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed.  Huizi-et-al-2017  provides an overview of some of the different groups: kernel, channel, filter, layers.  Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even  intra kernel strided sparsity  can also be used.  distiller.GroupLassoRegularizer  currently implements most of these groups, and you can easily add new groups.", 
             "title": "Group Regularization"
-        },
+        }, 
         {
-            "location": "/regularization/index.html#references",
-            "text": "Ian Goodfellow and Yoshua Bengio and Aaron Courville .\n     Deep Learning ,\n     arXiv:1607.04381v2,\n    2017.   Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally .\n     DSD: Dense-Sparse-Dense Training for Deep Neural Networks ,\n     arXiv:1607.04381v2,\n    2017.   Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally .\n     Exploring the Regularity of Sparse Structure in Convolutional Neural Networks ,\n    arXiv:1705.08922v3,\n    2017.   Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung .\n     Structured pruning of deep convolutional neural networks ,\n    arXiv:1512.08571,\n    2015",
+            "location": "/regularization/index.html#references", 
+            "text": "Ian Goodfellow and Yoshua Bengio and Aaron Courville .\n     Deep Learning ,\n     arXiv:1607.04381v2,\n    2017.   Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally .\n     DSD: Dense-Sparse-Dense Training for Deep Neural Networks ,\n     arXiv:1607.04381v2,\n    2017.   Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally .\n     Exploring the Regularity of Sparse Structure in Convolutional Neural Networks ,\n    arXiv:1705.08922v3,\n    2017.   Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung .\n     Structured pruning of deep convolutional neural networks ,\n    arXiv:1512.08571,\n    2015", 
             "title": "References"
-        },
+        }, 
         {
-            "location": "/quantization/index.html",
-            "text": "Quantization\n\n\nQuantization refers to the process of reducing the number of bits that represent a number. In the context of deep learning, the predominant numerical format used for research and for deployment has so far been 32-bit floating point, or FP32. However, the desire for reduced bandwidth and compute requirements of deep learning models has driven research into using lower-precision numerical formats. It has been extensively demonstrated that weights and activations can be represented using 8-bit integers (or INT8) without incurring significant loss in accuracy. The use of even lower bit-widths, such as 4/2/1-bits, is an active field of research that has also shown great progress.\n\n\nNote that this discussion is on quantization only in the context of more efficient inference. Using lower-precision numerics for more efficient training is currently out of scope.\n\n\nMotivation: Overall Efficiency\n\n\nThe more obvious benefit from quantization is \nsignificantly reduced bandwidth and storage\n. For instance, using INT8 for weights and activations consumes 4x less overall bandwidth compared to FP32.\n\nAdditionally integer compute is \nfaster\n than floating point compute. It is also much more \narea and energy efficient\n: \n\n\n\n\n\n\n\n\nINT8 Operation\n\n\nEnergy Saving vs FP32\n\n\nArea Saving vs FP32\n\n\n\n\n\n\n\n\n\n\nAdd\n\n\n30x\n\n\n116x\n\n\n\n\n\n\nMultiply\n\n\n18.5x\n\n\n27x\n\n\n\n\n\n\n\n\n(\nDally, 2015\n)\n\n\nNote that very aggressive quantization can yield even more efficiency. If weights are binary (-1, 1) or ternary (-1, 0, 1 using 2-bits), then convolution and fully-connected layers can be computed with additions and subtractions only, removing multiplications completely. If activations are binary as well, then additions can also be removed, in favor of bitwise operations (\nRastegari et al., 2016\n).\n\n\nInteger vs. FP32\n\n\nThere are two main attributes when discussing a numerical format. The first is \ndynamic range\n, which refers to the range of representable numbers. The second one is how many values can be represented within the dynamic range, which in turn determines the \nprecision / resolution\n of the format (the distance between two numbers).\n\nFor all integer formats, the dynamic range is \n[-2^{n-1} .. 2^{n-1}-1]\n, where \nn\n is the number of bits. So for INT8 the range is \n[-128 .. 127]\n, and for INT4 it is \n[-16 .. 15]\n (we're limiting ourselves to signed integers for now). The number of representable values is \n2^n\n.\nContrast that with FP32, where the dynamic range is \n\\pm 3.4\\ x\\ 10^{38}\n, and approximately \n4.2\\ x\\ 10^9\n values can be represented.\n\nWe can immediately see that FP32 is much more \nversatile\n, in that it is able to represent a wide range of distributions accurately. This is a nice property for deep learning models, where the distributions of weights and activations are usually very different (at least in dynamic range). In addition the dynamic range can differ between layers in the model.\n\nIn order to be able to represent these different distributions with an integer format, a \nscale factor\n is used to map the dynamic range of the tensor to the integer format range. But still we remain with the issue of having a significantly lower number of representable values, that is - much lower resolution.\n\nNote that this scale factor is, in most cases, a floating-point number. Hence, even when using integer numerics, some floating-point computations remain. \nCourbariaux et al., 2014\n scale using only shifts, eliminating the floating point operation. In \nGEMMLWOP\n, the FP32 scale factor is approximated using an integer or fixed-point multiplication followed by a shift operation. In many cases the effect of this approximation on accuracy is negligible.\n\n\nAvoiding Overflows\n\n\nConvolution and fully connected layers involve the storing of intermediate results in accumulators. Due to the limited dynamic range of integer formats, if we would use the same bit-width for the weights and activation, \nand\n for the accumulators, we would likely overflow very quickly. Therefore, accumulators are usually implemented with higher bit-widths.\n\nThe result of multiplying two \nn\n-bit integers is, at most, a \n2n\n-bit number. In convolution layers, such multiplications are accumulated \nc\\cdot k^2\n times, where \nc\n is the number of input channels and \nk\n is the kernel width (assuming a square kernel). Hence, to avoid overflowing, the accumulator should be \n2n + M\n-bits wide, where M is at least \nlog_2(c\\cdot k^2)\n. In many cases 32-bit accumulators are used, however for INT4 and lower it might be possible to use less than 32 -bits, depending on the expected use cases and layer widths.\n\n\n\"Conservative\" Quantization: INT8\n\n\nIn many cases, taking a model trained for FP32 and directly quantizing it to INT8, without any re-training, can result in a relatively low loss of accuracy (which may or may not be acceptable, depending on the use case). Some fine-tuning can further improve the accuracy (\nGysel at al., 2018\n).\n\nAs mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained \"online\" during inference, or \"offline\".\n\n\n\n\nOffline\n means gathering activations statistics before deploying the model, either during training or by running a few \"calibration\" batches on the trained FP32 model. Based on these gathered statistics, the scaled factors are calculated and are fixed once the model is deployed. This method has the risk of encountering values outside the previously observed ranges at runtime. These values will be clipped, which might lead to accuracy degradation.\n\n\nOnline\n means calculating the min/max values for each tensor dynamically during runtime. In this method clipping cannot occur, however the added computation resources required to calculate the min/max values at runtime might be prohibitive.\n\n\n\n\n\n\n\nIt is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. A simple method which can yield nice results is to simply use an average of the observed min/max values instead of the actual values. Alternatively, statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible (\nMigacz, 2017\n). Going further, \nBanner et al., 2018\n have proposed a method for analytically computing the clipping value under certain conditions.\n\n\nAnother possible optimization point is \nscale-factor scope\n. The most common way is use a single scale-factor per-layer, but it is also possible to calculate a scale-factor per-channel. This can be beneficial if the weight distributions vary greatly between channels.\n\n\nWhen used to directly quantize a model without re-training, as described so far, this method is commonly referred to as \npost-training quantization\n. However, recent publications have shown that there are cases where post-training quantization to INT8 doesn't preserve accuracy (\nBenoit et al., 2018\n, \nKrishnamoorthi, 2018\n). Namely, smaller models such as MobileNet seem to not respond as well to post-training quantization, presumabley due to their smaller representational capacity. In such cases, \nquantization-aware training\n is used.\n\n\n\"Aggressive\" Quantization: INT4 and Lower\n\n\nNaively quantizing a FP32 model to INT4 and lower usually incurs significant accuracy degradation. Many works have tried to mitigate this effect. They usually employ one or more of the following concepts in order to improve model accuracy:\n\n\n\n\nTraining / Re-Training\n: For INT4 and lower, training is required in order to obtain reasonable accuracy. The training loop is modified to take quantization into account. See details in the \nnext section\n.\n\n\nZhou S et al., 2016\n have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods \nrequire\n a trained FP32 model, either as a starting point (\nZhou A et al., 2017\n), or as a teacher network in a knowledge distillation training setup (see \nhere\n).\n\n\nReplacing the activation function\n: The most common activation function in vision models is ReLU, which is unbounded. That is - its dynamic range is not limited for positive inputs. This is very problematic for INT4 and below due to the very limited range and resolution. Therefore, most methods replace ReLU with another function which is bounded. In some cases a clipping function with hard coded values is used (\nZhou S et al., 2016\n, \nMishra et al., 2018\n). Another method learns the clipping value per layer, with better results (\nChoi et al., 2018\n). Once the clipping value is set, the scale factor used for quantization is also set, and no further calibration steps are required (as opposed to INT8 methods described above).\n\n\nModifying network structure\n: \nMishra et al., 2018\n try to compensate for the loss of information due to quantization by using wider layers (more channels). \nLin et al., 2017\n proposed a binary quantization method in which a single FP32 convolution is replaced with multiple binary convolutions, each scaled to represent a different \"base\", covering a larger dynamic range overall.\n\n\nFirst and last layer\n: Many methods do not quantize the first and last layer of the model. It has been observed by \nHan et al., 2015\n that the first convolutional layer is more sensitive to weights pruning, and some quantization works cite the same reason and show it empirically (\nZhou S et al., 2016\n, \nChoi et al., 2018\n). Some works also note that these layers usually constitute a very small portion of the overall computation within the model, further reducing the motivation to quantize them (\nRastegari et al., 2016\n). Most methods keep the first and last layers at FP32. However, \nChoi et al., 2018\n showed that \"conservative\" quantization of these layers, e.g. to INT8, does not reduce accuracy.\n\n\nIterative quantization\n: Most methods quantize the entire model at once. \nZhou A et al., 2017\n employ an iterative method, which starts with a trained FP32 baseline, and quantizes only a portion of the model at the time followed by several epochs of re-training to recover the accuracy loss from quantization.\n\n\nMixed Weights and Activations Precision\n: It has been observed that activations are more sensitive to quantization than weights (\nZhou S et al., 2016\n). Hence it is not uncommon to see experiments with activations quantized to a higher precision compared to weights. Some works have focused solely on quantizing weights, keeping the activations at FP32 (\nLi et al., 2016\n, \nZhu et al., 2016\n).\n\n\n\n\nQuantization-Aware Training\n\n\nAs mentioned above, in order to minimize the loss of accuracy from \"aggressive\" quantization, many methods that target INT4 and lower (and in some cases for INT8 as well) involve training the model in a way that considers the quantization. This means training with quantization of weights and activations \"baked\" into the training procedure. The training graph usually looks like this:\n\n\n\n\nA full precision copy of the weights is maintained throughout the training process (\"weights_fp\" in the diagram). Its purpose is to accumulate the small changes from the gradients without loss of precision (Note that the quantization of the weights is an integral part of the training graph, meaning that we back-propagate through it as well). Once the model is trained, only the quantized weights are used for inference.\n\nIn the diagram we show \"layer N\" as the conv + batch-norm + activation combination, but the same applies to fully-connected layers, element-wise operations, etc. During training, the operations within \"layer N\" can still run in full precision, with the \"quantize\" operations in the boundaries ensuring discrete-valued weights and activations. This is sometimes called \"simulated quantization\".  \n\n\nStraight-Through Estimator\n\n\nAn important question in this context is how to back-propagate through the quantization functions. These functions are discrete-valued, hence their derivative is 0 almost everywhere. So, using their gradients as-is would severely hinder the learning process. An approximation commonly used to overcome this issue is the \"straight-through estimator\" (STE) (\nHinton et al., 2012\n, \nBengio, 2013\n), which simply passes the gradient through these functions as-is.  \n\n\nReferences\n\n\n\n\nWilliam Dally\n. High-Performance Hardware for Machine Learning. \nTutorial, NIPS, 2015\n\n\n\n\n\nMohammad Rastegari, Vicente Ordone, Joseph Redmon and Ali Farhadi\n. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. \nECCV, 2016\n\n\n\n\n\nMatthieu Courbariaux, Yoshua Bengio and Jean-Pierre David\n. Training deep neural networks with low precision multiplications. \narxiv:1412.7024\n\n\n\n\n\nPhilipp Gysel, Jon Pimentel, Mohammad Motamedi and Soheil Ghiasi\n. Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks. \nIEEE Transactions on Neural Networks and Learning Systems, 2018\n\n\n\n\n\nSzymon Migacz\n. 8-bit Inference with TensorRT. \nGTC San Jose, 2017\n\n\n\n\n\nShuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu and Yuheng Zou\n. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. \narxiv:1606.06160\n\n\n\n\n\nAojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu and Yurong Chen\n. Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights. \nICLR, 2017\n\n\n\n\n\nAsit Mishra, Eriko Nurvitadhi, Jeffrey J Cook and Debbie Marr\n. WRPN: Wide Reduced-Precision Networks. \nICLR, 2018\n\n\n\n\n\nJungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan and Kailash Gopalakrishnan\n. PACT: Parameterized Clipping Activation for Quantized Neural Networks. \narxiv:1805.06085\n\n\n\n\n\nXiaofan Lin, Cong Zhao and Wei Pan\n. Towards Accurate Binary Convolutional Neural Network. \nNIPS, 2017\n\n\n\n\n\nSong Han, Jeff Pool, John Tran and William Dally\n. Learning both Weights and Connections for Efficient Neural Network. \nNIPS, 2015\n\n\n\n\n\nFengfu Li, Bo Zhang and Bin Liu\n. Ternary Weight Networks. \narxiv:1605.04711\n\n\n\n\n\nChenzhuo Zhu, Song Han, Huizi Mao and William J. Dally\n. Trained Ternary Quantization. \narxiv:1612.01064\n\n\n\n\n\nYoshua Bengio, Nicholas Leonard and Aaron Courville\n. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. \narxiv:1308.3432\n\n\n\n\n\nGeoffrey Hinton, Nitish Srivastava, Kevin Swersky, Tijmen Tieleman and Abdelrahman Mohamed\n. Neural Networks for Machine Learning. \nCoursera, video lectures, 2012\n\n\n\n\n\nBenoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam and Dmitry Kalenichenko\n. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. \nECCV, 2018\n\n\n\n\n\nRaghuraman Krishnamoorthi\n. Quantizing deep convolutional networks for efficient inference: A whitepaper \narxiv:1806.08342\n\n\n\n\n\nRon Banner, Yury Nahshan, Elad Hoffer and Daniel Soudry\n. ACIQ: Analytical Clipping for Integer Quantization of neural networks \narxiv:1810.05723",
+            "location": "/quantization/index.html", 
+            "text": "Quantization\n\n\nQuantization refers to the process of reducing the number of bits that represent a number. In the context of deep learning, the predominant numerical format used for research and for deployment has so far been 32-bit floating point, or FP32. However, the desire for reduced bandwidth and compute requirements of deep learning models has driven research into using lower-precision numerical formats. It has been extensively demonstrated that weights and activations can be represented using 8-bit integers (or INT8) without incurring significant loss in accuracy. The use of even lower bit-widths, such as 4/2/1-bits, is an active field of research that has also shown great progress.\n\n\nNote that this discussion is on quantization only in the context of more efficient inference. Using lower-precision numerics for more efficient training is currently out of scope.\n\n\nMotivation: Overall Efficiency\n\n\nThe more obvious benefit from quantization is \nsignificantly reduced bandwidth and storage\n. For instance, using INT8 for weights and activations consumes 4x less overall bandwidth compared to FP32.\n\nAdditionally integer compute is \nfaster\n than floating point compute. It is also much more \narea and energy efficient\n: \n\n\n\n\n\n\n\n\nINT8 Operation\n\n\nEnergy Saving vs FP32\n\n\nArea Saving vs FP32\n\n\n\n\n\n\n\n\n\n\nAdd\n\n\n30x\n\n\n116x\n\n\n\n\n\n\nMultiply\n\n\n18.5x\n\n\n27x\n\n\n\n\n\n\n\n\n(\nDally, 2015\n)\n\n\nNote that very aggressive quantization can yield even more efficiency. If weights are binary (-1, 1) or ternary (-1, 0, 1 using 2-bits), then convolution and fully-connected layers can be computed with additions and subtractions only, removing multiplications completely. If activations are binary as well, then additions can also be removed, in favor of bitwise operations (\nRastegari et al., 2016\n).\n\n\nInteger vs. FP32\n\n\nThere are two main attributes when discussing a numerical format. The first is \ndynamic range\n, which refers to the range of representable numbers. The second one is how many values can be represented within the dynamic range, which in turn determines the \nprecision / resolution\n of the format (the distance between two numbers).\n\nFor all integer formats, the dynamic range is \n[-2^{n-1} .. 2^{n-1}-1]\n, where \nn\n is the number of bits. So for INT8 the range is \n[-128 .. 127]\n, and for INT4 it is \n[-16 .. 15]\n (we're limiting ourselves to signed integers for now). The number of representable values is \n2^n\n.\nContrast that with FP32, where the dynamic range is \n\\pm 3.4\\ x\\ 10^{38}\n, and approximately \n4.2\\ x\\ 10^9\n values can be represented.\n\nWe can immediately see that FP32 is much more \nversatile\n, in that it is able to represent a wide range of distributions accurately. This is a nice property for deep learning models, where the distributions of weights and activations are usually very different (at least in dynamic range). In addition the dynamic range can differ between layers in the model.\n\nIn order to be able to represent these different distributions with an integer format, a \nscale factor\n is used to map the dynamic range of the tensor to the integer format range. But still we remain with the issue of having a significantly lower number of representable values, that is - much lower resolution.\n\nNote that this scale factor is, in most cases, a floating-point number. Hence, even when using integer numerics, some floating-point computations remain. \nCourbariaux et al., 2014\n scale using only shifts, eliminating the floating point operation. In \nGEMMLWOP\n, the FP32 scale factor is approximated using an integer or fixed-point multiplication followed by a shift operation. In many cases the effect of this approximation on accuracy is negligible.\n\n\nAvoiding Overflows\n\n\nConvolution and fully connected layers involve the storing of intermediate results in accumulators. Due to the limited dynamic range of integer formats, if we would use the same bit-width for the weights and activation, \nand\n for the accumulators, we would likely overflow very quickly. Therefore, accumulators are usually implemented with higher bit-widths.\n\nThe result of multiplying two \nn\n-bit integers is, at most, a \n2n\n-bit number. In convolution layers, such multiplications are accumulated \nc\\cdot k^2\n times, where \nc\n is the number of input channels and \nk\n is the kernel width (assuming a square kernel). Hence, to avoid overflowing, the accumulator should be \n2n + M\n-bits wide, where M is at least \nlog_2(c\\cdot k^2)\n. In many cases 32-bit accumulators are used, however for INT4 and lower it might be possible to use less than 32 -bits, depending on the expected use cases and layer widths.\n\n\n\"Conservative\" Quantization: INT8\n\n\nIn many cases, taking a model trained for FP32 and directly quantizing it to INT8, without any re-training, can result in a relatively low loss of accuracy (which may or may not be acceptable, depending on the use case). Some fine-tuning can further improve the accuracy (\nGysel at al., 2018\n).\n\nAs mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained \"online\" during inference, or \"offline\".\n\n\n\n\nOffline\n means gathering activations statistics before deploying the model, either during training or by running a few \"calibration\" batches on the trained FP32 model. Based on these gathered statistics, the scaled factors are calculated and are fixed once the model is deployed. This method has the risk of encountering values outside the previously observed ranges at runtime. These values will be clipped, which might lead to accuracy degradation.\n\n\nOnline\n means calculating the min/max values for each tensor dynamically during runtime. In this method clipping cannot occur, however the added computation resources required to calculate the min/max values at runtime might be prohibitive.\n\n\n\n\n\n\n\nIt is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. A simple method which can yield nice results is to simply use an average of the observed min/max values instead of the actual values. Alternatively, statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible (\nMigacz, 2017\n). Going further, \nBanner et al., 2018\n have proposed a method for analytically computing the clipping value under certain conditions.\n\n\nAnother possible optimization point is \nscale-factor scope\n. The most common way is use a single scale-factor per-layer, but it is also possible to calculate a scale-factor per-channel. This can be beneficial if the weight distributions vary greatly between channels.\n\n\nWhen used to directly quantize a model without re-training, as described so far, this method is commonly referred to as \npost-training quantization\n. However, recent publications have shown that there are cases where post-training quantization to INT8 doesn't preserve accuracy (\nBenoit et al., 2018\n, \nKrishnamoorthi, 2018\n). Namely, smaller models such as MobileNet seem to not respond as well to post-training quantization, presumabley due to their smaller representational capacity. In such cases, \nquantization-aware training\n is used.\n\n\n\"Aggressive\" Quantization: INT4 and Lower\n\n\nNaively quantizing a FP32 model to INT4 and lower usually incurs significant accuracy degradation. Many works have tried to mitigate this effect. They usually employ one or more of the following concepts in order to improve model accuracy:\n\n\n\n\nTraining / Re-Training\n: For INT4 and lower, training is required in order to obtain reasonable accuracy. The training loop is modified to take quantization into account. See details in the \nnext section\n.\n\n\nZhou S et al., 2016\n have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods \nrequire\n a trained FP32 model, either as a starting point (\nZhou A et al., 2017\n), or as a teacher network in a knowledge distillation training setup (see \nhere\n).\n\n\nReplacing the activation function\n: The most common activation function in vision models is ReLU, which is unbounded. That is - its dynamic range is not limited for positive inputs. This is very problematic for INT4 and below due to the very limited range and resolution. Therefore, most methods replace ReLU with another function which is bounded. In some cases a clipping function with hard coded values is used (\nZhou S et al., 2016\n, \nMishra et al., 2018\n). Another method learns the clipping value per layer, with better results (\nChoi et al., 2018\n). Once the clipping value is set, the scale factor used for quantization is also set, and no further calibration steps are required (as opposed to INT8 methods described above).\n\n\nModifying network structure\n: \nMishra et al., 2018\n try to compensate for the loss of information due to quantization by using wider layers (more channels). \nLin et al., 2017\n proposed a binary quantization method in which a single FP32 convolution is replaced with multiple binary convolutions, each scaled to represent a different \"base\", covering a larger dynamic range overall.\n\n\nFirst and last layer\n: Many methods do not quantize the first and last layer of the model. It has been observed by \nHan et al., 2015\n that the first convolutional layer is more sensitive to weights pruning, and some quantization works cite the same reason and show it empirically (\nZhou S et al., 2016\n, \nChoi et al., 2018\n). Some works also note that these layers usually constitute a very small portion of the overall computation within the model, further reducing the motivation to quantize them (\nRastegari et al., 2016\n). Most methods keep the first and last layers at FP32. However, \nChoi et al., 2018\n showed that \"conservative\" quantization of these layers, e.g. to INT8, does not reduce accuracy.\n\n\nIterative quantization\n: Most methods quantize the entire model at once. \nZhou A et al., 2017\n employ an iterative method, which starts with a trained FP32 baseline, and quantizes only a portion of the model at the time followed by several epochs of re-training to recover the accuracy loss from quantization.\n\n\nMixed Weights and Activations Precision\n: It has been observed that activations are more sensitive to quantization than weights (\nZhou S et al., 2016\n). Hence it is not uncommon to see experiments with activations quantized to a higher precision compared to weights. Some works have focused solely on quantizing weights, keeping the activations at FP32 (\nLi et al., 2016\n, \nZhu et al., 2016\n).\n\n\n\n\nQuantization-Aware Training\n\n\nAs mentioned above, in order to minimize the loss of accuracy from \"aggressive\" quantization, many methods that target INT4 and lower (and in some cases for INT8 as well) involve training the model in a way that considers the quantization. This means training with quantization of weights and activations \"baked\" into the training procedure. The training graph usually looks like this:\n\n\n\n\nA full precision copy of the weights is maintained throughout the training process (\"weights_fp\" in the diagram). Its purpose is to accumulate the small changes from the gradients without loss of precision (Note that the quantization of the weights is an integral part of the training graph, meaning that we back-propagate through it as well). Once the model is trained, only the quantized weights are used for inference.\n\nIn the diagram we show \"layer N\" as the conv + batch-norm + activation combination, but the same applies to fully-connected layers, element-wise operations, etc. During training, the operations within \"layer N\" can still run in full precision, with the \"quantize\" operations in the boundaries ensuring discrete-valued weights and activations. This is sometimes called \"simulated quantization\".  \n\n\nStraight-Through Estimator\n\n\nAn important question in this context is how to back-propagate through the quantization functions. These functions are discrete-valued, hence their derivative is 0 almost everywhere. So, using their gradients as-is would severely hinder the learning process. An approximation commonly used to overcome this issue is the \"straight-through estimator\" (STE) (\nHinton et al., 2012\n, \nBengio, 2013\n), which simply passes the gradient through these functions as-is.  \n\n\nReferences\n\n\n\n\nWilliam Dally\n. High-Performance Hardware for Machine Learning. \nTutorial, NIPS, 2015\n\n\n\n\n\nMohammad Rastegari, Vicente Ordone, Joseph Redmon and Ali Farhadi\n. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. \nECCV, 2016\n\n\n\n\n\nMatthieu Courbariaux, Yoshua Bengio and Jean-Pierre David\n. Training deep neural networks with low precision multiplications. \narxiv:1412.7024\n\n\n\n\n\nPhilipp Gysel, Jon Pimentel, Mohammad Motamedi and Soheil Ghiasi\n. Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks. \nIEEE Transactions on Neural Networks and Learning Systems, 2018\n\n\n\n\n\nSzymon Migacz\n. 8-bit Inference with TensorRT. \nGTC San Jose, 2017\n\n\n\n\n\nShuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu and Yuheng Zou\n. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. \narxiv:1606.06160\n\n\n\n\n\nAojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu and Yurong Chen\n. Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights. \nICLR, 2017\n\n\n\n\n\nAsit Mishra, Eriko Nurvitadhi, Jeffrey J Cook and Debbie Marr\n. WRPN: Wide Reduced-Precision Networks. \nICLR, 2018\n\n\n\n\n\nJungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan and Kailash Gopalakrishnan\n. PACT: Parameterized Clipping Activation for Quantized Neural Networks. \narxiv:1805.06085\n\n\n\n\n\nXiaofan Lin, Cong Zhao and Wei Pan\n. Towards Accurate Binary Convolutional Neural Network. \nNIPS, 2017\n\n\n\n\n\nSong Han, Jeff Pool, John Tran and William Dally\n. Learning both Weights and Connections for Efficient Neural Network. \nNIPS, 2015\n\n\n\n\n\nFengfu Li, Bo Zhang and Bin Liu\n. Ternary Weight Networks. \narxiv:1605.04711\n\n\n\n\n\nChenzhuo Zhu, Song Han, Huizi Mao and William J. Dally\n. Trained Ternary Quantization. \narxiv:1612.01064\n\n\n\n\n\nYoshua Bengio, Nicholas Leonard and Aaron Courville\n. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. \narxiv:1308.3432\n\n\n\n\n\nGeoffrey Hinton, Nitish Srivastava, Kevin Swersky, Tijmen Tieleman and Abdelrahman Mohamed\n. Neural Networks for Machine Learning. \nCoursera, video lectures, 2012\n\n\n\n\n\nBenoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam and Dmitry Kalenichenko\n. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. \nECCV, 2018\n\n\n\n\n\nRaghuraman Krishnamoorthi\n. Quantizing deep convolutional networks for efficient inference: A whitepaper \narxiv:1806.08342\n\n\n\n\n\nRon Banner, Yury Nahshan, Elad Hoffer and Daniel Soudry\n. ACIQ: Analytical Clipping for Integer Quantization of neural networks \narxiv:1810.05723", 
             "title": "Quantization"
-        },
+        }, 
         {
-            "location": "/quantization/index.html#quantization",
-            "text": "Quantization refers to the process of reducing the number of bits that represent a number. In the context of deep learning, the predominant numerical format used for research and for deployment has so far been 32-bit floating point, or FP32. However, the desire for reduced bandwidth and compute requirements of deep learning models has driven research into using lower-precision numerical formats. It has been extensively demonstrated that weights and activations can be represented using 8-bit integers (or INT8) without incurring significant loss in accuracy. The use of even lower bit-widths, such as 4/2/1-bits, is an active field of research that has also shown great progress.  Note that this discussion is on quantization only in the context of more efficient inference. Using lower-precision numerics for more efficient training is currently out of scope.",
+            "location": "/quantization/index.html#quantization", 
+            "text": "Quantization refers to the process of reducing the number of bits that represent a number. In the context of deep learning, the predominant numerical format used for research and for deployment has so far been 32-bit floating point, or FP32. However, the desire for reduced bandwidth and compute requirements of deep learning models has driven research into using lower-precision numerical formats. It has been extensively demonstrated that weights and activations can be represented using 8-bit integers (or INT8) without incurring significant loss in accuracy. The use of even lower bit-widths, such as 4/2/1-bits, is an active field of research that has also shown great progress.  Note that this discussion is on quantization only in the context of more efficient inference. Using lower-precision numerics for more efficient training is currently out of scope.", 
             "title": "Quantization"
-        },
+        }, 
         {
-            "location": "/quantization/index.html#motivation-overall-efficiency",
-            "text": "The more obvious benefit from quantization is  significantly reduced bandwidth and storage . For instance, using INT8 for weights and activations consumes 4x less overall bandwidth compared to FP32. \nAdditionally integer compute is  faster  than floating point compute. It is also much more  area and energy efficient :      INT8 Operation  Energy Saving vs FP32  Area Saving vs FP32      Add  30x  116x    Multiply  18.5x  27x     ( Dally, 2015 )  Note that very aggressive quantization can yield even more efficiency. If weights are binary (-1, 1) or ternary (-1, 0, 1 using 2-bits), then convolution and fully-connected layers can be computed with additions and subtractions only, removing multiplications completely. If activations are binary as well, then additions can also be removed, in favor of bitwise operations ( Rastegari et al., 2016 ).",
+            "location": "/quantization/index.html#motivation-overall-efficiency", 
+            "text": "The more obvious benefit from quantization is  significantly reduced bandwidth and storage . For instance, using INT8 for weights and activations consumes 4x less overall bandwidth compared to FP32. \nAdditionally integer compute is  faster  than floating point compute. It is also much more  area and energy efficient :      INT8 Operation  Energy Saving vs FP32  Area Saving vs FP32      Add  30x  116x    Multiply  18.5x  27x     ( Dally, 2015 )  Note that very aggressive quantization can yield even more efficiency. If weights are binary (-1, 1) or ternary (-1, 0, 1 using 2-bits), then convolution and fully-connected layers can be computed with additions and subtractions only, removing multiplications completely. If activations are binary as well, then additions can also be removed, in favor of bitwise operations ( Rastegari et al., 2016 ).", 
             "title": "Motivation: Overall Efficiency"
-        },
+        }, 
         {
-            "location": "/quantization/index.html#integer-vs-fp32",
-            "text": "There are two main attributes when discussing a numerical format. The first is  dynamic range , which refers to the range of representable numbers. The second one is how many values can be represented within the dynamic range, which in turn determines the  precision / resolution  of the format (the distance between two numbers). \nFor all integer formats, the dynamic range is  [-2^{n-1} .. 2^{n-1}-1] , where  n  is the number of bits. So for INT8 the range is  [-128 .. 127] , and for INT4 it is  [-16 .. 15]  (we're limiting ourselves to signed integers for now). The number of representable values is  2^n .\nContrast that with FP32, where the dynamic range is  \\pm 3.4\\ x\\ 10^{38} , and approximately  4.2\\ x\\ 10^9  values can be represented. \nWe can immediately see that FP32 is much more  versatile , in that it is able to represent a wide range of distributions accurately. This is a nice property for deep learning models, where the distributions of weights and activations are usually very different (at least in dynamic range). In addition the dynamic range can differ between layers in the model. \nIn order to be able to represent these different distributions with an integer format, a  scale factor  is used to map the dynamic range of the tensor to the integer format range. But still we remain with the issue of having a significantly lower number of representable values, that is - much lower resolution. \nNote that this scale factor is, in most cases, a floating-point number. Hence, even when using integer numerics, some floating-point computations remain.  Courbariaux et al., 2014  scale using only shifts, eliminating the floating point operation. In  GEMMLWOP , the FP32 scale factor is approximated using an integer or fixed-point multiplication followed by a shift operation. In many cases the effect of this approximation on accuracy is negligible.",
+            "location": "/quantization/index.html#integer-vs-fp32", 
+            "text": "There are two main attributes when discussing a numerical format. The first is  dynamic range , which refers to the range of representable numbers. The second one is how many values can be represented within the dynamic range, which in turn determines the  precision / resolution  of the format (the distance between two numbers). \nFor all integer formats, the dynamic range is  [-2^{n-1} .. 2^{n-1}-1] , where  n  is the number of bits. So for INT8 the range is  [-128 .. 127] , and for INT4 it is  [-16 .. 15]  (we're limiting ourselves to signed integers for now). The number of representable values is  2^n .\nContrast that with FP32, where the dynamic range is  \\pm 3.4\\ x\\ 10^{38} , and approximately  4.2\\ x\\ 10^9  values can be represented. \nWe can immediately see that FP32 is much more  versatile , in that it is able to represent a wide range of distributions accurately. This is a nice property for deep learning models, where the distributions of weights and activations are usually very different (at least in dynamic range). In addition the dynamic range can differ between layers in the model. \nIn order to be able to represent these different distributions with an integer format, a  scale factor  is used to map the dynamic range of the tensor to the integer format range. But still we remain with the issue of having a significantly lower number of representable values, that is - much lower resolution. \nNote that this scale factor is, in most cases, a floating-point number. Hence, even when using integer numerics, some floating-point computations remain.  Courbariaux et al., 2014  scale using only shifts, eliminating the floating point operation. In  GEMMLWOP , the FP32 scale factor is approximated using an integer or fixed-point multiplication followed by a shift operation. In many cases the effect of this approximation on accuracy is negligible.", 
             "title": "Integer vs. FP32"
-        },
+        }, 
         {
-            "location": "/quantization/index.html#avoiding-overflows",
-            "text": "Convolution and fully connected layers involve the storing of intermediate results in accumulators. Due to the limited dynamic range of integer formats, if we would use the same bit-width for the weights and activation,  and  for the accumulators, we would likely overflow very quickly. Therefore, accumulators are usually implemented with higher bit-widths. \nThe result of multiplying two  n -bit integers is, at most, a  2n -bit number. In convolution layers, such multiplications are accumulated  c\\cdot k^2  times, where  c  is the number of input channels and  k  is the kernel width (assuming a square kernel). Hence, to avoid overflowing, the accumulator should be  2n + M -bits wide, where M is at least  log_2(c\\cdot k^2) . In many cases 32-bit accumulators are used, however for INT4 and lower it might be possible to use less than 32 -bits, depending on the expected use cases and layer widths.",
+            "location": "/quantization/index.html#avoiding-overflows", 
+            "text": "Convolution and fully connected layers involve the storing of intermediate results in accumulators. Due to the limited dynamic range of integer formats, if we would use the same bit-width for the weights and activation,  and  for the accumulators, we would likely overflow very quickly. Therefore, accumulators are usually implemented with higher bit-widths. \nThe result of multiplying two  n -bit integers is, at most, a  2n -bit number. In convolution layers, such multiplications are accumulated  c\\cdot k^2  times, where  c  is the number of input channels and  k  is the kernel width (assuming a square kernel). Hence, to avoid overflowing, the accumulator should be  2n + M -bits wide, where M is at least  log_2(c\\cdot k^2) . In many cases 32-bit accumulators are used, however for INT4 and lower it might be possible to use less than 32 -bits, depending on the expected use cases and layer widths.", 
             "title": "Avoiding Overflows"
-        },
+        }, 
         {
-            "location": "/quantization/index.html#conservative-quantization-int8",
-            "text": "In many cases, taking a model trained for FP32 and directly quantizing it to INT8, without any re-training, can result in a relatively low loss of accuracy (which may or may not be acceptable, depending on the use case). Some fine-tuning can further improve the accuracy ( Gysel at al., 2018 ). \nAs mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained \"online\" during inference, or \"offline\".   Offline  means gathering activations statistics before deploying the model, either during training or by running a few \"calibration\" batches on the trained FP32 model. Based on these gathered statistics, the scaled factors are calculated and are fixed once the model is deployed. This method has the risk of encountering values outside the previously observed ranges at runtime. These values will be clipped, which might lead to accuracy degradation.  Online  means calculating the min/max values for each tensor dynamically during runtime. In this method clipping cannot occur, however the added computation resources required to calculate the min/max values at runtime might be prohibitive.    It is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. A simple method which can yield nice results is to simply use an average of the observed min/max values instead of the actual values. Alternatively, statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible ( Migacz, 2017 ). Going further,  Banner et al., 2018  have proposed a method for analytically computing the clipping value under certain conditions.  Another possible optimization point is  scale-factor scope . The most common way is use a single scale-factor per-layer, but it is also possible to calculate a scale-factor per-channel. This can be beneficial if the weight distributions vary greatly between channels.  When used to directly quantize a model without re-training, as described so far, this method is commonly referred to as  post-training quantization . However, recent publications have shown that there are cases where post-training quantization to INT8 doesn't preserve accuracy ( Benoit et al., 2018 ,  Krishnamoorthi, 2018 ). Namely, smaller models such as MobileNet seem to not respond as well to post-training quantization, presumabley due to their smaller representational capacity. In such cases,  quantization-aware training  is used.",
+            "location": "/quantization/index.html#conservative-quantization-int8", 
+            "text": "In many cases, taking a model trained for FP32 and directly quantizing it to INT8, without any re-training, can result in a relatively low loss of accuracy (which may or may not be acceptable, depending on the use case). Some fine-tuning can further improve the accuracy ( Gysel at al., 2018 ). \nAs mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained \"online\" during inference, or \"offline\".   Offline  means gathering activations statistics before deploying the model, either during training or by running a few \"calibration\" batches on the trained FP32 model. Based on these gathered statistics, the scaled factors are calculated and are fixed once the model is deployed. This method has the risk of encountering values outside the previously observed ranges at runtime. These values will be clipped, which might lead to accuracy degradation.  Online  means calculating the min/max values for each tensor dynamically during runtime. In this method clipping cannot occur, however the added computation resources required to calculate the min/max values at runtime might be prohibitive.    It is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. A simple method which can yield nice results is to simply use an average of the observed min/max values instead of the actual values. Alternatively, statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible ( Migacz, 2017 ). Going further,  Banner et al., 2018  have proposed a method for analytically computing the clipping value under certain conditions.  Another possible optimization point is  scale-factor scope . The most common way is use a single scale-factor per-layer, but it is also possible to calculate a scale-factor per-channel. This can be beneficial if the weight distributions vary greatly between channels.  When used to directly quantize a model without re-training, as described so far, this method is commonly referred to as  post-training quantization . However, recent publications have shown that there are cases where post-training quantization to INT8 doesn't preserve accuracy ( Benoit et al., 2018 ,  Krishnamoorthi, 2018 ). Namely, smaller models such as MobileNet seem to not respond as well to post-training quantization, presumabley due to their smaller representational capacity. In such cases,  quantization-aware training  is used.", 
             "title": "\"Conservative\" Quantization: INT8"
-        },
+        }, 
         {
-            "location": "/quantization/index.html#aggressive-quantization-int4-and-lower",
-            "text": "Naively quantizing a FP32 model to INT4 and lower usually incurs significant accuracy degradation. Many works have tried to mitigate this effect. They usually employ one or more of the following concepts in order to improve model accuracy:   Training / Re-Training : For INT4 and lower, training is required in order to obtain reasonable accuracy. The training loop is modified to take quantization into account. See details in the  next section .  Zhou S et al., 2016  have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods  require  a trained FP32 model, either as a starting point ( Zhou A et al., 2017 ), or as a teacher network in a knowledge distillation training setup (see  here ).  Replacing the activation function : The most common activation function in vision models is ReLU, which is unbounded. That is - its dynamic range is not limited for positive inputs. This is very problematic for INT4 and below due to the very limited range and resolution. Therefore, most methods replace ReLU with another function which is bounded. In some cases a clipping function with hard coded values is used ( Zhou S et al., 2016 ,  Mishra et al., 2018 ). Another method learns the clipping value per layer, with better results ( Choi et al., 2018 ). Once the clipping value is set, the scale factor used for quantization is also set, and no further calibration steps are required (as opposed to INT8 methods described above).  Modifying network structure :  Mishra et al., 2018  try to compensate for the loss of information due to quantization by using wider layers (more channels).  Lin et al., 2017  proposed a binary quantization method in which a single FP32 convolution is replaced with multiple binary convolutions, each scaled to represent a different \"base\", covering a larger dynamic range overall.  First and last layer : Many methods do not quantize the first and last layer of the model. It has been observed by  Han et al., 2015  that the first convolutional layer is more sensitive to weights pruning, and some quantization works cite the same reason and show it empirically ( Zhou S et al., 2016 ,  Choi et al., 2018 ). Some works also note that these layers usually constitute a very small portion of the overall computation within the model, further reducing the motivation to quantize them ( Rastegari et al., 2016 ). Most methods keep the first and last layers at FP32. However,  Choi et al., 2018  showed that \"conservative\" quantization of these layers, e.g. to INT8, does not reduce accuracy.  Iterative quantization : Most methods quantize the entire model at once.  Zhou A et al., 2017  employ an iterative method, which starts with a trained FP32 baseline, and quantizes only a portion of the model at the time followed by several epochs of re-training to recover the accuracy loss from quantization.  Mixed Weights and Activations Precision : It has been observed that activations are more sensitive to quantization than weights ( Zhou S et al., 2016 ). Hence it is not uncommon to see experiments with activations quantized to a higher precision compared to weights. Some works have focused solely on quantizing weights, keeping the activations at FP32 ( Li et al., 2016 ,  Zhu et al., 2016 ).",
+            "location": "/quantization/index.html#aggressive-quantization-int4-and-lower", 
+            "text": "Naively quantizing a FP32 model to INT4 and lower usually incurs significant accuracy degradation. Many works have tried to mitigate this effect. They usually employ one or more of the following concepts in order to improve model accuracy:   Training / Re-Training : For INT4 and lower, training is required in order to obtain reasonable accuracy. The training loop is modified to take quantization into account. See details in the  next section .  Zhou S et al., 2016  have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods  require  a trained FP32 model, either as a starting point ( Zhou A et al., 2017 ), or as a teacher network in a knowledge distillation training setup (see  here ).  Replacing the activation function : The most common activation function in vision models is ReLU, which is unbounded. That is - its dynamic range is not limited for positive inputs. This is very problematic for INT4 and below due to the very limited range and resolution. Therefore, most methods replace ReLU with another function which is bounded. In some cases a clipping function with hard coded values is used ( Zhou S et al., 2016 ,  Mishra et al., 2018 ). Another method learns the clipping value per layer, with better results ( Choi et al., 2018 ). Once the clipping value is set, the scale factor used for quantization is also set, and no further calibration steps are required (as opposed to INT8 methods described above).  Modifying network structure :  Mishra et al., 2018  try to compensate for the loss of information due to quantization by using wider layers (more channels).  Lin et al., 2017  proposed a binary quantization method in which a single FP32 convolution is replaced with multiple binary convolutions, each scaled to represent a different \"base\", covering a larger dynamic range overall.  First and last layer : Many methods do not quantize the first and last layer of the model. It has been observed by  Han et al., 2015  that the first convolutional layer is more sensitive to weights pruning, and some quantization works cite the same reason and show it empirically ( Zhou S et al., 2016 ,  Choi et al., 2018 ). Some works also note that these layers usually constitute a very small portion of the overall computation within the model, further reducing the motivation to quantize them ( Rastegari et al., 2016 ). Most methods keep the first and last layers at FP32. However,  Choi et al., 2018  showed that \"conservative\" quantization of these layers, e.g. to INT8, does not reduce accuracy.  Iterative quantization : Most methods quantize the entire model at once.  Zhou A et al., 2017  employ an iterative method, which starts with a trained FP32 baseline, and quantizes only a portion of the model at the time followed by several epochs of re-training to recover the accuracy loss from quantization.  Mixed Weights and Activations Precision : It has been observed that activations are more sensitive to quantization than weights ( Zhou S et al., 2016 ). Hence it is not uncommon to see experiments with activations quantized to a higher precision compared to weights. Some works have focused solely on quantizing weights, keeping the activations at FP32 ( Li et al., 2016 ,  Zhu et al., 2016 ).", 
             "title": "\"Aggressive\" Quantization: INT4 and Lower"
-        },
+        }, 
         {
-            "location": "/quantization/index.html#quantization-aware-training",
-            "text": "As mentioned above, in order to minimize the loss of accuracy from \"aggressive\" quantization, many methods that target INT4 and lower (and in some cases for INT8 as well) involve training the model in a way that considers the quantization. This means training with quantization of weights and activations \"baked\" into the training procedure. The training graph usually looks like this:   A full precision copy of the weights is maintained throughout the training process (\"weights_fp\" in the diagram). Its purpose is to accumulate the small changes from the gradients without loss of precision (Note that the quantization of the weights is an integral part of the training graph, meaning that we back-propagate through it as well). Once the model is trained, only the quantized weights are used for inference. \nIn the diagram we show \"layer N\" as the conv + batch-norm + activation combination, but the same applies to fully-connected layers, element-wise operations, etc. During training, the operations within \"layer N\" can still run in full precision, with the \"quantize\" operations in the boundaries ensuring discrete-valued weights and activations. This is sometimes called \"simulated quantization\".",
+            "location": "/quantization/index.html#quantization-aware-training", 
+            "text": "As mentioned above, in order to minimize the loss of accuracy from \"aggressive\" quantization, many methods that target INT4 and lower (and in some cases for INT8 as well) involve training the model in a way that considers the quantization. This means training with quantization of weights and activations \"baked\" into the training procedure. The training graph usually looks like this:   A full precision copy of the weights is maintained throughout the training process (\"weights_fp\" in the diagram). Its purpose is to accumulate the small changes from the gradients without loss of precision (Note that the quantization of the weights is an integral part of the training graph, meaning that we back-propagate through it as well). Once the model is trained, only the quantized weights are used for inference. \nIn the diagram we show \"layer N\" as the conv + batch-norm + activation combination, but the same applies to fully-connected layers, element-wise operations, etc. During training, the operations within \"layer N\" can still run in full precision, with the \"quantize\" operations in the boundaries ensuring discrete-valued weights and activations. This is sometimes called \"simulated quantization\".", 
             "title": "Quantization-Aware Training"
-        },
+        }, 
         {
-            "location": "/quantization/index.html#straight-through-estimator",
-            "text": "An important question in this context is how to back-propagate through the quantization functions. These functions are discrete-valued, hence their derivative is 0 almost everywhere. So, using their gradients as-is would severely hinder the learning process. An approximation commonly used to overcome this issue is the \"straight-through estimator\" (STE) ( Hinton et al., 2012 ,  Bengio, 2013 ), which simply passes the gradient through these functions as-is.",
+            "location": "/quantization/index.html#straight-through-estimator", 
+            "text": "An important question in this context is how to back-propagate through the quantization functions. These functions are discrete-valued, hence their derivative is 0 almost everywhere. So, using their gradients as-is would severely hinder the learning process. An approximation commonly used to overcome this issue is the \"straight-through estimator\" (STE) ( Hinton et al., 2012 ,  Bengio, 2013 ), which simply passes the gradient through these functions as-is.", 
             "title": "Straight-Through Estimator"
-        },
+        }, 
         {
-            "location": "/quantization/index.html#references",
-            "text": "William Dally . High-Performance Hardware for Machine Learning.  Tutorial, NIPS, 2015   Mohammad Rastegari, Vicente Ordone, Joseph Redmon and Ali Farhadi . XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.  ECCV, 2016   Matthieu Courbariaux, Yoshua Bengio and Jean-Pierre David . Training deep neural networks with low precision multiplications.  arxiv:1412.7024   Philipp Gysel, Jon Pimentel, Mohammad Motamedi and Soheil Ghiasi . Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks.  IEEE Transactions on Neural Networks and Learning Systems, 2018   Szymon Migacz . 8-bit Inference with TensorRT.  GTC San Jose, 2017   Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu and Yuheng Zou . DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.  arxiv:1606.06160   Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu and Yurong Chen . Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights.  ICLR, 2017   Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook and Debbie Marr . WRPN: Wide Reduced-Precision Networks.  ICLR, 2018   Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan and Kailash Gopalakrishnan . PACT: Parameterized Clipping Activation for Quantized Neural Networks.  arxiv:1805.06085   Xiaofan Lin, Cong Zhao and Wei Pan . Towards Accurate Binary Convolutional Neural Network.  NIPS, 2017   Song Han, Jeff Pool, John Tran and William Dally . Learning both Weights and Connections for Efficient Neural Network.  NIPS, 2015   Fengfu Li, Bo Zhang and Bin Liu . Ternary Weight Networks.  arxiv:1605.04711   Chenzhuo Zhu, Song Han, Huizi Mao and William J. Dally . Trained Ternary Quantization.  arxiv:1612.01064   Yoshua Bengio, Nicholas Leonard and Aaron Courville . Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.  arxiv:1308.3432   Geoffrey Hinton, Nitish Srivastava, Kevin Swersky, Tijmen Tieleman and Abdelrahman Mohamed . Neural Networks for Machine Learning.  Coursera, video lectures, 2012   Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam and Dmitry Kalenichenko . Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.  ECCV, 2018   Raghuraman Krishnamoorthi . Quantizing deep convolutional networks for efficient inference: A whitepaper  arxiv:1806.08342   Ron Banner, Yury Nahshan, Elad Hoffer and Daniel Soudry . ACIQ: Analytical Clipping for Integer Quantization of neural networks  arxiv:1810.05723",
+            "location": "/quantization/index.html#references", 
+            "text": "William Dally . High-Performance Hardware for Machine Learning.  Tutorial, NIPS, 2015   Mohammad Rastegari, Vicente Ordone, Joseph Redmon and Ali Farhadi . XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.  ECCV, 2016   Matthieu Courbariaux, Yoshua Bengio and Jean-Pierre David . Training deep neural networks with low precision multiplications.  arxiv:1412.7024   Philipp Gysel, Jon Pimentel, Mohammad Motamedi and Soheil Ghiasi . Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks.  IEEE Transactions on Neural Networks and Learning Systems, 2018   Szymon Migacz . 8-bit Inference with TensorRT.  GTC San Jose, 2017   Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu and Yuheng Zou . DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.  arxiv:1606.06160   Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu and Yurong Chen . Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights.  ICLR, 2017   Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook and Debbie Marr . WRPN: Wide Reduced-Precision Networks.  ICLR, 2018   Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan and Kailash Gopalakrishnan . PACT: Parameterized Clipping Activation for Quantized Neural Networks.  arxiv:1805.06085   Xiaofan Lin, Cong Zhao and Wei Pan . Towards Accurate Binary Convolutional Neural Network.  NIPS, 2017   Song Han, Jeff Pool, John Tran and William Dally . Learning both Weights and Connections for Efficient Neural Network.  NIPS, 2015   Fengfu Li, Bo Zhang and Bin Liu . Ternary Weight Networks.  arxiv:1605.04711   Chenzhuo Zhu, Song Han, Huizi Mao and William J. Dally . Trained Ternary Quantization.  arxiv:1612.01064   Yoshua Bengio, Nicholas Leonard and Aaron Courville . Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.  arxiv:1308.3432   Geoffrey Hinton, Nitish Srivastava, Kevin Swersky, Tijmen Tieleman and Abdelrahman Mohamed . Neural Networks for Machine Learning.  Coursera, video lectures, 2012   Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam and Dmitry Kalenichenko . Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.  ECCV, 2018   Raghuraman Krishnamoorthi . Quantizing deep convolutional networks for efficient inference: A whitepaper  arxiv:1806.08342   Ron Banner, Yury Nahshan, Elad Hoffer and Daniel Soudry . ACIQ: Analytical Clipping for Integer Quantization of neural networks  arxiv:1810.05723", 
             "title": "References"
-        },
+        }, 
         {
-            "location": "/knowledge_distillation/index.html",
-            "text": "Knowledge Distillation\n\n\n(For details on how to train a model with knowledge distillation in Distiller, see \nhere\n)\n\n\nKnowledge distillation is model compression method in which a small model is trained to mimic a pre-trained, larger model (or ensemble of models). This training setting is sometimes referred to as \"teacher-student\", where the large model is the teacher and the small model is the student (we'll be using these terms interchangeably).\n\n\nThe method was first proposed by \nBucila et al., 2006\n and generalized by \nHinton et al., 2015\n. The implementation in Distiller is based on the latter publication. Here we'll provide a summary of the method. For more information the reader may refer to the paper (a \nvideo lecture\n with \nslides\n is also available).\n\n\nIn distillation, knowledge is transferred from the teacher model to the student by minimizing a loss function in which the target is the distribution of class probabilities predicted by the teacher model. That is - the output of a softmax function on the teacher model's logits. However, in many cases, this probability distribution has the correct class at a very high probability, with all other class probabilities very close to 0. As such, it doesn't provide much information beyond the ground truth labels already provided in the dataset. To tackle this issue, \nHinton et al., 2015\n introduced the concept of \"softmax temperature\". The probability \np_i\n of class \ni\n is calculated from the logits \nz\n as:\n\n\n\n\np_i = \\frac{exp\\left(\\frac{z_i}{T}\\right)}{\\sum_{j} \\exp\\left(\\frac{z_j}{T}\\right)}\n\n\n\n\nwhere \nT\n is the temperature parameter. When \nT=1\n we get the standard softmax function. As \nT\n grows, the probability distribution generated by the softmax function becomes softer, providing more information as to which classes the teacher found more similar to the predicted class. Hinton calls this the \"dark knowledge\" embedded in the teacher model, and it is this dark knowledge that we are transferring to the student model in the distillation process. When computing the loss function vs. the teacher's soft targets, we use the same value of \nT\n to compute the softmax on the student's logits. We call this loss the \"distillation loss\".\n\n\nHinton et al., 2015\n found that it is also beneficial to train the distilled model to produce the correct labels (based on the ground truth) in addition to the teacher's soft-labels. Hence, we also calculate the \"standard\" loss between the student's predicted class probabilities and the ground-truth labels (also called \"hard labels/targets\"). We dub this loss the \"student loss\". When calculating the class probabilities for the student loss we use \nT = 1\n. \n\n\nThe overall loss function, incorporating both distillation and student losses, is calculated as:\n\n\n\n\n\\mathcal{L}(x;W) = \\alpha * \\mathcal{H}(y, \\sigma(z_s; T=1)) + \\beta * \\mathcal{H}(\\sigma(z_t; T=\\tau), \\sigma(z_s, T=\\tau))\n\n\n\n\nwhere \nx\n is the input, \nW\n are the student model parameters, \ny\n is the ground truth label, \n\\mathcal{H}\n is the cross-entropy loss function, \n\\sigma\n is the softmax function parameterized by the temperature \nT\n, and \n\\alpha\n and \n\\beta\n are coefficients. \nz_s\n and \nz_t\n are the logits of the student and teacher respectively.\n\n\n\n\nNew Hyper-Parameters\n\n\nIn general \n\\tau\n, \n\\alpha\n and \n\\beta\n are hyper parameters.\n\n\nIn their experiments, \nHinton et al., 2015\n use temperature values ranging from 1 to 20. They note that empirically, when the student model is very small compared to the teacher model, lower temperatures work better. This makes sense if we consider that as we raise the temperature, the resulting soft-labels distribution becomes richer in information, and a very small model might not be able to capture all of this information. However, there's no clear way to predict up front what kind of capacity for information the student model will have.\n\n\nWith regards to \n\\alpha\n and \n\\beta\n, \nHinton et al., 2015\n use a weighted average between the distillation loss and the student loss. That is, \n\\beta = 1 - \\alpha\n. They note that in general, they obtained the best results when setting \n\\alpha\n to be much smaller than \n\\beta\n (although in one of their experiments they use \n\\alpha = \\beta = 0.5\n).  Other works which utilize knowledge distillation don't use a weighted average. Some set \n\\alpha = 1\n while leaving \n\\beta\n tunable, while others don't set any constraints.\n\n\nCombining with Other Model Compression Techniques\n\n\nIn the \"basic\" scenario, the smaller (student) model is a pre-defined architecture which just has a smaller number of parameters compared to the teacher model. For example, we could train ResNet-18 by distilling knowledge from ResNet-34. But, a model with smaller capacity can also be obtained by other model compression techniques - sparsification and/or quantization. So, for example, we could train a 4-bit ResNet-18 model with some method using quantization-aware training, and use a distillation loss function as described above. In that case, the teacher model can even be a FP32 ResNet-18 model. Same goes for pruning and regularization.\n\n\nTann et al., 2017\n, \nMishra and Marr, 2018\n and \nPolino et al., 2018\n are some works that combine knowledge distillation with \nquantization\n. \nTheis et al., 2018\n and \nAshok et al., 2018\n combine distillation with \npruning\n.\n\n\nReferences\n\n\n\n\nCristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil\n. Model Compression. \nKDD, 2006\n\n\n\n\n\nGeoffrey Hinton, Oriol Vinyals and Jeff Dean\n. Distilling the Knowledge in a Neural Network. \narxiv:1503.02531\n\n\n\n\n\nHokchhay Tann, Soheil Hashemi, Iris Bahar and Sherief Reda\n. Hardware-Software Codesign of Accurate, Multiplier-free Deep Neural Networks. \nDAC, 2017\n\n\n\n\n\nAsit Mishra and Debbie Marr\n. Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy. \nICLR, 2018\n\n\n\n\n\nAntonio Polino, Razvan Pascanu and Dan Alistarh\n. Model compression via distillation and quantization. \nICLR, 2018\n\n\n\n\n\nAnubhav Ashok, Nicholas Rhinehart, Fares Beainy and Kris M. Kitani\n. N2N learning: Network to Network Compression via Policy Gradient Reinforcement Learning. \nICLR, 2018\n\n\n\n\n\nLucas Theis, Iryna Korshunova, Alykhan Tejani and Ferenc Husz\u00e1r\n. Faster gaze prediction with dense networks and Fisher pruning. \narxiv:1801.05787",
+            "location": "/knowledge_distillation/index.html", 
+            "text": "Knowledge Distillation\n\n\n(For details on how to train a model with knowledge distillation in Distiller, see \nhere\n)\n\n\nKnowledge distillation is model compression method in which a small model is trained to mimic a pre-trained, larger model (or ensemble of models). This training setting is sometimes referred to as \"teacher-student\", where the large model is the teacher and the small model is the student (we'll be using these terms interchangeably).\n\n\nThe method was first proposed by \nBucila et al., 2006\n and generalized by \nHinton et al., 2015\n. The implementation in Distiller is based on the latter publication. Here we'll provide a summary of the method. For more information the reader may refer to the paper (a \nvideo lecture\n with \nslides\n is also available).\n\n\nIn distillation, knowledge is transferred from the teacher model to the student by minimizing a loss function in which the target is the distribution of class probabilities predicted by the teacher model. That is - the output of a softmax function on the teacher model's logits. However, in many cases, this probability distribution has the correct class at a very high probability, with all other class probabilities very close to 0. As such, it doesn't provide much information beyond the ground truth labels already provided in the dataset. To tackle this issue, \nHinton et al., 2015\n introduced the concept of \"softmax temperature\". The probability \np_i\n of class \ni\n is calculated from the logits \nz\n as:\n\n\n\n\np_i = \\frac{exp\\left(\\frac{z_i}{T}\\right)}{\\sum_{j} \\exp\\left(\\frac{z_j}{T}\\right)}\n\n\n\n\nwhere \nT\n is the temperature parameter. When \nT=1\n we get the standard softmax function. As \nT\n grows, the probability distribution generated by the softmax function becomes softer, providing more information as to which classes the teacher found more similar to the predicted class. Hinton calls this the \"dark knowledge\" embedded in the teacher model, and it is this dark knowledge that we are transferring to the student model in the distillation process. When computing the loss function vs. the teacher's soft targets, we use the same value of \nT\n to compute the softmax on the student's logits. We call this loss the \"distillation loss\".\n\n\nHinton et al., 2015\n found that it is also beneficial to train the distilled model to produce the correct labels (based on the ground truth) in addition to the teacher's soft-labels. Hence, we also calculate the \"standard\" loss between the student's predicted class probabilities and the ground-truth labels (also called \"hard labels/targets\"). We dub this loss the \"student loss\". When calculating the class probabilities for the student loss we use \nT = 1\n. \n\n\nThe overall loss function, incorporating both distillation and student losses, is calculated as:\n\n\n\n\n\\mathcal{L}(x;W) = \\alpha * \\mathcal{H}(y, \\sigma(z_s; T=1)) + \\beta * \\mathcal{H}(\\sigma(z_t; T=\\tau), \\sigma(z_s, T=\\tau))\n\n\n\n\nwhere \nx\n is the input, \nW\n are the student model parameters, \ny\n is the ground truth label, \n\\mathcal{H}\n is the cross-entropy loss function, \n\\sigma\n is the softmax function parameterized by the temperature \nT\n, and \n\\alpha\n and \n\\beta\n are coefficients. \nz_s\n and \nz_t\n are the logits of the student and teacher respectively.\n\n\n\n\nNew Hyper-Parameters\n\n\nIn general \n\\tau\n, \n\\alpha\n and \n\\beta\n are hyper parameters.\n\n\nIn their experiments, \nHinton et al., 2015\n use temperature values ranging from 1 to 20. They note that empirically, when the student model is very small compared to the teacher model, lower temperatures work better. This makes sense if we consider that as we raise the temperature, the resulting soft-labels distribution becomes richer in information, and a very small model might not be able to capture all of this information. However, there's no clear way to predict up front what kind of capacity for information the student model will have.\n\n\nWith regards to \n\\alpha\n and \n\\beta\n, \nHinton et al., 2015\n use a weighted average between the distillation loss and the student loss. That is, \n\\beta = 1 - \\alpha\n. They note that in general, they obtained the best results when setting \n\\alpha\n to be much smaller than \n\\beta\n (although in one of their experiments they use \n\\alpha = \\beta = 0.5\n).  Other works which utilize knowledge distillation don't use a weighted average. Some set \n\\alpha = 1\n while leaving \n\\beta\n tunable, while others don't set any constraints.\n\n\nCombining with Other Model Compression Techniques\n\n\nIn the \"basic\" scenario, the smaller (student) model is a pre-defined architecture which just has a smaller number of parameters compared to the teacher model. For example, we could train ResNet-18 by distilling knowledge from ResNet-34. But, a model with smaller capacity can also be obtained by other model compression techniques - sparsification and/or quantization. So, for example, we could train a 4-bit ResNet-18 model with some method using quantization-aware training, and use a distillation loss function as described above. In that case, the teacher model can even be a FP32 ResNet-18 model. Same goes for pruning and regularization.\n\n\nTann et al., 2017\n, \nMishra and Marr, 2018\n and \nPolino et al., 2018\n are some works that combine knowledge distillation with \nquantization\n. \nTheis et al., 2018\n and \nAshok et al., 2018\n combine distillation with \npruning\n.\n\n\nReferences\n\n\n\n\nCristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil\n. Model Compression. \nKDD, 2006\n\n\n\n\n\nGeoffrey Hinton, Oriol Vinyals and Jeff Dean\n. Distilling the Knowledge in a Neural Network. \narxiv:1503.02531\n\n\n\n\n\nHokchhay Tann, Soheil Hashemi, Iris Bahar and Sherief Reda\n. Hardware-Software Codesign of Accurate, Multiplier-free Deep Neural Networks. \nDAC, 2017\n\n\n\n\n\nAsit Mishra and Debbie Marr\n. Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy. \nICLR, 2018\n\n\n\n\n\nAntonio Polino, Razvan Pascanu and Dan Alistarh\n. Model compression via distillation and quantization. \nICLR, 2018\n\n\n\n\n\nAnubhav Ashok, Nicholas Rhinehart, Fares Beainy and Kris M. Kitani\n. N2N learning: Network to Network Compression via Policy Gradient Reinforcement Learning. \nICLR, 2018\n\n\n\n\n\nLucas Theis, Iryna Korshunova, Alykhan Tejani and Ferenc Husz\u00e1r\n. Faster gaze prediction with dense networks and Fisher pruning. \narxiv:1801.05787", 
             "title": "Knowledge Distillation"
-        },
+        }, 
         {
-            "location": "/knowledge_distillation/index.html#knowledge-distillation",
-            "text": "(For details on how to train a model with knowledge distillation in Distiller, see  here )  Knowledge distillation is model compression method in which a small model is trained to mimic a pre-trained, larger model (or ensemble of models). This training setting is sometimes referred to as \"teacher-student\", where the large model is the teacher and the small model is the student (we'll be using these terms interchangeably).  The method was first proposed by  Bucila et al., 2006  and generalized by  Hinton et al., 2015 . The implementation in Distiller is based on the latter publication. Here we'll provide a summary of the method. For more information the reader may refer to the paper (a  video lecture  with  slides  is also available).  In distillation, knowledge is transferred from the teacher model to the student by minimizing a loss function in which the target is the distribution of class probabilities predicted by the teacher model. That is - the output of a softmax function on the teacher model's logits. However, in many cases, this probability distribution has the correct class at a very high probability, with all other class probabilities very close to 0. As such, it doesn't provide much information beyond the ground truth labels already provided in the dataset. To tackle this issue,  Hinton et al., 2015  introduced the concept of \"softmax temperature\". The probability  p_i  of class  i  is calculated from the logits  z  as:   p_i = \\frac{exp\\left(\\frac{z_i}{T}\\right)}{\\sum_{j} \\exp\\left(\\frac{z_j}{T}\\right)}   where  T  is the temperature parameter. When  T=1  we get the standard softmax function. As  T  grows, the probability distribution generated by the softmax function becomes softer, providing more information as to which classes the teacher found more similar to the predicted class. Hinton calls this the \"dark knowledge\" embedded in the teacher model, and it is this dark knowledge that we are transferring to the student model in the distillation process. When computing the loss function vs. the teacher's soft targets, we use the same value of  T  to compute the softmax on the student's logits. We call this loss the \"distillation loss\".  Hinton et al., 2015  found that it is also beneficial to train the distilled model to produce the correct labels (based on the ground truth) in addition to the teacher's soft-labels. Hence, we also calculate the \"standard\" loss between the student's predicted class probabilities and the ground-truth labels (also called \"hard labels/targets\"). We dub this loss the \"student loss\". When calculating the class probabilities for the student loss we use  T = 1 .   The overall loss function, incorporating both distillation and student losses, is calculated as:   \\mathcal{L}(x;W) = \\alpha * \\mathcal{H}(y, \\sigma(z_s; T=1)) + \\beta * \\mathcal{H}(\\sigma(z_t; T=\\tau), \\sigma(z_s, T=\\tau))   where  x  is the input,  W  are the student model parameters,  y  is the ground truth label,  \\mathcal{H}  is the cross-entropy loss function,  \\sigma  is the softmax function parameterized by the temperature  T , and  \\alpha  and  \\beta  are coefficients.  z_s  and  z_t  are the logits of the student and teacher respectively.",
+            "location": "/knowledge_distillation/index.html#knowledge-distillation", 
+            "text": "(For details on how to train a model with knowledge distillation in Distiller, see  here )  Knowledge distillation is model compression method in which a small model is trained to mimic a pre-trained, larger model (or ensemble of models). This training setting is sometimes referred to as \"teacher-student\", where the large model is the teacher and the small model is the student (we'll be using these terms interchangeably).  The method was first proposed by  Bucila et al., 2006  and generalized by  Hinton et al., 2015 . The implementation in Distiller is based on the latter publication. Here we'll provide a summary of the method. For more information the reader may refer to the paper (a  video lecture  with  slides  is also available).  In distillation, knowledge is transferred from the teacher model to the student by minimizing a loss function in which the target is the distribution of class probabilities predicted by the teacher model. That is - the output of a softmax function on the teacher model's logits. However, in many cases, this probability distribution has the correct class at a very high probability, with all other class probabilities very close to 0. As such, it doesn't provide much information beyond the ground truth labels already provided in the dataset. To tackle this issue,  Hinton et al., 2015  introduced the concept of \"softmax temperature\". The probability  p_i  of class  i  is calculated from the logits  z  as:   p_i = \\frac{exp\\left(\\frac{z_i}{T}\\right)}{\\sum_{j} \\exp\\left(\\frac{z_j}{T}\\right)}   where  T  is the temperature parameter. When  T=1  we get the standard softmax function. As  T  grows, the probability distribution generated by the softmax function becomes softer, providing more information as to which classes the teacher found more similar to the predicted class. Hinton calls this the \"dark knowledge\" embedded in the teacher model, and it is this dark knowledge that we are transferring to the student model in the distillation process. When computing the loss function vs. the teacher's soft targets, we use the same value of  T  to compute the softmax on the student's logits. We call this loss the \"distillation loss\".  Hinton et al., 2015  found that it is also beneficial to train the distilled model to produce the correct labels (based on the ground truth) in addition to the teacher's soft-labels. Hence, we also calculate the \"standard\" loss between the student's predicted class probabilities and the ground-truth labels (also called \"hard labels/targets\"). We dub this loss the \"student loss\". When calculating the class probabilities for the student loss we use  T = 1 .   The overall loss function, incorporating both distillation and student losses, is calculated as:   \\mathcal{L}(x;W) = \\alpha * \\mathcal{H}(y, \\sigma(z_s; T=1)) + \\beta * \\mathcal{H}(\\sigma(z_t; T=\\tau), \\sigma(z_s, T=\\tau))   where  x  is the input,  W  are the student model parameters,  y  is the ground truth label,  \\mathcal{H}  is the cross-entropy loss function,  \\sigma  is the softmax function parameterized by the temperature  T , and  \\alpha  and  \\beta  are coefficients.  z_s  and  z_t  are the logits of the student and teacher respectively.", 
             "title": "Knowledge Distillation"
-        },
+        }, 
         {
-            "location": "/knowledge_distillation/index.html#new-hyper-parameters",
-            "text": "In general  \\tau ,  \\alpha  and  \\beta  are hyper parameters.  In their experiments,  Hinton et al., 2015  use temperature values ranging from 1 to 20. They note that empirically, when the student model is very small compared to the teacher model, lower temperatures work better. This makes sense if we consider that as we raise the temperature, the resulting soft-labels distribution becomes richer in information, and a very small model might not be able to capture all of this information. However, there's no clear way to predict up front what kind of capacity for information the student model will have.  With regards to  \\alpha  and  \\beta ,  Hinton et al., 2015  use a weighted average between the distillation loss and the student loss. That is,  \\beta = 1 - \\alpha . They note that in general, they obtained the best results when setting  \\alpha  to be much smaller than  \\beta  (although in one of their experiments they use  \\alpha = \\beta = 0.5 ).  Other works which utilize knowledge distillation don't use a weighted average. Some set  \\alpha = 1  while leaving  \\beta  tunable, while others don't set any constraints.",
+            "location": "/knowledge_distillation/index.html#new-hyper-parameters", 
+            "text": "In general  \\tau ,  \\alpha  and  \\beta  are hyper parameters.  In their experiments,  Hinton et al., 2015  use temperature values ranging from 1 to 20. They note that empirically, when the student model is very small compared to the teacher model, lower temperatures work better. This makes sense if we consider that as we raise the temperature, the resulting soft-labels distribution becomes richer in information, and a very small model might not be able to capture all of this information. However, there's no clear way to predict up front what kind of capacity for information the student model will have.  With regards to  \\alpha  and  \\beta ,  Hinton et al., 2015  use a weighted average between the distillation loss and the student loss. That is,  \\beta = 1 - \\alpha . They note that in general, they obtained the best results when setting  \\alpha  to be much smaller than  \\beta  (although in one of their experiments they use  \\alpha = \\beta = 0.5 ).  Other works which utilize knowledge distillation don't use a weighted average. Some set  \\alpha = 1  while leaving  \\beta  tunable, while others don't set any constraints.", 
             "title": "New Hyper-Parameters"
-        },
+        }, 
         {
-            "location": "/knowledge_distillation/index.html#references",
-            "text": "Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil . Model Compression.  KDD, 2006   Geoffrey Hinton, Oriol Vinyals and Jeff Dean . Distilling the Knowledge in a Neural Network.  arxiv:1503.02531   Hokchhay Tann, Soheil Hashemi, Iris Bahar and Sherief Reda . Hardware-Software Codesign of Accurate, Multiplier-free Deep Neural Networks.  DAC, 2017   Asit Mishra and Debbie Marr . Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy.  ICLR, 2018   Antonio Polino, Razvan Pascanu and Dan Alistarh . Model compression via distillation and quantization.  ICLR, 2018   Anubhav Ashok, Nicholas Rhinehart, Fares Beainy and Kris M. Kitani . N2N learning: Network to Network Compression via Policy Gradient Reinforcement Learning.  ICLR, 2018   Lucas Theis, Iryna Korshunova, Alykhan Tejani and Ferenc Husz\u00e1r . Faster gaze prediction with dense networks and Fisher pruning.  arxiv:1801.05787",
+            "location": "/knowledge_distillation/index.html#references", 
+            "text": "Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil . Model Compression.  KDD, 2006   Geoffrey Hinton, Oriol Vinyals and Jeff Dean . Distilling the Knowledge in a Neural Network.  arxiv:1503.02531   Hokchhay Tann, Soheil Hashemi, Iris Bahar and Sherief Reda . Hardware-Software Codesign of Accurate, Multiplier-free Deep Neural Networks.  DAC, 2017   Asit Mishra and Debbie Marr . Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy.  ICLR, 2018   Antonio Polino, Razvan Pascanu and Dan Alistarh . Model compression via distillation and quantization.  ICLR, 2018   Anubhav Ashok, Nicholas Rhinehart, Fares Beainy and Kris M. Kitani . N2N learning: Network to Network Compression via Policy Gradient Reinforcement Learning.  ICLR, 2018   Lucas Theis, Iryna Korshunova, Alykhan Tejani and Ferenc Husz\u00e1r . Faster gaze prediction with dense networks and Fisher pruning.  arxiv:1801.05787", 
             "title": "References"
-        },
+        }, 
         {
-            "location": "/conditional_computation/index.html",
-            "text": "Conditional Computation\n\n\nConditional Computation refers to a class of algorithms in which each input sample uses a different part of the model, such that on average the compute, latency or power (depending on our objective) is reduced.\nTo quote \nBengio et. al\n\n\n\n\n\"Conditional computation refers to activating only some of the units in a network, in an input-dependent fashion. For example, if we think we\u2019re looking at a car, we only need to compute the activations of the vehicle detecting units, not of all features that a network could possible compute. The immediate effect of activating fewer units is that propagating information through the network will be faster, both at training as well as at test time. However, one needs to be able to decide in an intelligent fashion which units to turn on and off, depending on the input data. This is typically achieved with some form of gating structure, learned in parallel with the original network.\"\n\n\n\n\nAs usual, there are several approaches to implement Conditional Computation:\n\n\n\n\nSun et. al\n use several expert CNN, each trained on a different task, and combine them to one large network.\n\n\nZheng et. al\n use cascading, an idea which may be familiar to you from Viola-Jones face detection.\n\n\nTheodorakopoulos et. al\n add small layers that learn which filters to use per input sample, and then enforce that during inference (LKAM module).\n\n\nIoannou et. al\n introduce Conditional Networks: that \"can be thought of as: i) decision trees augmented with data transformation\noperators, or ii) CNNs, with block-diagonal sparse weight matrices, and explicit data routing functions\"\n\n\nBolukbasi et. al\n \"learn a system to adaptively choose the components of a deep network to be evaluated for each example. By allowing examples correctly classified using early layers of the system to exit, we avoid the computational time associated with full evaluation of the network. We extend this to learn a network selection system that adaptively selects the network to be evaluated for each example.\"\n\n\n\n\nConditional Computation is especially useful for real-time, latency-sensitive applicative.\n\nIn Distiller we currently have implemented a variant of Early Exit.\n\n\nReferences\n\n\n \nEmmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, Doina Precup.\n\n    \nConditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition\n, arXiv:1511.06297v2, 2016.\n\n\n\n\n\nY. Sun, X.Wang, and X. Tang.\n\n    \nDeep Convolutional Network Cascade for Facial Point Detection\n. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2014\n\n\n\n\n\nX. Zheng, W.Ouyang, and X.Wang.\n \nMulti-Stage Contextual Deep Learning for Pedestrian Detection.\n In Proc. IEEE Intl Conf. on Computer Vision (ICCV), 2014.\n\n\n\n\n\nI. Theodorakopoulos, V. Pothos, D. Kastaniotis and N. Fragoulis1.\n \nParsimonious Inference on Convolutional Neural Networks: Learning and applying on-line kernel activation rules.\n Irida Labs S.A, January 2017\n\n\n\n\n\nTolga Bolukbasi, Joseph Wang, Ofer Dekel, Venkatesh Saligrama\n \nAdaptive Neural Networks for Efficient Inference\n.  Proceedings of the 34th International Conference on Machine Learning, PMLR 70:527-536, 2017.\n\n\n\n\n\nYani Ioannou, Duncan Robertson, Darko Zikic, Peter Kontschieder, Jamie Shotton, Matthew Brown, Antonio Criminisi\n.\n    \nDecision Forests, Convolutional Networks and the Models in-Between\n, arXiv:1511.06297v2, 2016.",
+            "location": "/conditional_computation/index.html", 
+            "text": "Conditional Computation\n\n\nConditional Computation refers to a class of algorithms in which each input sample uses a different part of the model, such that on average the compute, latency or power (depending on our objective) is reduced.\nTo quote \nBengio et. al\n\n\n\n\n\"Conditional computation refers to activating only some of the units in a network, in an input-dependent fashion. For example, if we think we\u2019re looking at a car, we only need to compute the activations of the vehicle detecting units, not of all features that a network could possible compute. The immediate effect of activating fewer units is that propagating information through the network will be faster, both at training as well as at test time. However, one needs to be able to decide in an intelligent fashion which units to turn on and off, depending on the input data. This is typically achieved with some form of gating structure, learned in parallel with the original network.\"\n\n\n\n\nAs usual, there are several approaches to implement Conditional Computation:\n\n\n\n\nSun et. al\n use several expert CNN, each trained on a different task, and combine them to one large network.\n\n\nZheng et. al\n use cascading, an idea which may be familiar to you from Viola-Jones face detection.\n\n\nTheodorakopoulos et. al\n add small layers that learn which filters to use per input sample, and then enforce that during inference (LKAM module).\n\n\nIoannou et. al\n introduce Conditional Networks: that \"can be thought of as: i) decision trees augmented with data transformation\noperators, or ii) CNNs, with block-diagonal sparse weight matrices, and explicit data routing functions\"\n\n\nBolukbasi et. al\n \"learn a system to adaptively choose the components of a deep network to be evaluated for each example. By allowing examples correctly classified using early layers of the system to exit, we avoid the computational time associated with full evaluation of the network. We extend this to learn a network selection system that adaptively selects the network to be evaluated for each example.\"\n\n\n\n\nConditional Computation is especially useful for real-time, latency-sensitive applicative.\n\nIn Distiller we currently have implemented a variant of Early Exit.\n\n\nReferences\n\n\n \nEmmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, Doina Precup.\n\n    \nConditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition\n, arXiv:1511.06297v2, 2016.\n\n\n\n\n\nY. Sun, X.Wang, and X. Tang.\n\n    \nDeep Convolutional Network Cascade for Facial Point Detection\n. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2014\n\n\n\n\n\nX. Zheng, W.Ouyang, and X.Wang.\n \nMulti-Stage Contextual Deep Learning for Pedestrian Detection.\n In Proc. IEEE Intl Conf. on Computer Vision (ICCV), 2014.\n\n\n\n\n\nI. Theodorakopoulos, V. Pothos, D. Kastaniotis and N. Fragoulis1.\n \nParsimonious Inference on Convolutional Neural Networks: Learning and applying on-line kernel activation rules.\n Irida Labs S.A, January 2017\n\n\n\n\n\nTolga Bolukbasi, Joseph Wang, Ofer Dekel, Venkatesh Saligrama\n \nAdaptive Neural Networks for Efficient Inference\n.  Proceedings of the 34th International Conference on Machine Learning, PMLR 70:527-536, 2017.\n\n\n\n\n\nYani Ioannou, Duncan Robertson, Darko Zikic, Peter Kontschieder, Jamie Shotton, Matthew Brown, Antonio Criminisi\n.\n    \nDecision Forests, Convolutional Networks and the Models in-Between\n, arXiv:1511.06297v2, 2016.", 
             "title": "Conditional Computation"
-        },
+        }, 
         {
-            "location": "/conditional_computation/index.html#conditional-computation",
-            "text": "Conditional Computation refers to a class of algorithms in which each input sample uses a different part of the model, such that on average the compute, latency or power (depending on our objective) is reduced.\nTo quote  Bengio et. al   \"Conditional computation refers to activating only some of the units in a network, in an input-dependent fashion. For example, if we think we\u2019re looking at a car, we only need to compute the activations of the vehicle detecting units, not of all features that a network could possible compute. The immediate effect of activating fewer units is that propagating information through the network will be faster, both at training as well as at test time. However, one needs to be able to decide in an intelligent fashion which units to turn on and off, depending on the input data. This is typically achieved with some form of gating structure, learned in parallel with the original network.\"   As usual, there are several approaches to implement Conditional Computation:   Sun et. al  use several expert CNN, each trained on a different task, and combine them to one large network.  Zheng et. al  use cascading, an idea which may be familiar to you from Viola-Jones face detection.  Theodorakopoulos et. al  add small layers that learn which filters to use per input sample, and then enforce that during inference (LKAM module).  Ioannou et. al  introduce Conditional Networks: that \"can be thought of as: i) decision trees augmented with data transformation\noperators, or ii) CNNs, with block-diagonal sparse weight matrices, and explicit data routing functions\"  Bolukbasi et. al  \"learn a system to adaptively choose the components of a deep network to be evaluated for each example. By allowing examples correctly classified using early layers of the system to exit, we avoid the computational time associated with full evaluation of the network. We extend this to learn a network selection system that adaptively selects the network to be evaluated for each example.\"   Conditional Computation is especially useful for real-time, latency-sensitive applicative. \nIn Distiller we currently have implemented a variant of Early Exit.",
+            "location": "/conditional_computation/index.html#conditional-computation", 
+            "text": "Conditional Computation refers to a class of algorithms in which each input sample uses a different part of the model, such that on average the compute, latency or power (depending on our objective) is reduced.\nTo quote  Bengio et. al   \"Conditional computation refers to activating only some of the units in a network, in an input-dependent fashion. For example, if we think we\u2019re looking at a car, we only need to compute the activations of the vehicle detecting units, not of all features that a network could possible compute. The immediate effect of activating fewer units is that propagating information through the network will be faster, both at training as well as at test time. However, one needs to be able to decide in an intelligent fashion which units to turn on and off, depending on the input data. This is typically achieved with some form of gating structure, learned in parallel with the original network.\"   As usual, there are several approaches to implement Conditional Computation:   Sun et. al  use several expert CNN, each trained on a different task, and combine them to one large network.  Zheng et. al  use cascading, an idea which may be familiar to you from Viola-Jones face detection.  Theodorakopoulos et. al  add small layers that learn which filters to use per input sample, and then enforce that during inference (LKAM module).  Ioannou et. al  introduce Conditional Networks: that \"can be thought of as: i) decision trees augmented with data transformation\noperators, or ii) CNNs, with block-diagonal sparse weight matrices, and explicit data routing functions\"  Bolukbasi et. al  \"learn a system to adaptively choose the components of a deep network to be evaluated for each example. By allowing examples correctly classified using early layers of the system to exit, we avoid the computational time associated with full evaluation of the network. We extend this to learn a network selection system that adaptively selects the network to be evaluated for each example.\"   Conditional Computation is especially useful for real-time, latency-sensitive applicative. \nIn Distiller we currently have implemented a variant of Early Exit.", 
             "title": "Conditional Computation"
-        },
+        }, 
         {
-            "location": "/conditional_computation/index.html#references",
-            "text": "Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, Doina Precup. \n     Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition , arXiv:1511.06297v2, 2016.   Y. Sun, X.Wang, and X. Tang. \n     Deep Convolutional Network Cascade for Facial Point Detection . In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2014   X. Zheng, W.Ouyang, and X.Wang.   Multi-Stage Contextual Deep Learning for Pedestrian Detection.  In Proc. IEEE Intl Conf. on Computer Vision (ICCV), 2014.   I. Theodorakopoulos, V. Pothos, D. Kastaniotis and N. Fragoulis1.   Parsimonious Inference on Convolutional Neural Networks: Learning and applying on-line kernel activation rules.  Irida Labs S.A, January 2017   Tolga Bolukbasi, Joseph Wang, Ofer Dekel, Venkatesh Saligrama   Adaptive Neural Networks for Efficient Inference .  Proceedings of the 34th International Conference on Machine Learning, PMLR 70:527-536, 2017.   Yani Ioannou, Duncan Robertson, Darko Zikic, Peter Kontschieder, Jamie Shotton, Matthew Brown, Antonio Criminisi .\n     Decision Forests, Convolutional Networks and the Models in-Between , arXiv:1511.06297v2, 2016.",
+            "location": "/conditional_computation/index.html#references", 
+            "text": "Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, Doina Precup. \n     Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition , arXiv:1511.06297v2, 2016.   Y. Sun, X.Wang, and X. Tang. \n     Deep Convolutional Network Cascade for Facial Point Detection . In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2014   X. Zheng, W.Ouyang, and X.Wang.   Multi-Stage Contextual Deep Learning for Pedestrian Detection.  In Proc. IEEE Intl Conf. on Computer Vision (ICCV), 2014.   I. Theodorakopoulos, V. Pothos, D. Kastaniotis and N. Fragoulis1.   Parsimonious Inference on Convolutional Neural Networks: Learning and applying on-line kernel activation rules.  Irida Labs S.A, January 2017   Tolga Bolukbasi, Joseph Wang, Ofer Dekel, Venkatesh Saligrama   Adaptive Neural Networks for Efficient Inference .  Proceedings of the 34th International Conference on Machine Learning, PMLR 70:527-536, 2017.   Yani Ioannou, Duncan Robertson, Darko Zikic, Peter Kontschieder, Jamie Shotton, Matthew Brown, Antonio Criminisi .\n     Decision Forests, Convolutional Networks and the Models in-Between , arXiv:1511.06297v2, 2016.", 
             "title": "References"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html",
-            "text": "Weights Pruning Algorithms\n\n\n\n\nMagnitude Pruner\n\n\nThis is the most basic pruner: it applies a thresholding function, \\(thresh(.)\\), on each element, \\(w_i\\), of a weights tensor.  A different threshold can be used for each layer's weights tensor.\n\nBecause the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family.\n\n\n\\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]\n\n\nSensitivity Pruner\n\n\nFinding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values.  We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor.\n\n\nThe diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model.  You can see that they have an approximate Gaussian distribution.\n\n\n \n\n\nThe distributions of Alexnet conv1 and fc1 layers\n\n\nWe use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors.  For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (\\(\\sigma\\)) of the tensor.  Thus, if we set the threshold to \\(s*\\sigma\\), then basically we are thresholding \\(s * 68\\%\\) of the tensor elements.  \n\n\n\\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]\n\n\n\\[\n\\lambda = s * \\sigma_l \\;\\;\\; where\\; \\sigma_l\\; is \\;the \\;std \\;of \\;layer \\;l \\;as \\;measured \\;on \\;the \\;dense \\;model\n\\]\n\n\nHow do we choose this \\(s\\) multiplier?\n\n\nIn \nLearning both Weights and Connections for Efficient Neural Networks\n the authors write:\n\n\n\n\n\"We used the sensitivity results to find each layer\u2019s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer\u2019s weights\n\n\n\n\nSo the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \\(s\\).  Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.\n\n\nMethod of Operation\n\n\n\n\nStart by running a pruning sensitivity analysis on the model.  \n\n\nThen use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.\n\n\n\n\nSchedule\n\n\nIn their \npaper\n Song Han et al. use iterative pruning and change the value of the \\(s\\) multiplier at each pruning step.  Distiller's \nSensitivityPruner\n works differently: the value \\(s\\) is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are \"pulled\" toward the center of the distribution and thus more elements gets pruned.\n\n\nThis actually works quite well as we can see in the diagram below.  This is a TensorBoard screen-capture from Alexnet training, which shows how this method starts off pruning very aggressively, but then slowly reduces the pruning rate.\n\n\n\nWe use a simple iterative-pruning schedule such as: \nPrune every second epoch starting at epoch 0, and ending at epoch 38.\n  This excerpt from \nalexnet.schedule_sensitivity.yaml\n shows how this iterative schedule is conveyed in Distiller scheduling configuration YAML:\n\n\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n\n\n\nLevel Pruner\n\n\nClass \nSparsityLevelParameterPruner\n uses a similar method to go around specifying specific thresholding magnitudes.\nInstead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity).  Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level.\n\nThis pruner is much more stable compared to \nSensitivityPruner\n because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's \nSensitivityPruner\n is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution.  Song Han's methodology of using several different values for the multiplier \\(s\\), and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far).  \n\n\nTo set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each\n\n\nMethod of Operation\n\n\n\n\nSort the weights in the specified layer by their absolute values. \n\n\nMask to zero the smallest magnitude weights until the desired sparsity level is reached.\n\n\n\n\nSplicing Pruner\n\n\nIn \nDynamic Network Surgery for Efficient DNNs\n Guo et. al propose that network pruning and splicing work in tandem.  A \nSpilicingPruner\n is a pruner that both prunes and splices connections and works best with a Dynamic Network Surgery schedule, which, for example, configures the \nPruningPolicy\n to mask weights only during the forward pass.\n\n\nAutomated Gradual Pruner (AGP)\n\n\nIn \nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n, authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in \nAutomatedGradualPruner\n.\n\n\n\n\n\n\"We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value \\(s_i\\) (usually 0) to a \ufb01nal sparsity value \\(s_f\\) over a span of n pruning steps.\nThe intuition behind this sparsity function in equation (1)  is to prune the network rapidly in the initial phase when the redundant connections are\nabundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.\"\"\n\n\n\n\n\n\nYou can play with the scheduling parameters in the \nagp_schedule.ipynb notebook\n.\n\n\nThe authors describe AGP:\n\n\n\n\n\n\nOur automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity.\n\n\nDoesn't require much hyper-parameter tuning\n\n\nShown to perform well across different models\n\n\nDoes not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.\n\n\n\n\n\n\nRNN Pruner\n\n\nThe authors of \nExploring Sparsity in Recurrent Neural Networks\n, Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, \"propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network.\"  They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training.  They show pruning of RNN, GRU, LSTM and embedding layers.\n\n\nDistiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm.\n\n\n\n\nStructure Pruners\n\n\nElement-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation.  Structure pruners, remove entire \"structures\", such as kernels, filters, and even entire feature-maps.\n\n\nStructure Ranking Pruners\n\n\nRanking pruners use some criterion to rank the structures in a tensor, and then prune the tensor to a specified level. In principle, these pruners perform one-shot pruning, but can be combined with automatic pruning-level scheduling, such as AGP (see below).\nIn \nPruning Filters for Efficient ConvNets\n the authors use filter ranking, with \none-shot pruning\n followed by fine-tuning.  The authors of \nExploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition\n also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation:\n\n\n\n\nFirst, after sweeping through the full training set several times the weights become relatively stable \u2014 they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)\n\n\n\n\nL1RankedStructureParameterPruner\n\n\nThe \nL1RankedStructureParameterPruner\n pruner calculates the magnitude of some \"structure\", orders all of the structures based on some magnitude function and the \nm\n lowest ranking structures are pruned away.  This pruner performs ranking of structures using the mean of the absolute value of the structure as the representative of the structure magnitude.  The absolute mean does not depend on the size of the structure, so it is easier to use compared to just using the \\(L_1\\)-norm of the structure, and at the same time it is a good proxy of the \\(L_1\\)-norm.  Basically, you can think of \nmean(abs(t))\n as a form of normalization of the structure L1-norm by the length of the structure.  \nL1RankedStructureParameterPruner\n currently prunes weight filters, channels, and rows (for linear layers).\n\n\nActivationAPoZRankedFilterPruner\n\n\nThe \nActivationAPoZRankedFilterPruner\n pruner uses the activation channels mean APoZ (average percentage of zeros) to rank weight filters and prune a specified percentage of filters.\nThis method is called \nNetwork Trimming\n from the research paper:\n\"Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures\",\n    Hengyuan Hu, Rui Peng, Yu-Wing Tai, Chi-Keung Tang, ICLR 2016\n    https://arxiv.org/abs/1607.03250  \n\n\nGradientRankedFilterPruner\n\n\nThe \nGradientRankedFilterPruner\n tries to asses the importance of weight filters using the product of their gradients and the filter value.  \n\n\nRandomRankedFilterPruner\n\n\nFor research purposes we may want to compare the results of some structure-ranking pruner to a random structure-ranking.  The \nRandomRankedFilterPruner\n pruner can be used for this purpose.\n\n\nAutomated Gradual Pruner (AGP) for Structures\n\n\nThe idea of a mathematical formula controlling the sparsity level growth is very useful and \nStructuredAGP\n extends the implementation to structured pruning.\n\n\nPruner Compositions\n\n\nPruners can be combined to create new pruning schemes.  Specifically, with a few lines of code we currently marry the AGP sparsity level scheduler with our filter-ranking classes to create pruner compositions.  For each of these, we use AGP to decided how many filters to prune at each step, and we choose the filters to remove using one of the filter-ranking methods:\n\n\n\n\nL1RankedStructureParameterPruner_AGP\n\n\nActivationAPoZRankedFilterPruner_AGP\n\n\nGradientRankedFilterPruner_AGP\n\n\nRandomRankedFilterPruner_AGP\n\n\n\n\nHybrid Pruning\n\n\nIn a single schedule we can mix different pruning techniques.  For example, we might mix pruning and regularization.  Or structured pruning and element-wise pruning.  We can even apply different methods on the same tensor.  For example, we might want to perform filter pruning for a few epochs, then perform \nthinning\n and continue with element-wise pruning of the smaller network tensors.  This technique of mixing different methods we call Hybrid Pruning, and Distiller has a few example schedules.",
+            "location": "/algo_pruning/index.html", 
+            "text": "Weights Pruning Algorithms\n\n\n\n\nMagnitude Pruner\n\n\nThis is the most basic pruner: it applies a thresholding function, \\(thresh(.)\\), on each element, \\(w_i\\), of a weights tensor.  A different threshold can be used for each layer's weights tensor.\n\nBecause the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family.\n\n\n\\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]\n\n\nSensitivity Pruner\n\n\nFinding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values.  We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor.\n\n\nThe diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model.  You can see that they have an approximate Gaussian distribution.\n\n\n \n\n\nThe distributions of Alexnet conv1 and fc1 layers\n\n\nWe use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors.  For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (\\(\\sigma\\)) of the tensor.  Thus, if we set the threshold to \\(s*\\sigma\\), then basically we are thresholding \\(s * 68\\%\\) of the tensor elements.  \n\n\n\\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]\n\n\n\\[\n\\lambda = s * \\sigma_l \\;\\;\\; where\\; \\sigma_l\\; is \\;the \\;std \\;of \\;layer \\;l \\;as \\;measured \\;on \\;the \\;dense \\;model\n\\]\n\n\nHow do we choose this \\(s\\) multiplier?\n\n\nIn \nLearning both Weights and Connections for Efficient Neural Networks\n the authors write:\n\n\n\n\n\"We used the sensitivity results to find each layer\u2019s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer\u2019s weights\n\n\n\n\nSo the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \\(s\\).  Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.\n\n\nMethod of Operation\n\n\n\n\nStart by running a pruning sensitivity analysis on the model.  \n\n\nThen use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.\n\n\n\n\nSchedule\n\n\nIn their \npaper\n Song Han et al. use iterative pruning and change the value of the \\(s\\) multiplier at each pruning step.  Distiller's \nSensitivityPruner\n works differently: the value \\(s\\) is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are \"pulled\" toward the center of the distribution and thus more elements gets pruned.\n\n\nThis actually works quite well as we can see in the diagram below.  This is a TensorBoard screen-capture from Alexnet training, which shows how this method starts off pruning very aggressively, but then slowly reduces the pruning rate.\n\n\n\nWe use a simple iterative-pruning schedule such as: \nPrune every second epoch starting at epoch 0, and ending at epoch 38.\n  This excerpt from \nalexnet.schedule_sensitivity.yaml\n shows how this iterative schedule is conveyed in Distiller scheduling configuration YAML:\n\n\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n\n\n\nLevel Pruner\n\n\nClass \nSparsityLevelParameterPruner\n uses a similar method to go around specifying specific thresholding magnitudes.\nInstead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity).  Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level.\n\nThis pruner is much more stable compared to \nSensitivityPruner\n because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's \nSensitivityPruner\n is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution.  Song Han's methodology of using several different values for the multiplier \\(s\\), and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far).  \n\n\nTo set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each\n\n\nMethod of Operation\n\n\n\n\nSort the weights in the specified layer by their absolute values. \n\n\nMask to zero the smallest magnitude weights until the desired sparsity level is reached.\n\n\n\n\nSplicing Pruner\n\n\nIn \nDynamic Network Surgery for Efficient DNNs\n Guo et. al propose that network pruning and splicing work in tandem.  A \nSpilicingPruner\n is a pruner that both prunes and splices connections and works best with a Dynamic Network Surgery schedule, which, for example, configures the \nPruningPolicy\n to mask weights only during the forward pass.\n\n\nAutomated Gradual Pruner (AGP)\n\n\nIn \nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n, authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in \nAutomatedGradualPruner\n.\n\n\n\n\n\n\"We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value \\(s_i\\) (usually 0) to a \ufb01nal sparsity value \\(s_f\\) over a span of n pruning steps.\nThe intuition behind this sparsity function in equation (1)  is to prune the network rapidly in the initial phase when the redundant connections are\nabundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.\"\"\n\n\n\n\n\n\nYou can play with the scheduling parameters in the \nagp_schedule.ipynb notebook\n.\n\n\nThe authors describe AGP:\n\n\n\n\n\n\nOur automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity.\n\n\nDoesn't require much hyper-parameter tuning\n\n\nShown to perform well across different models\n\n\nDoes not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.\n\n\n\n\n\n\nRNN Pruner\n\n\nThe authors of \nExploring Sparsity in Recurrent Neural Networks\n, Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, \"propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network.\"  They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training.  They show pruning of RNN, GRU, LSTM and embedding layers.\n\n\nDistiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm.\n\n\n\n\nStructure Pruners\n\n\nElement-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation.  Structure pruners, remove entire \"structures\", such as kernels, filters, and even entire feature-maps.\n\n\nStructure Ranking Pruners\n\n\nRanking pruners use some criterion to rank the structures in a tensor, and then prune the tensor to a specified level. In principle, these pruners perform one-shot pruning, but can be combined with automatic pruning-level scheduling, such as AGP (see below).\nIn \nPruning Filters for Efficient ConvNets\n the authors use filter ranking, with \none-shot pruning\n followed by fine-tuning.  The authors of \nExploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition\n also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation:\n\n\n\n\nFirst, after sweeping through the full training set several times the weights become relatively stable \u2014 they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)\n\n\n\n\nL1RankedStructureParameterPruner\n\n\nThe \nL1RankedStructureParameterPruner\n pruner calculates the magnitude of some \"structure\", orders all of the structures based on some magnitude function and the \nm\n lowest ranking structures are pruned away.  This pruner performs ranking of structures using the mean of the absolute value of the structure as the representative of the structure magnitude.  The absolute mean does not depend on the size of the structure, so it is easier to use compared to just using the \\(L_1\\)-norm of the structure, and at the same time it is a good proxy of the \\(L_1\\)-norm.  Basically, you can think of \nmean(abs(t))\n as a form of normalization of the structure L1-norm by the length of the structure.  \nL1RankedStructureParameterPruner\n currently prunes weight filters, channels, and rows (for linear layers).\n\n\nActivationAPoZRankedFilterPruner\n\n\nThe \nActivationAPoZRankedFilterPruner\n pruner uses the activation channels mean APoZ (average percentage of zeros) to rank weight filters and prune a specified percentage of filters.\nThis method is called \nNetwork Trimming\n from the research paper:\n\"Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures\",\n    Hengyuan Hu, Rui Peng, Yu-Wing Tai, Chi-Keung Tang, ICLR 2016\n    https://arxiv.org/abs/1607.03250  \n\n\nGradientRankedFilterPruner\n\n\nThe \nGradientRankedFilterPruner\n tries to asses the importance of weight filters using the product of their gradients and the filter value.  \n\n\nRandomRankedFilterPruner\n\n\nFor research purposes we may want to compare the results of some structure-ranking pruner to a random structure-ranking.  The \nRandomRankedFilterPruner\n pruner can be used for this purpose.\n\n\nAutomated Gradual Pruner (AGP) for Structures\n\n\nThe idea of a mathematical formula controlling the sparsity level growth is very useful and \nStructuredAGP\n extends the implementation to structured pruning.\n\n\nPruner Compositions\n\n\nPruners can be combined to create new pruning schemes.  Specifically, with a few lines of code we currently marry the AGP sparsity level scheduler with our filter-ranking classes to create pruner compositions.  For each of these, we use AGP to decided how many filters to prune at each step, and we choose the filters to remove using one of the filter-ranking methods:\n\n\n\n\nL1RankedStructureParameterPruner_AGP\n\n\nActivationAPoZRankedFilterPruner_AGP\n\n\nGradientRankedFilterPruner_AGP\n\n\nRandomRankedFilterPruner_AGP\n\n\n\n\nHybrid Pruning\n\n\nIn a single schedule we can mix different pruning techniques.  For example, we might mix pruning and regularization.  Or structured pruning and element-wise pruning.  We can even apply different methods on the same tensor.  For example, we might want to perform filter pruning for a few epochs, then perform \nthinning\n and continue with element-wise pruning of the smaller network tensors.  This technique of mixing different methods we call Hybrid Pruning, and Distiller has a few example schedules.", 
             "title": "Pruning"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#weights-pruning-algorithms",
-            "text": "",
+            "location": "/algo_pruning/index.html#weights-pruning-algorithms", 
+            "text": "", 
             "title": "Weights Pruning Algorithms"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#magnitude-pruner",
-            "text": "This is the most basic pruner: it applies a thresholding function, \\(thresh(.)\\), on each element, \\(w_i\\), of a weights tensor.  A different threshold can be used for each layer's weights tensor. \nBecause the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family.  \\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]",
+            "location": "/algo_pruning/index.html#magnitude-pruner", 
+            "text": "This is the most basic pruner: it applies a thresholding function, \\(thresh(.)\\), on each element, \\(w_i\\), of a weights tensor.  A different threshold can be used for each layer's weights tensor. \nBecause the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family.  \\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]", 
             "title": "Magnitude Pruner"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#sensitivity-pruner",
-            "text": "Finding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values.  We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor. \nThe diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model.  You can see that they have an approximate Gaussian distribution.     The distributions of Alexnet conv1 and fc1 layers  We use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors.  For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (\\(\\sigma\\)) of the tensor.  Thus, if we set the threshold to \\(s*\\sigma\\), then basically we are thresholding \\(s * 68\\%\\) of the tensor elements.    \\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]  \\[\n\\lambda = s * \\sigma_l \\;\\;\\; where\\; \\sigma_l\\; is \\;the \\;std \\;of \\;layer \\;l \\;as \\;measured \\;on \\;the \\;dense \\;model\n\\]  How do we choose this \\(s\\) multiplier?  In  Learning both Weights and Connections for Efficient Neural Networks  the authors write:   \"We used the sensitivity results to find each layer\u2019s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer\u2019s weights   So the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \\(s\\).  Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.",
+            "location": "/algo_pruning/index.html#sensitivity-pruner", 
+            "text": "Finding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values.  We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor. \nThe diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model.  You can see that they have an approximate Gaussian distribution.     The distributions of Alexnet conv1 and fc1 layers  We use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors.  For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (\\(\\sigma\\)) of the tensor.  Thus, if we set the threshold to \\(s*\\sigma\\), then basically we are thresholding \\(s * 68\\%\\) of the tensor elements.    \\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]  \\[\n\\lambda = s * \\sigma_l \\;\\;\\; where\\; \\sigma_l\\; is \\;the \\;std \\;of \\;layer \\;l \\;as \\;measured \\;on \\;the \\;dense \\;model\n\\]  How do we choose this \\(s\\) multiplier?  In  Learning both Weights and Connections for Efficient Neural Networks  the authors write:   \"We used the sensitivity results to find each layer\u2019s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer\u2019s weights   So the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \\(s\\).  Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.", 
             "title": "Sensitivity Pruner"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#method-of-operation",
-            "text": "Start by running a pruning sensitivity analysis on the model.    Then use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.",
+            "location": "/algo_pruning/index.html#method-of-operation", 
+            "text": "Start by running a pruning sensitivity analysis on the model.    Then use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.", 
             "title": "Method of Operation"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#schedule",
-            "text": "In their  paper  Song Han et al. use iterative pruning and change the value of the \\(s\\) multiplier at each pruning step.  Distiller's  SensitivityPruner  works differently: the value \\(s\\) is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are \"pulled\" toward the center of the distribution and thus more elements gets pruned.  This actually works quite well as we can see in the diagram below.  This is a TensorBoard screen-capture from Alexnet training, which shows how this method starts off pruning very aggressively, but then slowly reduces the pruning rate.  We use a simple iterative-pruning schedule such as:  Prune every second epoch starting at epoch 0, and ending at epoch 38.   This excerpt from  alexnet.schedule_sensitivity.yaml  shows how this iterative schedule is conveyed in Distiller scheduling configuration YAML:  pruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2",
+            "location": "/algo_pruning/index.html#schedule", 
+            "text": "In their  paper  Song Han et al. use iterative pruning and change the value of the \\(s\\) multiplier at each pruning step.  Distiller's  SensitivityPruner  works differently: the value \\(s\\) is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are \"pulled\" toward the center of the distribution and thus more elements gets pruned.  This actually works quite well as we can see in the diagram below.  This is a TensorBoard screen-capture from Alexnet training, which shows how this method starts off pruning very aggressively, but then slowly reduces the pruning rate.  We use a simple iterative-pruning schedule such as:  Prune every second epoch starting at epoch 0, and ending at epoch 38.   This excerpt from  alexnet.schedule_sensitivity.yaml  shows how this iterative schedule is conveyed in Distiller scheduling configuration YAML:  pruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2", 
             "title": "Schedule"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#level-pruner",
-            "text": "Class  SparsityLevelParameterPruner  uses a similar method to go around specifying specific thresholding magnitudes.\nInstead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity).  Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level. \nThis pruner is much more stable compared to  SensitivityPruner  because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's  SensitivityPruner  is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution.  Song Han's methodology of using several different values for the multiplier \\(s\\), and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far).    To set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each",
+            "location": "/algo_pruning/index.html#level-pruner", 
+            "text": "Class  SparsityLevelParameterPruner  uses a similar method to go around specifying specific thresholding magnitudes.\nInstead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity).  Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level. \nThis pruner is much more stable compared to  SensitivityPruner  because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's  SensitivityPruner  is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution.  Song Han's methodology of using several different values for the multiplier \\(s\\), and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far).    To set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each", 
             "title": "Level Pruner"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#method-of-operation_1",
-            "text": "Sort the weights in the specified layer by their absolute values.   Mask to zero the smallest magnitude weights until the desired sparsity level is reached.",
+            "location": "/algo_pruning/index.html#method-of-operation_1", 
+            "text": "Sort the weights in the specified layer by their absolute values.   Mask to zero the smallest magnitude weights until the desired sparsity level is reached.", 
             "title": "Method of Operation"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#splicing-pruner",
-            "text": "In  Dynamic Network Surgery for Efficient DNNs  Guo et. al propose that network pruning and splicing work in tandem.  A  SpilicingPruner  is a pruner that both prunes and splices connections and works best with a Dynamic Network Surgery schedule, which, for example, configures the  PruningPolicy  to mask weights only during the forward pass.",
+            "location": "/algo_pruning/index.html#splicing-pruner", 
+            "text": "In  Dynamic Network Surgery for Efficient DNNs  Guo et. al propose that network pruning and splicing work in tandem.  A  SpilicingPruner  is a pruner that both prunes and splices connections and works best with a Dynamic Network Surgery schedule, which, for example, configures the  PruningPolicy  to mask weights only during the forward pass.", 
             "title": "Splicing Pruner"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#automated-gradual-pruner-agp",
-            "text": "In  To prune, or not to prune: exploring the efficacy of pruning for model compression , authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in  AutomatedGradualPruner .   \"We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value \\(s_i\\) (usually 0) to a \ufb01nal sparsity value \\(s_f\\) over a span of n pruning steps.\nThe intuition behind this sparsity function in equation (1)  is to prune the network rapidly in the initial phase when the redundant connections are\nabundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.\"\"    You can play with the scheduling parameters in the  agp_schedule.ipynb notebook .  The authors describe AGP:    Our automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity.  Doesn't require much hyper-parameter tuning  Shown to perform well across different models  Does not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.",
+            "location": "/algo_pruning/index.html#automated-gradual-pruner-agp", 
+            "text": "In  To prune, or not to prune: exploring the efficacy of pruning for model compression , authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in  AutomatedGradualPruner .   \"We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value \\(s_i\\) (usually 0) to a \ufb01nal sparsity value \\(s_f\\) over a span of n pruning steps.\nThe intuition behind this sparsity function in equation (1)  is to prune the network rapidly in the initial phase when the redundant connections are\nabundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.\"\"    You can play with the scheduling parameters in the  agp_schedule.ipynb notebook .  The authors describe AGP:    Our automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity.  Doesn't require much hyper-parameter tuning  Shown to perform well across different models  Does not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.", 
             "title": "Automated Gradual Pruner (AGP)"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#rnn-pruner",
-            "text": "The authors of  Exploring Sparsity in Recurrent Neural Networks , Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, \"propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network.\"  They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training.  They show pruning of RNN, GRU, LSTM and embedding layers.  Distiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm.",
+            "location": "/algo_pruning/index.html#rnn-pruner", 
+            "text": "The authors of  Exploring Sparsity in Recurrent Neural Networks , Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, \"propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network.\"  They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training.  They show pruning of RNN, GRU, LSTM and embedding layers.  Distiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm.", 
             "title": "RNN Pruner"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#structure-pruners",
-            "text": "Element-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation.  Structure pruners, remove entire \"structures\", such as kernels, filters, and even entire feature-maps.",
+            "location": "/algo_pruning/index.html#structure-pruners", 
+            "text": "Element-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation.  Structure pruners, remove entire \"structures\", such as kernels, filters, and even entire feature-maps.", 
             "title": "Structure Pruners"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#structure-ranking-pruners",
-            "text": "Ranking pruners use some criterion to rank the structures in a tensor, and then prune the tensor to a specified level. In principle, these pruners perform one-shot pruning, but can be combined with automatic pruning-level scheduling, such as AGP (see below).\nIn  Pruning Filters for Efficient ConvNets  the authors use filter ranking, with  one-shot pruning  followed by fine-tuning.  The authors of  Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition  also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation:   First, after sweeping through the full training set several times the weights become relatively stable \u2014 they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)",
+            "location": "/algo_pruning/index.html#structure-ranking-pruners", 
+            "text": "Ranking pruners use some criterion to rank the structures in a tensor, and then prune the tensor to a specified level. In principle, these pruners perform one-shot pruning, but can be combined with automatic pruning-level scheduling, such as AGP (see below).\nIn  Pruning Filters for Efficient ConvNets  the authors use filter ranking, with  one-shot pruning  followed by fine-tuning.  The authors of  Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition  also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation:   First, after sweeping through the full training set several times the weights become relatively stable \u2014 they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)", 
             "title": "Structure Ranking Pruners"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#l1rankedstructureparameterpruner",
-            "text": "The  L1RankedStructureParameterPruner  pruner calculates the magnitude of some \"structure\", orders all of the structures based on some magnitude function and the  m  lowest ranking structures are pruned away.  This pruner performs ranking of structures using the mean of the absolute value of the structure as the representative of the structure magnitude.  The absolute mean does not depend on the size of the structure, so it is easier to use compared to just using the \\(L_1\\)-norm of the structure, and at the same time it is a good proxy of the \\(L_1\\)-norm.  Basically, you can think of  mean(abs(t))  as a form of normalization of the structure L1-norm by the length of the structure.   L1RankedStructureParameterPruner  currently prunes weight filters, channels, and rows (for linear layers).",
+            "location": "/algo_pruning/index.html#l1rankedstructureparameterpruner", 
+            "text": "The  L1RankedStructureParameterPruner  pruner calculates the magnitude of some \"structure\", orders all of the structures based on some magnitude function and the  m  lowest ranking structures are pruned away.  This pruner performs ranking of structures using the mean of the absolute value of the structure as the representative of the structure magnitude.  The absolute mean does not depend on the size of the structure, so it is easier to use compared to just using the \\(L_1\\)-norm of the structure, and at the same time it is a good proxy of the \\(L_1\\)-norm.  Basically, you can think of  mean(abs(t))  as a form of normalization of the structure L1-norm by the length of the structure.   L1RankedStructureParameterPruner  currently prunes weight filters, channels, and rows (for linear layers).", 
             "title": "L1RankedStructureParameterPruner"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#activationapozrankedfilterpruner",
-            "text": "The  ActivationAPoZRankedFilterPruner  pruner uses the activation channels mean APoZ (average percentage of zeros) to rank weight filters and prune a specified percentage of filters.\nThis method is called  Network Trimming  from the research paper:\n\"Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures\",\n    Hengyuan Hu, Rui Peng, Yu-Wing Tai, Chi-Keung Tang, ICLR 2016\n    https://arxiv.org/abs/1607.03250",
+            "location": "/algo_pruning/index.html#activationapozrankedfilterpruner", 
+            "text": "The  ActivationAPoZRankedFilterPruner  pruner uses the activation channels mean APoZ (average percentage of zeros) to rank weight filters and prune a specified percentage of filters.\nThis method is called  Network Trimming  from the research paper:\n\"Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures\",\n    Hengyuan Hu, Rui Peng, Yu-Wing Tai, Chi-Keung Tang, ICLR 2016\n    https://arxiv.org/abs/1607.03250", 
             "title": "ActivationAPoZRankedFilterPruner"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#gradientrankedfilterpruner",
-            "text": "The  GradientRankedFilterPruner  tries to asses the importance of weight filters using the product of their gradients and the filter value.",
+            "location": "/algo_pruning/index.html#gradientrankedfilterpruner", 
+            "text": "The  GradientRankedFilterPruner  tries to asses the importance of weight filters using the product of their gradients and the filter value.", 
             "title": "GradientRankedFilterPruner"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#randomrankedfilterpruner",
-            "text": "For research purposes we may want to compare the results of some structure-ranking pruner to a random structure-ranking.  The  RandomRankedFilterPruner  pruner can be used for this purpose.",
+            "location": "/algo_pruning/index.html#randomrankedfilterpruner", 
+            "text": "For research purposes we may want to compare the results of some structure-ranking pruner to a random structure-ranking.  The  RandomRankedFilterPruner  pruner can be used for this purpose.", 
             "title": "RandomRankedFilterPruner"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#automated-gradual-pruner-agp-for-structures",
-            "text": "The idea of a mathematical formula controlling the sparsity level growth is very useful and  StructuredAGP  extends the implementation to structured pruning.",
+            "location": "/algo_pruning/index.html#automated-gradual-pruner-agp-for-structures", 
+            "text": "The idea of a mathematical formula controlling the sparsity level growth is very useful and  StructuredAGP  extends the implementation to structured pruning.", 
             "title": "Automated Gradual Pruner (AGP) for Structures"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#pruner-compositions",
-            "text": "Pruners can be combined to create new pruning schemes.  Specifically, with a few lines of code we currently marry the AGP sparsity level scheduler with our filter-ranking classes to create pruner compositions.  For each of these, we use AGP to decided how many filters to prune at each step, and we choose the filters to remove using one of the filter-ranking methods:   L1RankedStructureParameterPruner_AGP  ActivationAPoZRankedFilterPruner_AGP  GradientRankedFilterPruner_AGP  RandomRankedFilterPruner_AGP",
+            "location": "/algo_pruning/index.html#pruner-compositions", 
+            "text": "Pruners can be combined to create new pruning schemes.  Specifically, with a few lines of code we currently marry the AGP sparsity level scheduler with our filter-ranking classes to create pruner compositions.  For each of these, we use AGP to decided how many filters to prune at each step, and we choose the filters to remove using one of the filter-ranking methods:   L1RankedStructureParameterPruner_AGP  ActivationAPoZRankedFilterPruner_AGP  GradientRankedFilterPruner_AGP  RandomRankedFilterPruner_AGP", 
             "title": "Pruner Compositions"
-        },
+        }, 
         {
-            "location": "/algo_pruning/index.html#hybrid-pruning",
-            "text": "In a single schedule we can mix different pruning techniques.  For example, we might mix pruning and regularization.  Or structured pruning and element-wise pruning.  We can even apply different methods on the same tensor.  For example, we might want to perform filter pruning for a few epochs, then perform  thinning  and continue with element-wise pruning of the smaller network tensors.  This technique of mixing different methods we call Hybrid Pruning, and Distiller has a few example schedules.",
+            "location": "/algo_pruning/index.html#hybrid-pruning", 
+            "text": "In a single schedule we can mix different pruning techniques.  For example, we might mix pruning and regularization.  Or structured pruning and element-wise pruning.  We can even apply different methods on the same tensor.  For example, we might want to perform filter pruning for a few epochs, then perform  thinning  and continue with element-wise pruning of the smaller network tensors.  This technique of mixing different methods we call Hybrid Pruning, and Distiller has a few example schedules.", 
             "title": "Hybrid Pruning"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html",
-            "text": "Quantization Algorithms\n\n\nNote:\n\nFor any of the methods below that require quantization-aware training, please see \nhere\n for details on how to invoke it using Distiller's scheduling mechanism.\n\n\nRange-Based Linear Quantization\n\n\nLet's break down the terminology we use here:\n\n\n\n\nLinear:\n Means a float value is quantized by multiplying with a numeric constant (the \nscale factor\n).\n\n\nRange-Based:\n: Means that in order to calculate the scale factor, we look at the actual range of the tensor's values. In the most naive implementation, we use the actual min/max values of the tensor. Alternatively, we use some derivation based on the tensor's range / distribution to come up with a narrower min/max range, in order to remove possible outliers. This is in contrast to the other methods described here, which we could call \nclipping-based\n, as they impose an explicit clipping function on the tensors (using either a hard-coded value or a learned value).\n\n\n\n\nAsymmetric vs. Symmetric\n\n\nIn this method we can use two modes - \nasymmetric\n and \nsymmetric\n.\n\n\nAsymmetric Mode\n\n\n\n    \n\n\n\n\n\nIn \nasymmetric\n mode, we map the min/max in the float range to the min/max of the integer range. This is done by using a \nzero-point\n (also called \nquantization bias\n, or \noffset\n) in addition to the scale factor.\n\n\nLet us denote the original floating-point tensor by \nx_f\n, the quantized tensor by \nx_q\n, the scale factor by \nq_x\n, the zero-point by \nzp_x\n and the number of bits used for quantization by \nn\n. Then, we get:\n\n\n\n\nx_q = round\\left ((x_f - min_{x_f})\\underbrace{\\frac{2^n - 1}{max_{x_f} - min_{x_f}}}_{q_x} \\right) = round(q_x x_f - \\underbrace{min_{x_f}q_x)}_{zp_x} = round(q_x x_f - zp_x)\n\n\n\n\nIn practice, we actually use \nzp_x = round(min_{x_f}q_x)\n. This means that zero is exactly representable by an integer in the quantized range. This is important, for example, for layers that have zero-padding. By rounding the zero-point, we effectively \"nudge\" the min/max values in the float range a little bit, in order to gain this exact quantization of zero.\n\n\nNote that in the derivation above we use unsigned integer to represent the quantized range. That is, \nx_q \\in [0, 2^n-1]\n. One could use signed integer if necessary (perhaps due to HW considerations). This can be achieved by subtracting \n2^{n-1}\n.\n\n\nLet's see how a \nconvolution\n or \nfully-connected (FC)\n layer is quantized in asymmetric mode: (we denote input, output, weights and bias with  \nx, y, w\n and \nb\n respectively)\n\n\n\n\ny_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q + zp_x}{q_x} \\frac{w_q + zp_w}{q_w}} + \\frac{b_q + zp_b}{q_b} =\n\n\n = \\frac{1}{q_x q_w} \\left( \\sum { (x_q + zp_x) (w_q + zp_w) + \\frac{q_x q_w}{q_b}(b_q + zp_b) } \\right)\n\n\n\n\nTherefore:\n\n\n\n\ny_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\left( \\sum { (x_q+zp_x) (w_q+zp_w) + \\frac{q_x q_w}{q_b}(b_q+zp_b) } \\right) \\right) \n\n\n\n\nNotes:\n\n\n\n\nWe can see that the bias has to be re-scaled to match the scale of the summation.\n\n\nIn a proper integer-only HW pipeline, we would like our main accumulation term to simply be \n\\sum{x_q w_q}\n. In order to achieve this, one needs to further develop the expression we derived above. For further details please refer to the \ngemmlowp documentation\n\n\n\n\nSymmetric Mode\n\n\n\n    \n\n\n\n\n\nIn \nsymmetric\n mode, instead of mapping the exact min/max of the float range to the quantized range, we choose the maximum absolute value between min/max. In addition, we don't use a zero-point. So, the floating-point range we're effectively quantizing is symmetric with respect to zero, and so is the quantized range.\n\n\nUsing the same notations as above, we get:\n\n\n\n\nx_q = round\\left (x_f \\underbrace{\\frac{2^{n-1} - 1}{\\max|x_f|}}_{q_x} \\right) = round(q_x x_f)\n\n\n\n\nAgain, let's see how a \nconvolution\n or \nfully-connected (FC)\n layer is quantized, this time in symmetric mode:\n\n\n\n\ny_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q}{q_x} \\frac{w_q}{q_w}} + \\frac{b_q}{q_b} = \\frac{1}{q_x q_w} \\left( \\sum { x_q w_q + \\frac{q_x q_w}{q_b}b_q } \\right)\n\n\n\n\nTherefore:\n\n\n\n\ny_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\left( \\sum { x_q w_q + \\frac{q_x q_w}{q_b}b_q } \\right) \\right) \n\n\n\n\nComparing the Two Modes\n\n\nThe main trade-off between these two modes is simplicity vs. utilization of the quantized range.\n\n\n\n\nWhen using asymmetric quantization, the quantized range is fully utilized. That is because we exactly map the min/max values from the float range to the min/max of the quantized range. Using symmetric mode, if the float range is biased towards one side, could result in a quantized range where significant dynamic range is dedicated to values that we'll never see. The most extreme example of this is after ReLU, where the entire tensor is positive. Quantizing it in symmetric mode means we're effectively losing 1 bit.\n\n\nOn the other hand, if we look at the derviations for convolution / FC layers above, we can see that the actual implementation of symmetric mode is much simpler. In asymmetric mode, the zero-points require additional logic in HW. The cost of this extra logic in terms of latency and/or power and/or area will of course depend on the exact implementation.\n\n\n\n\nOther Features\n\n\n\n\nRemoving Outliers:\n As discussed \nhere\n, in some cases the float range of activations contains outliers. Spending dynamic range on these outliers hurts our ability ro represent the values we actually care about accurately.\n   \n\n       \n\n   \n\n  Currently, Distiller supports clipping of activations with averaging during post-training quantization. That is - for each batch, instead of calculating global min/max values, an average of the min/max values of each sample in the batch.\n\n\nScale factor scope:\n For weight tensors, Distiller supports per-channel quantization (per output channel).\n\n\n\n\nImplementation in Distiller\n\n\nPost-Training\n\n\nFor post-training quantization, currently \nconvolution\n and \nFC\n are supported using this method.  \n\n\n\n\nThey are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. The wrapper is implemented in the \nRangeLinearQuantParamLayerWrapper\n class.  \n\n\nAll other layers are unaffected and are executed using their original FP32 implementation.  \n\n\nTo automatically transform an existing model to a quantized model using this method, use the \nPostTrainLinearQuantizer\n class. For an example of how to do this, see the \ncompress_classifier.py\n. This sample also exposes command line arguments to invoke post-training quantization. For details see \nhere\n.\n\n\nFor weights and bias the scale factor and zero-point are determined once at quantization setup (\"offline\"), and for activations it is determined dynamically at runtime (\"online\"). The calculated quantization parameters are store as buffers within the module, so they are automatically serialized when the model checkpoint is saved.\n\n\nAs this is post-training, using it with number of bits < 8 is likely to lead to severe accuracy degradation for any non-trivial workload.\n\n\n\n\nQuantization-Aware Training\n\n\nTo apply range-based linear quantization in training, use the \nQuantAwareTrainRangeLinearQuantizer\n class. As it is now, it will apply weights quantization to convolution and FC modules. For activations quantization, it will insert instances \nFakeLinearQuantization\n module after ReLUs. This module follows the methodology described in \nBenoit et al., 2018\n and uses exponential moving averages to track activation ranges.\n\n\nSimilarly to post-training, the calculated quantization parameters (scale factors, zero-points, tracked activation ranges) are stored as buffers within their respective modules, so they're saved when a checkpoint is created.\n\n\nNote that converting from a quantization-aware training model to a post-training quantization model is not yet supported. Such a conversion will use the activation ranges tracked during training, so additional offline or online calculation of quantization parameters will not be required.\n\n\nDoReFa\n\n\n(As proposed in \nDoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients\n)  \n\n\nIn this method, we first define the quantization function \nquantize_k\n, which takes a real value \na_f \\in [0, 1]\n and outputs a discrete-valued \na_q \\in \\left\\{ \\frac{0}{2^k-1}, \\frac{1}{2^k-1}, ... , \\frac{2^k-1}{2^k-1} \\right\\}\n, where \nk\n is the number of bits used for quantization.\n\n\n\n\na_q = quantize_k(a_f) = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) a_f \\right)\n\n\n\n\nActivations are clipped to the \n[0, 1]\n range and then quantized as follows:\n\n\n\n\nx_q = quantize_k(x_f)\n\n\n\n\nFor weights, we define the following function \nf\n, which takes an unbounded real valued input and outputs a real value in \n[0, 1]\n:\n\n\n\n\nf(w) = \\frac{tanh(w)}{2 max(|tanh(w)|)} + \\frac{1}{2} \n\n\n\n\nNow we can use \nquantize_k\n to get quantized weight values, as follows:\n\n\n\n\nw_q = 2 quantize_k \\left( f(w_f) \\right) - 1\n\n\n\n\nThis method requires training the model with quantization-aware training, as discussed \nhere\n. Use the \nDorefaQuantizer\n class to transform an existing model to a model suitable for training with quantization using DoReFa.\n\n\nNotes:\n\n\n\n\nGradients quantization as proposed in the paper is not supported yet.\n\n\nThe paper defines special handling for binary weights which isn't supported in Distiller yet.\n\n\n\n\nPACT\n\n\n(As proposed in \nPACT: Parameterized Clipping Activation for Quantized Neural Networks\n)\n\n\nThis method is similar to DoReFa, but the upper clipping values, \n\\alpha\n, of the activation functions are learned parameters instead of hard coded to 1. Note that per the paper's recommendation, \n\\alpha\n is shared per layer.\n\n\nThis method requires training the model with quantization-aware training, as discussed \nhere\n. Use the \nPACTQuantizer\n class to transform an existing model to a model suitable for training with quantization using PACT.\n\n\nWRPN\n\n\n(As proposed in \nWRPN: Wide Reduced-Precision Networks\n)  \n\n\nIn this method, activations are clipped to \n[0, 1]\n and quantized as follows (\nk\n is the number of bits used for quantization):\n\n\n\n\nx_q = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) x_f \\right)\n\n\n\n\nWeights are clipped to \n[-1, 1]\n and quantized as follows:\n\n\n\n\nw_q = \\frac{1}{2^{k-1}-1} round \\left( \\left(2^{k-1} - 1 \\right)w_f \\right)\n\n\n\n\nNote that \nk-1\n bits are used to quantize weights, leaving one bit for sign.\n\n\nThis method requires training the model with quantization-aware training, as discussed \nhere\n. Use the \nWRPNQuantizer\n class to transform an existing model to a model suitable for training with quantization using WRPN.\n\n\nNotes:\n\n\n\n\nThe paper proposed widening of layers as a means to reduce accuracy loss. This isn't implemented as part of \nWRPNQuantizer\n at the moment. To experiment with this, modify your model implementation to have wider layers.\n\n\nThe paper defines special handling for binary weights which isn't supported in Distiller yet.",
+            "location": "/algo_quantization/index.html", 
+            "text": "Quantization Algorithms\n\n\nNote:\n\nFor any of the methods below that require quantization-aware training, please see \nhere\n for details on how to invoke it using Distiller's scheduling mechanism.\n\n\nRange-Based Linear Quantization\n\n\nLet's break down the terminology we use here:\n\n\n\n\nLinear:\n Means a float value is quantized by multiplying with a numeric constant (the \nscale factor\n).\n\n\nRange-Based:\n Means that in order to calculate the scale factor, we look at the actual range of the tensor's values. In the most naive implementation, we use the actual min/max values of the tensor. Alternatively, we use some derivation based on the tensor's range / distribution to come up with a narrower min/max range, in order to remove possible outliers. This is in contrast to the other methods described here, which we could call \nclipping-based\n, as they impose an explicit clipping function on the tensors (using either a hard-coded value or a learned value).\n\n\n\n\nAsymmetric vs. Symmetric\n\n\nIn this method we can use two modes - \nasymmetric\n and \nsymmetric\n.\n\n\nAsymmetric Mode\n\n\n\n    \n\n\n\n\n\nIn \nasymmetric\n mode, we map the min/max in the float range to the min/max of the integer range. This is done by using a \nzero-point\n (also called \nquantization bias\n, or \noffset\n) in addition to the scale factor.\n\n\nLet us denote the original floating-point tensor by \nx_f\n, the quantized tensor by \nx_q\n, the scale factor by \nq_x\n, the zero-point by \nzp_x\n and the number of bits used for quantization by \nn\n. Then, we get:\n\n\n\n\nx_q = round\\left ((x_f - min_{x_f})\\underbrace{\\frac{2^n - 1}{max_{x_f} - min_{x_f}}}_{q_x} \\right) = round(q_x x_f - \\underbrace{min_{x_f}q_x)}_{zp_x} = round(q_x x_f - zp_x)\n\n\n\n\nIn practice, we actually use \nzp_x = round(min_{x_f}q_x)\n. This means that zero is exactly representable by an integer in the quantized range. This is important, for example, for layers that have zero-padding. By rounding the zero-point, we effectively \"nudge\" the min/max values in the float range a little bit, in order to gain this exact quantization of zero.\n\n\nNote that in the derivation above we use unsigned integer to represent the quantized range. That is, \nx_q \\in [0, 2^n-1]\n. One could use signed integer if necessary (perhaps due to HW considerations). This can be achieved by subtracting \n2^{n-1}\n.\n\n\nLet's see how a \nconvolution\n or \nfully-connected (FC)\n layer is quantized in asymmetric mode: (we denote input, output, weights and bias with  \nx, y, w\n and \nb\n respectively)\n\n\n\n\ny_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q + zp_x}{q_x} \\frac{w_q + zp_w}{q_w}} + \\frac{b_q + zp_b}{q_b} =\n\n\n = \\frac{1}{q_x q_w} \\left( \\sum { (x_q + zp_x) (w_q + zp_w) + \\frac{q_x q_w}{q_b}(b_q + zp_b) } \\right)\n\n\n\n\nTherefore:\n\n\n\n\ny_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\left( \\sum { (x_q+zp_x) (w_q+zp_w) + \\frac{q_x q_w}{q_b}(b_q+zp_b) } \\right) \\right) \n\n\n\n\nNotes:\n\n\n\n\nWe can see that the bias has to be re-scaled to match the scale of the summation.\n\n\nIn a proper integer-only HW pipeline, we would like our main accumulation term to simply be \n\\sum{x_q w_q}\n. In order to achieve this, one needs to further develop the expression we derived above. For further details please refer to the \ngemmlowp documentation\n\n\n\n\nSymmetric Mode\n\n\n\n    \n\n\n\n\n\nIn \nsymmetric\n mode, instead of mapping the exact min/max of the float range to the quantized range, we choose the maximum absolute value between min/max. In addition, we don't use a zero-point. So, the floating-point range we're effectively quantizing is symmetric with respect to zero, and so is the quantized range.\n\n\nUsing the same notations as above, we get:\n\n\n\n\nx_q = round\\left (x_f \\underbrace{\\frac{2^{n-1} - 1}{\\max|x_f|}}_{q_x} \\right) = round(q_x x_f)\n\n\n\n\nAgain, let's see how a \nconvolution\n or \nfully-connected (FC)\n layer is quantized, this time in symmetric mode:\n\n\n\n\ny_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q}{q_x} \\frac{w_q}{q_w}} + \\frac{b_q}{q_b} = \\frac{1}{q_x q_w} \\left( \\sum { x_q w_q + \\frac{q_x q_w}{q_b}b_q } \\right)\n\n\n\n\nTherefore:\n\n\n\n\ny_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\left( \\sum { x_q w_q + \\frac{q_x q_w}{q_b}b_q } \\right) \\right) \n\n\n\n\nComparing the Two Modes\n\n\nThe main trade-off between these two modes is simplicity vs. utilization of the quantized range.\n\n\n\n\nWhen using asymmetric quantization, the quantized range is fully utilized. That is because we exactly map the min/max values from the float range to the min/max of the quantized range. Using symmetric mode, if the float range is biased towards one side, could result in a quantized range where significant dynamic range is dedicated to values that we'll never see. The most extreme example of this is after ReLU, where the entire tensor is positive. Quantizing it in symmetric mode means we're effectively losing 1 bit.\n\n\nOn the other hand, if we look at the derviations for convolution / FC layers above, we can see that the actual implementation of symmetric mode is much simpler. In asymmetric mode, the zero-points require additional logic in HW. The cost of this extra logic in terms of latency and/or power and/or area will of course depend on the exact implementation.\n\n\n\n\nOther Features\n\n\n\n\nRemoving Outliers:\n As discussed \nhere\n, in some cases the float range of activations contains outliers. Spending dynamic range on these outliers hurts our ability ro represent the values we actually care about accurately.\n   \n\n       \n\n   \n\n  Currently, Distiller supports clipping of activations with averaging during post-training quantization. That is - for each batch, instead of calculating global min/max values, an average of the min/max values of each sample in the batch.\n\n\nScale factor scope:\n For weight tensors, Distiller supports per-channel quantization (per output channel).\n\n\n\n\nImplementation in Distiller\n\n\nPost-Training\n\n\nFor post-training quantization, currently \nconvolution\n and \nFC\n are supported using this method.  \n\n\n\n\nThey are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. The wrapper is implemented in the \nRangeLinearQuantParamLayerWrapper\n class.  \n\n\nAll other layers are unaffected and are executed using their original FP32 implementation.  \n\n\nTo automatically transform an existing model to a quantized model using this method, use the \nPostTrainLinearQuantizer\n class. For an example of how to do this, see the \ncompress_classifier.py\n. This sample also exposes command line arguments to invoke post-training quantization. For details see \nhere\n.\n\n\nFor weights and bias the scale factor and zero-point are determined once at quantization setup (\"offline\"), and for activations it is determined dynamically at runtime (\"online\"). The calculated quantization parameters are store as buffers within the module, so they are automatically serialized when the model checkpoint is saved.\n\n\nAs this is post-training, using it with number of bits \n 8 is likely to lead to severe accuracy degradation for any non-trivial workload.\n\n\n\n\nQuantization-Aware Training\n\n\nTo apply range-based linear quantization in training, use the \nQuantAwareTrainRangeLinearQuantizer\n class. As it is now, it will apply weights quantization to convolution and FC modules. For activations quantization, it will insert instances \nFakeLinearQuantization\n module after ReLUs. This module follows the methodology described in \nBenoit et al., 2018\n and uses exponential moving averages to track activation ranges.\n\n\nSimilarly to post-training, the calculated quantization parameters (scale factors, zero-points, tracked activation ranges) are stored as buffers within their respective modules, so they're saved when a checkpoint is created.\n\n\nNote that converting from a quantization-aware training model to a post-training quantization model is not yet supported. Such a conversion will use the activation ranges tracked during training, so additional offline or online calculation of quantization parameters will not be required.\n\n\nDoReFa\n\n\n(As proposed in \nDoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients\n)  \n\n\nIn this method, we first define the quantization function \nquantize_k\n, which takes a real value \na_f \\in [0, 1]\n and outputs a discrete-valued \na_q \\in \\left\\{ \\frac{0}{2^k-1}, \\frac{1}{2^k-1}, ... , \\frac{2^k-1}{2^k-1} \\right\\}\n, where \nk\n is the number of bits used for quantization.\n\n\n\n\na_q = quantize_k(a_f) = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) a_f \\right)\n\n\n\n\nActivations are clipped to the \n[0, 1]\n range and then quantized as follows:\n\n\n\n\nx_q = quantize_k(x_f)\n\n\n\n\nFor weights, we define the following function \nf\n, which takes an unbounded real valued input and outputs a real value in \n[0, 1]\n:\n\n\n\n\nf(w) = \\frac{tanh(w)}{2 max(|tanh(w)|)} + \\frac{1}{2} \n\n\n\n\nNow we can use \nquantize_k\n to get quantized weight values, as follows:\n\n\n\n\nw_q = 2 quantize_k \\left( f(w_f) \\right) - 1\n\n\n\n\nThis method requires training the model with quantization-aware training, as discussed \nhere\n. Use the \nDorefaQuantizer\n class to transform an existing model to a model suitable for training with quantization using DoReFa.\n\n\nNotes:\n\n\n\n\nGradients quantization as proposed in the paper is not supported yet.\n\n\nThe paper defines special handling for binary weights which isn't supported in Distiller yet.\n\n\n\n\nPACT\n\n\n(As proposed in \nPACT: Parameterized Clipping Activation for Quantized Neural Networks\n)\n\n\nThis method is similar to DoReFa, but the upper clipping values, \n\\alpha\n, of the activation functions are learned parameters instead of hard coded to 1. Note that per the paper's recommendation, \n\\alpha\n is shared per layer.\n\n\nThis method requires training the model with quantization-aware training, as discussed \nhere\n. Use the \nPACTQuantizer\n class to transform an existing model to a model suitable for training with quantization using PACT.\n\n\nWRPN\n\n\n(As proposed in \nWRPN: Wide Reduced-Precision Networks\n)  \n\n\nIn this method, activations are clipped to \n[0, 1]\n and quantized as follows (\nk\n is the number of bits used for quantization):\n\n\n\n\nx_q = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) x_f \\right)\n\n\n\n\nWeights are clipped to \n[-1, 1]\n and quantized as follows:\n\n\n\n\nw_q = \\frac{1}{2^{k-1}-1} round \\left( \\left(2^{k-1} - 1 \\right)w_f \\right)\n\n\n\n\nNote that \nk-1\n bits are used to quantize weights, leaving one bit for sign.\n\n\nThis method requires training the model with quantization-aware training, as discussed \nhere\n. Use the \nWRPNQuantizer\n class to transform an existing model to a model suitable for training with quantization using WRPN.\n\n\nNotes:\n\n\n\n\nThe paper proposed widening of layers as a means to reduce accuracy loss. This isn't implemented as part of \nWRPNQuantizer\n at the moment. To experiment with this, modify your model implementation to have wider layers.\n\n\nThe paper defines special handling for binary weights which isn't supported in Distiller yet.", 
             "title": "Quantization"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#quantization-algorithms",
-            "text": "Note: \nFor any of the methods below that require quantization-aware training, please see  here  for details on how to invoke it using Distiller's scheduling mechanism.",
+            "location": "/algo_quantization/index.html#quantization-algorithms", 
+            "text": "Note: \nFor any of the methods below that require quantization-aware training, please see  here  for details on how to invoke it using Distiller's scheduling mechanism.", 
             "title": "Quantization Algorithms"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#range-based-linear-quantization",
-            "text": "Let's break down the terminology we use here:   Linear:  Means a float value is quantized by multiplying with a numeric constant (the  scale factor ).  Range-Based: : Means that in order to calculate the scale factor, we look at the actual range of the tensor's values. In the most naive implementation, we use the actual min/max values of the tensor. Alternatively, we use some derivation based on the tensor's range / distribution to come up with a narrower min/max range, in order to remove possible outliers. This is in contrast to the other methods described here, which we could call  clipping-based , as they impose an explicit clipping function on the tensors (using either a hard-coded value or a learned value).",
+            "location": "/algo_quantization/index.html#range-based-linear-quantization", 
+            "text": "Let's break down the terminology we use here:   Linear:  Means a float value is quantized by multiplying with a numeric constant (the  scale factor ).  Range-Based:  Means that in order to calculate the scale factor, we look at the actual range of the tensor's values. In the most naive implementation, we use the actual min/max values of the tensor. Alternatively, we use some derivation based on the tensor's range / distribution to come up with a narrower min/max range, in order to remove possible outliers. This is in contrast to the other methods described here, which we could call  clipping-based , as they impose an explicit clipping function on the tensors (using either a hard-coded value or a learned value).", 
             "title": "Range-Based Linear Quantization"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#asymmetric-vs-symmetric",
-            "text": "In this method we can use two modes -  asymmetric  and  symmetric .",
+            "location": "/algo_quantization/index.html#asymmetric-vs-symmetric", 
+            "text": "In this method we can use two modes -  asymmetric  and  symmetric .", 
             "title": "Asymmetric vs. Symmetric"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#asymmetric-mode",
-            "text": "In  asymmetric  mode, we map the min/max in the float range to the min/max of the integer range. This is done by using a  zero-point  (also called  quantization bias , or  offset ) in addition to the scale factor.  Let us denote the original floating-point tensor by  x_f , the quantized tensor by  x_q , the scale factor by  q_x , the zero-point by  zp_x  and the number of bits used for quantization by  n . Then, we get:   x_q = round\\left ((x_f - min_{x_f})\\underbrace{\\frac{2^n - 1}{max_{x_f} - min_{x_f}}}_{q_x} \\right) = round(q_x x_f - \\underbrace{min_{x_f}q_x)}_{zp_x} = round(q_x x_f - zp_x)   In practice, we actually use  zp_x = round(min_{x_f}q_x) . This means that zero is exactly representable by an integer in the quantized range. This is important, for example, for layers that have zero-padding. By rounding the zero-point, we effectively \"nudge\" the min/max values in the float range a little bit, in order to gain this exact quantization of zero.  Note that in the derivation above we use unsigned integer to represent the quantized range. That is,  x_q \\in [0, 2^n-1] . One could use signed integer if necessary (perhaps due to HW considerations). This can be achieved by subtracting  2^{n-1} .  Let's see how a  convolution  or  fully-connected (FC)  layer is quantized in asymmetric mode: (we denote input, output, weights and bias with   x, y, w  and  b  respectively)   y_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q + zp_x}{q_x} \\frac{w_q + zp_w}{q_w}} + \\frac{b_q + zp_b}{q_b} =   = \\frac{1}{q_x q_w} \\left( \\sum { (x_q + zp_x) (w_q + zp_w) + \\frac{q_x q_w}{q_b}(b_q + zp_b) } \\right)   Therefore:   y_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\left( \\sum { (x_q+zp_x) (w_q+zp_w) + \\frac{q_x q_w}{q_b}(b_q+zp_b) } \\right) \\right)    Notes:   We can see that the bias has to be re-scaled to match the scale of the summation.  In a proper integer-only HW pipeline, we would like our main accumulation term to simply be  \\sum{x_q w_q} . In order to achieve this, one needs to further develop the expression we derived above. For further details please refer to the  gemmlowp documentation",
+            "location": "/algo_quantization/index.html#asymmetric-mode", 
+            "text": "In  asymmetric  mode, we map the min/max in the float range to the min/max of the integer range. This is done by using a  zero-point  (also called  quantization bias , or  offset ) in addition to the scale factor.  Let us denote the original floating-point tensor by  x_f , the quantized tensor by  x_q , the scale factor by  q_x , the zero-point by  zp_x  and the number of bits used for quantization by  n . Then, we get:   x_q = round\\left ((x_f - min_{x_f})\\underbrace{\\frac{2^n - 1}{max_{x_f} - min_{x_f}}}_{q_x} \\right) = round(q_x x_f - \\underbrace{min_{x_f}q_x)}_{zp_x} = round(q_x x_f - zp_x)   In practice, we actually use  zp_x = round(min_{x_f}q_x) . This means that zero is exactly representable by an integer in the quantized range. This is important, for example, for layers that have zero-padding. By rounding the zero-point, we effectively \"nudge\" the min/max values in the float range a little bit, in order to gain this exact quantization of zero.  Note that in the derivation above we use unsigned integer to represent the quantized range. That is,  x_q \\in [0, 2^n-1] . One could use signed integer if necessary (perhaps due to HW considerations). This can be achieved by subtracting  2^{n-1} .  Let's see how a  convolution  or  fully-connected (FC)  layer is quantized in asymmetric mode: (we denote input, output, weights and bias with   x, y, w  and  b  respectively)   y_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q + zp_x}{q_x} \\frac{w_q + zp_w}{q_w}} + \\frac{b_q + zp_b}{q_b} =   = \\frac{1}{q_x q_w} \\left( \\sum { (x_q + zp_x) (w_q + zp_w) + \\frac{q_x q_w}{q_b}(b_q + zp_b) } \\right)   Therefore:   y_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\left( \\sum { (x_q+zp_x) (w_q+zp_w) + \\frac{q_x q_w}{q_b}(b_q+zp_b) } \\right) \\right)    Notes:   We can see that the bias has to be re-scaled to match the scale of the summation.  In a proper integer-only HW pipeline, we would like our main accumulation term to simply be  \\sum{x_q w_q} . In order to achieve this, one needs to further develop the expression we derived above. For further details please refer to the  gemmlowp documentation", 
             "title": "Asymmetric Mode"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#symmetric-mode",
-            "text": "In  symmetric  mode, instead of mapping the exact min/max of the float range to the quantized range, we choose the maximum absolute value between min/max. In addition, we don't use a zero-point. So, the floating-point range we're effectively quantizing is symmetric with respect to zero, and so is the quantized range.  Using the same notations as above, we get:   x_q = round\\left (x_f \\underbrace{\\frac{2^{n-1} - 1}{\\max|x_f|}}_{q_x} \\right) = round(q_x x_f)   Again, let's see how a  convolution  or  fully-connected (FC)  layer is quantized, this time in symmetric mode:   y_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q}{q_x} \\frac{w_q}{q_w}} + \\frac{b_q}{q_b} = \\frac{1}{q_x q_w} \\left( \\sum { x_q w_q + \\frac{q_x q_w}{q_b}b_q } \\right)   Therefore:   y_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\left( \\sum { x_q w_q + \\frac{q_x q_w}{q_b}b_q } \\right) \\right)",
+            "location": "/algo_quantization/index.html#symmetric-mode", 
+            "text": "In  symmetric  mode, instead of mapping the exact min/max of the float range to the quantized range, we choose the maximum absolute value between min/max. In addition, we don't use a zero-point. So, the floating-point range we're effectively quantizing is symmetric with respect to zero, and so is the quantized range.  Using the same notations as above, we get:   x_q = round\\left (x_f \\underbrace{\\frac{2^{n-1} - 1}{\\max|x_f|}}_{q_x} \\right) = round(q_x x_f)   Again, let's see how a  convolution  or  fully-connected (FC)  layer is quantized, this time in symmetric mode:   y_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q}{q_x} \\frac{w_q}{q_w}} + \\frac{b_q}{q_b} = \\frac{1}{q_x q_w} \\left( \\sum { x_q w_q + \\frac{q_x q_w}{q_b}b_q } \\right)   Therefore:   y_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\left( \\sum { x_q w_q + \\frac{q_x q_w}{q_b}b_q } \\right) \\right)", 
             "title": "Symmetric Mode"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#comparing-the-two-modes",
-            "text": "The main trade-off between these two modes is simplicity vs. utilization of the quantized range.   When using asymmetric quantization, the quantized range is fully utilized. That is because we exactly map the min/max values from the float range to the min/max of the quantized range. Using symmetric mode, if the float range is biased towards one side, could result in a quantized range where significant dynamic range is dedicated to values that we'll never see. The most extreme example of this is after ReLU, where the entire tensor is positive. Quantizing it in symmetric mode means we're effectively losing 1 bit.  On the other hand, if we look at the derviations for convolution / FC layers above, we can see that the actual implementation of symmetric mode is much simpler. In asymmetric mode, the zero-points require additional logic in HW. The cost of this extra logic in terms of latency and/or power and/or area will of course depend on the exact implementation.",
+            "location": "/algo_quantization/index.html#comparing-the-two-modes", 
+            "text": "The main trade-off between these two modes is simplicity vs. utilization of the quantized range.   When using asymmetric quantization, the quantized range is fully utilized. That is because we exactly map the min/max values from the float range to the min/max of the quantized range. Using symmetric mode, if the float range is biased towards one side, could result in a quantized range where significant dynamic range is dedicated to values that we'll never see. The most extreme example of this is after ReLU, where the entire tensor is positive. Quantizing it in symmetric mode means we're effectively losing 1 bit.  On the other hand, if we look at the derviations for convolution / FC layers above, we can see that the actual implementation of symmetric mode is much simpler. In asymmetric mode, the zero-points require additional logic in HW. The cost of this extra logic in terms of latency and/or power and/or area will of course depend on the exact implementation.", 
             "title": "Comparing the Two Modes"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#other-features",
-            "text": "Removing Outliers:  As discussed  here , in some cases the float range of activations contains outliers. Spending dynamic range on these outliers hurts our ability ro represent the values we actually care about accurately.\n    \n        \n    \n  Currently, Distiller supports clipping of activations with averaging during post-training quantization. That is - for each batch, instead of calculating global min/max values, an average of the min/max values of each sample in the batch.  Scale factor scope:  For weight tensors, Distiller supports per-channel quantization (per output channel).",
+            "location": "/algo_quantization/index.html#other-features", 
+            "text": "Removing Outliers:  As discussed  here , in some cases the float range of activations contains outliers. Spending dynamic range on these outliers hurts our ability ro represent the values we actually care about accurately.\n    \n        \n    \n  Currently, Distiller supports clipping of activations with averaging during post-training quantization. That is - for each batch, instead of calculating global min/max values, an average of the min/max values of each sample in the batch.  Scale factor scope:  For weight tensors, Distiller supports per-channel quantization (per output channel).", 
             "title": "Other Features"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#implementation-in-distiller",
-            "text": "",
+            "location": "/algo_quantization/index.html#implementation-in-distiller", 
+            "text": "", 
             "title": "Implementation in Distiller"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#post-training",
-            "text": "For post-training quantization, currently  convolution  and  FC  are supported using this method.     They are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. The wrapper is implemented in the  RangeLinearQuantParamLayerWrapper  class.    All other layers are unaffected and are executed using their original FP32 implementation.    To automatically transform an existing model to a quantized model using this method, use the  PostTrainLinearQuantizer  class. For an example of how to do this, see the  compress_classifier.py . This sample also exposes command line arguments to invoke post-training quantization. For details see  here .  For weights and bias the scale factor and zero-point are determined once at quantization setup (\"offline\"), and for activations it is determined dynamically at runtime (\"online\"). The calculated quantization parameters are store as buffers within the module, so they are automatically serialized when the model checkpoint is saved.  As this is post-training, using it with number of bits < 8 is likely to lead to severe accuracy degradation for any non-trivial workload.",
+            "location": "/algo_quantization/index.html#post-training", 
+            "text": "For post-training quantization, currently  convolution  and  FC  are supported using this method.     They are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. The wrapper is implemented in the  RangeLinearQuantParamLayerWrapper  class.    All other layers are unaffected and are executed using their original FP32 implementation.    To automatically transform an existing model to a quantized model using this method, use the  PostTrainLinearQuantizer  class. For an example of how to do this, see the  compress_classifier.py . This sample also exposes command line arguments to invoke post-training quantization. For details see  here .  For weights and bias the scale factor and zero-point are determined once at quantization setup (\"offline\"), and for activations it is determined dynamically at runtime (\"online\"). The calculated quantization parameters are store as buffers within the module, so they are automatically serialized when the model checkpoint is saved.  As this is post-training, using it with number of bits   8 is likely to lead to severe accuracy degradation for any non-trivial workload.", 
             "title": "Post-Training"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#quantization-aware-training",
-            "text": "To apply range-based linear quantization in training, use the  QuantAwareTrainRangeLinearQuantizer  class. As it is now, it will apply weights quantization to convolution and FC modules. For activations quantization, it will insert instances  FakeLinearQuantization  module after ReLUs. This module follows the methodology described in  Benoit et al., 2018  and uses exponential moving averages to track activation ranges.  Similarly to post-training, the calculated quantization parameters (scale factors, zero-points, tracked activation ranges) are stored as buffers within their respective modules, so they're saved when a checkpoint is created.  Note that converting from a quantization-aware training model to a post-training quantization model is not yet supported. Such a conversion will use the activation ranges tracked during training, so additional offline or online calculation of quantization parameters will not be required.",
+            "location": "/algo_quantization/index.html#quantization-aware-training", 
+            "text": "To apply range-based linear quantization in training, use the  QuantAwareTrainRangeLinearQuantizer  class. As it is now, it will apply weights quantization to convolution and FC modules. For activations quantization, it will insert instances  FakeLinearQuantization  module after ReLUs. This module follows the methodology described in  Benoit et al., 2018  and uses exponential moving averages to track activation ranges.  Similarly to post-training, the calculated quantization parameters (scale factors, zero-points, tracked activation ranges) are stored as buffers within their respective modules, so they're saved when a checkpoint is created.  Note that converting from a quantization-aware training model to a post-training quantization model is not yet supported. Such a conversion will use the activation ranges tracked during training, so additional offline or online calculation of quantization parameters will not be required.", 
             "title": "Quantization-Aware Training"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#dorefa",
-            "text": "(As proposed in  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients )    In this method, we first define the quantization function  quantize_k , which takes a real value  a_f \\in [0, 1]  and outputs a discrete-valued  a_q \\in \\left\\{ \\frac{0}{2^k-1}, \\frac{1}{2^k-1}, ... , \\frac{2^k-1}{2^k-1} \\right\\} , where  k  is the number of bits used for quantization.   a_q = quantize_k(a_f) = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) a_f \\right)   Activations are clipped to the  [0, 1]  range and then quantized as follows:   x_q = quantize_k(x_f)   For weights, we define the following function  f , which takes an unbounded real valued input and outputs a real value in  [0, 1] :   f(w) = \\frac{tanh(w)}{2 max(|tanh(w)|)} + \\frac{1}{2}    Now we can use  quantize_k  to get quantized weight values, as follows:   w_q = 2 quantize_k \\left( f(w_f) \\right) - 1   This method requires training the model with quantization-aware training, as discussed  here . Use the  DorefaQuantizer  class to transform an existing model to a model suitable for training with quantization using DoReFa.",
+            "location": "/algo_quantization/index.html#dorefa", 
+            "text": "(As proposed in  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients )    In this method, we first define the quantization function  quantize_k , which takes a real value  a_f \\in [0, 1]  and outputs a discrete-valued  a_q \\in \\left\\{ \\frac{0}{2^k-1}, \\frac{1}{2^k-1}, ... , \\frac{2^k-1}{2^k-1} \\right\\} , where  k  is the number of bits used for quantization.   a_q = quantize_k(a_f) = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) a_f \\right)   Activations are clipped to the  [0, 1]  range and then quantized as follows:   x_q = quantize_k(x_f)   For weights, we define the following function  f , which takes an unbounded real valued input and outputs a real value in  [0, 1] :   f(w) = \\frac{tanh(w)}{2 max(|tanh(w)|)} + \\frac{1}{2}    Now we can use  quantize_k  to get quantized weight values, as follows:   w_q = 2 quantize_k \\left( f(w_f) \\right) - 1   This method requires training the model with quantization-aware training, as discussed  here . Use the  DorefaQuantizer  class to transform an existing model to a model suitable for training with quantization using DoReFa.", 
             "title": "DoReFa"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#notes",
-            "text": "Gradients quantization as proposed in the paper is not supported yet.  The paper defines special handling for binary weights which isn't supported in Distiller yet.",
+            "location": "/algo_quantization/index.html#notes", 
+            "text": "Gradients quantization as proposed in the paper is not supported yet.  The paper defines special handling for binary weights which isn't supported in Distiller yet.", 
             "title": "Notes:"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#pact",
-            "text": "(As proposed in  PACT: Parameterized Clipping Activation for Quantized Neural Networks )  This method is similar to DoReFa, but the upper clipping values,  \\alpha , of the activation functions are learned parameters instead of hard coded to 1. Note that per the paper's recommendation,  \\alpha  is shared per layer.  This method requires training the model with quantization-aware training, as discussed  here . Use the  PACTQuantizer  class to transform an existing model to a model suitable for training with quantization using PACT.",
+            "location": "/algo_quantization/index.html#pact", 
+            "text": "(As proposed in  PACT: Parameterized Clipping Activation for Quantized Neural Networks )  This method is similar to DoReFa, but the upper clipping values,  \\alpha , of the activation functions are learned parameters instead of hard coded to 1. Note that per the paper's recommendation,  \\alpha  is shared per layer.  This method requires training the model with quantization-aware training, as discussed  here . Use the  PACTQuantizer  class to transform an existing model to a model suitable for training with quantization using PACT.", 
             "title": "PACT"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#wrpn",
-            "text": "(As proposed in  WRPN: Wide Reduced-Precision Networks )    In this method, activations are clipped to  [0, 1]  and quantized as follows ( k  is the number of bits used for quantization):   x_q = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) x_f \\right)   Weights are clipped to  [-1, 1]  and quantized as follows:   w_q = \\frac{1}{2^{k-1}-1} round \\left( \\left(2^{k-1} - 1 \\right)w_f \\right)   Note that  k-1  bits are used to quantize weights, leaving one bit for sign.  This method requires training the model with quantization-aware training, as discussed  here . Use the  WRPNQuantizer  class to transform an existing model to a model suitable for training with quantization using WRPN.",
+            "location": "/algo_quantization/index.html#wrpn", 
+            "text": "(As proposed in  WRPN: Wide Reduced-Precision Networks )    In this method, activations are clipped to  [0, 1]  and quantized as follows ( k  is the number of bits used for quantization):   x_q = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) x_f \\right)   Weights are clipped to  [-1, 1]  and quantized as follows:   w_q = \\frac{1}{2^{k-1}-1} round \\left( \\left(2^{k-1} - 1 \\right)w_f \\right)   Note that  k-1  bits are used to quantize weights, leaving one bit for sign.  This method requires training the model with quantization-aware training, as discussed  here . Use the  WRPNQuantizer  class to transform an existing model to a model suitable for training with quantization using WRPN.", 
             "title": "WRPN"
-        },
+        }, 
         {
-            "location": "/algo_quantization/index.html#notes_1",
-            "text": "The paper proposed widening of layers as a means to reduce accuracy loss. This isn't implemented as part of  WRPNQuantizer  at the moment. To experiment with this, modify your model implementation to have wider layers.  The paper defines special handling for binary weights which isn't supported in Distiller yet.",
+            "location": "/algo_quantization/index.html#notes_1", 
+            "text": "The paper proposed widening of layers as a means to reduce accuracy loss. This isn't implemented as part of  WRPNQuantizer  at the moment. To experiment with this, modify your model implementation to have wider layers.  The paper defines special handling for binary weights which isn't supported in Distiller yet.", 
             "title": "Notes:"
-        },
+        }, 
         {
-            "location": "/algo_earlyexit/index.html",
-            "text": "Early Exit Inference\n\n\nWhile Deep Neural Networks benefit from a large number of layers, it's often the case that many data points in classification tasks can be classified accurately with much less work. There have been several studies recently regarding the idea of exiting before the normal endpoint of the neural network. Panda et al in \nConditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition\n points out that a lot of data points can be classified easily and require less processing than some more difficult points and they view this in terms of power savings. Surat et al in \nBranchyNet: Fast Inference via Early Exiting from Deep Neural Networks\n look at a selective approach to exit placement and criteria for exiting early.\n\n\nWhy Does Early Exit Work?\n\n\nEarly Exit is a strategy with a straightforward and easy to understand concept Figure #fig(boundaries) shows a simple example in a 2-D feature space. While deep networks can represent more complex and expressive boundaries between classes (assuming we\u2019re confident of avoiding over-fitting the data), it\u2019s also clear that much of the data can be properly classified with even the simplest of classification boundaries.\n\n\n\n\nData points far from the boundary can be considered \"easy to classify\" and achieve a high degree of confidence quicker than do data points close to the boundary. In fact, we can think of the area between the outer straight lines as being the region that is \"difficult to classify\" and require the full expressiveness of the neural network to accurately classify it.\n\n\nExample code for Early Exit\n\n\nBoth CIFAR10 and ImageNet code comes directly from publically available examples from Pytorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.\n\n\nDeeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively.\n\n\nNote that this code does not actually take exits. What it does is to compute statistics of loss and accuracy assuming exits were taken when criteria are met. Actually implementing exits can be tricky and architecture dependent and we plan to address these issues.\n\n\nHeuristics\n\n\nThe insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more agressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.\n\n\nThere are other benefits to adding exits in that training the modified network now has backpropagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.\n\n\nEarly Exit Hyperparameters\n\n\nThere are two parameters that are required to enable early exit. Leave them undefined if you are not enabling Early Exit:\n\n\n\n\n\n\n--earlyexit_thresholds\n defines the\nthresholds for each of the early exits. The cross entropy measure must be \nless than\n the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify \"--earlyexit_thresholds 0.9 1.2\" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.\n\n\n\n\n\n\n--earlyexit_lossweights\n provide the weights for the linear combination of losses during training to compute a signle, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of \"--earlyexit_lossweights 0.2 0.3\" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.\n\n\n\n\n\n\nCIFAR10\n\n\nIn the case of CIFAR10, we have inserted a single exit after the first full layer grouping. The layers on the exit path itself includes a convolutional layer and a fully connected layer. If you move the exit, be sure to match the proper sizes for inputs and outputs to the exit layers.\n\n\nImageNet\n\n\nThis supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic resnet code and could be used with other size resnets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.\n\n\nReferences\n\n\n \nPriyadarshini Panda, Abhronil Sengupta, Kaushik Roy\n.\n    \nConditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition\n, arXiv:1509.08971v6, 2017.\n\n\n\n\n\nSurat Teerapittayanon, Bradley McDanel, H. T. Kung\n.\n    \nBranchyNet: Fast Inference via Early Exiting from Deep Neural Networks\n, arXiv:1709.01686, 2017.",
+            "location": "/algo_earlyexit/index.html", 
+            "text": "Early Exit Inference\n\n\nWhile Deep Neural Networks benefit from a large number of layers, it's often the case that many data points in classification tasks can be classified accurately with much less work. There have been several studies recently regarding the idea of exiting before the normal endpoint of the neural network. Panda et al in \nConditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition\n points out that a lot of data points can be classified easily and require less processing than some more difficult points and they view this in terms of power savings. Surat et al in \nBranchyNet: Fast Inference via Early Exiting from Deep Neural Networks\n look at a selective approach to exit placement and criteria for exiting early.\n\n\nWhy Does Early Exit Work?\n\n\nEarly Exit is a strategy with a straightforward and easy to understand concept Figure #fig(boundaries) shows a simple example in a 2-D feature space. While deep networks can represent more complex and expressive boundaries between classes (assuming we\u2019re confident of avoiding over-fitting the data), it\u2019s also clear that much of the data can be properly classified with even the simplest of classification boundaries.\n\n\n\n\nData points far from the boundary can be considered \"easy to classify\" and achieve a high degree of confidence quicker than do data points close to the boundary. In fact, we can think of the area between the outer straight lines as being the region that is \"difficult to classify\" and require the full expressiveness of the neural network to accurately classify it.\n\n\nExample code for Early Exit\n\n\nBoth CIFAR10 and ImageNet code comes directly from publicly available examples from PyTorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.\n\n\nNote:\n the sample code provided for ResNet models with Early Exits has exactly one early exit for the CIFAR10 example and exactly two early exits for the ImageNet example. If you want to modify the number of early exits, you will need to make sure that the model code is updated to have a corresponding number of exits.\nDeeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively.\n\n\nNote that this code does not actually take exits. What it does is to compute statistics of loss and accuracy assuming exits were taken when criteria are met. Actually implementing exits can be tricky and architecture dependent and we plan to address these issues.\n\n\nExample command lines\n\n\nWe have provided examples for ResNets of varying sizes for both CIFAR10 and ImageNet datasets. An example command line for training for CIFAR10 is:\n\n\npython compress_classifier.py --arch=resnet32_cifar_earlyexit --epochs=20 -b 128 \\\n    --lr=0.003 --earlyexit_thresholds 0.4 --earlyexit_lossweights 0.4 -j 30 \\\n    --out-dir /home/ -n earlyexit /home/pcifar10\n\n\n\n\nAnd an example command line for ImageNet is:\n\n\npython compress_classifier.py --arch=resnet50_earlyexit --epochs=120 -b 128 \\\n    --lr=0.003 --earlyexit_thresholds 1.2 0.9 --earlyexit_lossweights 0.1 0.3 \\\n    -j 30 --out-dir /home/ -n earlyexit /home/I1K/i1k-extracted/\n\n\n\n\nHeuristics\n\n\nThe insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more aggressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.\n\n\nThere are other benefits to adding exits in that training the modified network now has back-propagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.\n\n\nEarly Exit Hyper-Parameters\n\n\nThere are two parameters that are required to enable early exit. Leave them undefined if you are not enabling Early Exit:\n\n\n\n\n\n\n--earlyexit_thresholds\n defines the thresholds for each of the early exits. The cross entropy measure must be \nless than\n the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify \"--earlyexit_thresholds 0.9 1.2\" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.\n\n\n\n\n\n\n--earlyexit_lossweights\n provide the weights for the linear combination of losses during training to compute a single, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of \"--earlyexit_lossweights 0.2 0.3\" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.\n\n\n\n\n\n\nOutput Stats\n\n\nThe example code outputs various statistics regarding the loss and accuracy at each of the exits. During training, the Top1 and Top5 stats represent the accuracy should all of the data be forced out that exit (in order to compute the loss at that exit). During inference (i.e. validation and test stages), the Top1 and Top5 stats represent the accuracy for those data points that could exit because the calculated entropy at that exit was lower than the specified threshold for that exit.\n\n\nCIFAR10\n\n\nIn the case of CIFAR10, we have inserted a single exit after the first full layer grouping. The layers on the exit path itself includes a convolutional layer and a fully connected layer. If you move the exit, be sure to match the proper sizes for inputs and outputs to the exit layers.\n\n\nImageNet\n\n\nThis supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic ResNet code and could be used with other size ResNets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.\n\n\nReferences\n\n\n\n\n\nPriyadarshini Panda, Abhronil Sengupta, Kaushik Roy\n.\n    \nConditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition\n, arXiv:1509.08971v6, 2017.\n\n\n\n\n\nSurat Teerapittayanon, Bradley McDanel, H. T. Kung\n.\n    \nBranchyNet: Fast Inference via Early Exiting from Deep Neural Networks\n, arXiv:1709.01686, 2017.", 
             "title": "Early Exit"
-        },
+        }, 
         {
-            "location": "/algo_earlyexit/index.html#early-exit-inference",
-            "text": "While Deep Neural Networks benefit from a large number of layers, it's often the case that many data points in classification tasks can be classified accurately with much less work. There have been several studies recently regarding the idea of exiting before the normal endpoint of the neural network. Panda et al in  Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition  points out that a lot of data points can be classified easily and require less processing than some more difficult points and they view this in terms of power savings. Surat et al in  BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks  look at a selective approach to exit placement and criteria for exiting early.",
+            "location": "/algo_earlyexit/index.html#early-exit-inference", 
+            "text": "While Deep Neural Networks benefit from a large number of layers, it's often the case that many data points in classification tasks can be classified accurately with much less work. There have been several studies recently regarding the idea of exiting before the normal endpoint of the neural network. Panda et al in  Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition  points out that a lot of data points can be classified easily and require less processing than some more difficult points and they view this in terms of power savings. Surat et al in  BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks  look at a selective approach to exit placement and criteria for exiting early.", 
             "title": "Early Exit Inference"
-        },
+        }, 
         {
-            "location": "/algo_earlyexit/index.html#why-does-early-exit-work",
-            "text": "Early Exit is a strategy with a straightforward and easy to understand concept Figure #fig(boundaries) shows a simple example in a 2-D feature space. While deep networks can represent more complex and expressive boundaries between classes (assuming we\u2019re confident of avoiding over-fitting the data), it\u2019s also clear that much of the data can be properly classified with even the simplest of classification boundaries.   Data points far from the boundary can be considered \"easy to classify\" and achieve a high degree of confidence quicker than do data points close to the boundary. In fact, we can think of the area between the outer straight lines as being the region that is \"difficult to classify\" and require the full expressiveness of the neural network to accurately classify it.",
+            "location": "/algo_earlyexit/index.html#why-does-early-exit-work", 
+            "text": "Early Exit is a strategy with a straightforward and easy to understand concept Figure #fig(boundaries) shows a simple example in a 2-D feature space. While deep networks can represent more complex and expressive boundaries between classes (assuming we\u2019re confident of avoiding over-fitting the data), it\u2019s also clear that much of the data can be properly classified with even the simplest of classification boundaries.   Data points far from the boundary can be considered \"easy to classify\" and achieve a high degree of confidence quicker than do data points close to the boundary. In fact, we can think of the area between the outer straight lines as being the region that is \"difficult to classify\" and require the full expressiveness of the neural network to accurately classify it.", 
             "title": "Why Does Early Exit Work?"
-        },
+        }, 
         {
-            "location": "/algo_earlyexit/index.html#example-code-for-early-exit",
-            "text": "Both CIFAR10 and ImageNet code comes directly from publically available examples from Pytorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.  Deeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively.  Note that this code does not actually take exits. What it does is to compute statistics of loss and accuracy assuming exits were taken when criteria are met. Actually implementing exits can be tricky and architecture dependent and we plan to address these issues.",
+            "location": "/algo_earlyexit/index.html#example-code-for-early-exit", 
+            "text": "Both CIFAR10 and ImageNet code comes directly from publicly available examples from PyTorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.  Note:  the sample code provided for ResNet models with Early Exits has exactly one early exit for the CIFAR10 example and exactly two early exits for the ImageNet example. If you want to modify the number of early exits, you will need to make sure that the model code is updated to have a corresponding number of exits.\nDeeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively.  Note that this code does not actually take exits. What it does is to compute statistics of loss and accuracy assuming exits were taken when criteria are met. Actually implementing exits can be tricky and architecture dependent and we plan to address these issues.", 
             "title": "Example code for Early Exit"
-        },
+        }, 
         {
-            "location": "/algo_earlyexit/index.html#heuristics",
-            "text": "The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more agressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.  There are other benefits to adding exits in that training the modified network now has backpropagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.",
+            "location": "/algo_earlyexit/index.html#example-command-lines", 
+            "text": "We have provided examples for ResNets of varying sizes for both CIFAR10 and ImageNet datasets. An example command line for training for CIFAR10 is:  python compress_classifier.py --arch=resnet32_cifar_earlyexit --epochs=20 -b 128 \\\n    --lr=0.003 --earlyexit_thresholds 0.4 --earlyexit_lossweights 0.4 -j 30 \\\n    --out-dir /home/ -n earlyexit /home/pcifar10  And an example command line for ImageNet is:  python compress_classifier.py --arch=resnet50_earlyexit --epochs=120 -b 128 \\\n    --lr=0.003 --earlyexit_thresholds 1.2 0.9 --earlyexit_lossweights 0.1 0.3 \\\n    -j 30 --out-dir /home/ -n earlyexit /home/I1K/i1k-extracted/", 
+            "title": "Example command lines"
+        }, 
+        {
+            "location": "/algo_earlyexit/index.html#heuristics", 
+            "text": "The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more aggressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.  There are other benefits to adding exits in that training the modified network now has back-propagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.", 
             "title": "Heuristics"
-        },
+        }, 
+        {
+            "location": "/algo_earlyexit/index.html#early-exit-hyper-parameters", 
+            "text": "There are two parameters that are required to enable early exit. Leave them undefined if you are not enabling Early Exit:    --earlyexit_thresholds  defines the thresholds for each of the early exits. The cross entropy measure must be  less than  the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify \"--earlyexit_thresholds 0.9 1.2\" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.    --earlyexit_lossweights  provide the weights for the linear combination of losses during training to compute a single, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of \"--earlyexit_lossweights 0.2 0.3\" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.", 
+            "title": "Early Exit Hyper-Parameters"
+        }, 
         {
-            "location": "/algo_earlyexit/index.html#early-exit-hyperparameters",
-            "text": "There are two parameters that are required to enable early exit. Leave them undefined if you are not enabling Early Exit:    --earlyexit_thresholds  defines the\nthresholds for each of the early exits. The cross entropy measure must be  less than  the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify \"--earlyexit_thresholds 0.9 1.2\" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.    --earlyexit_lossweights  provide the weights for the linear combination of losses during training to compute a signle, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of \"--earlyexit_lossweights 0.2 0.3\" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.",
-            "title": "Early Exit Hyperparameters"
-        },
+            "location": "/algo_earlyexit/index.html#output-stats", 
+            "text": "The example code outputs various statistics regarding the loss and accuracy at each of the exits. During training, the Top1 and Top5 stats represent the accuracy should all of the data be forced out that exit (in order to compute the loss at that exit). During inference (i.e. validation and test stages), the Top1 and Top5 stats represent the accuracy for those data points that could exit because the calculated entropy at that exit was lower than the specified threshold for that exit.", 
+            "title": "Output Stats"
+        }, 
         {
-            "location": "/algo_earlyexit/index.html#cifar10",
-            "text": "In the case of CIFAR10, we have inserted a single exit after the first full layer grouping. The layers on the exit path itself includes a convolutional layer and a fully connected layer. If you move the exit, be sure to match the proper sizes for inputs and outputs to the exit layers.",
+            "location": "/algo_earlyexit/index.html#cifar10", 
+            "text": "In the case of CIFAR10, we have inserted a single exit after the first full layer grouping. The layers on the exit path itself includes a convolutional layer and a fully connected layer. If you move the exit, be sure to match the proper sizes for inputs and outputs to the exit layers.", 
             "title": "CIFAR10"
-        },
+        }, 
         {
-            "location": "/algo_earlyexit/index.html#imagenet",
-            "text": "This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic resnet code and could be used with other size resnets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.",
+            "location": "/algo_earlyexit/index.html#imagenet", 
+            "text": "This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic ResNet code and could be used with other size ResNets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.", 
             "title": "ImageNet"
-        },
+        }, 
         {
-            "location": "/algo_earlyexit/index.html#references",
-            "text": "Priyadarshini Panda, Abhronil Sengupta, Kaushik Roy .\n     Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition , arXiv:1509.08971v6, 2017.   Surat Teerapittayanon, Bradley McDanel, H. T. Kung .\n     BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks , arXiv:1709.01686, 2017.",
+            "location": "/algo_earlyexit/index.html#references", 
+            "text": "Priyadarshini Panda, Abhronil Sengupta, Kaushik Roy .\n     Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition , arXiv:1509.08971v6, 2017.   Surat Teerapittayanon, Bradley McDanel, H. T. Kung .\n     BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks , arXiv:1709.01686, 2017.", 
             "title": "References"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html",
-            "text": "Distiller Model Zoo\n\n\nHow to contribute models to the Model Zoo\n\n\nWe encourage you to contribute new models to the Model Zoo.  We welcome implementations of published papers or of your own work.  To assure that models and algorithms shared with others are high-quality, please commit your models with the following:\n\n\n\n\nCommand-line arguments\n\n\nLog files\n\n\nPyTorch model\n\n\n\n\nContents\n\n\nThe Distiller model zoo is not a \"traditional\" model-zoo, because it does not necessarily contain best-in-class compressed models.  Instead, the model-zoo contains a number of deep learning models that have been compressed using Distiller following some well-known research papers.  These are meant to serve as examples of how Distiller can be used.\n\n\nEach model contains a Distiller schedule detailing how the model was compressed, a PyTorch checkpoint, text logs and TensorBoard logs.\n\n\n\n\ntable, th, td {\n    border: 1px solid black;\n}\n\n\n\n\n  \n\n    \nPaper\n\n    \nDataset\n\n    \nNetwork\n\n    \nMethod & Granularity\n\n    \nSchedule\n\n    \nFeatures\n\n  \n\n  \n\n    \nLearning both Weights and Connections for Efficient Neural Networks\n\n    \nImageNet\n\n    \nAlexnet\n\n    \nElement-wise pruning\n\n    \nIterative; Manual\n\n    \nMagnitude thresholding based on a sensitivity quantifier.\nElement-wise sparsity sensitivity analysis\n\n  \n\n  \n\n    \nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n\n    \nImageNet\n\n    \nMobileNet\n\n    \nElement-wise pruning\n\n    \nAutomated gradual; Iterative\n\n    \nMagnitude thresholding based on target level\n\n  \n\n  \n\n    \nLearning Structured Sparsity in Deep Neural Networks\n\n    \nCIFAR10\n\n    \nResNet20\n\n    \nGroup regularization\n\n    \n1.Train with group-lasso\n2.Remove zero groups and fine-tune\n\n    \nGroup Lasso regularization. Groups: kernels (2D), channels, filters (3D), layers (4D), vectors (rows, cols)\n\n  \n\n  \n\n    \nPruning Filters for Efficient ConvNets\n\n    \nCIFAR10\n\n    \nResNet56\n\n    \nFilter ranking; guided by sensitivity analysis\n\n    \n1.Rank filters\n2. Remove filters and channels\n3.Fine-tune\n\n    \nOne-shot ranking and pruning of filters; with network thinning\n  \n\n\n\n\nLearning both Weights and Connections for Efficient Neural Networks\n\n\nThis schedule is an example of \"Iterative Pruning\" for Alexnet/Imagent, as described in chapter 3 of Song Han's PhD dissertation: \nEfficient Methods and Hardware for Deep Learning\n and in his paper \nLearning both Weights and Connections for Efficient Neural Networks\n.  \n\n\nThe Distiller schedule uses SensitivityPruner which is similar to MagnitudeParameterPruner, but instead of specifying \"raw\" thresholds, it uses a \"sensitivity parameter\".  Song Han's paper says that \"the pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layers weights,\" and this is not explained much further.  In Distiller, the \"quality parameter\" is referred to as \"sensitivity\" and\nis based on the values learned from performing sensitivity analysis.  Using a parameter that is related to the standard deviation is very helpful: under the assumption that the weights tensors are distributed normally, the standard deviation acts as a threshold normalizer.\n\n\nNote that Distiller's implementation deviates slightly from the algorithm Song Han describes in his PhD dissertation, in that the threshold value is set only once.  In his PhD dissertation, Song Han describes a growing threshold, at each iteration.  This requires n+1 hyper-parameters (n being the number of pruning iterations we use): the threshold and the threshold increase (delta) at each pruning iteration.  Distiller's implementation takes advantage of the fact that as pruning progresses, more weights are pulled toward zero, and therefore the threshold \"traps\" more weights.  Thus, we can use less hyper-parameters and achieve the same results.\n\n\n\n\nDistiller schedule: \ndistiller/examples/sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\n\nCheckpoint file: \nalexnet.checkpoint.89.pth.tar\n\n\n\n\nResults\n\n\nOur reference is TorchVision's pretrained Alexnet model which has a Top1 accuracy of 56.55 and Top5=79.09.  We prune away 88.44% of the parameters and achieve  Top1=56.61 and Top5=79.45.\nSong Han prunes 89% of the parameters, which is slightly better than our results.\n\n\nParameters:\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|    | Name                      | Shape            |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean\n|----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n|  0 | features.module.0.weight  | (64, 3, 11, 11)  |         23232 |          13411 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   42.27359 | 0.14391 | -0.00002 |    0.08805 |\n|  1 | features.module.3.weight  | (192, 64, 5, 5)  |        307200 |         115560 |    0.00000 |    0.00000 |  0.00000 |  1.91243 |  0.00000 |   62.38281 | 0.04703 | -0.00250 |    0.02289 |\n|  2 | features.module.6.weight  | (384, 192, 3, 3) |        663552 |         256565 |    0.00000 |    0.00000 |  0.00000 |  6.18490 |  0.00000 |   61.33445 | 0.03354 | -0.00184 |    0.01803 |\n|  3 | features.module.8.weight  | (256, 384, 3, 3) |        884736 |         315065 |    0.00000 |    0.00000 |  0.00000 |  6.96411 |  0.00000 |   64.38881 | 0.02646 | -0.00168 |    0.01422 |\n|  4 | features.module.10.weight | (256, 256, 3, 3) |        589824 |         186938 |    0.00000 |    0.00000 |  0.00000 | 15.49225 |  0.00000 |   68.30614 | 0.02714 | -0.00246 |    0.01409 |\n|  5 | classifier.1.weight       | (4096, 9216)     |      37748736 |        3398881 |    0.00000 |    0.21973 |  0.00000 |  0.21973 |  0.00000 |   90.99604 | 0.00589 | -0.00020 |    0.00168 |\n|  6 | classifier.4.weight       | (4096, 4096)     |      16777216 |        1782769 |    0.21973 |    3.46680 |  0.00000 |  3.46680 |  0.00000 |   89.37387 | 0.00849 | -0.00066 |    0.00263 |\n|  7 | classifier.6.weight       | (1000, 4096)     |       4096000 |         994738 |    3.36914 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   75.71440 | 0.01718 |  0.00030 |    0.00778 |\n|  8 | Total sparsity:           | -                |      61090496 |        7063928 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   88.43694 | 0.00000 |  0.00000 |    0.00000 |\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:35,357 - ==> Top1: 51.838    Top5: 74.817    Loss: 2.150\n\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:32:01,274 - ==> Top1: 56.606    Top5: 79.446    Loss: 1.893\n\n\n\n\nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n\n\nIn their paper Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their\nsmaller, but dense (small-dense) counterparts with identical memory footprint.\"\nThey also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with\nminimal tuning.\"\n\n\nThis pruning schedule is implemented by distiller.AutomatedGradualPruner, which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps.  Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper.  The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs.  We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size.\n\n\nImageNet files:\n\n\n\n\nDistiller schedule: \ndistiller/examples/agp-pruning/mobilenet.imagenet.schedule_agp.yaml\n\n\nCheckpoint file: \ncheckpoint.pth.tar\n\n\n\n\nResNet18 files:\n\n\n\n\nDistiller schedule: \ndistiller/examples/agp-pruning/resnet18.schedule_agp.yaml\n\n\nCheckpoint file: \ncheckpoint.pth.tar\n\n\n\n\nResults\n\n\nAs our baseline we used a \npretrained PyTorch MobileNet model\n (width=1) which has Top1=68.848 and Top5=88.740.\n\nIn their paper, Zhu and Gupta prune 50% of the elements of MobileNet (width=1) with a 1.1% drop in accuracy.  We pruned about 51.6% of the elements, with virtually no change in the accuracies (Top1: 68.808 and Top5: 88.656).  We didn't try to prune more than this, but we do note that the baseline accuracy that we used is almost 2% lower than the accuracy published in the paper.  \n\n\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|    | Name                     | Shape              |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |\n|----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n|  0 | module.model.0.0.weight  | (32, 3, 3, 3)      |           864 |            864 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.14466 |  0.00103 |    0.06508 |\n|  1 | module.model.1.0.weight  | (32, 1, 3, 3)      |           288 |            288 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.32146 |  0.01020 |    0.12932 |\n|  2 | module.model.1.3.weight  | (64, 32, 1, 1)     |          2048 |           2048 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.11942 |  0.00024 |    0.03627 |\n|  3 | module.model.2.0.weight  | (64, 1, 3, 3)      |           576 |            576 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.15809 |  0.00543 |    0.11513 |\n|  4 | module.model.2.3.weight  | (128, 64, 1, 1)    |          8192 |           8192 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08442 | -0.00031 |    0.04182 |\n|  5 | module.model.3.0.weight  | (128, 1, 3, 3)     |          1152 |           1152 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.16780 |  0.00125 |    0.10545 |\n|  6 | module.model.3.3.weight  | (128, 128, 1, 1)   |         16384 |          16384 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07126 | -0.00197 |    0.04123 |\n|  7 | module.model.4.0.weight  | (128, 1, 3, 3)     |          1152 |           1152 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.10182 |  0.00171 |    0.08719 |\n|  8 | module.model.4.3.weight  | (256, 128, 1, 1)   |         32768 |          13108 |    0.00000 |    0.00000 | 10.15625 | 59.99756 | 12.50000 |   59.99756 | 0.05543 | -0.00002 |    0.02760 |\n|  9 | module.model.5.0.weight  | (256, 1, 3, 3)     |          2304 |           2304 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.12516 | -0.00288 |    0.08058 |\n| 10 | module.model.5.3.weight  | (256, 256, 1, 1)   |         65536 |          26215 |    0.00000 |    0.00000 | 12.50000 | 59.99908 | 23.82812 |   59.99908 | 0.04453 |  0.00002 |    0.02271 |\n| 11 | module.model.6.0.weight  | (256, 1, 3, 3)     |          2304 |           2304 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08024 |  0.00252 |    0.06377 |\n| 12 | module.model.6.3.weight  | (512, 256, 1, 1)   |        131072 |          52429 |    0.00000 |    0.00000 | 23.82812 | 59.99985 | 14.25781 |   59.99985 | 0.03561 | -0.00057 |    0.01779 |\n| 13 | module.model.7.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.11008 | -0.00018 |    0.06829 |\n| 14 | module.model.7.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 14.25781 | 59.99985 | 21.28906 |   59.99985 | 0.02944 | -0.00060 |    0.01515 |\n| 15 | module.model.8.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08258 |  0.00370 |    0.04905 |\n| 16 | module.model.8.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 21.28906 | 59.99985 | 28.51562 |   59.99985 | 0.02865 | -0.00046 |    0.01465 |\n| 17 | module.model.9.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07578 |  0.00468 |    0.04201 |\n| 18 | module.model.9.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 28.51562 | 59.99985 | 23.43750 |   59.99985 | 0.02939 | -0.00044 |    0.01511 |\n| 19 | module.model.10.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07091 |  0.00014 |    0.04306 |\n| 20 | module.model.10.3.weight | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 24.60938 | 59.99985 | 20.89844 |   59.99985 | 0.03095 | -0.00059 |    0.01672 |\n| 21 | module.model.11.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.05729 | -0.00518 |    0.04267 |\n| 22 | module.model.11.3.weight | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 20.89844 | 59.99985 | 17.57812 |   59.99985 | 0.03229 | -0.00044 |    0.01797 |\n| 23 | module.model.12.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.04981 | -0.00136 |    0.03967 |\n| 24 | module.model.12.3.weight | (1024, 512, 1, 1)  |        524288 |         209716 |    0.00000 |    0.00000 | 16.01562 | 59.99985 | 44.23828 |   59.99985 | 0.02514 | -0.00106 |    0.01278 |\n| 25 | module.model.13.0.weight | (1024, 1, 3, 3)    |          9216 |           9216 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.02396 | -0.00949 |    0.01549 |\n| 26 | module.model.13.3.weight | (1024, 1024, 1, 1) |       1048576 |         419431 |    0.00000 |    0.00000 | 44.72656 | 59.99994 |  1.46484 |   59.99994 | 0.01801 | -0.00017 |    0.00931 |\n| 27 | module.fc.weight         | (1000, 1024)       |       1024000 |         409600 |    1.46484 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   60.00000 | 0.05078 |  0.00271 |    0.02734 |\n| 28 | Total sparsity:          | -                  |       4209088 |        1726917 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   58.97171 | 0.00000 |  0.00000 |    0.00000 |\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\nTotal sparsity: 58.97\n\n--- validate (epoch=199)-----------\n128116 samples (256 per mini-batch)\n==> Top1: 65.337    Top5: 84.984    Loss: 1.494\n\n--- test ---------------------\n50000 samples (256 per mini-batch)\n==> Top1: 68.810    Top5: 88.626    Loss: 1.282\n\n\n\n\n\nLearning Structured Sparsity in Deep Neural Networks\n\n\nThis research paper from the University of Pittsburgh, \"proposes a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNN\u2019s evaluation.\"\n\n\nNote that this paper does not use pruning, but instead uses group regularization during the training to force weights towards zero, as a group.  We used a schedule which thresholds the regularized elements at a magnitude equal to the regularization strength.  At the end of the regularization phase, we save the final sparsity masks generated by the regularization, and exit.  Then we load this regularized model, remove the layers corresponding to the zeroed weight tensors (all of a layer's elements have a zero value).    \n\n\nBaseline training\n\n\nWe started by training the baseline ResNet20-Cifar dense network since we didn't have a pre-trained model.\n\n\n\n\nDistiller schedule: \ndistiller/examples/ssl/resnet20_cifar_baseline_training.yaml\n\n\nCheckpoint files: \ndistiller/examples/ssl/checkpoints/\n\n\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.3 --epochs=180 --compress=../cifar10/resnet20/baseline_training.yaml -j=1 --deterministic\n\n\n\n\nRegularization\n\n\nThen we started training from scratch again, but this time we used Group Lasso regularization on entire layers:\n\nDistiller schedule: \ndistiller/examples/ssl/ssl_4D-removal_4L_training.yaml\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.4 --epochs=180 --compress=../ssl/ssl_4D-removal_training.yaml -j=1 --deterministic\n\n\n\n\nThe diagram below shows the training of Resnet20/CIFAR10 using Group Lasso regularization on entire layers (in blue) vs. training Resnet20/CIFAR10  baseline (in red).  You may notice several interesting things:\n1. The LR-decay policy is the same, but the two sessions start with different initial LR values.\n2. The data-loss of the regularized training follows the same shape as the un-regularized training (baseline), and eventually the two seem to merge.\n3. We see similar behavior in the validation Top1 and Top5 accuracy results, but the regularized training eventually performs better.\n4. In the top right corner we see the behavior of the regularization loss (\nReg Loss\n), which actually increases for some time, until the data-loss has a sharp drop (after ~16K mini-batches), at which point the regularization loss also starts dropping.\n\n\n\nThis \nregularization\n yields 5 layers with zeroed weight tensors.  We load this model, remove the 5 layers, and start the fine tuning of the weights.  This process of layer removal is specific to ResNet for CIFAR, which we altered by adding code to skip over layers during the forward path.  When you export to ONNX, the removed layers do not participate in the forward path, so they don't get incarnated.  \n\n\nWe managed to remove 5 of the 16 3x3 convolution layers which dominate the computation time.  It's not bad, but we probably could have done better.\n\n\nFine-tuning\n\n\nDuring the \nfine-tuning\n process, because the removed layers do not participate in the forward path, they do not appear in the backward path and are not backpropogated: therefore they are completely disconnected from the network.\n\nWe copy the checkpoint file of the regularized model to \ncheckpoint_trained_4D_regularized_5Lremoved.pth.tar\n.\n\nDistiller schedule: \ndistiller/examples/ssl/ssl_4D-removal_finetuning.yaml\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.1 --epochs=250 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --compress=../ssl/ssl_4D-removal_finetuning.yaml  -j=1 --deterministic\n\n\n\n\nResults\n\n\nOur baseline results for ResNet20 Cifar are: Top1=91.450 and  Top5=99.750\n\n\nWe used Distiller's GroupLassoRegularizer to remove 5 layers from Resnet20 (CIFAR10) with no degradation of the accuracies.\n\nThe regularized model exhibits really poor classification abilities: \n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --evaluate\n\n=> loading checkpoint ../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar\n   best top@1: 90.620\nLoaded compression schedule from checkpoint (epoch 179)\nRemoving layer: module.layer1.0.conv1 [layer=0 block=0 conv=0]\nRemoving layer: module.layer1.0.conv2 [layer=0 block=0 conv=1]\nRemoving layer: module.layer1.1.conv1 [layer=0 block=1 conv=0]\nRemoving layer: module.layer1.1.conv2 [layer=0 block=1 conv=1]\nRemoving layer: module.layer2.2.conv2 [layer=1 block=2 conv=1]\nFiles already downloaded and verified\nFiles already downloaded and verified\nDataset sizes:\n        training=45000\n        validation=5000\n        test=10000\n--- test ---------------------\n10000 samples (256 per mini-batch)\n==> Top1: 22.290    Top5: 68.940    Loss: 5.172\n\n\n\n\nHowever, after fine-tuning, we recovered most of the accuracies loss, but not quite all of it: Top1=91.020 and Top5=99.670\n\n\nWe didn't spend time trying to wrestle with this network, and therefore didn't achieve SSL's published results (which showed that they managed to remove 6 layers and at the same time increase accuracies).\n\n\nPruning Filters for Efficient ConvNets\n\n\nQuoting the authors directly:\n\n\n\n\nWe present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly.\nIn contrast to pruning weights, this approach does not result in sparse connectivity patterns. Hence, it does not need the support of sparse convolution libraries and can work with existing efficient BLAS libraries for dense matrix multiplications.\n\n\n\n\nThe implementation of the research by Hao et al. required us to add filter-pruning sensitivity analysis, and support for \"network thinning\".\n\n\nAfter performing filter-pruning sensitivity analysis to assess which layers are more sensitive to the pruning of filters, we execute distiller.L1RankedStructureParameterPruner once in order to rank the filters of each layer by their L1-norm values, and then we prune the schedule-prescribed sparsity level.  \n\n\n\n\nDistiller schedule: \ndistiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank.yaml\n\n\nCheckpoint files: \ncheckpoint_finetuned.pth.tar\n\n\n\n\nThe excerpt from the schedule, displayed below, shows how we declare the L1RankedStructureParameterPruner.  This class currently ranks filters only, but because in the future this class may support ranking of various structures, you need to specify for each parameter both the target sparsity level, and the structure type ('3D' is filter-wise pruning).\n\n\npruners:\n  filter_pruner:\n    class: 'L1RankedStructureParameterPruner'\n    reg_regims:\n      'module.layer1.0.conv1.weight': [0.6, '3D']\n      'module.layer1.1.conv1.weight': [0.6, '3D']\n      'module.layer1.2.conv1.weight': [0.6, '3D']\n      'module.layer1.3.conv1.weight': [0.6, '3D']\n\n\n\n\nIn the policy, we specify that we want to invoke this pruner once, at epoch 180.  Because we are starting from a network which was trained for 180 epochs (see Baseline training below), the filter ranking is performed right at the outset of this schedule.\n\n\npolicies:\n  - pruner:\n      instance_name: filter_pruner\n    epochs: [180]\n\n\n\n\n\nFollowing the pruning, we want to \"physically\" remove the pruned filters from the network, which involves reconfiguring the Convolutional layers and the parameter tensors.  When we remove filters from Convolution layer \nn\n we need to perform several changes to the network:\n1. Shrink layer \nn\n's weights tensor, leaving only the \"important\" filters.\n2. Configure layer \nn\n's \n.out_channels\n member to its new, smaller, value.\n3. If a BN layer follows layer \nn\n, then it also needs to be reconfigured and its scale and shift parameter vectors need to be shrunk.\n4. If a Convolution layer follows the BN layer, then it will have less input channels which requires reconfiguration and shrinking of its weights.\n\n\nAll of this is performed by distiller.ResnetCifarFilterRemover which is also scheduled at epoch 180.  We call this process \"network thinning\".\n\n\nextensions:\n  net_thinner:\n      class: 'FilterRemover'\n      thinning_func_str: remove_filters\n      arch: 'resnet56_cifar'\n      dataset: 'cifar10'\n\n\n\n\nNetwork thinning requires us to understand the layer connectivity and data-dependency of the DNN, and we are working on a robust method to perform this.  On networks with topologies similar to ResNet (residuals) and GoogLeNet (inception), which have several inputs and outputs to/from Convolution layers, there is extra details to consider.\n\nOur current implementation is specific to certain layers in ResNet and is a bit fragile.  We will continue to improve and generalize this.\n\n\nBaseline training\n\n\nWe started by training the baseline ResNet56-Cifar dense network (180 epochs) since we didn't have a pre-trained model.\n\n\n\n\nDistiller schedule: \ndistiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml\n\n\nCheckpoint files: \ncheckpoint.resnet56_cifar_baseline.pth.tar\n\n\n\n\nResults\n\n\nWe trained a ResNet56-Cifar10 network and achieve accuracy results which are on-par with published results:\nTop1: 92.970 and Top5: 99.740.\n\n\nWe used Hao et al.'s algorithm to remove 37.3% of the original convolution MACs, while maintaining virtually the same accuracy as the baseline:\nTop1: 92.830 and Top5: 99.760",
+            "location": "/model_zoo/index.html", 
+            "text": "Distiller Model Zoo\n\n\nHow to contribute models to the Model Zoo\n\n\nWe encourage you to contribute new models to the Model Zoo.  We welcome implementations of published papers or of your own work.  To assure that models and algorithms shared with others are high-quality, please commit your models with the following:\n\n\n\n\nCommand-line arguments\n\n\nLog files\n\n\nPyTorch model\n\n\n\n\nContents\n\n\nThe Distiller model zoo is not a \"traditional\" model-zoo, because it does not necessarily contain best-in-class compressed models.  Instead, the model-zoo contains a number of deep learning models that have been compressed using Distiller following some well-known research papers.  These are meant to serve as examples of how Distiller can be used.\n\n\nEach model contains a Distiller schedule detailing how the model was compressed, a PyTorch checkpoint, text logs and TensorBoard logs.\n\n\n\n\ntable, th, td {\n    border: 1px solid black;\n}\n\n\n\n\n  \n\n    \nPaper\n\n    \nDataset\n\n    \nNetwork\n\n    \nMethod \n Granularity\n\n    \nSchedule\n\n    \nFeatures\n\n  \n\n  \n\n    \nLearning both Weights and Connections for Efficient Neural Networks\n\n    \nImageNet\n\n    \nAlexnet\n\n    \nElement-wise pruning\n\n    \nIterative; Manual\n\n    \nMagnitude thresholding based on a sensitivity quantifier.\nElement-wise sparsity sensitivity analysis\n\n  \n\n  \n\n    \nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n\n    \nImageNet\n\n    \nMobileNet\n\n    \nElement-wise pruning\n\n    \nAutomated gradual; Iterative\n\n    \nMagnitude thresholding based on target level\n\n  \n\n  \n\n    \nLearning Structured Sparsity in Deep Neural Networks\n\n    \nCIFAR10\n\n    \nResNet20\n\n    \nGroup regularization\n\n    \n1.Train with group-lasso\n2.Remove zero groups and fine-tune\n\n    \nGroup Lasso regularization. Groups: kernels (2D), channels, filters (3D), layers (4D), vectors (rows, cols)\n\n  \n\n  \n\n    \nPruning Filters for Efficient ConvNets\n\n    \nCIFAR10\n\n    \nResNet56\n\n    \nFilter ranking; guided by sensitivity analysis\n\n    \n1.Rank filters\n2. Remove filters and channels\n3.Fine-tune\n\n    \nOne-shot ranking and pruning of filters; with network thinning\n  \n\n\n\n\nLearning both Weights and Connections for Efficient Neural Networks\n\n\nThis schedule is an example of \"Iterative Pruning\" for Alexnet/Imagent, as described in chapter 3 of Song Han's PhD dissertation: \nEfficient Methods and Hardware for Deep Learning\n and in his paper \nLearning both Weights and Connections for Efficient Neural Networks\n.  \n\n\nThe Distiller schedule uses SensitivityPruner which is similar to MagnitudeParameterPruner, but instead of specifying \"raw\" thresholds, it uses a \"sensitivity parameter\".  Song Han's paper says that \"the pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layers weights,\" and this is not explained much further.  In Distiller, the \"quality parameter\" is referred to as \"sensitivity\" and\nis based on the values learned from performing sensitivity analysis.  Using a parameter that is related to the standard deviation is very helpful: under the assumption that the weights tensors are distributed normally, the standard deviation acts as a threshold normalizer.\n\n\nNote that Distiller's implementation deviates slightly from the algorithm Song Han describes in his PhD dissertation, in that the threshold value is set only once.  In his PhD dissertation, Song Han describes a growing threshold, at each iteration.  This requires n+1 hyper-parameters (n being the number of pruning iterations we use): the threshold and the threshold increase (delta) at each pruning iteration.  Distiller's implementation takes advantage of the fact that as pruning progresses, more weights are pulled toward zero, and therefore the threshold \"traps\" more weights.  Thus, we can use less hyper-parameters and achieve the same results.\n\n\n\n\nDistiller schedule: \ndistiller/examples/sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\n\nCheckpoint file: \nalexnet.checkpoint.89.pth.tar\n\n\n\n\nResults\n\n\nOur reference is TorchVision's pretrained Alexnet model which has a Top1 accuracy of 56.55 and Top5=79.09.  We prune away 88.44% of the parameters and achieve  Top1=56.61 and Top5=79.45.\nSong Han prunes 89% of the parameters, which is slightly better than our results.\n\n\nParameters:\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|    | Name                      | Shape            |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean\n|----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n|  0 | features.module.0.weight  | (64, 3, 11, 11)  |         23232 |          13411 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   42.27359 | 0.14391 | -0.00002 |    0.08805 |\n|  1 | features.module.3.weight  | (192, 64, 5, 5)  |        307200 |         115560 |    0.00000 |    0.00000 |  0.00000 |  1.91243 |  0.00000 |   62.38281 | 0.04703 | -0.00250 |    0.02289 |\n|  2 | features.module.6.weight  | (384, 192, 3, 3) |        663552 |         256565 |    0.00000 |    0.00000 |  0.00000 |  6.18490 |  0.00000 |   61.33445 | 0.03354 | -0.00184 |    0.01803 |\n|  3 | features.module.8.weight  | (256, 384, 3, 3) |        884736 |         315065 |    0.00000 |    0.00000 |  0.00000 |  6.96411 |  0.00000 |   64.38881 | 0.02646 | -0.00168 |    0.01422 |\n|  4 | features.module.10.weight | (256, 256, 3, 3) |        589824 |         186938 |    0.00000 |    0.00000 |  0.00000 | 15.49225 |  0.00000 |   68.30614 | 0.02714 | -0.00246 |    0.01409 |\n|  5 | classifier.1.weight       | (4096, 9216)     |      37748736 |        3398881 |    0.00000 |    0.21973 |  0.00000 |  0.21973 |  0.00000 |   90.99604 | 0.00589 | -0.00020 |    0.00168 |\n|  6 | classifier.4.weight       | (4096, 4096)     |      16777216 |        1782769 |    0.21973 |    3.46680 |  0.00000 |  3.46680 |  0.00000 |   89.37387 | 0.00849 | -0.00066 |    0.00263 |\n|  7 | classifier.6.weight       | (1000, 4096)     |       4096000 |         994738 |    3.36914 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   75.71440 | 0.01718 |  0.00030 |    0.00778 |\n|  8 | Total sparsity:           | -                |      61090496 |        7063928 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   88.43694 | 0.00000 |  0.00000 |    0.00000 |\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:35,357 - ==\n Top1: 51.838    Top5: 74.817    Loss: 2.150\n\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:32:01,274 - ==\n Top1: 56.606    Top5: 79.446    Loss: 1.893\n\n\n\n\nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n\n\nIn their paper Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their\nsmaller, but dense (small-dense) counterparts with identical memory footprint.\"\nThey also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with\nminimal tuning.\"\n\n\nThis pruning schedule is implemented by distiller.AutomatedGradualPruner, which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps.  Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper.  The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs.  We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size.\n\n\nImageNet files:\n\n\n\n\nDistiller schedule: \ndistiller/examples/agp-pruning/mobilenet.imagenet.schedule_agp.yaml\n\n\nCheckpoint file: \ncheckpoint.pth.tar\n\n\n\n\nResNet18 files:\n\n\n\n\nDistiller schedule: \ndistiller/examples/agp-pruning/resnet18.schedule_agp.yaml\n\n\nCheckpoint file: \ncheckpoint.pth.tar\n\n\n\n\nResults\n\n\nAs our baseline we used a \npretrained PyTorch MobileNet model\n (width=1) which has Top1=68.848 and Top5=88.740.\n\nIn their paper, Zhu and Gupta prune 50% of the elements of MobileNet (width=1) with a 1.1% drop in accuracy.  We pruned about 51.6% of the elements, with virtually no change in the accuracies (Top1: 68.808 and Top5: 88.656).  We didn't try to prune more than this, but we do note that the baseline accuracy that we used is almost 2% lower than the accuracy published in the paper.  \n\n\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|    | Name                     | Shape              |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |\n|----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n|  0 | module.model.0.0.weight  | (32, 3, 3, 3)      |           864 |            864 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.14466 |  0.00103 |    0.06508 |\n|  1 | module.model.1.0.weight  | (32, 1, 3, 3)      |           288 |            288 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.32146 |  0.01020 |    0.12932 |\n|  2 | module.model.1.3.weight  | (64, 32, 1, 1)     |          2048 |           2048 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.11942 |  0.00024 |    0.03627 |\n|  3 | module.model.2.0.weight  | (64, 1, 3, 3)      |           576 |            576 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.15809 |  0.00543 |    0.11513 |\n|  4 | module.model.2.3.weight  | (128, 64, 1, 1)    |          8192 |           8192 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08442 | -0.00031 |    0.04182 |\n|  5 | module.model.3.0.weight  | (128, 1, 3, 3)     |          1152 |           1152 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.16780 |  0.00125 |    0.10545 |\n|  6 | module.model.3.3.weight  | (128, 128, 1, 1)   |         16384 |          16384 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07126 | -0.00197 |    0.04123 |\n|  7 | module.model.4.0.weight  | (128, 1, 3, 3)     |          1152 |           1152 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.10182 |  0.00171 |    0.08719 |\n|  8 | module.model.4.3.weight  | (256, 128, 1, 1)   |         32768 |          13108 |    0.00000 |    0.00000 | 10.15625 | 59.99756 | 12.50000 |   59.99756 | 0.05543 | -0.00002 |    0.02760 |\n|  9 | module.model.5.0.weight  | (256, 1, 3, 3)     |          2304 |           2304 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.12516 | -0.00288 |    0.08058 |\n| 10 | module.model.5.3.weight  | (256, 256, 1, 1)   |         65536 |          26215 |    0.00000 |    0.00000 | 12.50000 | 59.99908 | 23.82812 |   59.99908 | 0.04453 |  0.00002 |    0.02271 |\n| 11 | module.model.6.0.weight  | (256, 1, 3, 3)     |          2304 |           2304 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08024 |  0.00252 |    0.06377 |\n| 12 | module.model.6.3.weight  | (512, 256, 1, 1)   |        131072 |          52429 |    0.00000 |    0.00000 | 23.82812 | 59.99985 | 14.25781 |   59.99985 | 0.03561 | -0.00057 |    0.01779 |\n| 13 | module.model.7.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.11008 | -0.00018 |    0.06829 |\n| 14 | module.model.7.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 14.25781 | 59.99985 | 21.28906 |   59.99985 | 0.02944 | -0.00060 |    0.01515 |\n| 15 | module.model.8.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08258 |  0.00370 |    0.04905 |\n| 16 | module.model.8.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 21.28906 | 59.99985 | 28.51562 |   59.99985 | 0.02865 | -0.00046 |    0.01465 |\n| 17 | module.model.9.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07578 |  0.00468 |    0.04201 |\n| 18 | module.model.9.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 28.51562 | 59.99985 | 23.43750 |   59.99985 | 0.02939 | -0.00044 |    0.01511 |\n| 19 | module.model.10.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07091 |  0.00014 |    0.04306 |\n| 20 | module.model.10.3.weight | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 24.60938 | 59.99985 | 20.89844 |   59.99985 | 0.03095 | -0.00059 |    0.01672 |\n| 21 | module.model.11.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.05729 | -0.00518 |    0.04267 |\n| 22 | module.model.11.3.weight | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 20.89844 | 59.99985 | 17.57812 |   59.99985 | 0.03229 | -0.00044 |    0.01797 |\n| 23 | module.model.12.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.04981 | -0.00136 |    0.03967 |\n| 24 | module.model.12.3.weight | (1024, 512, 1, 1)  |        524288 |         209716 |    0.00000 |    0.00000 | 16.01562 | 59.99985 | 44.23828 |   59.99985 | 0.02514 | -0.00106 |    0.01278 |\n| 25 | module.model.13.0.weight | (1024, 1, 3, 3)    |          9216 |           9216 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.02396 | -0.00949 |    0.01549 |\n| 26 | module.model.13.3.weight | (1024, 1024, 1, 1) |       1048576 |         419431 |    0.00000 |    0.00000 | 44.72656 | 59.99994 |  1.46484 |   59.99994 | 0.01801 | -0.00017 |    0.00931 |\n| 27 | module.fc.weight         | (1000, 1024)       |       1024000 |         409600 |    1.46484 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   60.00000 | 0.05078 |  0.00271 |    0.02734 |\n| 28 | Total sparsity:          | -                  |       4209088 |        1726917 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   58.97171 | 0.00000 |  0.00000 |    0.00000 |\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\nTotal sparsity: 58.97\n\n--- validate (epoch=199)-----------\n128116 samples (256 per mini-batch)\n==\n Top1: 65.337    Top5: 84.984    Loss: 1.494\n\n--- test ---------------------\n50000 samples (256 per mini-batch)\n==\n Top1: 68.810    Top5: 88.626    Loss: 1.282\n\n\n\n\n\nLearning Structured Sparsity in Deep Neural Networks\n\n\nThis research paper from the University of Pittsburgh, \"proposes a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNN\u2019s evaluation.\"\n\n\nNote that this paper does not use pruning, but instead uses group regularization during the training to force weights towards zero, as a group.  We used a schedule which thresholds the regularized elements at a magnitude equal to the regularization strength.  At the end of the regularization phase, we save the final sparsity masks generated by the regularization, and exit.  Then we load this regularized model, remove the layers corresponding to the zeroed weight tensors (all of a layer's elements have a zero value).    \n\n\nBaseline training\n\n\nWe started by training the baseline ResNet20-Cifar dense network since we didn't have a pre-trained model.\n\n\n\n\nDistiller schedule: \ndistiller/examples/ssl/resnet20_cifar_baseline_training.yaml\n\n\nCheckpoint files: \ndistiller/examples/ssl/checkpoints/\n\n\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.3 --epochs=180 --compress=../cifar10/resnet20/baseline_training.yaml -j=1 --deterministic\n\n\n\n\nRegularization\n\n\nThen we started training from scratch again, but this time we used Group Lasso regularization on entire layers:\n\nDistiller schedule: \ndistiller/examples/ssl/ssl_4D-removal_4L_training.yaml\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.4 --epochs=180 --compress=../ssl/ssl_4D-removal_training.yaml -j=1 --deterministic\n\n\n\n\nThe diagram below shows the training of Resnet20/CIFAR10 using Group Lasso regularization on entire layers (in blue) vs. training Resnet20/CIFAR10  baseline (in red).  You may notice several interesting things:\n1. The LR-decay policy is the same, but the two sessions start with different initial LR values.\n2. The data-loss of the regularized training follows the same shape as the un-regularized training (baseline), and eventually the two seem to merge.\n3. We see similar behavior in the validation Top1 and Top5 accuracy results, but the regularized training eventually performs better.\n4. In the top right corner we see the behavior of the regularization loss (\nReg Loss\n), which actually increases for some time, until the data-loss has a sharp drop (after ~16K mini-batches), at which point the regularization loss also starts dropping.\n\n\n\nThis \nregularization\n yields 5 layers with zeroed weight tensors.  We load this model, remove the 5 layers, and start the fine tuning of the weights.  This process of layer removal is specific to ResNet for CIFAR, which we altered by adding code to skip over layers during the forward path.  When you export to ONNX, the removed layers do not participate in the forward path, so they don't get incarnated.  \n\n\nWe managed to remove 5 of the 16 3x3 convolution layers which dominate the computation time.  It's not bad, but we probably could have done better.\n\n\nFine-tuning\n\n\nDuring the \nfine-tuning\n process, because the removed layers do not participate in the forward path, they do not appear in the backward path and are not backpropogated: therefore they are completely disconnected from the network.\n\nWe copy the checkpoint file of the regularized model to \ncheckpoint_trained_4D_regularized_5Lremoved.pth.tar\n.\n\nDistiller schedule: \ndistiller/examples/ssl/ssl_4D-removal_finetuning.yaml\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.1 --epochs=250 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --compress=../ssl/ssl_4D-removal_finetuning.yaml  -j=1 --deterministic\n\n\n\n\nResults\n\n\nOur baseline results for ResNet20 Cifar are: Top1=91.450 and  Top5=99.750\n\n\nWe used Distiller's GroupLassoRegularizer to remove 5 layers from Resnet20 (CIFAR10) with no degradation of the accuracies.\n\nThe regularized model exhibits really poor classification abilities: \n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --evaluate\n\n=\n loading checkpoint ../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar\n   best top@1: 90.620\nLoaded compression schedule from checkpoint (epoch 179)\nRemoving layer: module.layer1.0.conv1 [layer=0 block=0 conv=0]\nRemoving layer: module.layer1.0.conv2 [layer=0 block=0 conv=1]\nRemoving layer: module.layer1.1.conv1 [layer=0 block=1 conv=0]\nRemoving layer: module.layer1.1.conv2 [layer=0 block=1 conv=1]\nRemoving layer: module.layer2.2.conv2 [layer=1 block=2 conv=1]\nFiles already downloaded and verified\nFiles already downloaded and verified\nDataset sizes:\n        training=45000\n        validation=5000\n        test=10000\n--- test ---------------------\n10000 samples (256 per mini-batch)\n==\n Top1: 22.290    Top5: 68.940    Loss: 5.172\n\n\n\n\nHowever, after fine-tuning, we recovered most of the accuracies loss, but not quite all of it: Top1=91.020 and Top5=99.670\n\n\nWe didn't spend time trying to wrestle with this network, and therefore didn't achieve SSL's published results (which showed that they managed to remove 6 layers and at the same time increase accuracies).\n\n\nPruning Filters for Efficient ConvNets\n\n\nQuoting the authors directly:\n\n\n\n\nWe present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly.\nIn contrast to pruning weights, this approach does not result in sparse connectivity patterns. Hence, it does not need the support of sparse convolution libraries and can work with existing efficient BLAS libraries for dense matrix multiplications.\n\n\n\n\nThe implementation of the research by Hao et al. required us to add filter-pruning sensitivity analysis, and support for \"network thinning\".\n\n\nAfter performing filter-pruning sensitivity analysis to assess which layers are more sensitive to the pruning of filters, we execute distiller.L1RankedStructureParameterPruner once in order to rank the filters of each layer by their L1-norm values, and then we prune the schedule-prescribed sparsity level.  \n\n\n\n\nDistiller schedule: \ndistiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank.yaml\n\n\nCheckpoint files: \ncheckpoint_finetuned.pth.tar\n\n\n\n\nThe excerpt from the schedule, displayed below, shows how we declare the L1RankedStructureParameterPruner.  This class currently ranks filters only, but because in the future this class may support ranking of various structures, you need to specify for each parameter both the target sparsity level, and the structure type ('3D' is filter-wise pruning).\n\n\npruners:\n  filter_pruner:\n    class: 'L1RankedStructureParameterPruner'\n    reg_regims:\n      'module.layer1.0.conv1.weight': [0.6, '3D']\n      'module.layer1.1.conv1.weight': [0.6, '3D']\n      'module.layer1.2.conv1.weight': [0.6, '3D']\n      'module.layer1.3.conv1.weight': [0.6, '3D']\n\n\n\n\nIn the policy, we specify that we want to invoke this pruner once, at epoch 180.  Because we are starting from a network which was trained for 180 epochs (see Baseline training below), the filter ranking is performed right at the outset of this schedule.\n\n\npolicies:\n  - pruner:\n      instance_name: filter_pruner\n    epochs: [180]\n\n\n\n\n\nFollowing the pruning, we want to \"physically\" remove the pruned filters from the network, which involves reconfiguring the Convolutional layers and the parameter tensors.  When we remove filters from Convolution layer \nn\n we need to perform several changes to the network:\n1. Shrink layer \nn\n's weights tensor, leaving only the \"important\" filters.\n2. Configure layer \nn\n's \n.out_channels\n member to its new, smaller, value.\n3. If a BN layer follows layer \nn\n, then it also needs to be reconfigured and its scale and shift parameter vectors need to be shrunk.\n4. If a Convolution layer follows the BN layer, then it will have less input channels which requires reconfiguration and shrinking of its weights.\n\n\nAll of this is performed by distiller.ResnetCifarFilterRemover which is also scheduled at epoch 180.  We call this process \"network thinning\".\n\n\nextensions:\n  net_thinner:\n      class: 'FilterRemover'\n      thinning_func_str: remove_filters\n      arch: 'resnet56_cifar'\n      dataset: 'cifar10'\n\n\n\n\nNetwork thinning requires us to understand the layer connectivity and data-dependency of the DNN, and we are working on a robust method to perform this.  On networks with topologies similar to ResNet (residuals) and GoogLeNet (inception), which have several inputs and outputs to/from Convolution layers, there is extra details to consider.\n\nOur current implementation is specific to certain layers in ResNet and is a bit fragile.  We will continue to improve and generalize this.\n\n\nBaseline training\n\n\nWe started by training the baseline ResNet56-Cifar dense network (180 epochs) since we didn't have a pre-trained model.\n\n\n\n\nDistiller schedule: \ndistiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml\n\n\nCheckpoint files: \ncheckpoint.resnet56_cifar_baseline.pth.tar\n\n\n\n\nResults\n\n\nWe trained a ResNet56-Cifar10 network and achieve accuracy results which are on-par with published results:\nTop1: 92.970 and Top5: 99.740.\n\n\nWe used Hao et al.'s algorithm to remove 37.3% of the original convolution MACs, while maintaining virtually the same accuracy as the baseline:\nTop1: 92.830 and Top5: 99.760", 
             "title": "Model Zoo"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#distiller-model-zoo",
-            "text": "",
+            "location": "/model_zoo/index.html#distiller-model-zoo", 
+            "text": "", 
             "title": "Distiller Model Zoo"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#how-to-contribute-models-to-the-model-zoo",
-            "text": "We encourage you to contribute new models to the Model Zoo.  We welcome implementations of published papers or of your own work.  To assure that models and algorithms shared with others are high-quality, please commit your models with the following:   Command-line arguments  Log files  PyTorch model",
+            "location": "/model_zoo/index.html#how-to-contribute-models-to-the-model-zoo", 
+            "text": "We encourage you to contribute new models to the Model Zoo.  We welcome implementations of published papers or of your own work.  To assure that models and algorithms shared with others are high-quality, please commit your models with the following:   Command-line arguments  Log files  PyTorch model", 
             "title": "How to contribute models to the Model Zoo"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#contents",
-            "text": "The Distiller model zoo is not a \"traditional\" model-zoo, because it does not necessarily contain best-in-class compressed models.  Instead, the model-zoo contains a number of deep learning models that have been compressed using Distiller following some well-known research papers.  These are meant to serve as examples of how Distiller can be used.  Each model contains a Distiller schedule detailing how the model was compressed, a PyTorch checkpoint, text logs and TensorBoard logs.  \ntable, th, td {\n    border: 1px solid black;\n}  \n   \n     Paper \n     Dataset \n     Network \n     Method & Granularity \n     Schedule \n     Features \n   \n   \n     Learning both Weights and Connections for Efficient Neural Networks \n     ImageNet \n     Alexnet \n     Element-wise pruning \n     Iterative; Manual \n     Magnitude thresholding based on a sensitivity quantifier. Element-wise sparsity sensitivity analysis \n   \n   \n     To prune, or not to prune: exploring the efficacy of pruning for model compression \n     ImageNet \n     MobileNet \n     Element-wise pruning \n     Automated gradual; Iterative \n     Magnitude thresholding based on target level \n   \n   \n     Learning Structured Sparsity in Deep Neural Networks \n     CIFAR10 \n     ResNet20 \n     Group regularization \n     1.Train with group-lasso 2.Remove zero groups and fine-tune \n     Group Lasso regularization. Groups: kernels (2D), channels, filters (3D), layers (4D), vectors (rows, cols) \n   \n   \n     Pruning Filters for Efficient ConvNets \n     CIFAR10 \n     ResNet56 \n     Filter ranking; guided by sensitivity analysis \n     1.Rank filters 2. Remove filters and channels 3.Fine-tune \n     One-shot ranking and pruning of filters; with network thinning",
+            "location": "/model_zoo/index.html#contents", 
+            "text": "The Distiller model zoo is not a \"traditional\" model-zoo, because it does not necessarily contain best-in-class compressed models.  Instead, the model-zoo contains a number of deep learning models that have been compressed using Distiller following some well-known research papers.  These are meant to serve as examples of how Distiller can be used.  Each model contains a Distiller schedule detailing how the model was compressed, a PyTorch checkpoint, text logs and TensorBoard logs.  \ntable, th, td {\n    border: 1px solid black;\n}  \n   \n     Paper \n     Dataset \n     Network \n     Method   Granularity \n     Schedule \n     Features \n   \n   \n     Learning both Weights and Connections for Efficient Neural Networks \n     ImageNet \n     Alexnet \n     Element-wise pruning \n     Iterative; Manual \n     Magnitude thresholding based on a sensitivity quantifier. Element-wise sparsity sensitivity analysis \n   \n   \n     To prune, or not to prune: exploring the efficacy of pruning for model compression \n     ImageNet \n     MobileNet \n     Element-wise pruning \n     Automated gradual; Iterative \n     Magnitude thresholding based on target level \n   \n   \n     Learning Structured Sparsity in Deep Neural Networks \n     CIFAR10 \n     ResNet20 \n     Group regularization \n     1.Train with group-lasso 2.Remove zero groups and fine-tune \n     Group Lasso regularization. Groups: kernels (2D), channels, filters (3D), layers (4D), vectors (rows, cols) \n   \n   \n     Pruning Filters for Efficient ConvNets \n     CIFAR10 \n     ResNet56 \n     Filter ranking; guided by sensitivity analysis \n     1.Rank filters 2. Remove filters and channels 3.Fine-tune \n     One-shot ranking and pruning of filters; with network thinning", 
             "title": "Contents"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#learning-both-weights-and-connections-for-efficient-neural-networks",
-            "text": "This schedule is an example of \"Iterative Pruning\" for Alexnet/Imagent, as described in chapter 3 of Song Han's PhD dissertation:  Efficient Methods and Hardware for Deep Learning  and in his paper  Learning both Weights and Connections for Efficient Neural Networks .    The Distiller schedule uses SensitivityPruner which is similar to MagnitudeParameterPruner, but instead of specifying \"raw\" thresholds, it uses a \"sensitivity parameter\".  Song Han's paper says that \"the pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layers weights,\" and this is not explained much further.  In Distiller, the \"quality parameter\" is referred to as \"sensitivity\" and\nis based on the values learned from performing sensitivity analysis.  Using a parameter that is related to the standard deviation is very helpful: under the assumption that the weights tensors are distributed normally, the standard deviation acts as a threshold normalizer.  Note that Distiller's implementation deviates slightly from the algorithm Song Han describes in his PhD dissertation, in that the threshold value is set only once.  In his PhD dissertation, Song Han describes a growing threshold, at each iteration.  This requires n+1 hyper-parameters (n being the number of pruning iterations we use): the threshold and the threshold increase (delta) at each pruning iteration.  Distiller's implementation takes advantage of the fact that as pruning progresses, more weights are pulled toward zero, and therefore the threshold \"traps\" more weights.  Thus, we can use less hyper-parameters and achieve the same results.   Distiller schedule:  distiller/examples/sensitivity-pruning/alexnet.schedule_sensitivity.yaml  Checkpoint file:  alexnet.checkpoint.89.pth.tar",
+            "location": "/model_zoo/index.html#learning-both-weights-and-connections-for-efficient-neural-networks", 
+            "text": "This schedule is an example of \"Iterative Pruning\" for Alexnet/Imagent, as described in chapter 3 of Song Han's PhD dissertation:  Efficient Methods and Hardware for Deep Learning  and in his paper  Learning both Weights and Connections for Efficient Neural Networks .    The Distiller schedule uses SensitivityPruner which is similar to MagnitudeParameterPruner, but instead of specifying \"raw\" thresholds, it uses a \"sensitivity parameter\".  Song Han's paper says that \"the pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layers weights,\" and this is not explained much further.  In Distiller, the \"quality parameter\" is referred to as \"sensitivity\" and\nis based on the values learned from performing sensitivity analysis.  Using a parameter that is related to the standard deviation is very helpful: under the assumption that the weights tensors are distributed normally, the standard deviation acts as a threshold normalizer.  Note that Distiller's implementation deviates slightly from the algorithm Song Han describes in his PhD dissertation, in that the threshold value is set only once.  In his PhD dissertation, Song Han describes a growing threshold, at each iteration.  This requires n+1 hyper-parameters (n being the number of pruning iterations we use): the threshold and the threshold increase (delta) at each pruning iteration.  Distiller's implementation takes advantage of the fact that as pruning progresses, more weights are pulled toward zero, and therefore the threshold \"traps\" more weights.  Thus, we can use less hyper-parameters and achieve the same results.   Distiller schedule:  distiller/examples/sensitivity-pruning/alexnet.schedule_sensitivity.yaml  Checkpoint file:  alexnet.checkpoint.89.pth.tar", 
             "title": "Learning both Weights and Connections for Efficient Neural Networks"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#results",
-            "text": "Our reference is TorchVision's pretrained Alexnet model which has a Top1 accuracy of 56.55 and Top5=79.09.  We prune away 88.44% of the parameters and achieve  Top1=56.61 and Top5=79.45.\nSong Han prunes 89% of the parameters, which is slightly better than our results.  Parameters:\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|    | Name                      | Shape            |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean\n|----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n|  0 | features.module.0.weight  | (64, 3, 11, 11)  |         23232 |          13411 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   42.27359 | 0.14391 | -0.00002 |    0.08805 |\n|  1 | features.module.3.weight  | (192, 64, 5, 5)  |        307200 |         115560 |    0.00000 |    0.00000 |  0.00000 |  1.91243 |  0.00000 |   62.38281 | 0.04703 | -0.00250 |    0.02289 |\n|  2 | features.module.6.weight  | (384, 192, 3, 3) |        663552 |         256565 |    0.00000 |    0.00000 |  0.00000 |  6.18490 |  0.00000 |   61.33445 | 0.03354 | -0.00184 |    0.01803 |\n|  3 | features.module.8.weight  | (256, 384, 3, 3) |        884736 |         315065 |    0.00000 |    0.00000 |  0.00000 |  6.96411 |  0.00000 |   64.38881 | 0.02646 | -0.00168 |    0.01422 |\n|  4 | features.module.10.weight | (256, 256, 3, 3) |        589824 |         186938 |    0.00000 |    0.00000 |  0.00000 | 15.49225 |  0.00000 |   68.30614 | 0.02714 | -0.00246 |    0.01409 |\n|  5 | classifier.1.weight       | (4096, 9216)     |      37748736 |        3398881 |    0.00000 |    0.21973 |  0.00000 |  0.21973 |  0.00000 |   90.99604 | 0.00589 | -0.00020 |    0.00168 |\n|  6 | classifier.4.weight       | (4096, 4096)     |      16777216 |        1782769 |    0.21973 |    3.46680 |  0.00000 |  3.46680 |  0.00000 |   89.37387 | 0.00849 | -0.00066 |    0.00263 |\n|  7 | classifier.6.weight       | (1000, 4096)     |       4096000 |         994738 |    3.36914 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   75.71440 | 0.01718 |  0.00030 |    0.00778 |\n|  8 | Total sparsity:           | -                |      61090496 |        7063928 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   88.43694 | 0.00000 |  0.00000 |    0.00000 |\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:35,357 - ==> Top1: 51.838    Top5: 74.817    Loss: 2.150\n\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:32:01,274 - ==> Top1: 56.606    Top5: 79.446    Loss: 1.893",
+            "location": "/model_zoo/index.html#results", 
+            "text": "Our reference is TorchVision's pretrained Alexnet model which has a Top1 accuracy of 56.55 and Top5=79.09.  We prune away 88.44% of the parameters and achieve  Top1=56.61 and Top5=79.45.\nSong Han prunes 89% of the parameters, which is slightly better than our results.  Parameters:\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|    | Name                      | Shape            |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean\n|----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n|  0 | features.module.0.weight  | (64, 3, 11, 11)  |         23232 |          13411 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   42.27359 | 0.14391 | -0.00002 |    0.08805 |\n|  1 | features.module.3.weight  | (192, 64, 5, 5)  |        307200 |         115560 |    0.00000 |    0.00000 |  0.00000 |  1.91243 |  0.00000 |   62.38281 | 0.04703 | -0.00250 |    0.02289 |\n|  2 | features.module.6.weight  | (384, 192, 3, 3) |        663552 |         256565 |    0.00000 |    0.00000 |  0.00000 |  6.18490 |  0.00000 |   61.33445 | 0.03354 | -0.00184 |    0.01803 |\n|  3 | features.module.8.weight  | (256, 384, 3, 3) |        884736 |         315065 |    0.00000 |    0.00000 |  0.00000 |  6.96411 |  0.00000 |   64.38881 | 0.02646 | -0.00168 |    0.01422 |\n|  4 | features.module.10.weight | (256, 256, 3, 3) |        589824 |         186938 |    0.00000 |    0.00000 |  0.00000 | 15.49225 |  0.00000 |   68.30614 | 0.02714 | -0.00246 |    0.01409 |\n|  5 | classifier.1.weight       | (4096, 9216)     |      37748736 |        3398881 |    0.00000 |    0.21973 |  0.00000 |  0.21973 |  0.00000 |   90.99604 | 0.00589 | -0.00020 |    0.00168 |\n|  6 | classifier.4.weight       | (4096, 4096)     |      16777216 |        1782769 |    0.21973 |    3.46680 |  0.00000 |  3.46680 |  0.00000 |   89.37387 | 0.00849 | -0.00066 |    0.00263 |\n|  7 | classifier.6.weight       | (1000, 4096)     |       4096000 |         994738 |    3.36914 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   75.71440 | 0.01718 |  0.00030 |    0.00778 |\n|  8 | Total sparsity:           | -                |      61090496 |        7063928 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   88.43694 | 0.00000 |  0.00000 |    0.00000 |\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:35,357 - ==  Top1: 51.838    Top5: 74.817    Loss: 2.150\n\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:32:01,274 - ==  Top1: 56.606    Top5: 79.446    Loss: 1.893", 
             "title": "Results"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#to-prune-or-not-to-prune-exploring-the-efficacy-of-pruning-for-model-compression",
-            "text": "In their paper Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their\nsmaller, but dense (small-dense) counterparts with identical memory footprint.\"\nThey also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with\nminimal tuning.\"  This pruning schedule is implemented by distiller.AutomatedGradualPruner, which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps.  Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper.  The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs.  We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size.  ImageNet files:   Distiller schedule:  distiller/examples/agp-pruning/mobilenet.imagenet.schedule_agp.yaml  Checkpoint file:  checkpoint.pth.tar   ResNet18 files:   Distiller schedule:  distiller/examples/agp-pruning/resnet18.schedule_agp.yaml  Checkpoint file:  checkpoint.pth.tar",
+            "location": "/model_zoo/index.html#to-prune-or-not-to-prune-exploring-the-efficacy-of-pruning-for-model-compression", 
+            "text": "In their paper Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their\nsmaller, but dense (small-dense) counterparts with identical memory footprint.\"\nThey also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with\nminimal tuning.\"  This pruning schedule is implemented by distiller.AutomatedGradualPruner, which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps.  Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper.  The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs.  We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size.  ImageNet files:   Distiller schedule:  distiller/examples/agp-pruning/mobilenet.imagenet.schedule_agp.yaml  Checkpoint file:  checkpoint.pth.tar   ResNet18 files:   Distiller schedule:  distiller/examples/agp-pruning/resnet18.schedule_agp.yaml  Checkpoint file:  checkpoint.pth.tar", 
             "title": "To prune, or not to prune: exploring the efficacy of pruning for model compression"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#results_1",
-            "text": "As our baseline we used a  pretrained PyTorch MobileNet model  (width=1) which has Top1=68.848 and Top5=88.740. \nIn their paper, Zhu and Gupta prune 50% of the elements of MobileNet (width=1) with a 1.1% drop in accuracy.  We pruned about 51.6% of the elements, with virtually no change in the accuracies (Top1: 68.808 and Top5: 88.656).  We didn't try to prune more than this, but we do note that the baseline accuracy that we used is almost 2% lower than the accuracy published in the paper.    +----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|    | Name                     | Shape              |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |\n|----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n|  0 | module.model.0.0.weight  | (32, 3, 3, 3)      |           864 |            864 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.14466 |  0.00103 |    0.06508 |\n|  1 | module.model.1.0.weight  | (32, 1, 3, 3)      |           288 |            288 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.32146 |  0.01020 |    0.12932 |\n|  2 | module.model.1.3.weight  | (64, 32, 1, 1)     |          2048 |           2048 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.11942 |  0.00024 |    0.03627 |\n|  3 | module.model.2.0.weight  | (64, 1, 3, 3)      |           576 |            576 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.15809 |  0.00543 |    0.11513 |\n|  4 | module.model.2.3.weight  | (128, 64, 1, 1)    |          8192 |           8192 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08442 | -0.00031 |    0.04182 |\n|  5 | module.model.3.0.weight  | (128, 1, 3, 3)     |          1152 |           1152 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.16780 |  0.00125 |    0.10545 |\n|  6 | module.model.3.3.weight  | (128, 128, 1, 1)   |         16384 |          16384 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07126 | -0.00197 |    0.04123 |\n|  7 | module.model.4.0.weight  | (128, 1, 3, 3)     |          1152 |           1152 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.10182 |  0.00171 |    0.08719 |\n|  8 | module.model.4.3.weight  | (256, 128, 1, 1)   |         32768 |          13108 |    0.00000 |    0.00000 | 10.15625 | 59.99756 | 12.50000 |   59.99756 | 0.05543 | -0.00002 |    0.02760 |\n|  9 | module.model.5.0.weight  | (256, 1, 3, 3)     |          2304 |           2304 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.12516 | -0.00288 |    0.08058 |\n| 10 | module.model.5.3.weight  | (256, 256, 1, 1)   |         65536 |          26215 |    0.00000 |    0.00000 | 12.50000 | 59.99908 | 23.82812 |   59.99908 | 0.04453 |  0.00002 |    0.02271 |\n| 11 | module.model.6.0.weight  | (256, 1, 3, 3)     |          2304 |           2304 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08024 |  0.00252 |    0.06377 |\n| 12 | module.model.6.3.weight  | (512, 256, 1, 1)   |        131072 |          52429 |    0.00000 |    0.00000 | 23.82812 | 59.99985 | 14.25781 |   59.99985 | 0.03561 | -0.00057 |    0.01779 |\n| 13 | module.model.7.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.11008 | -0.00018 |    0.06829 |\n| 14 | module.model.7.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 14.25781 | 59.99985 | 21.28906 |   59.99985 | 0.02944 | -0.00060 |    0.01515 |\n| 15 | module.model.8.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08258 |  0.00370 |    0.04905 |\n| 16 | module.model.8.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 21.28906 | 59.99985 | 28.51562 |   59.99985 | 0.02865 | -0.00046 |    0.01465 |\n| 17 | module.model.9.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07578 |  0.00468 |    0.04201 |\n| 18 | module.model.9.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 28.51562 | 59.99985 | 23.43750 |   59.99985 | 0.02939 | -0.00044 |    0.01511 |\n| 19 | module.model.10.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07091 |  0.00014 |    0.04306 |\n| 20 | module.model.10.3.weight | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 24.60938 | 59.99985 | 20.89844 |   59.99985 | 0.03095 | -0.00059 |    0.01672 |\n| 21 | module.model.11.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.05729 | -0.00518 |    0.04267 |\n| 22 | module.model.11.3.weight | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 20.89844 | 59.99985 | 17.57812 |   59.99985 | 0.03229 | -0.00044 |    0.01797 |\n| 23 | module.model.12.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.04981 | -0.00136 |    0.03967 |\n| 24 | module.model.12.3.weight | (1024, 512, 1, 1)  |        524288 |         209716 |    0.00000 |    0.00000 | 16.01562 | 59.99985 | 44.23828 |   59.99985 | 0.02514 | -0.00106 |    0.01278 |\n| 25 | module.model.13.0.weight | (1024, 1, 3, 3)    |          9216 |           9216 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.02396 | -0.00949 |    0.01549 |\n| 26 | module.model.13.3.weight | (1024, 1024, 1, 1) |       1048576 |         419431 |    0.00000 |    0.00000 | 44.72656 | 59.99994 |  1.46484 |   59.99994 | 0.01801 | -0.00017 |    0.00931 |\n| 27 | module.fc.weight         | (1000, 1024)       |       1024000 |         409600 |    1.46484 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   60.00000 | 0.05078 |  0.00271 |    0.02734 |\n| 28 | Total sparsity:          | -                  |       4209088 |        1726917 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   58.97171 | 0.00000 |  0.00000 |    0.00000 |\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\nTotal sparsity: 58.97\n\n--- validate (epoch=199)-----------\n128116 samples (256 per mini-batch)\n==> Top1: 65.337    Top5: 84.984    Loss: 1.494\n\n--- test ---------------------\n50000 samples (256 per mini-batch)\n==> Top1: 68.810    Top5: 88.626    Loss: 1.282",
+            "location": "/model_zoo/index.html#results_1", 
+            "text": "As our baseline we used a  pretrained PyTorch MobileNet model  (width=1) which has Top1=68.848 and Top5=88.740. \nIn their paper, Zhu and Gupta prune 50% of the elements of MobileNet (width=1) with a 1.1% drop in accuracy.  We pruned about 51.6% of the elements, with virtually no change in the accuracies (Top1: 68.808 and Top5: 88.656).  We didn't try to prune more than this, but we do note that the baseline accuracy that we used is almost 2% lower than the accuracy published in the paper.    +----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|    | Name                     | Shape              |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |\n|----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n|  0 | module.model.0.0.weight  | (32, 3, 3, 3)      |           864 |            864 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.14466 |  0.00103 |    0.06508 |\n|  1 | module.model.1.0.weight  | (32, 1, 3, 3)      |           288 |            288 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.32146 |  0.01020 |    0.12932 |\n|  2 | module.model.1.3.weight  | (64, 32, 1, 1)     |          2048 |           2048 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.11942 |  0.00024 |    0.03627 |\n|  3 | module.model.2.0.weight  | (64, 1, 3, 3)      |           576 |            576 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.15809 |  0.00543 |    0.11513 |\n|  4 | module.model.2.3.weight  | (128, 64, 1, 1)    |          8192 |           8192 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08442 | -0.00031 |    0.04182 |\n|  5 | module.model.3.0.weight  | (128, 1, 3, 3)     |          1152 |           1152 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.16780 |  0.00125 |    0.10545 |\n|  6 | module.model.3.3.weight  | (128, 128, 1, 1)   |         16384 |          16384 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07126 | -0.00197 |    0.04123 |\n|  7 | module.model.4.0.weight  | (128, 1, 3, 3)     |          1152 |           1152 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.10182 |  0.00171 |    0.08719 |\n|  8 | module.model.4.3.weight  | (256, 128, 1, 1)   |         32768 |          13108 |    0.00000 |    0.00000 | 10.15625 | 59.99756 | 12.50000 |   59.99756 | 0.05543 | -0.00002 |    0.02760 |\n|  9 | module.model.5.0.weight  | (256, 1, 3, 3)     |          2304 |           2304 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.12516 | -0.00288 |    0.08058 |\n| 10 | module.model.5.3.weight  | (256, 256, 1, 1)   |         65536 |          26215 |    0.00000 |    0.00000 | 12.50000 | 59.99908 | 23.82812 |   59.99908 | 0.04453 |  0.00002 |    0.02271 |\n| 11 | module.model.6.0.weight  | (256, 1, 3, 3)     |          2304 |           2304 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08024 |  0.00252 |    0.06377 |\n| 12 | module.model.6.3.weight  | (512, 256, 1, 1)   |        131072 |          52429 |    0.00000 |    0.00000 | 23.82812 | 59.99985 | 14.25781 |   59.99985 | 0.03561 | -0.00057 |    0.01779 |\n| 13 | module.model.7.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.11008 | -0.00018 |    0.06829 |\n| 14 | module.model.7.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 14.25781 | 59.99985 | 21.28906 |   59.99985 | 0.02944 | -0.00060 |    0.01515 |\n| 15 | module.model.8.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08258 |  0.00370 |    0.04905 |\n| 16 | module.model.8.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 21.28906 | 59.99985 | 28.51562 |   59.99985 | 0.02865 | -0.00046 |    0.01465 |\n| 17 | module.model.9.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07578 |  0.00468 |    0.04201 |\n| 18 | module.model.9.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 28.51562 | 59.99985 | 23.43750 |   59.99985 | 0.02939 | -0.00044 |    0.01511 |\n| 19 | module.model.10.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07091 |  0.00014 |    0.04306 |\n| 20 | module.model.10.3.weight | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 24.60938 | 59.99985 | 20.89844 |   59.99985 | 0.03095 | -0.00059 |    0.01672 |\n| 21 | module.model.11.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.05729 | -0.00518 |    0.04267 |\n| 22 | module.model.11.3.weight | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 20.89844 | 59.99985 | 17.57812 |   59.99985 | 0.03229 | -0.00044 |    0.01797 |\n| 23 | module.model.12.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.04981 | -0.00136 |    0.03967 |\n| 24 | module.model.12.3.weight | (1024, 512, 1, 1)  |        524288 |         209716 |    0.00000 |    0.00000 | 16.01562 | 59.99985 | 44.23828 |   59.99985 | 0.02514 | -0.00106 |    0.01278 |\n| 25 | module.model.13.0.weight | (1024, 1, 3, 3)    |          9216 |           9216 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.02396 | -0.00949 |    0.01549 |\n| 26 | module.model.13.3.weight | (1024, 1024, 1, 1) |       1048576 |         419431 |    0.00000 |    0.00000 | 44.72656 | 59.99994 |  1.46484 |   59.99994 | 0.01801 | -0.00017 |    0.00931 |\n| 27 | module.fc.weight         | (1000, 1024)       |       1024000 |         409600 |    1.46484 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   60.00000 | 0.05078 |  0.00271 |    0.02734 |\n| 28 | Total sparsity:          | -                  |       4209088 |        1726917 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   58.97171 | 0.00000 |  0.00000 |    0.00000 |\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\nTotal sparsity: 58.97\n\n--- validate (epoch=199)-----------\n128116 samples (256 per mini-batch)\n==  Top1: 65.337    Top5: 84.984    Loss: 1.494\n\n--- test ---------------------\n50000 samples (256 per mini-batch)\n==  Top1: 68.810    Top5: 88.626    Loss: 1.282", 
             "title": "Results"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#learning-structured-sparsity-in-deep-neural-networks",
-            "text": "This research paper from the University of Pittsburgh, \"proposes a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNN\u2019s evaluation.\"  Note that this paper does not use pruning, but instead uses group regularization during the training to force weights towards zero, as a group.  We used a schedule which thresholds the regularized elements at a magnitude equal to the regularization strength.  At the end of the regularization phase, we save the final sparsity masks generated by the regularization, and exit.  Then we load this regularized model, remove the layers corresponding to the zeroed weight tensors (all of a layer's elements have a zero value).",
+            "location": "/model_zoo/index.html#learning-structured-sparsity-in-deep-neural-networks", 
+            "text": "This research paper from the University of Pittsburgh, \"proposes a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNN\u2019s evaluation.\"  Note that this paper does not use pruning, but instead uses group regularization during the training to force weights towards zero, as a group.  We used a schedule which thresholds the regularized elements at a magnitude equal to the regularization strength.  At the end of the regularization phase, we save the final sparsity masks generated by the regularization, and exit.  Then we load this regularized model, remove the layers corresponding to the zeroed weight tensors (all of a layer's elements have a zero value).", 
             "title": "Learning Structured Sparsity in Deep Neural Networks"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#baseline-training",
-            "text": "We started by training the baseline ResNet20-Cifar dense network since we didn't have a pre-trained model.   Distiller schedule:  distiller/examples/ssl/resnet20_cifar_baseline_training.yaml  Checkpoint files:  distiller/examples/ssl/checkpoints/   $ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.3 --epochs=180 --compress=../cifar10/resnet20/baseline_training.yaml -j=1 --deterministic",
+            "location": "/model_zoo/index.html#baseline-training", 
+            "text": "We started by training the baseline ResNet20-Cifar dense network since we didn't have a pre-trained model.   Distiller schedule:  distiller/examples/ssl/resnet20_cifar_baseline_training.yaml  Checkpoint files:  distiller/examples/ssl/checkpoints/   $ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.3 --epochs=180 --compress=../cifar10/resnet20/baseline_training.yaml -j=1 --deterministic", 
             "title": "Baseline training"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#regularization",
-            "text": "Then we started training from scratch again, but this time we used Group Lasso regularization on entire layers: \nDistiller schedule:  distiller/examples/ssl/ssl_4D-removal_4L_training.yaml  $ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.4 --epochs=180 --compress=../ssl/ssl_4D-removal_training.yaml -j=1 --deterministic  The diagram below shows the training of Resnet20/CIFAR10 using Group Lasso regularization on entire layers (in blue) vs. training Resnet20/CIFAR10  baseline (in red).  You may notice several interesting things:\n1. The LR-decay policy is the same, but the two sessions start with different initial LR values.\n2. The data-loss of the regularized training follows the same shape as the un-regularized training (baseline), and eventually the two seem to merge.\n3. We see similar behavior in the validation Top1 and Top5 accuracy results, but the regularized training eventually performs better.\n4. In the top right corner we see the behavior of the regularization loss ( Reg Loss ), which actually increases for some time, until the data-loss has a sharp drop (after ~16K mini-batches), at which point the regularization loss also starts dropping.  This  regularization  yields 5 layers with zeroed weight tensors.  We load this model, remove the 5 layers, and start the fine tuning of the weights.  This process of layer removal is specific to ResNet for CIFAR, which we altered by adding code to skip over layers during the forward path.  When you export to ONNX, the removed layers do not participate in the forward path, so they don't get incarnated.    We managed to remove 5 of the 16 3x3 convolution layers which dominate the computation time.  It's not bad, but we probably could have done better.",
+            "location": "/model_zoo/index.html#regularization", 
+            "text": "Then we started training from scratch again, but this time we used Group Lasso regularization on entire layers: \nDistiller schedule:  distiller/examples/ssl/ssl_4D-removal_4L_training.yaml  $ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.4 --epochs=180 --compress=../ssl/ssl_4D-removal_training.yaml -j=1 --deterministic  The diagram below shows the training of Resnet20/CIFAR10 using Group Lasso regularization on entire layers (in blue) vs. training Resnet20/CIFAR10  baseline (in red).  You may notice several interesting things:\n1. The LR-decay policy is the same, but the two sessions start with different initial LR values.\n2. The data-loss of the regularized training follows the same shape as the un-regularized training (baseline), and eventually the two seem to merge.\n3. We see similar behavior in the validation Top1 and Top5 accuracy results, but the regularized training eventually performs better.\n4. In the top right corner we see the behavior of the regularization loss ( Reg Loss ), which actually increases for some time, until the data-loss has a sharp drop (after ~16K mini-batches), at which point the regularization loss also starts dropping.  This  regularization  yields 5 layers with zeroed weight tensors.  We load this model, remove the 5 layers, and start the fine tuning of the weights.  This process of layer removal is specific to ResNet for CIFAR, which we altered by adding code to skip over layers during the forward path.  When you export to ONNX, the removed layers do not participate in the forward path, so they don't get incarnated.    We managed to remove 5 of the 16 3x3 convolution layers which dominate the computation time.  It's not bad, but we probably could have done better.", 
             "title": "Regularization"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#fine-tuning",
-            "text": "During the  fine-tuning  process, because the removed layers do not participate in the forward path, they do not appear in the backward path and are not backpropogated: therefore they are completely disconnected from the network. \nWe copy the checkpoint file of the regularized model to  checkpoint_trained_4D_regularized_5Lremoved.pth.tar . \nDistiller schedule:  distiller/examples/ssl/ssl_4D-removal_finetuning.yaml  $ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.1 --epochs=250 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --compress=../ssl/ssl_4D-removal_finetuning.yaml  -j=1 --deterministic",
+            "location": "/model_zoo/index.html#fine-tuning", 
+            "text": "During the  fine-tuning  process, because the removed layers do not participate in the forward path, they do not appear in the backward path and are not backpropogated: therefore they are completely disconnected from the network. \nWe copy the checkpoint file of the regularized model to  checkpoint_trained_4D_regularized_5Lremoved.pth.tar . \nDistiller schedule:  distiller/examples/ssl/ssl_4D-removal_finetuning.yaml  $ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.1 --epochs=250 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --compress=../ssl/ssl_4D-removal_finetuning.yaml  -j=1 --deterministic", 
             "title": "Fine-tuning"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#results_2",
-            "text": "Our baseline results for ResNet20 Cifar are: Top1=91.450 and  Top5=99.750  We used Distiller's GroupLassoRegularizer to remove 5 layers from Resnet20 (CIFAR10) with no degradation of the accuracies. \nThe regularized model exhibits really poor classification abilities:   $ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --evaluate\n\n=> loading checkpoint ../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar\n   best top@1: 90.620\nLoaded compression schedule from checkpoint (epoch 179)\nRemoving layer: module.layer1.0.conv1 [layer=0 block=0 conv=0]\nRemoving layer: module.layer1.0.conv2 [layer=0 block=0 conv=1]\nRemoving layer: module.layer1.1.conv1 [layer=0 block=1 conv=0]\nRemoving layer: module.layer1.1.conv2 [layer=0 block=1 conv=1]\nRemoving layer: module.layer2.2.conv2 [layer=1 block=2 conv=1]\nFiles already downloaded and verified\nFiles already downloaded and verified\nDataset sizes:\n        training=45000\n        validation=5000\n        test=10000\n--- test ---------------------\n10000 samples (256 per mini-batch)\n==> Top1: 22.290    Top5: 68.940    Loss: 5.172  However, after fine-tuning, we recovered most of the accuracies loss, but not quite all of it: Top1=91.020 and Top5=99.670  We didn't spend time trying to wrestle with this network, and therefore didn't achieve SSL's published results (which showed that they managed to remove 6 layers and at the same time increase accuracies).",
+            "location": "/model_zoo/index.html#results_2", 
+            "text": "Our baseline results for ResNet20 Cifar are: Top1=91.450 and  Top5=99.750  We used Distiller's GroupLassoRegularizer to remove 5 layers from Resnet20 (CIFAR10) with no degradation of the accuracies. \nThe regularized model exhibits really poor classification abilities:   $ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --evaluate\n\n=  loading checkpoint ../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar\n   best top@1: 90.620\nLoaded compression schedule from checkpoint (epoch 179)\nRemoving layer: module.layer1.0.conv1 [layer=0 block=0 conv=0]\nRemoving layer: module.layer1.0.conv2 [layer=0 block=0 conv=1]\nRemoving layer: module.layer1.1.conv1 [layer=0 block=1 conv=0]\nRemoving layer: module.layer1.1.conv2 [layer=0 block=1 conv=1]\nRemoving layer: module.layer2.2.conv2 [layer=1 block=2 conv=1]\nFiles already downloaded and verified\nFiles already downloaded and verified\nDataset sizes:\n        training=45000\n        validation=5000\n        test=10000\n--- test ---------------------\n10000 samples (256 per mini-batch)\n==  Top1: 22.290    Top5: 68.940    Loss: 5.172  However, after fine-tuning, we recovered most of the accuracies loss, but not quite all of it: Top1=91.020 and Top5=99.670  We didn't spend time trying to wrestle with this network, and therefore didn't achieve SSL's published results (which showed that they managed to remove 6 layers and at the same time increase accuracies).", 
             "title": "Results"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#pruning-filters-for-efficient-convnets",
-            "text": "Quoting the authors directly:   We present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly.\nIn contrast to pruning weights, this approach does not result in sparse connectivity patterns. Hence, it does not need the support of sparse convolution libraries and can work with existing efficient BLAS libraries for dense matrix multiplications.   The implementation of the research by Hao et al. required us to add filter-pruning sensitivity analysis, and support for \"network thinning\".  After performing filter-pruning sensitivity analysis to assess which layers are more sensitive to the pruning of filters, we execute distiller.L1RankedStructureParameterPruner once in order to rank the filters of each layer by their L1-norm values, and then we prune the schedule-prescribed sparsity level.     Distiller schedule:  distiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank.yaml  Checkpoint files:  checkpoint_finetuned.pth.tar   The excerpt from the schedule, displayed below, shows how we declare the L1RankedStructureParameterPruner.  This class currently ranks filters only, but because in the future this class may support ranking of various structures, you need to specify for each parameter both the target sparsity level, and the structure type ('3D' is filter-wise pruning).  pruners:\n  filter_pruner:\n    class: 'L1RankedStructureParameterPruner'\n    reg_regims:\n      'module.layer1.0.conv1.weight': [0.6, '3D']\n      'module.layer1.1.conv1.weight': [0.6, '3D']\n      'module.layer1.2.conv1.weight': [0.6, '3D']\n      'module.layer1.3.conv1.weight': [0.6, '3D']  In the policy, we specify that we want to invoke this pruner once, at epoch 180.  Because we are starting from a network which was trained for 180 epochs (see Baseline training below), the filter ranking is performed right at the outset of this schedule.  policies:\n  - pruner:\n      instance_name: filter_pruner\n    epochs: [180]  Following the pruning, we want to \"physically\" remove the pruned filters from the network, which involves reconfiguring the Convolutional layers and the parameter tensors.  When we remove filters from Convolution layer  n  we need to perform several changes to the network:\n1. Shrink layer  n 's weights tensor, leaving only the \"important\" filters.\n2. Configure layer  n 's  .out_channels  member to its new, smaller, value.\n3. If a BN layer follows layer  n , then it also needs to be reconfigured and its scale and shift parameter vectors need to be shrunk.\n4. If a Convolution layer follows the BN layer, then it will have less input channels which requires reconfiguration and shrinking of its weights.  All of this is performed by distiller.ResnetCifarFilterRemover which is also scheduled at epoch 180.  We call this process \"network thinning\".  extensions:\n  net_thinner:\n      class: 'FilterRemover'\n      thinning_func_str: remove_filters\n      arch: 'resnet56_cifar'\n      dataset: 'cifar10'  Network thinning requires us to understand the layer connectivity and data-dependency of the DNN, and we are working on a robust method to perform this.  On networks with topologies similar to ResNet (residuals) and GoogLeNet (inception), which have several inputs and outputs to/from Convolution layers, there is extra details to consider. \nOur current implementation is specific to certain layers in ResNet and is a bit fragile.  We will continue to improve and generalize this.",
+            "location": "/model_zoo/index.html#pruning-filters-for-efficient-convnets", 
+            "text": "Quoting the authors directly:   We present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly.\nIn contrast to pruning weights, this approach does not result in sparse connectivity patterns. Hence, it does not need the support of sparse convolution libraries and can work with existing efficient BLAS libraries for dense matrix multiplications.   The implementation of the research by Hao et al. required us to add filter-pruning sensitivity analysis, and support for \"network thinning\".  After performing filter-pruning sensitivity analysis to assess which layers are more sensitive to the pruning of filters, we execute distiller.L1RankedStructureParameterPruner once in order to rank the filters of each layer by their L1-norm values, and then we prune the schedule-prescribed sparsity level.     Distiller schedule:  distiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank.yaml  Checkpoint files:  checkpoint_finetuned.pth.tar   The excerpt from the schedule, displayed below, shows how we declare the L1RankedStructureParameterPruner.  This class currently ranks filters only, but because in the future this class may support ranking of various structures, you need to specify for each parameter both the target sparsity level, and the structure type ('3D' is filter-wise pruning).  pruners:\n  filter_pruner:\n    class: 'L1RankedStructureParameterPruner'\n    reg_regims:\n      'module.layer1.0.conv1.weight': [0.6, '3D']\n      'module.layer1.1.conv1.weight': [0.6, '3D']\n      'module.layer1.2.conv1.weight': [0.6, '3D']\n      'module.layer1.3.conv1.weight': [0.6, '3D']  In the policy, we specify that we want to invoke this pruner once, at epoch 180.  Because we are starting from a network which was trained for 180 epochs (see Baseline training below), the filter ranking is performed right at the outset of this schedule.  policies:\n  - pruner:\n      instance_name: filter_pruner\n    epochs: [180]  Following the pruning, we want to \"physically\" remove the pruned filters from the network, which involves reconfiguring the Convolutional layers and the parameter tensors.  When we remove filters from Convolution layer  n  we need to perform several changes to the network:\n1. Shrink layer  n 's weights tensor, leaving only the \"important\" filters.\n2. Configure layer  n 's  .out_channels  member to its new, smaller, value.\n3. If a BN layer follows layer  n , then it also needs to be reconfigured and its scale and shift parameter vectors need to be shrunk.\n4. If a Convolution layer follows the BN layer, then it will have less input channels which requires reconfiguration and shrinking of its weights.  All of this is performed by distiller.ResnetCifarFilterRemover which is also scheduled at epoch 180.  We call this process \"network thinning\".  extensions:\n  net_thinner:\n      class: 'FilterRemover'\n      thinning_func_str: remove_filters\n      arch: 'resnet56_cifar'\n      dataset: 'cifar10'  Network thinning requires us to understand the layer connectivity and data-dependency of the DNN, and we are working on a robust method to perform this.  On networks with topologies similar to ResNet (residuals) and GoogLeNet (inception), which have several inputs and outputs to/from Convolution layers, there is extra details to consider. \nOur current implementation is specific to certain layers in ResNet and is a bit fragile.  We will continue to improve and generalize this.", 
             "title": "Pruning Filters for Efficient ConvNets"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#baseline-training_1",
-            "text": "We started by training the baseline ResNet56-Cifar dense network (180 epochs) since we didn't have a pre-trained model.   Distiller schedule:  distiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml  Checkpoint files:  checkpoint.resnet56_cifar_baseline.pth.tar",
+            "location": "/model_zoo/index.html#baseline-training_1", 
+            "text": "We started by training the baseline ResNet56-Cifar dense network (180 epochs) since we didn't have a pre-trained model.   Distiller schedule:  distiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml  Checkpoint files:  checkpoint.resnet56_cifar_baseline.pth.tar", 
             "title": "Baseline training"
-        },
+        }, 
         {
-            "location": "/model_zoo/index.html#results_3",
-            "text": "We trained a ResNet56-Cifar10 network and achieve accuracy results which are on-par with published results:\nTop1: 92.970 and Top5: 99.740.  We used Hao et al.'s algorithm to remove 37.3% of the original convolution MACs, while maintaining virtually the same accuracy as the baseline:\nTop1: 92.830 and Top5: 99.760",
+            "location": "/model_zoo/index.html#results_3", 
+            "text": "We trained a ResNet56-Cifar10 network and achieve accuracy results which are on-par with published results:\nTop1: 92.970 and Top5: 99.740.  We used Hao et al.'s algorithm to remove 37.3% of the original convolution MACs, while maintaining virtually the same accuracy as the baseline:\nTop1: 92.830 and Top5: 99.760", 
             "title": "Results"
-        },
+        }, 
         {
-            "location": "/jupyter/index.html",
-            "text": "Jupyter environment\n\n\nThe Jupyter notebooks environment allows us to plan our compression session and load Distiller data summaries to study and analyze compression results.\n\n\nEach notebook has embedded instructions and explanations, so here we provide only a brief description of each notebook.\n\n\nInstallation\n\n\nJupyter and its dependencies are included as part of the main \nrequirements.txt\n file, so there is no need for a dedicated installation step.\n\nHowever, to use the ipywidgets extension, you will need to enable it:\n\n\n$ jupyter nbextension enable --py widgetsnbextension --sys-prefix\n\n\n\n\nYou may want to refer to the \nipywidgets extension installation documentation\n.\n\n\nAnother extension which requires special installation handling is \nQgrid\n.  Qgrid is a Jupyter notebook widget that adds interactive features, such as sorting, to Panadas DataFrames rendering.  To enable Qgrid:\n\n\n$ jupyter nbextension enable --py --sys-prefix qgrid\n\n\n\n\nLaunching the Jupyter server\n\n\nThere are all kinds of options to use when launching Jupyter which you can use.  The example below tells the server to listen to connections from any IP address, and not to launch the browser window, but of course, you are free to launch Jupyter any way you want.\n\nConsult the \nuser's guide\n for more details.\n\n\n$ jupyter-notebook --ip=0.0.0.0 --no-browser\n\n\n\n\nUsing the Distiller notebooks\n\n\nThe Distiller Jupyter notebooks are located in the \ndistiller/jupyter\n directory.\n\nThey are provided as tools that you can use to prepare your compression experiments and study their results.\nWe welcome new ideas and implementations of Jupyter.\n\n\nRoughly, the notebooks can be divided into three categories.\n\n\nTheory\n\n\n\n\njupyter/L1-regularization.ipynb\n: Experience hands-on how L1 and L2 regularization affect the solution of a toy loss-minimization problem, to get a better grasp on the interaction between regularization and sparsity.\n\n\njupyter/alexnet_insights.ipynb\n: This notebook reviews and compares a couple of pruning sessions on Alexnet.  We compare distributions, performance, statistics and show some visualizations of the weights tensors.\n\n\n\n\nPreparation for compression\n\n\n\n\njupyter/model_summary.ipynb\n: Begin by getting familiar with your model.  Examine the sizes and properties of layers and connections.  Study which layers are compute-bound, and which are bandwidth-bound, and decide how to prune or regularize the model.\n\n\njupyter/sensitivity_analysis.ipynb\n: If you performed pruning sensitivity analysis on your model, this notebook can help you load the results and graphically study how the layers behave.\n\n\njupyter/interactive_lr_scheduler.ipynb\n: The learning rate decay policy affects pruning results, perhaps as much as it affects training results.  Graph a few LR-decay policies to see how they behave.\n\n\njupyter/jupyter/agp_schedule.ipynb\n: If you are using the Automated Gradual Pruner, this notebook can help you tune the schedule.\n\n\n\n\nReviewing experiment results\n\n\n\n\njupyter/compare_executions.ipynb\n: This is a simple notebook to help you graphically compare the results of executions of several experiments.\n\n\njupyter/compression_insights.ipynb\n: This notebook is packed with code, tables and graphs to us understand the results of a compression session.  Distiller provides \nsummaries\n, which are Pandas dataframes, which contain statistical information about you model.  We chose to use Pandas dataframes because they can be sliced, queried, summarized and graphed with a few lines of code.",
+            "location": "/jupyter/index.html", 
+            "text": "Jupyter environment\n\n\nThe Jupyter notebooks environment allows us to plan our compression session and load Distiller data summaries to study and analyze compression results.\n\n\nEach notebook has embedded instructions and explanations, so here we provide only a brief description of each notebook.\n\n\nInstallation\n\n\nJupyter and its dependencies are included as part of the main \nrequirements.txt\n file, so there is no need for a dedicated installation step.\n\nHowever, to use the ipywidgets extension, you will need to enable it:\n\n\n$ jupyter nbextension enable --py widgetsnbextension --sys-prefix\n\n\n\n\nYou may want to refer to the \nipywidgets extension installation documentation\n.\n\n\nAnother extension which requires special installation handling is \nQgrid\n.  Qgrid is a Jupyter notebook widget that adds interactive features, such as sorting, to Panadas DataFrames rendering.  To enable Qgrid:\n\n\n$ jupyter nbextension enable --py --sys-prefix qgrid\n\n\n\n\nLaunching the Jupyter server\n\n\nThere are all kinds of options to use when launching Jupyter which you can use.  The example below tells the server to listen to connections from any IP address, and not to launch the browser window, but of course, you are free to launch Jupyter any way you want.\n\nConsult the \nuser's guide\n for more details.\n\n\n$ jupyter-notebook --ip=0.0.0.0 --no-browser\n\n\n\n\nUsing the Distiller notebooks\n\n\nThe Distiller Jupyter notebooks are located in the \ndistiller/jupyter\n directory.\n\nThey are provided as tools that you can use to prepare your compression experiments and study their results.\nWe welcome new ideas and implementations of Jupyter.\n\n\nRoughly, the notebooks can be divided into three categories.\n\n\nTheory\n\n\n\n\njupyter/L1-regularization.ipynb\n: Experience hands-on how L1 and L2 regularization affect the solution of a toy loss-minimization problem, to get a better grasp on the interaction between regularization and sparsity.\n\n\njupyter/alexnet_insights.ipynb\n: This notebook reviews and compares a couple of pruning sessions on Alexnet.  We compare distributions, performance, statistics and show some visualizations of the weights tensors.\n\n\n\n\nPreparation for compression\n\n\n\n\njupyter/model_summary.ipynb\n: Begin by getting familiar with your model.  Examine the sizes and properties of layers and connections.  Study which layers are compute-bound, and which are bandwidth-bound, and decide how to prune or regularize the model.\n\n\njupyter/sensitivity_analysis.ipynb\n: If you performed pruning sensitivity analysis on your model, this notebook can help you load the results and graphically study how the layers behave.\n\n\njupyter/interactive_lr_scheduler.ipynb\n: The learning rate decay policy affects pruning results, perhaps as much as it affects training results.  Graph a few LR-decay policies to see how they behave.\n\n\njupyter/jupyter/agp_schedule.ipynb\n: If you are using the Automated Gradual Pruner, this notebook can help you tune the schedule.\n\n\n\n\nReviewing experiment results\n\n\n\n\njupyter/compare_executions.ipynb\n: This is a simple notebook to help you graphically compare the results of executions of several experiments.\n\n\njupyter/compression_insights.ipynb\n: This notebook is packed with code, tables and graphs to us understand the results of a compression session.  Distiller provides \nsummaries\n, which are Pandas dataframes, which contain statistical information about you model.  We chose to use Pandas dataframes because they can be sliced, queried, summarized and graphed with a few lines of code.", 
             "title": "Jupyter Notebooks"
-        },
+        }, 
         {
-            "location": "/jupyter/index.html#jupyter-environment",
-            "text": "The Jupyter notebooks environment allows us to plan our compression session and load Distiller data summaries to study and analyze compression results.  Each notebook has embedded instructions and explanations, so here we provide only a brief description of each notebook.",
+            "location": "/jupyter/index.html#jupyter-environment", 
+            "text": "The Jupyter notebooks environment allows us to plan our compression session and load Distiller data summaries to study and analyze compression results.  Each notebook has embedded instructions and explanations, so here we provide only a brief description of each notebook.", 
             "title": "Jupyter environment"
-        },
+        }, 
         {
-            "location": "/jupyter/index.html#installation",
-            "text": "Jupyter and its dependencies are included as part of the main  requirements.txt  file, so there is no need for a dedicated installation step. \nHowever, to use the ipywidgets extension, you will need to enable it:  $ jupyter nbextension enable --py widgetsnbextension --sys-prefix  You may want to refer to the  ipywidgets extension installation documentation .  Another extension which requires special installation handling is  Qgrid .  Qgrid is a Jupyter notebook widget that adds interactive features, such as sorting, to Panadas DataFrames rendering.  To enable Qgrid:  $ jupyter nbextension enable --py --sys-prefix qgrid",
+            "location": "/jupyter/index.html#installation", 
+            "text": "Jupyter and its dependencies are included as part of the main  requirements.txt  file, so there is no need for a dedicated installation step. \nHowever, to use the ipywidgets extension, you will need to enable it:  $ jupyter nbextension enable --py widgetsnbextension --sys-prefix  You may want to refer to the  ipywidgets extension installation documentation .  Another extension which requires special installation handling is  Qgrid .  Qgrid is a Jupyter notebook widget that adds interactive features, such as sorting, to Panadas DataFrames rendering.  To enable Qgrid:  $ jupyter nbextension enable --py --sys-prefix qgrid", 
             "title": "Installation"
-        },
+        }, 
         {
-            "location": "/jupyter/index.html#launching-the-jupyter-server",
-            "text": "There are all kinds of options to use when launching Jupyter which you can use.  The example below tells the server to listen to connections from any IP address, and not to launch the browser window, but of course, you are free to launch Jupyter any way you want. \nConsult the  user's guide  for more details.  $ jupyter-notebook --ip=0.0.0.0 --no-browser",
+            "location": "/jupyter/index.html#launching-the-jupyter-server", 
+            "text": "There are all kinds of options to use when launching Jupyter which you can use.  The example below tells the server to listen to connections from any IP address, and not to launch the browser window, but of course, you are free to launch Jupyter any way you want. \nConsult the  user's guide  for more details.  $ jupyter-notebook --ip=0.0.0.0 --no-browser", 
             "title": "Launching the Jupyter server"
-        },
+        }, 
         {
-            "location": "/jupyter/index.html#using-the-distiller-notebooks",
-            "text": "The Distiller Jupyter notebooks are located in the  distiller/jupyter  directory. \nThey are provided as tools that you can use to prepare your compression experiments and study their results.\nWe welcome new ideas and implementations of Jupyter.  Roughly, the notebooks can be divided into three categories.",
+            "location": "/jupyter/index.html#using-the-distiller-notebooks", 
+            "text": "The Distiller Jupyter notebooks are located in the  distiller/jupyter  directory. \nThey are provided as tools that you can use to prepare your compression experiments and study their results.\nWe welcome new ideas and implementations of Jupyter.  Roughly, the notebooks can be divided into three categories.", 
             "title": "Using the Distiller notebooks"
-        },
+        }, 
         {
-            "location": "/jupyter/index.html#theory",
-            "text": "jupyter/L1-regularization.ipynb : Experience hands-on how L1 and L2 regularization affect the solution of a toy loss-minimization problem, to get a better grasp on the interaction between regularization and sparsity.  jupyter/alexnet_insights.ipynb : This notebook reviews and compares a couple of pruning sessions on Alexnet.  We compare distributions, performance, statistics and show some visualizations of the weights tensors.",
+            "location": "/jupyter/index.html#theory", 
+            "text": "jupyter/L1-regularization.ipynb : Experience hands-on how L1 and L2 regularization affect the solution of a toy loss-minimization problem, to get a better grasp on the interaction between regularization and sparsity.  jupyter/alexnet_insights.ipynb : This notebook reviews and compares a couple of pruning sessions on Alexnet.  We compare distributions, performance, statistics and show some visualizations of the weights tensors.", 
             "title": "Theory"
-        },
+        }, 
         {
-            "location": "/jupyter/index.html#preparation-for-compression",
-            "text": "jupyter/model_summary.ipynb : Begin by getting familiar with your model.  Examine the sizes and properties of layers and connections.  Study which layers are compute-bound, and which are bandwidth-bound, and decide how to prune or regularize the model.  jupyter/sensitivity_analysis.ipynb : If you performed pruning sensitivity analysis on your model, this notebook can help you load the results and graphically study how the layers behave.  jupyter/interactive_lr_scheduler.ipynb : The learning rate decay policy affects pruning results, perhaps as much as it affects training results.  Graph a few LR-decay policies to see how they behave.  jupyter/jupyter/agp_schedule.ipynb : If you are using the Automated Gradual Pruner, this notebook can help you tune the schedule.",
+            "location": "/jupyter/index.html#preparation-for-compression", 
+            "text": "jupyter/model_summary.ipynb : Begin by getting familiar with your model.  Examine the sizes and properties of layers and connections.  Study which layers are compute-bound, and which are bandwidth-bound, and decide how to prune or regularize the model.  jupyter/sensitivity_analysis.ipynb : If you performed pruning sensitivity analysis on your model, this notebook can help you load the results and graphically study how the layers behave.  jupyter/interactive_lr_scheduler.ipynb : The learning rate decay policy affects pruning results, perhaps as much as it affects training results.  Graph a few LR-decay policies to see how they behave.  jupyter/jupyter/agp_schedule.ipynb : If you are using the Automated Gradual Pruner, this notebook can help you tune the schedule.", 
             "title": "Preparation for compression"
-        },
+        }, 
         {
-            "location": "/jupyter/index.html#reviewing-experiment-results",
-            "text": "jupyter/compare_executions.ipynb : This is a simple notebook to help you graphically compare the results of executions of several experiments.  jupyter/compression_insights.ipynb : This notebook is packed with code, tables and graphs to us understand the results of a compression session.  Distiller provides  summaries , which are Pandas dataframes, which contain statistical information about you model.  We chose to use Pandas dataframes because they can be sliced, queried, summarized and graphed with a few lines of code.",
+            "location": "/jupyter/index.html#reviewing-experiment-results", 
+            "text": "jupyter/compare_executions.ipynb : This is a simple notebook to help you graphically compare the results of executions of several experiments.  jupyter/compression_insights.ipynb : This notebook is packed with code, tables and graphs to us understand the results of a compression session.  Distiller provides  summaries , which are Pandas dataframes, which contain statistical information about you model.  We chose to use Pandas dataframes because they can be sliced, queried, summarized and graphed with a few lines of code.", 
             "title": "Reviewing experiment results"
-        },
+        }, 
         {
-            "location": "/design/index.html",
-            "text": "Distiller design\n\n\nDistiller is designed to be easily integrated into your own PyTorch research applications.\n\nIt is easiest to understand this integration by examining the code of the sample application for compressing image classification models (\ncompress_classifier.py\n).\n\n\nThe application borrows its main flow code from torchvision's ImageNet classification training sample application (https://github.com/pytorch/examples/tree/master/imagenet). We tried to keep it similar, in order to make it familiar and easy to understand.\n\n\nIntegrating compression is very simple: simply add invocations of the appropriate compression_scheduler callbacks, for each stage in the training.  The training skeleton looks like the pseudo code below.  The boiler-plate Pytorch classification training is speckled with invocations of CompressionScheduler.\n\n\nFor each epoch:\n    compression_scheduler.on_epoch_begin(epoch)\n    train()\n    validate()\n    save_checkpoint()\n    compression_scheduler.on_epoch_end(epoch)\n\ntrain():\n    For each training step:\n        compression_scheduler.on_minibatch_begin(epoch)\n        output = model(input_var)\n        loss = criterion(output, target_var)\n        compression_scheduler.before_backward_pass(epoch)\n        loss.backward()\n        optimizer.step()\n        compression_scheduler.on_minibatch_end(epoch)\n\n\n\n\nThese callbacks can be seen in the diagram below, as the arrow pointing from the Training Loop and into Distiller's \nScheduler\n, which invokes the correct algorithm.  The application also uses Distiller services to collect statistics in \nSummaries\n and logs files, which can be queried at a later time, from Jupyter notebooks or TensorBoard.\n\n\n\n\nSparsification and fine-tuning\n\n\n\n\nThe application sets up a model as normally done in PyTorch.\n\n\nAnd then instantiates a Scheduler and configures it:\n\n\nScheduler configuration is defined in a YAML file\n\n\nThe configuration specifies Policies. Each Policy is tied to a specific algorithm which controls some aspect of the training.\n\n\nSome types of algorithms control the actual sparsification of the model. Such types are \"pruner\" and \"regularizer\".\n\n\nSome algorithms control some parameter of the training process, such as the learning-rate decay scheduler (\nlr_scheduler\n).\n\n\nThe parameters of each algorithm are also specified in the configuration.\n\n\n\n\n\n\n\n\n\n\nIn addition to specifying the algorithm, each Policy specifies scheduling parameters which control when the algorithm is executed: start epoch, end epoch and frequency.\n\n\nThe Scheduler exposes callbacks for relevant training stages: epoch start/end, mini-batch start/end and pre-backward pass. Each scheduler callback activates the policies that were defined according the schedule that was defined.\n\n\nThese callbacks are placed the training loop.\n\n\n\n\nQuantization\n\n\nA quantized model is obtained by replacing existing operations with quantized versions. The quantized versions can be either complete replacements, or wrappers. A wrapper will use the existing modules internally and add quantization and de-quantization operations before/after as necessary.\n\n\nIn Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided.\n\n\nWe also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. This mechanism is exposed by the \nQuantizer\n class. \nQuantizer\n should be sub-classed for each quantization method.\n\n\nModel Transformation\n\n\nThe high-level flow is as follows:\n\n\n\n\nDefine a \nmapping\n between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the \nreplacement_factory\n attribute of the \nQuantizer\n class.\n\n\nIterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it.\n\n\nReplace the existing module with the module returned by the function. It is important to note that the \nname\n of the module \ndoes not\n change, as that could break the \nforward\n function of the parent module.\n\n\n\n\nDifferent quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different \nmapping\n will likely be defined.\n\nEach sub-class of \nQuantizer\n should populate the \nreplacement_factory\n dictionary attribute with the appropriate mapping.\n\nTo execute the model transformation, call the \nprepare_model\n function of the \nQuantizer\n instance.\n\n\nFlexible Bit-Widths\n\n\n\n\nEach instance of \nQuantizer\n is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the \nbits_activations\n and \nbits_weights\n parameters in \nQuantizer\n's constructor. Sub-classes may define bit-widths for other tensor types as needed.\n\n\nWe also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks (\"container\" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern.\n\n\nSo, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the \nbits_overrides\n parameter in the constructor.\n\n\nThe \nbits_overrides\n mapping is required to be an instance of \ncollections.OrderedDict\n (as opposed to just a simple Python \ndict\n). This is done in order to enable handling of overlapping name patterns.\n\n     So, for example, one could define certain override parameters for a group of layers, e.g. 'conv*', but also define different parameters for specific layers in that group, e.g. 'conv1'.\n\n     The patterns are evaluated eagerly - the first match wins. Therefore, the more specific patterns must come before the broad patterns.\n\n\n\n\nWeights Quantization\n\n\nThe \nQuantizer\n class also provides an API to quantize the weights of all layers at once. To use it, the \nparam_quantization_fn\n attribute needs to point to a function that accepts a tensor and the number of bits. During model transformation, the \nQuantizer\n class will build a list of all model parameters that need to be quantized along with their bit-width. Then, the \nquantize_params\n function can be called, which will iterate over all parameters and quantize them using \nparams_quantization_fn\n.\n\n\nQuantization-Aware Training\n\n\nThe \nQuantizer\n class supports quantization-aware training, that is - training with quantization in the loop. This requires handling of a couple of flows / scenarios:\n\n\n\n\n\n\nMaintaining a full precision copy of the weights, as described \nhere\n. This is enabled by setting \ntrain_with_fp_copy=True\n in the \nQuantizer\n constructor. At model transformation, in each module that has parameters that should be quantized, a new \ntorch.nn.Parameter\n is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module \nis not\n created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following \"hack\": \n\n\n\n\nThe existing \ntorch.nn.Parameter\n, e.g. \nweights\n, is replaced by a \ntorch.nn.Parameter\n named \nfloat_weight\n.\n\n\nTo maintain the existing functionality of the module, we then register a \nbuffer\n in the module with the original name - \nweights\n.\n\n\nDuring training, \nfloat_weight\n will be passed to \nparam_quantization_fn\n and the result will be stored in \nweight\n.\n\n\n\n\n\n\n\n\nIn addition, some quantization methods may introduce additional learned parameters to the model. For example, in the \nPACT\n method, acitvations are clipped to a value \n\\alpha\n, which is a learned parameter per-layer\n\n\n\n\n\n\nTo support these two cases, the \nQuantizer\n class also accepts an instance of a \ntorch.optim.Optimizer\n (normally this would be one an instance of its sub-classes). The quantizer will take care of modifying the optimizer according to the changes made to the parameters.   \n\n\n\n\nOptimizing New Parameters\n\n\nIn cases where new parameters are required by the scheme, it is likely that they'll need to be optimized separately from the main model parameters. In that case, the sub-class for the speicifc method should override \nQuantizer._get_updated_optimizer_params_groups()\n, and return the proper groups plus any desired hyper-parameter overrides.\n\n\n\n\nExamples\n\n\nThe base \nQuantizer\n class is implemented in \ndistiller/quantization/quantizer.py\n.\n\nFor a simple sub-class implementing symmetric linear quantization, see \nSymmetricLinearQuantizer\n in \ndistiller/quantization/range_linear.py\n.\n\nIn \ndistiller/quantization/clipped_linear.py\n there are examples of lower-precision methods which use training with quantization. Specifically, see \nPACTQuantizer\n for an example of overriding \nQuantizer._get_updated_optimizer_params_groups()\n.",
+            "location": "/design/index.html", 
+            "text": "Distiller design\n\n\nDistiller is designed to be easily integrated into your own PyTorch research applications.\n\nIt is easiest to understand this integration by examining the code of the sample application for compressing image classification models (\ncompress_classifier.py\n).\n\n\nThe application borrows its main flow code from torchvision's ImageNet classification training sample application (https://github.com/pytorch/examples/tree/master/imagenet). We tried to keep it similar, in order to make it familiar and easy to understand.\n\n\nIntegrating compression is very simple: simply add invocations of the appropriate compression_scheduler callbacks, for each stage in the training.  The training skeleton looks like the pseudo code below.  The boiler-plate Pytorch classification training is speckled with invocations of CompressionScheduler.\n\n\nFor each epoch:\n    compression_scheduler.on_epoch_begin(epoch)\n    train()\n    validate()\n    save_checkpoint()\n    compression_scheduler.on_epoch_end(epoch)\n\ntrain():\n    For each training step:\n        compression_scheduler.on_minibatch_begin(epoch)\n        output = model(input_var)\n        loss = criterion(output, target_var)\n        compression_scheduler.before_backward_pass(epoch)\n        loss.backward()\n        optimizer.step()\n        compression_scheduler.on_minibatch_end(epoch)\n\n\n\n\nThese callbacks can be seen in the diagram below, as the arrow pointing from the Training Loop and into Distiller's \nScheduler\n, which invokes the correct algorithm.  The application also uses Distiller services to collect statistics in \nSummaries\n and logs files, which can be queried at a later time, from Jupyter notebooks or TensorBoard.\n\n\n\n\nSparsification and fine-tuning\n\n\n\n\nThe application sets up a model as normally done in PyTorch.\n\n\nAnd then instantiates a Scheduler and configures it:\n\n\nScheduler configuration is defined in a YAML file\n\n\nThe configuration specifies Policies. Each Policy is tied to a specific algorithm which controls some aspect of the training.\n\n\nSome types of algorithms control the actual sparsification of the model. Such types are \"pruner\" and \"regularizer\".\n\n\nSome algorithms control some parameter of the training process, such as the learning-rate decay scheduler (\nlr_scheduler\n).\n\n\nThe parameters of each algorithm are also specified in the configuration.\n\n\n\n\n\n\n\n\n\n\nIn addition to specifying the algorithm, each Policy specifies scheduling parameters which control when the algorithm is executed: start epoch, end epoch and frequency.\n\n\nThe Scheduler exposes callbacks for relevant training stages: epoch start/end, mini-batch start/end and pre-backward pass. Each scheduler callback activates the policies that were defined according the schedule that was defined.\n\n\nThese callbacks are placed the training loop.\n\n\n\n\nQuantization\n\n\nA quantized model is obtained by replacing existing operations with quantized versions. The quantized versions can be either complete replacements, or wrappers. A wrapper will use the existing modules internally and add quantization and de-quantization operations before/after as necessary.\n\n\nIn Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided.\n\n\nWe also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. This mechanism is exposed by the \nQuantizer\n class. \nQuantizer\n should be sub-classed for each quantization method.\n\n\nModel Transformation\n\n\nThe high-level flow is as follows:\n\n\n\n\nDefine a \nmapping\n between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the \nreplacement_factory\n attribute of the \nQuantizer\n class.\n\n\nIterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it.\n\n\nReplace the existing module with the module returned by the function. It is important to note that the \nname\n of the module \ndoes not\n change, as that could break the \nforward\n function of the parent module.\n\n\n\n\nDifferent quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different \nmapping\n will likely be defined.\n\nEach sub-class of \nQuantizer\n should populate the \nreplacement_factory\n dictionary attribute with the appropriate mapping.\n\nTo execute the model transformation, call the \nprepare_model\n function of the \nQuantizer\n instance.\n\n\nFlexible Bit-Widths\n\n\n\n\nEach instance of \nQuantizer\n is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the \nbits_activations\n and \nbits_weights\n parameters in \nQuantizer\n's constructor. Sub-classes may define bit-widths for other tensor types as needed.\n\n\nWe also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks (\"container\" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern.\n\n\nSo, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the \nbits_overrides\n parameter in the constructor.\n\n\nThe \nbits_overrides\n mapping is required to be an instance of \ncollections.OrderedDict\n (as opposed to just a simple Python \ndict\n). This is done in order to enable handling of overlapping name patterns.\n\n     So, for example, one could define certain override parameters for a group of layers, e.g. 'conv*', but also define different parameters for specific layers in that group, e.g. 'conv1'.\n\n     The patterns are evaluated eagerly - the first match wins. Therefore, the more specific patterns must come before the broad patterns.\n\n\n\n\nWeights Quantization\n\n\nThe \nQuantizer\n class also provides an API to quantize the weights of all layers at once. To use it, the \nparam_quantization_fn\n attribute needs to point to a function that accepts a tensor and the number of bits. During model transformation, the \nQuantizer\n class will build a list of all model parameters that need to be quantized along with their bit-width. Then, the \nquantize_params\n function can be called, which will iterate over all parameters and quantize them using \nparams_quantization_fn\n.\n\n\nQuantization-Aware Training\n\n\nThe \nQuantizer\n class supports quantization-aware training, that is - training with quantization in the loop. This requires handling of a couple of flows / scenarios:\n\n\n\n\n\n\nMaintaining a full precision copy of the weights, as described \nhere\n. This is enabled by setting \ntrain_with_fp_copy=True\n in the \nQuantizer\n constructor. At model transformation, in each module that has parameters that should be quantized, a new \ntorch.nn.Parameter\n is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module \nis not\n created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following \"hack\": \n\n\n\n\nThe existing \ntorch.nn.Parameter\n, e.g. \nweights\n, is replaced by a \ntorch.nn.Parameter\n named \nfloat_weight\n.\n\n\nTo maintain the existing functionality of the module, we then register a \nbuffer\n in the module with the original name - \nweights\n.\n\n\nDuring training, \nfloat_weight\n will be passed to \nparam_quantization_fn\n and the result will be stored in \nweight\n.\n\n\n\n\n\n\n\n\nIn addition, some quantization methods may introduce additional learned parameters to the model. For example, in the \nPACT\n method, acitvations are clipped to a value \n\\alpha\n, which is a learned parameter per-layer\n\n\n\n\n\n\nTo support these two cases, the \nQuantizer\n class also accepts an instance of a \ntorch.optim.Optimizer\n (normally this would be one an instance of its sub-classes). The quantizer will take care of modifying the optimizer according to the changes made to the parameters.   \n\n\n\n\nOptimizing New Parameters\n\n\nIn cases where new parameters are required by the scheme, it is likely that they'll need to be optimized separately from the main model parameters. In that case, the sub-class for the speicifc method should override \nQuantizer._get_updated_optimizer_params_groups()\n, and return the proper groups plus any desired hyper-parameter overrides.\n\n\n\n\nExamples\n\n\nThe base \nQuantizer\n class is implemented in \ndistiller/quantization/quantizer.py\n.\n\nFor a simple sub-class implementing symmetric linear quantization, see \nSymmetricLinearQuantizer\n in \ndistiller/quantization/range_linear.py\n.\n\nIn \ndistiller/quantization/clipped_linear.py\n there are examples of lower-precision methods which use training with quantization. Specifically, see \nPACTQuantizer\n for an example of overriding \nQuantizer._get_updated_optimizer_params_groups()\n.", 
             "title": "Design"
-        },
+        }, 
         {
-            "location": "/design/index.html#distiller-design",
-            "text": "Distiller is designed to be easily integrated into your own PyTorch research applications. \nIt is easiest to understand this integration by examining the code of the sample application for compressing image classification models ( compress_classifier.py ).  The application borrows its main flow code from torchvision's ImageNet classification training sample application (https://github.com/pytorch/examples/tree/master/imagenet). We tried to keep it similar, in order to make it familiar and easy to understand.  Integrating compression is very simple: simply add invocations of the appropriate compression_scheduler callbacks, for each stage in the training.  The training skeleton looks like the pseudo code below.  The boiler-plate Pytorch classification training is speckled with invocations of CompressionScheduler.  For each epoch:\n    compression_scheduler.on_epoch_begin(epoch)\n    train()\n    validate()\n    save_checkpoint()\n    compression_scheduler.on_epoch_end(epoch)\n\ntrain():\n    For each training step:\n        compression_scheduler.on_minibatch_begin(epoch)\n        output = model(input_var)\n        loss = criterion(output, target_var)\n        compression_scheduler.before_backward_pass(epoch)\n        loss.backward()\n        optimizer.step()\n        compression_scheduler.on_minibatch_end(epoch)  These callbacks can be seen in the diagram below, as the arrow pointing from the Training Loop and into Distiller's  Scheduler , which invokes the correct algorithm.  The application also uses Distiller services to collect statistics in  Summaries  and logs files, which can be queried at a later time, from Jupyter notebooks or TensorBoard.",
+            "location": "/design/index.html#distiller-design", 
+            "text": "Distiller is designed to be easily integrated into your own PyTorch research applications. \nIt is easiest to understand this integration by examining the code of the sample application for compressing image classification models ( compress_classifier.py ).  The application borrows its main flow code from torchvision's ImageNet classification training sample application (https://github.com/pytorch/examples/tree/master/imagenet). We tried to keep it similar, in order to make it familiar and easy to understand.  Integrating compression is very simple: simply add invocations of the appropriate compression_scheduler callbacks, for each stage in the training.  The training skeleton looks like the pseudo code below.  The boiler-plate Pytorch classification training is speckled with invocations of CompressionScheduler.  For each epoch:\n    compression_scheduler.on_epoch_begin(epoch)\n    train()\n    validate()\n    save_checkpoint()\n    compression_scheduler.on_epoch_end(epoch)\n\ntrain():\n    For each training step:\n        compression_scheduler.on_minibatch_begin(epoch)\n        output = model(input_var)\n        loss = criterion(output, target_var)\n        compression_scheduler.before_backward_pass(epoch)\n        loss.backward()\n        optimizer.step()\n        compression_scheduler.on_minibatch_end(epoch)  These callbacks can be seen in the diagram below, as the arrow pointing from the Training Loop and into Distiller's  Scheduler , which invokes the correct algorithm.  The application also uses Distiller services to collect statistics in  Summaries  and logs files, which can be queried at a later time, from Jupyter notebooks or TensorBoard.", 
             "title": "Distiller design"
-        },
+        }, 
         {
-            "location": "/design/index.html#sparsification-and-fine-tuning",
-            "text": "The application sets up a model as normally done in PyTorch.  And then instantiates a Scheduler and configures it:  Scheduler configuration is defined in a YAML file  The configuration specifies Policies. Each Policy is tied to a specific algorithm which controls some aspect of the training.  Some types of algorithms control the actual sparsification of the model. Such types are \"pruner\" and \"regularizer\".  Some algorithms control some parameter of the training process, such as the learning-rate decay scheduler ( lr_scheduler ).  The parameters of each algorithm are also specified in the configuration.      In addition to specifying the algorithm, each Policy specifies scheduling parameters which control when the algorithm is executed: start epoch, end epoch and frequency.  The Scheduler exposes callbacks for relevant training stages: epoch start/end, mini-batch start/end and pre-backward pass. Each scheduler callback activates the policies that were defined according the schedule that was defined.  These callbacks are placed the training loop.",
+            "location": "/design/index.html#sparsification-and-fine-tuning", 
+            "text": "The application sets up a model as normally done in PyTorch.  And then instantiates a Scheduler and configures it:  Scheduler configuration is defined in a YAML file  The configuration specifies Policies. Each Policy is tied to a specific algorithm which controls some aspect of the training.  Some types of algorithms control the actual sparsification of the model. Such types are \"pruner\" and \"regularizer\".  Some algorithms control some parameter of the training process, such as the learning-rate decay scheduler ( lr_scheduler ).  The parameters of each algorithm are also specified in the configuration.      In addition to specifying the algorithm, each Policy specifies scheduling parameters which control when the algorithm is executed: start epoch, end epoch and frequency.  The Scheduler exposes callbacks for relevant training stages: epoch start/end, mini-batch start/end and pre-backward pass. Each scheduler callback activates the policies that were defined according the schedule that was defined.  These callbacks are placed the training loop.", 
             "title": "Sparsification and fine-tuning"
-        },
+        }, 
         {
-            "location": "/design/index.html#quantization",
-            "text": "A quantized model is obtained by replacing existing operations with quantized versions. The quantized versions can be either complete replacements, or wrappers. A wrapper will use the existing modules internally and add quantization and de-quantization operations before/after as necessary.  In Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided.  We also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. This mechanism is exposed by the  Quantizer  class.  Quantizer  should be sub-classed for each quantization method.",
+            "location": "/design/index.html#quantization", 
+            "text": "A quantized model is obtained by replacing existing operations with quantized versions. The quantized versions can be either complete replacements, or wrappers. A wrapper will use the existing modules internally and add quantization and de-quantization operations before/after as necessary.  In Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided.  We also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. This mechanism is exposed by the  Quantizer  class.  Quantizer  should be sub-classed for each quantization method.", 
             "title": "Quantization"
-        },
+        }, 
         {
-            "location": "/design/index.html#model-transformation",
-            "text": "The high-level flow is as follows:   Define a  mapping  between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the  replacement_factory  attribute of the  Quantizer  class.  Iterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it.  Replace the existing module with the module returned by the function. It is important to note that the  name  of the module  does not  change, as that could break the  forward  function of the parent module.   Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different  mapping  will likely be defined. \nEach sub-class of  Quantizer  should populate the  replacement_factory  dictionary attribute with the appropriate mapping. \nTo execute the model transformation, call the  prepare_model  function of the  Quantizer  instance.",
+            "location": "/design/index.html#model-transformation", 
+            "text": "The high-level flow is as follows:   Define a  mapping  between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the  replacement_factory  attribute of the  Quantizer  class.  Iterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it.  Replace the existing module with the module returned by the function. It is important to note that the  name  of the module  does not  change, as that could break the  forward  function of the parent module.   Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different  mapping  will likely be defined. \nEach sub-class of  Quantizer  should populate the  replacement_factory  dictionary attribute with the appropriate mapping. \nTo execute the model transformation, call the  prepare_model  function of the  Quantizer  instance.", 
             "title": "Model Transformation"
-        },
+        }, 
         {
-            "location": "/design/index.html#flexible-bit-widths",
-            "text": "Each instance of  Quantizer  is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the  bits_activations  and  bits_weights  parameters in  Quantizer 's constructor. Sub-classes may define bit-widths for other tensor types as needed.  We also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks (\"container\" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern.  So, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the  bits_overrides  parameter in the constructor.  The  bits_overrides  mapping is required to be an instance of  collections.OrderedDict  (as opposed to just a simple Python  dict ). This is done in order to enable handling of overlapping name patterns. \n     So, for example, one could define certain override parameters for a group of layers, e.g. 'conv*', but also define different parameters for specific layers in that group, e.g. 'conv1'. \n     The patterns are evaluated eagerly - the first match wins. Therefore, the more specific patterns must come before the broad patterns.",
+            "location": "/design/index.html#flexible-bit-widths", 
+            "text": "Each instance of  Quantizer  is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the  bits_activations  and  bits_weights  parameters in  Quantizer 's constructor. Sub-classes may define bit-widths for other tensor types as needed.  We also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks (\"container\" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern.  So, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the  bits_overrides  parameter in the constructor.  The  bits_overrides  mapping is required to be an instance of  collections.OrderedDict  (as opposed to just a simple Python  dict ). This is done in order to enable handling of overlapping name patterns. \n     So, for example, one could define certain override parameters for a group of layers, e.g. 'conv*', but also define different parameters for specific layers in that group, e.g. 'conv1'. \n     The patterns are evaluated eagerly - the first match wins. Therefore, the more specific patterns must come before the broad patterns.", 
             "title": "Flexible Bit-Widths"
-        },
+        }, 
         {
-            "location": "/design/index.html#weights-quantization",
-            "text": "The  Quantizer  class also provides an API to quantize the weights of all layers at once. To use it, the  param_quantization_fn  attribute needs to point to a function that accepts a tensor and the number of bits. During model transformation, the  Quantizer  class will build a list of all model parameters that need to be quantized along with their bit-width. Then, the  quantize_params  function can be called, which will iterate over all parameters and quantize them using  params_quantization_fn .",
+            "location": "/design/index.html#weights-quantization", 
+            "text": "The  Quantizer  class also provides an API to quantize the weights of all layers at once. To use it, the  param_quantization_fn  attribute needs to point to a function that accepts a tensor and the number of bits. During model transformation, the  Quantizer  class will build a list of all model parameters that need to be quantized along with their bit-width. Then, the  quantize_params  function can be called, which will iterate over all parameters and quantize them using  params_quantization_fn .", 
             "title": "Weights Quantization"
-        },
+        }, 
         {
-            "location": "/design/index.html#quantization-aware-training",
-            "text": "The  Quantizer  class supports quantization-aware training, that is - training with quantization in the loop. This requires handling of a couple of flows / scenarios:    Maintaining a full precision copy of the weights, as described  here . This is enabled by setting  train_with_fp_copy=True  in the  Quantizer  constructor. At model transformation, in each module that has parameters that should be quantized, a new  torch.nn.Parameter  is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module  is not  created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following \"hack\":    The existing  torch.nn.Parameter , e.g.  weights , is replaced by a  torch.nn.Parameter  named  float_weight .  To maintain the existing functionality of the module, we then register a  buffer  in the module with the original name -  weights .  During training,  float_weight  will be passed to  param_quantization_fn  and the result will be stored in  weight .     In addition, some quantization methods may introduce additional learned parameters to the model. For example, in the  PACT  method, acitvations are clipped to a value  \\alpha , which is a learned parameter per-layer    To support these two cases, the  Quantizer  class also accepts an instance of a  torch.optim.Optimizer  (normally this would be one an instance of its sub-classes). The quantizer will take care of modifying the optimizer according to the changes made to the parameters.      Optimizing New Parameters  In cases where new parameters are required by the scheme, it is likely that they'll need to be optimized separately from the main model parameters. In that case, the sub-class for the speicifc method should override  Quantizer._get_updated_optimizer_params_groups() , and return the proper groups plus any desired hyper-parameter overrides.",
+            "location": "/design/index.html#quantization-aware-training", 
+            "text": "The  Quantizer  class supports quantization-aware training, that is - training with quantization in the loop. This requires handling of a couple of flows / scenarios:    Maintaining a full precision copy of the weights, as described  here . This is enabled by setting  train_with_fp_copy=True  in the  Quantizer  constructor. At model transformation, in each module that has parameters that should be quantized, a new  torch.nn.Parameter  is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module  is not  created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following \"hack\":    The existing  torch.nn.Parameter , e.g.  weights , is replaced by a  torch.nn.Parameter  named  float_weight .  To maintain the existing functionality of the module, we then register a  buffer  in the module with the original name -  weights .  During training,  float_weight  will be passed to  param_quantization_fn  and the result will be stored in  weight .     In addition, some quantization methods may introduce additional learned parameters to the model. For example, in the  PACT  method, acitvations are clipped to a value  \\alpha , which is a learned parameter per-layer    To support these two cases, the  Quantizer  class also accepts an instance of a  torch.optim.Optimizer  (normally this would be one an instance of its sub-classes). The quantizer will take care of modifying the optimizer according to the changes made to the parameters.      Optimizing New Parameters  In cases where new parameters are required by the scheme, it is likely that they'll need to be optimized separately from the main model parameters. In that case, the sub-class for the speicifc method should override  Quantizer._get_updated_optimizer_params_groups() , and return the proper groups plus any desired hyper-parameter overrides.", 
             "title": "Quantization-Aware Training"
-        },
+        }, 
         {
-            "location": "/design/index.html#examples",
-            "text": "The base  Quantizer  class is implemented in  distiller/quantization/quantizer.py . \nFor a simple sub-class implementing symmetric linear quantization, see  SymmetricLinearQuantizer  in  distiller/quantization/range_linear.py . \nIn  distiller/quantization/clipped_linear.py  there are examples of lower-precision methods which use training with quantization. Specifically, see  PACTQuantizer  for an example of overriding  Quantizer._get_updated_optimizer_params_groups() .",
+            "location": "/design/index.html#examples", 
+            "text": "The base  Quantizer  class is implemented in  distiller/quantization/quantizer.py . \nFor a simple sub-class implementing symmetric linear quantization, see  SymmetricLinearQuantizer  in  distiller/quantization/range_linear.py . \nIn  distiller/quantization/clipped_linear.py  there are examples of lower-precision methods which use training with quantization. Specifically, see  PACTQuantizer  for an example of overriding  Quantizer._get_updated_optimizer_params_groups() .", 
             "title": "Examples"
-        },
+        }, 
         {
-            "location": "/tutorial-struct_pruning/index.html",
-            "text": "Pruning Filters & Channels\n\n\nIntroduction\n\n\nChannel and filter pruning are examples of structured-pruning which create compressed models that do not require special hardware to execute.  This latter fact makes this form of structured pruning particularly interesting and popular.\nIn networks that have serial data dependencies, it is pretty straight-forward to understand and define how to prune channels and filters.  However, in more complex models,  with parallel-data dependencies (paths) - such as ResNets (skip connections) and GoogLeNet (Inception layers) \u2013 things become increasingly more complex and require a deeper understanding of the data flow in the model, in order to define the pruning schedule.\n\nThis post explains channel and filter pruning, the challenges, and how to define a Distiller pruning schedule for these structures.  The details of the implementation are left for a separate post.\n\n\nBefore we dive into pruning, let\u2019s level-set on the terminology, because different people (and even research papers) do not always agree on the nomenclature.  This reflects my understanding of the nomenclature, and therefore these are the names used in Distiller.  I\u2019ll restrict this discussion to Convolution layers in CNNs, to contain the scope of the topic I\u2019ll be covering, although Distiller supports pruning of other structures such as matrix columns and rows.\nPyTorch describes \ntorch.nn.Conv2d\n as applying \u201ca 2D convolution over an input signal composed of several input planes.\u201d  We call each of these input planes a \nfeature-map\n (or FM, for short).  Another name is \ninput channel\n, as in the R/G/B channels of an image.  Some people refer to feature-maps as \nactivations\n (i.e. the activation of neurons), although I think strictly speaking \nactivations\n are the output of an activation layer that was fed a group of feature-maps.  Because it is very common, and because the use of an activation is orthogonal to our discussion, I will use \nactivations\n to refer to the output of a Convolution layer (i.e. 3D stack of feature-maps).\n\n\nIn the PyTorch documentation Convolution outputs have shape (N, C\nout\n, H\nout\n, W\nout\n) where N is a batch size, C\nout\n denotes a number of output channels, H\nout\n is a height of output planes in pixels, and W\nout\n is width in pixels.  We won\u2019t be paying much attention to the batch-size since it\u2019s not important to our discussion, so without loss of generality we can set N=1.  I\u2019m also assuming the most common Convolutions having \ngroups==1\n.\nConvolution weights are 4D: (F, C, K, K) where F is the number of filters, C is the number of channels, and K is the kernel size (we can assume the kernel height and width are equal for simplicity).  A \nkernel\n is a 2D matrix (K, K) that is part of a 3D feature detector.  This feature detector is called a \nfilter\n and it is basically a stack of 2D \nkernels\n.  Each kernel is convolved with a 2D input channel (i.e. feature-map) so if there are C\nin\n channels in the input, then there are C\nin\n kernels in a filter (C == C\nin\n).  Each filter is convolved with the entire input to create a single output channel (i.e. feature-map).  If there are C\nout\n output channels, then there are C\nout\n filters (F == C\nout\n).\n\n\nFilter Pruning\n\n\nFilter pruning and channel pruning are very similar, and I\u2019ll expand on that similarity later on \u2013 but for now let\u2019s focus on filter pruning.\n\nIn filter pruning we use some criterion to determine which filters are \nimportant\n and which are not.  Researchers came up with all sorts of pruning criteria: the L1-magnitude of the filters (citation), the entropy of the activations (citation), and the classification accuracy reduction (citation) are just some examples.  Disregarding how we chose the filters to prune, let\u2019s imagine that in the diagram below, we chose to prune (remove) the green and orange filters (the circle with the \u201c*\u201d designates a Convolution operation).\n\n\nSince we have two less filters operating on the input, we must have two less output feature-maps.  So when we prune filters, besides changing the physical size of the weight tensors, we also need to reconfigure the immediate Convolution layer (change its \nout_channels\n) and the following Convolution layer (change its \nin_channels\n).  And finally, because the next layer\u2019s input is now smaller (has fewer channels),  we should also shrink the next layer\u2019s weights tensors, by removing the channels corresponding to the filters we pruned.  We say that there is a \ndata-dependency\n between the two Convolution layers.  I didn\u2019t make any mention of the activation function that usually follows Convolution, because these functions are parameter-less and are not sensitive to the shape of their input.\nThere are some other dependencies that Distiller resolves (such as Optimizer parameters tightly-coupled to the weights) that I won\u2019t discuss here, because they are implementation details.\n\n\n\nThe scheduler YAML syntax for this example is pasted below.  We use L1-norm ranking of weight filters, and the pruning-rate is set by the AGP algorithm (Automatic Gradual Pruning).  The Convolution layers are conveniently named \nconv1\n and \nconv2\n in this example.\n\n\npruners:\n  example_pruner:\n    class: L1RankedStructureParameterPruner_AGP\n    initial_sparsity : 0.10\n    final_sparsity: 0.50\n    group_type: Filters\n    weights: [module.conv1.weight]\n\n\n\n\nNow let\u2019s add a Batch Normalization layer between the two convolutions:\n\n\n\nThe Batch Normalization layer is parameterized by a couple of tensors that contain information per input-channel (i.e. scale and shift).  Because our Convolution produces less output FMs, and these are the input to the Batch Normalization layer, we also need to reconfigure the Batch Normalization layer.  And we also need to physically shrink the Batch Normalization layer\u2019s scale and shift tensors, which are coefficients in the BN input transformation.  Moreover, the scale and shift coefficients that we remove from the tensors, must correspond to the filters (or output feature-maps channels) that we removed from the Convolution weight tensors.  This small nuance will prove to be a large pain, but we\u2019ll get to that in later examples.\nThe presence of a Batch Normalization layer in the example above is transparent to us, and in fact, the YAML schedule does not change.  Distiller detects the presence of Batch Normalization layers and adjusts their parameters automatically.\n\n\nLet\u2019s look at another example, with non-serial data-dependencies.  Here, the output of \nconv1\n is the input for \nconv2\n and \nconv3\n.  This is an example of parallel data-dependency, since both \nconv2\n and \nconv3\n depend on \nconv1\n.\n\n\n\nNote that the Distiller YAML schedule is unchanged from the previous two examples, since we are still only explicitly pruning the weight filters of \nconv1\n.  The weight channels of \nconv2\n and \nconv3\n are pruned implicitly by Distiller in a process called \u201cThinning\u201d (on which I will expand in a different post).\n\n\nNext, let\u2019s look at another example also involving three Convolutions, but this time we want to prune the filters of two convolutional layers, whose outputs are element-wise-summed and fed into a third Convolution.\nIn this example \nconv3\n is dependent on both \nconv1\n and \nconv2\n, and there are two implications to this dependency.  The first, and more obvious implication, is that we need to prune the same number of filters from both \nconv1\n and \nconv2\n.  Since we apply element-wise addition on the outputs of \nconv1\n and \nconv2\n, they must have the same shape - and they can only have the same shape if \nconv1\n and \nconv2\n prune the same number of filters.  The second implication of this triangular data-dependency is that both \nconv1\n and \nconv2\n must prune the \nsame\n filters!  Let\u2019s imagine for a moment, that we ignore this second constraint.  The diagram below illustrates the dilemma that arises: how should we prune the channels of the weights of \nconv3\n?  Obviously, we can\u2019t.\n\n\n\nWe must apply the second constraint \u2013 and that means that we now need to be proactive: we need to decide whether to use the prune \nconv1\n and \nconv2\n according to the filter-pruning choices of \nconv1\n or of \nconv2\n.  The diagram below illustrates the pruning scheme after deciding to follow the pruning choices of \nconv1\n.\n\n\n\nThe YAML compression schedule syntax needs to be able to express the two dependencies (or constraints) discussed above.  First we need to tell the Filter Pruner that we there is a dependency of type \nLeader\n.  This means that all of the tensors listed in the \nweights\n field are pruned together, to the same extent at each iteration, and that to prune the filters we will use the pruning decisions of the first tensor listed.  In the example below \nmodule.conv1.weight\n and \nmodule.conv2.weight\n are pruned together according to the pruning choices for \nmodule.conv1.weight\n.\n\n\npruners:\n  example_pruner:\n    class: L1RankedStructureParameterPruner_AGP\n    initial_sparsity : 0.10\n    final_sparsity: 0.50\n    group_type: Filters\n    group_dependency: Leader\n    weights: [module.conv1.weight, module.conv2.weight]\n\n\n\n\nWhen we turn to filter-pruning ResNets we see some pretty long dependency chains because of the skip-connections.  If you don\u2019t pay attention, you can easily under-specify (or mis-specify) dependency chains and Distiller will exit with an exception.  The exception does not explain the specification error and this needs to be improved.\n\n\nChannel Pruning\n\n\nChannel pruning is very similar to Filter pruning with all the details of dependencies reversed.  Look again at example #1, but this time imagine that we\u2019ve changed our schedule to prune the \nchannels\n of \nmodule.conv2.weight\n.\n\n\npruners:\n  example_pruner:\n    class: L1RankedStructureParameterPruner_AGP\n    initial_sparsity : 0.10\n    final_sparsity: 0.50\n    group_type: Channels\n    weights: [module.conv2.weight]\n\n\n\n\nAs the diagram shows, \nconv1\n is now dependent on \nconv2\n and its weights filters will be implicitly pruned according to the channels removed from the weights of \nconv2\n.\n\n\n\nGeek On.",
+            "location": "/tutorial-struct_pruning/index.html", 
+            "text": "Pruning Filters \n Channels\n\n\nIntroduction\n\n\nChannel and filter pruning are examples of structured-pruning which create compressed models that do not require special hardware to execute.  This latter fact makes this form of structured pruning particularly interesting and popular.\nIn networks that have serial data dependencies, it is pretty straight-forward to understand and define how to prune channels and filters.  However, in more complex models,  with parallel-data dependencies (paths) - such as ResNets (skip connections) and GoogLeNet (Inception layers) \u2013 things become increasingly more complex and require a deeper understanding of the data flow in the model, in order to define the pruning schedule.\n\nThis post explains channel and filter pruning, the challenges, and how to define a Distiller pruning schedule for these structures.  The details of the implementation are left for a separate post.\n\n\nBefore we dive into pruning, let\u2019s level-set on the terminology, because different people (and even research papers) do not always agree on the nomenclature.  This reflects my understanding of the nomenclature, and therefore these are the names used in Distiller.  I\u2019ll restrict this discussion to Convolution layers in CNNs, to contain the scope of the topic I\u2019ll be covering, although Distiller supports pruning of other structures such as matrix columns and rows.\nPyTorch describes \ntorch.nn.Conv2d\n as applying \u201ca 2D convolution over an input signal composed of several input planes.\u201d  We call each of these input planes a \nfeature-map\n (or FM, for short).  Another name is \ninput channel\n, as in the R/G/B channels of an image.  Some people refer to feature-maps as \nactivations\n (i.e. the activation of neurons), although I think strictly speaking \nactivations\n are the output of an activation layer that was fed a group of feature-maps.  Because it is very common, and because the use of an activation is orthogonal to our discussion, I will use \nactivations\n to refer to the output of a Convolution layer (i.e. 3D stack of feature-maps).\n\n\nIn the PyTorch documentation Convolution outputs have shape (N, C\nout\n, H\nout\n, W\nout\n) where N is a batch size, C\nout\n denotes a number of output channels, H\nout\n is a height of output planes in pixels, and W\nout\n is width in pixels.  We won\u2019t be paying much attention to the batch-size since it\u2019s not important to our discussion, so without loss of generality we can set N=1.  I\u2019m also assuming the most common Convolutions having \ngroups==1\n.\nConvolution weights are 4D: (F, C, K, K) where F is the number of filters, C is the number of channels, and K is the kernel size (we can assume the kernel height and width are equal for simplicity).  A \nkernel\n is a 2D matrix (K, K) that is part of a 3D feature detector.  This feature detector is called a \nfilter\n and it is basically a stack of 2D \nkernels\n.  Each kernel is convolved with a 2D input channel (i.e. feature-map) so if there are C\nin\n channels in the input, then there are C\nin\n kernels in a filter (C == C\nin\n).  Each filter is convolved with the entire input to create a single output channel (i.e. feature-map).  If there are C\nout\n output channels, then there are C\nout\n filters (F == C\nout\n).\n\n\nFilter Pruning\n\n\nFilter pruning and channel pruning are very similar, and I\u2019ll expand on that similarity later on \u2013 but for now let\u2019s focus on filter pruning.\n\nIn filter pruning we use some criterion to determine which filters are \nimportant\n and which are not.  Researchers came up with all sorts of pruning criteria: the L1-magnitude of the filters (citation), the entropy of the activations (citation), and the classification accuracy reduction (citation) are just some examples.  Disregarding how we chose the filters to prune, let\u2019s imagine that in the diagram below, we chose to prune (remove) the green and orange filters (the circle with the \u201c*\u201d designates a Convolution operation).\n\n\nSince we have two less filters operating on the input, we must have two less output feature-maps.  So when we prune filters, besides changing the physical size of the weight tensors, we also need to reconfigure the immediate Convolution layer (change its \nout_channels\n) and the following Convolution layer (change its \nin_channels\n).  And finally, because the next layer\u2019s input is now smaller (has fewer channels),  we should also shrink the next layer\u2019s weights tensors, by removing the channels corresponding to the filters we pruned.  We say that there is a \ndata-dependency\n between the two Convolution layers.  I didn\u2019t make any mention of the activation function that usually follows Convolution, because these functions are parameter-less and are not sensitive to the shape of their input.\nThere are some other dependencies that Distiller resolves (such as Optimizer parameters tightly-coupled to the weights) that I won\u2019t discuss here, because they are implementation details.\n\n\n\nThe scheduler YAML syntax for this example is pasted below.  We use L1-norm ranking of weight filters, and the pruning-rate is set by the AGP algorithm (Automatic Gradual Pruning).  The Convolution layers are conveniently named \nconv1\n and \nconv2\n in this example.\n\n\npruners:\n  example_pruner:\n    class: L1RankedStructureParameterPruner_AGP\n    initial_sparsity : 0.10\n    final_sparsity: 0.50\n    group_type: Filters\n    weights: [module.conv1.weight]\n\n\n\n\nNow let\u2019s add a Batch Normalization layer between the two convolutions:\n\n\n\nThe Batch Normalization layer is parameterized by a couple of tensors that contain information per input-channel (i.e. scale and shift).  Because our Convolution produces less output FMs, and these are the input to the Batch Normalization layer, we also need to reconfigure the Batch Normalization layer.  And we also need to physically shrink the Batch Normalization layer\u2019s scale and shift tensors, which are coefficients in the BN input transformation.  Moreover, the scale and shift coefficients that we remove from the tensors, must correspond to the filters (or output feature-maps channels) that we removed from the Convolution weight tensors.  This small nuance will prove to be a large pain, but we\u2019ll get to that in later examples.\nThe presence of a Batch Normalization layer in the example above is transparent to us, and in fact, the YAML schedule does not change.  Distiller detects the presence of Batch Normalization layers and adjusts their parameters automatically.\n\n\nLet\u2019s look at another example, with non-serial data-dependencies.  Here, the output of \nconv1\n is the input for \nconv2\n and \nconv3\n.  This is an example of parallel data-dependency, since both \nconv2\n and \nconv3\n depend on \nconv1\n.\n\n\n\nNote that the Distiller YAML schedule is unchanged from the previous two examples, since we are still only explicitly pruning the weight filters of \nconv1\n.  The weight channels of \nconv2\n and \nconv3\n are pruned implicitly by Distiller in a process called \u201cThinning\u201d (on which I will expand in a different post).\n\n\nNext, let\u2019s look at another example also involving three Convolutions, but this time we want to prune the filters of two convolutional layers, whose outputs are element-wise-summed and fed into a third Convolution.\nIn this example \nconv3\n is dependent on both \nconv1\n and \nconv2\n, and there are two implications to this dependency.  The first, and more obvious implication, is that we need to prune the same number of filters from both \nconv1\n and \nconv2\n.  Since we apply element-wise addition on the outputs of \nconv1\n and \nconv2\n, they must have the same shape - and they can only have the same shape if \nconv1\n and \nconv2\n prune the same number of filters.  The second implication of this triangular data-dependency is that both \nconv1\n and \nconv2\n must prune the \nsame\n filters!  Let\u2019s imagine for a moment, that we ignore this second constraint.  The diagram below illustrates the dilemma that arises: how should we prune the channels of the weights of \nconv3\n?  Obviously, we can\u2019t.\n\n\n\nWe must apply the second constraint \u2013 and that means that we now need to be proactive: we need to decide whether to use the prune \nconv1\n and \nconv2\n according to the filter-pruning choices of \nconv1\n or of \nconv2\n.  The diagram below illustrates the pruning scheme after deciding to follow the pruning choices of \nconv1\n.\n\n\n\nThe YAML compression schedule syntax needs to be able to express the two dependencies (or constraints) discussed above.  First we need to tell the Filter Pruner that we there is a dependency of type \nLeader\n.  This means that all of the tensors listed in the \nweights\n field are pruned together, to the same extent at each iteration, and that to prune the filters we will use the pruning decisions of the first tensor listed.  In the example below \nmodule.conv1.weight\n and \nmodule.conv2.weight\n are pruned together according to the pruning choices for \nmodule.conv1.weight\n.\n\n\npruners:\n  example_pruner:\n    class: L1RankedStructureParameterPruner_AGP\n    initial_sparsity : 0.10\n    final_sparsity: 0.50\n    group_type: Filters\n    group_dependency: Leader\n    weights: [module.conv1.weight, module.conv2.weight]\n\n\n\n\nWhen we turn to filter-pruning ResNets we see some pretty long dependency chains because of the skip-connections.  If you don\u2019t pay attention, you can easily under-specify (or mis-specify) dependency chains and Distiller will exit with an exception.  The exception does not explain the specification error and this needs to be improved.\n\n\nChannel Pruning\n\n\nChannel pruning is very similar to Filter pruning with all the details of dependencies reversed.  Look again at example #1, but this time imagine that we\u2019ve changed our schedule to prune the \nchannels\n of \nmodule.conv2.weight\n.\n\n\npruners:\n  example_pruner:\n    class: L1RankedStructureParameterPruner_AGP\n    initial_sparsity : 0.10\n    final_sparsity: 0.50\n    group_type: Channels\n    weights: [module.conv2.weight]\n\n\n\n\nAs the diagram shows, \nconv1\n is now dependent on \nconv2\n and its weights filters will be implicitly pruned according to the channels removed from the weights of \nconv2\n.\n\n\n\nGeek On.", 
             "title": "Pruning Filters and Channels"
-        },
+        }, 
         {
-            "location": "/tutorial-struct_pruning/index.html#pruning-filters-channels",
-            "text": "",
+            "location": "/tutorial-struct_pruning/index.html#pruning-filters-channels", 
+            "text": "", 
             "title": "Pruning Filters &amp; Channels"
-        },
+        }, 
         {
-            "location": "/tutorial-struct_pruning/index.html#introduction",
-            "text": "Channel and filter pruning are examples of structured-pruning which create compressed models that do not require special hardware to execute.  This latter fact makes this form of structured pruning particularly interesting and popular.\nIn networks that have serial data dependencies, it is pretty straight-forward to understand and define how to prune channels and filters.  However, in more complex models,  with parallel-data dependencies (paths) - such as ResNets (skip connections) and GoogLeNet (Inception layers) \u2013 things become increasingly more complex and require a deeper understanding of the data flow in the model, in order to define the pruning schedule. \nThis post explains channel and filter pruning, the challenges, and how to define a Distiller pruning schedule for these structures.  The details of the implementation are left for a separate post.  Before we dive into pruning, let\u2019s level-set on the terminology, because different people (and even research papers) do not always agree on the nomenclature.  This reflects my understanding of the nomenclature, and therefore these are the names used in Distiller.  I\u2019ll restrict this discussion to Convolution layers in CNNs, to contain the scope of the topic I\u2019ll be covering, although Distiller supports pruning of other structures such as matrix columns and rows.\nPyTorch describes  torch.nn.Conv2d  as applying \u201ca 2D convolution over an input signal composed of several input planes.\u201d  We call each of these input planes a  feature-map  (or FM, for short).  Another name is  input channel , as in the R/G/B channels of an image.  Some people refer to feature-maps as  activations  (i.e. the activation of neurons), although I think strictly speaking  activations  are the output of an activation layer that was fed a group of feature-maps.  Because it is very common, and because the use of an activation is orthogonal to our discussion, I will use  activations  to refer to the output of a Convolution layer (i.e. 3D stack of feature-maps).  In the PyTorch documentation Convolution outputs have shape (N, C out , H out , W out ) where N is a batch size, C out  denotes a number of output channels, H out  is a height of output planes in pixels, and W out  is width in pixels.  We won\u2019t be paying much attention to the batch-size since it\u2019s not important to our discussion, so without loss of generality we can set N=1.  I\u2019m also assuming the most common Convolutions having  groups==1 .\nConvolution weights are 4D: (F, C, K, K) where F is the number of filters, C is the number of channels, and K is the kernel size (we can assume the kernel height and width are equal for simplicity).  A  kernel  is a 2D matrix (K, K) that is part of a 3D feature detector.  This feature detector is called a  filter  and it is basically a stack of 2D  kernels .  Each kernel is convolved with a 2D input channel (i.e. feature-map) so if there are C in  channels in the input, then there are C in  kernels in a filter (C == C in ).  Each filter is convolved with the entire input to create a single output channel (i.e. feature-map).  If there are C out  output channels, then there are C out  filters (F == C out ).",
+            "location": "/tutorial-struct_pruning/index.html#introduction", 
+            "text": "Channel and filter pruning are examples of structured-pruning which create compressed models that do not require special hardware to execute.  This latter fact makes this form of structured pruning particularly interesting and popular.\nIn networks that have serial data dependencies, it is pretty straight-forward to understand and define how to prune channels and filters.  However, in more complex models,  with parallel-data dependencies (paths) - such as ResNets (skip connections) and GoogLeNet (Inception layers) \u2013 things become increasingly more complex and require a deeper understanding of the data flow in the model, in order to define the pruning schedule. \nThis post explains channel and filter pruning, the challenges, and how to define a Distiller pruning schedule for these structures.  The details of the implementation are left for a separate post.  Before we dive into pruning, let\u2019s level-set on the terminology, because different people (and even research papers) do not always agree on the nomenclature.  This reflects my understanding of the nomenclature, and therefore these are the names used in Distiller.  I\u2019ll restrict this discussion to Convolution layers in CNNs, to contain the scope of the topic I\u2019ll be covering, although Distiller supports pruning of other structures such as matrix columns and rows.\nPyTorch describes  torch.nn.Conv2d  as applying \u201ca 2D convolution over an input signal composed of several input planes.\u201d  We call each of these input planes a  feature-map  (or FM, for short).  Another name is  input channel , as in the R/G/B channels of an image.  Some people refer to feature-maps as  activations  (i.e. the activation of neurons), although I think strictly speaking  activations  are the output of an activation layer that was fed a group of feature-maps.  Because it is very common, and because the use of an activation is orthogonal to our discussion, I will use  activations  to refer to the output of a Convolution layer (i.e. 3D stack of feature-maps).  In the PyTorch documentation Convolution outputs have shape (N, C out , H out , W out ) where N is a batch size, C out  denotes a number of output channels, H out  is a height of output planes in pixels, and W out  is width in pixels.  We won\u2019t be paying much attention to the batch-size since it\u2019s not important to our discussion, so without loss of generality we can set N=1.  I\u2019m also assuming the most common Convolutions having  groups==1 .\nConvolution weights are 4D: (F, C, K, K) where F is the number of filters, C is the number of channels, and K is the kernel size (we can assume the kernel height and width are equal for simplicity).  A  kernel  is a 2D matrix (K, K) that is part of a 3D feature detector.  This feature detector is called a  filter  and it is basically a stack of 2D  kernels .  Each kernel is convolved with a 2D input channel (i.e. feature-map) so if there are C in  channels in the input, then there are C in  kernels in a filter (C == C in ).  Each filter is convolved with the entire input to create a single output channel (i.e. feature-map).  If there are C out  output channels, then there are C out  filters (F == C out ).", 
             "title": "Introduction"
-        },
+        }, 
         {
-            "location": "/tutorial-struct_pruning/index.html#filter-pruning",
-            "text": "Filter pruning and channel pruning are very similar, and I\u2019ll expand on that similarity later on \u2013 but for now let\u2019s focus on filter pruning. \nIn filter pruning we use some criterion to determine which filters are  important  and which are not.  Researchers came up with all sorts of pruning criteria: the L1-magnitude of the filters (citation), the entropy of the activations (citation), and the classification accuracy reduction (citation) are just some examples.  Disregarding how we chose the filters to prune, let\u2019s imagine that in the diagram below, we chose to prune (remove) the green and orange filters (the circle with the \u201c*\u201d designates a Convolution operation).  Since we have two less filters operating on the input, we must have two less output feature-maps.  So when we prune filters, besides changing the physical size of the weight tensors, we also need to reconfigure the immediate Convolution layer (change its  out_channels ) and the following Convolution layer (change its  in_channels ).  And finally, because the next layer\u2019s input is now smaller (has fewer channels),  we should also shrink the next layer\u2019s weights tensors, by removing the channels corresponding to the filters we pruned.  We say that there is a  data-dependency  between the two Convolution layers.  I didn\u2019t make any mention of the activation function that usually follows Convolution, because these functions are parameter-less and are not sensitive to the shape of their input.\nThere are some other dependencies that Distiller resolves (such as Optimizer parameters tightly-coupled to the weights) that I won\u2019t discuss here, because they are implementation details.  The scheduler YAML syntax for this example is pasted below.  We use L1-norm ranking of weight filters, and the pruning-rate is set by the AGP algorithm (Automatic Gradual Pruning).  The Convolution layers are conveniently named  conv1  and  conv2  in this example.  pruners:\n  example_pruner:\n    class: L1RankedStructureParameterPruner_AGP\n    initial_sparsity : 0.10\n    final_sparsity: 0.50\n    group_type: Filters\n    weights: [module.conv1.weight]  Now let\u2019s add a Batch Normalization layer between the two convolutions:  The Batch Normalization layer is parameterized by a couple of tensors that contain information per input-channel (i.e. scale and shift).  Because our Convolution produces less output FMs, and these are the input to the Batch Normalization layer, we also need to reconfigure the Batch Normalization layer.  And we also need to physically shrink the Batch Normalization layer\u2019s scale and shift tensors, which are coefficients in the BN input transformation.  Moreover, the scale and shift coefficients that we remove from the tensors, must correspond to the filters (or output feature-maps channels) that we removed from the Convolution weight tensors.  This small nuance will prove to be a large pain, but we\u2019ll get to that in later examples.\nThe presence of a Batch Normalization layer in the example above is transparent to us, and in fact, the YAML schedule does not change.  Distiller detects the presence of Batch Normalization layers and adjusts their parameters automatically.  Let\u2019s look at another example, with non-serial data-dependencies.  Here, the output of  conv1  is the input for  conv2  and  conv3 .  This is an example of parallel data-dependency, since both  conv2  and  conv3  depend on  conv1 .  Note that the Distiller YAML schedule is unchanged from the previous two examples, since we are still only explicitly pruning the weight filters of  conv1 .  The weight channels of  conv2  and  conv3  are pruned implicitly by Distiller in a process called \u201cThinning\u201d (on which I will expand in a different post).  Next, let\u2019s look at another example also involving three Convolutions, but this time we want to prune the filters of two convolutional layers, whose outputs are element-wise-summed and fed into a third Convolution.\nIn this example  conv3  is dependent on both  conv1  and  conv2 , and there are two implications to this dependency.  The first, and more obvious implication, is that we need to prune the same number of filters from both  conv1  and  conv2 .  Since we apply element-wise addition on the outputs of  conv1  and  conv2 , they must have the same shape - and they can only have the same shape if  conv1  and  conv2  prune the same number of filters.  The second implication of this triangular data-dependency is that both  conv1  and  conv2  must prune the  same  filters!  Let\u2019s imagine for a moment, that we ignore this second constraint.  The diagram below illustrates the dilemma that arises: how should we prune the channels of the weights of  conv3 ?  Obviously, we can\u2019t.  We must apply the second constraint \u2013 and that means that we now need to be proactive: we need to decide whether to use the prune  conv1  and  conv2  according to the filter-pruning choices of  conv1  or of  conv2 .  The diagram below illustrates the pruning scheme after deciding to follow the pruning choices of  conv1 .  The YAML compression schedule syntax needs to be able to express the two dependencies (or constraints) discussed above.  First we need to tell the Filter Pruner that we there is a dependency of type  Leader .  This means that all of the tensors listed in the  weights  field are pruned together, to the same extent at each iteration, and that to prune the filters we will use the pruning decisions of the first tensor listed.  In the example below  module.conv1.weight  and  module.conv2.weight  are pruned together according to the pruning choices for  module.conv1.weight .  pruners:\n  example_pruner:\n    class: L1RankedStructureParameterPruner_AGP\n    initial_sparsity : 0.10\n    final_sparsity: 0.50\n    group_type: Filters\n    group_dependency: Leader\n    weights: [module.conv1.weight, module.conv2.weight]  When we turn to filter-pruning ResNets we see some pretty long dependency chains because of the skip-connections.  If you don\u2019t pay attention, you can easily under-specify (or mis-specify) dependency chains and Distiller will exit with an exception.  The exception does not explain the specification error and this needs to be improved.",
+            "location": "/tutorial-struct_pruning/index.html#filter-pruning", 
+            "text": "Filter pruning and channel pruning are very similar, and I\u2019ll expand on that similarity later on \u2013 but for now let\u2019s focus on filter pruning. \nIn filter pruning we use some criterion to determine which filters are  important  and which are not.  Researchers came up with all sorts of pruning criteria: the L1-magnitude of the filters (citation), the entropy of the activations (citation), and the classification accuracy reduction (citation) are just some examples.  Disregarding how we chose the filters to prune, let\u2019s imagine that in the diagram below, we chose to prune (remove) the green and orange filters (the circle with the \u201c*\u201d designates a Convolution operation).  Since we have two less filters operating on the input, we must have two less output feature-maps.  So when we prune filters, besides changing the physical size of the weight tensors, we also need to reconfigure the immediate Convolution layer (change its  out_channels ) and the following Convolution layer (change its  in_channels ).  And finally, because the next layer\u2019s input is now smaller (has fewer channels),  we should also shrink the next layer\u2019s weights tensors, by removing the channels corresponding to the filters we pruned.  We say that there is a  data-dependency  between the two Convolution layers.  I didn\u2019t make any mention of the activation function that usually follows Convolution, because these functions are parameter-less and are not sensitive to the shape of their input.\nThere are some other dependencies that Distiller resolves (such as Optimizer parameters tightly-coupled to the weights) that I won\u2019t discuss here, because they are implementation details.  The scheduler YAML syntax for this example is pasted below.  We use L1-norm ranking of weight filters, and the pruning-rate is set by the AGP algorithm (Automatic Gradual Pruning).  The Convolution layers are conveniently named  conv1  and  conv2  in this example.  pruners:\n  example_pruner:\n    class: L1RankedStructureParameterPruner_AGP\n    initial_sparsity : 0.10\n    final_sparsity: 0.50\n    group_type: Filters\n    weights: [module.conv1.weight]  Now let\u2019s add a Batch Normalization layer between the two convolutions:  The Batch Normalization layer is parameterized by a couple of tensors that contain information per input-channel (i.e. scale and shift).  Because our Convolution produces less output FMs, and these are the input to the Batch Normalization layer, we also need to reconfigure the Batch Normalization layer.  And we also need to physically shrink the Batch Normalization layer\u2019s scale and shift tensors, which are coefficients in the BN input transformation.  Moreover, the scale and shift coefficients that we remove from the tensors, must correspond to the filters (or output feature-maps channels) that we removed from the Convolution weight tensors.  This small nuance will prove to be a large pain, but we\u2019ll get to that in later examples.\nThe presence of a Batch Normalization layer in the example above is transparent to us, and in fact, the YAML schedule does not change.  Distiller detects the presence of Batch Normalization layers and adjusts their parameters automatically.  Let\u2019s look at another example, with non-serial data-dependencies.  Here, the output of  conv1  is the input for  conv2  and  conv3 .  This is an example of parallel data-dependency, since both  conv2  and  conv3  depend on  conv1 .  Note that the Distiller YAML schedule is unchanged from the previous two examples, since we are still only explicitly pruning the weight filters of  conv1 .  The weight channels of  conv2  and  conv3  are pruned implicitly by Distiller in a process called \u201cThinning\u201d (on which I will expand in a different post).  Next, let\u2019s look at another example also involving three Convolutions, but this time we want to prune the filters of two convolutional layers, whose outputs are element-wise-summed and fed into a third Convolution.\nIn this example  conv3  is dependent on both  conv1  and  conv2 , and there are two implications to this dependency.  The first, and more obvious implication, is that we need to prune the same number of filters from both  conv1  and  conv2 .  Since we apply element-wise addition on the outputs of  conv1  and  conv2 , they must have the same shape - and they can only have the same shape if  conv1  and  conv2  prune the same number of filters.  The second implication of this triangular data-dependency is that both  conv1  and  conv2  must prune the  same  filters!  Let\u2019s imagine for a moment, that we ignore this second constraint.  The diagram below illustrates the dilemma that arises: how should we prune the channels of the weights of  conv3 ?  Obviously, we can\u2019t.  We must apply the second constraint \u2013 and that means that we now need to be proactive: we need to decide whether to use the prune  conv1  and  conv2  according to the filter-pruning choices of  conv1  or of  conv2 .  The diagram below illustrates the pruning scheme after deciding to follow the pruning choices of  conv1 .  The YAML compression schedule syntax needs to be able to express the two dependencies (or constraints) discussed above.  First we need to tell the Filter Pruner that we there is a dependency of type  Leader .  This means that all of the tensors listed in the  weights  field are pruned together, to the same extent at each iteration, and that to prune the filters we will use the pruning decisions of the first tensor listed.  In the example below  module.conv1.weight  and  module.conv2.weight  are pruned together according to the pruning choices for  module.conv1.weight .  pruners:\n  example_pruner:\n    class: L1RankedStructureParameterPruner_AGP\n    initial_sparsity : 0.10\n    final_sparsity: 0.50\n    group_type: Filters\n    group_dependency: Leader\n    weights: [module.conv1.weight, module.conv2.weight]  When we turn to filter-pruning ResNets we see some pretty long dependency chains because of the skip-connections.  If you don\u2019t pay attention, you can easily under-specify (or mis-specify) dependency chains and Distiller will exit with an exception.  The exception does not explain the specification error and this needs to be improved.", 
             "title": "Filter Pruning"
-        },
+        }, 
         {
-            "location": "/tutorial-struct_pruning/index.html#channel-pruning",
-            "text": "Channel pruning is very similar to Filter pruning with all the details of dependencies reversed.  Look again at example #1, but this time imagine that we\u2019ve changed our schedule to prune the  channels  of  module.conv2.weight .  pruners:\n  example_pruner:\n    class: L1RankedStructureParameterPruner_AGP\n    initial_sparsity : 0.10\n    final_sparsity: 0.50\n    group_type: Channels\n    weights: [module.conv2.weight]  As the diagram shows,  conv1  is now dependent on  conv2  and its weights filters will be implicitly pruned according to the channels removed from the weights of  conv2 .  Geek On.",
+            "location": "/tutorial-struct_pruning/index.html#channel-pruning", 
+            "text": "Channel pruning is very similar to Filter pruning with all the details of dependencies reversed.  Look again at example #1, but this time imagine that we\u2019ve changed our schedule to prune the  channels  of  module.conv2.weight .  pruners:\n  example_pruner:\n    class: L1RankedStructureParameterPruner_AGP\n    initial_sparsity : 0.10\n    final_sparsity: 0.50\n    group_type: Channels\n    weights: [module.conv2.weight]  As the diagram shows,  conv1  is now dependent on  conv2  and its weights filters will be implicitly pruned according to the channels removed from the weights of  conv2 .  Geek On.", 
             "title": "Channel Pruning"
-        },
+        }, 
         {
-            "location": "/tutorial-lang_model/index.html",
-            "text": "Using Distiller to prune a PyTorch language model\n\n\nContents\n\n\n\n\nIntroduction\n\n\nSetup\n\n\nPreparing the code\n\n\nTraining-loop\n\n\nCreating compression baselines\n\n\nCompressing the language model\n\n\nWhat are we compressing?\n\n\nHow are we compressing?\n\n\nWhen are we compressing?\n\n\nUntil next time\n\n\n\n\nIntroduction\n\n\nIn this tutorial I'll show you how to compress a word-level language model using \nDistiller\n.  Specifically, we use PyTorch\u2019s \nword-level language model sample code\n as the code-base of our example, weave in some Distiller code, and show how we compress the model using two different element-wise pruning algorithms.  To make things manageable, I've divided the tutorial to two parts: in the first we will setup the sample application and prune using \nAGP\n.  In the second part I'll show how I've added Baidu's RNN pruning algorithm and then use it to prune the same word-level language model.  The completed code is available \nhere\n.\n\n\nThe results are displayed below and the code is available \nhere\n.\nNote that we can improve the results by training longer, since the loss curves are usually still decreasing at the end of epoch 40.  However, for demonstration purposes we don\u2019t need to do this.\n\n\n\n\n\n\n\n\nType\n\n\nSparsity\n\n\nNNZ\n\n\nValidation\n\n\nTest\n\n\nCommand line\n\n\n\n\n\n\n\n\n\n\nSmall\n\n\n0%\n\n\n7,135,600\n\n\n101.13\n\n\n96.29\n\n\ntime python3 main.py --cuda --epochs 40 --tied --wd=1e-6\n\n\n\n\n\n\nMedium\n\n\n0%\n\n\n28,390,700\n\n\n88.17\n\n\n84.21\n\n\ntime python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied,--wd=1e-6\n\n\n\n\n\n\nLarge\n\n\n0%\n\n\n85,917,000\n\n\n87.49\n\n\n83.85\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-6\n\n\n\n\n\n\nLarge\n\n\n70%\n\n\n25,487,550\n\n\n90.67\n\n\n85.96\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml\n\n\n\n\n\n\nLarge\n\n\n70%\n\n\n25,487,550\n\n\n90.59\n\n\n85.84\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml --wd=1e-6\n\n\n\n\n\n\nLarge\n\n\n70%\n\n\n25,487,550\n\n\n87.40\n\n\n82.93\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70B.schedule_agp.yaml --wd=1e-6\n\n\n\n\n\n\nLarge\n\n\n80.4%\n\n\n16,847,550\n\n\n89.31\n\n\n83.64\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_80.schedule_agp.yaml --wd=1e-6\n\n\n\n\n\n\nLarge\n\n\n90%\n\n\n8,591,700\n\n\n90.70\n\n\n85.67\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_90.schedule_agp.yaml --wd=1e-6\n\n\n\n\n\n\nLarge\n\n\n95%\n\n\n4,295,850\n\n\n98.42\n\n\n92.79\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_95.schedule_agp.yaml --wd=1e-6\n\n\n\n\n\n\n\n\nTable 1: AGP language model pruning results. \nNNZ stands for number of non-zero coefficients (embeddings are counted once, because they are tied).\n\n\n\n\n\n\n  \nFigure 1: Perplexity vs model size (lower perplexity is better).\n\n\n\n\nThe model is composed of an Encoder embedding, two LSTMs, and a Decoder embedding.  The Encoder and decoder embeddings (projections) are tied to improve perplexity results (per https://arxiv.org/pdf/1611.01462.pdf), so in the sparsity statistics we account for only one of the encoder/decoder embeddings.  We used the WikiText2 dataset (twice as large as PTB).\n\n\nWe compared three model sizes: small (7.1M; 14M), medium (28M; 50M), large: (86M; 136M) \u2013 reported as (#parameters net/tied; #parameters gross).\nThe results reported below use a preset seed (for reproducibility), and we expect results can be improved if we allow \u201ctrue\u201d pseudo-randomness.  We limited our tests to 40 epochs, even though validation perplexity was still trending down.\n\n\nEssentially, this recreates the language model experiment in the AGP paper, and validates its conclusions:\n\n \u201cWe see that sparse models are able to outperform dense models which have significantly more parameters.\u201d\n\n The 80% sparse large model (which has 16.9M parameters and a perplexity of 83.64) is able to outperform the dense medium (which has 28.4M parameters and a perplexity of 84.21), a model which has 1.7 times more parameters.  It also outperform the dense large model, which exemplifies how pruning can act as a regularizer.\n* \u201cOur results show that pruning works very well not only on the dense LSTM weights and dense softmax layer but also the dense embedding matrix. This suggests that during the optimization procedure the neural network can find a good sparse embedding for the words in the vocabulary that works well together with the sparse connectivity structure of the LSTM weights and softmax layer.\u201d\n\n\nSetup\n\n\nWe start by cloning Pytorch\u2019s example \nrepository\n. I\u2019ve copied the language model code to distiller\u2019s examples/word_language_model directory, so I\u2019ll use that for the rest of the tutorial.\nNext, let\u2019s create and activate a virtual environment, as explained in Distiller's \nREADME\n file.\nNow we can turn our attention to \nmain.py\n, which contains the training application.\n\n\nPreparing the code\n\n\nWe begin by adding code to invoke Distiller in file \nmain.py\n.  This involves a bit of mechanics, because we did not \npip install\n Distiller in our environment (we don\u2019t have a \nsetup.py\n script for Distiller as of yet).  To make Distiller library functions accessible from \nmain.py\n, we modify \nsys.path\n to include the distiller root directory by taking the current directory and pointing two directories up.  This is very specific to the location of this example code, and it will break if you\u2019ve placed the code elsewhere \u2013 so be aware.\n\n\nimport os\nimport sys\nscript_dir = os.path.dirname(__file__)\nmodule_path = os.path.abspath(os.path.join(script_dir, '..', '..'))\nif module_path not in sys.path:\n    sys.path.append(module_path)\nimport distiller\nimport apputils\nfrom distiller.data_loggers import TensorBoardLogger, PythonLogger\n\n\n\n\nNext, we augment the application arguments with two Distiller-specific arguments.  The first, \n--summary\n, gives us the ability to do simple compression instrumentation (e.g. log sparsity statistics).  The second argument, \n--compress\n, is how we tell the application where the compression scheduling file is located.\nWe also add two arguments - momentum and weight-decay - for the SGD optimizer.  As I explain later, I replaced the original code's optimizer with SGD, so we need these extra arguments.\n\n\n# Distiller-related arguments\nSUMMARY_CHOICES = ['sparsity', 'model', 'modules', 'png', 'percentile']\nparser.add_argument('--summary', type=str, choices=SUMMARY_CHOICES,\n                    help='print a summary of the model, and exit - options: ' +\n                    ' | '.join(SUMMARY_CHOICES))\nparser.add_argument('--compress', dest='compress', type=str, nargs='?', action='store',\n                    help='configuration file for pruning the model (default is to use hard-coded schedule)')\nparser.add_argument('--momentum', default=0., type=float, metavar='M',\n                    help='momentum')\nparser.add_argument('--weight-decay', '--wd', default=0., type=float,\n                    metavar='W', help='weight decay (default: 1e-4)')\n\n\n\n\nWe add code to handle the \n--summary\n application argument.  It can be as simple as forwarding to \ndistiller.model_summary\n or more complex, as in the Distiller sample.\n\n\nif args.summary:\n    distiller.model_summary(model, None, args.summary, 'wikitext2')\n    exit(0)\n\n\n\n\nSimilarly, we add code to handle the \n--compress\n argument, which creates a CompressionScheduler and configures it from a YAML schedule file:\n\n\nif args.compress:\n    source = args.compress\n    compression_scheduler = distiller.CompressionScheduler(model)\n    distiller.config.fileConfig(model, None, compression_scheduler, args.compress, msglogger)\n\n\n\n\nWe also create the optimizer, and the learning-rate decay policy scheduler.  The original PyTorch example manually manages the optimization and LR decay process, but I think that having a standard optimizer and LR-decay schedule gives us the flexibility to experiment with these during the training process.  Using an \nSGD optimizer\n configured with \nmomentum=0\n and \nweight_decay=0\n, and a \nReduceLROnPlateau LR-decay policy\n with \npatience=0\n and \nfactor=0.5\n will give the same behavior as in the original PyTorch example.  From there, we can experiment with the optimizer and LR-decay configuration.\n\n\noptimizer = torch.optim.SGD(model.parameters(), args.lr,\n                            momentum=args.momentum,\n                            weight_decay=args.weight_decay)\nlr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',\n                                                          patience=0, verbose=True, factor=0.5)\n\n\n\n\nNext, we add code to setup the logging backends: a Python logger backend which reads its configuration from file and logs messages to the console and log file (\npylogger\n); and a TensorBoard backend logger which logs statistics to a TensorBoard data file (\ntflogger\n).  I configured the TensorBoard backend to log gradients because RNNs suffer from vanishing and exploding gradients, so we might want to take a look in case the training experiences a sudden failure.\nThis code is not strictly required, but it is quite useful to be able to log the session progress, and to export logs to TensorBoard for realtime visualization of the training progress.\n\n\n# Distiller loggers\nmsglogger = apputils.config_pylogger('logging.conf', None)\ntflogger = TensorBoardLogger(msglogger.logdir)\ntflogger.log_gradients = True\npylogger = PythonLogger(msglogger)\n\n\n\n\nTraining loop\n\n\nNow we scroll down all the way to the train() function.  We'll change its signature to include the \nepoch\n, \noptimizer\n, and \ncompression_schdule\n.   We'll soon see why we need these.\n\n\ndef train(epoch, optimizer, compression_scheduler=None)\n\n\n\n\nFunction \ntrain()\n is responsible for training the network in batches for one epoch, and in its epoch loop we want to perform compression.   The \nCompressionScheduler\n invokes \nScheduledTrainingPolicy\n instances per the scheduling specification that was programmed in the \nCompressionScheduler\n instance.  There are four main \nSchedulingPolicy\n types: \nPruningPolicy\n, \nRegularizationPolicy\n, \nLRPolicy\n, and \nQuantizationPolicy\n.  We'll be using \nPruningPolicy\n, which is triggered \non_epoch_begin\n (to invoke the \nPruners\n, and \non_minibatch_begin\n (to mask the weights).   Later we will create a YAML scheduling file, and specify the schedule of \nAutomatedGradualPruner\n instances.  \n\n\nBecause we are writing a single application, which can be used with various Policies in the future (e.g. group-lasso regularization), we should add code to invoke all of the \nCompressionScheduler\n's callbacks, not just the mandatory \non_epoch_begin\n callback.    We invoke \non_minibatch_begin\n before running the forward-pass, \nbefore_backward_pass\n after computing the loss, and \non_minibatch_end\n after completing the backward-pass.\n\n\n\ndef train(epoch, optimizer, compression_scheduler=None):\n    ...\n\n    # The line below was fixed as per: https://github.com/pytorch/examples/issues/214\n    for batch, i in enumerate(range(0, train_data.size(0), args.bptt)):\n        data, targets = get_batch(train_data, i)\n        # Starting each batch, we detach the hidden state from how it was previously produced.\n        # If we didn't, the model would try backpropagating all the way to start of the dataset.\n        hidden = repackage_hidden(hidden)\n\n        \nif compression_scheduler:\n            compression_scheduler.on_minibatch_begin(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch)\n\n        output, hidden = model(data, hidden)\n        loss = criterion(output.view(-1, ntokens), targets)\n\n        \nif compression_scheduler:\n            compression_scheduler.before_backward_pass(epoch, minibatch_id=batch,\n                                                       minibatches_per_epoch=steps_per_epoch,\n                                                       loss=loss)\n\n        optimizer.zero_grad()\n        loss.backward()\n\n        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.\n        torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)\n        optimizer.step()\n\n        total_loss += loss.item()\n\n        \nif compression_scheduler:\n            compression_scheduler.on_minibatch_end(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch)\n\n\n\n\n\nThe rest of the code could stay as in the original PyTorch sample, but I wanted to use an SGD optimizer, so I replaced:\n\n\nfor p in model.parameters():\n    p.data.add_(-lr, p.grad.data)\n\n\n\n\nwith:\n\n\noptimizer.step()\n\n\n\n\nThe rest of the code in function \ntrain()\n logs to a text file and a \nTensorBoard\n backend.  Again, such code is not mandatory, but a few lines give us a lot of visibility: we have training progress information saved to log, and we can monitor the training progress in realtime on TensorBoard.  That's a lot for a few lines of code ;-)\n\n\n\nif batch % args.log_interval == 0 and batch > 0:\n    cur_loss = total_loss / args.log_interval\n    elapsed = time.time() - start_time\n    lr = optimizer.param_groups[0]['lr']\n    msglogger.info(\n            '| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.4f} | ms/batch {:5.2f} '\n            '| loss {:5.2f} | ppl {:8.2f}'.format(\n        epoch, batch, len(train_data) // args.bptt, lr,\n        elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))\n    total_loss = 0\n    start_time = time.time()\n    stats = ('Peformance/Training/',\n        OrderedDict([\n            ('Loss', cur_loss),\n            ('Perplexity', math.exp(cur_loss)),\n            ('LR', lr),\n            ('Batch Time', elapsed * 1000)])\n        )\n    steps_completed = batch + 1\n    distiller.log_training_progress(stats, model.named_parameters(), epoch, steps_completed,\n                                    steps_per_epoch, args.log_interval, [tflogger])\n\n\n\n\nFinally we get to the outer training-loop which loops on \nargs.epochs\n.  We add the two final \nCompressionScheduler\n callbacks: \non_epoch_begin\n, at the start of the loop, and \non_epoch_end\n after running \nevaluate\n on the model and updating the learning-rate.\n\n\n\ntry:\n    for epoch in range(0, args.epochs):\n        epoch_start_time = time.time()\n        \nif compression_scheduler:\n            compression_scheduler.on_epoch_begin(epoch)\n\n\n        train(epoch, optimizer, compression_scheduler)\n        val_loss = evaluate(val_data)\n        lr_scheduler.step(val_loss)\n\n        \nif compression_scheduler:\n            compression_scheduler.on_epoch_end(epoch)\n\n\n\n\n\nAnd that's it!  The language model sample is ready for compression.  \n\n\nCreating compression baselines\n\n\nIn \nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint.\" They also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning.\"\n\nThis pruning schedule is implemented by distiller.AutomatedGradualPruner (AGP), which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps. Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper. The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs. We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size.\n\n\nBefore we start compressing stuff ;-), we need to create baselines so we have something to benchmark against.  Let's prepare small, medium, and large baseline models, like Table 3 of \nTo prune, or Not to Prune\n.  These will provide baseline perplexity results that we'll compare the compressed models against.  \n\nI chose to use tied input/output embeddings, and constrained the training to 40 epochs.  The table below shows the model sizes, where we are interested in the tied version (biases are ignored due to their small size and because we don't prune them).\n\n\n\n\n\n\n\n\nSize\n\n\nNumber of Weights (untied)\n\n\nNumber of Weights (tied)\n\n\n\n\n\n\n\n\n\n\nSmall\n\n\n13,951,200\n\n\n7,295,600\n\n\n\n\n\n\nMedium\n\n\n50,021,400\n\n\n28,390,700\n\n\n\n\n\n\nLarge\n\n\n135,834,000\n\n\n85,917,000\n\n\n\n\n\n\n\n\nI started experimenting with the optimizer setup like in the PyTorch example, but I added some L2 regularization when I noticed that the training was overfitting.  The two right columns show the perplexity results (lower is better) of each of the models with no L2 regularization and with 1e-5 and 1e-6.\nIn all three model sizes using the smaller L2 regularization (1e-6) gave the best results.  BTW, I'm not showing here experiments with even lower regularization because that did not help.\n\n\n\n\n\n\n\n\nType\n\n\nCommand line\n\n\nValidation\n\n\nTest\n\n\n\n\n\n\n\n\n\n\nSmall\n\n\ntime python3 main.py --cuda --epochs 40 --tied\n\n\n105.23\n\n\n99.53\n\n\n\n\n\n\nSmall\n\n\ntime python3 main.py --cuda --epochs 40 --tied --wd=1e-6\n\n\n101.13\n\n\n96.29\n\n\n\n\n\n\nSmall\n\n\ntime python3 main.py --cuda --epochs 40 --tied --wd=1e-5\n\n\n109.49\n\n\n103.53\n\n\n\n\n\n\nMedium\n\n\ntime python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied\n\n\n90.93\n\n\n86.20\n\n\n\n\n\n\nMedium\n\n\ntime python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --wd=1e-6\n\n\n88.17\n\n\n84.21\n\n\n\n\n\n\nMedium\n\n\ntime python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --wd=1e-5\n\n\n97.75\n\n\n93.06\n\n\n\n\n\n\nLarge\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied\n\n\n88.23\n\n\n84.21\n\n\n\n\n\n\nLarge\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-6\n\n\n87.49\n\n\n83.85\n\n\n\n\n\n\nLarge\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-5\n\n\n99.22\n\n\n94.28\n\n\n\n\n\n\n\n\nCompressing the language model\n\n\nOK, so now let's recreate the results of the language model experiment from section 4.2 of paper.  We're using PyTorch's sample, so the language model we implement is not exactly like the one in the AGP paper (and uses a different dataset), but it's close enough, so if everything goes well, we should see similar compression results.\n\n\nWhat are we compressing?\n\n\nTo gain insight about the model parameters, we can use the command-line to produce a weights-sparsity table:\n\n\n$ python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --summary=sparsity\n\nParameters:\n+---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|         | Name             | Shape         |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |\n|---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n| 0.00000 | encoder.weight   | (33278, 1500) |      49917000 |       49916999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.05773 | -0.00000 |    0.05000 |\n| 1.00000 | rnn.weight_ih_l0 | (6000, 1500)  |       9000000 |        9000000 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.01491 |  0.00001 |    0.01291 |\n| 2.00000 | rnn.weight_hh_l0 | (6000, 1500)  |       9000000 |        8999999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00001 | 0.01491 |  0.00000 |    0.01291 |\n| 3.00000 | rnn.weight_ih_l1 | (6000, 1500)  |       9000000 |        8999999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00001 | 0.01490 | -0.00000 |    0.01291 |\n| 4.00000 | rnn.weight_hh_l1 | (6000, 1500)  |       9000000 |        9000000 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.01491 | -0.00000 |    0.01291 |\n| 5.00000 | decoder.weight   | (33278, 1500) |      49917000 |       49916999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.05773 | -0.00000 |    0.05000 |\n| 6.00000 | Total sparsity:  | -             |     135834000 |      135833996 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.00000 |  0.00000 |    0.00000 |\n+---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\nTotal sparsity: 0.00\n\n\n\n\nSo what's going on here?\n\nencoder.weight\n and \ndecoder.weight\n are the input and output embeddings, respectively.  Remember that in the configuration I chose for the three model sizes these embeddings are tied, which means that we only have one copy of parameters, that is shared between the encoder and decoder.\nWe also have two pairs of RNN (LSTM really) parameters.  There is a pair because the model uses the command-line argument \nargs.nlayers\n to decide how many instances of RNN (or LSTM or GRU) cells to use, and it defaults to 2.  The recurrent cells are LSTM cells, because this is the default of \nargs.model\n, which is used in the initialization of \nRNNModel\n.  Let's look at the parameters of the first RNN: \nrnn.weight_ih_l0\n and \nrnn.weight_hh_l0\n: what are these?\n\nRecall the \nLSTM equations\n that PyTorch implements.  In the equations, there are 8 instances of vector-matrix multiplication (when batch=1).  These can be combined into a single matrix-matrix multiplication (GEMM), but PyTorch groups these into two GEMM operations: one GEMM multiplies the inputs (\nrnn.weight_ih_l0\n), and the other multiplies the hidden-state (\nrnn.weight_hh_l0\n).  \n\n\nHow are we compressing?\n\n\nLet's turn to the configurations of the Large language model compression schedule to 70%, 80%, 90% and 95% sparsity. Using AGP it is easy to configure the pruning schedule to produce an exact sparsity of the compressed model.  I'll use the \n70% schedule\n to show a concrete example.\n\n\nThe YAML file has two sections: \npruners\n and \npolicies\n.  Section \npruners\n defines instances of \nParameterPruner\n - in our case we define three instances of \nAutomatedGradualPruner\n: for the weights of the first RNN (\nl0_rnn_pruner\n), the second RNN (\nl1_rnn_pruner\n) and the embedding layer (\nembedding_pruner\n).  These names are arbitrary, and serve are name-handles which bind Policies to Pruners - so you can use whatever names you want.\nEach \nAutomatedGradualPruner\n is configured with an \ninitial_sparsity\n and \nfinal_sparsity\n.  For examples, the \nl0_rnn_pruner\n below is configured to prune 5% of the weights as soon as it starts working, and finish when 70% of the weights have been pruned.  The \nweights\n parameter tells the Pruner which weight tensors to prune.\n\n\npruners:\n  l0_rnn_pruner:\n    class: AutomatedGradualPruner\n    initial_sparsity : 0.05\n    final_sparsity: 0.70\n    weights: [rnn.weight_ih_l0, rnn.weight_hh_l0]\n\n  l1_rnn_pruner:\n    class: AutomatedGradualPruner\n    initial_sparsity : 0.05\n    final_sparsity: 0.70\n    weights: [rnn.weight_ih_l1, rnn.weight_hh_l1]\n\n  embedding_pruner:\n    class: AutomatedGradualPruner\n    initial_sparsity : 0.05\n    final_sparsity: 0.70\n    weights: [encoder.weight]\n\n\n\n\nWhen are we compressing?\n\n\nIf the \npruners\n section defines \"what-to-do\", the \npolicies\n section defines \"when-to-do\".  This part is harder, because we define the pruning schedule, which requires us to try a few different schedules until we understand which schedule works best.\nBelow we define three \nPruningPolicy\n instances.  The first two instances start operating at epoch 2 (\nstarting_epoch\n), end at epoch 20 (\nending_epoch\n), and operate once every epoch (\nfrequency\n; as I explained above, Distiller's Pruning scheduling operates only at \non_epoch_begin\n).  In between pruning operations, the pruned model is fine-tuned.\n\n\npolicies:\n  - pruner:\n      instance_name : l0_rnn_pruner\n    starting_epoch: 2\n    ending_epoch: 20  \n    frequency: 1\n\n  - pruner:\n      instance_name : l1_rnn_pruner\n    starting_epoch: 2\n    ending_epoch: 20\n    frequency: 1\n\n  - pruner:\n      instance_name : embedding_pruner\n    starting_epoch: 3\n    ending_epoch: 21\n    frequency: 1\n\n\n\n\nWe invoke the compression as follows:\n\n\n$ time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml\n\n\n\n\nTable 1\n above shows that we can make a negligible improvement when adding L2 regularization.  I did some experimenting with the sparsity distribution between the layers, and the scheduling frequency and noticed that the embedding layers are much less sensitive to pruning than the RNN cells.  I didn't notice any difference between the RNN cells, but I also didn't invest in this exploration.\nA new \n70% sparsity schedule\n, prunes the RNNs only to 50% sparsity, but prunes the embedding to 85% sparsity, and achieves almost a 3 points improvement in the test perplexity results.\n\n\nWe provide \nsimilar pruning schedules\n for the other compression rates.\n\n\nUntil next time\n\n\nThis concludes the first part of the tutorial on pruning a PyTorch language model.\n\nIn the next installment, I'll explain how we added an implementation of Baidu Research's \nExploring Sparsity in Recurrent Neural Networks\n paper, and applied to this language model.\n\n\nGeek On.",
+            "location": "/tutorial-lang_model/index.html", 
+            "text": "Using Distiller to prune a PyTorch language model\n\n\nContents\n\n\n\n\nIntroduction\n\n\nSetup\n\n\nPreparing the code\n\n\nTraining-loop\n\n\nCreating compression baselines\n\n\nCompressing the language model\n\n\nWhat are we compressing?\n\n\nHow are we compressing?\n\n\nWhen are we compressing?\n\n\nUntil next time\n\n\n\n\nIntroduction\n\n\nIn this tutorial I'll show you how to compress a word-level language model using \nDistiller\n.  Specifically, we use PyTorch\u2019s \nword-level language model sample code\n as the code-base of our example, weave in some Distiller code, and show how we compress the model using two different element-wise pruning algorithms.  To make things manageable, I've divided the tutorial to two parts: in the first we will setup the sample application and prune using \nAGP\n.  In the second part I'll show how I've added Baidu's RNN pruning algorithm and then use it to prune the same word-level language model.  The completed code is available \nhere\n.\n\n\nThe results are displayed below and the code is available \nhere\n.\nNote that we can improve the results by training longer, since the loss curves are usually still decreasing at the end of epoch 40.  However, for demonstration purposes we don\u2019t need to do this.\n\n\n\n\n\n\n\n\nType\n\n\nSparsity\n\n\nNNZ\n\n\nValidation\n\n\nTest\n\n\nCommand line\n\n\n\n\n\n\n\n\n\n\nSmall\n\n\n0%\n\n\n7,135,600\n\n\n101.13\n\n\n96.29\n\n\ntime python3 main.py --cuda --epochs 40 --tied --wd=1e-6\n\n\n\n\n\n\nMedium\n\n\n0%\n\n\n28,390,700\n\n\n88.17\n\n\n84.21\n\n\ntime python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied,--wd=1e-6\n\n\n\n\n\n\nLarge\n\n\n0%\n\n\n85,917,000\n\n\n87.49\n\n\n83.85\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-6\n\n\n\n\n\n\nLarge\n\n\n70%\n\n\n25,487,550\n\n\n90.67\n\n\n85.96\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml\n\n\n\n\n\n\nLarge\n\n\n70%\n\n\n25,487,550\n\n\n90.59\n\n\n85.84\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml --wd=1e-6\n\n\n\n\n\n\nLarge\n\n\n70%\n\n\n25,487,550\n\n\n87.40\n\n\n82.93\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70B.schedule_agp.yaml --wd=1e-6\n\n\n\n\n\n\nLarge\n\n\n80.4%\n\n\n16,847,550\n\n\n89.31\n\n\n83.64\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_80.schedule_agp.yaml --wd=1e-6\n\n\n\n\n\n\nLarge\n\n\n90%\n\n\n8,591,700\n\n\n90.70\n\n\n85.67\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_90.schedule_agp.yaml --wd=1e-6\n\n\n\n\n\n\nLarge\n\n\n95%\n\n\n4,295,850\n\n\n98.42\n\n\n92.79\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_95.schedule_agp.yaml --wd=1e-6\n\n\n\n\n\n\n\n\nTable 1: AGP language model pruning results. \nNNZ stands for number of non-zero coefficients (embeddings are counted once, because they are tied).\n\n\n\n\n\n\n  \nFigure 1: Perplexity vs model size (lower perplexity is better).\n\n\n\n\nThe model is composed of an Encoder embedding, two LSTMs, and a Decoder embedding.  The Encoder and decoder embeddings (projections) are tied to improve perplexity results (per https://arxiv.org/pdf/1611.01462.pdf), so in the sparsity statistics we account for only one of the encoder/decoder embeddings.  We used the WikiText2 dataset (twice as large as PTB).\n\n\nWe compared three model sizes: small (7.1M; 14M), medium (28M; 50M), large: (86M; 136M) \u2013 reported as (#parameters net/tied; #parameters gross).\nThe results reported below use a preset seed (for reproducibility), and we expect results can be improved if we allow \u201ctrue\u201d pseudo-randomness.  We limited our tests to 40 epochs, even though validation perplexity was still trending down.\n\n\nEssentially, this recreates the language model experiment in the AGP paper, and validates its conclusions:\n\n \u201cWe see that sparse models are able to outperform dense models which have significantly more parameters.\u201d\n\n The 80% sparse large model (which has 16.9M parameters and a perplexity of 83.64) is able to outperform the dense medium (which has 28.4M parameters and a perplexity of 84.21), a model which has 1.7 times more parameters.  It also outperform the dense large model, which exemplifies how pruning can act as a regularizer.\n* \u201cOur results show that pruning works very well not only on the dense LSTM weights and dense softmax layer but also the dense embedding matrix. This suggests that during the optimization procedure the neural network can find a good sparse embedding for the words in the vocabulary that works well together with the sparse connectivity structure of the LSTM weights and softmax layer.\u201d\n\n\nSetup\n\n\nWe start by cloning Pytorch\u2019s example \nrepository\n. I\u2019ve copied the language model code to distiller\u2019s examples/word_language_model directory, so I\u2019ll use that for the rest of the tutorial.\nNext, let\u2019s create and activate a virtual environment, as explained in Distiller's \nREADME\n file.\nNow we can turn our attention to \nmain.py\n, which contains the training application.\n\n\nPreparing the code\n\n\nWe begin by adding code to invoke Distiller in file \nmain.py\n.  This involves a bit of mechanics, because we did not \npip install\n Distiller in our environment (we don\u2019t have a \nsetup.py\n script for Distiller as of yet).  To make Distiller library functions accessible from \nmain.py\n, we modify \nsys.path\n to include the distiller root directory by taking the current directory and pointing two directories up.  This is very specific to the location of this example code, and it will break if you\u2019ve placed the code elsewhere \u2013 so be aware.\n\n\nimport os\nimport sys\nscript_dir = os.path.dirname(__file__)\nmodule_path = os.path.abspath(os.path.join(script_dir, '..', '..'))\nif module_path not in sys.path:\n    sys.path.append(module_path)\nimport distiller\nimport apputils\nfrom distiller.data_loggers import TensorBoardLogger, PythonLogger\n\n\n\n\nNext, we augment the application arguments with two Distiller-specific arguments.  The first, \n--summary\n, gives us the ability to do simple compression instrumentation (e.g. log sparsity statistics).  The second argument, \n--compress\n, is how we tell the application where the compression scheduling file is located.\nWe also add two arguments - momentum and weight-decay - for the SGD optimizer.  As I explain later, I replaced the original code's optimizer with SGD, so we need these extra arguments.\n\n\n# Distiller-related arguments\nSUMMARY_CHOICES = ['sparsity', 'model', 'modules', 'png', 'percentile']\nparser.add_argument('--summary', type=str, choices=SUMMARY_CHOICES,\n                    help='print a summary of the model, and exit - options: ' +\n                    ' | '.join(SUMMARY_CHOICES))\nparser.add_argument('--compress', dest='compress', type=str, nargs='?', action='store',\n                    help='configuration file for pruning the model (default is to use hard-coded schedule)')\nparser.add_argument('--momentum', default=0., type=float, metavar='M',\n                    help='momentum')\nparser.add_argument('--weight-decay', '--wd', default=0., type=float,\n                    metavar='W', help='weight decay (default: 1e-4)')\n\n\n\n\nWe add code to handle the \n--summary\n application argument.  It can be as simple as forwarding to \ndistiller.model_summary\n or more complex, as in the Distiller sample.\n\n\nif args.summary:\n    distiller.model_summary(model, None, args.summary, 'wikitext2')\n    exit(0)\n\n\n\n\nSimilarly, we add code to handle the \n--compress\n argument, which creates a CompressionScheduler and configures it from a YAML schedule file:\n\n\nif args.compress:\n    source = args.compress\n    compression_scheduler = distiller.CompressionScheduler(model)\n    distiller.config.fileConfig(model, None, compression_scheduler, args.compress, msglogger)\n\n\n\n\nWe also create the optimizer, and the learning-rate decay policy scheduler.  The original PyTorch example manually manages the optimization and LR decay process, but I think that having a standard optimizer and LR-decay schedule gives us the flexibility to experiment with these during the training process.  Using an \nSGD optimizer\n configured with \nmomentum=0\n and \nweight_decay=0\n, and a \nReduceLROnPlateau LR-decay policy\n with \npatience=0\n and \nfactor=0.5\n will give the same behavior as in the original PyTorch example.  From there, we can experiment with the optimizer and LR-decay configuration.\n\n\noptimizer = torch.optim.SGD(model.parameters(), args.lr,\n                            momentum=args.momentum,\n                            weight_decay=args.weight_decay)\nlr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',\n                                                          patience=0, verbose=True, factor=0.5)\n\n\n\n\nNext, we add code to setup the logging backends: a Python logger backend which reads its configuration from file and logs messages to the console and log file (\npylogger\n); and a TensorBoard backend logger which logs statistics to a TensorBoard data file (\ntflogger\n).  I configured the TensorBoard backend to log gradients because RNNs suffer from vanishing and exploding gradients, so we might want to take a look in case the training experiences a sudden failure.\nThis code is not strictly required, but it is quite useful to be able to log the session progress, and to export logs to TensorBoard for realtime visualization of the training progress.\n\n\n# Distiller loggers\nmsglogger = apputils.config_pylogger('logging.conf', None)\ntflogger = TensorBoardLogger(msglogger.logdir)\ntflogger.log_gradients = True\npylogger = PythonLogger(msglogger)\n\n\n\n\nTraining loop\n\n\nNow we scroll down all the way to the train() function.  We'll change its signature to include the \nepoch\n, \noptimizer\n, and \ncompression_schdule\n.   We'll soon see why we need these.\n\n\ndef train(epoch, optimizer, compression_scheduler=None)\n\n\n\n\nFunction \ntrain()\n is responsible for training the network in batches for one epoch, and in its epoch loop we want to perform compression.   The \nCompressionScheduler\n invokes \nScheduledTrainingPolicy\n instances per the scheduling specification that was programmed in the \nCompressionScheduler\n instance.  There are four main \nSchedulingPolicy\n types: \nPruningPolicy\n, \nRegularizationPolicy\n, \nLRPolicy\n, and \nQuantizationPolicy\n.  We'll be using \nPruningPolicy\n, which is triggered \non_epoch_begin\n (to invoke the \nPruners\n, and \non_minibatch_begin\n (to mask the weights).   Later we will create a YAML scheduling file, and specify the schedule of \nAutomatedGradualPruner\n instances.  \n\n\nBecause we are writing a single application, which can be used with various Policies in the future (e.g. group-lasso regularization), we should add code to invoke all of the \nCompressionScheduler\n's callbacks, not just the mandatory \non_epoch_begin\n callback.    We invoke \non_minibatch_begin\n before running the forward-pass, \nbefore_backward_pass\n after computing the loss, and \non_minibatch_end\n after completing the backward-pass.\n\n\n\ndef train(epoch, optimizer, compression_scheduler=None):\n    ...\n\n    # The line below was fixed as per: https://github.com/pytorch/examples/issues/214\n    for batch, i in enumerate(range(0, train_data.size(0), args.bptt)):\n        data, targets = get_batch(train_data, i)\n        # Starting each batch, we detach the hidden state from how it was previously produced.\n        # If we didn't, the model would try backpropagating all the way to start of the dataset.\n        hidden = repackage_hidden(hidden)\n\n        \nif compression_scheduler:\n            compression_scheduler.on_minibatch_begin(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch)\n\n        output, hidden = model(data, hidden)\n        loss = criterion(output.view(-1, ntokens), targets)\n\n        \nif compression_scheduler:\n            compression_scheduler.before_backward_pass(epoch, minibatch_id=batch,\n                                                       minibatches_per_epoch=steps_per_epoch,\n                                                       loss=loss)\n\n        optimizer.zero_grad()\n        loss.backward()\n\n        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.\n        torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)\n        optimizer.step()\n\n        total_loss += loss.item()\n\n        \nif compression_scheduler:\n            compression_scheduler.on_minibatch_end(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch)\n\n\n\n\n\nThe rest of the code could stay as in the original PyTorch sample, but I wanted to use an SGD optimizer, so I replaced:\n\n\nfor p in model.parameters():\n    p.data.add_(-lr, p.grad.data)\n\n\n\n\nwith:\n\n\noptimizer.step()\n\n\n\n\nThe rest of the code in function \ntrain()\n logs to a text file and a \nTensorBoard\n backend.  Again, such code is not mandatory, but a few lines give us a lot of visibility: we have training progress information saved to log, and we can monitor the training progress in realtime on TensorBoard.  That's a lot for a few lines of code ;-)\n\n\n\nif batch % args.log_interval == 0 and batch > 0:\n    cur_loss = total_loss / args.log_interval\n    elapsed = time.time() - start_time\n    lr = optimizer.param_groups[0]['lr']\n    msglogger.info(\n            '| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.4f} | ms/batch {:5.2f} '\n            '| loss {:5.2f} | ppl {:8.2f}'.format(\n        epoch, batch, len(train_data) // args.bptt, lr,\n        elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))\n    total_loss = 0\n    start_time = time.time()\n    stats = ('Peformance/Training/',\n        OrderedDict([\n            ('Loss', cur_loss),\n            ('Perplexity', math.exp(cur_loss)),\n            ('LR', lr),\n            ('Batch Time', elapsed * 1000)])\n        )\n    steps_completed = batch + 1\n    distiller.log_training_progress(stats, model.named_parameters(), epoch, steps_completed,\n                                    steps_per_epoch, args.log_interval, [tflogger])\n\n\n\n\nFinally we get to the outer training-loop which loops on \nargs.epochs\n.  We add the two final \nCompressionScheduler\n callbacks: \non_epoch_begin\n, at the start of the loop, and \non_epoch_end\n after running \nevaluate\n on the model and updating the learning-rate.\n\n\n\ntry:\n    for epoch in range(0, args.epochs):\n        epoch_start_time = time.time()\n        \nif compression_scheduler:\n            compression_scheduler.on_epoch_begin(epoch)\n\n\n        train(epoch, optimizer, compression_scheduler)\n        val_loss = evaluate(val_data)\n        lr_scheduler.step(val_loss)\n\n        \nif compression_scheduler:\n            compression_scheduler.on_epoch_end(epoch)\n\n\n\n\n\nAnd that's it!  The language model sample is ready for compression.  \n\n\nCreating compression baselines\n\n\nIn \nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint.\" They also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning.\"\n\nThis pruning schedule is implemented by distiller.AutomatedGradualPruner (AGP), which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps. Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper. The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs. We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size.\n\n\nBefore we start compressing stuff ;-), we need to create baselines so we have something to benchmark against.  Let's prepare small, medium, and large baseline models, like Table 3 of \nTo prune, or Not to Prune\n.  These will provide baseline perplexity results that we'll compare the compressed models against.  \n\nI chose to use tied input/output embeddings, and constrained the training to 40 epochs.  The table below shows the model sizes, where we are interested in the tied version (biases are ignored due to their small size and because we don't prune them).\n\n\n\n\n\n\n\n\nSize\n\n\nNumber of Weights (untied)\n\n\nNumber of Weights (tied)\n\n\n\n\n\n\n\n\n\n\nSmall\n\n\n13,951,200\n\n\n7,295,600\n\n\n\n\n\n\nMedium\n\n\n50,021,400\n\n\n28,390,700\n\n\n\n\n\n\nLarge\n\n\n135,834,000\n\n\n85,917,000\n\n\n\n\n\n\n\n\nI started experimenting with the optimizer setup like in the PyTorch example, but I added some L2 regularization when I noticed that the training was overfitting.  The two right columns show the perplexity results (lower is better) of each of the models with no L2 regularization and with 1e-5 and 1e-6.\nIn all three model sizes using the smaller L2 regularization (1e-6) gave the best results.  BTW, I'm not showing here experiments with even lower regularization because that did not help.\n\n\n\n\n\n\n\n\nType\n\n\nCommand line\n\n\nValidation\n\n\nTest\n\n\n\n\n\n\n\n\n\n\nSmall\n\n\ntime python3 main.py --cuda --epochs 40 --tied\n\n\n105.23\n\n\n99.53\n\n\n\n\n\n\nSmall\n\n\ntime python3 main.py --cuda --epochs 40 --tied --wd=1e-6\n\n\n101.13\n\n\n96.29\n\n\n\n\n\n\nSmall\n\n\ntime python3 main.py --cuda --epochs 40 --tied --wd=1e-5\n\n\n109.49\n\n\n103.53\n\n\n\n\n\n\nMedium\n\n\ntime python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied\n\n\n90.93\n\n\n86.20\n\n\n\n\n\n\nMedium\n\n\ntime python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --wd=1e-6\n\n\n88.17\n\n\n84.21\n\n\n\n\n\n\nMedium\n\n\ntime python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --wd=1e-5\n\n\n97.75\n\n\n93.06\n\n\n\n\n\n\nLarge\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied\n\n\n88.23\n\n\n84.21\n\n\n\n\n\n\nLarge\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-6\n\n\n87.49\n\n\n83.85\n\n\n\n\n\n\nLarge\n\n\ntime python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-5\n\n\n99.22\n\n\n94.28\n\n\n\n\n\n\n\n\nCompressing the language model\n\n\nOK, so now let's recreate the results of the language model experiment from section 4.2 of paper.  We're using PyTorch's sample, so the language model we implement is not exactly like the one in the AGP paper (and uses a different dataset), but it's close enough, so if everything goes well, we should see similar compression results.\n\n\nWhat are we compressing?\n\n\nTo gain insight about the model parameters, we can use the command-line to produce a weights-sparsity table:\n\n\n$ python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --summary=sparsity\n\nParameters:\n+---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|         | Name             | Shape         |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |\n|---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n| 0.00000 | encoder.weight   | (33278, 1500) |      49917000 |       49916999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.05773 | -0.00000 |    0.05000 |\n| 1.00000 | rnn.weight_ih_l0 | (6000, 1500)  |       9000000 |        9000000 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.01491 |  0.00001 |    0.01291 |\n| 2.00000 | rnn.weight_hh_l0 | (6000, 1500)  |       9000000 |        8999999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00001 | 0.01491 |  0.00000 |    0.01291 |\n| 3.00000 | rnn.weight_ih_l1 | (6000, 1500)  |       9000000 |        8999999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00001 | 0.01490 | -0.00000 |    0.01291 |\n| 4.00000 | rnn.weight_hh_l1 | (6000, 1500)  |       9000000 |        9000000 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.01491 | -0.00000 |    0.01291 |\n| 5.00000 | decoder.weight   | (33278, 1500) |      49917000 |       49916999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.05773 | -0.00000 |    0.05000 |\n| 6.00000 | Total sparsity:  | -             |     135834000 |      135833996 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.00000 |  0.00000 |    0.00000 |\n+---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\nTotal sparsity: 0.00\n\n\n\n\nSo what's going on here?\n\nencoder.weight\n and \ndecoder.weight\n are the input and output embeddings, respectively.  Remember that in the configuration I chose for the three model sizes these embeddings are tied, which means that we only have one copy of parameters, that is shared between the encoder and decoder.\nWe also have two pairs of RNN (LSTM really) parameters.  There is a pair because the model uses the command-line argument \nargs.nlayers\n to decide how many instances of RNN (or LSTM or GRU) cells to use, and it defaults to 2.  The recurrent cells are LSTM cells, because this is the default of \nargs.model\n, which is used in the initialization of \nRNNModel\n.  Let's look at the parameters of the first RNN: \nrnn.weight_ih_l0\n and \nrnn.weight_hh_l0\n: what are these?\n\nRecall the \nLSTM equations\n that PyTorch implements.  In the equations, there are 8 instances of vector-matrix multiplication (when batch=1).  These can be combined into a single matrix-matrix multiplication (GEMM), but PyTorch groups these into two GEMM operations: one GEMM multiplies the inputs (\nrnn.weight_ih_l0\n), and the other multiplies the hidden-state (\nrnn.weight_hh_l0\n).  \n\n\nHow are we compressing?\n\n\nLet's turn to the configurations of the Large language model compression schedule to 70%, 80%, 90% and 95% sparsity. Using AGP it is easy to configure the pruning schedule to produce an exact sparsity of the compressed model.  I'll use the \n70% schedule\n to show a concrete example.\n\n\nThe YAML file has two sections: \npruners\n and \npolicies\n.  Section \npruners\n defines instances of \nParameterPruner\n - in our case we define three instances of \nAutomatedGradualPruner\n: for the weights of the first RNN (\nl0_rnn_pruner\n), the second RNN (\nl1_rnn_pruner\n) and the embedding layer (\nembedding_pruner\n).  These names are arbitrary, and serve are name-handles which bind Policies to Pruners - so you can use whatever names you want.\nEach \nAutomatedGradualPruner\n is configured with an \ninitial_sparsity\n and \nfinal_sparsity\n.  For examples, the \nl0_rnn_pruner\n below is configured to prune 5% of the weights as soon as it starts working, and finish when 70% of the weights have been pruned.  The \nweights\n parameter tells the Pruner which weight tensors to prune.\n\n\npruners:\n  l0_rnn_pruner:\n    class: AutomatedGradualPruner\n    initial_sparsity : 0.05\n    final_sparsity: 0.70\n    weights: [rnn.weight_ih_l0, rnn.weight_hh_l0]\n\n  l1_rnn_pruner:\n    class: AutomatedGradualPruner\n    initial_sparsity : 0.05\n    final_sparsity: 0.70\n    weights: [rnn.weight_ih_l1, rnn.weight_hh_l1]\n\n  embedding_pruner:\n    class: AutomatedGradualPruner\n    initial_sparsity : 0.05\n    final_sparsity: 0.70\n    weights: [encoder.weight]\n\n\n\n\nWhen are we compressing?\n\n\nIf the \npruners\n section defines \"what-to-do\", the \npolicies\n section defines \"when-to-do\".  This part is harder, because we define the pruning schedule, which requires us to try a few different schedules until we understand which schedule works best.\nBelow we define three \nPruningPolicy\n instances.  The first two instances start operating at epoch 2 (\nstarting_epoch\n), end at epoch 20 (\nending_epoch\n), and operate once every epoch (\nfrequency\n; as I explained above, Distiller's Pruning scheduling operates only at \non_epoch_begin\n).  In between pruning operations, the pruned model is fine-tuned.\n\n\npolicies:\n  - pruner:\n      instance_name : l0_rnn_pruner\n    starting_epoch: 2\n    ending_epoch: 20  \n    frequency: 1\n\n  - pruner:\n      instance_name : l1_rnn_pruner\n    starting_epoch: 2\n    ending_epoch: 20\n    frequency: 1\n\n  - pruner:\n      instance_name : embedding_pruner\n    starting_epoch: 3\n    ending_epoch: 21\n    frequency: 1\n\n\n\n\nWe invoke the compression as follows:\n\n\n$ time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml\n\n\n\n\nTable 1\n above shows that we can make a negligible improvement when adding L2 regularization.  I did some experimenting with the sparsity distribution between the layers, and the scheduling frequency and noticed that the embedding layers are much less sensitive to pruning than the RNN cells.  I didn't notice any difference between the RNN cells, but I also didn't invest in this exploration.\nA new \n70% sparsity schedule\n, prunes the RNNs only to 50% sparsity, but prunes the embedding to 85% sparsity, and achieves almost a 3 points improvement in the test perplexity results.\n\n\nWe provide \nsimilar pruning schedules\n for the other compression rates.\n\n\nUntil next time\n\n\nThis concludes the first part of the tutorial on pruning a PyTorch language model.\n\nIn the next installment, I'll explain how we added an implementation of Baidu Research's \nExploring Sparsity in Recurrent Neural Networks\n paper, and applied to this language model.\n\n\nGeek On.", 
             "title": "Pruning a Language Model"
-        },
+        }, 
         {
-            "location": "/tutorial-lang_model/index.html#using-distiller-to-prune-a-pytorch-language-model",
-            "text": "",
+            "location": "/tutorial-lang_model/index.html#using-distiller-to-prune-a-pytorch-language-model", 
+            "text": "", 
             "title": "Using Distiller to prune a PyTorch language model"
-        },
+        }, 
         {
-            "location": "/tutorial-lang_model/index.html#contents",
-            "text": "Introduction  Setup  Preparing the code  Training-loop  Creating compression baselines  Compressing the language model  What are we compressing?  How are we compressing?  When are we compressing?  Until next time",
+            "location": "/tutorial-lang_model/index.html#contents", 
+            "text": "Introduction  Setup  Preparing the code  Training-loop  Creating compression baselines  Compressing the language model  What are we compressing?  How are we compressing?  When are we compressing?  Until next time", 
             "title": "Contents"
-        },
+        }, 
         {
-            "location": "/tutorial-lang_model/index.html#introduction",
-            "text": "In this tutorial I'll show you how to compress a word-level language model using  Distiller .  Specifically, we use PyTorch\u2019s  word-level language model sample code  as the code-base of our example, weave in some Distiller code, and show how we compress the model using two different element-wise pruning algorithms.  To make things manageable, I've divided the tutorial to two parts: in the first we will setup the sample application and prune using  AGP .  In the second part I'll show how I've added Baidu's RNN pruning algorithm and then use it to prune the same word-level language model.  The completed code is available  here .  The results are displayed below and the code is available  here .\nNote that we can improve the results by training longer, since the loss curves are usually still decreasing at the end of epoch 40.  However, for demonstration purposes we don\u2019t need to do this.     Type  Sparsity  NNZ  Validation  Test  Command line      Small  0%  7,135,600  101.13  96.29  time python3 main.py --cuda --epochs 40 --tied --wd=1e-6    Medium  0%  28,390,700  88.17  84.21  time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied,--wd=1e-6    Large  0%  85,917,000  87.49  83.85  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-6    Large  70%  25,487,550  90.67  85.96  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml    Large  70%  25,487,550  90.59  85.84  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml --wd=1e-6    Large  70%  25,487,550  87.40  82.93  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70B.schedule_agp.yaml --wd=1e-6    Large  80.4%  16,847,550  89.31  83.64  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_80.schedule_agp.yaml --wd=1e-6    Large  90%  8,591,700  90.70  85.67  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_90.schedule_agp.yaml --wd=1e-6    Large  95%  4,295,850  98.42  92.79  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_95.schedule_agp.yaml --wd=1e-6     Table 1: AGP language model pruning results.  NNZ stands for number of non-zero coefficients (embeddings are counted once, because they are tied).   \n   Figure 1: Perplexity vs model size (lower perplexity is better).   The model is composed of an Encoder embedding, two LSTMs, and a Decoder embedding.  The Encoder and decoder embeddings (projections) are tied to improve perplexity results (per https://arxiv.org/pdf/1611.01462.pdf), so in the sparsity statistics we account for only one of the encoder/decoder embeddings.  We used the WikiText2 dataset (twice as large as PTB).  We compared three model sizes: small (7.1M; 14M), medium (28M; 50M), large: (86M; 136M) \u2013 reported as (#parameters net/tied; #parameters gross).\nThe results reported below use a preset seed (for reproducibility), and we expect results can be improved if we allow \u201ctrue\u201d pseudo-randomness.  We limited our tests to 40 epochs, even though validation perplexity was still trending down.  Essentially, this recreates the language model experiment in the AGP paper, and validates its conclusions:  \u201cWe see that sparse models are able to outperform dense models which have significantly more parameters.\u201d  The 80% sparse large model (which has 16.9M parameters and a perplexity of 83.64) is able to outperform the dense medium (which has 28.4M parameters and a perplexity of 84.21), a model which has 1.7 times more parameters.  It also outperform the dense large model, which exemplifies how pruning can act as a regularizer.\n* \u201cOur results show that pruning works very well not only on the dense LSTM weights and dense softmax layer but also the dense embedding matrix. This suggests that during the optimization procedure the neural network can find a good sparse embedding for the words in the vocabulary that works well together with the sparse connectivity structure of the LSTM weights and softmax layer.\u201d",
+            "location": "/tutorial-lang_model/index.html#introduction", 
+            "text": "In this tutorial I'll show you how to compress a word-level language model using  Distiller .  Specifically, we use PyTorch\u2019s  word-level language model sample code  as the code-base of our example, weave in some Distiller code, and show how we compress the model using two different element-wise pruning algorithms.  To make things manageable, I've divided the tutorial to two parts: in the first we will setup the sample application and prune using  AGP .  In the second part I'll show how I've added Baidu's RNN pruning algorithm and then use it to prune the same word-level language model.  The completed code is available  here .  The results are displayed below and the code is available  here .\nNote that we can improve the results by training longer, since the loss curves are usually still decreasing at the end of epoch 40.  However, for demonstration purposes we don\u2019t need to do this.     Type  Sparsity  NNZ  Validation  Test  Command line      Small  0%  7,135,600  101.13  96.29  time python3 main.py --cuda --epochs 40 --tied --wd=1e-6    Medium  0%  28,390,700  88.17  84.21  time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied,--wd=1e-6    Large  0%  85,917,000  87.49  83.85  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-6    Large  70%  25,487,550  90.67  85.96  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml    Large  70%  25,487,550  90.59  85.84  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml --wd=1e-6    Large  70%  25,487,550  87.40  82.93  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70B.schedule_agp.yaml --wd=1e-6    Large  80.4%  16,847,550  89.31  83.64  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_80.schedule_agp.yaml --wd=1e-6    Large  90%  8,591,700  90.70  85.67  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_90.schedule_agp.yaml --wd=1e-6    Large  95%  4,295,850  98.42  92.79  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_95.schedule_agp.yaml --wd=1e-6     Table 1: AGP language model pruning results.  NNZ stands for number of non-zero coefficients (embeddings are counted once, because they are tied).   \n   Figure 1: Perplexity vs model size (lower perplexity is better).   The model is composed of an Encoder embedding, two LSTMs, and a Decoder embedding.  The Encoder and decoder embeddings (projections) are tied to improve perplexity results (per https://arxiv.org/pdf/1611.01462.pdf), so in the sparsity statistics we account for only one of the encoder/decoder embeddings.  We used the WikiText2 dataset (twice as large as PTB).  We compared three model sizes: small (7.1M; 14M), medium (28M; 50M), large: (86M; 136M) \u2013 reported as (#parameters net/tied; #parameters gross).\nThe results reported below use a preset seed (for reproducibility), and we expect results can be improved if we allow \u201ctrue\u201d pseudo-randomness.  We limited our tests to 40 epochs, even though validation perplexity was still trending down.  Essentially, this recreates the language model experiment in the AGP paper, and validates its conclusions:  \u201cWe see that sparse models are able to outperform dense models which have significantly more parameters.\u201d  The 80% sparse large model (which has 16.9M parameters and a perplexity of 83.64) is able to outperform the dense medium (which has 28.4M parameters and a perplexity of 84.21), a model which has 1.7 times more parameters.  It also outperform the dense large model, which exemplifies how pruning can act as a regularizer.\n* \u201cOur results show that pruning works very well not only on the dense LSTM weights and dense softmax layer but also the dense embedding matrix. This suggests that during the optimization procedure the neural network can find a good sparse embedding for the words in the vocabulary that works well together with the sparse connectivity structure of the LSTM weights and softmax layer.\u201d", 
             "title": "Introduction"
-        },
+        }, 
         {
-            "location": "/tutorial-lang_model/index.html#setup",
-            "text": "We start by cloning Pytorch\u2019s example  repository . I\u2019ve copied the language model code to distiller\u2019s examples/word_language_model directory, so I\u2019ll use that for the rest of the tutorial.\nNext, let\u2019s create and activate a virtual environment, as explained in Distiller's  README  file.\nNow we can turn our attention to  main.py , which contains the training application.",
+            "location": "/tutorial-lang_model/index.html#setup", 
+            "text": "We start by cloning Pytorch\u2019s example  repository . I\u2019ve copied the language model code to distiller\u2019s examples/word_language_model directory, so I\u2019ll use that for the rest of the tutorial.\nNext, let\u2019s create and activate a virtual environment, as explained in Distiller's  README  file.\nNow we can turn our attention to  main.py , which contains the training application.", 
             "title": "Setup"
-        },
+        }, 
         {
-            "location": "/tutorial-lang_model/index.html#preparing-the-code",
-            "text": "We begin by adding code to invoke Distiller in file  main.py .  This involves a bit of mechanics, because we did not  pip install  Distiller in our environment (we don\u2019t have a  setup.py  script for Distiller as of yet).  To make Distiller library functions accessible from  main.py , we modify  sys.path  to include the distiller root directory by taking the current directory and pointing two directories up.  This is very specific to the location of this example code, and it will break if you\u2019ve placed the code elsewhere \u2013 so be aware.  import os\nimport sys\nscript_dir = os.path.dirname(__file__)\nmodule_path = os.path.abspath(os.path.join(script_dir, '..', '..'))\nif module_path not in sys.path:\n    sys.path.append(module_path)\nimport distiller\nimport apputils\nfrom distiller.data_loggers import TensorBoardLogger, PythonLogger  Next, we augment the application arguments with two Distiller-specific arguments.  The first,  --summary , gives us the ability to do simple compression instrumentation (e.g. log sparsity statistics).  The second argument,  --compress , is how we tell the application where the compression scheduling file is located.\nWe also add two arguments - momentum and weight-decay - for the SGD optimizer.  As I explain later, I replaced the original code's optimizer with SGD, so we need these extra arguments.  # Distiller-related arguments\nSUMMARY_CHOICES = ['sparsity', 'model', 'modules', 'png', 'percentile']\nparser.add_argument('--summary', type=str, choices=SUMMARY_CHOICES,\n                    help='print a summary of the model, and exit - options: ' +\n                    ' | '.join(SUMMARY_CHOICES))\nparser.add_argument('--compress', dest='compress', type=str, nargs='?', action='store',\n                    help='configuration file for pruning the model (default is to use hard-coded schedule)')\nparser.add_argument('--momentum', default=0., type=float, metavar='M',\n                    help='momentum')\nparser.add_argument('--weight-decay', '--wd', default=0., type=float,\n                    metavar='W', help='weight decay (default: 1e-4)')  We add code to handle the  --summary  application argument.  It can be as simple as forwarding to  distiller.model_summary  or more complex, as in the Distiller sample.  if args.summary:\n    distiller.model_summary(model, None, args.summary, 'wikitext2')\n    exit(0)  Similarly, we add code to handle the  --compress  argument, which creates a CompressionScheduler and configures it from a YAML schedule file:  if args.compress:\n    source = args.compress\n    compression_scheduler = distiller.CompressionScheduler(model)\n    distiller.config.fileConfig(model, None, compression_scheduler, args.compress, msglogger)  We also create the optimizer, and the learning-rate decay policy scheduler.  The original PyTorch example manually manages the optimization and LR decay process, but I think that having a standard optimizer and LR-decay schedule gives us the flexibility to experiment with these during the training process.  Using an  SGD optimizer  configured with  momentum=0  and  weight_decay=0 , and a  ReduceLROnPlateau LR-decay policy  with  patience=0  and  factor=0.5  will give the same behavior as in the original PyTorch example.  From there, we can experiment with the optimizer and LR-decay configuration.  optimizer = torch.optim.SGD(model.parameters(), args.lr,\n                            momentum=args.momentum,\n                            weight_decay=args.weight_decay)\nlr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',\n                                                          patience=0, verbose=True, factor=0.5)  Next, we add code to setup the logging backends: a Python logger backend which reads its configuration from file and logs messages to the console and log file ( pylogger ); and a TensorBoard backend logger which logs statistics to a TensorBoard data file ( tflogger ).  I configured the TensorBoard backend to log gradients because RNNs suffer from vanishing and exploding gradients, so we might want to take a look in case the training experiences a sudden failure.\nThis code is not strictly required, but it is quite useful to be able to log the session progress, and to export logs to TensorBoard for realtime visualization of the training progress.  # Distiller loggers\nmsglogger = apputils.config_pylogger('logging.conf', None)\ntflogger = TensorBoardLogger(msglogger.logdir)\ntflogger.log_gradients = True\npylogger = PythonLogger(msglogger)",
+            "location": "/tutorial-lang_model/index.html#preparing-the-code", 
+            "text": "We begin by adding code to invoke Distiller in file  main.py .  This involves a bit of mechanics, because we did not  pip install  Distiller in our environment (we don\u2019t have a  setup.py  script for Distiller as of yet).  To make Distiller library functions accessible from  main.py , we modify  sys.path  to include the distiller root directory by taking the current directory and pointing two directories up.  This is very specific to the location of this example code, and it will break if you\u2019ve placed the code elsewhere \u2013 so be aware.  import os\nimport sys\nscript_dir = os.path.dirname(__file__)\nmodule_path = os.path.abspath(os.path.join(script_dir, '..', '..'))\nif module_path not in sys.path:\n    sys.path.append(module_path)\nimport distiller\nimport apputils\nfrom distiller.data_loggers import TensorBoardLogger, PythonLogger  Next, we augment the application arguments with two Distiller-specific arguments.  The first,  --summary , gives us the ability to do simple compression instrumentation (e.g. log sparsity statistics).  The second argument,  --compress , is how we tell the application where the compression scheduling file is located.\nWe also add two arguments - momentum and weight-decay - for the SGD optimizer.  As I explain later, I replaced the original code's optimizer with SGD, so we need these extra arguments.  # Distiller-related arguments\nSUMMARY_CHOICES = ['sparsity', 'model', 'modules', 'png', 'percentile']\nparser.add_argument('--summary', type=str, choices=SUMMARY_CHOICES,\n                    help='print a summary of the model, and exit - options: ' +\n                    ' | '.join(SUMMARY_CHOICES))\nparser.add_argument('--compress', dest='compress', type=str, nargs='?', action='store',\n                    help='configuration file for pruning the model (default is to use hard-coded schedule)')\nparser.add_argument('--momentum', default=0., type=float, metavar='M',\n                    help='momentum')\nparser.add_argument('--weight-decay', '--wd', default=0., type=float,\n                    metavar='W', help='weight decay (default: 1e-4)')  We add code to handle the  --summary  application argument.  It can be as simple as forwarding to  distiller.model_summary  or more complex, as in the Distiller sample.  if args.summary:\n    distiller.model_summary(model, None, args.summary, 'wikitext2')\n    exit(0)  Similarly, we add code to handle the  --compress  argument, which creates a CompressionScheduler and configures it from a YAML schedule file:  if args.compress:\n    source = args.compress\n    compression_scheduler = distiller.CompressionScheduler(model)\n    distiller.config.fileConfig(model, None, compression_scheduler, args.compress, msglogger)  We also create the optimizer, and the learning-rate decay policy scheduler.  The original PyTorch example manually manages the optimization and LR decay process, but I think that having a standard optimizer and LR-decay schedule gives us the flexibility to experiment with these during the training process.  Using an  SGD optimizer  configured with  momentum=0  and  weight_decay=0 , and a  ReduceLROnPlateau LR-decay policy  with  patience=0  and  factor=0.5  will give the same behavior as in the original PyTorch example.  From there, we can experiment with the optimizer and LR-decay configuration.  optimizer = torch.optim.SGD(model.parameters(), args.lr,\n                            momentum=args.momentum,\n                            weight_decay=args.weight_decay)\nlr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',\n                                                          patience=0, verbose=True, factor=0.5)  Next, we add code to setup the logging backends: a Python logger backend which reads its configuration from file and logs messages to the console and log file ( pylogger ); and a TensorBoard backend logger which logs statistics to a TensorBoard data file ( tflogger ).  I configured the TensorBoard backend to log gradients because RNNs suffer from vanishing and exploding gradients, so we might want to take a look in case the training experiences a sudden failure.\nThis code is not strictly required, but it is quite useful to be able to log the session progress, and to export logs to TensorBoard for realtime visualization of the training progress.  # Distiller loggers\nmsglogger = apputils.config_pylogger('logging.conf', None)\ntflogger = TensorBoardLogger(msglogger.logdir)\ntflogger.log_gradients = True\npylogger = PythonLogger(msglogger)", 
             "title": "Preparing the code"
-        },
+        }, 
         {
-            "location": "/tutorial-lang_model/index.html#training-loop",
-            "text": "Now we scroll down all the way to the train() function.  We'll change its signature to include the  epoch ,  optimizer , and  compression_schdule .   We'll soon see why we need these.  def train(epoch, optimizer, compression_scheduler=None)  Function  train()  is responsible for training the network in batches for one epoch, and in its epoch loop we want to perform compression.   The  CompressionScheduler  invokes  ScheduledTrainingPolicy  instances per the scheduling specification that was programmed in the  CompressionScheduler  instance.  There are four main  SchedulingPolicy  types:  PruningPolicy ,  RegularizationPolicy ,  LRPolicy , and  QuantizationPolicy .  We'll be using  PruningPolicy , which is triggered  on_epoch_begin  (to invoke the  Pruners , and  on_minibatch_begin  (to mask the weights).   Later we will create a YAML scheduling file, and specify the schedule of  AutomatedGradualPruner  instances.    Because we are writing a single application, which can be used with various Policies in the future (e.g. group-lasso regularization), we should add code to invoke all of the  CompressionScheduler 's callbacks, not just the mandatory  on_epoch_begin  callback.    We invoke  on_minibatch_begin  before running the forward-pass,  before_backward_pass  after computing the loss, and  on_minibatch_end  after completing the backward-pass.  \ndef train(epoch, optimizer, compression_scheduler=None):\n    ...\n\n    # The line below was fixed as per: https://github.com/pytorch/examples/issues/214\n    for batch, i in enumerate(range(0, train_data.size(0), args.bptt)):\n        data, targets = get_batch(train_data, i)\n        # Starting each batch, we detach the hidden state from how it was previously produced.\n        # If we didn't, the model would try backpropagating all the way to start of the dataset.\n        hidden = repackage_hidden(hidden)\n\n         if compression_scheduler:\n            compression_scheduler.on_minibatch_begin(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch) \n        output, hidden = model(data, hidden)\n        loss = criterion(output.view(-1, ntokens), targets)\n\n         if compression_scheduler:\n            compression_scheduler.before_backward_pass(epoch, minibatch_id=batch,\n                                                       minibatches_per_epoch=steps_per_epoch,\n                                                       loss=loss) \n        optimizer.zero_grad()\n        loss.backward()\n\n        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.\n        torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)\n        optimizer.step()\n\n        total_loss += loss.item()\n\n         if compression_scheduler:\n            compression_scheduler.on_minibatch_end(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch)   The rest of the code could stay as in the original PyTorch sample, but I wanted to use an SGD optimizer, so I replaced:  for p in model.parameters():\n    p.data.add_(-lr, p.grad.data)  with:  optimizer.step()  The rest of the code in function  train()  logs to a text file and a  TensorBoard  backend.  Again, such code is not mandatory, but a few lines give us a lot of visibility: we have training progress information saved to log, and we can monitor the training progress in realtime on TensorBoard.  That's a lot for a few lines of code ;-)  \nif batch % args.log_interval == 0 and batch > 0:\n    cur_loss = total_loss / args.log_interval\n    elapsed = time.time() - start_time\n    lr = optimizer.param_groups[0]['lr']\n    msglogger.info(\n            '| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.4f} | ms/batch {:5.2f} '\n            '| loss {:5.2f} | ppl {:8.2f}'.format(\n        epoch, batch, len(train_data) // args.bptt, lr,\n        elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))\n    total_loss = 0\n    start_time = time.time()\n    stats = ('Peformance/Training/',\n        OrderedDict([\n            ('Loss', cur_loss),\n            ('Perplexity', math.exp(cur_loss)),\n            ('LR', lr),\n            ('Batch Time', elapsed * 1000)])\n        )\n    steps_completed = batch + 1\n    distiller.log_training_progress(stats, model.named_parameters(), epoch, steps_completed,\n                                    steps_per_epoch, args.log_interval, [tflogger])  Finally we get to the outer training-loop which loops on  args.epochs .  We add the two final  CompressionScheduler  callbacks:  on_epoch_begin , at the start of the loop, and  on_epoch_end  after running  evaluate  on the model and updating the learning-rate.  \ntry:\n    for epoch in range(0, args.epochs):\n        epoch_start_time = time.time()\n         if compression_scheduler:\n            compression_scheduler.on_epoch_begin(epoch) \n\n        train(epoch, optimizer, compression_scheduler)\n        val_loss = evaluate(val_data)\n        lr_scheduler.step(val_loss)\n\n         if compression_scheduler:\n            compression_scheduler.on_epoch_end(epoch)   And that's it!  The language model sample is ready for compression.",
+            "location": "/tutorial-lang_model/index.html#training-loop", 
+            "text": "Now we scroll down all the way to the train() function.  We'll change its signature to include the  epoch ,  optimizer , and  compression_schdule .   We'll soon see why we need these.  def train(epoch, optimizer, compression_scheduler=None)  Function  train()  is responsible for training the network in batches for one epoch, and in its epoch loop we want to perform compression.   The  CompressionScheduler  invokes  ScheduledTrainingPolicy  instances per the scheduling specification that was programmed in the  CompressionScheduler  instance.  There are four main  SchedulingPolicy  types:  PruningPolicy ,  RegularizationPolicy ,  LRPolicy , and  QuantizationPolicy .  We'll be using  PruningPolicy , which is triggered  on_epoch_begin  (to invoke the  Pruners , and  on_minibatch_begin  (to mask the weights).   Later we will create a YAML scheduling file, and specify the schedule of  AutomatedGradualPruner  instances.    Because we are writing a single application, which can be used with various Policies in the future (e.g. group-lasso regularization), we should add code to invoke all of the  CompressionScheduler 's callbacks, not just the mandatory  on_epoch_begin  callback.    We invoke  on_minibatch_begin  before running the forward-pass,  before_backward_pass  after computing the loss, and  on_minibatch_end  after completing the backward-pass.  \ndef train(epoch, optimizer, compression_scheduler=None):\n    ...\n\n    # The line below was fixed as per: https://github.com/pytorch/examples/issues/214\n    for batch, i in enumerate(range(0, train_data.size(0), args.bptt)):\n        data, targets = get_batch(train_data, i)\n        # Starting each batch, we detach the hidden state from how it was previously produced.\n        # If we didn't, the model would try backpropagating all the way to start of the dataset.\n        hidden = repackage_hidden(hidden)\n\n         if compression_scheduler:\n            compression_scheduler.on_minibatch_begin(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch) \n        output, hidden = model(data, hidden)\n        loss = criterion(output.view(-1, ntokens), targets)\n\n         if compression_scheduler:\n            compression_scheduler.before_backward_pass(epoch, minibatch_id=batch,\n                                                       minibatches_per_epoch=steps_per_epoch,\n                                                       loss=loss) \n        optimizer.zero_grad()\n        loss.backward()\n\n        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.\n        torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)\n        optimizer.step()\n\n        total_loss += loss.item()\n\n         if compression_scheduler:\n            compression_scheduler.on_minibatch_end(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch)   The rest of the code could stay as in the original PyTorch sample, but I wanted to use an SGD optimizer, so I replaced:  for p in model.parameters():\n    p.data.add_(-lr, p.grad.data)  with:  optimizer.step()  The rest of the code in function  train()  logs to a text file and a  TensorBoard  backend.  Again, such code is not mandatory, but a few lines give us a lot of visibility: we have training progress information saved to log, and we can monitor the training progress in realtime on TensorBoard.  That's a lot for a few lines of code ;-)  \nif batch % args.log_interval == 0 and batch > 0:\n    cur_loss = total_loss / args.log_interval\n    elapsed = time.time() - start_time\n    lr = optimizer.param_groups[0]['lr']\n    msglogger.info(\n            '| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.4f} | ms/batch {:5.2f} '\n            '| loss {:5.2f} | ppl {:8.2f}'.format(\n        epoch, batch, len(train_data) // args.bptt, lr,\n        elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))\n    total_loss = 0\n    start_time = time.time()\n    stats = ('Peformance/Training/',\n        OrderedDict([\n            ('Loss', cur_loss),\n            ('Perplexity', math.exp(cur_loss)),\n            ('LR', lr),\n            ('Batch Time', elapsed * 1000)])\n        )\n    steps_completed = batch + 1\n    distiller.log_training_progress(stats, model.named_parameters(), epoch, steps_completed,\n                                    steps_per_epoch, args.log_interval, [tflogger])  Finally we get to the outer training-loop which loops on  args.epochs .  We add the two final  CompressionScheduler  callbacks:  on_epoch_begin , at the start of the loop, and  on_epoch_end  after running  evaluate  on the model and updating the learning-rate.  \ntry:\n    for epoch in range(0, args.epochs):\n        epoch_start_time = time.time()\n         if compression_scheduler:\n            compression_scheduler.on_epoch_begin(epoch) \n\n        train(epoch, optimizer, compression_scheduler)\n        val_loss = evaluate(val_data)\n        lr_scheduler.step(val_loss)\n\n         if compression_scheduler:\n            compression_scheduler.on_epoch_end(epoch)   And that's it!  The language model sample is ready for compression.", 
             "title": "Training loop"
-        },
+        }, 
         {
-            "location": "/tutorial-lang_model/index.html#creating-compression-baselines",
-            "text": "In  To prune, or not to prune: exploring the efficacy of pruning for model compression  Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint.\" They also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning.\" \nThis pruning schedule is implemented by distiller.AutomatedGradualPruner (AGP), which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps. Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper. The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs. We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size.  Before we start compressing stuff ;-), we need to create baselines so we have something to benchmark against.  Let's prepare small, medium, and large baseline models, like Table 3 of  To prune, or Not to Prune .  These will provide baseline perplexity results that we'll compare the compressed models against.   \nI chose to use tied input/output embeddings, and constrained the training to 40 epochs.  The table below shows the model sizes, where we are interested in the tied version (biases are ignored due to their small size and because we don't prune them).     Size  Number of Weights (untied)  Number of Weights (tied)      Small  13,951,200  7,295,600    Medium  50,021,400  28,390,700    Large  135,834,000  85,917,000     I started experimenting with the optimizer setup like in the PyTorch example, but I added some L2 regularization when I noticed that the training was overfitting.  The two right columns show the perplexity results (lower is better) of each of the models with no L2 regularization and with 1e-5 and 1e-6.\nIn all three model sizes using the smaller L2 regularization (1e-6) gave the best results.  BTW, I'm not showing here experiments with even lower regularization because that did not help.     Type  Command line  Validation  Test      Small  time python3 main.py --cuda --epochs 40 --tied  105.23  99.53    Small  time python3 main.py --cuda --epochs 40 --tied --wd=1e-6  101.13  96.29    Small  time python3 main.py --cuda --epochs 40 --tied --wd=1e-5  109.49  103.53    Medium  time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied  90.93  86.20    Medium  time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --wd=1e-6  88.17  84.21    Medium  time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --wd=1e-5  97.75  93.06    Large  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied  88.23  84.21    Large  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-6  87.49  83.85    Large  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-5  99.22  94.28",
+            "location": "/tutorial-lang_model/index.html#creating-compression-baselines", 
+            "text": "In  To prune, or not to prune: exploring the efficacy of pruning for model compression  Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint.\" They also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning.\" \nThis pruning schedule is implemented by distiller.AutomatedGradualPruner (AGP), which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps. Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper. The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs. We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size.  Before we start compressing stuff ;-), we need to create baselines so we have something to benchmark against.  Let's prepare small, medium, and large baseline models, like Table 3 of  To prune, or Not to Prune .  These will provide baseline perplexity results that we'll compare the compressed models against.   \nI chose to use tied input/output embeddings, and constrained the training to 40 epochs.  The table below shows the model sizes, where we are interested in the tied version (biases are ignored due to their small size and because we don't prune them).     Size  Number of Weights (untied)  Number of Weights (tied)      Small  13,951,200  7,295,600    Medium  50,021,400  28,390,700    Large  135,834,000  85,917,000     I started experimenting with the optimizer setup like in the PyTorch example, but I added some L2 regularization when I noticed that the training was overfitting.  The two right columns show the perplexity results (lower is better) of each of the models with no L2 regularization and with 1e-5 and 1e-6.\nIn all three model sizes using the smaller L2 regularization (1e-6) gave the best results.  BTW, I'm not showing here experiments with even lower regularization because that did not help.     Type  Command line  Validation  Test      Small  time python3 main.py --cuda --epochs 40 --tied  105.23  99.53    Small  time python3 main.py --cuda --epochs 40 --tied --wd=1e-6  101.13  96.29    Small  time python3 main.py --cuda --epochs 40 --tied --wd=1e-5  109.49  103.53    Medium  time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied  90.93  86.20    Medium  time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --wd=1e-6  88.17  84.21    Medium  time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --wd=1e-5  97.75  93.06    Large  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied  88.23  84.21    Large  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-6  87.49  83.85    Large  time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-5  99.22  94.28", 
             "title": "Creating compression baselines"
-        },
+        }, 
         {
-            "location": "/tutorial-lang_model/index.html#compressing-the-language-model",
-            "text": "OK, so now let's recreate the results of the language model experiment from section 4.2 of paper.  We're using PyTorch's sample, so the language model we implement is not exactly like the one in the AGP paper (and uses a different dataset), but it's close enough, so if everything goes well, we should see similar compression results.",
+            "location": "/tutorial-lang_model/index.html#compressing-the-language-model", 
+            "text": "OK, so now let's recreate the results of the language model experiment from section 4.2 of paper.  We're using PyTorch's sample, so the language model we implement is not exactly like the one in the AGP paper (and uses a different dataset), but it's close enough, so if everything goes well, we should see similar compression results.", 
             "title": "Compressing the language model"
-        },
+        }, 
         {
-            "location": "/tutorial-lang_model/index.html#what-are-we-compressing",
-            "text": "To gain insight about the model parameters, we can use the command-line to produce a weights-sparsity table:  $ python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --summary=sparsity\n\nParameters:\n+---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|         | Name             | Shape         |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |\n|---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n| 0.00000 | encoder.weight   | (33278, 1500) |      49917000 |       49916999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.05773 | -0.00000 |    0.05000 |\n| 1.00000 | rnn.weight_ih_l0 | (6000, 1500)  |       9000000 |        9000000 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.01491 |  0.00001 |    0.01291 |\n| 2.00000 | rnn.weight_hh_l0 | (6000, 1500)  |       9000000 |        8999999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00001 | 0.01491 |  0.00000 |    0.01291 |\n| 3.00000 | rnn.weight_ih_l1 | (6000, 1500)  |       9000000 |        8999999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00001 | 0.01490 | -0.00000 |    0.01291 |\n| 4.00000 | rnn.weight_hh_l1 | (6000, 1500)  |       9000000 |        9000000 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.01491 | -0.00000 |    0.01291 |\n| 5.00000 | decoder.weight   | (33278, 1500) |      49917000 |       49916999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.05773 | -0.00000 |    0.05000 |\n| 6.00000 | Total sparsity:  | -             |     135834000 |      135833996 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.00000 |  0.00000 |    0.00000 |\n+---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\nTotal sparsity: 0.00  So what's going on here? encoder.weight  and  decoder.weight  are the input and output embeddings, respectively.  Remember that in the configuration I chose for the three model sizes these embeddings are tied, which means that we only have one copy of parameters, that is shared between the encoder and decoder.\nWe also have two pairs of RNN (LSTM really) parameters.  There is a pair because the model uses the command-line argument  args.nlayers  to decide how many instances of RNN (or LSTM or GRU) cells to use, and it defaults to 2.  The recurrent cells are LSTM cells, because this is the default of  args.model , which is used in the initialization of  RNNModel .  Let's look at the parameters of the first RNN:  rnn.weight_ih_l0  and  rnn.weight_hh_l0 : what are these? \nRecall the  LSTM equations  that PyTorch implements.  In the equations, there are 8 instances of vector-matrix multiplication (when batch=1).  These can be combined into a single matrix-matrix multiplication (GEMM), but PyTorch groups these into two GEMM operations: one GEMM multiplies the inputs ( rnn.weight_ih_l0 ), and the other multiplies the hidden-state ( rnn.weight_hh_l0 ).",
+            "location": "/tutorial-lang_model/index.html#what-are-we-compressing", 
+            "text": "To gain insight about the model parameters, we can use the command-line to produce a weights-sparsity table:  $ python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --summary=sparsity\n\nParameters:\n+---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|         | Name             | Shape         |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |\n|---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n| 0.00000 | encoder.weight   | (33278, 1500) |      49917000 |       49916999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.05773 | -0.00000 |    0.05000 |\n| 1.00000 | rnn.weight_ih_l0 | (6000, 1500)  |       9000000 |        9000000 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.01491 |  0.00001 |    0.01291 |\n| 2.00000 | rnn.weight_hh_l0 | (6000, 1500)  |       9000000 |        8999999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00001 | 0.01491 |  0.00000 |    0.01291 |\n| 3.00000 | rnn.weight_ih_l1 | (6000, 1500)  |       9000000 |        8999999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00001 | 0.01490 | -0.00000 |    0.01291 |\n| 4.00000 | rnn.weight_hh_l1 | (6000, 1500)  |       9000000 |        9000000 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.01491 | -0.00000 |    0.01291 |\n| 5.00000 | decoder.weight   | (33278, 1500) |      49917000 |       49916999 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.05773 | -0.00000 |    0.05000 |\n| 6.00000 | Total sparsity:  | -             |     135834000 |      135833996 |    0.00000 |    0.00000 |        0 |  0.00000 |        0 |    0.00000 | 0.00000 |  0.00000 |    0.00000 |\n+---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\nTotal sparsity: 0.00  So what's going on here? encoder.weight  and  decoder.weight  are the input and output embeddings, respectively.  Remember that in the configuration I chose for the three model sizes these embeddings are tied, which means that we only have one copy of parameters, that is shared between the encoder and decoder.\nWe also have two pairs of RNN (LSTM really) parameters.  There is a pair because the model uses the command-line argument  args.nlayers  to decide how many instances of RNN (or LSTM or GRU) cells to use, and it defaults to 2.  The recurrent cells are LSTM cells, because this is the default of  args.model , which is used in the initialization of  RNNModel .  Let's look at the parameters of the first RNN:  rnn.weight_ih_l0  and  rnn.weight_hh_l0 : what are these? \nRecall the  LSTM equations  that PyTorch implements.  In the equations, there are 8 instances of vector-matrix multiplication (when batch=1).  These can be combined into a single matrix-matrix multiplication (GEMM), but PyTorch groups these into two GEMM operations: one GEMM multiplies the inputs ( rnn.weight_ih_l0 ), and the other multiplies the hidden-state ( rnn.weight_hh_l0 ).", 
             "title": "What are we compressing?"
-        },
+        }, 
         {
-            "location": "/tutorial-lang_model/index.html#how-are-we-compressing",
-            "text": "Let's turn to the configurations of the Large language model compression schedule to 70%, 80%, 90% and 95% sparsity. Using AGP it is easy to configure the pruning schedule to produce an exact sparsity of the compressed model.  I'll use the  70% schedule  to show a concrete example.  The YAML file has two sections:  pruners  and  policies .  Section  pruners  defines instances of  ParameterPruner  - in our case we define three instances of  AutomatedGradualPruner : for the weights of the first RNN ( l0_rnn_pruner ), the second RNN ( l1_rnn_pruner ) and the embedding layer ( embedding_pruner ).  These names are arbitrary, and serve are name-handles which bind Policies to Pruners - so you can use whatever names you want.\nEach  AutomatedGradualPruner  is configured with an  initial_sparsity  and  final_sparsity .  For examples, the  l0_rnn_pruner  below is configured to prune 5% of the weights as soon as it starts working, and finish when 70% of the weights have been pruned.  The  weights  parameter tells the Pruner which weight tensors to prune.  pruners:\n  l0_rnn_pruner:\n    class: AutomatedGradualPruner\n    initial_sparsity : 0.05\n    final_sparsity: 0.70\n    weights: [rnn.weight_ih_l0, rnn.weight_hh_l0]\n\n  l1_rnn_pruner:\n    class: AutomatedGradualPruner\n    initial_sparsity : 0.05\n    final_sparsity: 0.70\n    weights: [rnn.weight_ih_l1, rnn.weight_hh_l1]\n\n  embedding_pruner:\n    class: AutomatedGradualPruner\n    initial_sparsity : 0.05\n    final_sparsity: 0.70\n    weights: [encoder.weight]",
+            "location": "/tutorial-lang_model/index.html#how-are-we-compressing", 
+            "text": "Let's turn to the configurations of the Large language model compression schedule to 70%, 80%, 90% and 95% sparsity. Using AGP it is easy to configure the pruning schedule to produce an exact sparsity of the compressed model.  I'll use the  70% schedule  to show a concrete example.  The YAML file has two sections:  pruners  and  policies .  Section  pruners  defines instances of  ParameterPruner  - in our case we define three instances of  AutomatedGradualPruner : for the weights of the first RNN ( l0_rnn_pruner ), the second RNN ( l1_rnn_pruner ) and the embedding layer ( embedding_pruner ).  These names are arbitrary, and serve are name-handles which bind Policies to Pruners - so you can use whatever names you want.\nEach  AutomatedGradualPruner  is configured with an  initial_sparsity  and  final_sparsity .  For examples, the  l0_rnn_pruner  below is configured to prune 5% of the weights as soon as it starts working, and finish when 70% of the weights have been pruned.  The  weights  parameter tells the Pruner which weight tensors to prune.  pruners:\n  l0_rnn_pruner:\n    class: AutomatedGradualPruner\n    initial_sparsity : 0.05\n    final_sparsity: 0.70\n    weights: [rnn.weight_ih_l0, rnn.weight_hh_l0]\n\n  l1_rnn_pruner:\n    class: AutomatedGradualPruner\n    initial_sparsity : 0.05\n    final_sparsity: 0.70\n    weights: [rnn.weight_ih_l1, rnn.weight_hh_l1]\n\n  embedding_pruner:\n    class: AutomatedGradualPruner\n    initial_sparsity : 0.05\n    final_sparsity: 0.70\n    weights: [encoder.weight]", 
             "title": "How are we compressing?"
-        },
+        }, 
         {
-            "location": "/tutorial-lang_model/index.html#when-are-we-compressing",
-            "text": "If the  pruners  section defines \"what-to-do\", the  policies  section defines \"when-to-do\".  This part is harder, because we define the pruning schedule, which requires us to try a few different schedules until we understand which schedule works best.\nBelow we define three  PruningPolicy  instances.  The first two instances start operating at epoch 2 ( starting_epoch ), end at epoch 20 ( ending_epoch ), and operate once every epoch ( frequency ; as I explained above, Distiller's Pruning scheduling operates only at  on_epoch_begin ).  In between pruning operations, the pruned model is fine-tuned.  policies:\n  - pruner:\n      instance_name : l0_rnn_pruner\n    starting_epoch: 2\n    ending_epoch: 20  \n    frequency: 1\n\n  - pruner:\n      instance_name : l1_rnn_pruner\n    starting_epoch: 2\n    ending_epoch: 20\n    frequency: 1\n\n  - pruner:\n      instance_name : embedding_pruner\n    starting_epoch: 3\n    ending_epoch: 21\n    frequency: 1  We invoke the compression as follows:  $ time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml  Table 1  above shows that we can make a negligible improvement when adding L2 regularization.  I did some experimenting with the sparsity distribution between the layers, and the scheduling frequency and noticed that the embedding layers are much less sensitive to pruning than the RNN cells.  I didn't notice any difference between the RNN cells, but I also didn't invest in this exploration.\nA new  70% sparsity schedule , prunes the RNNs only to 50% sparsity, but prunes the embedding to 85% sparsity, and achieves almost a 3 points improvement in the test perplexity results.  We provide  similar pruning schedules  for the other compression rates.",
+            "location": "/tutorial-lang_model/index.html#when-are-we-compressing", 
+            "text": "If the  pruners  section defines \"what-to-do\", the  policies  section defines \"when-to-do\".  This part is harder, because we define the pruning schedule, which requires us to try a few different schedules until we understand which schedule works best.\nBelow we define three  PruningPolicy  instances.  The first two instances start operating at epoch 2 ( starting_epoch ), end at epoch 20 ( ending_epoch ), and operate once every epoch ( frequency ; as I explained above, Distiller's Pruning scheduling operates only at  on_epoch_begin ).  In between pruning operations, the pruned model is fine-tuned.  policies:\n  - pruner:\n      instance_name : l0_rnn_pruner\n    starting_epoch: 2\n    ending_epoch: 20  \n    frequency: 1\n\n  - pruner:\n      instance_name : l1_rnn_pruner\n    starting_epoch: 2\n    ending_epoch: 20\n    frequency: 1\n\n  - pruner:\n      instance_name : embedding_pruner\n    starting_epoch: 3\n    ending_epoch: 21\n    frequency: 1  We invoke the compression as follows:  $ time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml  Table 1  above shows that we can make a negligible improvement when adding L2 regularization.  I did some experimenting with the sparsity distribution between the layers, and the scheduling frequency and noticed that the embedding layers are much less sensitive to pruning than the RNN cells.  I didn't notice any difference between the RNN cells, but I also didn't invest in this exploration.\nA new  70% sparsity schedule , prunes the RNNs only to 50% sparsity, but prunes the embedding to 85% sparsity, and achieves almost a 3 points improvement in the test perplexity results.  We provide  similar pruning schedules  for the other compression rates.", 
             "title": "When are we compressing?"
-        },
+        }, 
         {
-            "location": "/tutorial-lang_model/index.html#until-next-time",
-            "text": "This concludes the first part of the tutorial on pruning a PyTorch language model. \nIn the next installment, I'll explain how we added an implementation of Baidu Research's  Exploring Sparsity in Recurrent Neural Networks  paper, and applied to this language model.  Geek On.",
+            "location": "/tutorial-lang_model/index.html#until-next-time", 
+            "text": "This concludes the first part of the tutorial on pruning a PyTorch language model. \nIn the next installment, I'll explain how we added an implementation of Baidu Research's  Exploring Sparsity in Recurrent Neural Networks  paper, and applied to this language model.  Geek On.", 
             "title": "Until next time"
         }
     ]
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index d0900fc..e15d9bf 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -4,7 +4,7 @@
     
     <url>
      <loc>/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -12,7 +12,7 @@
     
     <url>
      <loc>/install/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -20,7 +20,7 @@
     
     <url>
      <loc>/usage/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -28,7 +28,7 @@
     
     <url>
      <loc>/schedule/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -37,31 +37,31 @@
         
     <url>
      <loc>/pruning/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/regularization/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/quantization/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/knowledge_distillation/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/conditional_computation/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
@@ -71,19 +71,19 @@
         
     <url>
      <loc>/algo_pruning/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/algo_quantization/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/algo_earlyexit/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
@@ -92,7 +92,7 @@
     
     <url>
      <loc>/model_zoo/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -100,7 +100,7 @@
     
     <url>
      <loc>/jupyter/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -108,7 +108,7 @@
     
     <url>
      <loc>/design/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -117,13 +117,13 @@
         
     <url>
      <loc>/tutorial-struct_pruning/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/tutorial-lang_model/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
-- 
GitLab