diff --git a/distiller/thresholding.py b/distiller/thresholding.py
index 38c9923908bcf54bb6d9cf441b9c48032a52f3bf..18eeae4ebdb3c7877ed0fcb04ae50a9964e342b9 100755
--- a/distiller/thresholding.py
+++ b/distiller/thresholding.py
@@ -67,6 +67,9 @@ def group_threshold_mask(param, group_type, threshold, threshold_criteria):
         #    elements in each channel as the threshold filter.
         # 3. Apply the threshold filter
         binary_map = threshold_policy(view_2d, thresholds, threshold_criteria)
+
+We need to stash this and then use it!!
+
         # 3. Finally, expand the thresholds and view as a 4D tensor
         a = binary_map.expand(param.size(2) * param.size(3),
                               param.size(0) * param.size(1)).t()
diff --git a/docs-src/docs/algo_earlyexit.md b/docs-src/docs/algo_earlyexit.md
new file mode 100755
index 0000000000000000000000000000000000000000..226d45e6ef54430dc67ac99dd5eba757818b9730
--- /dev/null
+++ b/docs-src/docs/algo_earlyexit.md
@@ -0,0 +1,42 @@
+# Early Exit Inference
+While Deep Neural Networks benefit from a large number of layers, it's often the case that many data points in classification tasks can be classified accurately with much less work. There have been several studies recently regarding the idea of exiting before the normal endpoint of the neural network. Panda et al in [Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition](#panda) points out that a lot of data points can be classified easily and require less processing than some more difficult points and they view this in terms of power savings. Surat et al in [BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks](#branchynet) look at a selective approach to exit placement and criteria for exiting early.
+
+## Why Does Early Exit Work?
+Early Exit is a strategy with a straightforward and easy to understand concept Figure #fig(boundaries) shows a simple example in a 2-D feature space. While deep networks can represent more complex and expressive boundaries between classes (assuming we’re confident of avoiding over-fitting the data), it’s also clear that much of the data can be properly classified with even the simplest of classification boundaries.
+
+![Figure !fig(boundaries): Simple and more expressive classification boundaries](imgs/decision_boundary.png)
+
+Data points far from the boundary can be considered "easy to classify" and achieve a high degree of confidence quicker than do data points close to the boundary. In fact, we can think of the area between the outer straight lines as being the region that is "difficult to classify" and require the full expressiveness of the neural network to accurately classify it.
+
+## Example code for Early Exit
+Both CIFAR10 and ImageNet code comes directly from publically available examples from Pytorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.
+
+Deeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively.
+
+Note that this code does not actually take exits. What it does is to compute statistics of loss and accuracy assuming exits were taken when criteria are met. Actually implementing exits can be tricky and architecture dependent and we plan to address these issues.
+
+### Heuristics
+The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more agressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.
+
+There are other benefits to adding exits in that training the modified network now has backpropagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.
+
+### Early Exit Hyperparameters
+There are two parameters that are required to enable early exit. Leave them undefined if you are not enabling Early Exit:
+
+1. **--earlyexit_thresholds** defines the
+thresholds for each of the early exits. The cross entropy measure must be **less than** the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify "--earlyexit_thresholds 0.9 1.2" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.
+
+1. **--earlyexit_lossweights** provide the weights for the linear combination of losses during training to compute a signle, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of "--earlyexit_lossweights 0.2 0.3" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.
+
+### CIFAR10
+In the case of CIFAR10, we have inserted a single exit after the first full layer grouping. The layers on the exit path itself includes a convolutional layer and a fully connected layer. If you move the exit, be sure to match the proper sizes for inputs and outputs to the exit layers.
+
+### ImageNet
+This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic resnet code and could be used with other size resnets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.
+
+## References
+<div id="panda"></div> **Priyadarshini Panda, Abhronil Sengupta, Kaushik Roy**.
+    [*Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition*](https://arxiv.org/abs/1509.08971v6), arXiv:1509.08971v6, 2017.
+
+<div id="branchynet"></div> **Surat Teerapittayanon, Bradley McDanel, H. T. Kung**.
+    [*BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks*](http://arxiv.org/abs/1709.01686), arXiv:1709.01686, 2017.
diff --git a/docs-src/mkdocs.yml b/docs-src/mkdocs.yml
index 13c369994ebbe3b7bb38a84e798d442584d56c88..95d167cfa2b5771e22820bb1d3be34d12a269192 100755
--- a/docs-src/mkdocs.yml
+++ b/docs-src/mkdocs.yml
@@ -18,13 +18,15 @@ pages:
   - Usage: usage.md
   - Compression scheduling: schedule.md
   - Compressing models:
-    - 'Pruning': 'pruning.md'
-    - 'Regularization': 'regularization.md'
-    - 'Quantization': 'quantization.md'
-    - 'Knowledge Distillation': 'knowledge_distillation.md'
+    - Pruning: pruning.md
+    - Regularization: regularization.md
+    - Quantization: quantization.md
+    - Knowledge Distillation: knowledge_distillation.md
+    - Conditional Computation: conditional_computation.md
   - Algorithms:
     - Pruning: algo_pruning.md
     - Quantization: algo_quantization.md
+    - Early Exit: algo_earlyexit.md
   - Model Zoo: model_zoo.md
   - Jupyter notebooks: jupyter.md
   - Design: design.md
diff --git a/docs/404.html b/docs/404.html
index 7ad708f6021fa413f2c7a43c515bbe9d0671e86e..77abc7a3bbb341c1a6d94f35ac21a5a0e88df0b3 100644
--- a/docs/404.html
+++ b/docs/404.html
@@ -81,6 +81,10 @@
                     
     <a class="" href="/knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="/conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -96,6 +100,10 @@
                     
     <a class="" href="/algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="/algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
diff --git a/docs/algo_pruning/index.html b/docs/algo_pruning/index.html
index aad639ee3ccae99cc80b3aae6095d91aa1e1c8ce..b5ec67ec5a072c39ec473bf623a6894213b7c057 100644
--- a/docs/algo_pruning/index.html
+++ b/docs/algo_pruning/index.html
@@ -88,6 +88,10 @@
                     
     <a class="" href="../knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -134,6 +138,10 @@
                     
     <a class="" href="../algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
@@ -300,7 +308,7 @@ abundant and gradually reduce the number of weights being pruned each time as th
         <a href="../algo_quantization/index.html" class="btn btn-neutral float-right" title="Quantization">Next <span class="icon icon-circle-arrow-right"></span></a>
       
       
-        <a href="../knowledge_distillation/index.html" class="btn btn-neutral" title="Knowledge Distillation"><span class="icon icon-circle-arrow-left"></span> Previous</a>
+        <a href="../conditional_computation/index.html" class="btn btn-neutral" title="Conditional Computation"><span class="icon icon-circle-arrow-left"></span> Previous</a>
       
     </div>
   
@@ -326,7 +334,7 @@ abundant and gradually reduce the number of weights being pruned each time as th
     <span class="rst-current-version" data-toggle="rst-current-version">
       
       
-        <span><a href="../knowledge_distillation/index.html" style="color: #fcfcfc;">&laquo; Previous</a></span>
+        <span><a href="../conditional_computation/index.html" style="color: #fcfcfc;">&laquo; Previous</a></span>
       
       
         <span style="margin-left: 15px"><a href="../algo_quantization/index.html" style="color: #fcfcfc">Next &raquo;</a></span>
diff --git a/docs/algo_quantization/index.html b/docs/algo_quantization/index.html
index 5b13861658bb99da2d0ac73123bb429097d22a15..55c748c0704501bb567193eed498de5ad8bd2d8e 100644
--- a/docs/algo_quantization/index.html
+++ b/docs/algo_quantization/index.html
@@ -88,6 +88,10 @@
                     
     <a class="" href="../knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -121,6 +125,10 @@
 
     </ul>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
@@ -249,7 +257,7 @@ Note how the bias has to be re-scaled to match the scale of the summation.</p>
   
     <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
       
-        <a href="../model_zoo/index.html" class="btn btn-neutral float-right" title="Model Zoo">Next <span class="icon icon-circle-arrow-right"></span></a>
+        <a href="../algo_earlyexit/index.html" class="btn btn-neutral float-right" title="Early Exit">Next <span class="icon icon-circle-arrow-right"></span></a>
       
       
         <a href="../algo_pruning/index.html" class="btn btn-neutral" title="Pruning"><span class="icon icon-circle-arrow-left"></span> Previous</a>
@@ -281,7 +289,7 @@ Note how the bias has to be re-scaled to match the scale of the summation.</p>
         <span><a href="../algo_pruning/index.html" style="color: #fcfcfc;">&laquo; Previous</a></span>
       
       
-        <span style="margin-left: 15px"><a href="../model_zoo/index.html" style="color: #fcfcfc">Next &raquo;</a></span>
+        <span style="margin-left: 15px"><a href="../algo_earlyexit/index.html" style="color: #fcfcfc">Next &raquo;</a></span>
       
     </span>
 </div>
diff --git a/docs/design/index.html b/docs/design/index.html
index ee6e7135fceae4fb3ff4156b1a0b9330d1ee0fce..4db0ce874a32e136bc9961b839a7c1553d5456a5 100644
--- a/docs/design/index.html
+++ b/docs/design/index.html
@@ -88,6 +88,10 @@
                     
     <a class="" href="../knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -103,6 +107,10 @@
                     
     <a class="" href="../algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
diff --git a/docs/imgs/decision_boundary.png b/docs/imgs/decision_boundary.png
index a22c4c42c20cd31df791354bbc012655359d74d9..54c6da7e295985ff1096a3e27e21946f9efd606b 100644
Binary files a/docs/imgs/decision_boundary.png and b/docs/imgs/decision_boundary.png differ
diff --git a/docs/index.html b/docs/index.html
index dc3ad73a991337a7b800fe35cbc5d35634b95b2b..31cac2a2830ad085d66bbb3f28f400f592a358ee 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -102,6 +102,10 @@
                     
     <a class="" href="knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -117,6 +121,10 @@
                     
     <a class="" href="algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
@@ -250,5 +258,5 @@ And of course, if we used a sparse or compressed representation, then we are red
 
 <!--
 MkDocs version : 0.17.2
-Build Date UTC : 2018-11-04 12:28:21
+Build Date UTC : 2018-11-07 17:59:47
 -->
diff --git a/docs/install/index.html b/docs/install/index.html
index 494d64262f201c34487f91e28ed747a1d51fb2e1..685ff9aeb487472428fd27707c0b89003b2e8c91 100644
--- a/docs/install/index.html
+++ b/docs/install/index.html
@@ -104,6 +104,10 @@
                     
     <a class="" href="../knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -119,6 +123,10 @@
                     
     <a class="" href="../algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
diff --git a/docs/jupyter/index.html b/docs/jupyter/index.html
index 35cbe13246aa3f1ce964236b09fa345c0834f0e7..fa51a5023f11bb043aeaaaa6c6ec8d490dd821e0 100644
--- a/docs/jupyter/index.html
+++ b/docs/jupyter/index.html
@@ -88,6 +88,10 @@
                     
     <a class="" href="../knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -103,6 +107,10 @@
                     
     <a class="" href="../algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
diff --git a/docs/knowledge_distillation/index.html b/docs/knowledge_distillation/index.html
index b140d8cdc4ff0a64430c9adcd7fed0d4eaf81231..28d1ea69c61a927d9e440022fc3e51f9841e63fb 100644
--- a/docs/knowledge_distillation/index.html
+++ b/docs/knowledge_distillation/index.html
@@ -102,6 +102,10 @@
 
     </ul>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -117,6 +121,10 @@
                     
     <a class="" href="../algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
@@ -222,7 +230,7 @@
   
     <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
       
-        <a href="../algo_pruning/index.html" class="btn btn-neutral float-right" title="Pruning">Next <span class="icon icon-circle-arrow-right"></span></a>
+        <a href="../conditional_computation/index.html" class="btn btn-neutral float-right" title="Conditional Computation">Next <span class="icon icon-circle-arrow-right"></span></a>
       
       
         <a href="../quantization/index.html" class="btn btn-neutral" title="Quantization"><span class="icon icon-circle-arrow-left"></span> Previous</a>
@@ -254,7 +262,7 @@
         <span><a href="../quantization/index.html" style="color: #fcfcfc;">&laquo; Previous</a></span>
       
       
-        <span style="margin-left: 15px"><a href="../algo_pruning/index.html" style="color: #fcfcfc">Next &raquo;</a></span>
+        <span style="margin-left: 15px"><a href="../conditional_computation/index.html" style="color: #fcfcfc">Next &raquo;</a></span>
       
     </span>
 </div>
diff --git a/docs/model_zoo/index.html b/docs/model_zoo/index.html
index 55ad7b86187716c9958a267c0933a2dc4c53625c..3c710cb370205e316932d98a41e3141ab04e830e 100644
--- a/docs/model_zoo/index.html
+++ b/docs/model_zoo/index.html
@@ -88,6 +88,10 @@
                     
     <a class="" href="../knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -103,6 +107,10 @@
                     
     <a class="" href="../algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
@@ -462,7 +470,7 @@ Top1: 92.830 and Top5: 99.760</p>
         <a href="../jupyter/index.html" class="btn btn-neutral float-right" title="Jupyter notebooks">Next <span class="icon icon-circle-arrow-right"></span></a>
       
       
-        <a href="../algo_quantization/index.html" class="btn btn-neutral" title="Quantization"><span class="icon icon-circle-arrow-left"></span> Previous</a>
+        <a href="../algo_earlyexit/index.html" class="btn btn-neutral" title="Early Exit"><span class="icon icon-circle-arrow-left"></span> Previous</a>
       
     </div>
   
@@ -488,7 +496,7 @@ Top1: 92.830 and Top5: 99.760</p>
     <span class="rst-current-version" data-toggle="rst-current-version">
       
       
-        <span><a href="../algo_quantization/index.html" style="color: #fcfcfc;">&laquo; Previous</a></span>
+        <span><a href="../algo_earlyexit/index.html" style="color: #fcfcfc;">&laquo; Previous</a></span>
       
       
         <span style="margin-left: 15px"><a href="../jupyter/index.html" style="color: #fcfcfc">Next &raquo;</a></span>
diff --git a/docs/pruning/index.html b/docs/pruning/index.html
index 00052fb0db3645c64ed779cbfd6a038b91477773..f97f44393174a0a2e4bcf08b0e105cf3ac5724cf 100644
--- a/docs/pruning/index.html
+++ b/docs/pruning/index.html
@@ -110,6 +110,10 @@
                     
     <a class="" href="../knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -125,6 +129,10 @@
                     
     <a class="" href="../algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
diff --git a/docs/quantization/index.html b/docs/quantization/index.html
index 99b66221b3c1392fbb5c8bbe5553914c5ee1f157..0fd0d831968baa6b1963d9541da21df21ec8e8ec 100644
--- a/docs/quantization/index.html
+++ b/docs/quantization/index.html
@@ -110,6 +110,10 @@
                     
     <a class="" href="../knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -125,6 +129,10 @@
                     
     <a class="" href="../algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
diff --git a/docs/regularization/index.html b/docs/regularization/index.html
index e6bdf1b9a85987f9553e1e69c1febfcb9bd1f342..1db9ffcfce61da0dc257673df771f9e6c64be6d7 100644
--- a/docs/regularization/index.html
+++ b/docs/regularization/index.html
@@ -104,6 +104,10 @@
                     
     <a class="" href="../knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -119,6 +123,10 @@
                     
     <a class="" href="../algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
diff --git a/docs/schedule/index.html b/docs/schedule/index.html
index 8634d178c72cde3cf3a4de07d07f4dac704bb32d..e08da6cafb69e587fd0e4f076f6bd0bb83a41695 100644
--- a/docs/schedule/index.html
+++ b/docs/schedule/index.html
@@ -110,6 +110,10 @@
                     
     <a class="" href="../knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -125,6 +129,10 @@
                     
     <a class="" href="../algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
diff --git a/docs/search.html b/docs/search.html
index 60998fdbfa0cddbf439d40e91f33db197c0e656e..92b365c29957da3093e85ad15d3ea8feed031f1c 100644
--- a/docs/search.html
+++ b/docs/search.html
@@ -81,6 +81,10 @@
                     
     <a class="" href="knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -96,6 +100,10 @@
                     
     <a class="" href="algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>
           
diff --git a/docs/search/search_index.json b/docs/search/search_index.json
index 04ede462541c8edec2d14fceb4fc4a87df5c59dc..ef121f6e0e5ef88f86c8063fb29b646223a819e3 100644
--- a/docs/search/search_index.json
+++ b/docs/search/search_index.json
@@ -325,6 +325,21 @@
             "text": "Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil . Model Compression.  KDD, 2006   Geoffrey Hinton, Oriol Vinyals and Jeff Dean . Distilling the Knowledge in a Neural Network.  arxiv:1503.02531   Hokchhay Tann, Soheil Hashemi, Iris Bahar and Sherief Reda . Hardware-Software Codesign of Accurate, Multiplier-free Deep Neural Networks.  DAC, 2017   Asit Mishra and Debbie Marr . Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy.  ICLR, 2018   Antonio Polino, Razvan Pascanu and Dan Alistarh . Model compression via distillation and quantization.  ICLR, 2018   Anubhav Ashok, Nicholas Rhinehart, Fares Beainy and Kris M. Kitani . N2N learning: Network to Network Compression via Policy Gradient Reinforcement Learning.  ICLR, 2018   Lucas Theis, Iryna Korshunova, Alykhan Tejani and Ferenc Husz\u00e1r . Faster gaze prediction with dense networks and Fisher pruning.  arxiv:1801.05787",
             "title": "References"
         },
+        {
+            "location": "/conditional_computation/index.html",
+            "text": "Conditional Computation\n\n\nConditional Computation refers to a class of algorithms in which each input sample uses a different part of the model, such that on average the compute, latency or power (depending on our objective) is reduced.\nTo quote \nBengio et. al\n\n\n\n\n\"Conditional computation refers to activating only some of the units in a network, in an input-dependent fashion. For example, if we think we\u2019re looking at a car, we only need to compute the activations of the vehicle detecting units, not of all features that a network could possible compute. The immediate effect of activating fewer units is that propagating information through the network will be faster, both at training as well as at test time. However, one needs to be able to decide in an intelligent fashion which units to turn on and off, depending on the input data. This is typically achieved with some form of gating structure, learned in parallel with the original network.\"\n\n\n\n\nAs usual, there are several approaches to implement Conditional Computation:\n\n\n\n\nSun et. al\n use several expert CNN, each trained on a different task, and combine them to one large network.\n\n\nZheng et. al\n use cascading, an idea which may be familiar to you from Viola-Jones face detection.\n\n\nTheodorakopoulos et. al\n add small layers that learn which filters to use per input sample, and then enforce that during inference (LKAM module).\n\n\nIoannou et. al\n introduce Conditional Networks: that \"can be thought of as: i) decision trees augmented with data transformation\noperators, or ii) CNNs, with block-diagonal sparse weight matrices, and explicit data routing functions\"\n\n\nBolukbasi et. al\n \"learn a system to adaptively choose the components of a deep network to be evaluated for each example. By allowing examples correctly classified using early layers of the system to exit, we avoid the computational time associated with full evaluation of the network. We extend this to learn a network selection system that adaptively selects the network to be evaluated for each example.\"\n\n\n\n\nConditional Computation is especially useful for real-time, latency-sensitive applicative.\n\nIn Distiller we currently have implemented a variant of Early Exit.\n\n\nReferences\n\n\n \nEmmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, Doina Precup.\n\n    \nConditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition\n, arXiv:1511.06297v2, 2016.\n\n\n\n\n\nY. Sun, X.Wang, and X. Tang.\n\n    \nDeep Convolutional Network Cascade for Facial Point Detection\n. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2014\n\n\n\n\n\nX. Zheng, W.Ouyang, and X.Wang.\n \nMulti-Stage Contextual Deep Learning for Pedestrian Detection.\n In Proc. IEEE Intl Conf. on Computer Vision (ICCV), 2014.\n\n\n\n\n\nI. Theodorakopoulos, V. Pothos, D. Kastaniotis and N. Fragoulis1.\n \nParsimonious Inference on Convolutional Neural Networks: Learning and applying on-line kernel activation rules.\n Irida Labs S.A, January 2017\n\n\n\n\n\nTolga Bolukbasi, Joseph Wang, Ofer Dekel, Venkatesh Saligrama\n \nAdaptive Neural Networks for Efficient Inference\n.  Proceedings of the 34th International Conference on Machine Learning, PMLR 70:527-536, 2017.\n\n\n\n\n\nYani Ioannou, Duncan Robertson, Darko Zikic, Peter Kontschieder, Jamie Shotton, Matthew Brown, Antonio Criminisi\n.\n    \nDecision Forests, Convolutional Networks and the Models in-Between\n, arXiv:1511.06297v2, 2016.",
+            "title": "Conditional Computation"
+        },
+        {
+            "location": "/conditional_computation/index.html#conditional-computation",
+            "text": "Conditional Computation refers to a class of algorithms in which each input sample uses a different part of the model, such that on average the compute, latency or power (depending on our objective) is reduced.\nTo quote  Bengio et. al   \"Conditional computation refers to activating only some of the units in a network, in an input-dependent fashion. For example, if we think we\u2019re looking at a car, we only need to compute the activations of the vehicle detecting units, not of all features that a network could possible compute. The immediate effect of activating fewer units is that propagating information through the network will be faster, both at training as well as at test time. However, one needs to be able to decide in an intelligent fashion which units to turn on and off, depending on the input data. This is typically achieved with some form of gating structure, learned in parallel with the original network.\"   As usual, there are several approaches to implement Conditional Computation:   Sun et. al  use several expert CNN, each trained on a different task, and combine them to one large network.  Zheng et. al  use cascading, an idea which may be familiar to you from Viola-Jones face detection.  Theodorakopoulos et. al  add small layers that learn which filters to use per input sample, and then enforce that during inference (LKAM module).  Ioannou et. al  introduce Conditional Networks: that \"can be thought of as: i) decision trees augmented with data transformation\noperators, or ii) CNNs, with block-diagonal sparse weight matrices, and explicit data routing functions\"  Bolukbasi et. al  \"learn a system to adaptively choose the components of a deep network to be evaluated for each example. By allowing examples correctly classified using early layers of the system to exit, we avoid the computational time associated with full evaluation of the network. We extend this to learn a network selection system that adaptively selects the network to be evaluated for each example.\"   Conditional Computation is especially useful for real-time, latency-sensitive applicative. \nIn Distiller we currently have implemented a variant of Early Exit.",
+            "title": "Conditional Computation"
+        },
+        {
+            "location": "/conditional_computation/index.html#references",
+            "text": "Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, Doina Precup. \n     Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition , arXiv:1511.06297v2, 2016.   Y. Sun, X.Wang, and X. Tang. \n     Deep Convolutional Network Cascade for Facial Point Detection . In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2014   X. Zheng, W.Ouyang, and X.Wang.   Multi-Stage Contextual Deep Learning for Pedestrian Detection.  In Proc. IEEE Intl Conf. on Computer Vision (ICCV), 2014.   I. Theodorakopoulos, V. Pothos, D. Kastaniotis and N. Fragoulis1.   Parsimonious Inference on Convolutional Neural Networks: Learning and applying on-line kernel activation rules.  Irida Labs S.A, January 2017   Tolga Bolukbasi, Joseph Wang, Ofer Dekel, Venkatesh Saligrama   Adaptive Neural Networks for Efficient Inference .  Proceedings of the 34th International Conference on Machine Learning, PMLR 70:527-536, 2017.   Yani Ioannou, Duncan Robertson, Darko Zikic, Peter Kontschieder, Jamie Shotton, Matthew Brown, Antonio Criminisi .\n     Decision Forests, Convolutional Networks and the Models in-Between , arXiv:1511.06297v2, 2016.",
+            "title": "References"
+        },
         {
             "location": "/algo_pruning/index.html",
             "text": "Weights pruning algorithms\n\n\n\n\nMagnitude pruner\n\n\nThis is the most basic pruner: it applies a thresholding function, \\(thresh(.)\\), on each element, \\(w_i\\), of a weights tensor.  A different threshold can be used for each layer's weights tensor.\n\nBecause the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family.\n\n\n\\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]\n\n\nSensitivity pruner\n\n\nFinding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values.  We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor.\n\n\nThe diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model.  You can see that they have an approximate Gaussian distribution.\n\n\n \n\n\nThe distributions of Alexnet conv1 and fc1 layers\n\n\nWe use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors.  For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (\\(\\sigma\\)) of the tensor.  Thus, if we set the threshold to \\(s*\\sigma\\), then basically we are thresholding \\(s * 68\\%\\) of the tensor elements.  \n\n\n\\[ thresh(w_i)=\\left\\lbrace\n\\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} }\n\\right\\rbrace \\]\n\n\n\\[\n\\lambda = s * \\sigma_l \\;\\;\\; where\\; \\sigma_l\\; is \\;the \\;std \\;of \\;layer \\;l \\;as \\;measured \\;on \\;the \\;dense \\;model\n\\]\n\n\nHow do we choose this \\(s\\) multiplier?\n\n\nIn \nLearning both Weights and Connections for Efficient Neural Networks\n the authors write:\n\n\n\n\n\"We used the sensitivity results to find each layer\u2019s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer\u2019s weights\n\n\n\n\nSo the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \\(s\\).  Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.\n\n\nMethod of operation\n\n\n\n\nStart by running a pruning sensitivity analysis on the model.  \n\n\nThen use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.\n\n\n\n\nSchedule\n\n\nIn their \npaper\n Song Han et al. use iterative pruning and change the value of the \\(s\\) multiplier at each pruning step.  Distiller's \nSensitivityPruner\n works differently: the value \\(s\\) is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are \"pulled\" toward the center of the distribution and thus more elements gets pruned.\n\n\nThis actually works quite well as we can see in the diagram below.  This is a TensorBoard screen-capture from Alexnet training, which shows how this method starts off pruning very aggressively, but then slowly reduces the pruning rate.\n\n\n\nWe use a simple iterative-pruning schedule such as: \nPrune every second epoch starting at epoch 0, and ending at epoch 38.\n  This excerpt from \nalexnet.schedule_sensitivity.yaml\n shows how this iterative schedule is conveyed in Distiller scheduling configuration YAML:\n\n\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n\n\n\nLevel pruner\n\n\nClass \nSparsityLevelParameterPruner\n uses a similar method to go around specifying specific thresholding magnitudes.\nInstead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity).  Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level.\n\nThis pruner is much more stable compared to \nSensitivityPruner\n because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's \nSensitivityPruner\n is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution.  Song Han's methodology of using several different values for the multiplier \\(s\\), and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far).  \n\n\nTo set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each\n\n\nMethod of operation\n\n\n\n\nSort the weights in the specified layer by their absolute values. \n\n\nMask to zero the smallest magnitude weights until the desired sparsity level is reached.\n\n\n\n\nAutomated gradual pruner (AGP)\n\n\nIn \nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n, authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in \nAutomatedGradualPruner\n.\n\n\n\n\n\n\"We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value \\(s_i\\) (usually 0) to a \ufb01nal sparsity value \\(s_f\\) over a span of n pruning steps.\nThe intuition behind this sparsity function in equation (1)  is to prune the network rapidly in the initial phase when the redundant connections are\nabundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.\"\"\n\n\n\n\n\n\nYou can play with the scheduling parameters in the \nagp_schedule.ipynb notebook\n.\n\n\nThe authors describe AGP:\n\n\n\n\n\n\nOur automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity.\n\n\nDoesn't require much hyper-parameter tuning\n\n\nShown to perform well across different models\n\n\nDoes not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.\n\n\n\n\n\n\nRNN pruner\n\n\nThe authors of \nExploring Sparsity in Recurrent Neural Networks\n, Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, \"propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network.\"  They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training.  They show pruning of RNN, GRU, LSTM and embedding layers.\n\n\nDistiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm.\n\n\n\n\nStructure pruners\n\n\nElement-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation.  Structure pruners, remove entire \"structures\", such as kernels, filters, and even entire feature-maps.\n\n\nRanked structure pruner\n\n\nThe \nL1RankedStructureParameterPruner\n pruner calculates the magnitude of some \"structure\", orders all of the structures based on some magnitude function and the \nm\n lowest ranking structures are pruned away.  Currently this pruner only performs ranking of filters (3D structures) and it uses the mean of the absolute value of the tensor as the representative of the filter magnitude.  The absolute mean does not depend on the size of the filter, so it is easier to use compared to just using the \\(L_1\\)-norm of the structure, and at the same time it is a good proxy of the \\(L_1\\)-norm.\n\n\nIn \nPruning Filters for Efficient ConvNets\n the authors use filter ranking, with \none-shot pruning\n followed by fine-tuning.  The authors of \nExploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition\n also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation:\n\n\n\n\nFirst, after sweeping through the full training set several times the weights become relatively stable \u2014 they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)\n\n\n\n\nActivation-influenced pruner\n\n\nThe motivation for this pruner, is that if a feature-map produces very small activations, then this feature-map is not very important, and can be pruned away.\n- \nStatus: not implemented",
@@ -435,6 +450,51 @@
             "text": "We've implemented  convolution  and  FC  using this method.     They are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. The wrapper is implemented in the  RangeLinearQuantParamLayerWrapper  class.    All other layers are unaffected and are executed using their original FP32 implementation.    To automatically transform an existing model to a quantized model using this method, use the  SymmetricLinearQuantizer  class.  For weights and bias the scale factor is determined once at quantization setup (\"offline\"), and for activations it is determined dynamically at runtime (\"online\").    Important note:  Currently, this method is implemented as  inference only , with no back-propagation functionality. Hence, it can only be used to quantize a pre-trained FP32 model, with no re-training. As such, using it with  n < 8  is likely to lead to severe accuracy degradation for any non-trivial workload.",
             "title": "Implementation"
         },
+        {
+            "location": "/algo_earlyexit/index.html",
+            "text": "Early Exit Inference\n\n\nWhile Deep Neural Networks benefit from a large number of layers, it's often the case that many data points in classification tasks can be classified accurately with much less work. There have been several studies recently regarding the idea of exiting before the normal endpoint of the neural network. Panda et al in \nConditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition\n points out that a lot of data points can be classified easily and require less processing than some more difficult points and they view this in terms of power savings. Surat et al in \nBranchyNet: Fast Inference via Early Exiting from Deep Neural Networks\n look at a selective approach to exit placement and criteria for exiting early.\n\n\nWhy Does Early Exit Work?\n\n\nEarly Exit is a strategy with a straightforward and easy to understand concept Figure #fig(boundaries) shows a simple example in a 2-D feature space. While deep networks can represent more complex and expressive boundaries between classes (assuming we\u2019re confident of avoiding over-fitting the data), it\u2019s also clear that much of the data can be properly classified with even the simplest of classification boundaries.\n\n\n\n\nData points far from the boundary can be considered \"easy to classify\" and achieve a high degree of confidence quicker than do data points close to the boundary. In fact, we can think of the area between the outer straight lines as being the region that is \"difficult to classify\" and require the full expressiveness of the neural network to accurately classify it.\n\n\nExample code for Early Exit\n\n\nBoth CIFAR10 and ImageNet code comes directly from publically available examples from Pytorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.\n\n\nDeeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively.\n\n\nNote that this code does not actually take exits. What it does is to compute statistics of loss and accuracy assuming exits were taken when criteria are met. Actually implementing exits can be tricky and architecture dependent and we plan to address these issues.\n\n\nHeuristics\n\n\nThe insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more agressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.\n\n\nThere are other benefits to adding exits in that training the modified network now has backpropagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.\n\n\nEarly Exit Hyperparameters\n\n\nThere are two parameters that are required to enable early exit. Leave them undefined if you are not enabling Early Exit:\n\n\n\n\n\n\n--earlyexit_thresholds\n defines the\nthresholds for each of the early exits. The cross entropy measure must be \nless than\n the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify \"--earlyexit_thresholds 0.9 1.2\" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.\n\n\n\n\n\n\n--earlyexit_lossweights\n provide the weights for the linear combination of losses during training to compute a signle, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of \"--earlyexit_lossweights 0.2 0.3\" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.\n\n\n\n\n\n\nCIFAR10\n\n\nIn the case of CIFAR10, we have inserted a single exit after the first full layer grouping. The layers on the exit path itself includes a convolutional layer and a fully connected layer. If you move the exit, be sure to match the proper sizes for inputs and outputs to the exit layers.\n\n\nImageNet\n\n\nThis supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic resnet code and could be used with other size resnets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.\n\n\nReferences\n\n\n \nPriyadarshini Panda, Abhronil Sengupta, Kaushik Roy\n.\n    \nConditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition\n, arXiv:1509.08971v6, 2017.\n\n\n\n\n\nSurat Teerapittayanon, Bradley McDanel, H. T. Kung\n.\n    \nBranchyNet: Fast Inference via Early Exiting from Deep Neural Networks\n, arXiv:1709.01686, 2017.",
+            "title": "Early Exit"
+        },
+        {
+            "location": "/algo_earlyexit/index.html#early-exit-inference",
+            "text": "While Deep Neural Networks benefit from a large number of layers, it's often the case that many data points in classification tasks can be classified accurately with much less work. There have been several studies recently regarding the idea of exiting before the normal endpoint of the neural network. Panda et al in  Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition  points out that a lot of data points can be classified easily and require less processing than some more difficult points and they view this in terms of power savings. Surat et al in  BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks  look at a selective approach to exit placement and criteria for exiting early.",
+            "title": "Early Exit Inference"
+        },
+        {
+            "location": "/algo_earlyexit/index.html#why-does-early-exit-work",
+            "text": "Early Exit is a strategy with a straightforward and easy to understand concept Figure #fig(boundaries) shows a simple example in a 2-D feature space. While deep networks can represent more complex and expressive boundaries between classes (assuming we\u2019re confident of avoiding over-fitting the data), it\u2019s also clear that much of the data can be properly classified with even the simplest of classification boundaries.   Data points far from the boundary can be considered \"easy to classify\" and achieve a high degree of confidence quicker than do data points close to the boundary. In fact, we can think of the area between the outer straight lines as being the region that is \"difficult to classify\" and require the full expressiveness of the neural network to accurately classify it.",
+            "title": "Why Does Early Exit Work?"
+        },
+        {
+            "location": "/algo_earlyexit/index.html#example-code-for-early-exit",
+            "text": "Both CIFAR10 and ImageNet code comes directly from publically available examples from Pytorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.  Deeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively.  Note that this code does not actually take exits. What it does is to compute statistics of loss and accuracy assuming exits were taken when criteria are met. Actually implementing exits can be tricky and architecture dependent and we plan to address these issues.",
+            "title": "Example code for Early Exit"
+        },
+        {
+            "location": "/algo_earlyexit/index.html#heuristics",
+            "text": "The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more agressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.  There are other benefits to adding exits in that training the modified network now has backpropagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.",
+            "title": "Heuristics"
+        },
+        {
+            "location": "/algo_earlyexit/index.html#early-exit-hyperparameters",
+            "text": "There are two parameters that are required to enable early exit. Leave them undefined if you are not enabling Early Exit:    --earlyexit_thresholds  defines the\nthresholds for each of the early exits. The cross entropy measure must be  less than  the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify \"--earlyexit_thresholds 0.9 1.2\" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.    --earlyexit_lossweights  provide the weights for the linear combination of losses during training to compute a signle, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of \"--earlyexit_lossweights 0.2 0.3\" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.",
+            "title": "Early Exit Hyperparameters"
+        },
+        {
+            "location": "/algo_earlyexit/index.html#cifar10",
+            "text": "In the case of CIFAR10, we have inserted a single exit after the first full layer grouping. The layers on the exit path itself includes a convolutional layer and a fully connected layer. If you move the exit, be sure to match the proper sizes for inputs and outputs to the exit layers.",
+            "title": "CIFAR10"
+        },
+        {
+            "location": "/algo_earlyexit/index.html#imagenet",
+            "text": "This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic resnet code and could be used with other size resnets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.",
+            "title": "ImageNet"
+        },
+        {
+            "location": "/algo_earlyexit/index.html#references",
+            "text": "Priyadarshini Panda, Abhronil Sengupta, Kaushik Roy .\n     Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition , arXiv:1509.08971v6, 2017.   Surat Teerapittayanon, Bradley McDanel, H. T. Kung .\n     BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks , arXiv:1709.01686, 2017.",
+            "title": "References"
+        },
         {
             "location": "/model_zoo/index.html",
             "text": "Distiller Model Zoo\n\n\nHow to contribute models to the Model Zoo\n\n\nWe encourage you to contribute new models to the Model Zoo.  We welcome implementations of published papers or of your own work.  To assure that models and algorithms shared with others are high-quality, please commit your models with the following:\n\n\n\n\nCommand-line arguments\n\n\nLog files\n\n\nPyTorch model\n\n\n\n\nContents\n\n\nThe Distiller model zoo is not a \"traditional\" model-zoo, because it does not necessarily contain best-in-class compressed models.  Instead, the model-zoo contains a number of deep learning models that have been compressed using Distiller following some well-known research papers.  These are meant to serve as examples of how Distiller can be used.\n\n\nEach model contains a Distiller schedule detailing how the model was compressed, a PyTorch checkpoint, text logs and TensorBoard logs.\n\n\n\n\ntable, th, td {\n    border: 1px solid black;\n}\n\n\n\n\n  \n\n    \nPaper\n\n    \nDataset\n\n    \nNetwork\n\n    \nMethod & Granularity\n\n    \nSchedule\n\n    \nFeatures\n\n  \n\n  \n\n    \nLearning both Weights and Connections for Efficient Neural Networks\n\n    \nImageNet\n\n    \nAlexnet\n\n    \nElement-wise pruning\n\n    \nIterative; Manual\n\n    \nMagnitude thresholding based on a sensitivity quantifier.\nElement-wise sparsity sensitivity analysis\n\n  \n\n  \n\n    \nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n\n    \nImageNet\n\n    \nMobileNet\n\n    \nElement-wise pruning\n\n    \nAutomated gradual; Iterative\n\n    \nMagnitude thresholding based on target level\n\n  \n\n  \n\n    \nLearning Structured Sparsity in Deep Neural Networks\n\n    \nCIFAR10\n\n    \nResNet20\n\n    \nGroup regularization\n\n    \n1.Train with group-lasso\n2.Remove zero groups and fine-tune\n\n    \nGroup Lasso regularization. Groups: kernels (2D), channels, filters (3D), layers (4D), vectors (rows, cols)\n\n  \n\n  \n\n    \nPruning Filters for Efficient ConvNets\n\n    \nCIFAR10\n\n    \nResNet56\n\n    \nFilter ranking; guided by sensitivity analysis\n\n    \n1.Rank filters\n2. Remove filters and channels\n3.Fine-tune\n\n    \nOne-shot ranking and pruning of filters; with network thinning\n  \n\n\n\n\nLearning both Weights and Connections for Efficient Neural Networks\n\n\nThis schedule is an example of \"Iterative Pruning\" for Alexnet/Imagent, as described in chapter 3 of Song Han's PhD dissertation: \nEfficient Methods and Hardware for Deep Learning\n and in his paper \nLearning both Weights and Connections for Efficient Neural Networks\n.  \n\n\nThe Distiller schedule uses SensitivityPruner which is similar to MagnitudeParameterPruner, but instead of specifying \"raw\" thresholds, it uses a \"sensitivity parameter\".  Song Han's paper says that \"the pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layers weights,\" and this is not explained much further.  In Distiller, the \"quality parameter\" is referred to as \"sensitivity\" and\nis based on the values learned from performing sensitivity analysis.  Using a parameter that is related to the standard deviation is very helpful: under the assumption that the weights tensors are distributed normally, the standard deviation acts as a threshold normalizer.\n\n\nNote that Distiller's implementation deviates slightly from the algorithm Song Han describes in his PhD dissertation, in that the threshold value is set only once.  In his PhD dissertation, Song Han describes a growing threshold, at each iteration.  This requires n+1 hyper-parameters (n being the number of pruning iterations we use): the threshold and the threshold increase (delta) at each pruning iteration.  Distiller's implementation takes advantage of the fact that as pruning progresses, more weights are pulled toward zero, and therefore the threshold \"traps\" more weights.  Thus, we can use less hyper-parameters and achieve the same results.\n\n\n\n\nDistiller schedule: \ndistiller/examples/sensitivity-pruning/alexnet.schedule_sensitivity.yaml\n\n\nCheckpoint file: \nalexnet.checkpoint.89.pth.tar\n\n\n\n\nResults\n\n\nOur reference is TorchVision's pretrained Alexnet model which has a Top1 accuracy of 56.55 and Top5=79.09.  We prune away 88.44% of the parameters and achieve  Top1=56.61 and Top5=79.45.\nSong Han prunes 89% of the parameters, which is slightly better than our results.\n\n\nParameters:\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|    | Name                      | Shape            |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean\n|----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n|  0 | features.module.0.weight  | (64, 3, 11, 11)  |         23232 |          13411 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   42.27359 | 0.14391 | -0.00002 |    0.08805 |\n|  1 | features.module.3.weight  | (192, 64, 5, 5)  |        307200 |         115560 |    0.00000 |    0.00000 |  0.00000 |  1.91243 |  0.00000 |   62.38281 | 0.04703 | -0.00250 |    0.02289 |\n|  2 | features.module.6.weight  | (384, 192, 3, 3) |        663552 |         256565 |    0.00000 |    0.00000 |  0.00000 |  6.18490 |  0.00000 |   61.33445 | 0.03354 | -0.00184 |    0.01803 |\n|  3 | features.module.8.weight  | (256, 384, 3, 3) |        884736 |         315065 |    0.00000 |    0.00000 |  0.00000 |  6.96411 |  0.00000 |   64.38881 | 0.02646 | -0.00168 |    0.01422 |\n|  4 | features.module.10.weight | (256, 256, 3, 3) |        589824 |         186938 |    0.00000 |    0.00000 |  0.00000 | 15.49225 |  0.00000 |   68.30614 | 0.02714 | -0.00246 |    0.01409 |\n|  5 | classifier.1.weight       | (4096, 9216)     |      37748736 |        3398881 |    0.00000 |    0.21973 |  0.00000 |  0.21973 |  0.00000 |   90.99604 | 0.00589 | -0.00020 |    0.00168 |\n|  6 | classifier.4.weight       | (4096, 4096)     |      16777216 |        1782769 |    0.21973 |    3.46680 |  0.00000 |  3.46680 |  0.00000 |   89.37387 | 0.00849 | -0.00066 |    0.00263 |\n|  7 | classifier.6.weight       | (1000, 4096)     |       4096000 |         994738 |    3.36914 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   75.71440 | 0.01718 |  0.00030 |    0.00778 |\n|  8 | Total sparsity:           | -                |      61090496 |        7063928 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   88.43694 | 0.00000 |  0.00000 |    0.00000 |\n+----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n 2018-04-04 21:30:52,499 - Total sparsity: 88.44\n\n 2018-04-04 21:30:52,499 - --- validate (epoch=89)-----------\n 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch)\n 2018-04-04 21:31:35,357 - ==> Top1: 51.838    Top5: 74.817    Loss: 2.150\n\n 2018-04-04 21:31:39,251 - --- test ---------------------\n 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch)\n 2018-04-04 21:32:01,274 - ==> Top1: 56.606    Top5: 79.446    Loss: 1.893\n\n\n\n\nTo prune, or not to prune: exploring the efficacy of pruning for model compression\n\n\nIn their paper Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their\nsmaller, but dense (small-dense) counterparts with identical memory footprint.\"\nThey also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with\nminimal tuning.\"\n\n\nThis pruning schedule is implemented by distiller.AutomatedGradualPruner, which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps.  Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper.  The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs.  We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size.\n\n\nImageNet files:\n\n\n\n\nDistiller schedule: \ndistiller/examples/agp-pruning/mobilenet.imagenet.schedule_agp.yaml\n\n\nCheckpoint file: \ncheckpoint.pth.tar\n\n\n\n\nResNet18 files:\n\n\n\n\nDistiller schedule: \ndistiller/examples/agp-pruning/resnet18.schedule_agp.yaml\n\n\nCheckpoint file: \ncheckpoint.pth.tar\n\n\n\n\nResults\n\n\nAs our baseline we used a \npretrained PyTorch MobileNet model\n (width=1) which has Top1=68.848 and Top5=88.740.\n\nIn their paper, Zhu and Gupta prune 50% of the elements of MobileNet (width=1) with a 1.1% drop in accuracy.  We pruned about 51.6% of the elements, with virtually no change in the accuracies (Top1: 68.808 and Top5: 88.656).  We didn't try to prune more than this, but we do note that the baseline accuracy that we used is almost 2% lower than the accuracy published in the paper.  \n\n\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\n|    | Name                     | Shape              |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |\n|----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|\n|  0 | module.model.0.0.weight  | (32, 3, 3, 3)      |           864 |            864 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.14466 |  0.00103 |    0.06508 |\n|  1 | module.model.1.0.weight  | (32, 1, 3, 3)      |           288 |            288 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.32146 |  0.01020 |    0.12932 |\n|  2 | module.model.1.3.weight  | (64, 32, 1, 1)     |          2048 |           2048 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.11942 |  0.00024 |    0.03627 |\n|  3 | module.model.2.0.weight  | (64, 1, 3, 3)      |           576 |            576 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.15809 |  0.00543 |    0.11513 |\n|  4 | module.model.2.3.weight  | (128, 64, 1, 1)    |          8192 |           8192 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08442 | -0.00031 |    0.04182 |\n|  5 | module.model.3.0.weight  | (128, 1, 3, 3)     |          1152 |           1152 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.16780 |  0.00125 |    0.10545 |\n|  6 | module.model.3.3.weight  | (128, 128, 1, 1)   |         16384 |          16384 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07126 | -0.00197 |    0.04123 |\n|  7 | module.model.4.0.weight  | (128, 1, 3, 3)     |          1152 |           1152 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.10182 |  0.00171 |    0.08719 |\n|  8 | module.model.4.3.weight  | (256, 128, 1, 1)   |         32768 |          13108 |    0.00000 |    0.00000 | 10.15625 | 59.99756 | 12.50000 |   59.99756 | 0.05543 | -0.00002 |    0.02760 |\n|  9 | module.model.5.0.weight  | (256, 1, 3, 3)     |          2304 |           2304 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.12516 | -0.00288 |    0.08058 |\n| 10 | module.model.5.3.weight  | (256, 256, 1, 1)   |         65536 |          26215 |    0.00000 |    0.00000 | 12.50000 | 59.99908 | 23.82812 |   59.99908 | 0.04453 |  0.00002 |    0.02271 |\n| 11 | module.model.6.0.weight  | (256, 1, 3, 3)     |          2304 |           2304 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08024 |  0.00252 |    0.06377 |\n| 12 | module.model.6.3.weight  | (512, 256, 1, 1)   |        131072 |          52429 |    0.00000 |    0.00000 | 23.82812 | 59.99985 | 14.25781 |   59.99985 | 0.03561 | -0.00057 |    0.01779 |\n| 13 | module.model.7.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.11008 | -0.00018 |    0.06829 |\n| 14 | module.model.7.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 14.25781 | 59.99985 | 21.28906 |   59.99985 | 0.02944 | -0.00060 |    0.01515 |\n| 15 | module.model.8.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.08258 |  0.00370 |    0.04905 |\n| 16 | module.model.8.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 21.28906 | 59.99985 | 28.51562 |   59.99985 | 0.02865 | -0.00046 |    0.01465 |\n| 17 | module.model.9.0.weight  | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07578 |  0.00468 |    0.04201 |\n| 18 | module.model.9.3.weight  | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 28.51562 | 59.99985 | 23.43750 |   59.99985 | 0.02939 | -0.00044 |    0.01511 |\n| 19 | module.model.10.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.07091 |  0.00014 |    0.04306 |\n| 20 | module.model.10.3.weight | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 24.60938 | 59.99985 | 20.89844 |   59.99985 | 0.03095 | -0.00059 |    0.01672 |\n| 21 | module.model.11.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.05729 | -0.00518 |    0.04267 |\n| 22 | module.model.11.3.weight | (512, 512, 1, 1)   |        262144 |         104858 |    0.00000 |    0.00000 | 20.89844 | 59.99985 | 17.57812 |   59.99985 | 0.03229 | -0.00044 |    0.01797 |\n| 23 | module.model.12.0.weight | (512, 1, 3, 3)     |          4608 |           4608 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.04981 | -0.00136 |    0.03967 |\n| 24 | module.model.12.3.weight | (1024, 512, 1, 1)  |        524288 |         209716 |    0.00000 |    0.00000 | 16.01562 | 59.99985 | 44.23828 |   59.99985 | 0.02514 | -0.00106 |    0.01278 |\n| 25 | module.model.13.0.weight | (1024, 1, 3, 3)    |          9216 |           9216 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |    0.00000 | 0.02396 | -0.00949 |    0.01549 |\n| 26 | module.model.13.3.weight | (1024, 1024, 1, 1) |       1048576 |         419431 |    0.00000 |    0.00000 | 44.72656 | 59.99994 |  1.46484 |   59.99994 | 0.01801 | -0.00017 |    0.00931 |\n| 27 | module.fc.weight         | (1000, 1024)       |       1024000 |         409600 |    1.46484 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   60.00000 | 0.05078 |  0.00271 |    0.02734 |\n| 28 | Total sparsity:          | -                  |       4209088 |        1726917 |    0.00000 |    0.00000 |  0.00000 |  0.00000 |  0.00000 |   58.97171 | 0.00000 |  0.00000 |    0.00000 |\n+----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+\nTotal sparsity: 58.97\n\n--- validate (epoch=199)-----------\n128116 samples (256 per mini-batch)\n==> Top1: 65.337    Top5: 84.984    Loss: 1.494\n\n--- test ---------------------\n50000 samples (256 per mini-batch)\n==> Top1: 68.810    Top5: 88.626    Loss: 1.282\n\n\n\n\n\nLearning Structured Sparsity in Deep Neural Networks\n\n\nThis research paper from the University of Pittsburgh, \"proposes a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNN\u2019s evaluation.\"\n\n\nNote that this paper does not use pruning, but instead uses group regularization during the training to force weights towards zero, as a group.  We used a schedule which thresholds the regularized elements at a magnitude equal to the regularization strength.  At the end of the regularization phase, we save the final sparsity masks generated by the regularization, and exit.  Then we load this regularized model, remove the layers corresponding to the zeroed weight tensors (all of a layer's elements have a zero value).    \n\n\nBaseline training\n\n\nWe started by training the baseline ResNet20-Cifar dense network since we didn't have a pre-trained model.\n\n\n\n\nDistiller schedule: \ndistiller/examples/ssl/resnet20_cifar_baseline_training.yaml\n\n\nCheckpoint files: \ndistiller/examples/ssl/checkpoints/\n\n\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.3 --epochs=180 --compress=../cifar10/resnet20/baseline_training.yaml -j=1 --deterministic\n\n\n\n\nRegularization\n\n\nThen we started training from scratch again, but this time we used Group Lasso regularization on entire layers:\n\nDistiller schedule: \ndistiller/examples/ssl/ssl_4D-removal_4L_training.yaml\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.4 --epochs=180 --compress=../ssl/ssl_4D-removal_training.yaml -j=1 --deterministic\n\n\n\n\nThe diagram below shows the training of Resnet20/CIFAR10 using Group Lasso regularization on entire layers (in blue) vs. training Resnet20/CIFAR10  baseline (in red).  You may notice several interesting things:\n1. The LR-decay policy is the same, but the two sessions start with different initial LR values.\n2. The data-loss of the regularized training follows the same shape as the un-regularized training (baseline), and eventually the two seem to merge.\n3. We see similar behavior in the validation Top1 and Top5 accuracy results, but the regularized training eventually performs better.\n4. In the top right corner we see the behavior of the regularization loss (\nReg Loss\n), which actually increases for some time, until the data-loss has a sharp drop (after ~16K mini-batches), at which point the regularization loss also starts dropping.\n\n\n\nThis \nregularization\n yields 5 layers with zeroed weight tensors.  We load this model, remove the 5 layers, and start the fine tuning of the weights.  This process of layer removal is specific to ResNet for CIFAR, which we altered by adding code to skip over layers during the forward path.  When you export to ONNX, the removed layers do not participate in the forward path, so they don't get incarnated.  \n\n\nWe managed to remove 5 of the 16 3x3 convolution layers which dominate the computation time.  It's not bad, but we probably could have done better.\n\n\nFine-tuning\n\n\nDuring the \nfine-tuning\n process, because the removed layers do not participate in the forward path, they do not appear in the backward path and are not backpropogated: therefore they are completely disconnected from the network.\n\nWe copy the checkpoint file of the regularized model to \ncheckpoint_trained_4D_regularized_5Lremoved.pth.tar\n.\n\nDistiller schedule: \ndistiller/examples/ssl/ssl_4D-removal_finetuning.yaml\n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --lr=0.1 --epochs=250 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --compress=../ssl/ssl_4D-removal_finetuning.yaml  -j=1 --deterministic\n\n\n\n\nResults\n\n\nOur baseline results for ResNet20 Cifar are: Top1=91.450 and  Top5=99.750\n\n\nWe used Distiller's GroupLassoRegularizer to remove 5 layers from Resnet20 (CIFAR10) with no degradation of the accuracies.\n\nThe regularized model exhibits really poor classification abilities: \n\n\n$ time python3 compress_classifier.py --arch resnet20_cifar  ../data.cifar10 -p=50 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --evaluate\n\n=> loading checkpoint ../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar\n   best top@1: 90.620\nLoaded compression schedule from checkpoint (epoch 179)\nRemoving layer: module.layer1.0.conv1 [layer=0 block=0 conv=0]\nRemoving layer: module.layer1.0.conv2 [layer=0 block=0 conv=1]\nRemoving layer: module.layer1.1.conv1 [layer=0 block=1 conv=0]\nRemoving layer: module.layer1.1.conv2 [layer=0 block=1 conv=1]\nRemoving layer: module.layer2.2.conv2 [layer=1 block=2 conv=1]\nFiles already downloaded and verified\nFiles already downloaded and verified\nDataset sizes:\n        training=45000\n        validation=5000\n        test=10000\n--- test ---------------------\n10000 samples (256 per mini-batch)\n==> Top1: 22.290    Top5: 68.940    Loss: 5.172\n\n\n\n\nHowever, after fine-tuning, we recovered most of the accuracies loss, but not quite all of it: Top1=91.020 and Top5=99.670\n\n\nWe didn't spend time trying to wrestle with this network, and therefore didn't achieve SSL's published results (which showed that they managed to remove 6 layers and at the same time increase accuracies).\n\n\nPruning Filters for Efficient ConvNets\n\n\nQuoting the authors directly:\n\n\n\n\nWe present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly.\nIn contrast to pruning weights, this approach does not result in sparse connectivity patterns. Hence, it does not need the support of sparse convolution libraries and can work with existing efficient BLAS libraries for dense matrix multiplications.\n\n\n\n\nThe implementation of the research by Hao et al. required us to add filter-pruning sensitivity analysis, and support for \"network thinning\".\n\n\nAfter performing filter-pruning sensitivity analysis to assess which layers are more sensitive to the pruning of filters, we execute distiller.L1RankedStructureParameterPruner once in order to rank the filters of each layer by their L1-norm values, and then we prune the schedule-prescribed sparsity level.  \n\n\n\n\nDistiller schedule: \ndistiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank.yaml\n\n\nCheckpoint files: \ncheckpoint_finetuned.pth.tar\n\n\n\n\nThe excerpt from the schedule, displayed below, shows how we declare the L1RankedStructureParameterPruner.  This class currently ranks filters only, but because in the future this class may support ranking of various structures, you need to specify for each parameter both the target sparsity level, and the structure type ('3D' is filter-wise pruning).\n\n\npruners:\n  filter_pruner:\n    class: 'L1RankedStructureParameterPruner'\n    reg_regims:\n      'module.layer1.0.conv1.weight': [0.6, '3D']\n      'module.layer1.1.conv1.weight': [0.6, '3D']\n      'module.layer1.2.conv1.weight': [0.6, '3D']\n      'module.layer1.3.conv1.weight': [0.6, '3D']\n\n\n\n\nIn the policy, we specify that we want to invoke this pruner once, at epoch 180.  Because we are starting from a network which was trained for 180 epochs (see Baseline training below), the filter ranking is performed right at the outset of this schedule.\n\n\npolicies:\n  - pruner:\n      instance_name: filter_pruner\n    epochs: [180]\n\n\n\n\n\nFollowing the pruning, we want to \"physically\" remove the pruned filters from the network, which involves reconfiguring the Convolutional layers and the parameter tensors.  When we remove filters from Convolution layer \nn\n we need to perform several changes to the network:\n1. Shrink layer \nn\n's weights tensor, leaving only the \"important\" filters.\n2. Configure layer \nn\n's \n.out_channels\n member to its new, smaller, value.\n3. If a BN layer follows layer \nn\n, then it also needs to be reconfigured and its scale and shift parameter vectors need to be shrunk.\n4. If a Convolution layer follows the BN layer, then it will have less input channels which requires reconfiguration and shrinking of its weights.\n\n\nAll of this is performed by distiller.ResnetCifarFilterRemover which is also scheduled at epoch 180.  We call this process \"network thinning\".\n\n\nextensions:\n  net_thinner:\n      class: 'FilterRemover'\n      thinning_func_str: remove_filters\n      arch: 'resnet56_cifar'\n      dataset: 'cifar10'\n\n\n\n\nNetwork thinning requires us to understand the layer connectivity and data-dependency of the DNN, and we are working on a robust method to perform this.  On networks with topologies similar to ResNet (residuals) and GoogLeNet (inception), which have several inputs and outputs to/from Convolution layers, there is extra details to consider.\n\nOur current implementation is specific to certain layers in ResNet and is a bit fragile.  We will continue to improve and generalize this.\n\n\nBaseline training\n\n\nWe started by training the baseline ResNet56-Cifar dense network (180 epochs) since we didn't have a pre-trained model.\n\n\n\n\nDistiller schedule: \ndistiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml\n\n\nCheckpoint files: \ncheckpoint.resnet56_cifar_baseline.pth.tar\n\n\n\n\nResults\n\n\nWe trained a ResNet56-Cifar10 network and achieve accuracy results which are on-par with published results:\nTop1: 92.970 and Top5: 99.740.\n\n\nWe used Hao et al.'s algorithm to remove 37.3% of the original convolution MACs, while maintaining virtually the same accuracy as the baseline:\nTop1: 92.830 and Top5: 99.760",
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 71406c1ebb9a885491fd33b54affc83d416dfd71..f870492024fe54efcd38f9354d9b50fbde0c709f 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -4,7 +4,7 @@
     
     <url>
      <loc>/index.html</loc>
-     <lastmod>2018-11-04</lastmod>
+     <lastmod>2018-11-07</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -12,7 +12,7 @@
     
     <url>
      <loc>/install/index.html</loc>
-     <lastmod>2018-11-04</lastmod>
+     <lastmod>2018-11-07</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -20,7 +20,7 @@
     
     <url>
      <loc>/usage/index.html</loc>
-     <lastmod>2018-11-04</lastmod>
+     <lastmod>2018-11-07</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -28,7 +28,7 @@
     
     <url>
      <loc>/schedule/index.html</loc>
-     <lastmod>2018-11-04</lastmod>
+     <lastmod>2018-11-07</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -37,25 +37,31 @@
         
     <url>
      <loc>/pruning/index.html</loc>
-     <lastmod>2018-11-04</lastmod>
+     <lastmod>2018-11-07</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/regularization/index.html</loc>
-     <lastmod>2018-11-04</lastmod>
+     <lastmod>2018-11-07</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/quantization/index.html</loc>
-     <lastmod>2018-11-04</lastmod>
+     <lastmod>2018-11-07</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/knowledge_distillation/index.html</loc>
-     <lastmod>2018-11-04</lastmod>
+     <lastmod>2018-11-07</lastmod>
+     <changefreq>daily</changefreq>
+    </url>
+        
+    <url>
+     <loc>/conditional_computation/index.html</loc>
+     <lastmod>2018-11-07</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
@@ -65,13 +71,19 @@
         
     <url>
      <loc>/algo_pruning/index.html</loc>
-     <lastmod>2018-11-04</lastmod>
+     <lastmod>2018-11-07</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
     <url>
      <loc>/algo_quantization/index.html</loc>
-     <lastmod>2018-11-04</lastmod>
+     <lastmod>2018-11-07</lastmod>
+     <changefreq>daily</changefreq>
+    </url>
+        
+    <url>
+     <loc>/algo_earlyexit/index.html</loc>
+     <lastmod>2018-11-07</lastmod>
      <changefreq>daily</changefreq>
     </url>
         
@@ -80,7 +92,7 @@
     
     <url>
      <loc>/model_zoo/index.html</loc>
-     <lastmod>2018-11-04</lastmod>
+     <lastmod>2018-11-07</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -88,7 +100,7 @@
     
     <url>
      <loc>/jupyter/index.html</loc>
-     <lastmod>2018-11-04</lastmod>
+     <lastmod>2018-11-07</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
@@ -96,7 +108,7 @@
     
     <url>
      <loc>/design/index.html</loc>
-     <lastmod>2018-11-04</lastmod>
+     <lastmod>2018-11-07</lastmod>
      <changefreq>daily</changefreq>
     </url>
     
diff --git a/docs/usage/index.html b/docs/usage/index.html
index edb0b610a47cc584df7218073cc7028e5358a10d..e62a0a31eb2dd0d4adaf14335312fd281d1d8d6e 100644
--- a/docs/usage/index.html
+++ b/docs/usage/index.html
@@ -120,6 +120,10 @@
                     
     <a class="" href="../knowledge_distillation/index.html">Knowledge Distillation</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../conditional_computation/index.html">Conditional Computation</a>
+                </li>
     </ul>
 	    </li>
           
@@ -135,6 +139,10 @@
                     
     <a class="" href="../algo_quantization/index.html">Quantization</a>
                 </li>
+                <li class="">
+                    
+    <a class="" href="../algo_earlyexit/index.html">Early Exit</a>
+                </li>
     </ul>
 	    </li>