Updated early-exit docs (from @haim-barad)

8bcaaa53 · Guy Jacob · 51880a22 · 8bcaaa53 · 8bcaaa53 · 8bcaaa53
Commit 8bcaaa53 authored 6 years ago by Guy Jacob
--- a/docs-src/docs/algo_earlyexit.md
+++ b/docs-src/docs/algo_earlyexit.md
 # Early Exit Inference
 While Deep Neural Networks benefit from a large number of layers, it's often the case that many data points in classification tasks can be classified accurately with much less work. There have been several studies recently regarding the idea of exiting before the normal endpoint of the neural network. Panda et al in [Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition](#panda) points out that a lot of data points can be classified easily and require less processing than some more difficult points and they view this in terms of power savings. Surat et al in [BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks](#branchynet) look at a selective approach to exit placement and criteria for exiting early.
 ## Why Does Early Exit Work?
 Early Exit is a strategy with a straightforward and easy to understand concept Figure #fig(boundaries) shows a simple example in a 2-D feature space. While deep networks can represent more complex and expressive boundaries between classes (assuming we’re confident of avoiding over-fitting the data), it’s also clear that much of the data can be properly classified with even the simplest of classification boundaries.
 ![Figure !fig(boundaries): Simple and more expressive classification boundaries](imgs/decision_boundary.png)
@@ -9,32 +11,60 @@ Early Exit is a strategy with a straightforward and easy to understand concept F
 Data points far from the boundary can be considered "easy to classify" and achieve a high degree of confidence quicker than do data points close to the boundary. In fact, we can think of the area between the outer straight lines as being the region that is "difficult to classify" and require the full expressiveness of the neural network to accurately classify it.
 ## Example code for Early Exit
-Both CIFAR10 and ImageNet code comes directly from publically available examples from Pytorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.
+Both CIFAR10 and ImageNet code comes directly from publicly available examples from PyTorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.
+**Note:** the sample code provided for ResNet models with Early Exits has exactly one early exit for the CIFAR10 example and exactly two early exits for the ImageNet example. If you want to modify the number of early exits, you will need to make sure that the model code is updated to have a corresponding number of exits.
 Deeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively.
 Note that this code does not actually take exits. What it does is to compute statistics of loss and accuracy assuming exits were taken when criteria are met. Actually implementing exits can be tricky and architecture dependent and we plan to address these issues.
+### Example command lines
+We have provided examples for ResNets of varying sizes for both CIFAR10 and ImageNet datasets. An example command line for training for CIFAR10 is:
+```bash
+python compress_classifier.py --arch=resnet32_cifar_earlyexit --epochs=20 -b 128 \
+    --lr=0.003 --earlyexit_thresholds 0.4 --earlyexit_lossweights 0.4 -j 30 \
+    --out-dir /home/ -n earlyexit /home/pcifar10
+```
+And an example command line for ImageNet is:
+```bash
+python compress_classifier.py --arch=resnet50_earlyexit --epochs=120 -b 128 \
+    --lr=0.003 --earlyexit_thresholds 1.2 0.9 --earlyexit_lossweights 0.1 0.3 \
+    -j 30 --out-dir /home/ -n earlyexit /home/I1K/i1k-extracted/
+```
 ### Heuristics
-The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more agressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.
-There are other benefits to adding exits in that training the modified network now has backpropagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.
+The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more aggressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.
+There are other benefits to adding exits in that training the modified network now has back-propagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.
+### Early Exit Hyper-Parameters
-### Early Exit Hyperparameters
 There are two parameters that are required to enable early exit. Leave them undefined if you are not enabling Early Exit:
-1. **--earlyexit_thresholds** defines the
+1. **--earlyexit_thresholds** defines the thresholds for each of the early exits. The cross entropy measure must be **less than** the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify "--earlyexit_thresholds 0.9 1.2" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.
-thresholds for each of the early exits. The cross entropy measure must be **less than** the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify "--earlyexit_thresholds 0.9 1.2" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.
+12 **--earlyexit_lossweights** provide the weights for the linear combination of losses during training to compute a single, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of "--earlyexit_lossweights 0.2 0.3" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.
+### Output Stats
-1. **--earlyexit_lossweights** provide the weights for the linear combination of losses during training to compute a signle, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of "--earlyexit_lossweights 0.2 0.3" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.
+The example code outputs various statistics regarding the loss and accuracy at each of the exits. During training, the Top1 and Top5 stats represent the accuracy should all of the data be forced out that exit (in order to compute the loss at that exit). During inference (i.e. validation and test stages), the Top1 and Top5 stats represent the accuracy for those data points that could exit because the calculated entropy at that exit was lower than the specified threshold for that exit.
 ### CIFAR10
 In the case of CIFAR10, we have inserted a single exit after the first full layer grouping. The layers on the exit path itself includes a convolutional layer and a fully connected layer. If you move the exit, be sure to match the proper sizes for inputs and outputs to the exit layers.
 ### ImageNet
-This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic resnet code and could be used with other size resnets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.
+This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic ResNet code and could be used with other size ResNets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.
 ## References
 <div id="panda"></div> **Priyadarshini Panda, Abhronil Sengupta, Kaushik Roy**.
    [*Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition*](https://arxiv.org/abs/1509.08971v6), arXiv:1509.08971v6, 2017.

--- a/docs/algo_earlyexit/index.html
+++ b/docs/algo_earlyexit/index.html
@@ -203,29 +203,46 @@
 <p><img alt="Figure !fig(boundaries): Simple and more expressive classification boundaries" src="../imgs/decision_boundary.png" /></p>
 <p>Data points far from the boundary can be considered "easy to classify" and achieve a high degree of confidence quicker than do data points close to the boundary. In fact, we can think of the area between the outer straight lines as being the region that is "difficult to classify" and require the full expressiveness of the neural network to accurately classify it.</p>
 <h2 id="example-code-for-early-exit">Example code for Early Exit</h2>
-<p>Both CIFAR10 and ImageNet code comes directly from publically available examples from Pytorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.</p>
+<p>Both CIFAR10 and ImageNet code comes directly from publicly available examples from PyTorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work.</p>
-<p>Deeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively.</p>
+<p><strong>Note:</strong> the sample code provided for ResNet models with Early Exits has exactly one early exit for the CIFAR10 example and exactly two early exits for the ImageNet example. If you want to modify the number of early exits, you will need to make sure that the model code is updated to have a corresponding number of exits.
+Deeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively.</p>
 <p>Note that this code does not actually take exits. What it does is to compute statistics of loss and accuracy assuming exits were taken when criteria are met. Actually implementing exits can be tricky and architecture dependent and we plan to address these issues.</p>
+<h3 id="example-command-lines">Example command lines</h3>
+<p>We have provided examples for ResNets of varying sizes for both CIFAR10 and ImageNet datasets. An example command line for training for CIFAR10 is:</p>
+<pre><code class="bash">python compress_classifier.py --arch=resnet32_cifar_earlyexit --epochs=20 -b 128 \
+    --lr=0.003 --earlyexit_thresholds 0.4 --earlyexit_lossweights 0.4 -j 30 \
+    --out-dir /home/ -n earlyexit /home/pcifar10
+</code></pre>
+<p>And an example command line for ImageNet is:</p>
+<pre><code class="bash">python compress_classifier.py --arch=resnet50_earlyexit --epochs=120 -b 128 \
+    --lr=0.003 --earlyexit_thresholds 1.2 0.9 --earlyexit_lossweights 0.1 0.3 \
+    -j 30 --out-dir /home/ -n earlyexit /home/I1K/i1k-extracted/
+</code></pre>
 <h3 id="heuristics">Heuristics</h3>
-<p>The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more agressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.</p>
+<p>The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more aggressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy.</p>
-<p>There are other benefits to adding exits in that training the modified network now has backpropagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.</p>
+<p>There are other benefits to adding exits in that training the modified network now has back-propagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.</p>
-<h3 id="early-exit-hyperparameters">Early Exit Hyperparameters</h3>
+<h3 id="early-exit-hyper-parameters">Early Exit Hyper-Parameters</h3>
 <p>There are two parameters that are required to enable early exit. Leave them undefined if you are not enabling Early Exit:</p>
 <ol>
 <li>
-<p><strong>--earlyexit_thresholds</strong> defines the
+<p><strong>--earlyexit_thresholds</strong> defines the thresholds for each of the early exits. The cross entropy measure must be <strong>less than</strong> the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify "--earlyexit_thresholds 0.9 1.2" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.</p>
-thresholds for each of the early exits. The cross entropy measure must be <strong>less than</strong> the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify "--earlyexit_thresholds 0.9 1.2" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits.</p>
 </li>
 <li>
-<p><strong>--earlyexit_lossweights</strong> provide the weights for the linear combination of losses during training to compute a signle, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of "--earlyexit_lossweights 0.2 0.3" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.</p>
+<p><strong>--earlyexit_lossweights</strong> provide the weights for the linear combination of losses during training to compute a single, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of "--earlyexit_lossweights 0.2 0.3" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.</p>
 </li>
 </ol>
+<h3 id="output-stats">Output Stats</h3>
+<p>The example code outputs various statistics regarding the loss and accuracy at each of the exits. During training, the Top1 and Top5 stats represent the accuracy should all of the data be forced out that exit (in order to compute the loss at that exit). During inference (i.e. validation and test stages), the Top1 and Top5 stats represent the accuracy for those data points that could exit because the calculated entropy at that exit was lower than the specified threshold for that exit.</p>
 <h3 id="cifar10">CIFAR10</h3>
 <p>In the case of CIFAR10, we have inserted a single exit after the first full layer grouping. The layers on the exit path itself includes a convolutional layer and a fully connected layer. If you move the exit, be sure to match the proper sizes for inputs and outputs to the exit layers.</p>
 <h3 id="imagenet">ImageNet</h3>
-<p>This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic resnet code and could be used with other size resnets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.</p>
+<p>This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic ResNet code and could be used with other size ResNets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.</p>
 <h2 id="references">References</h2>
-<p><div id="panda"></div> <strong>Priyadarshini Panda, Abhronil Sengupta, Kaushik Roy</strong>.
+<div id="panda"></div>
+<p><strong>Priyadarshini Panda, Abhronil Sengupta, Kaushik Roy</strong>.
    <a href="https://arxiv.org/abs/1509.08971v6"><em>Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition</em></a>, arXiv:1509.08971v6, 2017.</p>
 <div id="branchynet"></div>

--- a/docs/algo_quantization/index.html
+++ b/docs/algo_quantization/index.html
@@ -205,7 +205,7 @@ For any of the methods below that require quantization-aware training, please se
 <p>Let's break down the terminology we use here:</p>
 <ul>
 <li><strong>Linear:</strong> Means a float value is quantized by multiplying with a numeric constant (the <strong>scale factor</strong>).</li>
-<li><strong>Range-Based:</strong>: Means that in order to calculate the scale factor, we look at the actual range of the tensor's values. In the most naive implementation, we use the actual min/max values of the tensor. Alternatively, we use some derivation based on the tensor's range / distribution to come up with a narrower min/max range, in order to remove possible outliers. This is in contrast to the other methods described here, which we could call <strong>clipping-based</strong>, as they impose an explicit clipping function on the tensors (using either a hard-coded value or a learned value).</li>
+<li><strong>Range-Based:</strong> Means that in order to calculate the scale factor, we look at the actual range of the tensor's values. In the most naive implementation, we use the actual min/max values of the tensor. Alternatively, we use some derivation based on the tensor's range / distribution to come up with a narrower min/max range, in order to remove possible outliers. This is in contrast to the other methods described here, which we could call <strong>clipping-based</strong>, as they impose an explicit clipping function on the tensors (using either a hard-coded value or a learned value).</li>
 </ul>
 <h3 id="asymmetric-vs-symmetric">Asymmetric vs. Symmetric</h3>
 <p>In this method we can use two modes - <strong>asymmetric</strong> and <strong>symmetric</strong>.</p>

--- a/docs/index.html
+++ b/docs/index.html
@@ -273,5 +273,5 @@ And of course, if we used a sparse or compressed representation, then we are red
 <!--
 MkDocs version : 0.17.2
-Build Date UTC : 2018-12-06 14:40:20
+Build Date UTC : 2018-12-11 08:46:53
 -->
--- a/docs/search/search_index.json
+++ b/docs/search/search_index.json
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -4,7 +4,7 @@
    <url>
     <loc>/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
@@ -12,7 +12,7 @@
    <url>
     <loc>/install/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
@@ -20,7 +20,7 @@
    <url>
     <loc>/usage/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
@@ -28,7 +28,7 @@
    <url>
     <loc>/schedule/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
@@ -37,31 +37,31 @@
    <url>
     <loc>/pruning/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>/regularization/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>/quantization/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>/knowledge_distillation/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>/conditional_computation/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
@@ -71,19 +71,19 @@
    <url>
     <loc>/algo_pruning/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>/algo_quantization/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>/algo_earlyexit/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
@@ -92,7 +92,7 @@
    <url>
     <loc>/model_zoo/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
@@ -100,7 +100,7 @@
    <url>
     <loc>/jupyter/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
@@ -108,7 +108,7 @@
    <url>
     <loc>/design/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
@@ -117,13 +117,13 @@
    <url>
     <loc>/tutorial-struct_pruning/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>/tutorial-lang_model/index.html</loc>
-     <lastmod>2018-12-06</lastmod>
+     <lastmod>2018-12-11</lastmod>
     <changefreq>daily</changefreq>
    </url>