Docs: Fix broken images and links

b9207bf7 · Guy Jacob · 158602c5 · b9207bf7 · b9207bf7 · b9207bf7
Commit b9207bf7 authored 6 years ago by Guy Jacob
--- a/docs-src/docs/algo_quantization.md
+++ b/docs-src/docs/algo_quantization.md
@@ -17,7 +17,7 @@ In this method we can use two modes - **asymmetric** and **symmetric**.
 #### Asymmetric Mode

 <p align="center">
-    <img src="../imgs/quant_asym.png"/>
+    <img src="imgs/quant_asym.png"/>
 </p>

 In **asymmetric** mode, we map the min/max in the float range to the min/max of the integer range. This is done by using a **zero-point** (also called *quantization bias*, or *offset*) in addition to the scale factor.
@@ -47,7 +47,7 @@ Notes:
 #### Symmetric Mode

 <p align="center">
-    <img src="../imgs/quant_sym.png"/>
+    <img src="imgs/quant_sym.png"/>
 </p>

 In **symmetric** mode, instead of mapping the exact min/max of the float range to the quantized range, we choose the maximum absolute value between min/max. In addition, we don't use a zero-point. So, the floating-point range we're effectively quantizing is symmetric with respect to zero, and so is the quantized range.
@@ -75,7 +75,7 @@ The main trade-off between these two modes is simplicity vs. utilization of the

 - **Removing Outliers:** As discussed [here](quantization.md#outliers-removal), in some cases the float range of activations contains outliers. Spending dynamic range on these outliers hurts our ability to represent the values we actually care about accurately.
   <p align="center">
-       <img src="../imgs/quant_clipped.png"/>
+       <img src="imgs/quant_clipped.png"/>
   </p>
  Currently, Distiller supports clipping of activations with averaging during post-training quantization. That is - for each batch, instead of calculating global min/max values, an average of the min/max values of each sample in the batch.
 - **Scale factor scope:** For weight tensors, Distiller supports per-channel quantization (per output channel).
@@ -95,7 +95,7 @@ For post-training quantization, this method is implemented by wrapping existing
    - Embedding
 - All other layers are unaffected and are executed using their original FP32 implementation.
 - To automatically transform an existing model to a quantized model using this method, use the `PostTrainLinearQuantizer` class. For details on ways to invoke the quantizer see [here](schedule.md#post-training-quantization).
- The transform performed by the Quantizer only works on sub-classes of `torch.nn.Module`. But operations such as element-wise addition / multiplication and concatenation do not have associated Modules in PyTorch. They are either overloaded operators, or simple functions in the `torch` namespace. To be able to quantize these operations, we've implemented very simple modules that wrap these operations [here](https://github.com/NervanaSystems/distiller/blob/master/distiller/distiller/modules). It is necessary to manually modify your model and replace any existing operator with a corresponding module. For an example, see our slightly modified [ResNet implementation](https://github.com/NervanaSystems/distiller/blob/quantization_updates/models/imagenet/resnet.py).
+- The transform performed by the Quantizer only works on sub-classes of `torch.nn.Module`. But operations such as element-wise addition / multiplication and concatenation do not have associated Modules in PyTorch. They are either overloaded operators, or simple functions in the `torch` namespace. To be able to quantize these operations, we've implemented very simple modules that wrap these operations [here](https://github.com/NervanaSystems/distiller/blob/master/distiller/modules). It is necessary to manually modify your model and replace any existing operator with a corresponding module. For an example, see our slightly modified [ResNet implementation](https://github.com/NervanaSystems/distiller/blob/master/distiller/models/imagenet/resnet.py).
 - For weights and bias the scale factor and zero-point are determined once at quantization setup ("offline" / "static"). For activations, both "static" and "dynamic" quantization is supported. Static quantizaton of activations requires that statistics be collected beforehand. See details on how to do that [here](schedule.md#collecting-statistics-for-quantization).
 - The calculated quantization parameters are stored as buffers within the module, so they are automatically serialized when the model checkpoint is saved.

@@ -141,7 +141,7 @@ This method requires training the model with quantization-aware training, as dis

 This method is similar to DoReFa, but the upper clipping values, \(\alpha\), of the activation functions are learned parameters instead of hard coded to 1. Note that per the paper's recommendation, \(\alpha\) is shared per layer.

-This method requires training the model with quantization-aware training, as discussed [here](quantization/#quantization-aware-training). Use the `PACTQuantizer` class to transform an existing model to a model suitable for training with quantization using PACT.
+This method requires training the model with quantization-aware training, as discussed [here](quantization.md#quantization-aware-training). Use the `PACTQuantizer` class to transform an existing model to a model suitable for training with quantization using PACT.

 ## WRPN

@@ -157,7 +157,7 @@ Weights are clipped to \([-1, 1]\) and quantized as follows:

 Note that \(k-1\) bits are used to quantize weights, leaving one bit for sign.

-This method requires training the model with quantization-aware training, as discussed [here](quantization/#quantization-aware-training). Use the `WRPNQuantizer` class to transform an existing model to a model suitable for training with quantization using WRPN.
+This method requires training the model with quantization-aware training, as discussed [here](quantization.md#quantization-aware-training). Use the `WRPNQuantizer` class to transform an existing model to a model suitable for training with quantization using WRPN.

 ### Notes:


--- a/docs-src/docs/design.md
+++ b/docs-src/docs/design.md
@@ -84,7 +84,7 @@ The `Quantizer` class supports quantization-aware training, that is - training w
    2. To maintain the existing functionality of the module, we then register a `buffer` in the module with the original name - `weights`.
    3. During training, `float_weight` will be passed to `param_quantization_fn` and the result will be stored in `weight`.

-2. In addition, some quantization methods may introduce additional learned parameters to the model. For example, in the [PACT](algo_quantization.md#PACT) method, acitvations are clipped to a value \(\alpha\), which is a learned parameter per-layer
+2. In addition, some quantization methods may introduce additional learned parameters to the model. For example, in the [PACT](algo_quantization.md#pact) method, acitvations are clipped to a value \(\alpha\), which is a learned parameter per-layer

 To support these two cases, the `Quantizer` class also accepts an instance of a `torch.optim.Optimizer` (normally this would be one an instance of its sub-classes). The quantizer will take care of modifying the optimizer according to the changes made to the parameters.   


--- a/docs-src/docs/schedule.md
+++ b/docs-src/docs/schedule.md
@@ -393,7 +393,7 @@ if args.quantize_eval:
 Note that the command-line arguments don't expose the `overrides` parameter of the quantizer, which allows fine-grained control over how each layer is quantized. To utilize this functionality, configure with a YAML file.

 To see integration of these command line arguments in use, see the [image classification example](https://github.com/NervanaSystems/distiller/blob/master/examples/classifier_compression/compress_classifier.py). 
-For examples invocations of post-training quantization see [here](https://github.com/NervanaSystems/distiller/blob/master/examples/quantization/post_training_quant).
+For examples invocations of post-training quantization see [here](https://github.com/NervanaSystems/distiller/blob/master/examples/quantization/post_train_quant).

 ### Collecting Statistics for Quantization

@@ -411,7 +411,7 @@ if args.qe_calibration:
    collector.save(yaml_path)
 ```

-The genreated YAML stats file can then be provided using the ``--qe-stats-file` argument. An example of a generated stats file can be found [here](https://github.com/NervanaSystems/distiller/blob/master/examples/quantization/post_training_quant/stats/resnet18_quant_stats.yaml).
+The genreated YAML stats file can then be provided using the ``--qe-stats-file` argument. An example of a generated stats file can be found [here](https://github.com/NervanaSystems/distiller/blob/master/examples/quantization/post_train_quant/stats/resnet18_quant_stats.yaml).

 ## Pruning Fine-Control


--- a/docs-src/docs/tutorial-lang_model.md
+++ b/docs-src/docs/tutorial-lang_model.md
@@ -43,6 +43,7 @@ We compared three model sizes: small (7.1M; 14M), medium (28M; 50M), large: (86M
 The results reported below use a preset seed (for reproducibility), and we expect results can be improved if we allow “true” pseudo-randomness.  We limited our tests to 40 epochs, even though validation perplexity was still trending down.

 Essentially, this recreates the language model experiment in the AGP paper, and validates its conclusions:
+
 * “We see that sparse models are able to outperform dense models which have significantly more parameters.”
 * The 80% sparse large model (which has 16.9M parameters and a perplexity of 83.64) is able to outperform the dense medium (which has 28.4M parameters and a perplexity of 84.21), a model which has 1.7 times more parameters.  It also outperform the dense large model, which exemplifies how pruning can act as a regularizer.
 * “Our results show that pruning works very well not only on the dense LSTM weights and dense softmax layer but also the dense embedding matrix. This suggests that during the optimization procedure the neural network can find a good sparse embedding for the words in the vocabulary that works well together with the sparse connectivity structure of the LSTM weights and softmax layer.”

--- a/docs-src/docs/usage.md
+++ b/docs-src/docs/usage.md
@@ -153,7 +153,7 @@ See [here](schedule.md#post-training-quantization) for more details on how to in

 A checkpoint with the quantized model will be dumped in the run directory. It will contain the quantized model parameters (the data type will still be FP32, but the values will be integers). The calculated quantization parameters (scale and zero-point) are stored as well in each quantized layer.

-For more examples of post-training quantization see [here](https://github.com/NervanaSystems/distiller/blob/master/examples/quantization/post_training_quant).
+For more examples of post-training quantization see [here](https://github.com/NervanaSystems/distiller/blob/master/examples/quantization/post_train_quant).

 ## Summaries
 You can use the sample compression application to generate model summary reports, such as the attributes and compute summary report (see screen capture below).

--- a/docs/algo_quantization.html
+++ b/docs/algo_quantization.html
@@ -212,7 +212,7 @@ For any of the methods below that require quantization-aware training, please se
 <p>In this method we can use two modes - <strong>asymmetric</strong> and <strong>symmetric</strong>.</p>
 <h4 id="asymmetric-mode">Asymmetric Mode</h4>
 <p align="center">
-    <img src="../imgs/quant_asym.png"/>
+    <img src="imgs/quant_asym.png"/>
 </p>

 <p>In <strong>asymmetric</strong> mode, we map the min/max in the float range to the min/max of the integer range. This is done by using a <strong>zero-point</strong> (also called <em>quantization bias</em>, or <em>offset</em>) in addition to the scale factor.</p>
@@ -238,7 +238,7 @@ For any of the methods below that require quantization-aware training, please se
 </ul>
 <h4 id="symmetric-mode">Symmetric Mode</h4>
 <p align="center">
-    <img src="../imgs/quant_sym.png"/>
+    <img src="imgs/quant_sym.png"/>
 </p>

 <p>In <strong>symmetric</strong> mode, instead of mapping the exact min/max of the float range to the quantized range, we choose the maximum absolute value between min/max. In addition, we don't use a zero-point. So, the floating-point range we're effectively quantizing is symmetric with respect to zero, and so is the quantized range.</p>
@@ -264,7 +264,7 @@ For any of the methods below that require quantization-aware training, please se
 <ul>
 <li><strong>Removing Outliers:</strong> As discussed <a href="quantization.html#outliers-removal">here</a>, in some cases the float range of activations contains outliers. Spending dynamic range on these outliers hurts our ability to represent the values we actually care about accurately.
   <p align="center">
-       <img src="../imgs/quant_clipped.png"/>
+       <img src="imgs/quant_clipped.png"/>
   </p>
  Currently, Distiller supports clipping of activations with averaging during post-training quantization. That is - for each batch, instead of calculating global min/max values, an average of the min/max values of each sample in the batch.</li>
 <li><strong>Scale factor scope:</strong> For weight tensors, Distiller supports per-channel quantization (per output channel).</li>
@@ -284,7 +284,7 @@ For any of the methods below that require quantization-aware training, please se
 </li>
 <li>All other layers are unaffected and are executed using their original FP32 implementation.</li>
 <li>To automatically transform an existing model to a quantized model using this method, use the <code>PostTrainLinearQuantizer</code> class. For details on ways to invoke the quantizer see <a href="schedule.html#post-training-quantization">here</a>.</li>
-<li>The transform performed by the Quantizer only works on sub-classes of <code>torch.nn.Module</code>. But operations such as element-wise addition / multiplication and concatenation do not have associated Modules in PyTorch. They are either overloaded operators, or simple functions in the <code>torch</code> namespace. To be able to quantize these operations, we've implemented very simple modules that wrap these operations <a href="https://github.com/NervanaSystems/distiller/blob/master/distiller/distiller/modules">here</a>. It is necessary to manually modify your model and replace any existing operator with a corresponding module. For an example, see our slightly modified <a href="https://github.com/NervanaSystems/distiller/blob/quantization_updates/models/imagenet/resnet.py">ResNet implementation</a>.</li>
+<li>The transform performed by the Quantizer only works on sub-classes of <code>torch.nn.Module</code>. But operations such as element-wise addition / multiplication and concatenation do not have associated Modules in PyTorch. They are either overloaded operators, or simple functions in the <code>torch</code> namespace. To be able to quantize these operations, we've implemented very simple modules that wrap these operations <a href="https://github.com/NervanaSystems/distiller/blob/master/distiller/modules">here</a>. It is necessary to manually modify your model and replace any existing operator with a corresponding module. For an example, see our slightly modified <a href="https://github.com/NervanaSystems/distiller/blob/master/distiller/models/imagenet/resnet.py">ResNet implementation</a>.</li>
 <li>For weights and bias the scale factor and zero-point are determined once at quantization setup ("offline" / "static"). For activations, both "static" and "dynamic" quantization is supported. Static quantizaton of activations requires that statistics be collected beforehand. See details on how to do that <a href="schedule.html#collecting-statistics-for-quantization">here</a>.</li>
 <li>The calculated quantization parameters are stored as buffers within the module, so they are automatically serialized when the model checkpoint is saved.</li>
 </ul>
@@ -320,7 +320,7 @@ Note that the current implementation of <code>QuantAwareTrainRangeLinearQuantize
 <h2 id="pact">PACT</h2>
 <p>(As proposed in <a href="https://arxiv.org/abs/1805.06085">PACT: Parameterized Clipping Activation for Quantized Neural Networks</a>)</p>
 <p>This method is similar to DoReFa, but the upper clipping values, <script type="math/tex">\alpha</script>, of the activation functions are learned parameters instead of hard coded to 1. Note that per the paper's recommendation, <script type="math/tex">\alpha</script> is shared per layer.</p>
-<p>This method requires training the model with quantization-aware training, as discussed <a href="quantization/#quantization-aware-training">here</a>. Use the <code>PACTQuantizer</code> class to transform an existing model to a model suitable for training with quantization using PACT.</p>
+<p>This method requires training the model with quantization-aware training, as discussed <a href="quantization.html#quantization-aware-training">here</a>. Use the <code>PACTQuantizer</code> class to transform an existing model to a model suitable for training with quantization using PACT.</p>
 <h2 id="wrpn">WRPN</h2>
 <p>(As proposed in <a href="https://arxiv.org/abs/1709.01134">WRPN: Wide Reduced-Precision Networks</a>)  </p>
 <p>In this method, activations are clipped to <script type="math/tex">[0, 1]</script> and quantized as follows (<script type="math/tex">k</script> is the number of bits used for quantization):</p>
@@ -332,7 +332,7 @@ Note that the current implementation of <code>QuantAwareTrainRangeLinearQuantize
 <script type="math/tex; mode=display">w_q = \frac{1}{2^{k-1}-1} round \left( \left(2^{k-1} - 1 \right)w_f \right)</script>
 </p>
 <p>Note that <script type="math/tex">k-1</script> bits are used to quantize weights, leaving one bit for sign.</p>
-<p>This method requires training the model with quantization-aware training, as discussed <a href="quantization/#quantization-aware-training">here</a>. Use the <code>WRPNQuantizer</code> class to transform an existing model to a model suitable for training with quantization using WRPN.</p>
+<p>This method requires training the model with quantization-aware training, as discussed <a href="quantization.html#quantization-aware-training">here</a>. Use the <code>WRPNQuantizer</code> class to transform an existing model to a model suitable for training with quantization using WRPN.</p>
 <h3 id="notes_1">Notes:</h3>
 <ul>
 <li>The paper proposed widening of layers as a means to reduce accuracy loss. This isn't implemented as part of <code>WRPNQuantizer</code> at the moment. To experiment with this, modify your model implementation to have wider layers.</li>

--- a/docs/design.html
+++ b/docs/design.html
@@ -270,7 +270,7 @@ To execute the model transformation, call the <code>prepare_model</code> functio
 </ol>
 </li>
 <li>
-<p>In addition, some quantization methods may introduce additional learned parameters to the model. For example, in the <a href="algo_quantization.html#PACT">PACT</a> method, acitvations are clipped to a value <script type="math/tex">\alpha</script>, which is a learned parameter per-layer</p>
+<p>In addition, some quantization methods may introduce additional learned parameters to the model. For example, in the <a href="algo_quantization.html#pact">PACT</a> method, acitvations are clipped to a value <script type="math/tex">\alpha</script>, which is a learned parameter per-layer</p>
 </li>
 </ol>
 <p>To support these two cases, the <code>Quantizer</code> class also accepts an instance of a <code>torch.optim.Optimizer</code> (normally this would be one an instance of its sub-classes). The quantizer will take care of modifying the optimizer according to the changes made to the parameters.   </p>

--- a/docs/index.html
+++ b/docs/index.html
@@ -273,5 +273,5 @@ And of course, if we used a sparse or compressed representation, then we are red

 <!--
 MkDocs version : 1.0.4
-Build Date UTC : 2019-04-08 12:31:43
+Build Date UTC : 2019-04-14 11:38:57
 -->
--- a/docs/schedule.html
+++ b/docs/schedule.html
@@ -567,7 +567,7 @@ args = parser.parse_args()

 <p>Note that the command-line arguments don't expose the <code>overrides</code> parameter of the quantizer, which allows fine-grained control over how each layer is quantized. To utilize this functionality, configure with a YAML file.</p>
 <p>To see integration of these command line arguments in use, see the <a href="https://github.com/NervanaSystems/distiller/blob/master/examples/classifier_compression/compress_classifier.py">image classification example</a>. 
-For examples invocations of post-training quantization see <a href="https://github.com/NervanaSystems/distiller/blob/master/examples/quantization/post_training_quant">here</a>.</p>
+For examples invocations of post-training quantization see <a href="https://github.com/NervanaSystems/distiller/blob/master/examples/quantization/post_train_quant">here</a>.</p>
 <h3 id="collecting-statistics-for-quantization">Collecting Statistics for Quantization</h3>
 <p>To collect generate statistics that can be used for static quantization of activations, do the following (shown here assuming the command line argument <code>--qe-calibration</code> shown above is used, which specifies the number of batches to use for statistics generation):</p>
 <pre><code class="python">if args.qe_calibration:
@@ -581,7 +581,7 @@ For examples invocations of post-training quantization see <a href="https://gith
    collector.save(yaml_path)
 </code></pre>

-<p>The genreated YAML stats file can then be provided using the <code>`--qe-stats-file</code> argument. An example of a generated stats file can be found <a href="https://github.com/NervanaSystems/distiller/blob/master/examples/quantization/post_training_quant/stats/resnet18_quant_stats.yaml">here</a>.</p>
+<p>The genreated YAML stats file can then be provided using the <code>`--qe-stats-file</code> argument. An example of a generated stats file can be found <a href="https://github.com/NervanaSystems/distiller/blob/master/examples/quantization/post_train_quant/stats/resnet18_quant_stats.yaml">here</a>.</p>
 <h2 id="pruning-fine-control">Pruning Fine-Control</h2>
 <p>Sometimes the default pruning process doesn't satisfy our needs and we require finer control over the pruning process (e.g. over masking, gradient handling, and weight updates).  Below we will explain the math and nuances of fine-control configuration.</p>
 <h3 id="setting-up-the-problem">Setting up the problem</h3>

--- a/docs/search/search_index.json
+++ b/docs/search/search_index.json
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -2,87 +2,87 @@
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
    <url>
     <loc>None</loc>
-     <lastmod>2019-04-08</lastmod>
+     <lastmod>2019-04-14</lastmod>
     <changefreq>daily</changefreq>
    </url>
 </urlset>
\ No newline at end of file
--- a/docs/sitemap.xml.gz
+++ b/docs/sitemap.xml.gz
--- a/docs/tutorial-lang_model.html
+++ b/docs/tutorial-lang_model.html
@@ -316,10 +316,12 @@ Note that we can improve the results by training longer, since the loss curves a
 <p>The model is composed of an Encoder embedding, two LSTMs, and a Decoder embedding.  The Encoder and decoder embeddings (projections) are tied to improve perplexity results (per https://arxiv.org/pdf/1611.01462.pdf), so in the sparsity statistics we account for only one of the encoder/decoder embeddings.  We used the WikiText2 dataset (twice as large as PTB).</p>
 <p>We compared three model sizes: small (7.1M; 14M), medium (28M; 50M), large: (86M; 136M) – reported as (#parameters net/tied; #parameters gross).
 The results reported below use a preset seed (for reproducibility), and we expect results can be improved if we allow “true” pseudo-randomness.  We limited our tests to 40 epochs, even though validation perplexity was still trending down.</p>
-<p>Essentially, this recreates the language model experiment in the AGP paper, and validates its conclusions:
-* “We see that sparse models are able to outperform dense models which have significantly more parameters.”
-* The 80% sparse large model (which has 16.9M parameters and a perplexity of 83.64) is able to outperform the dense medium (which has 28.4M parameters and a perplexity of 84.21), a model which has 1.7 times more parameters.  It also outperform the dense large model, which exemplifies how pruning can act as a regularizer.
-* “Our results show that pruning works very well not only on the dense LSTM weights and dense softmax layer but also the dense embedding matrix. This suggests that during the optimization procedure the neural network can find a good sparse embedding for the words in the vocabulary that works well together with the sparse connectivity structure of the LSTM weights and softmax layer.”</p>
+<p>Essentially, this recreates the language model experiment in the AGP paper, and validates its conclusions:</p>
+<ul>
+<li>“We see that sparse models are able to outperform dense models which have significantly more parameters.”</li>
+<li>The 80% sparse large model (which has 16.9M parameters and a perplexity of 83.64) is able to outperform the dense medium (which has 28.4M parameters and a perplexity of 84.21), a model which has 1.7 times more parameters.  It also outperform the dense large model, which exemplifies how pruning can act as a regularizer.</li>
+<li>“Our results show that pruning works very well not only on the dense LSTM weights and dense softmax layer but also the dense embedding matrix. This suggests that during the optimization procedure the neural network can find a good sparse embedding for the words in the vocabulary that works well together with the sparse connectivity structure of the LSTM weights and softmax layer.”</li>
+</ul>
 <h2 id="setup">Setup</h2>
 <p>We start by cloning Pytorch’s example <a href="https://github.com/pytorch/examples/tree/master">repository</a>. I’ve copied the language model code to distiller’s examples/word_language_model directory, so I’ll use that for the rest of the tutorial.
 Next, let’s create and activate a virtual environment, as explained in Distiller's <a href="https://github.com/NervanaSystems/distiller#create-a-python-virtual-environment">README</a> file.

--- a/docs/usage.html
+++ b/docs/usage.html
@@ -359,7 +359,7 @@ Results are output as a CSV file (<code>sensitivity.csv</code>) and PNG file (<c

 <p>See <a href="schedule.html#post-training-quantization">here</a> for more details on how to invoke post-training quantization from the command line.</p>
 <p>A checkpoint with the quantized model will be dumped in the run directory. It will contain the quantized model parameters (the data type will still be FP32, but the values will be integers). The calculated quantization parameters (scale and zero-point) are stored as well in each quantized layer.</p>
-<p>For more examples of post-training quantization see <a href="https://github.com/NervanaSystems/distiller/blob/master/examples/quantization/post_training_quant">here</a>.</p>
+<p>For more examples of post-training quantization see <a href="https://github.com/NervanaSystems/distiller/blob/master/examples/quantization/post_train_quant">here</a>.</p>
 <h2 id="summaries">Summaries</h2>
 <p>You can use the sample compression application to generate model summary reports, such as the attributes and compute summary report (see screen capture below).
 You can log sparsity statistics (written to console and CSV file), performance, optimizer and model information, and also create a PNG image of the DNN.