From 02f7871b56856beb16b654a84aed2fcbf5cb6f2c Mon Sep 17 00:00:00 2001
From: Guy Jacob <guy.jacob@intel.com>
Date: Fri, 22 Jun 2018 03:04:52 +0300
Subject: [PATCH] Minor additions to docs

---
 docs-src/docs/design.md       |  5 ++++-
 docs-src/docs/schedule.md     | 16 +++++++++++++++-
 docs/design/index.html        |  4 +++-
 docs/index.html               |  2 +-
 docs/schedule/index.html      | 14 +++++++++++++-
 docs/search/search_index.json | 10 +++++-----
 6 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/docs-src/docs/design.md b/docs-src/docs/design.md
index e7a3a7c..7ff309e 100755
--- a/docs-src/docs/design.md
+++ b/docs-src/docs/design.md
@@ -58,7 +58,8 @@ The high-level flow is as follows:
 - Replace the existing module with the module returned by the function. It is important to note that the **name** of the module **does not** change, as that could break the `forward` function of the parent module.
 
 Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different "strategies" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different **mapping** will likely be defined.  
-Each sub-class of `Quantizer` should populate the `replacement_factory` dictionary attribute with the appropriate mapping.
+Each sub-class of `Quantizer` should populate the `replacement_factory` dictionary attribute with the appropriate mapping.  
+To execute the model transformation, call the `prepare_model` function of the `Quantizer` instance.
 
 ### Flexible Bit-Widths
 
@@ -78,5 +79,7 @@ The `Quantizer` class supports training with quantization in the loop, as descri
 2. To maintain the existing functionality of the module, we then register a `buffer` in the module with the original name - `weights`.
 3. During training, `float_weight` will be passed to `param_quantization_fn` and the result will be stored in `weight`.
 
+**Important Note**: Since this process modifies the model's parameters, it must be done **before** an PyTorch `Optimizer` is created (this refers to any of the sub-classes defined [here](https://pytorch.org/docs/stable/optim.html#algorithms)).
+
 The base `Quantizer` class is implemented in `distiller/quantization/quantizer.py`.  
 For a simple sub-class implementing symmetric linear quantization, see `SymmetricLinearQuantizer` in `distiller/quantization/range_linear.py`. For examples of lower-precision methods using training with quantization see `DorefaQuantizer` and `WRPNQuantizer` in `distiller/quantization/clipped_linear.py`
diff --git a/docs-src/docs/schedule.md b/docs-src/docs/schedule.md
index 0089277..1f4c184 100755
--- a/docs-src/docs/schedule.md
+++ b/docs-src/docs/schedule.md
@@ -240,7 +240,8 @@ policies:
 
 ## Quantization
 
-Similarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the `Quantizer` class (see details [here](design.md#quantization)).
+Similarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the `Quantizer` class (see details [here](design.md#quantization)).  
+**Notes**: Only a single quantizer instance may be defined.  
 Let's see an example:
 
 ```
@@ -277,3 +278,16 @@ bits_overrides:
     wts: 2
     acts: null
 ```
+
+The `QuantizationPolicy`, which controls the quantization procedure during training, is actually quite simplistic. All it does is call the `prepare_model()` function of the `Quantizer` when it's initialized, followed by the first call to `quantize_params()`. Then, at the end of each epoch, after the float copy of the weights has been updated, it calls the `quantize_params()` function again. 
+
+```
+policies:
+    - quantizer:
+        instance_name: dorefa_quantizer
+      starting_epoch: 0
+      ending_epoch: 200
+      frequency: 1
+```
+
+**Important Note**: As mentioned [here](design.md#training-with-quantization), since the quantizer modifies the model's parameters (assuming training with quantization in the loop is used), the call to `prepare_model()` must be performed before an optimizer is called. Therefore, currently, the starting epoch for a quantization policy must be 0, otherwise the quantization process will not work as expected. If one wishes to do a "warm-startup" (or "boot-strapping"), training for a few epochs with full precision and only then starting to quantize, the only way to do this right now is to execute a separate run to generate the boot-strapped weights, and execute a second which will resume the checkpoint with the boot-strapped weights.
diff --git a/docs/design/index.html b/docs/design/index.html
index bf89613..991b270 100644
--- a/docs/design/index.html
+++ b/docs/design/index.html
@@ -217,7 +217,8 @@ train():
 <li>Replace the existing module with the module returned by the function. It is important to note that the <strong>name</strong> of the module <strong>does not</strong> change, as that could break the <code>forward</code> function of the parent module.</li>
 </ul>
 <p>Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different "strategies" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different <strong>mapping</strong> will likely be defined.<br />
-Each sub-class of <code>Quantizer</code> should populate the <code>replacement_factory</code> dictionary attribute with the appropriate mapping.</p>
+Each sub-class of <code>Quantizer</code> should populate the <code>replacement_factory</code> dictionary attribute with the appropriate mapping.<br />
+To execute the model transformation, call the <code>prepare_model</code> function of the <code>Quantizer</code> instance.</p>
 <h3 id="flexible-bit-widths">Flexible Bit-Widths</h3>
 <ul>
 <li>Each instance of <code>Quantizer</code> is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the <code>bits_activations</code> and <code>bits_weights</code> parameters in <code>Quantizer</code>'s constructor. Sub-classes may define bit-widths for other tensor types as needed.</li>
@@ -233,6 +234,7 @@ Each sub-class of <code>Quantizer</code> should populate the <code>replacement_f
 <li>To maintain the existing functionality of the module, we then register a <code>buffer</code> in the module with the original name - <code>weights</code>.</li>
 <li>During training, <code>float_weight</code> will be passed to <code>param_quantization_fn</code> and the result will be stored in <code>weight</code>.</li>
 </ol>
+<p><strong>Important Note</strong>: Since this process modifies the model's parameters, it must be done <strong>before</strong> an PyTorch <code>Optimizer</code> is created (this refers to any of the sub-classes defined <a href="https://pytorch.org/docs/stable/optim.html#algorithms">here</a>).</p>
 <p>The base <code>Quantizer</code> class is implemented in <code>distiller/quantization/quantizer.py</code>.<br />
 For a simple sub-class implementing symmetric linear quantization, see <code>SymmetricLinearQuantizer</code> in <code>distiller/quantization/range_linear.py</code>. For examples of lower-precision methods using training with quantization see <code>DorefaQuantizer</code> and <code>WRPNQuantizer</code> in <code>distiller/quantization/clipped_linear.py</code></p>
               
diff --git a/docs/index.html b/docs/index.html
index 1f3be6c..0a1d63d 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -246,5 +246,5 @@ And of course, if we used a sparse or compressed representation, then we are red
 
 <!--
 MkDocs version : 0.17.2
-Build Date UTC : 2018-06-21 23:06:46
+Build Date UTC : 2018-06-22 00:04:22
 -->
diff --git a/docs/schedule/index.html b/docs/schedule/index.html
index 2bf4ccc..ee6e523 100644
--- a/docs/schedule/index.html
+++ b/docs/schedule/index.html
@@ -401,7 +401,8 @@ policies:
 </code></pre>
 
 <h2 id="quantization">Quantization</h2>
-<p>Similarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the <code>Quantizer</code> class (see details <a href="../design/index.html#quantization">here</a>).
+<p>Similarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the <code>Quantizer</code> class (see details <a href="../design/index.html#quantization">here</a>).<br />
+<strong>Notes</strong>: Only a single quantizer instance may be defined.<br />
 Let's see an example:</p>
 <pre><code>quantizers:
   dorefa_quantizer:
@@ -436,6 +437,17 @@ Let's see an example:</p>
     wts: 2
     acts: null
 </code></pre>
+
+<p>The <code>QuantizationPolicy</code>, which controls the quantization procedure during training, is actually quite simplistic. All it does is call the <code>prepare_model()</code> function of the <code>Quantizer</code> when it's initialized, followed by the first call to <code>quantize_params()</code>. Then, at the end of each epoch, after the float copy of the weights has been updated, it calls the <code>quantize_params()</code> function again. </p>
+<pre><code>policies:
+    - quantizer:
+        instance_name: dorefa_quantizer
+      starting_epoch: 0
+      ending_epoch: 200
+      frequency: 1
+</code></pre>
+
+<p><strong>Important Note</strong>: As mentioned <a href="../design/index.html#training-with-quantization">here</a>, since the quantizer modifies the model's parameters (assuming training with quantization in the loop is used), the call to <code>prepare_model()</code> must be performed before an optimizer is called. Therefore, currently, the starting epoch for a quantization policy must be 0, otherwise the quantization process will not work as expected. If one wishes to do a "warm-startup" (or "boot-strapping"), training for a few epochs with full precision and only then starting to quantize, the only way to do this right now is to execute a separate run to generate the boot-strapped weights, and execute a second which will resume the checkpoint with the boot-strapped weights.</p>
               
             </div>
           </div>
diff --git a/docs/search/search_index.json b/docs/search/search_index.json
index 2181117..6b9865e 100644
--- a/docs/search/search_index.json
+++ b/docs/search/search_index.json
@@ -137,7 +137,7 @@
         }, 
         {
             "location": "/schedule/index.html", 
-            "text": "Compression scheduler\n\n\nIn iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of \nCompressionScheduler\n: it needed to be part of the training loop, and to be able to make and implement pruning, regularization and quantization decisions.  We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification.  We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base.  Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code.  \n\n\nHigh level overview\n\n\nLet's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, Quantizers, LR-scheduler and Policies.\n\n\n\n\nPruners, Regularizers and Quantizers are very similar: They implement either a Pruning/Regularization/Quantization algorithm, respectively. \n\n\nAn LR-scheduler specifies the LR-decay algorithm.  \n\n\n\n\nThese define the \nwhat\n part of the schedule.  \n\n\nThe Policies define the \nwhen\n part of the schedule: at which epoch to start applying the Pruner/Regularizer/Quantizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application).  A policy also defines the instance of Pruner/Regularizer/Quantizer/LR-decay it is managing.\n\nThe CompressionScheduler is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners, Regularizers and Quantizers from code.\n\n\nSyntax through example\n\n\nWe'll use \nalexnet.schedule_agp.yaml\n to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet.\n\n\nversion: 1\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\nlr_schedulers:\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1\n\n\n\n\nThere is only one version of the YAML syntax, and the version number is not verified at the moment.  However, to be future-proof it is probably better to let the YAML parser know that you are using version-1 syntax, in case there is ever a version 2.\n\n\nversion: 1\n\n\n\n\nIn the \npruners\n section, we define the instances of pruners we want the scheduler to instantiate and use.\n\nWe define a single pruner instance, named \nmy_pruner\n, of algorithm \nSensitivityPruner\n.  We will refer to this instance in the \nPolicies\n section.\n\nThen we list the sensitivity multipliers, \\(s\\), of each of the weight tensors.\n\nYou may list as many Pruners as you want in this section, as long as each has a unique name.  You can several types of pruners in one schedule.\n\n\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.6\n\n\n\n\nNext, we want to specify the learning-rate decay scheduling in the \nlr_schedulers\n section.  We assign a name to this instance: \npruning_lr\n.  As in the \npruners\n section, you may use any name, as long as all LR-schedulers have a unique name.  At the moment, only one instance of LR-scheduler is allowed.  The LR-scheduler must be a subclass of PyTorch's \n_LRScheduler\n.  You can use any of the schedulers defined in \ntorch.optim.lr_scheduler\n (see \nhere\n).  In addition, we've implemented some additional schedulers in Distiller (see \nhere\n). The keyword arguments (kwargs) are passed directly to the LR-scheduler's constructor, so that as new LR-schedulers are added to \ntorch.optim.lr_scheduler\n, they can be used without changing the application code.\n\n\nlr_schedulers:\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\n\n\n\nFinally, we define the \npolicies\n section which defines the actual scheduling.  A \nPolicy\n manages an instance of a \nPruner\n, \nRegularizer\n, \nQuantizer\n, or \nLRScheduler\n, by naming the instance.  In the example below, a \nPruningPolicy\n uses the pruner instance named \nmy_pruner\n: it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38.  \n\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1\n\n\n\n\nThis is \niterative pruning\n:\n\n\n\n\n\n\nTrain Connectivity\n\n\n\n\n\n\nPrune Connections\n\n\n\n\n\n\nRetrain Weights\n\n\n\n\n\n\nGoto 2\n\n\n\n\n\n\nIt is described  in \nLearning both Weights and Connections for Efficient Neural Networks\n:\n\n\n\n\n\"Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections...After an initial training phase, we remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first phase learns the topology of the networks \u2014 learning which connections are important and removing the unimportant connections. We then retrain the sparse network so the remaining connections can compensate for the connections that have been removed. The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.\"\n\n\n\n\nRegularization\n\n\nYou can also define and schedule regularization.\n\n\nL1 regularization\n\n\nFormat (this is an informal specification, not a valid \nABNF\n specification):\n\n\nregularizers:\n  \nREGULARIZER_NAME_STR\n:\n    class: L1Regularizer\n    reg_regims:\n      \nPYTORCH_PARAM_NAME_STR\n: \nSTRENGTH_FLOAT\n\n      ...\n      \nPYTORCH_PARAM_NAME_STR\n: \nSTRENGTH_FLOAT\n\n    threshold_criteria: [Mean_Abs | Max]\n\n\n\n\nFor example:\n\n\nversion: 1\n\nregularizers:\n  my_L1_reg:\n    class: L1Regularizer\n    reg_regims:\n      'module.layer3.1.conv1.weight': 0.000002\n      'module.layer3.1.conv2.weight': 0.000002\n      'module.layer3.1.conv3.weight': 0.000002\n      'module.layer3.2.conv1.weight': 0.000002\n    threshold_criteria: Mean_Abs\n\npolicies:\n  - regularizer:\n      instance_name: my_L1_reg\n    starting_epoch: 0\n    ending_epoch: 60\n    frequency: 1\n\n\n\n\nGroup regularization\n\n\nFormat (informal specification):\n\n\nFormat:\n  regularizers:\n    \nREGULARIZER_NAME_STR\n:\n      class: L1Regularizer\n      reg_regims:\n        \nPYTORCH_PARAM_NAME_STR\n: [\nSTRENGTH_FLOAT\n, \n'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'\n]\n        \nPYTORCH_PARAM_NAME_STR\n: [\nSTRENGTH_FLOAT\n, \n'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'\n]\n      threshold_criteria: [Mean_Abs | Max]\n\n\n\n\nFor example:\n\n\nversion: 1\n\nregularizers:\n  my_filter_regularizer:\n    class: GroupLassoRegularizer\n    reg_regims:\n      'module.layer3.1.conv1.weight': [0.00005, '3D']\n      'module.layer3.1.conv2.weight': [0.00005, '3D']\n      'module.layer3.1.conv3.weight': [0.00005, '3D']\n      'module.layer3.2.conv1.weight': [0.00005, '3D']\n    threshold_criteria: Mean_Abs\n\npolicies:\n  - regularizer:\n      instance_name: my_filter_regularizer\n    starting_epoch: 0\n    ending_epoch: 60\n    frequency: 1\n\n\n\n\nMixing it up\n\n\nYou can mix pruning and regularization.\n\n\nversion: 1\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\nregularizers:\n  2d_groups_regularizer:\n    class: GroupLassoRegularizer\n    reg_regims:\n      'features.module.0.weight': [0.000012, '2D']\n      'features.module.3.weight': [0.000012, '2D']\n      'features.module.6.weight': [0.000012, '2D']\n      'features.module.8.weight': [0.000012, '2D']\n      'features.module.10.weight': [0.000012, '2D']\n\n\nlr_schedulers:\n  # Learning rate decay scheduler\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - regularizer:\n      instance_name: '2d_groups_regularizer'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 1\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1\n\n\n\n\n\nQuantization\n\n\nSimilarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the \nQuantizer\n class (see details \nhere\n).\nLet's see an example:\n\n\nquantizers:\n  dorefa_quantizer:\n    class: DorefaQuantizer\n    bits_activations: 8\n    bits_weights: 4\n    bits_overrides:\n      conv1:\n        wts: null\n        acts: null\n      relu1:\n        wts: null\n        acts: null\n      final_relu:\n        wts: null\n        acts: null\n      fc:\n        wts: null\n        acts: null\n\n\n\n\n\n\nThe specific quantization method we're instantiating here is \nDorefaQuantizer\n.\n\n\nThen we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively. \n\n\nThen, we define the \nbits_overrides\n mapping. In this case, we choose not to quantize the first and last layer of the model. In the case of \nDorefaQuantizer\n, the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters \nconv1\n, the first activation layer \nrelu1\n, the last activation layer \nfinal_relu\n and the last layer with parameters \nfc\n.\n\n\nSpecifying \nnull\n means \"do not quantize\".\n\n\nNote that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers.\n\n\nWe can also reference \ngroups of layers\n in the \nbits_overrides\n mapping. This is done using regular expressions. Suppose we have a sub-module in our model named \nblock1\n, which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named \nconv1\n, \nconv2\n and so on. In that case we would define the following:\n\n\n\n\nbits_overrides:\n  block1.conv*:\n    wts: 2\n    acts: null", 
+            "text": "Compression scheduler\n\n\nIn iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of \nCompressionScheduler\n: it needed to be part of the training loop, and to be able to make and implement pruning, regularization and quantization decisions.  We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification.  We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base.  Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code.  \n\n\nHigh level overview\n\n\nLet's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, Quantizers, LR-scheduler and Policies.\n\n\n\n\nPruners, Regularizers and Quantizers are very similar: They implement either a Pruning/Regularization/Quantization algorithm, respectively. \n\n\nAn LR-scheduler specifies the LR-decay algorithm.  \n\n\n\n\nThese define the \nwhat\n part of the schedule.  \n\n\nThe Policies define the \nwhen\n part of the schedule: at which epoch to start applying the Pruner/Regularizer/Quantizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application).  A policy also defines the instance of Pruner/Regularizer/Quantizer/LR-decay it is managing.\n\nThe CompressionScheduler is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners, Regularizers and Quantizers from code.\n\n\nSyntax through example\n\n\nWe'll use \nalexnet.schedule_agp.yaml\n to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet.\n\n\nversion: 1\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\nlr_schedulers:\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1\n\n\n\n\nThere is only one version of the YAML syntax, and the version number is not verified at the moment.  However, to be future-proof it is probably better to let the YAML parser know that you are using version-1 syntax, in case there is ever a version 2.\n\n\nversion: 1\n\n\n\n\nIn the \npruners\n section, we define the instances of pruners we want the scheduler to instantiate and use.\n\nWe define a single pruner instance, named \nmy_pruner\n, of algorithm \nSensitivityPruner\n.  We will refer to this instance in the \nPolicies\n section.\n\nThen we list the sensitivity multipliers, \\(s\\), of each of the weight tensors.\n\nYou may list as many Pruners as you want in this section, as long as each has a unique name.  You can several types of pruners in one schedule.\n\n\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.6\n\n\n\n\nNext, we want to specify the learning-rate decay scheduling in the \nlr_schedulers\n section.  We assign a name to this instance: \npruning_lr\n.  As in the \npruners\n section, you may use any name, as long as all LR-schedulers have a unique name.  At the moment, only one instance of LR-scheduler is allowed.  The LR-scheduler must be a subclass of PyTorch's \n_LRScheduler\n.  You can use any of the schedulers defined in \ntorch.optim.lr_scheduler\n (see \nhere\n).  In addition, we've implemented some additional schedulers in Distiller (see \nhere\n). The keyword arguments (kwargs) are passed directly to the LR-scheduler's constructor, so that as new LR-schedulers are added to \ntorch.optim.lr_scheduler\n, they can be used without changing the application code.\n\n\nlr_schedulers:\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\n\n\n\nFinally, we define the \npolicies\n section which defines the actual scheduling.  A \nPolicy\n manages an instance of a \nPruner\n, \nRegularizer\n, \nQuantizer\n, or \nLRScheduler\n, by naming the instance.  In the example below, a \nPruningPolicy\n uses the pruner instance named \nmy_pruner\n: it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38.  \n\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1\n\n\n\n\nThis is \niterative pruning\n:\n\n\n\n\n\n\nTrain Connectivity\n\n\n\n\n\n\nPrune Connections\n\n\n\n\n\n\nRetrain Weights\n\n\n\n\n\n\nGoto 2\n\n\n\n\n\n\nIt is described  in \nLearning both Weights and Connections for Efficient Neural Networks\n:\n\n\n\n\n\"Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections...After an initial training phase, we remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first phase learns the topology of the networks \u2014 learning which connections are important and removing the unimportant connections. We then retrain the sparse network so the remaining connections can compensate for the connections that have been removed. The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.\"\n\n\n\n\nRegularization\n\n\nYou can also define and schedule regularization.\n\n\nL1 regularization\n\n\nFormat (this is an informal specification, not a valid \nABNF\n specification):\n\n\nregularizers:\n  \nREGULARIZER_NAME_STR\n:\n    class: L1Regularizer\n    reg_regims:\n      \nPYTORCH_PARAM_NAME_STR\n: \nSTRENGTH_FLOAT\n\n      ...\n      \nPYTORCH_PARAM_NAME_STR\n: \nSTRENGTH_FLOAT\n\n    threshold_criteria: [Mean_Abs | Max]\n\n\n\n\nFor example:\n\n\nversion: 1\n\nregularizers:\n  my_L1_reg:\n    class: L1Regularizer\n    reg_regims:\n      'module.layer3.1.conv1.weight': 0.000002\n      'module.layer3.1.conv2.weight': 0.000002\n      'module.layer3.1.conv3.weight': 0.000002\n      'module.layer3.2.conv1.weight': 0.000002\n    threshold_criteria: Mean_Abs\n\npolicies:\n  - regularizer:\n      instance_name: my_L1_reg\n    starting_epoch: 0\n    ending_epoch: 60\n    frequency: 1\n\n\n\n\nGroup regularization\n\n\nFormat (informal specification):\n\n\nFormat:\n  regularizers:\n    \nREGULARIZER_NAME_STR\n:\n      class: L1Regularizer\n      reg_regims:\n        \nPYTORCH_PARAM_NAME_STR\n: [\nSTRENGTH_FLOAT\n, \n'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'\n]\n        \nPYTORCH_PARAM_NAME_STR\n: [\nSTRENGTH_FLOAT\n, \n'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'\n]\n      threshold_criteria: [Mean_Abs | Max]\n\n\n\n\nFor example:\n\n\nversion: 1\n\nregularizers:\n  my_filter_regularizer:\n    class: GroupLassoRegularizer\n    reg_regims:\n      'module.layer3.1.conv1.weight': [0.00005, '3D']\n      'module.layer3.1.conv2.weight': [0.00005, '3D']\n      'module.layer3.1.conv3.weight': [0.00005, '3D']\n      'module.layer3.2.conv1.weight': [0.00005, '3D']\n    threshold_criteria: Mean_Abs\n\npolicies:\n  - regularizer:\n      instance_name: my_filter_regularizer\n    starting_epoch: 0\n    ending_epoch: 60\n    frequency: 1\n\n\n\n\nMixing it up\n\n\nYou can mix pruning and regularization.\n\n\nversion: 1\npruners:\n  my_pruner:\n    class: 'SensitivityPruner'\n    sensitivities:\n      'features.module.0.weight': 0.25\n      'features.module.3.weight': 0.35\n      'features.module.6.weight': 0.40\n      'features.module.8.weight': 0.45\n      'features.module.10.weight': 0.55\n      'classifier.1.weight': 0.875\n      'classifier.4.weight': 0.875\n      'classifier.6.weight': 0.625\n\nregularizers:\n  2d_groups_regularizer:\n    class: GroupLassoRegularizer\n    reg_regims:\n      'features.module.0.weight': [0.000012, '2D']\n      'features.module.3.weight': [0.000012, '2D']\n      'features.module.6.weight': [0.000012, '2D']\n      'features.module.8.weight': [0.000012, '2D']\n      'features.module.10.weight': [0.000012, '2D']\n\n\nlr_schedulers:\n  # Learning rate decay scheduler\n   pruning_lr:\n     class: ExponentialLR\n     gamma: 0.9\n\npolicies:\n  - pruner:\n      instance_name : 'my_pruner'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 2\n\n  - regularizer:\n      instance_name: '2d_groups_regularizer'\n    starting_epoch: 0\n    ending_epoch: 38\n    frequency: 1\n\n  - lr_scheduler:\n      instance_name: pruning_lr\n    starting_epoch: 24\n    ending_epoch: 200\n    frequency: 1\n\n\n\n\n\nQuantization\n\n\nSimilarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the \nQuantizer\n class (see details \nhere\n).\n\n\nNotes\n: Only a single quantizer instance may be defined.\n\nLet's see an example:\n\n\nquantizers:\n  dorefa_quantizer:\n    class: DorefaQuantizer\n    bits_activations: 8\n    bits_weights: 4\n    bits_overrides:\n      conv1:\n        wts: null\n        acts: null\n      relu1:\n        wts: null\n        acts: null\n      final_relu:\n        wts: null\n        acts: null\n      fc:\n        wts: null\n        acts: null\n\n\n\n\n\n\nThe specific quantization method we're instantiating here is \nDorefaQuantizer\n.\n\n\nThen we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively. \n\n\nThen, we define the \nbits_overrides\n mapping. In this case, we choose not to quantize the first and last layer of the model. In the case of \nDorefaQuantizer\n, the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters \nconv1\n, the first activation layer \nrelu1\n, the last activation layer \nfinal_relu\n and the last layer with parameters \nfc\n.\n\n\nSpecifying \nnull\n means \"do not quantize\".\n\n\nNote that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers.\n\n\nWe can also reference \ngroups of layers\n in the \nbits_overrides\n mapping. This is done using regular expressions. Suppose we have a sub-module in our model named \nblock1\n, which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named \nconv1\n, \nconv2\n and so on. In that case we would define the following:\n\n\n\n\nbits_overrides:\n  block1.conv*:\n    wts: 2\n    acts: null\n\n\n\n\nThe \nQuantizationPolicy\n, which controls the quantization procedure during training, is actually quite simplistic. All it does is call the \nprepare_model()\n function of the \nQuantizer\n when it's initialized, followed by the first call to \nquantize_params()\n. Then, at the end of each epoch, after the float copy of the weights has been updated, it calls the \nquantize_params()\n function again. \n\n\npolicies:\n    - quantizer:\n        instance_name: dorefa_quantizer\n      starting_epoch: 0\n      ending_epoch: 200\n      frequency: 1\n\n\n\n\nImportant Note\n: As mentioned \nhere\n, since the quantizer modifies the model's parameters (assuming training with quantization in the loop is used), the call to \nprepare_model()\n must be performed before an optimizer is called. Therefore, currently, the starting epoch for a quantization policy must be 0, otherwise the quantization process will not work as expected. If one wishes to do a \"warm-startup\" (or \"boot-strapping\"), training for a few epochs with full precision and only then starting to quantize, the only way to do this right now is to execute a separate run to generate the boot-strapped weights, and execute a second which will resume the checkpoint with the boot-strapped weights.", 
             "title": "Compression scheduling"
         }, 
         {
@@ -177,7 +177,7 @@
         }, 
         {
             "location": "/schedule/index.html#quantization", 
-            "text": "Similarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the  Quantizer  class (see details  here ).\nLet's see an example:  quantizers:\n  dorefa_quantizer:\n    class: DorefaQuantizer\n    bits_activations: 8\n    bits_weights: 4\n    bits_overrides:\n      conv1:\n        wts: null\n        acts: null\n      relu1:\n        wts: null\n        acts: null\n      final_relu:\n        wts: null\n        acts: null\n      fc:\n        wts: null\n        acts: null   The specific quantization method we're instantiating here is  DorefaQuantizer .  Then we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively.   Then, we define the  bits_overrides  mapping. In this case, we choose not to quantize the first and last layer of the model. In the case of  DorefaQuantizer , the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters  conv1 , the first activation layer  relu1 , the last activation layer  final_relu  and the last layer with parameters  fc .  Specifying  null  means \"do not quantize\".  Note that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers.  We can also reference  groups of layers  in the  bits_overrides  mapping. This is done using regular expressions. Suppose we have a sub-module in our model named  block1 , which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named  conv1 ,  conv2  and so on. In that case we would define the following:   bits_overrides:\n  block1.conv*:\n    wts: 2\n    acts: null", 
+            "text": "Similarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the  Quantizer  class (see details  here ).  Notes : Only a single quantizer instance may be defined. \nLet's see an example:  quantizers:\n  dorefa_quantizer:\n    class: DorefaQuantizer\n    bits_activations: 8\n    bits_weights: 4\n    bits_overrides:\n      conv1:\n        wts: null\n        acts: null\n      relu1:\n        wts: null\n        acts: null\n      final_relu:\n        wts: null\n        acts: null\n      fc:\n        wts: null\n        acts: null   The specific quantization method we're instantiating here is  DorefaQuantizer .  Then we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively.   Then, we define the  bits_overrides  mapping. In this case, we choose not to quantize the first and last layer of the model. In the case of  DorefaQuantizer , the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters  conv1 , the first activation layer  relu1 , the last activation layer  final_relu  and the last layer with parameters  fc .  Specifying  null  means \"do not quantize\".  Note that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers.  We can also reference  groups of layers  in the  bits_overrides  mapping. This is done using regular expressions. Suppose we have a sub-module in our model named  block1 , which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named  conv1 ,  conv2  and so on. In that case we would define the following:   bits_overrides:\n  block1.conv*:\n    wts: 2\n    acts: null  The  QuantizationPolicy , which controls the quantization procedure during training, is actually quite simplistic. All it does is call the  prepare_model()  function of the  Quantizer  when it's initialized, followed by the first call to  quantize_params() . Then, at the end of each epoch, after the float copy of the weights has been updated, it calls the  quantize_params()  function again.   policies:\n    - quantizer:\n        instance_name: dorefa_quantizer\n      starting_epoch: 0\n      ending_epoch: 200\n      frequency: 1  Important Note : As mentioned  here , since the quantizer modifies the model's parameters (assuming training with quantization in the loop is used), the call to  prepare_model()  must be performed before an optimizer is called. Therefore, currently, the starting epoch for a quantization policy must be 0, otherwise the quantization process will not work as expected. If one wishes to do a \"warm-startup\" (or \"boot-strapping\"), training for a few epochs with full precision and only then starting to quantize, the only way to do this right now is to execute a separate run to generate the boot-strapped weights, and execute a second which will resume the checkpoint with the boot-strapped weights.", 
             "title": "Quantization"
         }, 
         {
@@ -522,7 +522,7 @@
         }, 
         {
             "location": "/design/index.html", 
-            "text": "Distiller design\n\n\nDistiller is designed to be easily integrated into your own PyTorch research applications.\n\nIt is easiest to understand this integration by examining the code of the sample application for compressing image classification models (\ncompress_classifier.py\n).\n\n\nThe application borrows its main flow code from torchvision's ImageNet classification training sample application (https://github.com/pytorch/examples/tree/master/imagenet). We tried to keep it similar, in order to make it familiar and easy to understand.\n\n\nIntegrating compression is very simple: simply add invocations of the appropriate compression_scheduler callbacks, for each stage in the training.  The training skeleton looks like the pseudo code below.  The boiler-plate Pytorch classification training is speckled with invocations of CompressionScheduler.\n\n\nFor each epoch:\n    compression_scheduler.on_epoch_begin(epoch)\n    train()\n    validate()\n    save_checkpoint()\n    compression_scheduler.on_epoch_end(epoch)\n\ntrain():\n    For each training step:\n        compression_scheduler.on_minibatch_begin(epoch)\n        output = model(input_var)\n        loss = criterion(output, target_var)\n        compression_scheduler.before_backward_pass(epoch)\n        loss.backward()\n        optimizer.step()\n        compression_scheduler.on_minibatch_end(epoch)\n\n\n\n\nThese callbacks can be seen in the diagram below, as the arrow pointing from the Training Loop and into Distiller's \nScheduler\n, which invokes the correct algorithm.  The application also uses Distiller services to collect statistics in \nSummaries\n and logs files, which can be queried at a later time, from Jupyter notebooks or TensorBoard.\n\n\n\n\nSparsification and fine-tuning\n\n\n\n\nThe application sets up a model as normally done in PyTorch.\n\n\nAnd then instantiates a Scheduler and configures it:\n\n\nScheduler configuration is defined in a YAML file\n\n\nThe configuration specifies Policies. Each Policy is tied to a specific algorithm which controls some aspect of the training.\n\n\nSome types of algorithms control the actual sparsification of the model. Such types are \"pruner\" and \"regularizer\".\n\n\nSome algorithms control some parameter of the training process, such as the learning-rate decay scheduler (\nlr_scheduler\n).\n\n\nThe parameters of each algorithm are also specified in the configuration.\n\n\n\n\n\n\n\n\n\n\nIn addition to specifying the algorithm, each Policy specifies scheduling parameters which control when the algorithm is executed: start epoch, end epoch and frequency.\n\n\nThe Scheduler exposes callbacks for relevant training stages: epoch start/end, mini-batch start/end and pre-backward pass. Each scheduler callback activates the policies that were defined according the schedule that was defined.\n\n\nThese callbacks are placed the training loop.\n\n\n\n\nQuantization\n\n\nA quantized model is obtained by replacing existing operations with quantized versions. The quantized versions can be either complete replacements, or wrappers. A wrapper will use the existing modules internally and add quantization and de-quantization operations before/after as necessary.\n\n\nIn Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided.\n\n\nWe also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. This mechanism is exposed by the \nQuantizer\n class. \nQuantizer\n should be sub-classed for each quantization method.\n\n\nModel Transformation\n\n\nThe high-level flow is as follows:\n\n\n\n\nDefine a \nmapping\n between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the \nreplacement_factory\n attribute of the \nQuantizer\n class.\n\n\nIterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it.\n\n\nReplace the existing module with the module returned by the function. It is important to note that the \nname\n of the module \ndoes not\n change, as that could break the \nforward\n function of the parent module.\n\n\n\n\nDifferent quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different \nmapping\n will likely be defined.\n\nEach sub-class of \nQuantizer\n should populate the \nreplacement_factory\n dictionary attribute with the appropriate mapping.\n\n\nFlexible Bit-Widths\n\n\n\n\nEach instance of \nQuantizer\n is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the \nbits_activations\n and \nbits_weights\n parameters in \nQuantizer\n's constructor. Sub-classes may define bit-widths for other tensor types as needed.\n\n\nWe also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks (\"container\" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern.\n\n\nSo, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the \nbits_overrides\n parameter in the constructor.\n\n\n\n\nWeights Quantization\n\n\nThe \nQuantizer\n class also provides an API to quantize the weights of all layers at once. To use it, the \nparam_quantization_fn\n attribute needs to point to a function that accepts a tensor and the number of bits. During model transformation, the \nQuantizer\n class will build a list of all model parameters that need to be quantized along with their bit-width. Then, the \nquantize_params\n function can be called, which will iterate over all parameters and quantize them using \nparams_quantization_fn\n.\n\n\nTraining with Quantization\n\n\nThe \nQuantizer\n class supports training with quantization in the loop, as described \nhere\n. This is enabled by setting \ntrain_with_fp_copy=True\n in the \nQuantizer\n constructor. At model transformation, in each module that has parameters that should be quantized, a new \ntorch.nn.Parameter\n is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module \nis not\n created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following \"hack\":\n\n\n\n\nThe existing \ntorch.nn.Parameter\n, e.g. \nweights\n, is replaced by a \ntorch.nn.Parameter\n named \nfloat_weight\n.\n\n\nTo maintain the existing functionality of the module, we then register a \nbuffer\n in the module with the original name - \nweights\n.\n\n\nDuring training, \nfloat_weight\n will be passed to \nparam_quantization_fn\n and the result will be stored in \nweight\n.\n\n\n\n\nThe base \nQuantizer\n class is implemented in \ndistiller/quantization/quantizer.py\n.\n\nFor a simple sub-class implementing symmetric linear quantization, see \nSymmetricLinearQuantizer\n in \ndistiller/quantization/range_linear.py\n. For examples of lower-precision methods using training with quantization see \nDorefaQuantizer\n and \nWRPNQuantizer\n in \ndistiller/quantization/clipped_linear.py", 
+            "text": "Distiller design\n\n\nDistiller is designed to be easily integrated into your own PyTorch research applications.\n\nIt is easiest to understand this integration by examining the code of the sample application for compressing image classification models (\ncompress_classifier.py\n).\n\n\nThe application borrows its main flow code from torchvision's ImageNet classification training sample application (https://github.com/pytorch/examples/tree/master/imagenet). We tried to keep it similar, in order to make it familiar and easy to understand.\n\n\nIntegrating compression is very simple: simply add invocations of the appropriate compression_scheduler callbacks, for each stage in the training.  The training skeleton looks like the pseudo code below.  The boiler-plate Pytorch classification training is speckled with invocations of CompressionScheduler.\n\n\nFor each epoch:\n    compression_scheduler.on_epoch_begin(epoch)\n    train()\n    validate()\n    save_checkpoint()\n    compression_scheduler.on_epoch_end(epoch)\n\ntrain():\n    For each training step:\n        compression_scheduler.on_minibatch_begin(epoch)\n        output = model(input_var)\n        loss = criterion(output, target_var)\n        compression_scheduler.before_backward_pass(epoch)\n        loss.backward()\n        optimizer.step()\n        compression_scheduler.on_minibatch_end(epoch)\n\n\n\n\nThese callbacks can be seen in the diagram below, as the arrow pointing from the Training Loop and into Distiller's \nScheduler\n, which invokes the correct algorithm.  The application also uses Distiller services to collect statistics in \nSummaries\n and logs files, which can be queried at a later time, from Jupyter notebooks or TensorBoard.\n\n\n\n\nSparsification and fine-tuning\n\n\n\n\nThe application sets up a model as normally done in PyTorch.\n\n\nAnd then instantiates a Scheduler and configures it:\n\n\nScheduler configuration is defined in a YAML file\n\n\nThe configuration specifies Policies. Each Policy is tied to a specific algorithm which controls some aspect of the training.\n\n\nSome types of algorithms control the actual sparsification of the model. Such types are \"pruner\" and \"regularizer\".\n\n\nSome algorithms control some parameter of the training process, such as the learning-rate decay scheduler (\nlr_scheduler\n).\n\n\nThe parameters of each algorithm are also specified in the configuration.\n\n\n\n\n\n\n\n\n\n\nIn addition to specifying the algorithm, each Policy specifies scheduling parameters which control when the algorithm is executed: start epoch, end epoch and frequency.\n\n\nThe Scheduler exposes callbacks for relevant training stages: epoch start/end, mini-batch start/end and pre-backward pass. Each scheduler callback activates the policies that were defined according the schedule that was defined.\n\n\nThese callbacks are placed the training loop.\n\n\n\n\nQuantization\n\n\nA quantized model is obtained by replacing existing operations with quantized versions. The quantized versions can be either complete replacements, or wrappers. A wrapper will use the existing modules internally and add quantization and de-quantization operations before/after as necessary.\n\n\nIn Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided.\n\n\nWe also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. This mechanism is exposed by the \nQuantizer\n class. \nQuantizer\n should be sub-classed for each quantization method.\n\n\nModel Transformation\n\n\nThe high-level flow is as follows:\n\n\n\n\nDefine a \nmapping\n between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the \nreplacement_factory\n attribute of the \nQuantizer\n class.\n\n\nIterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it.\n\n\nReplace the existing module with the module returned by the function. It is important to note that the \nname\n of the module \ndoes not\n change, as that could break the \nforward\n function of the parent module.\n\n\n\n\nDifferent quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different \nmapping\n will likely be defined.\n\nEach sub-class of \nQuantizer\n should populate the \nreplacement_factory\n dictionary attribute with the appropriate mapping.\n\nTo execute the model transformation, call the \nprepare_model\n function of the \nQuantizer\n instance.\n\n\nFlexible Bit-Widths\n\n\n\n\nEach instance of \nQuantizer\n is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the \nbits_activations\n and \nbits_weights\n parameters in \nQuantizer\n's constructor. Sub-classes may define bit-widths for other tensor types as needed.\n\n\nWe also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks (\"container\" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern.\n\n\nSo, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the \nbits_overrides\n parameter in the constructor.\n\n\n\n\nWeights Quantization\n\n\nThe \nQuantizer\n class also provides an API to quantize the weights of all layers at once. To use it, the \nparam_quantization_fn\n attribute needs to point to a function that accepts a tensor and the number of bits. During model transformation, the \nQuantizer\n class will build a list of all model parameters that need to be quantized along with their bit-width. Then, the \nquantize_params\n function can be called, which will iterate over all parameters and quantize them using \nparams_quantization_fn\n.\n\n\nTraining with Quantization\n\n\nThe \nQuantizer\n class supports training with quantization in the loop, as described \nhere\n. This is enabled by setting \ntrain_with_fp_copy=True\n in the \nQuantizer\n constructor. At model transformation, in each module that has parameters that should be quantized, a new \ntorch.nn.Parameter\n is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module \nis not\n created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following \"hack\":\n\n\n\n\nThe existing \ntorch.nn.Parameter\n, e.g. \nweights\n, is replaced by a \ntorch.nn.Parameter\n named \nfloat_weight\n.\n\n\nTo maintain the existing functionality of the module, we then register a \nbuffer\n in the module with the original name - \nweights\n.\n\n\nDuring training, \nfloat_weight\n will be passed to \nparam_quantization_fn\n and the result will be stored in \nweight\n.\n\n\n\n\nImportant Note\n: Since this process modifies the model's parameters, it must be done \nbefore\n an PyTorch \nOptimizer\n is created (this refers to any of the sub-classes defined \nhere\n).\n\n\nThe base \nQuantizer\n class is implemented in \ndistiller/quantization/quantizer.py\n.\n\nFor a simple sub-class implementing symmetric linear quantization, see \nSymmetricLinearQuantizer\n in \ndistiller/quantization/range_linear.py\n. For examples of lower-precision methods using training with quantization see \nDorefaQuantizer\n and \nWRPNQuantizer\n in \ndistiller/quantization/clipped_linear.py", 
             "title": "Design"
         }, 
         {
@@ -542,7 +542,7 @@
         }, 
         {
             "location": "/design/index.html#model-transformation", 
-            "text": "The high-level flow is as follows:   Define a  mapping  between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the  replacement_factory  attribute of the  Quantizer  class.  Iterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it.  Replace the existing module with the module returned by the function. It is important to note that the  name  of the module  does not  change, as that could break the  forward  function of the parent module.   Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different  mapping  will likely be defined. \nEach sub-class of  Quantizer  should populate the  replacement_factory  dictionary attribute with the appropriate mapping.", 
+            "text": "The high-level flow is as follows:   Define a  mapping  between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the  replacement_factory  attribute of the  Quantizer  class.  Iterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it.  Replace the existing module with the module returned by the function. It is important to note that the  name  of the module  does not  change, as that could break the  forward  function of the parent module.   Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different  mapping  will likely be defined. \nEach sub-class of  Quantizer  should populate the  replacement_factory  dictionary attribute with the appropriate mapping. \nTo execute the model transformation, call the  prepare_model  function of the  Quantizer  instance.", 
             "title": "Model Transformation"
         }, 
         {
@@ -557,7 +557,7 @@
         }, 
         {
             "location": "/design/index.html#training-with-quantization", 
-            "text": "The  Quantizer  class supports training with quantization in the loop, as described  here . This is enabled by setting  train_with_fp_copy=True  in the  Quantizer  constructor. At model transformation, in each module that has parameters that should be quantized, a new  torch.nn.Parameter  is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module  is not  created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following \"hack\":   The existing  torch.nn.Parameter , e.g.  weights , is replaced by a  torch.nn.Parameter  named  float_weight .  To maintain the existing functionality of the module, we then register a  buffer  in the module with the original name -  weights .  During training,  float_weight  will be passed to  param_quantization_fn  and the result will be stored in  weight .   The base  Quantizer  class is implemented in  distiller/quantization/quantizer.py . \nFor a simple sub-class implementing symmetric linear quantization, see  SymmetricLinearQuantizer  in  distiller/quantization/range_linear.py . For examples of lower-precision methods using training with quantization see  DorefaQuantizer  and  WRPNQuantizer  in  distiller/quantization/clipped_linear.py", 
+            "text": "The  Quantizer  class supports training with quantization in the loop, as described  here . This is enabled by setting  train_with_fp_copy=True  in the  Quantizer  constructor. At model transformation, in each module that has parameters that should be quantized, a new  torch.nn.Parameter  is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module  is not  created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following \"hack\":   The existing  torch.nn.Parameter , e.g.  weights , is replaced by a  torch.nn.Parameter  named  float_weight .  To maintain the existing functionality of the module, we then register a  buffer  in the module with the original name -  weights .  During training,  float_weight  will be passed to  param_quantization_fn  and the result will be stored in  weight .   Important Note : Since this process modifies the model's parameters, it must be done  before  an PyTorch  Optimizer  is created (this refers to any of the sub-classes defined  here ).  The base  Quantizer  class is implemented in  distiller/quantization/quantizer.py . \nFor a simple sub-class implementing symmetric linear quantization, see  SymmetricLinearQuantizer  in  distiller/quantization/range_linear.py . For examples of lower-precision methods using training with quantization see  DorefaQuantizer  and  WRPNQuantizer  in  distiller/quantization/clipped_linear.py", 
             "title": "Training with Quantization"
         }
     ]
-- 
GitLab