diff --git a/docs-src/docs/regularization.md b/docs-src/docs/regularization.md
index f0b000fa71262e98d5bd3a567898d1714970e899..2881cad468990eaa754ad41065921a1646e75ae3 100755
--- a/docs-src/docs/regularization.md
+++ b/docs-src/docs/regularization.md
@@ -45,12 +45,12 @@ In [Dense-Sparse-Dense (DSD)](#han-et-al-2017), Song Han et al. use pruning as a
 
 Regularization can also be used to induce sparsity.  To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\lVert W \rVert_1\\).
 \\[
-l_1(W) = \sum_{i=1}^{|W|} |w_i|
+\lVert W \rVert_1 = l_1(W) = \sum_{i=1}^{|W|} |w_i|
 \\]
 
 \\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero.  \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler.  This is sometimes referred to as *feature selection* and gives us another interpretation of pruning.
 
-[One](http://localhost:8888/notebooks/L1-regularization.ipynb) of Distiller's Jupiter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.
+[One](https://github.com/NervanaSystems/distiller/blob/master/jupyter/L1-regularization.ipynb) of Distiller's Jupiter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.
 
 
 If we configure ```weight_decay``` to zero and use \\(l_1\\)-norm regularization, then we have:
@@ -71,7 +71,7 @@ loss = criterion(output, target) + lambda * l1_regularizer()
 
 ## Group Regularization
 
-In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements.  Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not.  The groups structures have to be pre-defined.
+In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements.  Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not.  The group structures have to be pre-defined.
 
 To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty.  We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers.  It gets a bit messy, but not overly complicated:
 \\[
@@ -91,7 +91,7 @@ where \\(w^{(g)} \in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements
 Group Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity).  Group sparsity exhibits regularity (i.e. its shape is regular), and therefore
 it can be beneficial to improve inference speed.
 
-[Huizi-et-al-2017](#huizi-et-al-2017) provides an overview of some of the different groups: kernel, channel, filter, layers.  Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even [intra kernel strided sparsity](#anwar-et-al-2015).
+[Huizi-et-al-2017](#huizi-et-al-2017) provides an overview of some of the different groups: kernel, channel, filter, layers.  Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even [intra kernel strided sparsity](#anwar-et-al-2015) can also be used.
 
 ```distiller.GroupLassoRegularizer``` currently implements most of these groups, and you can easily add new groups.
 
diff --git a/docs/index.html b/docs/index.html
index ccf8d34c93f26138e87205376cc3af23767d9ea7..43dff83062f61ec37fc7c8ac86360204d539cc12 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -236,5 +236,5 @@ And of course, if we used a sparse or compressed representation, then we are red
 
 <!--
 MkDocs version : 0.17.2
-Build Date UTC : 2018-04-24 15:37:02
+Build Date UTC : 2018-04-24 17:18:11
 -->
diff --git a/docs/regularization/index.html b/docs/regularization/index.html
index 03df80e40ed8c762800940ddf89a4b32c6879027..36b8b6d55bc7a582268a8a7af4294cdf31fa0f48 100644
--- a/docs/regularization/index.html
+++ b/docs/regularization/index.html
@@ -200,10 +200,10 @@ for input, target in dataset:
 </blockquote>
 <p>Regularization can also be used to induce sparsity.  To induce element-wise sparsity we can use the \(l_1\)-norm, \(\lVert W \rVert_1\).
 \[
-l_1(W) = \sum_{i=1}^{|W|} |w_i|
+\lVert W \rVert_1 = l_1(W) = \sum_{i=1}^{|W|} |w_i|
 \]</p>
 <p>\(l_2\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero.  \(l_1\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler.  This is sometimes referred to as <em>feature selection</em> and gives us another interpretation of pruning.</p>
-<p><a href="http://localhost:8888/notebooks/L1-regularization.ipynb">One</a> of Distiller's Jupiter notebooks explains how the \(l_1\)-norm regularizer induces sparsity, and how it interacts with \(l_2\)-norm regularization.</p>
+<p><a href="https://github.com/NervanaSystems/distiller/blob/master/jupyter/L1-regularization.ipynb">One</a> of Distiller's Jupiter notebooks explains how the \(l_1\)-norm regularizer induces sparsity, and how it interacts with \(l_2\)-norm regularization.</p>
 <p>If we configure <code>weight_decay</code> to zero and use \(l_1\)-norm regularization, then we have:
 \[
 loss(W;x;y) = loss_D(W;x;y) + \lambda_R \lVert W \rVert_1
@@ -219,7 +219,7 @@ loss = criterion(output, target) + lambda * l1_regularizer()
 </code></pre>
 
 <h2 id="group-regularization">Group Regularization</h2>
-<p>In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements.  Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not.  The groups structures have to be pre-defined.</p>
+<p>In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements.  Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not.  The group structures have to be pre-defined.</p>
 <p>To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty.  We represent all of the parameter groups in layer \(l\) as \( W_l^{(G)} \), and we add the penalty of all groups for all layers.  It gets a bit messy, but not overly complicated:
 \[
 loss(W;x;y) = loss_D(W;x;y) + \lambda_R R(W) + \lambda_g \sum_{l=1}^{L} R_g(W_l^{(G)})
@@ -233,7 +233,7 @@ where \(w^{(g)} \in w^{(l)} \) and \( |w^{(g)}| \) is the number of elements in
 <br>
 Group Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity).  Group sparsity exhibits regularity (i.e. its shape is regular), and therefore
 it can be beneficial to improve inference speed.</p>
-<p><a href="#huizi-et-al-2017">Huizi-et-al-2017</a> provides an overview of some of the different groups: kernel, channel, filter, layers.  Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even <a href="#anwar-et-al-2015">intra kernel strided sparsity</a>.</p>
+<p><a href="#huizi-et-al-2017">Huizi-et-al-2017</a> provides an overview of some of the different groups: kernel, channel, filter, layers.  Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even <a href="#anwar-et-al-2015">intra kernel strided sparsity</a> can also be used.</p>
 <p><code>distiller.GroupLassoRegularizer</code> currently implements most of these groups, and you can easily add new groups.</p>
 <h2 id="references">References</h2>
 <p><div id="deep-learning"></div> <strong>Ian Goodfellow and Yoshua Bengio and Aaron Courville</strong>.
diff --git a/docs/search/search_index.json b/docs/search/search_index.json
index 1a5e9e603381e6d58f82e60444c7b9a8d4d07776..a11dad1dca803c02ad02bae9801f7c4c52171d76 100644
--- a/docs/search/search_index.json
+++ b/docs/search/search_index.json
@@ -212,7 +212,7 @@
         },
         {
             "location": "/regularization/index.html",
-            "text": "Regularization\n\n\nIn their book \nDeep Learning\n Ian Goodfellow et al. define regularization as\n\n\n\n\n\"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\"\n\n\n\n\nPyTorch's \noptimizers\n use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance).\n\n\nIn general, we can write this as:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W)\n\\]\nAnd specifically,\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2\n\\]\nWhere W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or \ncriterion\n in the Distiller sample image classifier compression application).\n\n\noptimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001)\ncriterion = nn.CrossEntropyLoss()\n...\nfor input, target in dataset:\n    optimizer.zero_grad()\n    output = model(input)\n    loss = criterion(output, target)\n    loss.backward()\n    optimizer.step()\n\n\n\n\n\\(\\lambda_R\\) is a scalar called the \nregularization strength\n, and it balances the data error and the regularization error.  In PyTorch, this is the \nweight_decay\n argument.\n\n\n\\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a \nmagnitude\n, or sizing, of the weights tensor.\n\\[\n\\lVert W \\rVert_2^2 = \\sum_{l=1}^{L}  \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l)\n\\]\n\n\n\\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation.\n\n\nThe qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in \nDeep Learning\n.\n\n\nSparsity and Regularization\n\n\nWe mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods.\n\n\nIn \nDense-Sparse-Dense (DSD)\n, Song Han et al. use pruning as a regularizer to improve a model's accuracy:\n\n\n\n\n\"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\"\n\n\n\n\nRegularization can also be used to induce sparsity.  To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\nl_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\]\n\n\n\\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero.  \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler.  This is sometimes referred to as \nfeature selection\n and gives us another interpretation of pruning.\n\n\nOne\n of Distiller's Jupiter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.\n\n\nIf we configure \nweight_decay\n to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2  + \\lambda_{R_1} \\lVert W \\rVert_1\n\\]\n\n\nClass \ndistiller.L1Regularizer\n implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization.\n\n\nl1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()\n\n\n\n\nGroup Regularization\n\n\nIn Group Regularization, we penalize entire groups of parameter elements, instead of individual elements.  Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not.  The groups structures have to be pre-defined.\n\n\nTo the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty.  We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers.  It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\]\n\n\nLet's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\).\n\n\n\\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\).\n\n\n\\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer.  Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups).\n\n\n\nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity).  Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed.\n\n\nHuizi-et-al-2017\n provides an overview of some of the different groups: kernel, channel, filter, layers.  Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even \nintra kernel strided sparsity\n.\n\n\ndistiller.GroupLassoRegularizer\n currently implements most of these groups, and you can easily add new groups.\n\n\nReferences\n\n\n \nIan Goodfellow and Yoshua Bengio and Aaron Courville\n.\n    \nDeep Learning\n,\n     arXiv:1607.04381v2,\n    2017.\n\n\n\n\n\nSong Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally\n.\n    \nDSD: Dense-Sparse-Dense Training for Deep Neural Networks\n,\n     arXiv:1607.04381v2,\n    2017.\n\n\n\n\n\nHuizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally\n.\n    \nExploring the Regularity of Sparse Structure in Convolutional Neural Networks\n,\n    arXiv:1705.08922v3,\n    2017.\n\n\n\n\n\nSajid Anwar, Kyuyeon Hwang, and Wonyong Sung\n.\n    \nStructured pruning of deep convolutional neural networks\n,\n    arXiv:1512.08571,\n    2015",
+            "text": "Regularization\n\n\nIn their book \nDeep Learning\n Ian Goodfellow et al. define regularization as\n\n\n\n\n\"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\"\n\n\n\n\nPyTorch's \noptimizers\n use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance).\n\n\nIn general, we can write this as:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W)\n\\]\nAnd specifically,\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2\n\\]\nWhere W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or \ncriterion\n in the Distiller sample image classifier compression application).\n\n\noptimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001)\ncriterion = nn.CrossEntropyLoss()\n...\nfor input, target in dataset:\n    optimizer.zero_grad()\n    output = model(input)\n    loss = criterion(output, target)\n    loss.backward()\n    optimizer.step()\n\n\n\n\n\\(\\lambda_R\\) is a scalar called the \nregularization strength\n, and it balances the data error and the regularization error.  In PyTorch, this is the \nweight_decay\n argument.\n\n\n\\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a \nmagnitude\n, or sizing, of the weights tensor.\n\\[\n\\lVert W \\rVert_2^2 = \\sum_{l=1}^{L}  \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l)\n\\]\n\n\n\\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation.\n\n\nThe qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in \nDeep Learning\n.\n\n\nSparsity and Regularization\n\n\nWe mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods.\n\n\nIn \nDense-Sparse-Dense (DSD)\n, Song Han et al. use pruning as a regularizer to improve a model's accuracy:\n\n\n\n\n\"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\"\n\n\n\n\nRegularization can also be used to induce sparsity.  To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\]\n\n\n\\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero.  \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler.  This is sometimes referred to as \nfeature selection\n and gives us another interpretation of pruning.\n\n\nOne\n of Distiller's Jupiter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.\n\n\nIf we configure \nweight_decay\n to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2  + \\lambda_{R_1} \\lVert W \\rVert_1\n\\]\n\n\nClass \ndistiller.L1Regularizer\n implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization.\n\n\nl1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()\n\n\n\n\nGroup Regularization\n\n\nIn Group Regularization, we penalize entire groups of parameter elements, instead of individual elements.  Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not.  The group structures have to be pre-defined.\n\n\nTo the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty.  We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers.  It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\]\n\n\nLet's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\).\n\n\n\\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\).\n\n\n\\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer.  Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups).\n\n\n\nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity).  Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed.\n\n\nHuizi-et-al-2017\n provides an overview of some of the different groups: kernel, channel, filter, layers.  Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even \nintra kernel strided sparsity\n can also be used.\n\n\ndistiller.GroupLassoRegularizer\n currently implements most of these groups, and you can easily add new groups.\n\n\nReferences\n\n\n \nIan Goodfellow and Yoshua Bengio and Aaron Courville\n.\n    \nDeep Learning\n,\n     arXiv:1607.04381v2,\n    2017.\n\n\n\n\n\nSong Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally\n.\n    \nDSD: Dense-Sparse-Dense Training for Deep Neural Networks\n,\n     arXiv:1607.04381v2,\n    2017.\n\n\n\n\n\nHuizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally\n.\n    \nExploring the Regularity of Sparse Structure in Convolutional Neural Networks\n,\n    arXiv:1705.08922v3,\n    2017.\n\n\n\n\n\nSajid Anwar, Kyuyeon Hwang, and Wonyong Sung\n.\n    \nStructured pruning of deep convolutional neural networks\n,\n    arXiv:1512.08571,\n    2015",
             "title": "Regularization"
         },
         {
@@ -222,12 +222,12 @@
         },
         {
             "location": "/regularization/index.html#sparsity-and-regularization",
-            "text": "We mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods.  In  Dense-Sparse-Dense (DSD) , Song Han et al. use pruning as a regularizer to improve a model's accuracy:   \"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\"   Regularization can also be used to induce sparsity.  To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\nl_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\]  \\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero.  \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler.  This is sometimes referred to as  feature selection  and gives us another interpretation of pruning.  One  of Distiller's Jupiter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.  If we configure  weight_decay  to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2  + \\lambda_{R_1} \\lVert W \\rVert_1\n\\]  Class  distiller.L1Regularizer  implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization.  l1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()",
+            "text": "We mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods.  In  Dense-Sparse-Dense (DSD) , Song Han et al. use pruning as a regularizer to improve a model's accuracy:   \"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\"   Regularization can also be used to induce sparsity.  To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\]  \\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero.  \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler.  This is sometimes referred to as  feature selection  and gives us another interpretation of pruning.  One  of Distiller's Jupiter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.  If we configure  weight_decay  to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2  + \\lambda_{R_1} \\lVert W \\rVert_1\n\\]  Class  distiller.L1Regularizer  implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization.  l1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()",
             "title": "Sparsity and Regularization"
         },
         {
             "location": "/regularization/index.html#group-regularization",
-            "text": "In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements.  Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not.  The groups structures have to be pre-defined.  To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty.  We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers.  It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\]  Let's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\).  \\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\).  \\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer.  Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups).  \nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity).  Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed.  Huizi-et-al-2017  provides an overview of some of the different groups: kernel, channel, filter, layers.  Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even  intra kernel strided sparsity .  distiller.GroupLassoRegularizer  currently implements most of these groups, and you can easily add new groups.",
+            "text": "In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements.  Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not.  The group structures have to be pre-defined.  To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty.  We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers.  It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\]  Let's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\).  \\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\).  \\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer.  Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups).  \nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity).  Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed.  Huizi-et-al-2017  provides an overview of some of the different groups: kernel, channel, filter, layers.  Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even  intra kernel strided sparsity  can also be used.  distiller.GroupLassoRegularizer  currently implements most of these groups, and you can easily add new groups.",
             "title": "Group Regularization"
         },
         {