Skip to content
Snippets Groups Projects
Commit 67e5a230 authored by Neta Zmora's avatar Neta Zmora
Browse files

fix documentation links

parent 0ecd205a
No related branches found
No related tags found
No related merge requests found
...@@ -45,12 +45,12 @@ In [Dense-Sparse-Dense (DSD)](#han-et-al-2017), Song Han et al. use pruning as a ...@@ -45,12 +45,12 @@ In [Dense-Sparse-Dense (DSD)](#han-et-al-2017), Song Han et al. use pruning as a
Regularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\lVert W \rVert_1\\). Regularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\lVert W \rVert_1\\).
\\[ \\[
l_1(W) = \sum_{i=1}^{|W|} |w_i| \lVert W \rVert_1 = l_1(W) = \sum_{i=1}^{|W|} |w_i|
\\] \\]
\\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as *feature selection* and gives us another interpretation of pruning. \\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as *feature selection* and gives us another interpretation of pruning.
[One](http://localhost:8888/notebooks/L1-regularization.ipynb) of Distiller's Jupiter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization. [One](https://github.com/NervanaSystems/distiller/blob/master/jupyter/L1-regularization.ipynb) of Distiller's Jupiter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.
If we configure ```weight_decay``` to zero and use \\(l_1\\)-norm regularization, then we have: If we configure ```weight_decay``` to zero and use \\(l_1\\)-norm regularization, then we have:
...@@ -71,7 +71,7 @@ loss = criterion(output, target) + lambda * l1_regularizer() ...@@ -71,7 +71,7 @@ loss = criterion(output, target) + lambda * l1_regularizer()
## Group Regularization ## Group Regularization
In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements. Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not. The groups structures have to be pre-defined. In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements. Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not. The group structures have to be pre-defined.
To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty. We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers. It gets a bit messy, but not overly complicated: To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty. We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers. It gets a bit messy, but not overly complicated:
\\[ \\[
...@@ -91,7 +91,7 @@ where \\(w^{(g)} \in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements ...@@ -91,7 +91,7 @@ where \\(w^{(g)} \in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements
Group Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity). Group sparsity exhibits regularity (i.e. its shape is regular), and therefore Group Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity). Group sparsity exhibits regularity (i.e. its shape is regular), and therefore
it can be beneficial to improve inference speed. it can be beneficial to improve inference speed.
[Huizi-et-al-2017](#huizi-et-al-2017) provides an overview of some of the different groups: kernel, channel, filter, layers. Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even [intra kernel strided sparsity](#anwar-et-al-2015). [Huizi-et-al-2017](#huizi-et-al-2017) provides an overview of some of the different groups: kernel, channel, filter, layers. Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even [intra kernel strided sparsity](#anwar-et-al-2015) can also be used.
```distiller.GroupLassoRegularizer``` currently implements most of these groups, and you can easily add new groups. ```distiller.GroupLassoRegularizer``` currently implements most of these groups, and you can easily add new groups.
......
...@@ -236,5 +236,5 @@ And of course, if we used a sparse or compressed representation, then we are red ...@@ -236,5 +236,5 @@ And of course, if we used a sparse or compressed representation, then we are red
<!-- <!--
MkDocs version : 0.17.2 MkDocs version : 0.17.2
Build Date UTC : 2018-04-24 15:37:02 Build Date UTC : 2018-04-24 17:18:11
--> -->
...@@ -200,10 +200,10 @@ for input, target in dataset: ...@@ -200,10 +200,10 @@ for input, target in dataset:
</blockquote> </blockquote>
<p>Regularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \(l_1\)-norm, \(\lVert W \rVert_1\). <p>Regularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \(l_1\)-norm, \(\lVert W \rVert_1\).
\[ \[
l_1(W) = \sum_{i=1}^{|W|} |w_i| \lVert W \rVert_1 = l_1(W) = \sum_{i=1}^{|W|} |w_i|
\]</p> \]</p>
<p>\(l_2\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \(l_1\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as <em>feature selection</em> and gives us another interpretation of pruning.</p> <p>\(l_2\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \(l_1\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as <em>feature selection</em> and gives us another interpretation of pruning.</p>
<p><a href="http://localhost:8888/notebooks/L1-regularization.ipynb">One</a> of Distiller's Jupiter notebooks explains how the \(l_1\)-norm regularizer induces sparsity, and how it interacts with \(l_2\)-norm regularization.</p> <p><a href="https://github.com/NervanaSystems/distiller/blob/master/jupyter/L1-regularization.ipynb">One</a> of Distiller's Jupiter notebooks explains how the \(l_1\)-norm regularizer induces sparsity, and how it interacts with \(l_2\)-norm regularization.</p>
<p>If we configure <code>weight_decay</code> to zero and use \(l_1\)-norm regularization, then we have: <p>If we configure <code>weight_decay</code> to zero and use \(l_1\)-norm regularization, then we have:
\[ \[
loss(W;x;y) = loss_D(W;x;y) + \lambda_R \lVert W \rVert_1 loss(W;x;y) = loss_D(W;x;y) + \lambda_R \lVert W \rVert_1
...@@ -219,7 +219,7 @@ loss = criterion(output, target) + lambda * l1_regularizer() ...@@ -219,7 +219,7 @@ loss = criterion(output, target) + lambda * l1_regularizer()
</code></pre> </code></pre>
<h2 id="group-regularization">Group Regularization</h2> <h2 id="group-regularization">Group Regularization</h2>
<p>In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements. Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not. The groups structures have to be pre-defined.</p> <p>In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements. Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not. The group structures have to be pre-defined.</p>
<p>To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty. We represent all of the parameter groups in layer \(l\) as \( W_l^{(G)} \), and we add the penalty of all groups for all layers. It gets a bit messy, but not overly complicated: <p>To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty. We represent all of the parameter groups in layer \(l\) as \( W_l^{(G)} \), and we add the penalty of all groups for all layers. It gets a bit messy, but not overly complicated:
\[ \[
loss(W;x;y) = loss_D(W;x;y) + \lambda_R R(W) + \lambda_g \sum_{l=1}^{L} R_g(W_l^{(G)}) loss(W;x;y) = loss_D(W;x;y) + \lambda_R R(W) + \lambda_g \sum_{l=1}^{L} R_g(W_l^{(G)})
...@@ -233,7 +233,7 @@ where \(w^{(g)} \in w^{(l)} \) and \( |w^{(g)}| \) is the number of elements in ...@@ -233,7 +233,7 @@ where \(w^{(g)} \in w^{(l)} \) and \( |w^{(g)}| \) is the number of elements in
<br> <br>
Group Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity). Group sparsity exhibits regularity (i.e. its shape is regular), and therefore Group Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity). Group sparsity exhibits regularity (i.e. its shape is regular), and therefore
it can be beneficial to improve inference speed.</p> it can be beneficial to improve inference speed.</p>
<p><a href="#huizi-et-al-2017">Huizi-et-al-2017</a> provides an overview of some of the different groups: kernel, channel, filter, layers. Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even <a href="#anwar-et-al-2015">intra kernel strided sparsity</a>.</p> <p><a href="#huizi-et-al-2017">Huizi-et-al-2017</a> provides an overview of some of the different groups: kernel, channel, filter, layers. Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even <a href="#anwar-et-al-2015">intra kernel strided sparsity</a> can also be used.</p>
<p><code>distiller.GroupLassoRegularizer</code> currently implements most of these groups, and you can easily add new groups.</p> <p><code>distiller.GroupLassoRegularizer</code> currently implements most of these groups, and you can easily add new groups.</p>
<h2 id="references">References</h2> <h2 id="references">References</h2>
<p><div id="deep-learning"></div> <strong>Ian Goodfellow and Yoshua Bengio and Aaron Courville</strong>. <p><div id="deep-learning"></div> <strong>Ian Goodfellow and Yoshua Bengio and Aaron Courville</strong>.
......
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment