From ebb89126c92acba78e47c01a4bc06daefc17245c Mon Sep 17 00:00:00 2001 From: Neta Zmora <neta.zmora@intel.com> Date: Sat, 28 Apr 2018 21:02:49 +0300 Subject: [PATCH] fix typo: Jupyter spelled as Jupiter --- README.md | 6 +++--- docs-src/docs/install.md | 2 +- docs-src/docs/regularization.md | 2 +- docs/index.html | 2 +- docs/install/index.html | 2 +- docs/regularization/index.html | 2 +- docs/search/search_index.json | 8 ++++---- docs/sitemap.xml | 22 +++++++++++----------- 8 files changed, 23 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index 3779df2..a0533f6 100755 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ Network compression can reduce the memory footprint of a neural network, increas + [Install dependencies](#install-dependencies) * [Getting Started](#getting-started) + [Example invocations of the sample application](#example-invocations-of-the-sample-application) - + [Explore the sample Jupiter notebooks](#explore-the-sample-jupiter-notebooks) + + [Explore the sample Jupyter notebooks](#explore-the-sample-jupyter-notebooks) * [Set up the classification datasets](#set-up-the-classification-datasets) * [Running the tests](#running-the-tests) * [Generating the HTML documentation site](#generating-the-html-documentation-site) @@ -137,7 +137,7 @@ You can jump head-first into some limited examples of network compression, to ge Distiller comes with a sample application for compressing image classification DNNs, ```compress_classifier.py``` located at ```distiller/examples/classifier_compression```. -We'll show you how to use it for some simple use-cases, and will point you to some ready-to-go Jupiter notebooks. +We'll show you how to use it for some simple use-cases, and will point you to some ready-to-go Jupyter notebooks. For more details, there are some other resources you can refer to: + [Model zoo](https://nervanasystems.github.io/distiller/model_zoo/index.html) @@ -196,7 +196,7 @@ This example performs 8-bit quantization of ResNet20 for CIFAR10. We've include $ python3 compress_classifier.py -a resnet20-cifar ../../../data.cifar10 --resume ../examples/ssl/checkpoints/checkpoint_trained_dense.pth.tar --quantize --evaluate ``` -### Explore the sample Jupiter notebooks +### Explore the sample Jupyter notebooks The set of notebooks that come with Distiller is described [here](https://nervanasystems.github.io/distiller/jupyter/index.html#using-the-distiller-notebooks), which also explains the steps to install the Jupyter notebook server.<br> After installing and running the server, take a look at the [notebook](https://github.com/NervanaSystems/distiller/blob/master/jupyter/sensitivity_analysis.ipynb) covering pruning sensitivity analysis. diff --git a/docs-src/docs/install.md b/docs-src/docs/install.md index 06da113..3231710 100755 --- a/docs-src/docs/install.md +++ b/docs-src/docs/install.md @@ -5,7 +5,7 @@ These instructions will help get Distiller up and running on your local machine. You may also want to refer to these resources: * [Dataset installation](https://github.com/NervanaSystems/distiller#set-up-the-classification-datasets) instructions. -* [Jupiter installation](https://nervanasystems.github.io/distiller/jupyter/index.html#installation) instructions. +* [Jupyter installation](https://nervanasystems.github.io/distiller/jupyter/index.html#installation) instructions. Notes: - Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5. diff --git a/docs-src/docs/regularization.md b/docs-src/docs/regularization.md index 2881cad..6062ee2 100755 --- a/docs-src/docs/regularization.md +++ b/docs-src/docs/regularization.md @@ -50,7 +50,7 @@ Regularization can also be used to induce sparsity. To induce element-wise spar \\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as *feature selection* and gives us another interpretation of pruning. -[One](https://github.com/NervanaSystems/distiller/blob/master/jupyter/L1-regularization.ipynb) of Distiller's Jupiter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization. +[One](https://github.com/NervanaSystems/distiller/blob/master/jupyter/L1-regularization.ipynb) of Distiller's Jupyter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization. If we configure ```weight_decay``` to zero and use \\(l_1\\)-norm regularization, then we have: diff --git a/docs/index.html b/docs/index.html index 134c354..d2061f3 100644 --- a/docs/index.html +++ b/docs/index.html @@ -236,5 +236,5 @@ And of course, if we used a sparse or compressed representation, then we are red <!-- MkDocs version : 0.17.2 -Build Date UTC : 2018-04-24 23:01:45 +Build Date UTC : 2018-04-28 18:01:09 --> diff --git a/docs/install/index.html b/docs/install/index.html index ac6cb29..71347a3 100644 --- a/docs/install/index.html +++ b/docs/install/index.html @@ -160,7 +160,7 @@ <p>You may also want to refer to these resources:</p> <ul> <li><a href="https://github.com/NervanaSystems/distiller#set-up-the-classification-datasets">Dataset installation</a> instructions.</li> -<li><a href="https://nervanasystems.github.io/distiller/jupyter/index.html#installation">Jupiter installation</a> instructions.</li> +<li><a href="https://nervanasystems.github.io/distiller/jupyter/index.html#installation">Jupyter installation</a> instructions.</li> </ul> <p>Notes: - Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5. diff --git a/docs/regularization/index.html b/docs/regularization/index.html index 36b8b6d..c7fc349 100644 --- a/docs/regularization/index.html +++ b/docs/regularization/index.html @@ -203,7 +203,7 @@ for input, target in dataset: \lVert W \rVert_1 = l_1(W) = \sum_{i=1}^{|W|} |w_i| \]</p> <p>\(l_2\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \(l_1\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as <em>feature selection</em> and gives us another interpretation of pruning.</p> -<p><a href="https://github.com/NervanaSystems/distiller/blob/master/jupyter/L1-regularization.ipynb">One</a> of Distiller's Jupiter notebooks explains how the \(l_1\)-norm regularizer induces sparsity, and how it interacts with \(l_2\)-norm regularization.</p> +<p><a href="https://github.com/NervanaSystems/distiller/blob/master/jupyter/L1-regularization.ipynb">One</a> of Distiller's Jupyter notebooks explains how the \(l_1\)-norm regularizer induces sparsity, and how it interacts with \(l_2\)-norm regularization.</p> <p>If we configure <code>weight_decay</code> to zero and use \(l_1\)-norm regularization, then we have: \[ loss(W;x;y) = loss_D(W;x;y) + \lambda_R \lVert W \rVert_1 diff --git a/docs/search/search_index.json b/docs/search/search_index.json index a8ead5f..746a63c 100644 --- a/docs/search/search_index.json +++ b/docs/search/search_index.json @@ -37,12 +37,12 @@ }, { "location": "/install/index.html", - "text": "Distiller Installation\n\n\nThese instructions will help get Distiller up and running on your local machine.\n\n\nYou may also want to refer to these resources:\n\n\n\n\nDataset installation\n instructions.\n\n\nJupiter installation\n instructions.\n\n\n\n\nNotes:\n- Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.\n- If you are not using a GPU, you might need to make small adjustments to the code.\n\n\nClone Distiller\n\n\nClone the Distiller code repository from github:\n\n\n$ git clone https://github.com/NervanaSystems/distiller.git\n\n\n\n\nThe rest of the documentation that follows, assumes that you have cloned your repository to a directory called \ndistiller\n. \n\n\nCreate a Python virtual environment\n\n\nWe recommend using a \nPython virtual environment\n, but that of course, is up to you.\nThere's nothing special about using Distiller in a virtual environment, but we provide some instructions, for completeness.\n\nBefore creating the virtual environment, make sure you are located in directory \ndistiller\n. After creating the environment, you should see a directory called \ndistiller/env\n.\n\n\n\nUsing virtualenv\n\n\nIf you don't have virtualenv installed, you can find the installation instructions \nhere\n.\n\n\nTo create the environment, execute:\n\n\n$ python3 -m virtualenv env\n\n\n\n\nThis creates a subdirectory named \nenv\n where the python virtual environment is stored, and configures the current shell to use it as the default python environment.\n\n\nUsing venv\n\n\nIf you prefer to use \nvenv\n, then begin by installing it:\n\n\n$ sudo apt-get install python3-venv\n\n\n\n\nThen create the environment:\n\n\n$ python3 -m venv env\n\n\n\n\nAs with virtualenv, this creates a directory called \ndistiller/env\n.\n\n\nActivate the environment\n\n\nThe environment activation and deactivation commands for \nvenv\n and \nvirtualenv\n are the same.\n\n\n!NOTE: Make sure to activate the environment, before proceeding with the installation of the dependency packages:\n\n\n$ source env/bin/activate\n\n\n\n\nInstall dependencies\n\n\nFinally, install Distiller's dependency packages using \npip3\n:\n\n\n$ pip3 install -r requirements.txt\n\n\n\n\nPyTorch is included in the \nrequirements.txt\n file, and will currently download PyTorch version 3.1 for CUDA 8.0. This is the setup we've used for testing Distiller.", + "text": "Distiller Installation\n\n\nThese instructions will help get Distiller up and running on your local machine.\n\n\nYou may also want to refer to these resources:\n\n\n\n\nDataset installation\n instructions.\n\n\nJupyter installation\n instructions.\n\n\n\n\nNotes:\n- Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.\n- If you are not using a GPU, you might need to make small adjustments to the code.\n\n\nClone Distiller\n\n\nClone the Distiller code repository from github:\n\n\n$ git clone https://github.com/NervanaSystems/distiller.git\n\n\n\n\nThe rest of the documentation that follows, assumes that you have cloned your repository to a directory called \ndistiller\n. \n\n\nCreate a Python virtual environment\n\n\nWe recommend using a \nPython virtual environment\n, but that of course, is up to you.\nThere's nothing special about using Distiller in a virtual environment, but we provide some instructions, for completeness.\n\nBefore creating the virtual environment, make sure you are located in directory \ndistiller\n. After creating the environment, you should see a directory called \ndistiller/env\n.\n\n\n\nUsing virtualenv\n\n\nIf you don't have virtualenv installed, you can find the installation instructions \nhere\n.\n\n\nTo create the environment, execute:\n\n\n$ python3 -m virtualenv env\n\n\n\n\nThis creates a subdirectory named \nenv\n where the python virtual environment is stored, and configures the current shell to use it as the default python environment.\n\n\nUsing venv\n\n\nIf you prefer to use \nvenv\n, then begin by installing it:\n\n\n$ sudo apt-get install python3-venv\n\n\n\n\nThen create the environment:\n\n\n$ python3 -m venv env\n\n\n\n\nAs with virtualenv, this creates a directory called \ndistiller/env\n.\n\n\nActivate the environment\n\n\nThe environment activation and deactivation commands for \nvenv\n and \nvirtualenv\n are the same.\n\n\n!NOTE: Make sure to activate the environment, before proceeding with the installation of the dependency packages:\n\n\n$ source env/bin/activate\n\n\n\n\nInstall dependencies\n\n\nFinally, install Distiller's dependency packages using \npip3\n:\n\n\n$ pip3 install -r requirements.txt\n\n\n\n\nPyTorch is included in the \nrequirements.txt\n file, and will currently download PyTorch version 3.1 for CUDA 8.0. This is the setup we've used for testing Distiller.", "title": "Installation" }, { "location": "/install/index.html#distiller-installation", - "text": "These instructions will help get Distiller up and running on your local machine. You may also want to refer to these resources: Dataset installation instructions. Jupiter installation instructions. Notes:\n- Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.\n- If you are not using a GPU, you might need to make small adjustments to the code.", + "text": "These instructions will help get Distiller up and running on your local machine. You may also want to refer to these resources: Dataset installation instructions. Jupyter installation instructions. Notes:\n- Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.\n- If you are not using a GPU, you might need to make small adjustments to the code.", "title": "Distiller Installation" }, { @@ -217,7 +217,7 @@ }, { "location": "/regularization/index.html", - "text": "Regularization\n\n\nIn their book \nDeep Learning\n Ian Goodfellow et al. define regularization as\n\n\n\n\n\"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\"\n\n\n\n\nPyTorch's \noptimizers\n use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance).\n\n\nIn general, we can write this as:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W)\n\\]\nAnd specifically,\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2\n\\]\nWhere W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or \ncriterion\n in the Distiller sample image classifier compression application).\n\n\noptimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001)\ncriterion = nn.CrossEntropyLoss()\n...\nfor input, target in dataset:\n optimizer.zero_grad()\n output = model(input)\n loss = criterion(output, target)\n loss.backward()\n optimizer.step()\n\n\n\n\n\\(\\lambda_R\\) is a scalar called the \nregularization strength\n, and it balances the data error and the regularization error. In PyTorch, this is the \nweight_decay\n argument.\n\n\n\\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a \nmagnitude\n, or sizing, of the weights tensor.\n\\[\n\\lVert W \\rVert_2^2 = \\sum_{l=1}^{L} \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l)\n\\]\n\n\n\\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation.\n\n\nThe qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in \nDeep Learning\n.\n\n\nSparsity and Regularization\n\n\nWe mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods.\n\n\nIn \nDense-Sparse-Dense (DSD)\n, Song Han et al. use pruning as a regularizer to improve a model's accuracy:\n\n\n\n\n\"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\"\n\n\n\n\nRegularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\]\n\n\n\\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as \nfeature selection\n and gives us another interpretation of pruning.\n\n\nOne\n of Distiller's Jupiter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.\n\n\nIf we configure \nweight_decay\n to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2 + \\lambda_{R_1} \\lVert W \\rVert_1\n\\]\n\n\nClass \ndistiller.L1Regularizer\n implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization.\n\n\nl1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()\n\n\n\n\nGroup Regularization\n\n\nIn Group Regularization, we penalize entire groups of parameter elements, instead of individual elements. Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not. The group structures have to be pre-defined.\n\n\nTo the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty. We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers. It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\]\n\n\nLet's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\).\n\n\n\\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\).\n\n\n\\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer. Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups).\n\n\n\nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity). Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed.\n\n\nHuizi-et-al-2017\n provides an overview of some of the different groups: kernel, channel, filter, layers. Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even \nintra kernel strided sparsity\n can also be used.\n\n\ndistiller.GroupLassoRegularizer\n currently implements most of these groups, and you can easily add new groups.\n\n\nReferences\n\n\n \nIan Goodfellow and Yoshua Bengio and Aaron Courville\n.\n \nDeep Learning\n,\n arXiv:1607.04381v2,\n 2017.\n\n\n\n\n\nSong Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally\n.\n \nDSD: Dense-Sparse-Dense Training for Deep Neural Networks\n,\n arXiv:1607.04381v2,\n 2017.\n\n\n\n\n\nHuizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally\n.\n \nExploring the Regularity of Sparse Structure in Convolutional Neural Networks\n,\n arXiv:1705.08922v3,\n 2017.\n\n\n\n\n\nSajid Anwar, Kyuyeon Hwang, and Wonyong Sung\n.\n \nStructured pruning of deep convolutional neural networks\n,\n arXiv:1512.08571,\n 2015", + "text": "Regularization\n\n\nIn their book \nDeep Learning\n Ian Goodfellow et al. define regularization as\n\n\n\n\n\"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\"\n\n\n\n\nPyTorch's \noptimizers\n use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance).\n\n\nIn general, we can write this as:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W)\n\\]\nAnd specifically,\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2\n\\]\nWhere W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or \ncriterion\n in the Distiller sample image classifier compression application).\n\n\noptimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001)\ncriterion = nn.CrossEntropyLoss()\n...\nfor input, target in dataset:\n optimizer.zero_grad()\n output = model(input)\n loss = criterion(output, target)\n loss.backward()\n optimizer.step()\n\n\n\n\n\\(\\lambda_R\\) is a scalar called the \nregularization strength\n, and it balances the data error and the regularization error. In PyTorch, this is the \nweight_decay\n argument.\n\n\n\\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a \nmagnitude\n, or sizing, of the weights tensor.\n\\[\n\\lVert W \\rVert_2^2 = \\sum_{l=1}^{L} \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l)\n\\]\n\n\n\\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation.\n\n\nThe qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in \nDeep Learning\n.\n\n\nSparsity and Regularization\n\n\nWe mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods.\n\n\nIn \nDense-Sparse-Dense (DSD)\n, Song Han et al. use pruning as a regularizer to improve a model's accuracy:\n\n\n\n\n\"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\"\n\n\n\n\nRegularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\]\n\n\n\\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as \nfeature selection\n and gives us another interpretation of pruning.\n\n\nOne\n of Distiller's Jupyter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization.\n\n\nIf we configure \nweight_decay\n to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2 + \\lambda_{R_1} \\lVert W \\rVert_1\n\\]\n\n\nClass \ndistiller.L1Regularizer\n implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization.\n\n\nl1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()\n\n\n\n\nGroup Regularization\n\n\nIn Group Regularization, we penalize entire groups of parameter elements, instead of individual elements. Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not. The group structures have to be pre-defined.\n\n\nTo the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty. We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers. It gets a bit messy, but not overly complicated:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)})\n\\]\n\n\nLet's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\).\n\n\n\\[\nR_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2\n\\]\nwhere \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\).\n\n\n\\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer. Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups).\n\n\n\nGroup Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity). Group sparsity exhibits regularity (i.e. its shape is regular), and therefore\nit can be beneficial to improve inference speed.\n\n\nHuizi-et-al-2017\n provides an overview of some of the different groups: kernel, channel, filter, layers. Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even \nintra kernel strided sparsity\n can also be used.\n\n\ndistiller.GroupLassoRegularizer\n currently implements most of these groups, and you can easily add new groups.\n\n\nReferences\n\n\n \nIan Goodfellow and Yoshua Bengio and Aaron Courville\n.\n \nDeep Learning\n,\n arXiv:1607.04381v2,\n 2017.\n\n\n\n\n\nSong Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally\n.\n \nDSD: Dense-Sparse-Dense Training for Deep Neural Networks\n,\n arXiv:1607.04381v2,\n 2017.\n\n\n\n\n\nHuizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally\n.\n \nExploring the Regularity of Sparse Structure in Convolutional Neural Networks\n,\n arXiv:1705.08922v3,\n 2017.\n\n\n\n\n\nSajid Anwar, Kyuyeon Hwang, and Wonyong Sung\n.\n \nStructured pruning of deep convolutional neural networks\n,\n arXiv:1512.08571,\n 2015", "title": "Regularization" }, { @@ -227,7 +227,7 @@ }, { "location": "/regularization/index.html#sparsity-and-regularization", - "text": "We mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods. In Dense-Sparse-Dense (DSD) , Song Han et al. use pruning as a regularizer to improve a model's accuracy: \"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\" Regularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\] \\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as feature selection and gives us another interpretation of pruning. One of Distiller's Jupiter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization. If we configure weight_decay to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2 + \\lambda_{R_1} \\lVert W \\rVert_1\n\\] Class distiller.L1Regularizer implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization. l1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()", + "text": "We mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods. In Dense-Sparse-Dense (DSD) , Song Han et al. use pruning as a regularizer to improve a model's accuracy: \"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\" Regularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\).\n\\[\n\\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i|\n\\] \\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as feature selection and gives us another interpretation of pruning. One of Distiller's Jupyter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization. If we configure weight_decay to zero and use \\(l_1\\)-norm regularization, then we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1\n\\]\nIf we use both regularizers, we have:\n\\[\nloss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2 + \\lambda_{R_1} \\lVert W \\rVert_1\n\\] Class distiller.L1Regularizer implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization. l1_regularizer = distiller.s(model.parameters())\n...\nloss = criterion(output, target) + lambda * l1_regularizer()", "title": "Sparsity and Regularization" }, { diff --git a/docs/sitemap.xml b/docs/sitemap.xml index fac22d9..1d161d2 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,7 +4,7 @@ <url> <loc>/index.html</loc> - <lastmod>2018-04-25</lastmod> + <lastmod>2018-04-28</lastmod> <changefreq>daily</changefreq> </url> @@ -12,7 +12,7 @@ <url> <loc>/install/index.html</loc> - <lastmod>2018-04-25</lastmod> + <lastmod>2018-04-28</lastmod> <changefreq>daily</changefreq> </url> @@ -20,7 +20,7 @@ <url> <loc>/usage/index.html</loc> - <lastmod>2018-04-25</lastmod> + <lastmod>2018-04-28</lastmod> <changefreq>daily</changefreq> </url> @@ -28,7 +28,7 @@ <url> <loc>/schedule/index.html</loc> - <lastmod>2018-04-25</lastmod> + <lastmod>2018-04-28</lastmod> <changefreq>daily</changefreq> </url> @@ -37,19 +37,19 @@ <url> <loc>/pruning/index.html</loc> - <lastmod>2018-04-25</lastmod> + <lastmod>2018-04-28</lastmod> <changefreq>daily</changefreq> </url> <url> <loc>/regularization/index.html</loc> - <lastmod>2018-04-25</lastmod> + <lastmod>2018-04-28</lastmod> <changefreq>daily</changefreq> </url> <url> <loc>/quantization/index.html</loc> - <lastmod>2018-04-25</lastmod> + <lastmod>2018-04-28</lastmod> <changefreq>daily</changefreq> </url> @@ -58,7 +58,7 @@ <url> <loc>/algorithms/index.html</loc> - <lastmod>2018-04-25</lastmod> + <lastmod>2018-04-28</lastmod> <changefreq>daily</changefreq> </url> @@ -66,7 +66,7 @@ <url> <loc>/model_zoo/index.html</loc> - <lastmod>2018-04-25</lastmod> + <lastmod>2018-04-28</lastmod> <changefreq>daily</changefreq> </url> @@ -74,7 +74,7 @@ <url> <loc>/jupyter/index.html</loc> - <lastmod>2018-04-25</lastmod> + <lastmod>2018-04-28</lastmod> <changefreq>daily</changefreq> </url> @@ -82,7 +82,7 @@ <url> <loc>/design/index.html</loc> - <lastmod>2018-04-25</lastmod> + <lastmod>2018-04-28</lastmod> <changefreq>daily</changefreq> </url> -- GitLab