diff --git a/examples/agp-pruning/README.md b/examples/agp-pruning/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ad2846a115391b7e39cdb385f839e2cc1560e3f7
--- /dev/null
+++ b/examples/agp-pruning/README.md
@@ -0,0 +1,59 @@
+## Automated Gradual Pruner (AGP) Pruning Examples
+
+### Introduction
+In [To prune, or not to prune: exploring the efficacy of pruning for model compression](https://arxiv.org/abs/1710.01878),
+authors Michael Zhu and Suyog Gupta provide an algorithm to schedule iterative level pruning.
+
+> We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value (usually 0) to a ï¬nal sparsity value over a span of n pruning steps.
+The intuition behind this sparsity function in equation (1)  is to prune the network rapidly in the initial phase when the redundant connections are
+abundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.
+
+The authors describe AGP:
+>
+- Our automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity.
+-  Doesn't require much hyper-parameter tuning
+- Shown to perform well across different models
+- Does not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.
+
+### Distiller 
+* The original AGP paper described the application of AGP for fine-grained pruning, and in Distiller we also implemented AGP for structured-pruning.
+* We also provide examples of applying AGP for pruning language models. The results and 
+methodology are discussed at length in the [documentation](https://nervanasystems.github.io/distiller/tutorial-lang_model.html)
+
+### Examples
+
+The tables below provide the results of the experimental pruning schedules that
+appear in this directory.  Each example YAML schedule-file contains the command-line
+used to execute the experiment, and further details.
+
+#### Element-wise sparsity
+| Model | Granularity | Sparsity (%) | Top1  | Baseline Top1
+| --- |  :--- |  ---: |  ---: |  ---: |
+| AlexNet | Fine | 88.3 | 56.528 | 56.55
+| MobileNet v1 (width=1)| Fine | 51.6 | 68.8 | 68.9
+| ResNeXt-101-32x4d| Fine | 75.0 | 78.66 | 78.19
+| ResNet-18 | Fine | 59.9 | 69.87 | 69.76 
+| ResNet-50 | Fine | 26 .0 | 76.54 | 76.15
+| ResNet-50 | Fine | 80.0 | 75.99 | 76.15
+| ResNet-50 | Fine | 84.6 | 75.66 | 76.15
+
+#### Block sparsity
+| Model | Granularity | Sparsity | Top1  | Baseline Top1
+| --- |  :--- |  ---: |  ---: |  ---: |
+| ResNet-50 | 1x1x8 | 36.7 | 76.36 | 76.15
+
+#### Filter pruning with thinning
+
+Our objective here is to minimize compute but performing thinning.  Therefore,
+sparsity is often at 0%, but the number of parameters is reduced as
+filters are removed.
+
+In this table we seek to see a <b>lower</b> value for `Parameters Kept (%)` and, more importantely, 
+`Compute Kept (%)`.
+
+| Model | Granularity | Sparsity (%) | Parameters Kept (%) | Compute Kept (%)| Top1 | Baseline Top1
+| --- |  :--- |  ---: |  ---: |  ---: | ---: |  ---: |
+| ResNet-50 | Filters| 0.0 | 43.37 | 44.56 | 74.47 | 76.15
+| ResNet-50 (2) | Filters| 0.0 | 49.69 | 49.82 | 74.78 | 76.15
+| ResNet-50 (3) | Filters| 0.0 | 67.95 | 67.33 | 75.75 | 76.15
+| ResNet-50 (w/ FC) | Filters| 11.6 | 42.74 | 44.56 | 74.56 | 76.15
\ No newline at end of file
diff --git a/examples/agp-pruning/prune_resnext101_fine75.yaml b/examples/agp-pruning/prune_resnext101_fine75.yaml
index 4a23d1fb83e48c248caf80f28da4f673f58cc337..f135508e1080b2134c28465286e0932437b65821 100755
--- a/examples/agp-pruning/prune_resnext101_fine75.yaml
+++ b/examples/agp-pruning/prune_resnext101_fine75.yaml
@@ -1,7 +1,7 @@
 # python3 ${DISTILLER_PATH}/examples/classifier_compression/compress_classifier.py --arch=resnext101_32x4d --pretrained --compress=${THIS} --epochs=81 --lr 0.01 ${IMAGENET_PATH} --vs=0
 
 # Parameters:
-# 2019-03-18 13:15:41,611 - +-----+-------------------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+
+# +-----+-------------------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+
 # |     | Name                                | Shape              |   NNZ (dense) |   NNZ (sparse) |   Cols (%) |   Rows (%) |   Ch (%) |   2D (%) |   3D (%) |   Fine (%) |     Std |     Mean |   Abs-Mean |
 # |-----+-------------------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------|
 # |   0 | module.features.4.0.0.0.0.0.weight  | (128, 64, 1, 1)    |          8192 |           2048 |    0.00000 |    0.00000 | 20.31250 | 75.00000 | 37.50000 |   75.00000 | 0.04950 | -0.00209 |    0.01767 |
diff --git a/examples/agp-pruning/resnet50_pruning_for_accuracy.schedule_agp.yaml b/examples/agp-pruning/resnet50_pruning_for_accuracy.schedule_agp.yaml
index 4b35651067a8f09dbb7801100372d0750da73af6..2c206a2371b9782b1fdd973a0b696baf492f8b35 100755
--- a/examples/agp-pruning/resnet50_pruning_for_accuracy.schedule_agp.yaml
+++ b/examples/agp-pruning/resnet50_pruning_for_accuracy.schedule_agp.yaml
@@ -2,7 +2,7 @@
 # of ResNet50 using the ImageNet dataset.
 # Top1 is 76.538 (=23.462 error rate) vs the published Top1: 76.15 (https://pytorch.org/docs/stable/torchvision/models.html)
 #
-# I ran this for 80 epochs, but it can probably run for a much shorter time and prodcue the same results (50 epochs?)
+# I ran this for 80 epochs, but it can probably run for a much shorter time and produce the same results (50 epochs?)
 #
 # time python3 compress_classifier.py -a=resnet50 --pretrained -p=50 ../../../data.imagenet/ -j=22 --epochs=80 --lr=0.001 --compress=resnet50.schedule_agp.yaml
 #
diff --git a/examples/baseline_networks/cifar/plain20_cifar_baseline_training.yaml b/examples/baseline_networks/cifar/plain20_cifar_baseline_training.yaml
index 3489c2654a276db72f5d7de0ac77415d4a402297..5a231c72fb0c022aac0c86c370c445e2ecbb1859 100755
--- a/examples/baseline_networks/cifar/plain20_cifar_baseline_training.yaml
+++ b/examples/baseline_networks/cifar/plain20_cifar_baseline_training.yaml
@@ -31,7 +31,7 @@
 # time python3  compress_classifier.py --arch=plain20_cifar ../../../data.cifar -p=50 --lr=0.1 --epochs=180 --batch=128 --compress=../baseline_networks/cifar/plain20_cifar_baseline_training.yaml --gpu=0 -j=1 --deterministic
 #
 # Results:
-#   Top1 = 90.18 - which is 0.3% lower than ower goal.
+#   Top1 = 90.18 - which is 0.3% lower than owr goal.
 #   *For better results, with much shorter training, see the explanation after the tables below.
 #
 # Parameters:
diff --git a/examples/drop_filter/README.md b/examples/drop_filter/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7afc45a2aeec6d9c2ca8a2047a783d5f61a58792
--- /dev/null
+++ b/examples/drop_filter/README.md
@@ -0,0 +1,71 @@
+## DropFilter
+DropFilter - a regularization method similar to Dropout, which drops entire convolutional
+filters, instead of mere neurons.
+However, unlike the original intent of DropFilter - to act as a regularizer and reduce the generalization error
+of the network, here we employ higher rates of filter-dropping (rates are increased over time by following an AGP
+schedule) in order to make the network more robust to filter-pruning.  We test this robustness using sensitivity
+analysis.
+
+
+A relevant quote from [3]:
+> To train slimmable neural networks, we begin with a naive approach, where we directly train a
+shared neural network with different width configurations. The training framework is similar to the
+one of our final approach, as shown in Algorithm 1. The training is stable, however, the network
+obtains extremely low top-1 testing accuracy around 0:1% on 1000-class ImageNet classification.
+Error curves of the naive approach are shown in Figure 2. We conjecture the major problem in
+the naive approach is that: for a single channel in a layer, different numbers of input channels in
+previous layer result in different means and variances of the aggregated feature, which are then
+rolling averaged to a shared batch normalization layer. The inconsistency leads to inaccurate batch
+normalization statistics in a layer-by-layer propagating manner. Note that these batch normalization
+statistics (moving averaged means and variances) are only used during testing, in training the means
+and variances of the current mini-batch are used.
+
+### Examples
+
+Dropping filters requires finer control over the scheduling process since we want to drop different sets
+of filters every `n` training iterations/steps (i.e. mini-batches), whereas usually we make such decisions
+at the epoch boundary. 
+
+1. [plain20_cifar_dropfilter_training.yaml](https://github.com/NervanaSystems/distiller/blob/master/examples/drop_filter/plain20_cifar_dropfilter_training.yaml)
+
+    In this example we train a Plain20 model with increasing levels of dropped filters, 
+    starting at 5% drop and going as high as 50% drop.  Our aim is to make the network more robust to filter-pruning<br>
+    * The network is trained from scratch.
+    * We spend a few epochs just training, to start from weights that are somewhat trained.
+    * We use AGP to control the schedule of the slow increase in the percentage of dropped filters.
+    * To choose which filters to drop we use a Bernoulli probability function.
+
+    | Model | Drop Rate  | Top1  | Baseline Top1
+    | --- | :---: |    ---: |  ---: |
+    | Plain-20 | 5-50%| 89.61 | 90.18
+    
+2. [plain20_cifar_dropfilter_training_regularization.yaml](https://github.com/NervanaSystems/distiller/blob/master/examples/drop_filter/plain20_cifar_dropfilter_training_regularization.yaml)
+    
+   In this example we use DropFilter for regularization.  
+   * The network is trained from scratch.
+   * To choose which filters to drop we use a Bernoulli probability function.
+   
+    | Model | Drop Rate  | Top1  | Baseline Top1
+    | --- | :---: |    ---: |  ---: |
+    | Plain-20 | 10%| 90.88 | 90.18
+    
+3. [resnet20_cifar_randomlevel_training.yaml](https://github.com/NervanaSystems/distiller/blob/master/examples/drop_filter/resnet20_cifar_randomlevel_training.yaml)
+    
+    In this example we randomly choose a percentage of filters to prune (level), and then use L1-norm ranking to 
+    choose which filters to drop.
+
+    | Model | Drop Rate  | Top1  | Baseline Top1
+    | --- | :---: |    ---: |  ---: |
+    | ResNet-20 | 10-20%| 90.80 | 90.78
+
+### References:
+[1] Zhengsu Chen Jianwei Niu Qi Tian.
+    DropFilter: Dropout for Convolutions.    
+    https://arxiv.org/abs/1810.09849
+     
+[2] Hengyue Pan, Hui Jiang, Xin Niu, Yong Dou.
+    DropFilter: A Novel Regularization Method for Learning Convolutional Neural Networks
+    <br>https://arxiv.org/abs/1811.06783
+     
+[3] Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, Thomas Huang. 
+Slimmable Neural Networks.  In ICLR 2019, arXiv:1812.08928
\ No newline at end of file