Skip to content
Snippets Groups Projects
  1. May 20, 2019
    • Neta Zmora's avatar
      Thinning: added support for group-wise convolutions · 6b832025
      Neta Zmora authored
      Group-wise convolutions with num_groups == num_in_channels, as
      configured in MobileNet for example, create attribute and shape dependency
      chains that are longer than convolutions with num_groups == 1.
      For example in the sequence below, changing the number of filters of the
      weights of Conv1, triggers changes in BN1, Conv2, BN2, and Conv3
      (g indicates the number of groups):
      
      Conv1(g=1) => BN1 ==> Conv2(g=32) ==> BN2 ==> Conv3(g=1)
      
      Changing the number of filters used in Conv1 affects the parameters and
      attributes of BN1 and Conv2 - The number of input channels of Conv2 is
      changed.as explained in
      https://nervanasystems.github.io/distiller/tutorial-struct_pruning.html.
      
      However, since Conv2 has num_groups == num_in_channels, we have to change
      num_groups, which triggers a change in num_out_channels.  This is akin to
      changing the number of filters of Conv2, which triggers a change in BN2
      and Conv3.
      
      models/mobilenet.py:
      Changed the code that flattens the output of the
      features-extractors and prepares it as input to the classifier.
      The code was written using hard-coded shape values, which made it
      impossible to use in thinned models (where dimensions are changed).
      
      tests/test_pruning.py:
      Added test for thinning MobileNet (grouped convolutions)
      6b832025
  2. May 16, 2019
  3. Apr 08, 2019
  4. Apr 01, 2019
    • Bar's avatar
      Load optimizer from checkpoint (BREAKING - see details) (#182) · 992291cf
      Bar authored
      Load optimizer from checkpoint (BREAKING - see details) (#182)
      
      * Fixes issues #70, #145 and replaces PR #74
      * checkpoint.py
        * save_checkpoint will now save the optimizer type in addition to
          its state
        * load_checkpoint will now instantiate an optimizer based on the
          saved type and load its state
      * config.py: file/dict_config now accept the resumed epoch to pass to
        LR schedulers
      * policy.py: LRPolicy now passes the current epoch to the LR scheduler
      * Classifier compression sample
        * New flag '--resume-from' for properly resuming a saved training
          session, inc. optimizer state and epoch #
        * Flag '--reset-optimizer' added to allow discarding of a loaded
          optimizer.
        * BREAKING:
          * Previous flag '--resume' is deprecated and is mapped to
            '--resume-from' + '--reset-optimizer'. 
          * But, old resuming behavior had an inconsistency where the epoch
            count would continue from the saved epoch, but the LR scheduler
            was setup as if we were starting from epoch 0.
          * Using '--resume-from' + '--reset-optimizer' now will simply
            RESET the epoch count to 0 for the whole environment.
          * This means that scheduling configurations (in YAML or code)
            which assumed use of '--resume' might need to be changed to
            reflect the fact that the epoch count now starts from 0
          * All relevant YAML files under 'examples' modified to reflect
            this change
      * Initial support for ReduceLROnPlateau (#161):
        * Allow passing **kwargs to policies via the scheduler
        * Image classification now passes the validation loss to the
          scheduler, to be used yo ReduceLROnPlateau
        * The current implementation is experimental and subject to change
      992291cf
  5. Feb 26, 2019
  6. Dec 01, 2018
    • Neta Zmora's avatar
      Important changes to pruning channels and filters (#93) · a0bf2a8f
      Neta Zmora authored
      This commit contains the main fix for issue #85.  It contains a couple of changes to the YAML structure pruning API, with examples.
      I urge you to read the documentation in the Wiki (https://github.com/NervanaSystems/distiller/wiki/Pruning-Filters-&-Channels).
      
      New syntax for defining Structured AGP.  I tried to make the syntax similar to fine-grained
      (i.e. element-wise) pruning.  All you need to do is add: ```group_type: Filters```.
      ```
        low_pruner:
          class: L1RankedStructureParameterPruner_AGP
          initial_sparsity : 0.10
          final_sparsity: 0.50
          group_type: Filters
          weights: [module.layer3.0.conv2.weight,
                    module.layer3.0.downsample.0.weight,
                    module.layer3.1.conv2.weight,
                    module.layer3.2.conv2.weight]
      ```
      
      If you want to define “leader-based” pruning dependencies, add ```group_dependency: Leader```:
      ```
        low_pruner:
          class: L1RankedStructureParameterPruner_AGP
          initial_sparsity : 0.10
          final_sparsity: 0.50
          group_type: Filters
          group_dependency: Leader
          weights: [module.layer3.0.conv2.weight,
                    module.layer3.0.downsample.0.weight,
                    module.layer3.1.conv2.weight,
                    module.layer3.2.conv2.weight]
      ```
      
      Retired the old ```reg_regims``` API for describing one-shot structured-pruning.
      
      The new YAML API is very similar to AGP structured-pruning, which is much better
      than before.
      The new API also allows us to describe data-dependencies when doing one-shot
      structure pruning, just like AGP structured-pruning.
      
      This commit also includes further code refactoring.
      
      Old API:
      ```
        filter_pruner:
           class: 'L1RankedStructureParameterPruner'
           reg_regims:
             'module.layer1.0.conv1.weight': [0.6, '3D']
             'module.layer1.1.conv1.weight': [0.6, '3D']
      ```
      
      New API:
      ```
       filter_pruner:
          class: 'L1RankedStructureParameterPruner'
          group_type: Filters
          desired_sparsity: 0.6
          weights: [
            module.layer1.0.conv1.weight,
            module.layer1.1.conv1.weight]
      ```
      
      thresholding.py – separate the generation of the binary_map from the pruning_mask so that we
      can cache the binary map and share it between several modules.
      
      pruning/automated_gradual_pruner.py – major refactoring to supported “leader-based”
      sub-graph pruning dependencies.  The concept is explained in issue #85
      
      
      agp-pruning/resnet20_filters.schedule_agp.yaml
      agp-pruning/resnet20_filters.schedule_agp_2.yaml
      agp-pruning/resnet20_filters.schedule_agp_3.yaml
      network_trimming/resnet56_cifar_activation_apoz.yaml
      network_trimming/resnet56_cifar_activation_apoz_v2.yaml
      Unverified
      a0bf2a8f
  7. Jul 22, 2018
  8. Jul 21, 2018
  9. Jul 15, 2018
    • Neta Zmora's avatar
      Thinning: bug fixes · b48908c3
      Neta Zmora authored
      There are two different “namespaces” referring to module names:
      normalized and de-normalized.
      Normalized module names are module names that have the same
      format for both data-parallel and data-serial models.
      De-normalized module names are the “raw” PyTorch module names
      that reflect the full model graph.  So if there is a container module
      such as nn.DataParallel in the model, then a sub-module’s name
      will have the “module” substring somewhere in it.
      
      SummaryGraph operates by converting the PyTorch to ONNX, and
      I’ve have issues handling nn.DataParallel in this process.
      Therefore, SummaryGraph uses only normalized names internally.
      
      PruningRecipe, on the other hand, uses de-normalized names
      because it needs to operate on the model itself.
      
      This is a sticky situation that can create really annoying bugs and makes
      for some ugly code.  Nonetheless, this is the best I can do right now,
      and I’ll probably revisit this soon to make it nicer.
      For now, I’m pushing this commit that fixes the distinction between the
      two namespaces, and fixes related bugs – in the hope that it is not too
      brittle.
      
      append_module_directive – now uses denormalize_module_name to
      ensure recipe module names are denormalized.
      
      append_param_directive – because we are dealing with parameters,
      I can’t use denormalize_module_name as easily as in append_module_directive.
      The clean solution is kept for later :-(
      b48908c3
  10. Jul 13, 2018
    • Neta Zmora's avatar
      ADC (Automatic Deep Compression) example + features, tests, bug fixes (#28) · 718f777b
      Neta Zmora authored
      This is a merge of the ADC branch and master.
      ADC (using a DDPG RL agent to compress image classifiers) is still WiP and requires
      An unreleased version of Coach (https://github.com/NervanaSystems/coach).
      
      Small features in this commit:
      -Added model_find_module() - find module object given its name
      - Add channel ranking and pruning: pruning/ranked_structures_pruner.py
      - Add a CIFAR10 VGG16 model: models/cifar10/vgg_cifar.py
      - Thinning: change the level of some log messages – some of the messages were
      moved to ‘debug’ level because they are not usually interesting.
      - Add a function to print nicely formatted integers - distiller/utils.py
      - Sensitivity analysis for channels-removal
      - compress_classifier.py – handle keyboard interrupts
      - compress_classifier.py – fix re-raise of exceptions, so they maintain call-stack
      
      -Added tests:
      -- test_summarygraph.py: test_simplenet() - Added a regression test to target a bug that occurs when taking the predecessor of the first node in a graph
      -- test_ranking.py - test_ch_ranking, test_ranked_channel_pruning
      -- test_model_summary.py - test_png_generation, test_summary (sparsity/ compute/model/modules)
      
      - Bug fixes in this commit:
      -- Thinning bug fix: handle zero-sized 'indices' tensor
      During the thinning process, the 'indices' tensor can become zero-sized,
      and will have an undefiend length. Therefore, we need to check for this
      situation when assessing the number of elements in 'indices'
      -- Language model: adjust main.py to new distiller.model_summary API
      Unverified
      718f777b
  11. Jul 11, 2018
    • Neta Zmora's avatar
      Extend pruning tests to parallel models · e3e41ba6
      Neta Zmora authored
      e3e41ba6
    • Neta Zmora's avatar
      More robust handling of data-parallel/serial graphs (#27) · b64be690
      Neta Zmora authored
      Remove the complicated logic trying to handle data-parallel models as
      serially-processed models, and vice versa.
      
      *Function distiller.utils.make_non_parallel_copy() does the heavy lifting of
      replacing  all instances of nn.DataParallel in a model with instances of
      DoNothingModuleWrapper.
      The DoNothingModuleWrapper wrapper does nothing but forward to the
      wrapped module.  This is a trick we use to transform a data-parallel model
      to a serial-processed model.
      
      *SummaryGraph uses a copy of the model after the model is processed by
      distiller.make_non_parallel_copy() which renders the model non-data-parallel.
      
      *The same goes for model_performance_summary()
      
      *Model inputs are explicitly placed on the Cuda device, since now all models are
      Executed on the CPU.  Previously, if a model was not created using
      nn.DataParallel, then the model was not explicitly placed on the Cuda device.
      
      *The logic in distiller.CompressionScheduler that attempted to load a
      model parallel model and process it serially, or load a serial model and
      process it data-parallel, was removed.  This removes a lot of fuzziness and makes
      the code more robust: we do not needlessly try to be heroes.
      
      * model summaries - remove pytorch 0.4 warning
      
      * create_model: remove redundant .cuda() call
      
      * Tests: support both parallel and serial tests
      Unverified
      b64be690
  12. Jun 30, 2018
    • Neta Zmora's avatar
      Bug fix: add support for thinning the optimizer · b21f449b
      Neta Zmora authored
      You no longer need to use —momentum=0 when removing structures
      dynamically.
      The SGD momentum update (velocity) is dependent on the weights, which
      PyTorch optimizers cache internally.  This caching is not a problem for
      filter/channel removal (thinning) because although we dynamically
      change the shapes of the weights tensors, we don’t change the weights
      tensors themselves.
      PyTorch’s SGD creates tensors to store the momentum updates, and these
      tensors have the same shape as the weights tensors.  When we change the
      weights tensors, we need to make the appropriate changes in the Optimizer,
      or disable the momentum.
      We added a new function - thinning.optimizer_thinning() - to do this.
      This function is brittle as it is tested only on optim.SGD and relies on the
      internal representation of the SGD optimizer, which can change w/o notice.
      For example, optim.Adam uses state['exp_avg'], state['exp_avg_sq']
      Which also depend the shape of the weight tensors.
      We needed to pass the Optimizer instance to Thinning policies
      (ChannelRemover, FilterRemover) via the callbacks, which required us
      to change the callback interface.
      In the future we plan a bigger change to the callback API, to allow
      passing of arbitrary context from the training environment to Distiller.
      
      Also in this commit:
      * compress_classifier.py had special handling for resnet layer-removal, which
      is used in examples/ssl/ssl_4D-removal_training.yaml.
      This is a brittle and ugly hack.  Until we have a more elegant solution, I’m
      Removing support for layer-removal.
      * Added to the tests invocation of forward and backward passes over a model.
      This tests more of the real flows, which use the optimizer and construct
      gradient tensors.
      * Added a test of a special case of convolution filter-pruning which occurs
      when the next layer is fully-connected (linear)
      b21f449b
  13. Jun 26, 2018
  14. Jun 14, 2018
    • Neta Zmora's avatar
      Fixed a couple of bugs and added tests · 1d62f96f
      Neta Zmora authored
      When removing channels and thinning, the number of filters of the
      next layer was not set correctly.
      
      When loading a model that has already been thinned (e.g when loading
      a model, thinning, saving, loading), don’t crash on wrong tensor sizes.
      
      Cache the thinning recipe in the model when loading from checkpoint.
      Without this, a loaded thin model will lose its recipes when saved to
      checkpoint
      1d62f96f
  15. Jun 13, 2018
Loading