Skip to content
Snippets Groups Projects
  • Bar's avatar
    b9d53ff8
    Non-channel/filter block pruning (#119) · b9d53ff8
    Bar authored
    Block pruning: support specifying the block shape from the YAML file
    
    Block pruning refers to pruning 4-D structures of a specific shape.  This 
    is a why it is sometimes called structure-pruning or group-pruning 
    (confusing, I know).
    A specific example of block pruning is filter or channel pruning, which
    have a highly-regular block shape.   
    This commit adds support for pruning blocks/groups/structures
    that have irregular shapes that accelerate inference on a specific 
    hardware platform.  You can read more about the regularity of shapes in
    (Exploring the Regularity of Sparse Structure in
    Convolutional Neural Networks)[https://arxiv.org/pdf/1705.08922.pdf].
    
    When we want to introduce sparsity in order to reduce the compute load
    of a certain layer, we need to understand how the HW and SW perform
    the layer's operation, and how this operation is vectorized.  Then we can
    induce sparsity to match the vector shape.
    
    For example, Intel AVX-512 are SIMD instructions that apply the same
    instruction (Single Instruction) on a vector of inputs (Multiple
    Data).  The following single instruction performs an element-wise
    multiplication of two 16 32-bit element vectors:
    
         __m256i result = __mm256_mul_epi32(vec_a, vec_b);
    
    If either vec_a or vec_b are partially sparse, we still need to perform
    the multiplication operation and the sparsity does not help reduce the
    cost (power, latency) of computation.  However, if either vec_a or vec_b
    contain only zeros then we can eliminate entirely the instruction.  In this 
    case, we say that we would like to have group sparsity of 16-elements.  
    I.e. the HW/SW benefits from sparsity induced in blocks of 16 elements.
    
    Things are a bit more involved because we also need to understand how the
    software maps layer operations to hardware.  For example, a 3x3
    convolution can be computed as a direct-convolution, as a matrix multiply
    operation, or as a Winograd matrix operation (to name a few ways of
    computation).  These low-level operations are then mapped to SIMD
    instructions.
    
    Finally, the low-level SW needs to support a block-sparse storage-format
    for weight tensors (see for example:
    http://www.netlib.org/linalg/html_templates/node90.html)
    b9d53ff8
    History
    Non-channel/filter block pruning (#119)
    Bar authored
    Block pruning: support specifying the block shape from the YAML file
    
    Block pruning refers to pruning 4-D structures of a specific shape.  This 
    is a why it is sometimes called structure-pruning or group-pruning 
    (confusing, I know).
    A specific example of block pruning is filter or channel pruning, which
    have a highly-regular block shape.   
    This commit adds support for pruning blocks/groups/structures
    that have irregular shapes that accelerate inference on a specific 
    hardware platform.  You can read more about the regularity of shapes in
    (Exploring the Regularity of Sparse Structure in
    Convolutional Neural Networks)[https://arxiv.org/pdf/1705.08922.pdf].
    
    When we want to introduce sparsity in order to reduce the compute load
    of a certain layer, we need to understand how the HW and SW perform
    the layer's operation, and how this operation is vectorized.  Then we can
    induce sparsity to match the vector shape.
    
    For example, Intel AVX-512 are SIMD instructions that apply the same
    instruction (Single Instruction) on a vector of inputs (Multiple
    Data).  The following single instruction performs an element-wise
    multiplication of two 16 32-bit element vectors:
    
         __m256i result = __mm256_mul_epi32(vec_a, vec_b);
    
    If either vec_a or vec_b are partially sparse, we still need to perform
    the multiplication operation and the sparsity does not help reduce the
    cost (power, latency) of computation.  However, if either vec_a or vec_b
    contain only zeros then we can eliminate entirely the instruction.  In this 
    case, we say that we would like to have group sparsity of 16-elements.  
    I.e. the HW/SW benefits from sparsity induced in blocks of 16 elements.
    
    Things are a bit more involved because we also need to understand how the
    software maps layer operations to hardware.  For example, a 3x3
    convolution can be computed as a direct-convolution, as a matrix multiply
    operation, or as a Winograd matrix operation (to name a few ways of
    computation).  These low-level operations are then mapped to SIMD
    instructions.
    
    Finally, the low-level SW needs to support a block-sparse storage-format
    for weight tensors (see for example:
    http://www.netlib.org/linalg/html_templates/node90.html)