distiller/utils.py · b9d53ff885115adf49bbf7965fd7d07e92ebe391 · llvm / distiller

6 years ago

Non-channel/filter block pruning (#119) · b9d53ff8

Bar authored 6 years ago

Block pruning: support specifying the block shape from the YAML file

Block pruning refers to pruning 4-D structures of a specific shape.  This 
is a why it is sometimes called structure-pruning or group-pruning 
(confusing, I know).
A specific example of block pruning is filter or channel pruning, which
have a highly-regular block shape.   
This commit adds support for pruning blocks/groups/structures
that have irregular shapes that accelerate inference on a specific 
hardware platform.  You can read more about the regularity of shapes in
(Exploring the Regularity of Sparse Structure in
Convolutional Neural Networks)[https://arxiv.org/pdf/1705.08922.pdf].

When we want to introduce sparsity in order to reduce the compute load
of a certain layer, we need to understand how the HW and SW perform
the layer's operation, and how this operation is vectorized.  Then we can
induce sparsity to match the vector shape.

For example, Intel AVX-512 are SIMD instructions that apply the same
instruction (Single Instruction) on a vector of inputs (Multiple
Data).  The following single instruction performs an element-wise
multiplication of two 16 32-bit element vectors:

     __m256i result = __mm256_mul_epi32(vec_a, vec_b);

If either vec_a or vec_b are partially sparse, we still need to perform
the multiplication operation and the sparsity does not help reduce the
cost (power, latency) of computation.  However, if either vec_a or vec_b
contain only zeros then we can eliminate entirely the instruction.  In this 
case, we say that we would like to have group sparsity of 16-elements.  
I.e. the HW/SW benefits from sparsity induced in blocks of 16 elements.

Things are a bit more involved because we also need to understand how the
software maps layer operations to hardware.  For example, a 3x3
convolution can be computed as a direct-convolution, as a matrix multiply
operation, or as a Winograd matrix operation (to name a few ways of
computation).  These low-level operations are then mapped to SIMD
instructions.

Finally, the low-level SW needs to support a block-sparse storage-format
for weight tensors (see for example:
http://www.netlib.org/linalg/html_templates/node90.html)

b9d53ff8

History

Non-channel/filter block pruning (#119)

Bar authored 6 years ago

Block pruning: support specifying the block shape from the YAML file

Block pruning refers to pruning 4-D structures of a specific shape.  This 
is a why it is sometimes called structure-pruning or group-pruning 
(confusing, I know).
A specific example of block pruning is filter or channel pruning, which
have a highly-regular block shape.   
This commit adds support for pruning blocks/groups/structures
that have irregular shapes that accelerate inference on a specific 
hardware platform.  You can read more about the regularity of shapes in
(Exploring the Regularity of Sparse Structure in
Convolutional Neural Networks)[https://arxiv.org/pdf/1705.08922.pdf].

When we want to introduce sparsity in order to reduce the compute load
of a certain layer, we need to understand how the HW and SW perform
the layer's operation, and how this operation is vectorized.  Then we can
induce sparsity to match the vector shape.

For example, Intel AVX-512 are SIMD instructions that apply the same
instruction (Single Instruction) on a vector of inputs (Multiple
Data).  The following single instruction performs an element-wise
multiplication of two 16 32-bit element vectors:

     __m256i result = __mm256_mul_epi32(vec_a, vec_b);

If either vec_a or vec_b are partially sparse, we still need to perform
the multiplication operation and the sparsity does not help reduce the
cost (power, latency) of computation.  However, if either vec_a or vec_b
contain only zeros then we can eliminate entirely the instruction.  In this 
case, we say that we would like to have group sparsity of 16-elements.  
I.e. the HW/SW benefits from sparsity induced in blocks of 16 elements.

Things are a bit more involved because we also need to understand how the
software maps layer operations to hardware.  For example, a 3x3
convolution can be computed as a direct-convolution, as a matrix multiply
operation, or as a Winograd matrix operation (to name a few ways of
computation).  These low-level operations are then mapped to SIMD
instructions.

Finally, the low-level SW needs to support a block-sparse storage-format
for weight tensors (see for example:
http://www.netlib.org/linalg/html_templates/node90.html)