# Quantization Algorithms

The following quantization methods are currently implemented in Distiller:

## DoReFa

(As proposed in [DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients](https://arxiv.org/abs/1606.06160))  
  
In this method, we first define the quantization function \(quantize_k\), which takes a real value \(a_f \in [0, 1]\) and outputs a discrete-valued \(a_q \in \left\{ \frac{0}{2^k-1}, \frac{1}{2^k-1}, ... , \frac{2^k-1}{2^k-1} \right\}\), where \(k\) is the number of bits used for quantization.

\[a_q = quantize_k(a_f) = \frac{1}{2^k-1} round \left( \left(2^k - 1 \right) a_f \right)\]

Activations are clipped to the \([0, 1]\) range and then quantized as follows:

\[x_q = quantize_k(x_f)\]

For weights, we define the following function \(f\), which takes an unbounded real valued input and outputs a real value in \([0, 1]\):

\[f(w) = \frac{tanh(w)}{2 max(|tanh(w)|)} + \frac{1}{2} \]

Now we can use \(quantize_k\) to get quantized weight values, as follows:

\[w_q = 2 quantize_k \left( f(w_f) \right) - 1\]

This method requires training the model with quantization, as discussed [here](quantization.md#training-with-quantization). Use the `DorefaQuantizer` class to transform an existing model to a model suitable for training with quantization using DoReFa.

### Notes:

- Gradients quantization as proposed in the paper is not supported yet.
- The paper defines special handling for binary weights which isn't supported in Distiller yet.

## WRPN

(As proposed in [WRPN: Wide Reduced-Precision Networks](https://arxiv.org/abs/1709.01134))  

In this method, activations are clipped to \([0, 1]\) and quantized as follows (\(k\) is the number of bits used for quantization):

\[x_q = \frac{1}{2^k-1} round \left( \left(2^k - 1 \right) x_f \right)\]

Weights are clipped to \([-1, 1]\) and quantized as follows:

\[w_q = \frac{1}{2^{k-1}-1} round \left( \left(2^{k-1} - 1 \right)w_f \right)\]

Note that \(k-1\) bits are used to quantize weights, leaving one bit for sign.

This method requires training the model with quantization, as discussed [here](quantization/#training-with-quantization). Use the `WRPNQuantizer` class to transform an existing model to a model suitable for training with quantization using WRPN.

### Notes:

- The paper proposed widening of layers as a means to reduce accuracy loss. This isn't implemented as part of `WRPNQuantizer` at the moment. To experiment with this, modify your model implementation to have wider layers.
- The paper defines special handling for binary weights which isn't supported in Distiller yet.

## Symmetric Linear Quantization

In this method, a float value is quantized by multiplying with a numeric constant (the **scale factor**), hence it is **Linear**. We use a signed integer to represent the quantized range, with no quantization bias (or "offset") used. As a result, the floating-point range considered for quantization is **symmetric** with respect to zero.  
In the current implementation the scale factor is chosen so that the entire range of the floating-point tensor is quantized (we do not attempt to remove outliers).  
Let us denote the original floating-point tensor by \(x_f\), the quantized tensor by \(x_q\), the scale factor by \(q_x\) and the number of bits used for quantization by \(n\). Then, we get:
\[q_x = \frac{2^{n-1}-1}{\max|x|}\]
\[x_q = round(q_x x_f)\]
(The \(round\) operation is round-to-nearest-integer)  
  
Let's see how a **convolution** or **fully-connected (FC)** layer is quantized using this method: (we denote input, output, weights and bias with  \(x, y, w\) and \(b\) respectively)
\[y_f = \sum{x_f w_f} + b_f = \sum{\frac{x_q}{q_x} \frac{w_q}{q_w}} + \frac{b_q}{q_b} = \frac{1}{q_x q_w} \left( \sum { x_q w_q + \frac{q_x q_w}{q_b}b_q } \right)\]
\[y_q = round(q_y y_f) = round\left(\frac{q_y}{q_x q_w} \left( \sum { x_q w_q + \frac{q_x q_w}{q_b}b_q } \right) \right) \]
Note how the bias has to be re-scaled to match the scale of the summation.

### Implementation

We've implemented **convolution** and **FC** using this method.  

- They are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. The wrapper is implemented in the `RangeLinearQuantParamLayerWrapper` class.  
- All other layers are unaffected and are executed using their original FP32 implementation.  
- To automatically transform an existing model to a quantized model using this method, use the `SymmetricLinearQuantizer` class.
- For weights and bias the scale factor is determined once at quantization setup ("offline"), and for activations it is determined dynamically at runtime ("online").  
- **Important note:** Currently, this method is implemented as **inference only**, with no back-propagation functionality. Hence, it can only be used to quantize a pre-trained FP32 model, with no re-training. As such, using it with \(n < 8\) is likely to lead to severe accuracy degradation for any non-trivial workload.