Snippets Groups Projects

Added GNMT post-training quantization example (#252)

Lev Zlotnik authored 5 years ago

1198a889

1198a889 5 years ago

Name	Last commit	Last update
..
scripts
seq2seq
LICENSE
README.md
download_dataset.sh
download_trained_model.sh
model_stats.yaml
quantize_gnmt.ipynb
requirements.txt
translate.py
verify_dataset.sh

Google's Neural Machine Translation

In this example we quantize MLPerf's implementation of GNMT and show different configurations of quantization to achieve the highest accuracy using post-training quantization.

Note that this folder contains only code required to run evaluation. All training code was removed. A link to a pre-trained model is provided below.

For a summary on the quantization results see below.

Running the Example

This example is implemented as Jupyter notebook.

Install Requirements

pip install -r requirements

(This will install sacrebleu)

Get the Dataset

Download the data using the following command:

bash download_dataset.sh

Verify data with:

bash verify_dataset.sh

Download the Pre-trained Model

wget https://zenodo.org/record/2581623/files/model_best.pth

Run the Example

jupyter notebook

And start the quantize_gnmt.ipynb notebook.

Summary of Quantization Results

What is Quantized

The following operations / modules are fully quantized:

Linear (fully-connected)
Embedding
Element-wise addition
Element-wise multiplication
MatMul / Batch MatMul
Concat

The following operations do not have a quantized implementation. The operations run in FP32, with quantized + de-quantize applied at the op boundary (input and output):

Softmax
Tanh
Sigmoid
Division by norm in the attention block. That is, in pseudo code:
```
quant_dequant(x)
y = x / norm(x)
quant_dequant(y)
```

Results

Precision	Mode	Per-Channel	Clip Activations	Bleu Score
FP32	N/A	N/A	N/A	22.16
INT8	Symmetric	No	No	18.05
INT8	Asymmetric	No	No	18.52
INT8	Asymmetric	Yes	AVG in all layers	9.63
INT8	Asymmetric	Yes	AVG in all layers except attention block	16.94
INT8	Asymmetric	Yes	AVG in all layers except attention block and final classifier	21.49

Dataset / Environment

Publication / Attribution

We use WMT16 English-German for training.

Data preprocessing

Script uses subword-nmt package to segment text into subword units (BPE), by default it builds shared vocabulary of 32,000 tokens. Preprocessing removes all pairs of sentences that can't be decoded by latin-1 encoder.

Model

Publication / Attribution

Implemented model is similar to the one from Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper.

Most important difference is in the attention mechanism. This repository implements gnmt_v2 attention: output from first LSTM layer of decoder goes into attention, then re-weighted context is concatenated with inputs to all subsequent LSTM layers in decoder at current timestep.

The same attention mechanism is also implemented in default GNMT-like models from tensorflow/nmt and NVIDIA/OpenSeq2Seq.

Structure

general:
- encoder and decoder are using shared embeddings
- data-parallel multi-gpu training
- dynamic loss scaling with backoff for Tensor Cores (mixed precision) training
- trained with label smoothing loss (smoothing factor 0.1)
encoder:
- 4-layer LSTM, hidden size 1024, first layer is bidirectional, the rest are undirectional
- with residual connections starting from 3rd layer
- uses standard LSTM layer (accelerated by cudnn)
decoder:
- 4-layer unidirectional LSTM with hidden size 1024 and fully-connected classifier
- with residual connections starting from 3rd layer
- uses standard LSTM layer (accelerated by cudnn)
attention:
- normalized Bahdanau attention
- model uses gnmt_v2 attention mechanism
- output from first LSTM layer of decoder goes into attention, then re-weighted context is concatenated with the input to all subsequent LSTM layers in decoder at the current timestep
inference:
- beam search with default beam size 5
- with coverage penalty and length normalization
- BLEU computed by sacrebleu