Google's Neural Machine Translation
In this example we quantize MLPerf's implementation of GNMT and show different configurations of quantization to achieve the highest accuracy using post-training quantization.
Note that this folder contains only code required to run evaluation. All training code was removed. A link to a pre-trained model is provided below.
For a summary on the quantization results see below.
Running the Example
This example is implemented as Jupyter notebook.
Install Requirements
pip install -r requirements
(This will install sacrebleu)
Get the Dataset
Download the data using the following command:
bash download_dataset.sh
Verify data with:
bash verify_dataset.sh
Download the Pre-trained Model
wget https://zenodo.org/record/2581623/files/model_best.pth
Run the Example
jupyter notebook
And start the quantize_gnmt.ipynb
notebook.
Summary of Quantization Results
What is Quantized
The following operations / modules are fully quantized:
- Linear (fully-connected)
- Embedding
- Element-wise addition
- Element-wise multiplication
- MatMul / Batch MatMul
- Concat
The following operations do not have a quantized implementation. The operations run in FP32, with quantized + de-quantize applied at the op boundary (input and output):
- Softmax
- Tanh
- Sigmoid
- Division by norm in the attention block. That is, in pseudo code:
quant_dequant(x) y = x / norm(x) quant_dequant(y)
Results
Precision | Mode | Per-Channel | Clip Activations | Bleu Score |
---|---|---|---|---|
FP32 | N/A | N/A | N/A | 22.16 |
INT8 | Symmetric | No | No | 18.05 |
INT8 | Asymmetric | No | No | 18.52 |
INT8 | Asymmetric | Yes | AVG in all layers | 9.63 |
INT8 | Asymmetric | Yes | AVG in all layers except attention block | 16.94 |
INT8 | Asymmetric | Yes | AVG in all layers except attention block and final classifier | 21.49 |
Dataset / Environment
Publication / Attribution
We use WMT16 English-German for training.
Data preprocessing
Script uses subword-nmt package to segment text into subword units (BPE), by default it builds shared vocabulary of 32,000 tokens. Preprocessing removes all pairs of sentences that can't be decoded by latin-1 encoder.
Model
Publication / Attribution
Implemented model is similar to the one from Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper.
Most important difference is in the attention mechanism. This repository implements gnmt_v2
attention: output from first LSTM layer of decoder goes into attention, then re-weighted context is concatenated with inputs to all subsequent LSTM layers in decoder at current timestep.
The same attention mechanism is also implemented in default GNMT-like models from tensorflow/nmt and NVIDIA/OpenSeq2Seq.
Structure
- general:
- encoder and decoder are using shared embeddings
- data-parallel multi-gpu training
- dynamic loss scaling with backoff for Tensor Cores (mixed precision) training
- trained with label smoothing loss (smoothing factor 0.1)
- encoder:
- 4-layer LSTM, hidden size 1024, first layer is bidirectional, the rest are undirectional
- with residual connections starting from 3rd layer
- uses standard LSTM layer (accelerated by cudnn)
- decoder:
- 4-layer unidirectional LSTM with hidden size 1024 and fully-connected classifier
- with residual connections starting from 3rd layer
- uses standard LSTM layer (accelerated by cudnn)
- attention:
- normalized Bahdanau attention
- model uses
gnmt_v2
attention mechanism - output from first LSTM layer of decoder goes into attention, then re-weighted context is concatenated with the input to all subsequent LSTM layers in decoder at the current timestep
- inference:
- beam search with default beam size 5
- with coverage penalty and length normalization
- BLEU computed by sacrebleu