Update approximation-implementation.rst

10aa6303 · hsharif3 · Yifan Zhao · 03ad146e · 10aa6303
Commit 10aa6303 authored 4 years ago by hsharif3 Committed by Yifan Zhao 4 years ago
--- a/hpvm/docs/developerdocs/approximation-implementation.rst
+++ b/hpvm/docs/developerdocs/approximation-implementation.rst
 Approximate Algorithm Implementations
 =========================================

+This release includes two approximations, namely perforation and sampling, both of which apply to convolution operations. The knobs for these approximations are described in :doc:`configuration-format`. We include implementations for these approximations on both GPU and CPU. The implementations are described below: 

 Perforated Convolutions
 -----------------------
@@ -8,33 +9,49 @@ Perforated Convolutions
 Overview
 ^^^^^^^^^

-Perforation approximation for convolution operation entails, perforating rows/columns of tensors i.e. skipping computing values of rows/columns of tensors and using the neighboring values to interpolate the skipped ones to recover the accuracy and shape of the resultant tensor. This helps reduce the number MAC operations performed while improving cache and memory bandwidth usage. Perforation is performed at a uniform rate, which is the percentage of rows/columns perforated in relation to total number of rows/columns. Perforation is performed starting from an offset, which is the index of the row/columns from where perforation is performed and rows/columns prior to that index remain unapproximated.
+The core idea of perforated convolutions is to compute a subset of the output tensor elements and interpolate the missing elements. Specifically, we include an implementation that skips entire output rows or columns (configurable through knobs), and intepolates the missing rows/columns through neighbour averaging.
+Since the approximation reduces the number of output elemennts that need to be computed, it reduces the multiply-accumulate (MAC) operations, and reduces memory bandwidth usage (loads subset of data), hence resulting in both speedups and energy reductions. Our implementation performs the perforation at fixed strides (e.g. skip 1 out of every 3 rows) and the rate of perforation is a configurable knob. The type of perforation (row/col) is also configurable. Another knob is the starting offset - the tensor row/column index to start the perforation (i.e., skipping) from. 

 Description
 ^^^^^^^^^^^

-The algorithm for perforated convolution can be broken down into three major steps:
+Our implementation for perforated convolution is a three-step process:

-* **Patch matrix creation:** Based on indices of the rows/columns to be perforated, the corresponding elements of the input tensor are used to create a new matrix called an input-patch matrix. The input-patch matrix is a matrix laid out in memory in such a way that convolution is then reduced down to a simple matrix multiplication operation. This approach is similar to one described in this `paper <https://dl.acm.org/doi/abs/10.1145/2964284.2967243>`_.
+* **Patch matrix creation:** Based on indices of the rows/columns to be perforated, the corresponding elements of the input tensor are used to create a new matrix called an input-patch matrix. The input-patch matrix is a matrix laid out in memory such that convolution is reduced to a simple matrix multiplication operation. This approach is similar to one described in this `paper <https://dl.acm.org/doi/abs/10.1145/2964284.2967243>`_.

-* **Dense matrix multiplication:** This step involves performing a matrix multiplication in a manner very similar to described in this `paper <https://arxiv.org/pdf/1704.04428.pdf>`_. It is important to note that it is performed on reduced, dense matrices.
-
-* **Interpolation of missing values:** This step entails allocation of a new tensor to which computed elements from the reduced, dense tensor are copied and the elements whose computation was skipped are interpolated by taking the arithmetic mean of the neighboring elements. These neighboring elements constitute the computed elements on the right and the left of the skipped element in case of column perforation;  and the computed elements above and below the skipped element in case of row perforation.
+* **Dense matrix multiplication:** This step involves performing a matrix multiplication in a manner very similar to described in this `paper <https://arxiv.org/pdf/1704.04428.pdf>`_. Note that these matrices are dense (no sparsity in the tensor representation).

+* **Interpolation of missing values:** A new tensor is created with dimensions the output tensor is expected to have. The perforated tensor output (after matrix multiplication) is copied to corresponding rows/columns, and for the skipped rows/columns, the output values are populated using interpolation; using arithmetic mean of the neighboring elements. For column perforation, these neighboring elements are the right and left element of the skipped element. For row perforation, the top and bottom neigbouring elements are used for interpolation. The output of this step is the approximate (perforated) convolution result.

 Filter Sampling
 ---------------

 Overview
 ^^^^^^^^^
-Convolution performed with filter sampling approximation constitutes performing convolution operation using a sampled filter i.e. a filter with missing elements. This helps reduce the number MAC operations performed while improving cache and memory bandwidth usage by reducing overall filter size. Filter sampling is performed at a rate, which is the percentage of elements sampled in relation to total number of elements in a tensor. Sampling is performed starting from an offset into a filter, which is the index of the filter element at which sampling begins - i.e. filter elements prior to this index are not skipped/sampled.
+The core idea of filter sampling is to use a subset of convolution filter elements and inputs to compute the (full) tensor output. Filter sampling is a variant of "input sampling", while perforation is a variant of "output sampling". Similar to perforation, filter sampling also reduces MAC operations and memory bandwidth usage. The filter elements (and corresponding input tensor elements) are skipped at a regular stride. The start offset is a tunable knob (exposed to our autotuner) which controls the filter tensor index at which skipping starts. 

 Description
 ^^^^^^^^^^^
-The algorithm for convolution using a reduced filter is implemented in three major steps:

-* **Creation of sampled filter:** This step entails allocation of a new sampled filter whose size is based on the sampling rate and offset. The appropriate elements of the original filter are scaled up by the factor of  rate / (rate - 1) and copied to the newly allocated reduced fiter. Scaling up of filter elements helps make up for the lost accuracy from sampling the filter.
+Our filter sampling implementation involves the following steps:
+
+* **Creation of sampled filter:** This step creates a new sampled filter (with fewer elements) whose size is based on the sampling rate (and offset). Since filter elements are skipped, the remaining elements are scaled up by the factor of  rate / (rate - 1) and copied to the newly allocated sampled fiter. Scaling up of filter elements helps make up for the lost accuracy from sampling the filter (found as part of our empirical study).
+
+* **Patch matrix creation:** Based on filter element indices used in the construction of the sampled filter, the corresponding input tensor elements are used to create a new matrix, called an input-patch matrix. The input-patch matrix is a matrix laid out in memory in such a way that convolution is transformed to a simple matrix multiplication operation. 
+
+* **Dense matrix multiplication:** This step involves performing a matrix multiplication on the (sampled) filter and input patch matrices. The output result of matrix multiplication is the approximate convolution result. 
+
+
+Sources 
+^^^^^^^^
+
+The implementation for perforation and sampling for GPU are present in `hpmv/hpvm/projects/hpvm-tensor-rt/tensor_runtime/src/approx_techniques.cu`
+
+Relevant Routines 
+ * `tensorConvApprox` (FP32)
+ * `tensorConvApproxHalf2` (FP16)
+
+The implementations on CPU are present in: `projects/hpvm-tensor-rt/tensor_runtime/src/tensor_cpu_runtime.cc`. The Relevant routine is: `tensorConvApproxCPU`. Note that this single routine supports baseline (no approximation), perforation, and sampling knobs. All supported knobs are detailed in :doc:`configuration-format`.
+

-* **Patch matrix creation:** Based on indices of the elements of the original filter that go into making the sampled filter, the corresponding elements of the input tensor are used to create a new matrix called an input-patch matrix. The input-patch matrix is a matrix laid out in memory in such a way that convolution is then reduced down to a simple matrix multiplication operation. 

-* **Dense matrix multiplication:** This step involves performing a matrix multiplication on the (sampled) filter and input patch matrices.