README.md



DNN Forward Propagation Using GEMM

Directory Structure:

data
-- Contains all the test files.
gemm_cmake
-- Project configured for cmake compilation.
gemm_vs2015
-- Project configured for Visual Studio 2015.
-- HDF5 v1.10.0 (included in this repo) is configured to be in the default install location (C:\Program Files (x86)\HDF_Group\HDF5\1.10.0).


How To Compile
Visual Studio 2015
Open the project by double clicking file "gemm.sln" under gemm_vs2015. CUDA and HDF5 must be installed. If the "Solution Platform" on the top toolbar displays "x64", change it to be "x86". To compile please select "Build -> Build Solution" from the menu bar.
CMake
Please install HDF5 yourself using package management on your Linux/OS X, then just run cmake to compile. See https://github.com/webgpu/ece408project for detailed instructions.

Notes on Optimizations
One of the major optimizations I did was to lower the occupancy. The main idea behind this is to reuse as much data as I can, this is especially true for the matrix multiplication, since the convolution filters are the same for every input, loading it into shared memory for every input is really not ideal. So first thing is to lower the number of loading filters.
Take the matrix multiplication after the second unroll for example, input matrix A is 64800, input matrix B is 80064, output matrix C is 6464. If you just follow the common method to do the multiplication you'll probably end up with a 22 grid with 3232 blocks, such that each thread writes one output in C. But this means for every input there will be a matrix B, and for every matrix B the kernel needs to load the whole matrix A again. So back to lowering the number of loading convolution filters (i.e. matrix A), instead of loading only one input matrix B, now I load 2, that means I'm cutting the number of loading matrix A in half. Also, instead of using a 22 grid of 3232 blocks, I now have a 1D grid with 1616 blocks, and the shared memory tile size is now 6432 for A and 3264 for B, i.e. every thread loads 8 numbers from A, 16 numbers from B, and writes to 32 outputs in C. If you do the math, you'll see why this works. Before the optimization, for every output to C, one thread needs to load 225 numbers from A and B, the ratio of input and output is 50 to 1, also it needs 6464 threads to calculate 1 output matrix C. After the optimization, I load 2425 numbers from A and B and writes 32 outputs in C, the ratio is 18.75 and I only need 1616 threads for every two output matrix C.
The other thing that had a big effect on the speed is try to use as much registers as possible, even though shared memories are faster than global memory, registers are still faster.
Reference: http://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pdf

Other Notes
The ~100μs speed was achieved by using asynchronous kernel calls (i.e. streams). But just using streams is not enough, you'll have to make sure every kernel you have only uses the data produced by the previous kernel in the same stream. E.g. if you have two kernels functions "unroll1" and "matrix_multiplication1", "matrix_multiplication1" can only use the data produced by "unroll1", and they must be in the same stream. If you manage to do this for every kernel, then you will not need to call any synchronization function in between kernel launches (e.g. cudaDeviceSynchronize, cudaStreamSynchronize, etc.). You will only need to call cudaDeviceSynchronize once just before you need to use the data from the very last kernel. In our case it's the argmax in the main function. So essentially, by streaming everything perfectly and moving the cudaDeviceSynchronize outside the wrapper function, the timer is only timing the the function call and return time. And the reason this is useful in general is because your host code will normally have something else to do, or you will be wasting resources by just waiting for the GPU to finish its job. This method is not useful to us in this project is because there's nothing for the CPU to do other than wait for the GPU (or if you want you can split the input data and give CPU a small portion to process, but that won't help much). The reason I think this is a legitimate way to do things is like I just said, there is no reason to wait for the GPU when you know the CPU can do other things, in our case this will be getting the end time, and since you are only using the data for checking the correctness, the synchronization should happen just before that.