Skip to content
Snippets Groups Projects
Commit 0b18fab6 authored by Akash Kothari's avatar Akash Kothari
Browse files

Remove paper directory

parent 81f5fc7b
No related branches found
No related tags found
No related merge requests found
Showing
with 0 additions and 1052 deletions
\begin{abstract}
%
Heterogeneous computing is widely used in the System-on-chip (SoC) processors
that power modern mobile devices in order to
reduce power consumption through specialization.
However, programming such systems can be extremely complex as a single
SoC combines multiple different
parallelism models, instruction sets, and memory hierarchies, and different
SoCs use \emph{different combinations} of these features.
We propose \NAME{}, a new Virtual Instruction Set Architecture (ISA) that aims to
address both functional portability and performance portability across
mobile heterogeneous SoCs by capturing the wide range of different
parallelism models expected to be available on future SoCs.
Our virtual ISA design uses only two parallelism models to achieve this goal:
\emph{a hierarchical dataflow graph with side effects} and
\emph{parametric vector instructions}.
\NAME{} is more general than existing ones that focus heavily on GPUs,
such as PTX, HSAIL and SPIR, e.g., it can capture both streaming pipelined
parallelism and general dataflow parallelism found in many custom and
semi-custom (programmable) accelerators.
We present a compilation strategy to generate code for a diverse range
of target hardware components from the common virtual ISA.
As a first prototype, we have implemented backends for
GPUs that use nVidia's PTX,
vector hardware using Intel's AVX, and
host code running on X86 processors.
Experimental results show that code generated for vectors and GPUs
from a single virtual ISA representation achieves
performance that is within about a factor of 2x of separately hand-tuned code,
and much closer in most cases.
We further demonstrate qualitatively using a realistic example
that our virtual ISA abstractions are also suited for capturing pipelining and
streaming parallelism.
%
\end{abstract}
define {float*, i64} @laplacian(float* in %I, i64 %sizeI, float* in %B, i64 %sizeB, i32 %dimX, i32 %dimY) {
; Create dataflow nodes in child graph
%erode_node = call i8* @llvm.visc.createNode(@erode)
%dilate_node = call i8* @llvm.visc.createNode(@dilate)
%lincomb_node = call i8* @llvm.visc.createNode2D(@lincomb, i32 %dimX, i32 %dimY)
; Bind inputs of parent node Laplacian with child nodes Dilate, Erode and lincomb
call void @llvm.visc.bind.input(i8* %dilate_node, i32 0, i32 0)
call void @llvm.visc.bind.input(i8* %dilate_node, i32 1, i32 1)
call void @llvm.visc.bind.input(i8* %dilate_node, i32 2, i32 2)
call void @llvm.visc.bind.input(i8* %dilate_node, i32 3, i32 3)
call void @llvm.visc.bind.input(i8* %erode_node, i32 0, i32 0)
call void @llvm.visc.bind.input(i8* %erode_node, i32 1, i32 1)
call void @llvm.visc.bind.input(i8* %erode_node, i32 2, i32 2)
call void @llvm.visc.bind.input(i8* %erode_node, i32 3, i32 3)
call void @llvm.visc.bind.input(i8* %lincomb_node, i32 0, i32 0)
call void @llvm.visc.bind.input(i8* %lincomb_node, i32 1, i32 1)
call void @llvm.visc.bind.input(i8* %lincomb_node, i32 2, i32 6)
call void @llvm.visc.bind.input(i8* %lincomb_node, i32 3, i32 7)
; Create edges between child nodes for sending output of Erode and Dilate to lincomb node
call void @llvm.visc.createEdge(i8* %dilate_node, i8* %lincomb_node, i32 0, i32 2)
call void @llvm.visc.createEdge(i8* %dilate_node, i8* %lincomb_node, i32 1, i32 3)
call void @llvm.visc.createEdge(i8* %erode_node, i8* %lincomb_node, i32 0, i32 4)
call void @llvm.visc.createEdge(i8* %erode_node, i8* %lincomb_node, i32 1, i32 5)
; Bind output of lincomb node with output of parent node Laplacian
call void @llvm.visc.bind.output(i8* %lincomb_node, i32 0, i32 0)
call void @llvm.visc.bind.output(i8* %lincomb_node, i32 1, i32 1)
ret {float*, i64} zeroinitializer
}
define void @lincomb(double* I, double* Ie, double* Id) {
%N = call i8* @llvm.visc.getNode()
%nidx = call i32 @llvm.visc.getNodeInstanceID.x(i8* %N)
%nidy = call i32 @llvm.visc.getNodeInstanceID.y(i8* %N)
%vl = call i32 @llvm.visc.getVectorLength(i32 8)
;Index and base address calculation using %nidx, %nidy
;for I, Id, Ie, L (not shown)
%pixel_I = load <%vl x double>* I_base
%pixel_Id = load <%vl x double>* Id_base
%pixel_Ie = load <%vl x double>* Ie_base
%tmp = insertelement <%vl x double> undef,
double 2.0, i32 0
%vec2 = shufflevector <%vl x double> %tmp,
<%vl x double> undef,
<%vl x i32> zeroinitializer
%mul = mul <%vl x double> %vec2, %pixel_I
%add = add <%vl x double> %pixel_Id, %pixel_Ie
%res = sub <%vl x double> %add, %mul
store <%vl x double> %res, <%vl x double>* L_base
ret void
}
/***************************************************************************
*cr
*cr (C) Copyright 2010 The Board of Trustees of the
*cr University of Illinois
*cr All Rights Reserved
*cr
***************************************************************************/
/*
* Main entry of dense matrix-matrix multiplication kernel
*/
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <malloc.h>
#include <vector>
#include <iostream>
#include <cassert>
#include <CL/cl.h>
#include <parboil.h>
// I/O routines
extern bool readColMajorMatrixFile(const char *fn, int &nr_row, int &nr_col, std::vector<float>&v);
extern bool writeColMajorMatrixFile(const char *fn, int, int, std::vector<float>&);
extern char* readFile(const char*);
// Parameters of tile sizes
#define TILE_SZ 16
#define CHECK_ERROR(errorMessage) \
if(clStatus != CL_SUCCESS) \
{ \
std::cout<< errorMessage <<": "<< clStatus <<" Error!\n"; \
std::cout<<"Line: "<<__LINE__<<"\n"; \
exit(1); \
}
void basicSgemm( int m, int n, cl_mem A, cl_mem B, cl_mem C, cl_kernel clKernel, cl_command_queue clCommandQueue )
{
// In this code we assume the matrix sizes are multiple of tile size
if ((m%TILE_SZ) || (n%TILE_SZ)) {
std::cerr << "unsupported size of matrix. m should be multiple of " << TILE_SZ
<< "; n should be multiple of " << TILE_SZ << std::endl;
}
//#ifdef ROWM
//size_t db = m;
//size_t dg = (m*n);
//#else
//size_t db = n;
//size_t dg = (m*n);
//#endif
#ifdef ROWM
size_t dg = m;
#else
size_t dg = n;
#endif
cl_int clStatus;
//std::cout << "Block dim = " << db << ", Group dim = " << dg/db << "\n";
clStatus = clSetKernelArg(clKernel,0,sizeof(cl_mem),(void*)&A);
clStatus = clSetKernelArg(clKernel,1,sizeof(cl_mem),(void*)&B);
clStatus = clSetKernelArg(clKernel,2,sizeof(cl_mem),(void*)&C);
clStatus = clSetKernelArg(clKernel,3,sizeof(int),(void*)&m);
clStatus = clSetKernelArg(clKernel,4,sizeof(int),(void*)&n);
CHECK_ERROR("clSetKernelArg")
//clStatus = clEnqueueNDRangeKernel(clCommandQueue,clKernel,1,NULL,&dg,&db,0,NULL,NULL);
clStatus = clEnqueueNDRangeKernel(clCommandQueue,clKernel,1,NULL,&dg,NULL,0,NULL,NULL);
CHECK_ERROR("clEnqueueNDRangeKernel")
clStatus = clFinish(clCommandQueue);
CHECK_ERROR("clFinish")
}
int main (int argc, char *argv[]) {
struct pb_Parameters *params;
struct pb_TimerSet timers;
size_t A_sz, B_sz, C_sz;
int matArow, matAcol;
int matBrow, matBcol;
std::vector<float> matA, matB;
/* Read command line. Expect 3 inputs: A, B and B^T
in column-major layout*/
params = pb_ReadParameters(&argc, argv);
if ((params->inpFiles[0] == NULL)
|| (params->inpFiles[1] == NULL)
|| (params->inpFiles[2] != NULL))
{
fprintf(stderr, "Expecting three input filenames\n");
exit(-1);
}
/* Read in data */
// load A
readColMajorMatrixFile(params->inpFiles[0],
matArow, matAcol, matA);
// load B^T
readColMajorMatrixFile(params->inpFiles[1],
matBrow, matBcol, matB);
assert(matArow == matBrow && matAcol == matBcol && "Dimensions of two input matrices should match");
pb_InitializeTimerSet(&timers);
pb_SwitchToTimer(&timers, visc_TimerID_SETUP);
cl_int clStatus;
cl_platform_id clPlatform;
clStatus = clGetPlatformIDs(1,&clPlatform,NULL);
CHECK_ERROR("clGetPlatformIDs")
cl_context_properties clCps[3] = {CL_CONTEXT_PLATFORM,(cl_context_properties)clPlatform,0};
cl_context clContext = clCreateContextFromType(clCps,CL_DEVICE_TYPE_GPU,NULL,NULL,&clStatus);
CHECK_ERROR("clCreateContextFromType")
cl_device_id clDevice;
clStatus = clGetDeviceIDs(clPlatform,CL_DEVICE_TYPE_GPU,1,&clDevice,NULL);
CHECK_ERROR("clGetDeviceIDs")
cl_command_queue clCommandQueue = clCreateCommandQueue(clContext,clDevice,CL_QUEUE_PROFILING_ENABLE,&clStatus);
CHECK_ERROR("clCreateCommandQueue")
pb_SetOpenCL(&clContext, &clCommandQueue);
// const char* clSource[] = {readFile("src/opencl_base/kernel_offline.nvptx.s")};
// cl_program clProgram = clCreateProgramWithSource(clContext,1,clSource,NULL,&clStatus);
cl_kernel clKernel;
cl_program clProgram;
pb_CreateAndBuildKernelFromBinary("build/opencl_default/kernel_offline.nvptx.s", "matAdd", &clContext, &clDevice, &clProgram, &clKernel);
//cl_program clProgram = clCreateProgramWithSource(clContext,1,clSource,NULL,&clStatus);
//CHECK_ERROR("clCreateProgramWithSource")
//char clOptions[50];
//sprintf(clOptions,"");
//clStatus = clBuildProgram(clProgram,1,&clDevice,clOptions,NULL,NULL);
//CHECK_ERROR("clBuildProgram")
//cl_kernel clKernel = clCreateKernel(clProgram,"mysgemmNT",&clStatus);
//CHECK_ERROR("clCreateKernel")
pb_SwitchToTimer( &timers, pb_TimerID_COMPUTE );
// copy A to device memory
A_sz = matArow*matAcol*sizeof(float);
B_sz = matBrow*matBcol*sizeof(float);
// allocate space for C
C_sz = matArow*matBcol*sizeof(float);
// OpenCL memory allocation
std::vector<float> matC(matArow*matBcol);
pb_SwitchToTimer( &timers, pb_TimerID_COPY );
cl_mem dA = clCreateBuffer(clContext,CL_MEM_READ_ONLY,A_sz,NULL,&clStatus);
CHECK_ERROR("clCreateBuffer")
cl_mem dB = clCreateBuffer(clContext,CL_MEM_READ_ONLY,B_sz,NULL,&clStatus);
CHECK_ERROR("clCreateBuffer")
cl_mem dC = clCreateBuffer(clContext,CL_MEM_WRITE_ONLY,C_sz,NULL,&clStatus);
CHECK_ERROR("clCreateBuffer")
// Copy A and B^T into device memory
clStatus = clEnqueueWriteBuffer(clCommandQueue,dA,CL_FALSE,0,A_sz,&matA.front(),0,NULL,NULL);
CHECK_ERROR("clEnqueueWriteBuffer")
clStatus = clEnqueueWriteBuffer(clCommandQueue,dB,CL_FALSE,0,B_sz,&matB.front(),0,NULL,NULL);
CHECK_ERROR("clEnqueueWriteBuffer")
clStatus = clEnqueueWriteBuffer(clCommandQueue,dC,CL_TRUE,0,C_sz,&matC.front(),0,NULL,NULL);
CHECK_ERROR("clEnqueueWriteBuffer")
pb_SwitchToTimer( &timers, pb_TimerID_KERNEL );
// Use standard sgemm interface
basicSgemm(matArow, matAcol, dA, dB, dC, clKernel, clCommandQueue);
pb_SwitchToTimer( &timers, pb_TimerID_COPY );
clEnqueueReadBuffer(clCommandQueue,dC,CL_TRUE,0,C_sz,&matC.front(),0,NULL,NULL);
pb_SwitchToTimer( &timers, visc_TimerID_SETUP);
clStatus = clReleaseKernel(clKernel);
clStatus = clReleaseProgram(clProgram);
clStatus = clReleaseMemObject(dA);
clStatus = clReleaseMemObject(dB);
clStatus = clReleaseMemObject(dC);
clStatus = clReleaseCommandQueue(clCommandQueue);
clStatus = clReleaseContext(clContext);
pb_SwitchToTimer(&timers, pb_TimerID_NONE);
pb_PrintTimerSet(&timers);
if (params->outFile) {
/* Write C to file */
//pb_SwitchToTimer(&timers, pb_TimerID_IO);
writeColMajorMatrixFile(params->outFile,
matArow, matBcol, matC);
}
double GPUtime = pb_GetElapsedTime(&(timers.timers[pb_TimerID_KERNEL]));
std::cout<< "GFLOPs = " << 2.* matArow * matBcol * matAcol/GPUtime/1e9 << std::endl;
pb_FreeParameters(params);
//free((void*)clSource[0]);
return 0;
}
__kernel void matAdd( __global float *A, __global float *B, __global float* C, int m, int n ) {
int j = get_global_id(0);
for(int i=0; i < m; i++)
C[i*n+j] = A[i*n+j] + B[i*n+j];
}
void basicSgemm( int m, int n, cl_mem A, cl_mem B, cl_mem C, cl_kernel clKernel, cl_command_queue clCommandQueue )
{
size_t global_work_group = n;
clSetKernelArg(clKernel,0,sizeof(cl_mem),(void*)&A);
clSetKernelArg(clKernel,1,sizeof(cl_mem),(void*)&B);
clSetKernelArg(clKernel,2,sizeof(cl_mem),(void*)&C);
clSetKernelArg(clKernel,3,sizeof(int),(void*)&m);
clSetKernelArg(clKernel,4,sizeof(int),(void*)&n);
clEnqueueNDRangeKernel(clCommandQueue, clKernel, 1, NULL, &global_work_group, NULL, 0, NULL, NULL);
clFinish(clCommandQueue);
}
__kernel void matAdd( __global float *A, __global float *B, __global float* C, int m, int n ) {
int i = get_global_id(0);
for(int j=0; j < n; j++)
C[i*n+j] = A[i*n+j] + B[i*n+j];
}
void basicSgemm( int m, int n, cl_mem A, cl_mem B, cl_mem C, cl_kernel clKernel, cl_command_queue clCommandQueue )
{
size_t global_work_group = m;
clSetKernelArg(clKernel,0,sizeof(cl_mem),(void*)&A);
clSetKernelArg(clKernel,1,sizeof(cl_mem),(void*)&B);
clSetKernelArg(clKernel,2,sizeof(cl_mem),(void*)&C);
clSetKernelArg(clKernel,3,sizeof(int),(void*)&m);
clSetKernelArg(clKernel,4,sizeof(int),(void*)&n);
clEnqueueNDRangeKernel(clCommandQueue, clKernel, 1, NULL,
&global_work_group, NULL, 0, NULL, NULL);
clFinish(clCommandQueue);
}
%------------------------------------------------------------------------------
\section{Compilation Strategy}
\label{sec:compiler}
%------------------------------------------------------------------------------
The goal of our compilation strategy is to generate native code from a single
virtual ISA format, allowing parts of an application to map flexibly to
different compute units. Our goal, in this paper, is not to develop new
optimization techniques on this virtual ISA; we are developing
those techniques in our ongoing research.
In this paper, we show how the virtual ISA design lends itself to
be compiled piecewise to different hardware compute units.
%\begin{center}
%\begin{figure}[hbt]
%\centering
%%\hspace*{4ex}
%\includegraphics[height=4cm]{Figures/compilation-short.png}
%\caption{\footnotesize{Compilation Flow of virtual ISA program}}
%\label{fig:compilation}
%\end{figure}
%\end{center}
We use simple annotations on the node functions to specify to which compute
unit a given graph node should be translated, e.g., the annotation may specify
one or more of \{{\tt GPU, Vector, None}\}.
%
Typically, the annotations would be chosen by a language front-end,
a programmer, or (in future) a run-time scheduler that decided when a
new version of native code was needed for a given subgraph.
%
If an entire hierarchical graph will be compiled as a single kernel mapped
to a single compute unit, then only the parent node of that graph needs to
be annotated.
%
The compiler will generate code for each such graph using the compilation
flow described below.
Device-specific ``translators'' use this information to generate native
code for a particular compute unit. Once mapping
of nodes to different hardware components is done, the code generation for
transfer of data between corresponding hardware components is generated.
%
In future, virtual ISA compilers can allow more flexible mapping
by generating native code for multiple targets for the same subgraph, and
relying on the runtime and scheduler to perform data transfers when mapping of
source and destination nodes of a dataflow edge are known at runtime.
Our current compilation strategy does not support cycles in a dataflow graph,
although loops within leaf nodes present no problems.
%
Outer-level cycles must be expressed in the host code outside the dataflow
graphs, as we do for iterative algorithms (like {\tt stencil}) and streaming
computations (like the image processing pipeline described in Section~\ref{sec:evaluation:streaming}).
%------------------------------------------------------------------------------
\subsection{Compilation Flow}
\label{sec:compiler:flow}
%------------------------------------------------------------------------------
The compilation flow for a virtual ISA program can be divided into three
phases:
%
(1) Mapping and code generation of distinct subgraphs to hardware
accelerators, specifically, compute code for the annotated nodes.
%
(2) Calls to the run-time library for data movement for the DFG edges.
%
(3) Generating sequential code for the remaining unmapped parts of the graph.
%
%What about {\tt launch/wait} intrinsic code generation.
%
The latter phase -- sequential code -- is straightforward and is only
briefly described in Section~\ref{sec:compiler:impl}.
%
The other two phases are described below.
The translation to native code is carried out for one annotated node
at a time.
The compilation requires traversal of the
dataflow graph to find the annotated nodes and to translate each of them into
native code for the selected compute unit.
%
We use Algorithm~\ref{algo:traversal} to traverse the hierarchical graph
at find the annotated nodes.
%
This algorithm is a simple depth-first traversal of the graph, translating
each annotated node as it is encountered, as described below.
%
The edges in the hierarchical graph
between nodes belonging to the same child graph express dataflow edges that
require run-time support for the data transfers.
\begin{algorithm}
\caption{Hierarchical Dataflow Graph Traversal}
\label{algo:traversal}
\begin{algorithmic}[1.]
\Procedure{Visit}{Node $N$}
\If{$N$ was visited before}
return
\EndIf
\If{$N$ is an annotated node}
\State \textit{NN} = CollapseToLeaf($N$)
\State Translate(\textit{NN})
\Else \Comment{$N$ is an internal node}
\State $G \gets$ child graph of node $N$
\State $L \gets$ list of all nodes of $G$ in topological order
\While {$L$ is non-empty}
\State remove a node $n$ from $L$
\State $\textsc{Visit}(n)$
\EndWhile
\EndIf
\EndProcedure
\end{algorithmic}
\end{algorithm}
%------------------------------------------------------------------------------
\subsubsection{Mapping Subgraphs to Accelerators}
\label{sec:compiler:mapping}
%------------------------------------------------------------------------------
The annotations described earlier identify distinct subgraphs that should be
mapped to specific compute units.
For example, the subgraph containing
{\tt Laplacian} node in Figure~\ref{fig:designexample} expresses parallelism
well suited for a GPU, and assuming it is marked as such,
the GPU translator would translate it for execution on an available GPU.
%
It would first collapse the hierarchical graph at the node, $N$, into a single
leaf node, \textit{NN}, and then translate node \textit{NN} to the specified
compute unit.
%
Collapsing a graph into a single node is conceptually straightforward, though
it involves many steps, and the details are omitted here.
%
To translate the leaf node, the translator isolates the functions
associated with the node into a separate LLVM module and generates native
code for it.
%
The specific details of the translation are implementation specific, and are
described below in Section~\ref{sec:compiler:impl}.
%
The final result of this phase is a new
graph where all leaf nodes have been translated for execution on
individual compute units.
%------------------------------------------------------------------------------
\subsubsection{Data Movement and Internal Nodes' Code Generation}
\label{sec:compiler:datamovement}
%------------------------------------------------------------------------------
The input to this phase is a graph where all leaf nodes have been mapped to
hardware accelerators and contain target specific code. The compiler performs
code generation of all the internal nodes of this graph, and for dataflow edges
between nodes. The child graph of any internal node is traversed in
topological order and function calls are inserted to the corresponding leaf
node.
For CPU code (e.g., targeting vector hardware),
loops are inserted around a function call if a static child node maps to
multiple instances in the dynamic dataflow graph.
For data flow edges where the
source and destination node execute on the same compute unit, or if
they execute on two different compute units that share memory,
passing a pointer between the nodes would be enough.
Such pointer passing is safe even with copy semantics because
a dataflow edge implies that the source node must have
\emph{completed} execution before the sink node can begin, so the
source code will not overwrite the data once the sink node begins execution.
%
However, several
accelerators today have separate memory hierarchy and data needs to be
explicitly brought into the accelerator memory before starting the execution. In
such cases explicit data copy instructions are generated using calls to the
accelerator API. For example, we use OpenCL API calls to move data to
and from the GPU.
%------------------------------------------------------------------------------
\subsection{Implementation}
\label{sec:compiler:impl}
%------------------------------------------------------------------------------
Our current compiler has functional translators
for compiling the \NAME{} virtual ISA to PTX,
AVX and host code for x86-64 (host code should also work for other
architectures for which an LLVM backend and the OpenCL run-time are available).
To reduce implementation effort
for our prototype, we leverage existing backends in the mainline LLVM
infrastructure for PTX (the open source NVPTX back end) and for AVX
(the LLVM-to-SPIR back-end with Intel's OpenCL SPIR-to-AVX translator).
Our implementation then mainly has to translate our
virtual ISA to the input code expected by each of these back-ends.
%------------------------------------------------------------------------------
\subsubsection{Translators}
\label{sec:compiler:translators}
%------------------------------------------------------------------------------
Our PTX translator takes the subgraph
where an internal node has a single leaf node in its child graph, which is
replicated into several dynamic instances. The PTX translator generates NVVM
IR~\cite{NVVM:URL} for the leaf node.
NVVM IR is a subset of the LLVM IR, together with a set of intrinsic functions,
which the open source NVPTX backend can translate
into PTX~\cite{PTX2.3Manual:URL} assembly.
For the internal node, our translator generates code
to load and run the PTX assembly of the leaf node on
the target nVidia GPU using the
nVidia OpenCL runtime to execute the internal node.
In a similar fashion, our
AVX translator generates SPIR~\cite{SPIRKhronosSpec} code for the leaf node and
uses the Intel OpenCL~\cite{IntelOpenCL:URL} runtime to execute it on
multicore CPUs supporting AVX extensions.
The Intel SPIR translator to AVX has significant autovectorization capabilities
that take advantage of the independence of SPIR kernel instances to produce
vector code.
Note that it is reasonable for us to reuse Intel's vectorizor instead of
writing our own because our goal is \emph{not} to invent new vectorization and
vector code generation technology: rather, our goal is to
show that the \NAME{} virtual ISA is a suitable
input code representation for enabling effective vectorization, which we can
accomplish by feeding Intel's SPIR translator from our virtual ISA.
OpenCL does not allow dynamic memory allocation inside the kernel. As a result,
dataflow nodes which perform dynamic memory allocation cannot be compiled for
GPUs. For nodes generating a data array as output, pointers to pre-allocated
arrays are passed as inputs to a node. Thus, pointer arguments to a node can be
pointers to both input or output data array. The general idiom we use to pass
arrays is to provide a pointer to the array and the array size as arguments.
To differentiate between pointers to
input/output data arrays, we add attributes {\tt in}, {\tt out}, and {\tt inout}
to node arguments as shown for input pointer $I$ in Listing~\ref{lst:laplacian}.
These attributes enable us to avoid extra memory copies, when executing on GPUs.
For example, in the iterative \texttt{stencil} benchmark, the main kernel is
executed a fairly large number of times, and only one of the two arrays it
operates needs to be copied back to the host every time and the other one is
then copied back from host to the GPU.
By marking one of the array arguments as {\tt in} and the other one as
{\tt out}, we avoid the extra copy in each direction.
%------------------------------------------------------------------------------
\subsubsection{{Launch/Wait} Intrinsic Code Generation}
\label{sec:compiler:flow}
%------------------------------------------------------------------------------
The {\tt launch} intrinsic is used to asynchronously start a dataflow
graph execution from host code. The {\tt wait} intrinsic blocks until the
dataflow graph execution is complete. The compiler replaces the {\tt launch}
intrinsic with a runtime API call to start the dataflow graph execution in a
new thread, using the Posix pthreads library.
The {\tt wait} intrinsic is implemented using {\tt pthread\_join}.
%------------------------------------------------------------------------------
\subsubsection{\NAME{} Runtime}
\label{sec:compiler:flow}
%------------------------------------------------------------------------------
Previous subsections describe the static code generation of key features of the
virtual ISA. Two specific features, however, require runtime support.
First, the virtual ISA design allows a leaf node to query node instance and dimension
queries to any ancestor. When such a query can be addressed by hardware
registers, the query intrinsic is replaced by the corresponding accelerator API
call. However, when it is not supported, the runtime maintains a stack to keep
track of the instance ID, and dimension limit of the dynamic instance of the
ancestors and responds when a query arrives.
Second, the dataflow graph semantics of the virtual ISA assumes a globally
addressable memory model. However, in the present form, many
accelerators present in a SoC do not support this model. For example, many of
today's GPUs cannot address CPU memory directly (although this capability is
emerging and may be more common in future).
In such a scenario, the data has to be
explicitly transferred to the accelerator memory before one initiates
computation on the accelerator. To perform these data transfers, the
translator inserts static API
calls to the accelerator runtime in the generated native binary. These
data copies are expensive and critical to application performance. It may
happen that such a copy is unnecessary because the data is already present on
the device. This would happen because the data was brought in the device memory
by a prior node executing on the device. Thus, as an optimization, the \NAME{}
runtime incorporates a feature we call the ``memory tracker,''
which keeps track of the latest copy of data arrays to avoid unnecessary
copies to and from the accelerator.
%\label{sec:compiler:impl} We implement the compilation strategy as a
%series of LLVM passes. We describe a few key passes in the current compiler.
%\todo[inline]{Optimizations??}
%The compilation process preserves some key invariants:
%\begin{enumerate}
%\item Every kernel is generated from a distinct subgraph in the dataflow graph hierarchy.
%\item After Codegen:
%\begin{itemize}
%\item Every leaf node in the hierarchy is represented by one or more kernels
%\item An internal node never becomes a kernel; instead, it is evaluated at runtime to
%instantiate the child subgraphs.
%\end{itemize}
%\end{enumerate}
%The \NAME{} compilation flow can be broken down into the following steps -
%\begin{enumerate}
%\item The front end would parse the source files and generate the hierarchical
%dataflow graph (called DFG, represented through intrinsics in LLVM IR) along
%with LLVM IR for the leaf nodes.
%\item The Graph Builder Pass would construct the internal representation for the
%static dataflow flow graph by parsing
%the DFG intrinsics.
%\item Other passes operate on and optimize the LLVM IR.
%\item The code generator uses the static dataflow graph to
%\begin{itemize}
%\item identify subgraphs which can be mapped to one or more available compute units
%(GPUs, DSPs, vectors) efficiently.
%\item generate code to assist the scheduler in dispatching these kernels to the
%appropriate compute units. One subgraph may be dispatched to multiple compute units.
%\item The backends generate the target specific code for the appropriate compute units.
%\item Finally, DFG runtime calls perform runtime
%scheduling and execute the target-specific kernel codes.
%\end{itemize}
%\end{enumerate}
%Currenlty, we use modified Clang to generate vitual ISA binary from OpenCL
%source code. We have implemented compiler backends for translation of virtual
%ISA code to native code for selected target backends.
%\begin{enumerate}
%\item DFG2LLVM\_NVPTX backend pass: This backend pass identifies the subgraph suitable
%for computation on an nVidia GPU and produces NVVM IR, which the LLVM NVPTX backend
%can translate to PTX assembly.
%\item DFG2LLVM\_SPIR backend pass: This pass is similar to the DFG2LLVM\_NVPTX pass,
%except that instead of the NVVM IR, it produces the SPIR binary. It generates
%code to invoke the Intel OpenCL runtime run the SPIR binary using Intel AVX SIMD
%instructions.
%\item DFG2LLVM\_X86 backend pass: This pass can generate single threaded x86
%source code for any remaining node in the dataflow graph and also generates code
%for launching the dataflow graph from host.
%\end{enumerate}
\section{Conclusion}
\label{sec:conclusion}
We present \NAME{}, a new Virtual ISA that aims to address the functional and
performance portability challenges arising in today's SoC's. \NAME{} is designed
as a hierarchical dataflow graph with side effects and parametric vector
instructions. We argue that these two models of parallelism exposed by \NAME{}
successfully capture the diverse parallelism models exposed by a wide range of
parallel hardware. We also presented a compilation strategy
that uses a single object code to target a wide range of parallel hardware, and
implemented backend translators for nVidia's GPUs targeting PTX, vector
hardware using Intel's AVX, and host code for X86 processors.
We evaluate our design by (a) using a single \NAME{} representation of four
applications from the Parboil Benchmark Suite to generate code for both nVidia's
GPUs and vector hardware, and comparing with baselines that are each seperately
tuned for their respective target device. The achieved performance is within a
factor of 2x at the worst case, demonstrating the achieved performance
portability from a single \NAME{} representation, and (b) demonstrating that
\NAME{} can naturally capture streaming parallelism due to its dataflow
representation.
%------------------------------------------------------------------------------
\section{Virtual ISA Design Goals}
\label{section:goals}
%------------------------------------------------------------------------------
Previous work~\cite{PTX2.3Manual:URL,VectorLLVA:VEE06} has shown that the approach of a
virtual ISA can achieve both high performance and be commercially viable.
In this work, we aim to design a virtual ISA for the wide range of parallel
hardware configurations found in current and future mobile SoCs.
We briefly summarize the primary design goals of our virtual ISA:
%
\begin{description}
%
\item{\bf Object code portability with as good performance as possible:}
%
The key goal of our virtual ISA design is to enable the \emph{virtual object
code} to be portable across a wide range of different configurations of
heterogeneous parallel SoCs, while obtaining as good performance as possible
on each compute unit.
%
We emphasize that we do not necessarily aim to match manually tuned
code for individual compute units because such tuning usually comes at the
cost of portability, or at the cost of hurting performance on other compute
units.
%
Object code portability is an absolute requirement for modern
applications running on mobile hardware.
%
Applications that absolutely require hand-tuned performance can already use
conditionally compiled code or \emph{fat binaries} or both to achieve such
performance, at the cost of significantly greater programming and maintenance
effort.
\item{\bf Language independence:} The virtual ISA should be able to support a
wide range of parallel programming languages, such as OpenCL, Renderscript,
and OpenMP 4.0 accelerator features.
In particular, the virtual ISA is \emph{not} intended as a source-level
programming language, but the parallelism abstractions must be easy to
reason about by programmers.
\item{\bf Machine independence:} The virtual ISA should be able to support a
wide range of hardware instruction sets, application binary interfaces (ABIs).
\item{\bf As few abstractions of parallelism as possible:} The virtual ISA must
use as few parallelism models as possible to capture the wide range of parallel
hardware on a modern SoC. This is important so that programmers can design
and tune algorithms without having to become experts in a wide range of
different kinds of parallelism.
%
These few abstractions must be able to map down effectively to today's
parallel hardware, such as multicore CPUs, GPUs and vectors, and also to
emerging parallel hardware, especially semi-custom, programmable accelerators.
%
(Custom, fixed-function accelerators may have high degrees of internal
parallelism but are usually programmed via fairly straightforward library
interfaces, which do not require exposing the detailed internal parallelism
features.)
\item{\bf Coarse-grain parallelism across compute units:} The virtual ISA
must capture relatively large-grain parallelism mapped to different compute
units, while compiling down to use as efficient data transfer mechanisms
as possible between those compute units.
\item{\bf Coarse- and fine-grain parallelism within compute units:} The
virtual ISA must also capture both coarse- and fine-grain parallelism that
can be mapped to a single compute unit, in order to achieve the highest
possible parallel performance for each compute kernel on a wide range of
compute units.
\item{\bf Representation of both explicit and implicit communication:} It must
be possible to represent both explicit data copies between compute units, e.g.,
between a CPU and an accelerator, and implicit data transfers through shared
memory, e.g., for a shared-memory multicore system or for emerging GPU
hardware that allows direct sharing between CPU and GPU. Both kinds of memory
transfers need to be under careful control of the programmer because memory
accesses are often the primary determining factor in program performance.
\item{\bf Flexible scheduling support across compute units:} It must be
possible to compile kernels represented in the virtual ISA to multiple
different compute units, so that a run-time scheduler can flexibly map a given
kernel to different compute units, depending on availability constraints
and battery conservation goals.
\item{\bf Offline compilation model:} To minimize energy consumption and
perceived application startup time, it should be possible to compile the
virtual ISA ahead-of-time (AOT) to native machine code. For example, this
was one major change from Android's Dalvik virtual machine, which uses
just-in-time (JIT) compilation every time an application is loaded, to the
ART system, which uses AOT compilation once at install time.
\end{description}
%------------------------------------------------------------------------------
\section{Evaluation}
\label{sec:evaluation}
%------------------------------------------------------------------------------
In our experiments, we evaluate the suitability of the virtual ISA design
in two ways. (1) The virtual ISA design should be portable. For this, we use
the same virtual ISA binary of an application to compile to different compute
units.
(2) When compared to current heterogeneous programming technologies such as
OpenCL, CUDA, and others, the virtual ISA design should be able to capture the
parallelism expressed using these languages, and thus achieve reasonable
performance when compiled to target architectures for these source-level
languages.
%------------------------------------------------------------------------------
\subsection{Experimental Setup and Benchmarks}
\label{sec:evaluation:setup}
%------------------------------------------------------------------------------
We modified the OpenCL front-end in the Clang compiler to generate the
virtual ISA for OpenCL applications.
%
We use annotations as hints to identify the subgraphs in the virtual ISA
that are suitable for accelerators.
%
We then used the compilation strategy described in Section~\ref{sec:compiler}
to translate the virtual ISA to two different target units:
the AVX instruction set in an Intel Xeon E5 core i7 and
a discrete nVidia GeForce GTX 680 GPU card with 2GB of memory.
The Intel Xeon also served as the host processor, running
at 3.6 GHz, with 16 GB RAM.
For our experimental evaluation, we used four applications from the
Parboil~\cite{Parboil} benchmark suite:
Sparse Matrix Vector Multiple (spmv),
Single-precision Matrix Multiply (sgemm),
Stencil PDE solver (stencil), and
a Lattice-Boltzmann solver (lbm).
In the GPU experiments, our baseline for comparison is the best available
OpenCL implementation
in Parboil that does not use local memory (since our virtual ISA does not
yet support local memory). For spvm and lbm, that is the Parboil version
labeled {\tt opencl\_nvidia}, which
has been hand-tuned for the Tesla NVidia GPUs~\cite{Liwen:Personal}. For
sgemm, the hand tuned version was utilizing local memory, thus preventing us
from using it. Instead, using that version as a starting point, we implemented
a version that is similar in every way except that the accesses to local
memory were replaced by accesses to global GPU memory instead, and that we
tuned the work group sizes to achieve the
best performance. Finally, for stencil, we use the basic version since following
the same practice did not improve the execution time.
All the applications are compiled using nVidia's proprietary OpenCL
compiler.
In the vector experiments, our baseline is the same OpenCL implementations that
we chose as GPU baselines,
but compiled using the Intel OpenCL compiler, as we found
that these versions achieved the best performance compared to the other
available OpenCL versions on vector hardware as well.
The \NAME{} binaries were also generated using the same versions of OpenCL.
We use two input
sizes for each benchmark, labeled `Small' and `Large' below.
Each data point we report is an average of ten runs for
the small test cases and an average of five runs for the large test cases;
we repeated the experiments multiple times to verify their stability.
%------------------------------------------------------------------------------
\subsection{Experimental Results}
\label{sec:evaluation:results}
%------------------------------------------------------------------------------
Figures~\ref{fig:gpusmall} and~\ref{fig:gpularge} show the normalized execution
time of these applications against GPU baseline for each of the two sizes.
Similarly, figures~\ref{fig:cpusmall} and~\ref{fig:cpularge} compare the
performance of \NAME{} programs with the vector baseline. The execution times are
broken down to segments corresponding to time spent in the compute kernel of the
application (kernel), copying data (copy) and remaining time spent on the host
side. The total execution time for the baseline is depicted on the
corresponding bar to give an indication of the actual numbers.
When comparing \NAME{} code with the GPU baseline, \NAME{} achieves near
hand-tuned OpenCL performance for almost all of these benchmarks, except spmv on
`Small' dataset, where it is within a factor of $1.2$. This is because of the
small total execution time of $0.076s$ for spmv on `Small' dataset. For the `Large'
dataset, the \NAME{} code performance is on par with OpenCL implementation,
where due to the fact that the total running time is larger, the effect of
constant overhead to the total execution time is minimal.
In the vector case, we see that the performance of \NAME{} is within 25\% in the
worst case. We observe that the kernel execution time in lbm is 25\% higher for
\NAME{} implementation than OpenCL. This is because the Intel OpenCL runtime
which is used by the \NAME{} runtime keeps one thread idle when it observes an
extra thread has been created by an application. We have to create this thread
to execute the \NAME{} dataflow graph asynchronously. We expect this overhead to
go away with improved OpenCL runtime implementation.
%Comparing \NAME{} code with the GPU baseline, the performance is within about
%25\% of the baseline in most cases and within a factor of
%$1.8$ in the worst case.
%We see that the \NAME{}
%application spends more time in the kernel execution relative to the GPU
%baseline. However, inspection of the generated PTX files generated by nVidia
%OpenCL compiler for OpenCL applications and \NAME{} compiler for \NAME{} applications
%has shown that they are almost identical, with the only difference being a minor
%number of instructions being reordered. Also, we notice increased, sometimes to
%a significant factor, data copy times, despite the fact the data copied in both
%applications are similar and that the \NAME{} runtime makes use of a memory
%tracking mechanism to avoid unnecessary data copies. We are working on getting
%a
%clear picture of the overheads that the \NAME{} representation or compilation may
%be imposing on the program execution.
%In the vector case, we see that the performance of \NAME{} is within about
%30\% in all cases, and within a factor of 1.6x in the worst case.
%We again
%observe the same inefficiencies in kernel and copy time, albeit less pronounced
%due to the fact that the total running times are generally larger, which
%minimizes the effect of constant overheads to the total execution time.
Finally, we note that none of our benchmarks made use of vector code at the leaf
dataflow nodes. This choice was made after comparing the performance of two \NAME{}
versions: (a) the \NAME{} object code as generated from the modified Clang
frontend, and (b) the \NAME{} code after altering the number of dynamic instances
of the leaf nodes as well as their code, in order to perform a bigger amount of
computation so that vectorization can be achieved. This transformation may have
improved the performance in some cases for one of the two targets, but it never
achieved reasonable performance on both. This is due to the competing
representation required to achieve good performance for GPUs and vector units.
In the GPU case, code executing by a thread should perform carefully strided
memory accesses in order to achieve coalescing of the memory requests performed
by multiple threads, and vector instructions get serialized at the hardware thus
no performance gain occurs from their use. In the vector case, a thread aims to
access consecutive locations so as to perform vectorized memory operations and
computations. Thus, a simple code where all threads perform independent
operations and access consecutive locations has the potential to achieve good
performance on both targets, by allowing memory coalescing on the GPU side and
vectorization across work items in the vector case. To conclude, for simple
benchmarks where vectorization across work items can be achieved automatically,
our experiment shows that the presence of vector instructions does not improve
performance on both targets. We expect the vector instructions to lead to
performance gains for more complicated kernels where automatic vectorization
will not be effective.
\begin{figure*}[hbt]
\begin{minipage}{0.48\textwidth}
\begin{center}
\includegraphics[height=4cm]{Figures/gpusmall.png}
\caption{\footnotesize{GPU Experiments - Small Test Normalized Execution
Time}}
\label{fig:gpusmall}
\end{center}
\end{minipage}~~~~\begin{minipage}{0.48\textwidth}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics[height=4cm]{Figures/gpularge.png}
\caption{\footnotesize{GPU Experiments - Large Test Normalized Execution
Time}}
\label{fig:gpularge}
\end{center}
\end{minipage}
\end{figure*}
\begin{figure*}[hbt]
\begin{minipage}{0.48\textwidth}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics[height=4cm]{Figures/cpusmall.png}
\caption{\footnotesize{Vector Experiments - Small Test Normalized Execution
Time}}
\label{fig:cpusmall}
\end{center}
\end{minipage}~~~~\begin{minipage}{0.48\textwidth}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics[height=4cm]{Figures/cpularge.png}
\caption{\footnotesize{Vector Experiments - Large Test Normalized Execution
Time}}
\label{fig:cpularge}
\end{center}
\end{minipage}
\end{figure*}
%------------------------------------------------------------------------------
\subsection{Expressing parallelism beyond GPUs}
\label{sec:evaluation:streaming}
%------------------------------------------------------------------------------
\NAME~is aimed to be extensible beyond the devices that are most commonly found
in today's accelerators and represent parallelism models in a broad class of
available hardware. Apart from data parallelism, many accelerators expose a
streaming paallelism model and would benefit greatly by a representation that
can capture this feature. \NAME~presents the unique advantages of representing a
program as a dataflow graph, which is a natural way of representing the
communication between producers and consumers, as well as describing the
repeated transfer of multiple data items via streaming edges. This section uses
an image processing pipeline to demonstrate the benefits of expressing a
streaming application in \NAME.
\begin{center}
\begin{figure*}[hbt]
\centering
%\hspace*{4ex}
\includegraphics[height=6cm]{Figures/pipeline.png}
\caption{Edge Detection in gray scale images in \NAME{}}
\label{fig:pipeline}
\end{figure*}
\vspace*{-1.5\baselineskip}
\end{center}
Figure~\ref{fig:pipeline} presents an application for Edge Detection in
gray scale images in \NAME. At a high level, this application is a dataflow node
that acceps a greyscale image $I$ and a binary structuring element $B$ and
computes a binary image $E$ that represents the edges of $I$. The application
begins by computing an estimate of the Laplacian $L$ of $I$, as depicted in
figure~\ref{fig:pipeline}, and proceeds by computing its zerocrossings,
i.e. points of sign change in $L$. A different dataflow node computes the
gradient $G$ of $I$, operation that can proceed in parallel with the remaining
computations. The final dataflow node uses the output of the Gradient and the
ZeroCrossings to perform a thresholding operation that will allow it to reject
small variations in the brightness of the image and only detect more significant
variations that actually constitute edges.
We implemented this pipeline using OpenCV computer vision library.
We used C++ thread library to create threads for each top level node in this
example, and implemented fixed size
circular buffers for each streaming edge between these nodes to pass data
between them. The pipeline, streaming and dataflow parallelism expressed in this
example is easy to capture in \NAME{}. The streaming edges, dataflow nodes
simply map to key features of \NAME{}. Our current implementation of \NAME{} is
only missing the implementation of circular buffers for streaming edges, and
thus we do not have a working \NAME{} version of this example.
However, mapping pipeline and streaming parallelism model to SPIR, HSAIL
parallelism models of one kernel replicated across several cores, is
non-intuitive and difficult to achieve. OpenCL supports concurrent execution of
kernels running in two different streams, Expressing concurrency across kernels
working on different image sections would require complex synchronization and an
iimplementation of programmer managed scheduling of nodes. This is a tedious and
error-prone task, which is unlikely to scale to bigger and more complex
pipelines.
Expressing this example in \NAME{}, would have the added advantage of flexibly
mapping computationally heavy parts of the pipeline to accelerators. The Laplacian node is the pipeline
bottleneck. Mapping Laplacian to GPU, achieved 2x speedup, as it balances the
two branches of the pipeline. However, mapping both Laplacian and Gradient to
GPU achieves a modest 1.1x speedup. This further shows the advantage of flexible
mapping, which allows the programmer or auto-tuner to easily tune an application.
%\begin{center}
%\lstinputlisting[float=*, language=llvm]{Code/lincomb.ll}
%\end{center}
File deleted
paper/Figures/compilation-short.png

58.5 KiB

paper/Figures/compilation.png

77 KiB

paper/Figures/cpularge.png

47.8 KiB

paper/Figures/cpusmall.png

49.2 KiB

paper/Figures/designexample.png

335 KiB

paper/Figures/gpularge.png

47.8 KiB

paper/Figures/gpusmall.png

48.2 KiB

paper/Figures/lincomb.png

183 KiB

paper/Figures/pipeline.png

325 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment