Introduction.tex

%------------------------------------------------------------------------------
\section{Introduction}
\label{sec:intro}
%------------------------------------------------------------------------------

In computing contexts where energy is an important consideration, such as in
mobile devices like smartphones, tablets, and e-book readers, or where power and
heat dissipation are important, such as in data centers, traditional homogeneous
multicore processors can be quite inefficient.  These contexts are increasingly
seeing the advent of heterogeneous computing systems, which use specialized
computing elements that can deliver much greater efficiency in
performance-per-Joule or performance-per-Watt.  For example, the ``application
processor'' on a modern smartphone or tablet is a heterogeneous System-on-chip
(SoC) that often includes not just a multicore host CPU, but also a GPU, a DSP,
and several more specialized processors for tasks such as audio and video
decoding, image processing, digital photography, and speech recognition.

Programming applications for hardware that uses such diverse combinations of
computing elements is extremely challenging. The challenges include developing
portable algorithms, writing efficient yet portable source-level programs,
producing portable object code, and tuning the programs. At a more fundamental
level, these challenges arise from three root causes: (1) diverse parallelism
models; (2) diverse memory architectures; and (3) diverse hardware instruction
sets. To make use of the full range of available hardware to maximize
performance and energy efficiency, the programming environment needs to provide
common abstractions for all the available hardware compute units in
heterogeneous systems. Not only are these abstractions required at the level of
source-code, but also at object-code level to make the object-code portable
across the same and different manufacturer's devices, thus allowing
application vendors to be able to ship a single software version across a broad
range of devices. 

%\begin{center}
%\begin{figure}[hbt]
%\centering\hspace*{4ex}\includegraphics[height=6.5cm]{Figures/visc.pdf}
%\caption{\footnotesize{System Organization for Virtual Instruction Set Computing
%in a Heterogeneous System}}
%\label{fig:visc}
%\end{figure}
%\end{center}

We believe that these issues are best addressed using a 
virtual instruction set layer that abstracts away most of the low-level 
details of different hardware components, but
provides a small number of abstractions of parallelism that can be mapped 
down (or ``translated'') effectively to all the different kinds of parallel 
hardware on a wide range of SoCs.
The (virtual) object code is translated down to specific hardware components
available on a particular device, at install time, load time or run-time.
This general approach, which we call Virtual Instruction Set Computing (VISC),
has been used very successfully for GPGPU computing, e.g., through the
PTX virtual ISA for several generations of nVidia GPUs, and more recently
HSAIL~\cite{HSAIL} and SPIR~\cite{SPIRKhronosSpec} for other classes of hardware.
Although HSAIL and SPIR can be mapped down to non-GPU hardware, their design
has been heavily influenced by the SIMT parallelism model of GPUs, which 
supports both GPU and vector hardware well but limits their effectiveness
for other kinds of parallelism.
This is discussed in more detail in Sections~\ref{sec:evaluation:streaming} 
and~\ref{sec:related}.
%
%The key point is that the only software components that can "see" the hardware
%details are the translators (i.e., compiler back ends), system-level and
%application-level schedulers, a minimal set of other low-level OS components and
%some device drivers. The rest of the software stack, including source-level
%language implementations, application libraries, and middleware, lives above the
%virtual ISA and is portable across different heterogeneous system
%configurations. Unlike previous VISC systems, our virtual instruction set design
%abstracts away and unifies the diverse forms of parallelism in hardware (using a
%combination of only two models of parallelism). It also provides abstractions
%for memory and communication, allowing back-end translators to generate code for
%efficient data movement across compute units. These abstractions enable
%programmers to write efficient software applications that are portable across a
%diverse range of hardware configurations. Moreover, we are exploiting the
%flexible translator-hardware communication in VISC systems to enable novel
%memory system designs that are more energy-efficient and higher performance than
%current designs.

In this paper, we propose a virtual ISA design that abstracts away the
wide range of parallelism models and the disparate instruction sets used 
within and across SoCs.
(In this work, we do not consider the different memory hierarchy architectures
used across compute units or devices, but it is a subject of our ongoing 
work.)
In fact, we can represent these different parallelism models using only 
\emph{two abstractions of parallelism}:
\begin{itemize}
\item Hierarchical dataflow graphs with side effects, and
\item Short-vector SIMD (Single Instruction Multiple Data) instructions.
\end{itemize}
%
Dataflow graphs are a very general model of data parallelism and, when 
extended to allow shared memory accesses (side effects), can capture
many forms of parallel computing over data elements, including 
vector SIMD parallelism,
the SIMT (Single Instruction Multiple Threads) parallelism model used in
general-purpose GPUs,
streaming or pipelined-dataflow parallelism, and
fine-grained data parallelism, which may be synchronous or asynchronous.
%
Although dataflow graphs can capture vector parallelism too, vector
instructions, when applicable, provide a representation that is far
more compact, efficient, and much easier to reason about and transform;
for this reason, we include explicit vector instructions in our model.

We make the dataflow graphs hierarchical to express multiple
granularities of parallelism in a natural manner, e.g.,
coarse-grain parallelism across different compute units vs. fine-grain 
parallelism within a single compute unit.
%
In particular, a dataflow graph node is either an \emph{internal node} or a
\emph{leaf node}.
%
An internal node itself contains another dataflow graph within it.
%
A leaf node contains executable code that is some mixture of scalar and
vector instructions.
%
Each leaf node in a dataflow graph includes a parameter value, $N$, which
specifies that the node should be \emph{replicated} $N$ times for 
independent parallel execution; the value of $N$ may be computed at
run-time.
%
This allows the graph to capture fine-grain parallelism, and is
similar to how a GPU kernel in CUDA, OpenCL or PTX is replicated across 
the threads of a GPU device.

One final feature of our representation is that a dataflow graph edge may
be either an ordinary edge or a ``streaming'' edge.
%
An ordinary edge represents a one-time data transfer from a producer node 
to a consumer node; implicitly the two nodes connected by the edge are
executed only once.
%
A streaming edge specifies that the producer and consumer edges execute
repeatedly, transferring data items continuously with the semantics of
a bounded FIFO buffer.

This code representation can be mapped down and executed effectively on 
the full range of parallel hardware on a modern SoC, including GPUs, 
vector hardware, multicore host processors, digital signal processors (DSPs),
and semi-custom hardware accelerators.
%
In this work, we describe a first prototype system that translates a single
virtual object code program to nVidia GPUs (using PTX), Intel's AVX vector
instructions, and X86 host processors.
%
We present preliminary experimental results comparing the performance of the
generated code for a set of benchmarks to hand-tuned code written using
OpenCL for the GPU and hand-vectorized for AVX.
%
Our results show that the code generated by \NAME{} is close in performance
to the hand-tuned code in many cases, and within about 2x in all cases.
%
These results were obtained with relatively little compiler optimization
for either GPU or vector hardware, which gives us confidence that \NAME{} can
provide object code portability with relatively low performance cost.

We also present a detailed description of a pipelined streaming benchmark
and how it is represented in \NAME{}.
%
Representing this benchmark in PTX, HSAIL or SPIR would be extremely awkward:
it would require manually written tiling and buffering, with complicated 
synchronization to achieve concurrent execution of different pipeline
stages.
%
Although we have not yet implemented the buffered message passing required
for streaming parallelism, the example shows that \NAME{} can naturally
express a broader class of parallelism than can be expressed with the
existing virtual ISAs.
%
We also briefly discuss an example class of programmable, custom accelerators
for machine learning algorithms, which can be naturally targeted using the
parallelism models in \NAME{}, although capturing all the details of the
hardware is a subject of future work.

%*********MENTIONED THIS BRIEFLY EARLIER INSTEAD OF HERE.*********
%%
%One key limitation of our current work is that we do \emph{not} yet provide
%portable abstractions of the varying memory hierarchies used in different
%hardware components.
%%
%Although we have a preliminary design for such abstractions, implementing the
%design fully and evaluating it are beyond the scope of this work.

The next section describes the high-level design goals of \NAME{}.
%
Section~\ref{sec:design} then presents the detailed design of the
\NAME{} virtual ISA, and its implementation as an extension of the LLVM
instruction set~\cite{LLVM:CGO04}.
%
Section~\ref{sec:compiler} describes our general compilation strategy, 
and our prototype translators for PTX, AVX, and X86.
%
Section~\ref{sec:evaluation} presents our experimental results and
our qualitative discussion of the pipelined benchmark and our future work
on the machine learning accelerator.
%
Section~\ref{sec:related} compares our work with the state of the art,
and Section~\ref{sec:conclusion} concludes.