Skip to content
Snippets Groups Projects
Commit 7d010b81 authored by Prakalp Srivastava's avatar Prakalp Srivastava
Browse files

minor changes to compilation

parent cd642cb7
No related branches found
No related tags found
No related merge requests found
......@@ -6,22 +6,23 @@ compute elements on a single chip. These computing elements use different
parallelism models, instruction sets, and memory hierarchy, making it difficult
to achieve performance and code portability on heterogeneous systems.
Application programming for such systems would be greatly simplified if a single
<<<<<<< HEAD
object code representation could be used to generate code for different compute
units in a heteroegenous system. Previous efforts aiming to address the source
units in a heterogeneous system. Previous efforts aiming to address the source
and object code portability challenges arising in such systems, such as OpenCL,
CUDA, SPIR, PTX and HSAIL focus heavily on GPUs, which makes them insufficent
CUDA, SPIR, PTX and HSAIL focus heavily on GPUs, which makes them insufficient
for today's SoCs.
We propose VISC, a framework for programming heterogeneous systems. In this
paper we focus on the crux of VISC, a novel virtual ISA design which adds
dataflow graph abstractions to LLVM IR, to capture the diverse forms of
parallelism models exposed by today's SoCs. We also present a compilation
strategy to generate code for AVX, PTX and X86 backends from single virtual ISA
Virtual Instruction Set Computing (VISC) is a powerful approach to better
portability. We propose to use it to address the code and performance
portability problem for heterogeneous mobile SoCs. In this paper we focus on the
crux of VISC approach and present a novel virtual ISA design which adds dataflow
graph abstractions to LLVM IR, to capture the diverse forms of parallelism
models exposed by today's SoCs. We also present a compilation strategy to
generate code for AVX, PTX and X86 backends from single virtual ISA
representation of a program. Through a set of experiments we show that code
generated for CPUs and GPUs from single virtual ISA representation achieves
performance (within 1 to 1.6x) with hand-tuned code
\todo{What numbers to quote here?}. We further demonstrate that these virtual
ISA abstractions are also suited for capturing pipelining and streaming
parallelism.
acceptable performance, compared with hand-tuned code. We further demonstrate
that these virtual ISA abstractions are also suited for capturing pipelining and
streaming parallelism.
\end{abstract}
......@@ -18,42 +18,43 @@ be compiled piecewise to different hardware compute units.
\end{center}
The VISC virtual ISA, at the top-level, is neatly separated into component
dataflow nodes, which in practice, can be used to represent computational chunks
in an application which are big enough to be executed on different hardware
components. Communication between these nodes is captured with dataflow edges,
which encode information about the size and type of data moving from one
component to another. Device specific backend
translators leverage this feature to identify the nodes/subgraph that can be mapped to a particular
hardware compoenent and generate native code for that component. Once mapping of
nodes to different hardware components is done, the code generation for transfer
of data between corresponding hardware components is generated. Sophisticated
virtual ISA compilers in future can also allow flexible mapping, by generating
native code for multiple backends for the same subgraph, and relying on runtime
and scheduler to perform data transfers when mapping of source and destination
nodes of a dataflow edge are known at runtime.
dataflow nodes, which in practice, can be used to represent computational
chunks in an application which are big enough to be executed on different
hardware components. Communication between these nodes is captured with
dataflow edges, which encode information about the size and type of data moving
from one component to another. Device specific backend translators leverage
this feature to identify the nodes/subgraph that can be mapped to a particular
hardware compoenent and generate native code for that component. Once mapping
of nodes to different hardware components is done, the code generation for
transfer of data between corresponding hardware components is generated.
Sophisticated virtual ISA compilers in future can also allow flexible mapping,
by generating native code for multiple backends for the same subgraph, and
relying on runtime and scheduler to perform data transfers when mapping of
source and destination nodes of a dataflow edge are known at runtime.
\subsection{Compilation Flow}
Figure~\ref{fig:compilation} shows the compilation flow of virtual ISA program.
\todo{Redo the figure to show two phases of compilation}
Our current compiler has functional backends for compiling virtual ISA
components to PTX, AVX and host code for x86-64 processor. These backends are
implemented as LLVM passes, details of which are in
section~\ref{sec:CompilerImplementation}. The compilation flow can be divided
into two phases, (1) mapping and code generation of distinct subgraphs to hardware
accelerators, and (2) generating sequential code for remaining unmapped parts of
the graph, and for data movement across components mapped to different
accelerators.
\todo{Redo the figure to show two phases of compilation} Our current compiler
has functional backends for compiling virtual ISA components to PTX, AVX and
host code for x86-64 processor. These backends are implemented as LLVM passes,
details of which are in section~\ref{sec:CompilerImplementation}. The
compilation flow can be divided into three phases, (1) mapping and code
generation of distinct subgraphs to hardware accelerators, (2) generating
sequential code for remaining unmapped parts of the graph, and for data
movement across components mapped to different accelerators, and (3) {\tt
launch/wait} intrinsic code generation. These three phases are described below.
\todo{Graph traversal}
\subsubsection{Mapping Subgraphs to Accelerators}
Backends to different hardware accelerators identify distinct subgraphs ideally
suited to the corresponding accelerator. For example, the subgraph containing Laplacian node in
Figure~\ref{fig:subgraph}\todo{Insert ref to pipeline example figure} expresses parallelism well suited for a GPU, and the
GPU backend would isolate the functions associated with the subgraph into a
separate LLVM module and generate native code for it. It would also replace this subgraph
component in the original graph with a leaf node with PTX instructions. The
final result of this phase is a new graph where all leaf nodes have been mapped
to hardware acclerators.
\subsubsection{Mapping Subgraphs to Accelerators} Backends to different
hardware accelerators identify distinct subgraphs ideally suited to the
corresponding accelerator. For example, the subgraph containing Laplacian node
in Figure~\ref{fig:subgraph}\todo{Insert ref to pipeline example figure}
expresses parallelism well suited for a GPU, and the GPU backend would isolate
the functions associated with the subgraph into a separate LLVM module and
generate native code for it. It would also replace this subgraph component in
the original graph with a leaf node with PTX instructions. The final result of
this phase is a new graph where all leaf nodes have been mapped to hardware
acclerators.
\subsubsection{Data Movement and Internal Nodes' Code Generation}
The input to this phase is a graph where all leaf nodes have been mapped to
......@@ -91,16 +92,17 @@ call. However, when it is not supported, the runtime maintains a stack to keep
track of the instance ID, and dimension limit of the dynamic instance of the
ancestors and responds when a query arrives.
\todo{Rephrase this para}The dataflow graph semantics of virtual ISA assumes a globally
addressable memory model. However, in the present form, majority of
\todo{Rephrase this para}The dataflow graph semantics of virtual ISA assumes a
globally addressable memory model. However, in the present form, majority of
accelerators present in a SoC do not support this model. For example, GPUs
cannot directly address CPU memory. In such a scenario, the data has to be
explicitly transferred to the acclerator memory before one initiates computation on the
acclerator. To initiate these data transfers, static API calls to acclerator
runtimes are inserted in the generated native binary. However, it may happen
that such a copy is unnecessary because the previous node executed on the
accelerator itself. Thus, the VISC runtime needs to keep track of the latest
copy of data arrays to avoid unnecessary copies to and from accelerators.
explicitly transferred to the acclerator memory before one initiates
computation on the acclerator. To initiate these data transfers, static API
calls to acclerator runtimes are inserted in the generated native binary.
However, it may happen that such a copy is unnecessary because the previous
node executed on the accelerator itself. Thus, the VISC runtime needs to keep
track of the latest copy of data arrays to avoid unnecessary copies to and from
accelerators.
%\label{sec:CompilerImplementation} We implement the compilation strategy as a
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment