Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
H
hpvm-release
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
llvm
hpvm-release
Commits
7d010b81
Commit
7d010b81
authored
9 years ago
by
Prakalp Srivastava
Browse files
Options
Downloads
Patches
Plain Diff
minor changes to compilation
parent
cd642cb7
No related branches found
No related tags found
No related merge requests found
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
paper/Abstract.tex
+13
-12
13 additions, 12 deletions
paper/Abstract.tex
paper/Compilation.tex
+41
-39
41 additions, 39 deletions
paper/Compilation.tex
with
54 additions
and
51 deletions
paper/Abstract.tex
+
13
−
12
View file @
7d010b81
...
...
@@ -6,22 +6,23 @@ compute elements on a single chip. These computing elements use different
parallelism models, instruction sets, and memory hierarchy, making it difficult
to achieve performance and code portability on heterogeneous systems.
Application programming for such systems would be greatly simplified if a single
<<<<<<< HEAD
object code representation could be used to generate code for different compute
units in a hetero
e
genous system. Previous efforts aiming to address the source
units in a heterogen
e
ous system. Previous efforts aiming to address the source
and object code portability challenges arising in such systems, such as OpenCL,
CUDA, SPIR, PTX and HSAIL focus heavily on GPUs, which makes them insufficent
CUDA, SPIR, PTX and HSAIL focus heavily on GPUs, which makes them insuffic
i
ent
for today's SoCs.
We propose VISC, a framework for programming heterogeneous systems. In this
paper we focus on the crux of VISC, a novel virtual ISA design which adds
dataflow graph abstractions to LLVM IR, to capture the diverse forms of
parallelism models exposed by today's SoCs. We also present a compilation
strategy to generate code for AVX, PTX and X86 backends from single virtual ISA
Virtual Instruction Set Computing (VISC) is a powerful approach to better
portability. We propose to use it to address the code and performance
portability problem for heterogeneous mobile SoCs. In this paper we focus on the
crux of VISC approach and present a novel virtual ISA design which adds dataflow
graph abstractions to LLVM IR, to capture the diverse forms of parallelism
models exposed by today's SoCs. We also present a compilation strategy to
generate code for AVX, PTX and X86 backends from single virtual ISA
representation of a program. Through a set of experiments we show that code
generated for CPUs and GPUs from single virtual ISA representation achieves
performance (within 1 to 1.6x) with hand-tuned cod
e
\todo
{
What numbers to quote here?
}
. We further demonstrate that these virtual
ISA abstractions are also suited for capturing pipelining and streaming
parallelism.
acceptable performance, compared with hand-tuned code. We further demonstrat
e
that these virtual ISA abstractions are also suited for capturing pipelining and
streaming parallelism.
\end{abstract}
This diff is collapsed.
Click to expand it.
paper/Compilation.tex
+
41
−
39
View file @
7d010b81
...
...
@@ -18,42 +18,43 @@ be compiled piecewise to different hardware compute units.
\end{center}
The VISC virtual ISA, at the top-level, is neatly separated into component
dataflow nodes, which in practice, can be used to represent computational
chunks
in an application which are big enough to be executed on different
hardware
components. Communication between these nodes is captured with
dataflow edges,
which encode information about the size and type of data moving
from one
component to another. Device specific backend
translators leverage
this feature to identify the nodes/subgraph that can be mapped to a particular
hardware compoenent and generate native code for that component. Once mapping
of
nodes to different hardware components is done, the code generation for
transfer
of data between corresponding hardware components is generated.
Sophisticated
virtual ISA compilers in future can also allow flexible mapping,
by generating
native code for multiple backends for the same subgraph, and
relying on runtime
and scheduler to perform data transfers when mapping of
source and destination
nodes of a dataflow edge are known at runtime.
dataflow nodes, which in practice, can be used to represent computational
chunks
in an application which are big enough to be executed on different
hardware
components. Communication between these nodes is captured with
dataflow edges,
which encode information about the size and type of data moving
from one
component to another. Device specific backend
translators leverage
this feature to identify the nodes/subgraph that can be mapped to a particular
hardware compoenent and generate native code for that component. Once mapping
of
nodes to different hardware components is done, the code generation for
transfer
of data between corresponding hardware components is generated.
Sophisticated
virtual ISA compilers in future can also allow flexible mapping,
by generating
native code for multiple backends for the same subgraph, and
relying on runtime
and scheduler to perform data transfers when mapping of
source and destination
nodes of a dataflow edge are known at runtime.
\subsection
{
Compilation Flow
}
Figure~
\ref
{
fig:compilation
}
shows the compilation flow of virtual ISA program.
\todo
{
Redo the figure to show two phases of compilation
}
Our current compiler
has functional backends for compiling virtual ISA
components to PTX, AVX and
host code for x86-64 processor. These backends are
implemented as LLVM passes, details of which are in
section~
\ref
{
sec:CompilerImplementation
}
. The compilation flow can be divided
into two phases, (1) mapping and code
generation of distinct subgraphs to hardware
accelerators, and (2) generating
sequential code for remaining unmapped parts of
the graph, and for data
movement across components mapped to different
accelerators
.
\todo
{
Redo the figure to show two phases of compilation
}
Our current compiler
has functional backends for compiling virtual ISA
components to PTX, AVX and
host code for x86-64 processor. These backends are
implemented as LLVM passes,
details of which are in section~
\ref
{
sec:CompilerImplementation
}
. The
compilation flow can be divided into three phases, (1) mapping and code
generation of distinct subgraphs to hardware
accelerators, (2) generating
sequential code for remaining unmapped parts of
the graph, and for data
movement across components mapped to different
accelerators, and (3)
{
\tt
launch/wait
}
intrinsic code generation. These three phases are described below
.
\todo
{
Graph traversal
}
\subsubsection
{
Mapping Subgraphs to Accelerators
}
Backends to different hardware accelerators identify distinct subgraphs ideally
suited to the corresponding accelerator. For example, the subgraph containing Laplacian node in
Figure~
\ref
{
fig:subgraph
}
\todo
{
Insert ref to pipeline example figure
}
expresses parallelism well suited for a GPU, and the
GPU backend would isolate the functions associated with the subgraph into a
separate LLVM module and generate native code for it. It would also replace this subgraph
component in the original graph with a leaf node with PTX instructions. The
final result of this phase is a new graph where all leaf nodes have been mapped
to hardware acclerators.
\subsubsection
{
Mapping Subgraphs to Accelerators
}
Backends to different
hardware accelerators identify distinct subgraphs ideally suited to the
corresponding accelerator. For example, the subgraph containing Laplacian node
in Figure~
\ref
{
fig:subgraph
}
\todo
{
Insert ref to pipeline example figure
}
expresses parallelism well suited for a GPU, and the GPU backend would isolate
the functions associated with the subgraph into a separate LLVM module and
generate native code for it. It would also replace this subgraph component in
the original graph with a leaf node with PTX instructions. The final result of
this phase is a new graph where all leaf nodes have been mapped to hardware
acclerators.
\subsubsection
{
Data Movement and Internal Nodes' Code Generation
}
The input to this phase is a graph where all leaf nodes have been mapped to
...
...
@@ -91,16 +92,17 @@ call. However, when it is not supported, the runtime maintains a stack to keep
track of the instance ID, and dimension limit of the dynamic instance of the
ancestors and responds when a query arrives.
\todo
{
Rephrase this para
}
The dataflow graph semantics of virtual ISA assumes a
globally
addressable memory model. However, in the present form, majority of
\todo
{
Rephrase this para
}
The dataflow graph semantics of virtual ISA assumes a
globally
addressable memory model. However, in the present form, majority of
accelerators present in a SoC do not support this model. For example, GPUs
cannot directly address CPU memory. In such a scenario, the data has to be
explicitly transferred to the acclerator memory before one initiates computation on the
acclerator. To initiate these data transfers, static API calls to acclerator
runtimes are inserted in the generated native binary. However, it may happen
that such a copy is unnecessary because the previous node executed on the
accelerator itself. Thus, the VISC runtime needs to keep track of the latest
copy of data arrays to avoid unnecessary copies to and from accelerators.
explicitly transferred to the acclerator memory before one initiates
computation on the acclerator. To initiate these data transfers, static API
calls to acclerator runtimes are inserted in the generated native binary.
However, it may happen that such a copy is unnecessary because the previous
node executed on the accelerator itself. Thus, the VISC runtime needs to keep
track of the latest copy of data arrays to avoid unnecessary copies to and from
accelerators.
%\label{sec:CompilerImplementation} We implement the compilation strategy as a
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment