minor changes to compilation

7d010b81 · Prakalp Srivastava · cd642cb7 · 7d010b81 · 7d010b81
Commit 7d010b81 authored 9 years ago by Prakalp Srivastava
--- a/paper/Abstract.tex
+++ b/paper/Abstract.tex
@@ -6,22 +6,23 @@ compute elements on a single chip. These computing elements use different
 parallelism models, instruction sets, and memory hierarchy, making it difficult
 to achieve performance and code portability on heterogeneous systems.
 Application programming for such systems would be greatly simplified if a single
-<<<<<<< HEAD
 object code representation could be used to generate code for different compute
-units in a heteroegenous system. Previous efforts aiming to address the source
+units in a heterogeneous system. Previous efforts aiming to address the source
 and object code portability challenges arising in such systems, such as OpenCL,
-CUDA, SPIR, PTX and HSAIL focus heavily on GPUs, which makes them insufficent
+CUDA, SPIR, PTX and HSAIL focus heavily on GPUs, which makes them insufficient
 for today's SoCs.

-We propose VISC, a framework for programming heterogeneous systems. In this
-paper we focus on the crux of VISC, a novel virtual ISA design which adds
-dataflow graph abstractions to LLVM IR, to capture the diverse forms of
-parallelism models exposed by today's SoCs. We also present a compilation
-strategy to generate code for AVX, PTX and X86 backends from single virtual ISA
+Virtual Instruction Set Computing (VISC) is a powerful approach to better
+portability. We propose to use it to address the code and performance
+portability problem for heterogeneous mobile SoCs. In this paper we focus on the
+crux of VISC approach and present a novel virtual ISA design which adds dataflow
+graph abstractions to LLVM IR, to capture the diverse forms of parallelism
+models exposed by today's SoCs. We also present a compilation strategy to
+generate code for AVX, PTX and X86 backends from single virtual ISA
 representation of a program. Through a set of experiments we show that code
 generated for CPUs and GPUs from single virtual ISA representation achieves
-performance (within 1 to 1.6x) with hand-tuned code
-\todo{What numbers to quote here?}. We further demonstrate that these virtual
-ISA abstractions are also suited for capturing pipelining and streaming
-parallelism.
+acceptable performance, compared with hand-tuned code.  We further demonstrate
+that these virtual ISA abstractions are also suited for capturing pipelining and
+streaming parallelism.
+
 \end{abstract}
--- a/paper/Compilation.tex
+++ b/paper/Compilation.tex
@@ -18,42 +18,43 @@ be compiled piecewise to different hardware compute units.
 \end{center}

 The VISC virtual ISA, at the top-level, is neatly separated into component
-dataflow nodes, which in practice, can be used to represent computational chunks
-in an application which are big enough to be executed on different hardware
-components.  Communication between these nodes is captured with dataflow edges,
-which encode information about the size and type of data moving from one
-component to another. Device specific backend
-translators leverage this feature to identify the nodes/subgraph that can be mapped to a particular
-hardware compoenent and generate native code for that component. Once mapping of
-nodes to different hardware components is done, the code generation for transfer
-of data between corresponding hardware components is generated. Sophisticated
-virtual ISA compilers in future can also allow flexible mapping, by generating
-native code for multiple backends for the same subgraph, and relying on runtime
-and scheduler to perform data transfers when mapping of source and destination
-nodes of a dataflow edge are known at runtime.
+dataflow nodes, which in practice, can be used to represent computational
+chunks in an application which are big enough to be executed on different
+hardware components.  Communication between these nodes is captured with
+dataflow edges, which encode information about the size and type of data moving
+from one component to another. Device specific backend translators leverage
+this feature to identify the nodes/subgraph that can be mapped to a particular
+hardware compoenent and generate native code for that component. Once mapping
+of nodes to different hardware components is done, the code generation for
+transfer of data between corresponding hardware components is generated.
+Sophisticated virtual ISA compilers in future can also allow flexible mapping,
+by generating native code for multiple backends for the same subgraph, and
+relying on runtime and scheduler to perform data transfers when mapping of
+source and destination nodes of a dataflow edge are known at runtime.

 \subsection{Compilation Flow}
 Figure~\ref{fig:compilation} shows the compilation flow of virtual ISA program.
-\todo{Redo the figure to show two phases of compilation}
-Our current compiler has functional backends for compiling virtual ISA
-components to PTX, AVX and host code for x86-64 processor. These backends are
-implemented as LLVM passes, details of which are in
-section~\ref{sec:CompilerImplementation}.  The compilation flow can be divided
-into two phases, (1) mapping and code generation of distinct subgraphs to hardware
-accelerators, and (2) generating sequential code for remaining unmapped parts of
-the graph, and for data movement across components mapped to different
-accelerators.
+\todo{Redo the figure to show two phases of compilation} Our current compiler
+has functional backends for compiling virtual ISA components to PTX, AVX and
+host code for x86-64 processor. These backends are implemented as LLVM passes,
+details of which are in section~\ref{sec:CompilerImplementation}.  The
+compilation flow can be divided into three phases, (1) mapping and code
+generation of distinct subgraphs to hardware accelerators, (2) generating
+sequential code for remaining unmapped parts of the graph, and for data
+movement across components mapped to different accelerators, and (3) {\tt
+launch/wait} intrinsic code generation. These three phases are described below.

 \todo{Graph traversal}
-\subsubsection{Mapping Subgraphs to Accelerators}
-Backends to different hardware accelerators identify distinct subgraphs ideally
-suited to the corresponding accelerator. For example, the subgraph containing Laplacian node in
-Figure~\ref{fig:subgraph}\todo{Insert ref to pipeline example figure} expresses parallelism well suited for a GPU, and the
-GPU backend would isolate the functions associated with the subgraph into a
-separate LLVM module and generate native code for it. It would also replace this subgraph
-component in the original graph with a leaf node with PTX instructions. The
-final result of this phase is a new graph where all leaf nodes have been mapped
-to  hardware acclerators.
+\subsubsection{Mapping Subgraphs to Accelerators} Backends to different
+hardware accelerators identify distinct subgraphs ideally suited to the
+corresponding accelerator. For example, the subgraph containing Laplacian node
+in Figure~\ref{fig:subgraph}\todo{Insert ref to pipeline example figure}
+expresses parallelism well suited for a GPU, and the GPU backend would isolate
+the functions associated with the subgraph into a separate LLVM module and
+generate native code for it. It would also replace this subgraph component in
+the original graph with a leaf node with PTX instructions. The final result of
+this phase is a new graph where all leaf nodes have been mapped to  hardware
+acclerators.

 \subsubsection{Data Movement and Internal Nodes' Code Generation}
 The input to this phase is a graph where all leaf nodes have been mapped to
@@ -91,16 +92,17 @@ call. However, when it is not supported, the runtime maintains a stack to keep
 track of the instance ID, and dimension limit of the dynamic instance of the
 ancestors and responds when a query arrives.

-\todo{Rephrase this para}The dataflow graph semantics of virtual ISA assumes a globally
-addressable memory model. However, in the present form, majority of
+\todo{Rephrase this para}The dataflow graph semantics of virtual ISA assumes a
+globally addressable memory model. However, in the present form, majority of
 accelerators present in a SoC do not support this model. For example, GPUs
 cannot directly address CPU memory. In such a scenario, the data has to be
-explicitly transferred to the acclerator memory before one initiates computation on the
-acclerator. To initiate these data transfers, static API calls to acclerator
-runtimes are inserted in the generated native binary. However, it may happen
-that such a copy is unnecessary because the previous node executed on the
-accelerator itself. Thus, the VISC runtime needs to keep track of the latest
-copy of data arrays to avoid unnecessary copies to and from accelerators.
+explicitly transferred to the acclerator memory before one initiates
+computation on the acclerator. To initiate these data transfers, static API
+calls to acclerator runtimes are inserted in the generated native binary.
+However, it may happen that such a copy is unnecessary because the previous
+node executed on the accelerator itself. Thus, the VISC runtime needs to keep
+track of the latest copy of data arrays to avoid unnecessary copies to and from
+accelerators.


 %\label{sec:CompilerImplementation} We implement the compilation strategy as a