Restructure.

a50eb59b · Vikram Adve · d17d3e3d · a50eb59b
Commit a50eb59b authored 9 years ago by Vikram Adve
--- a/paper/Evaluation.tex
+++ b/paper/Evaluation.tex
@@ -13,6 +13,48 @@ parallelism expressed using these languages, and thus achieve reasonable
 performance when compiled to target architectures for these source-level
 languages.

+\begin{figure*}[hbt]
+\begin{minipage}{0.48\textwidth}
+\begin{center}
+    \includegraphics[height=4cm]{Figures/gpusmall.png}
+    \caption{\footnotesize{GPU Experiments - Small Test Normalized Execution
+    Time}}
+    \label{fig:gpusmall}
+\end{center}
+\end{minipage}~~~~\begin{minipage}{0.48\textwidth}
+\begin{center}
+    \centering
+    %\hspace*{4ex}
+    \includegraphics[height=4cm]{Figures/gpularge.png}
+    \caption{\footnotesize{GPU Experiments - Large Test Normalized Execution
+    Time}}
+    \label{fig:gpularge}
+\end{center}
+\end{minipage}
+\end{figure*}
+
+\begin{figure*}[hbt]
+\begin{minipage}{0.48\textwidth}
+\begin{center}
+    \centering
+    %\hspace*{4ex}
+    \includegraphics[height=4cm]{Figures/cpusmall.png}
+    \caption{\footnotesize{Vector Experiments - Small Test Normalized Execution
+    Time}}
+    \label{fig:cpusmall}
+\end{center}
+\end{minipage}~~~~\begin{minipage}{0.48\textwidth}
+\begin{center}
+    \centering
+    %\hspace*{4ex}
+    \includegraphics[height=4cm]{Figures/cpularge.png}
+    \caption{\footnotesize{Vector Experiments - Large Test Normalized Execution
+    Time}}
+    \label{fig:cpularge}
+\end{center}
+\end{minipage}
+\end{figure*}
+
 %------------------------------------------------------------------------------
 \subsection{Experimental Setup and Benchmarks}
 \label{sec:evaluation:setup}
@@ -60,7 +102,7 @@ but compiled using the Intel OpenCL compiler, as we found
 that these versions achieved the best performance compared to the other
 available OpenCL versions on vector hardware as well.

-The VISC binaries were also generated using the same versions of OpenCL.
+The \NAME{} binaries were also generated using the same versions of OpenCL.

 We use two input
 sizes for each benchmark, labeled 'Small' and 'Large' below.
@@ -76,26 +118,37 @@ we repeated the experiments multiple times to verify their stability.
 Figures~\ref{fig:gpusmall} and~\ref{fig:gpularge} show the normalized execution
 time of these applications against GPU baseline for each of the two sizes.
 Similarly, figures~\ref{fig:cpusmall} and~\ref{fig:cpularge} compare the
-performance of VISC programs with the vector baseline. The execution times are
+performance of \NAME{} programs with the vector baseline. The execution times are
 broken down to segments corresponding to time spent in the compute kernel of the
 application (kernel), copying data (copy) and remaining time spent on the host
 side.  The total execution time for the baseline is depicted on the 
 corresponding bar to give an indication of the actual numbers.

-Comparing VISC code with the GPU baseline, the performance is within about
+\begin{center}
+  \begin{figure*}[hbt]
+    \centering
+    \vspace*{-2\baselineskip}
+    \includegraphics[height=6cm]{Figures/pipeline.png}
+    \caption{Edge Detection in gray scale images in \NAME{}}
+    \label{fig:pipeline}
+  \end{figure*}
+\vspace*{-1.5\baselineskip}
+\end{center}
+
+Comparing \NAME{} code with the GPU baseline, the performance is within about
 25\% of the baseline in most cases and within a factor of
 $1.8$ in the worst case.
-We see that the VISC
+We see that the \NAME{}
 application spends more time in the kernel execution relative to the GPU
 baseline. However, inspection of the generated PTX files generated by nVidia
-OpenCL compiler for OpenCL applications and VISC compiler for VISC applications
+OpenCL compiler for OpenCL applications and \NAME{} compiler for \NAME{} applications
 has shown that they are almost identical, with the only difference being a minor
 number of instructions being reordered. Also, we notice increased, sometimes to
 a significant factor, data copy times, despite the fact the data copied in both
-applications are similar and that the VISC runtime makes use of a memory
+applications are similar and that the \NAME{} runtime makes use of a memory
 tracking mechanism to avoid unnecessary data copies. We are working on getting 
 a
-clear picture of the overheads that the VISC representation or compilation may
+clear picture of the overheads that the \NAME{} representation or compilation may
 be imposing on the program execution.

 In the vector case, we see that the performance of \NAME{} is within about
@@ -106,9 +159,9 @@ due to the fact that the total running times are generally larger, which
 minimizes the effect of constant overheads to the total execution time.

 Finally, we note that none of our benchmarks made use of vector code at the leaf
-dataflow nodes. This choice was made after comparing the performance of two VISC
-versions: (a) the VISC object code as generated from the modified Clang
-frontend, and (b) the VISC code after altering the number of dynamic instances
+dataflow nodes. This choice was made after comparing the performance of two \NAME{}
+versions: (a) the \NAME{} object code as generated from the modified Clang
+frontend, and (b) the \NAME{} code after altering the number of dynamic instances
 of the leaf nodes as well as their code, in order to perform a bigger amount of
 computation so that vectorization can be achieved. This transformation may have
 improved the performance in some cases for one of the two targets, but it never
@@ -130,48 +183,6 @@ performance gains for more complicated kernels where automatic vectorization
 will not be effective.


-\begin{figure*}[hbt]
-\begin{minipage}{0.48\textwidth}
-\begin{center}
-    \includegraphics[height=4cm]{Figures/gpusmall.png}
-    \caption{\footnotesize{GPU Experiments - Small Test Normalized Execution
-    Time}}
-    \label{fig:gpusmall}
-\end{center}
-\end{minipage}~~~~\begin{minipage}{0.48\textwidth}
-\begin{center}
-    \centering
-    %\hspace*{4ex}
-    \includegraphics[height=4cm]{Figures/gpularge.png}
-    \caption{\footnotesize{GPU Experiments - Large Test Normalized Execution
-    Time}}
-    \label{fig:gpularge}
-\end{center}
-\end{minipage}
-\end{figure*}
-
-\begin{figure*}[hbt]
-\begin{minipage}{0.48\textwidth}
-\begin{center}
-    \centering
-    %\hspace*{4ex}
-    \includegraphics[height=4cm]{Figures/cpusmall.png}
-    \caption{\footnotesize{Vector Experiments - Small Test Normalized Execution
-    Time}}
-    \label{fig:cpusmall}
-\end{center}
-\end{minipage}~~~~\begin{minipage}{0.48\textwidth}
-\begin{center}
-    \centering
-    %\hspace*{4ex}
-    \includegraphics[height=4cm]{Figures/cpularge.png}
-    \caption{\footnotesize{Vector Experiments - Large Test Normalized Execution
-    Time}}
-    \label{fig:cpularge}
-\end{center}
-\end{minipage}
-\end{figure*}
-
 %------------------------------------------------------------------------------
 \subsection{Expressing parallelism beyond GPUs}
 \label{sec:evaluation:streaming}
@@ -180,28 +191,17 @@ will not be effective.
 \NAME~is aimed to be extensible beyond the devices that are most commonly found
 in today's accelerators and represent parallelism models in a broad class of
 available hardware. Apart from data parallelism, many accelerators expose a
-streaming paallelism model and would benefit greatly by a representation that
-can capture this feature. \NAME~presents the unique advantages of representing a
-program as a dataflow graph, which is a natural way of representing the
+streaming parallelism model and would benefit greatly by a representation that
+can capture this feature.  The dataflow graph in \NAME{} 
+provides a natural way of representing the
 communication between producers and consumers, as well as describing the
 repeated transfer of multiple data items via streaming edges. This section uses
 an image processing pipeline to demonstrate the benefits of expressing a
 streaming application in \NAME.

-\begin{center}
-  \begin{figure*}[hbt]
-    \centering
-    %\hspace*{4ex}
-    \includegraphics[height=6cm]{Figures/pipeline.png}
-    \caption{Edge Detection in gray scale images in \NAME{}}
-    \label{fig:pipeline}
-  \end{figure*}
-\vspace*{-1.5\baselineskip}
-\end{center}
-
 Figure~\ref{fig:pipeline} presents an application for Edge Detection in
 gray scale images in \NAME. At a high level, this application is a dataflow node
-that acceps a greyscale image $I$ and a binary structuring element $B$ and
+that accepts a greyscale image $I$ and a binary structuring element $B$ and
 computes a binary image $E$ that represents the edges of $I$. The application
 begins by computing an estimate of the Laplacian $L$ of $I$, as depicted in
 figure~\ref{fig:pipeline}, and proceeds by computing its zerocrossings,
@@ -212,7 +212,7 @@ ZeroCrossings to perform a thresholding operation that will allow it to reject
 small variations in the brightness of the image and only detect more significant
 variations that actually constitute edges. 

-We implemented this pipeline using OpenCV computer vision library.
+We implemented this pipeline using OpenCV~\ref{OpenCV} computer vision library.
 We used C++ thread library to create threads for each top level node in this
 example, and implemented fixed size
 circular buffers for each streaming edge between these nodes to pass data