Skip to content
Snippets Groups Projects
Commit a50eb59b authored by Vikram Adve's avatar Vikram Adve
Browse files

Restructure.

parent d17d3e3d
No related branches found
No related tags found
No related merge requests found
......@@ -13,6 +13,48 @@ parallelism expressed using these languages, and thus achieve reasonable
performance when compiled to target architectures for these source-level
languages.
\begin{figure*}[hbt]
\begin{minipage}{0.48\textwidth}
\begin{center}
\includegraphics[height=4cm]{Figures/gpusmall.png}
\caption{\footnotesize{GPU Experiments - Small Test Normalized Execution
Time}}
\label{fig:gpusmall}
\end{center}
\end{minipage}~~~~\begin{minipage}{0.48\textwidth}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics[height=4cm]{Figures/gpularge.png}
\caption{\footnotesize{GPU Experiments - Large Test Normalized Execution
Time}}
\label{fig:gpularge}
\end{center}
\end{minipage}
\end{figure*}
\begin{figure*}[hbt]
\begin{minipage}{0.48\textwidth}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics[height=4cm]{Figures/cpusmall.png}
\caption{\footnotesize{Vector Experiments - Small Test Normalized Execution
Time}}
\label{fig:cpusmall}
\end{center}
\end{minipage}~~~~\begin{minipage}{0.48\textwidth}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics[height=4cm]{Figures/cpularge.png}
\caption{\footnotesize{Vector Experiments - Large Test Normalized Execution
Time}}
\label{fig:cpularge}
\end{center}
\end{minipage}
\end{figure*}
%------------------------------------------------------------------------------
\subsection{Experimental Setup and Benchmarks}
\label{sec:evaluation:setup}
......@@ -60,7 +102,7 @@ but compiled using the Intel OpenCL compiler, as we found
that these versions achieved the best performance compared to the other
available OpenCL versions on vector hardware as well.
The VISC binaries were also generated using the same versions of OpenCL.
The \NAME{} binaries were also generated using the same versions of OpenCL.
We use two input
sizes for each benchmark, labeled 'Small' and 'Large' below.
......@@ -76,26 +118,37 @@ we repeated the experiments multiple times to verify their stability.
Figures~\ref{fig:gpusmall} and~\ref{fig:gpularge} show the normalized execution
time of these applications against GPU baseline for each of the two sizes.
Similarly, figures~\ref{fig:cpusmall} and~\ref{fig:cpularge} compare the
performance of VISC programs with the vector baseline. The execution times are
performance of \NAME{} programs with the vector baseline. The execution times are
broken down to segments corresponding to time spent in the compute kernel of the
application (kernel), copying data (copy) and remaining time spent on the host
side. The total execution time for the baseline is depicted on the
corresponding bar to give an indication of the actual numbers.
Comparing VISC code with the GPU baseline, the performance is within about
\begin{center}
\begin{figure*}[hbt]
\centering
\vspace*{-2\baselineskip}
\includegraphics[height=6cm]{Figures/pipeline.png}
\caption{Edge Detection in gray scale images in \NAME{}}
\label{fig:pipeline}
\end{figure*}
\vspace*{-1.5\baselineskip}
\end{center}
Comparing \NAME{} code with the GPU baseline, the performance is within about
25\% of the baseline in most cases and within a factor of
$1.8$ in the worst case.
We see that the VISC
We see that the \NAME{}
application spends more time in the kernel execution relative to the GPU
baseline. However, inspection of the generated PTX files generated by nVidia
OpenCL compiler for OpenCL applications and VISC compiler for VISC applications
OpenCL compiler for OpenCL applications and \NAME{} compiler for \NAME{} applications
has shown that they are almost identical, with the only difference being a minor
number of instructions being reordered. Also, we notice increased, sometimes to
a significant factor, data copy times, despite the fact the data copied in both
applications are similar and that the VISC runtime makes use of a memory
applications are similar and that the \NAME{} runtime makes use of a memory
tracking mechanism to avoid unnecessary data copies. We are working on getting
a
clear picture of the overheads that the VISC representation or compilation may
clear picture of the overheads that the \NAME{} representation or compilation may
be imposing on the program execution.
In the vector case, we see that the performance of \NAME{} is within about
......@@ -106,9 +159,9 @@ due to the fact that the total running times are generally larger, which
minimizes the effect of constant overheads to the total execution time.
Finally, we note that none of our benchmarks made use of vector code at the leaf
dataflow nodes. This choice was made after comparing the performance of two VISC
versions: (a) the VISC object code as generated from the modified Clang
frontend, and (b) the VISC code after altering the number of dynamic instances
dataflow nodes. This choice was made after comparing the performance of two \NAME{}
versions: (a) the \NAME{} object code as generated from the modified Clang
frontend, and (b) the \NAME{} code after altering the number of dynamic instances
of the leaf nodes as well as their code, in order to perform a bigger amount of
computation so that vectorization can be achieved. This transformation may have
improved the performance in some cases for one of the two targets, but it never
......@@ -130,48 +183,6 @@ performance gains for more complicated kernels where automatic vectorization
will not be effective.
\begin{figure*}[hbt]
\begin{minipage}{0.48\textwidth}
\begin{center}
\includegraphics[height=4cm]{Figures/gpusmall.png}
\caption{\footnotesize{GPU Experiments - Small Test Normalized Execution
Time}}
\label{fig:gpusmall}
\end{center}
\end{minipage}~~~~\begin{minipage}{0.48\textwidth}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics[height=4cm]{Figures/gpularge.png}
\caption{\footnotesize{GPU Experiments - Large Test Normalized Execution
Time}}
\label{fig:gpularge}
\end{center}
\end{minipage}
\end{figure*}
\begin{figure*}[hbt]
\begin{minipage}{0.48\textwidth}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics[height=4cm]{Figures/cpusmall.png}
\caption{\footnotesize{Vector Experiments - Small Test Normalized Execution
Time}}
\label{fig:cpusmall}
\end{center}
\end{minipage}~~~~\begin{minipage}{0.48\textwidth}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics[height=4cm]{Figures/cpularge.png}
\caption{\footnotesize{Vector Experiments - Large Test Normalized Execution
Time}}
\label{fig:cpularge}
\end{center}
\end{minipage}
\end{figure*}
%------------------------------------------------------------------------------
\subsection{Expressing parallelism beyond GPUs}
\label{sec:evaluation:streaming}
......@@ -180,28 +191,17 @@ will not be effective.
\NAME~is aimed to be extensible beyond the devices that are most commonly found
in today's accelerators and represent parallelism models in a broad class of
available hardware. Apart from data parallelism, many accelerators expose a
streaming paallelism model and would benefit greatly by a representation that
can capture this feature. \NAME~presents the unique advantages of representing a
program as a dataflow graph, which is a natural way of representing the
streaming parallelism model and would benefit greatly by a representation that
can capture this feature. The dataflow graph in \NAME{}
provides a natural way of representing the
communication between producers and consumers, as well as describing the
repeated transfer of multiple data items via streaming edges. This section uses
an image processing pipeline to demonstrate the benefits of expressing a
streaming application in \NAME.
\begin{center}
\begin{figure*}[hbt]
\centering
%\hspace*{4ex}
\includegraphics[height=6cm]{Figures/pipeline.png}
\caption{Edge Detection in gray scale images in \NAME{}}
\label{fig:pipeline}
\end{figure*}
\vspace*{-1.5\baselineskip}
\end{center}
Figure~\ref{fig:pipeline} presents an application for Edge Detection in
gray scale images in \NAME. At a high level, this application is a dataflow node
that acceps a greyscale image $I$ and a binary structuring element $B$ and
that accepts a greyscale image $I$ and a binary structuring element $B$ and
computes a binary image $E$ that represents the edges of $I$. The application
begins by computing an estimate of the Laplacian $L$ of $I$, as depicted in
figure~\ref{fig:pipeline}, and proceeds by computing its zerocrossings,
......@@ -212,7 +212,7 @@ ZeroCrossings to perform a thresholding operation that will allow it to reject
small variations in the brightness of the image and only detect more significant
variations that actually constitute edges.
We implemented this pipeline using OpenCV computer vision library.
We implemented this pipeline using OpenCV~\ref{OpenCV} computer vision library.
We used C++ thread library to create threads for each top level node in this
example, and implemented fixed size
circular buffers for each streaming edge between these nodes to pass data
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment