Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
H
hpvm-release
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
llvm
hpvm-release
Commits
a50eb59b
Commit
a50eb59b
authored
9 years ago
by
Vikram Adve
Browse files
Options
Downloads
Patches
Plain Diff
Restructure.
parent
d17d3e3d
No related branches found
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
paper/Evaluation.tex
+68
-68
68 additions, 68 deletions
paper/Evaluation.tex
with
68 additions
and
68 deletions
paper/Evaluation.tex
+
68
−
68
View file @
a50eb59b
...
...
@@ -13,6 +13,48 @@ parallelism expressed using these languages, and thus achieve reasonable
performance when compiled to target architectures for these source-level
languages.
\begin{figure*}
[hbt]
\begin{minipage}
{
0.48
\textwidth
}
\begin{center}
\includegraphics
[height=4cm]
{
Figures/gpusmall.png
}
\caption
{
\footnotesize
{
GPU Experiments - Small Test Normalized Execution
Time
}}
\label
{
fig:gpusmall
}
\end{center}
\end{minipage}
~~~~
\begin{minipage}
{
0.48
\textwidth
}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics
[height=4cm]
{
Figures/gpularge.png
}
\caption
{
\footnotesize
{
GPU Experiments - Large Test Normalized Execution
Time
}}
\label
{
fig:gpularge
}
\end{center}
\end{minipage}
\end{figure*}
\begin{figure*}
[hbt]
\begin{minipage}
{
0.48
\textwidth
}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics
[height=4cm]
{
Figures/cpusmall.png
}
\caption
{
\footnotesize
{
Vector Experiments - Small Test Normalized Execution
Time
}}
\label
{
fig:cpusmall
}
\end{center}
\end{minipage}
~~~~
\begin{minipage}
{
0.48
\textwidth
}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics
[height=4cm]
{
Figures/cpularge.png
}
\caption
{
\footnotesize
{
Vector Experiments - Large Test Normalized Execution
Time
}}
\label
{
fig:cpularge
}
\end{center}
\end{minipage}
\end{figure*}
%------------------------------------------------------------------------------
\subsection
{
Experimental Setup and Benchmarks
}
\label
{
sec:evaluation:setup
}
...
...
@@ -60,7 +102,7 @@ but compiled using the Intel OpenCL compiler, as we found
that these versions achieved the best performance compared to the other
available OpenCL versions on vector hardware as well.
The
VISC
binaries were also generated using the same versions of OpenCL.
The
\NAME
{}
binaries were also generated using the same versions of OpenCL.
We use two input
sizes for each benchmark, labeled 'Small' and 'Large' below.
...
...
@@ -76,26 +118,37 @@ we repeated the experiments multiple times to verify their stability.
Figures~
\ref
{
fig:gpusmall
}
and~
\ref
{
fig:gpularge
}
show the normalized execution
time of these applications against GPU baseline for each of the two sizes.
Similarly, figures~
\ref
{
fig:cpusmall
}
and~
\ref
{
fig:cpularge
}
compare the
performance of
VISC
programs with the vector baseline. The execution times are
performance of
\NAME
{}
programs with the vector baseline. The execution times are
broken down to segments corresponding to time spent in the compute kernel of the
application (kernel), copying data (copy) and remaining time spent on the host
side. The total execution time for the baseline is depicted on the
corresponding bar to give an indication of the actual numbers.
Comparing VISC code with the GPU baseline, the performance is within about
\begin{center}
\begin{figure*}
[hbt]
\centering
\vspace*
{
-2
\baselineskip
}
\includegraphics
[height=6cm]
{
Figures/pipeline.png
}
\caption
{
Edge Detection in gray scale images in
\NAME
{}}
\label
{
fig:pipeline
}
\end{figure*}
\vspace*
{
-1.5
\baselineskip
}
\end{center}
Comparing
\NAME
{}
code with the GPU baseline, the performance is within about
25
\%
of the baseline in most cases and within a factor of
$
1
.
8
$
in the worst case.
We see that the
VISC
We see that the
\NAME
{}
application spends more time in the kernel execution relative to the GPU
baseline. However, inspection of the generated PTX files generated by nVidia
OpenCL compiler for OpenCL applications and
VISC
compiler for
VISC
applications
OpenCL compiler for OpenCL applications and
\NAME
{}
compiler for
\NAME
{}
applications
has shown that they are almost identical, with the only difference being a minor
number of instructions being reordered. Also, we notice increased, sometimes to
a significant factor, data copy times, despite the fact the data copied in both
applications are similar and that the
VISC
runtime makes use of a memory
applications are similar and that the
\NAME
{}
runtime makes use of a memory
tracking mechanism to avoid unnecessary data copies. We are working on getting
a
clear picture of the overheads that the
VISC
representation or compilation may
clear picture of the overheads that the
\NAME
{}
representation or compilation may
be imposing on the program execution.
In the vector case, we see that the performance of
\NAME
{}
is within about
...
...
@@ -106,9 +159,9 @@ due to the fact that the total running times are generally larger, which
minimizes the effect of constant overheads to the total execution time.
Finally, we note that none of our benchmarks made use of vector code at the leaf
dataflow nodes. This choice was made after comparing the performance of two
VISC
versions: (a) the
VISC
object code as generated from the modified Clang
frontend, and (b) the
VISC
code after altering the number of dynamic instances
dataflow nodes. This choice was made after comparing the performance of two
\NAME
{}
versions: (a) the
\NAME
{}
object code as generated from the modified Clang
frontend, and (b) the
\NAME
{}
code after altering the number of dynamic instances
of the leaf nodes as well as their code, in order to perform a bigger amount of
computation so that vectorization can be achieved. This transformation may have
improved the performance in some cases for one of the two targets, but it never
...
...
@@ -130,48 +183,6 @@ performance gains for more complicated kernels where automatic vectorization
will not be effective.
\begin{figure*}
[hbt]
\begin{minipage}
{
0.48
\textwidth
}
\begin{center}
\includegraphics
[height=4cm]
{
Figures/gpusmall.png
}
\caption
{
\footnotesize
{
GPU Experiments - Small Test Normalized Execution
Time
}}
\label
{
fig:gpusmall
}
\end{center}
\end{minipage}
~~~~
\begin{minipage}
{
0.48
\textwidth
}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics
[height=4cm]
{
Figures/gpularge.png
}
\caption
{
\footnotesize
{
GPU Experiments - Large Test Normalized Execution
Time
}}
\label
{
fig:gpularge
}
\end{center}
\end{minipage}
\end{figure*}
\begin{figure*}
[hbt]
\begin{minipage}
{
0.48
\textwidth
}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics
[height=4cm]
{
Figures/cpusmall.png
}
\caption
{
\footnotesize
{
Vector Experiments - Small Test Normalized Execution
Time
}}
\label
{
fig:cpusmall
}
\end{center}
\end{minipage}
~~~~
\begin{minipage}
{
0.48
\textwidth
}
\begin{center}
\centering
%\hspace*{4ex}
\includegraphics
[height=4cm]
{
Figures/cpularge.png
}
\caption
{
\footnotesize
{
Vector Experiments - Large Test Normalized Execution
Time
}}
\label
{
fig:cpularge
}
\end{center}
\end{minipage}
\end{figure*}
%------------------------------------------------------------------------------
\subsection
{
Expressing parallelism beyond GPUs
}
\label
{
sec:evaluation:streaming
}
...
...
@@ -180,28 +191,17 @@ will not be effective.
\NAME
~is aimed to be extensible beyond the devices that are most commonly found
in today's accelerators and represent parallelism models in a broad class of
available hardware. Apart from data parallelism, many accelerators expose a
streaming paallelism model and would benefit greatly by a representation that
can capture this feature.
\NAME
~presents the unique advantages of representing a
pro
gram as a dataflow graph, which i
s a natural way of representing the
streaming pa
r
allelism model and would benefit greatly by a representation that
can capture this feature.
The dataflow graph in
\NAME
{}
pro
vide
s a natural way of representing the
communication between producers and consumers, as well as describing the
repeated transfer of multiple data items via streaming edges. This section uses
an image processing pipeline to demonstrate the benefits of expressing a
streaming application in
\NAME
.
\begin{center}
\begin{figure*}
[hbt]
\centering
%\hspace*{4ex}
\includegraphics
[height=6cm]
{
Figures/pipeline.png
}
\caption
{
Edge Detection in gray scale images in
\NAME
{}}
\label
{
fig:pipeline
}
\end{figure*}
\vspace*
{
-1.5
\baselineskip
}
\end{center}
Figure~
\ref
{
fig:pipeline
}
presents an application for Edge Detection in
gray scale images in
\NAME
. At a high level, this application is a dataflow node
that acceps a greyscale image
$
I
$
and a binary structuring element
$
B
$
and
that accep
t
s a greyscale image
$
I
$
and a binary structuring element
$
B
$
and
computes a binary image
$
E
$
that represents the edges of
$
I
$
. The application
begins by computing an estimate of the Laplacian
$
L
$
of
$
I
$
, as depicted in
figure~
\ref
{
fig:pipeline
}
, and proceeds by computing its zerocrossings,
...
...
@@ -212,7 +212,7 @@ ZeroCrossings to perform a thresholding operation that will allow it to reject
small variations in the brightness of the image and only detect more significant
variations that actually constitute edges.
We implemented this pipeline using OpenCV computer vision library.
We implemented this pipeline using OpenCV
~
\ref
{
OpenCV
}
computer vision library.
We used C++ thread library to create threads for each top level node in this
example, and implemented fixed size
circular buffers for each streaming edge between these nodes to pass data
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment