Skip to content
Snippets Groups Projects
Commit 795bfa8b authored by Prakalp Srivastava's avatar Prakalp Srivastava
Browse files

Updated evaluation for new results

parent e80ad564
No related branches found
No related tags found
No related merge requests found
......@@ -59,11 +59,10 @@ we chose as GPU baselines,
but compiled using the Intel OpenCL compiler, as we found
that these versions achieved the best performance compared to the other
available OpenCL versions on vector hardware as well.
The \NAME{} binaries were also generated using the same versions of OpenCL.
We use two input
sizes for each benchmark, labeled 'Small' and 'Large' below.
sizes for each benchmark, labeled `Small' and `Large' below.
Each data point we report is an average of ten runs for
the small test cases and an average of five runs for the large test cases;
we repeated the experiments multiple times to verify their stability.
......@@ -82,28 +81,44 @@ application (kernel), copying data (copy) and remaining time spent on the host
side. The total execution time for the baseline is depicted on the
corresponding bar to give an indication of the actual numbers.
Comparing \NAME{} code with the GPU baseline, the performance is within about
25\% of the baseline in most cases and within a factor of
$1.8$ in the worst case.
We see that the \NAME{}
application spends more time in the kernel execution relative to the GPU
baseline. However, inspection of the generated PTX files generated by nVidia
OpenCL compiler for OpenCL applications and \NAME{} compiler for \NAME{} applications
has shown that they are almost identical, with the only difference being a minor
number of instructions being reordered. Also, we notice increased, sometimes to
a significant factor, data copy times, despite the fact the data copied in both
applications are similar and that the \NAME{} runtime makes use of a memory
tracking mechanism to avoid unnecessary data copies. We are working on getting
a
clear picture of the overheads that the \NAME{} representation or compilation may
be imposing on the program execution.
In the vector case, we see that the performance of \NAME{} is within about
30\% in all cases, and within a factor of 1.6x in the worst case.
We again
observe the same inefficiencies in kernel and copy time, albeit less pronounced
due to the fact that the total running times are generally larger, which
minimizes the effect of constant overheads to the total execution time.
When comparing \NAME{} code with the GPU baseline, \NAME{} achieves near
hand-tuned OpenCL performance for almost all of these benchmarks, except spmv on
`Small' dataset, where it is within a factor of $1.2$. This is because of the
small total execution time of $0.076s$ for spmv on `Small' dataset. For the `Large'
dataset, the \NAME{} code performance is on par with OpenCL implementation,
where due to the fact that the total running time is larger, the effect of
constant overhead to the total execution time is minimal.
In the vector case, we see that the performance of \NAME{} is within 25\% in the
worst case. We observe that the kernel execution time in lbm is 25\% higher for
\NAME{} implementation than OpenCL. This is because the Intel OpenCL runtime
which is used by the \NAME{} runtime keeps one thread idle when it observes an
extra thread has been created by an application. We have to create this thread
to execute the \NAME{} dataflow graph asynchronously. We expect this overhead to
go away with improved OpenCL runtime implementation.
%Comparing \NAME{} code with the GPU baseline, the performance is within about
%25\% of the baseline in most cases and within a factor of
%$1.8$ in the worst case.
%We see that the \NAME{}
%application spends more time in the kernel execution relative to the GPU
%baseline. However, inspection of the generated PTX files generated by nVidia
%OpenCL compiler for OpenCL applications and \NAME{} compiler for \NAME{} applications
%has shown that they are almost identical, with the only difference being a minor
%number of instructions being reordered. Also, we notice increased, sometimes to
%a significant factor, data copy times, despite the fact the data copied in both
%applications are similar and that the \NAME{} runtime makes use of a memory
%tracking mechanism to avoid unnecessary data copies. We are working on getting
%a
%clear picture of the overheads that the \NAME{} representation or compilation may
%be imposing on the program execution.
%In the vector case, we see that the performance of \NAME{} is within about
%30\% in all cases, and within a factor of 1.6x in the worst case.
%We again
%observe the same inefficiencies in kernel and copy time, albeit less pronounced
%due to the fact that the total running times are generally larger, which
%minimizes the effect of constant overheads to the total execution time.
Finally, we note that none of our benchmarks made use of vector code at the leaf
dataflow nodes. This choice was made after comparing the performance of two \NAME{}
......
paper/Figures/cpularge.png

46.8 KiB | W: | H:

paper/Figures/cpularge.png

47.8 KiB | W: | H:

paper/Figures/cpularge.png
paper/Figures/cpularge.png
paper/Figures/cpularge.png
paper/Figures/cpularge.png
  • 2-up
  • Swipe
  • Onion skin
paper/Figures/cpusmall.png

49.4 KiB | W: | H:

paper/Figures/cpusmall.png

49.2 KiB | W: | H:

paper/Figures/cpusmall.png
paper/Figures/cpusmall.png
paper/Figures/cpusmall.png
paper/Figures/cpusmall.png
  • 2-up
  • Swipe
  • Onion skin
paper/Figures/gpularge.png

46.5 KiB | W: | H:

paper/Figures/gpularge.png

47.8 KiB | W: | H:

paper/Figures/gpularge.png
paper/Figures/gpularge.png
paper/Figures/gpularge.png
paper/Figures/gpularge.png
  • 2-up
  • Swipe
  • Onion skin
paper/Figures/gpusmall.png

45.2 KiB | W: | H:

paper/Figures/gpusmall.png

48.2 KiB | W: | H:

paper/Figures/gpusmall.png
paper/Figures/gpusmall.png
paper/Figures/gpusmall.png
paper/Figures/gpusmall.png
  • 2-up
  • Swipe
  • Onion skin
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment