Skip to content
Snippets Groups Projects
Commit 795bfa8b authored by Prakalp Srivastava's avatar Prakalp Srivastava
Browse files

Updated evaluation for new results

parent e80ad564
No related branches found
No related tags found
No related merge requests found
...@@ -59,11 +59,10 @@ we chose as GPU baselines, ...@@ -59,11 +59,10 @@ we chose as GPU baselines,
but compiled using the Intel OpenCL compiler, as we found but compiled using the Intel OpenCL compiler, as we found
that these versions achieved the best performance compared to the other that these versions achieved the best performance compared to the other
available OpenCL versions on vector hardware as well. available OpenCL versions on vector hardware as well.
The \NAME{} binaries were also generated using the same versions of OpenCL. The \NAME{} binaries were also generated using the same versions of OpenCL.
We use two input We use two input
sizes for each benchmark, labeled 'Small' and 'Large' below. sizes for each benchmark, labeled `Small' and `Large' below.
Each data point we report is an average of ten runs for Each data point we report is an average of ten runs for
the small test cases and an average of five runs for the large test cases; the small test cases and an average of five runs for the large test cases;
we repeated the experiments multiple times to verify their stability. we repeated the experiments multiple times to verify their stability.
...@@ -82,28 +81,44 @@ application (kernel), copying data (copy) and remaining time spent on the host ...@@ -82,28 +81,44 @@ application (kernel), copying data (copy) and remaining time spent on the host
side. The total execution time for the baseline is depicted on the side. The total execution time for the baseline is depicted on the
corresponding bar to give an indication of the actual numbers. corresponding bar to give an indication of the actual numbers.
Comparing \NAME{} code with the GPU baseline, the performance is within about When comparing \NAME{} code with the GPU baseline, \NAME{} achieves near
25\% of the baseline in most cases and within a factor of hand-tuned OpenCL performance for almost all of these benchmarks, except spmv on
$1.8$ in the worst case. `Small' dataset, where it is within a factor of $1.2$. This is because of the
We see that the \NAME{} small total execution time of $0.076s$ for spmv on `Small' dataset. For the `Large'
application spends more time in the kernel execution relative to the GPU dataset, the \NAME{} code performance is on par with OpenCL implementation,
baseline. However, inspection of the generated PTX files generated by nVidia where due to the fact that the total running time is larger, the effect of
OpenCL compiler for OpenCL applications and \NAME{} compiler for \NAME{} applications constant overhead to the total execution time is minimal.
has shown that they are almost identical, with the only difference being a minor
number of instructions being reordered. Also, we notice increased, sometimes to In the vector case, we see that the performance of \NAME{} is within 25\% in the
a significant factor, data copy times, despite the fact the data copied in both worst case. We observe that the kernel execution time in lbm is 25\% higher for
applications are similar and that the \NAME{} runtime makes use of a memory \NAME{} implementation than OpenCL. This is because the Intel OpenCL runtime
tracking mechanism to avoid unnecessary data copies. We are working on getting which is used by the \NAME{} runtime keeps one thread idle when it observes an
a extra thread has been created by an application. We have to create this thread
clear picture of the overheads that the \NAME{} representation or compilation may to execute the \NAME{} dataflow graph asynchronously. We expect this overhead to
be imposing on the program execution. go away with improved OpenCL runtime implementation.
In the vector case, we see that the performance of \NAME{} is within about %Comparing \NAME{} code with the GPU baseline, the performance is within about
30\% in all cases, and within a factor of 1.6x in the worst case. %25\% of the baseline in most cases and within a factor of
We again %$1.8$ in the worst case.
observe the same inefficiencies in kernel and copy time, albeit less pronounced %We see that the \NAME{}
due to the fact that the total running times are generally larger, which %application spends more time in the kernel execution relative to the GPU
minimizes the effect of constant overheads to the total execution time. %baseline. However, inspection of the generated PTX files generated by nVidia
%OpenCL compiler for OpenCL applications and \NAME{} compiler for \NAME{} applications
%has shown that they are almost identical, with the only difference being a minor
%number of instructions being reordered. Also, we notice increased, sometimes to
%a significant factor, data copy times, despite the fact the data copied in both
%applications are similar and that the \NAME{} runtime makes use of a memory
%tracking mechanism to avoid unnecessary data copies. We are working on getting
%a
%clear picture of the overheads that the \NAME{} representation or compilation may
%be imposing on the program execution.
%In the vector case, we see that the performance of \NAME{} is within about
%30\% in all cases, and within a factor of 1.6x in the worst case.
%We again
%observe the same inefficiencies in kernel and copy time, albeit less pronounced
%due to the fact that the total running times are generally larger, which
%minimizes the effect of constant overheads to the total execution time.
Finally, we note that none of our benchmarks made use of vector code at the leaf Finally, we note that none of our benchmarks made use of vector code at the leaf
dataflow nodes. This choice was made after comparing the performance of two \NAME{} dataflow nodes. This choice was made after comparing the performance of two \NAME{}
......
paper/Figures/cpularge.png

46.8 KiB | W: | H:

paper/Figures/cpularge.png

47.8 KiB | W: | H:

paper/Figures/cpularge.png
paper/Figures/cpularge.png
paper/Figures/cpularge.png
paper/Figures/cpularge.png
  • 2-up
  • Swipe
  • Onion skin
paper/Figures/cpusmall.png

49.4 KiB | W: | H:

paper/Figures/cpusmall.png

49.2 KiB | W: | H:

paper/Figures/cpusmall.png
paper/Figures/cpusmall.png
paper/Figures/cpusmall.png
paper/Figures/cpusmall.png
  • 2-up
  • Swipe
  • Onion skin
paper/Figures/gpularge.png

46.5 KiB | W: | H:

paper/Figures/gpularge.png

47.8 KiB | W: | H:

paper/Figures/gpularge.png
paper/Figures/gpularge.png
paper/Figures/gpularge.png
paper/Figures/gpularge.png
  • 2-up
  • Swipe
  • Onion skin
paper/Figures/gpusmall.png

45.2 KiB | W: | H:

paper/Figures/gpusmall.png

48.2 KiB | W: | H:

paper/Figures/gpusmall.png
paper/Figures/gpusmall.png
paper/Figures/gpusmall.png
paper/Figures/gpusmall.png
  • 2-up
  • Swipe
  • Onion skin
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment