Updated evaluation for new results

795bfa8b · Prakalp Srivastava · e80ad564 · 795bfa8b · e80ad564 · 795bfa8b
Commit 795bfa8b authored 9 years ago by Prakalp Srivastava
--- a/paper/Evaluation.tex
+++ b/paper/Evaluation.tex
@@ -59,11 +59,10 @@ we chose as GPU baselines,
 but compiled using the Intel OpenCL compiler, as we found
 that these versions achieved the best performance compared to the other
 available OpenCL versions on vector hardware as well.
 The \NAME{} binaries were also generated using the same versions of OpenCL.
 We use two input
-sizes for each benchmark, labeled 'Small' and 'Large' below.
+sizes for each benchmark, labeled `Small' and `Large' below.
 Each data point we report is an average of ten runs for
 the small test cases and an average of five runs for the large test cases; 
 we repeated the experiments multiple times to verify their stability.
@@ -82,28 +81,44 @@ application (kernel), copying data (copy) and remaining time spent on the host
 side.  The total execution time for the baseline is depicted on the 
 corresponding bar to give an indication of the actual numbers.
-Comparing \NAME{} code with the GPU baseline, the performance is within about
+When comparing \NAME{} code with the GPU baseline, \NAME{} achieves near
-25\% of the baseline in most cases and within a factor of
+hand-tuned OpenCL performance for almost all of these benchmarks, except spmv on
-$1.8$ in the worst case.
+`Small' dataset, where it is within a factor of $1.2$. This is because of the
-We see that the \NAME{}
+small total execution time of $0.076s$ for spmv on `Small' dataset. For the `Large'
-application spends more time in the kernel execution relative to the GPU
+dataset, the \NAME{} code performance is on par with OpenCL implementation,
-baseline. However, inspection of the generated PTX files generated by nVidia
+where due to the fact that the total running time is larger, the effect of
-OpenCL compiler for OpenCL applications and \NAME{} compiler for \NAME{} applications
+constant overhead to the total execution time is minimal.
-has shown that they are almost identical, with the only difference being a minor
-number of instructions being reordered. Also, we notice increased, sometimes to
+In the vector case, we see that the performance of \NAME{} is within 25\% in the
-a significant factor, data copy times, despite the fact the data copied in both
+worst case. We observe that the kernel execution time in lbm is 25\% higher for
-applications are similar and that the \NAME{} runtime makes use of a memory
+\NAME{} implementation than OpenCL. This is because the Intel OpenCL runtime
-tracking mechanism to avoid unnecessary data copies. We are working on getting 
+which is used by the \NAME{} runtime keeps one thread idle when it observes an
-a
+extra thread has been created by an application. We have to create this thread
-clear picture of the overheads that the \NAME{} representation or compilation may
+to execute the \NAME{} dataflow graph asynchronously. We expect this overhead to
-be imposing on the program execution.
+go away with improved OpenCL runtime implementation.
-In the vector case, we see that the performance of \NAME{} is within about
+%Comparing \NAME{} code with the GPU baseline, the performance is within about
-30\% in all cases, and within a factor of 1.6x in the worst case.
+%25\% of the baseline in most cases and within a factor of
-We again
+%$1.8$ in the worst case.
-observe the same inefficiencies in kernel and copy time, albeit less pronounced
+%We see that the \NAME{}
-due to the fact that the total running times are generally larger, which
+%application spends more time in the kernel execution relative to the GPU
-minimizes the effect of constant overheads to the total execution time.
+%baseline. However, inspection of the generated PTX files generated by nVidia
+%OpenCL compiler for OpenCL applications and \NAME{} compiler for \NAME{} applications
+%has shown that they are almost identical, with the only difference being a minor
+%number of instructions being reordered. Also, we notice increased, sometimes to
+%a significant factor, data copy times, despite the fact the data copied in both
+%applications are similar and that the \NAME{} runtime makes use of a memory
+%tracking mechanism to avoid unnecessary data copies. We are working on getting 
+%a
+%clear picture of the overheads that the \NAME{} representation or compilation may
+%be imposing on the program execution.
+%In the vector case, we see that the performance of \NAME{} is within about
+%30\% in all cases, and within a factor of 1.6x in the worst case.
+%We again
+%observe the same inefficiencies in kernel and copy time, albeit less pronounced
+%due to the fact that the total running times are generally larger, which
+%minimizes the effect of constant overheads to the total execution time.
 Finally, we note that none of our benchmarks made use of vector code at the leaf
 dataflow nodes. This choice was made after comparing the performance of two \NAME{}

--- a/paper/Figures/cpularge.png
+++ b/paper/Figures/cpularge.png
--- a/paper/Figures/cpusmall.png
+++ b/paper/Figures/cpusmall.png
--- a/paper/Figures/gpularge.png
+++ b/paper/Figures/gpularge.png
--- a/paper/Figures/gpusmall.png
+++ b/paper/Figures/gpusmall.png