diff --git a/paper/Evaluation.tex b/paper/Evaluation.tex
index 7d4a539d4341b353307ec60a8283f0fbf895f4ed..18e5284fc2453dcf9109d421e7de49caac3718a1 100644
--- a/paper/Evaluation.tex
+++ b/paper/Evaluation.tex
@@ -59,11 +59,10 @@ we chose as GPU baselines,
 but compiled using the Intel OpenCL compiler, as we found
 that these versions achieved the best performance compared to the other
 available OpenCL versions on vector hardware as well.
-
 The \NAME{} binaries were also generated using the same versions of OpenCL.
 
 We use two input
-sizes for each benchmark, labeled 'Small' and 'Large' below.
+sizes for each benchmark, labeled `Small' and `Large' below.
 Each data point we report is an average of ten runs for
 the small test cases and an average of five runs for the large test cases; 
 we repeated the experiments multiple times to verify their stability.
@@ -82,28 +81,44 @@ application (kernel), copying data (copy) and remaining time spent on the host
 side.  The total execution time for the baseline is depicted on the 
 corresponding bar to give an indication of the actual numbers.
 
-Comparing \NAME{} code with the GPU baseline, the performance is within about
-25\% of the baseline in most cases and within a factor of
-$1.8$ in the worst case.
-We see that the \NAME{}
-application spends more time in the kernel execution relative to the GPU
-baseline. However, inspection of the generated PTX files generated by nVidia
-OpenCL compiler for OpenCL applications and \NAME{} compiler for \NAME{} applications
-has shown that they are almost identical, with the only difference being a minor
-number of instructions being reordered. Also, we notice increased, sometimes to
-a significant factor, data copy times, despite the fact the data copied in both
-applications are similar and that the \NAME{} runtime makes use of a memory
-tracking mechanism to avoid unnecessary data copies. We are working on getting 
-a
-clear picture of the overheads that the \NAME{} representation or compilation may
-be imposing on the program execution.
-
-In the vector case, we see that the performance of \NAME{} is within about
-30\% in all cases, and within a factor of 1.6x in the worst case.
-We again
-observe the same inefficiencies in kernel and copy time, albeit less pronounced
-due to the fact that the total running times are generally larger, which
-minimizes the effect of constant overheads to the total execution time.
+When comparing \NAME{} code with the GPU baseline, \NAME{} achieves near
+hand-tuned OpenCL performance for almost all of these benchmarks, except spmv on
+`Small' dataset, where it is within a factor of $1.2$. This is because of the
+small total execution time of $0.076s$ for spmv on `Small' dataset. For the `Large'
+dataset, the \NAME{} code performance is on par with OpenCL implementation,
+where due to the fact that the total running time is larger, the effect of
+constant overhead to the total execution time is minimal.
+
+In the vector case, we see that the performance of \NAME{} is within 25\% in the
+worst case. We observe that the kernel execution time in lbm is 25\% higher for
+\NAME{} implementation than OpenCL. This is because the Intel OpenCL runtime
+which is used by the \NAME{} runtime keeps one thread idle when it observes an
+extra thread has been created by an application. We have to create this thread
+to execute the \NAME{} dataflow graph asynchronously. We expect this overhead to
+go away with improved OpenCL runtime implementation.
+
+%Comparing \NAME{} code with the GPU baseline, the performance is within about
+%25\% of the baseline in most cases and within a factor of
+%$1.8$ in the worst case.
+%We see that the \NAME{}
+%application spends more time in the kernel execution relative to the GPU
+%baseline. However, inspection of the generated PTX files generated by nVidia
+%OpenCL compiler for OpenCL applications and \NAME{} compiler for \NAME{} applications
+%has shown that they are almost identical, with the only difference being a minor
+%number of instructions being reordered. Also, we notice increased, sometimes to
+%a significant factor, data copy times, despite the fact the data copied in both
+%applications are similar and that the \NAME{} runtime makes use of a memory
+%tracking mechanism to avoid unnecessary data copies. We are working on getting 
+%a
+%clear picture of the overheads that the \NAME{} representation or compilation may
+%be imposing on the program execution.
+
+%In the vector case, we see that the performance of \NAME{} is within about
+%30\% in all cases, and within a factor of 1.6x in the worst case.
+%We again
+%observe the same inefficiencies in kernel and copy time, albeit less pronounced
+%due to the fact that the total running times are generally larger, which
+%minimizes the effect of constant overheads to the total execution time.
 
 Finally, we note that none of our benchmarks made use of vector code at the leaf
 dataflow nodes. This choice was made after comparing the performance of two \NAME{}
diff --git a/paper/Figures/cpularge.png b/paper/Figures/cpularge.png
index cee89cc44a2fa05a73128d49fad08ed24b5053b6..f9463ec0d9fb525ee0806cf07e6f8274ef0ae4da 100644
Binary files a/paper/Figures/cpularge.png and b/paper/Figures/cpularge.png differ
diff --git a/paper/Figures/cpusmall.png b/paper/Figures/cpusmall.png
index a70cb4ed7dcace70c57564a18547cdb58320e811..3b96c6c16ce9158948a85c7f58fd2800622b42c8 100644
Binary files a/paper/Figures/cpusmall.png and b/paper/Figures/cpusmall.png differ
diff --git a/paper/Figures/gpularge.png b/paper/Figures/gpularge.png
index 532e5ca26b4b198a7e0b2a4fefbf54c6e8c01819..eb09aeed6e2d90325ca89b23a7fb2813fe0b7d2a 100644
Binary files a/paper/Figures/gpularge.png and b/paper/Figures/gpularge.png differ
diff --git a/paper/Figures/gpusmall.png b/paper/Figures/gpusmall.png
index b74c3aec8fabdeb4b4323ca37048b576505e0802..8186414ca10490b84a15f2f72aa7cff5b314497e 100644
Binary files a/paper/Figures/gpusmall.png and b/paper/Figures/gpusmall.png differ