comms

0607cf11 · Praneet Rathi · 7421e35e · 0607cf11
Commit 0607cf11 authored 4 months ago by Praneet Rathi
--- a/hercules_cg/src/gpu.rs
+++ b/hercules_cg/src/gpu.rs
@@ -38,10 +38,37 @@ pub fn gpu_codegen<W: Write>(
     *   largest element
     * - Summation types must be aligned to their largest element
     * 
-     * Major TODOs:
+     * Notes on GPU parallelization strategy and tips for IR transformations:
+     * - The top level block fork and any lower thread forks require a known Fork
+     *   size. Thus for an otherwise parallelizable Fork with unknown size, 
+     *   consider splitting it into two Forks with one of known size. For block
+     *   level, the known fork has to be the (only) top-most fork.
+     * - The thread-level strategy is determined by starting at the most nested 
+     *   Forks and working outwards in a greedy manner, with caps by GPU spec.
+     *   Thus, to ensure some outer Fork is parallelized, ensure the inner
+     *   parallelizable Forks aren't too large or consider removing schedule
+     *   annotations.
+     * - Tight-Associative reductions can only be efficiently implemented if 
+     *   different Hercules ThreadIDs correspond to consecutive CUDA threads. But
+     *   this prevents nested parallelization since each parallel group must always
+     *   be a contiguous tile of threads. We use a heuristic of choosing the larger
+     *   factor when this results in a conflict between a Fork and it's subtree,
+     *   but this choice may not be optimal.
+     * - A given Fork (not talking about its children) can only be parallelized
+     *   if all its Reduces are Parallel-Reduce or Tight-Associative. So if the
+     *   Fork contains expensive parallelizable operations, ensure all reductions
+     *   are parallelizable or if not try pulling those out into a different Fork.
+     * - We do nothing to mitigate intra-warp divergence. To mitigate this, the
+     *   IR, for example, should ensure the innermost parallelizable Forks either 
+     *   have factor >= warp size (32) or remove Fork/Reduce node schedule
+     *   annotations.
+     * 
+     * Main TODOs:
     * - Fix dynamic shared memory allocation to reuse old shmem. The main case
-     * for improvement is when we have serialized forks with unused intermediate
-     * values from previous iterations.
+     *   for improvement is when we have serialized forks with unused intermediate
+     *   values from previous iterations.
+     * - Add mapping from Region node to Fork node if there's a reduce whose control
+     *   is a Region not Join.
     * - Matmul/Conv detection
     * - Add float8, float16, bfloat16 dtypes if they come
     */