Skip to content
Snippets Groups Projects
Commit 0607cf11 authored by Praneet Rathi's avatar Praneet Rathi
Browse files

comms

parent 7421e35e
No related branches found
No related tags found
1 merge request!115GPU backend
This commit is part of merge request !115. Comments created here will be created in the context of that merge request.
......@@ -38,10 +38,37 @@ pub fn gpu_codegen<W: Write>(
* largest element
* - Summation types must be aligned to their largest element
*
* Major TODOs:
* Notes on GPU parallelization strategy and tips for IR transformations:
* - The top level block fork and any lower thread forks require a known Fork
* size. Thus for an otherwise parallelizable Fork with unknown size,
* consider splitting it into two Forks with one of known size. For block
* level, the known fork has to be the (only) top-most fork.
* - The thread-level strategy is determined by starting at the most nested
* Forks and working outwards in a greedy manner, with caps by GPU spec.
* Thus, to ensure some outer Fork is parallelized, ensure the inner
* parallelizable Forks aren't too large or consider removing schedule
* annotations.
* - Tight-Associative reductions can only be efficiently implemented if
* different Hercules ThreadIDs correspond to consecutive CUDA threads. But
* this prevents nested parallelization since each parallel group must always
* be a contiguous tile of threads. We use a heuristic of choosing the larger
* factor when this results in a conflict between a Fork and it's subtree,
* but this choice may not be optimal.
* - A given Fork (not talking about its children) can only be parallelized
* if all its Reduces are Parallel-Reduce or Tight-Associative. So if the
* Fork contains expensive parallelizable operations, ensure all reductions
* are parallelizable or if not try pulling those out into a different Fork.
* - We do nothing to mitigate intra-warp divergence. To mitigate this, the
* IR, for example, should ensure the innermost parallelizable Forks either
* have factor >= warp size (32) or remove Fork/Reduce node schedule
* annotations.
*
* Main TODOs:
* - Fix dynamic shared memory allocation to reuse old shmem. The main case
* for improvement is when we have serialized forks with unused intermediate
* values from previous iterations.
* for improvement is when we have serialized forks with unused intermediate
* values from previous iterations.
* - Add mapping from Region node to Fork node if there's a reduce whose control
* is a Region not Join.
* - Matmul/Conv detection
* - Add float8, float16, bfloat16 dtypes if they come
*/
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment