Fork tiling non-divisible sizes
I've made enough progress on cava to get it running on the GPU (yay!), and the GPU backend is actually coping with the code pretty well. This includes an add reduction inside gamut
, which when tiled by 2, the GPU backend will generate a add reduction using the cooperative groups API, which corresponds pretty much exactly to what we want! The issue is that the size of this loop in cava is 3702, which is not divisible by any larger power of two. To extract more parallelism in gamut, we'd need to implement fork tiling with leftover iteration generation.
Currently, we invoke fork-tile
as follows:
fork-tile[2, 0, false, true](fuse4@channel_loop);
in cava. The arguments are the tile factor, tiled dimension, whether to emit leftover iterations, and what direction to tile in. I'd like to use:
fork-tile[512, 0, true, true](fuse4@channel_loop);
which would peel off a fork-join of size dc_param_2 - 512 * floor(dc_param_2 / 512)
(the floor is implicitly represented in DC division), which would be size 118 given dc_param_2 = 118.