Fork tiling non-divisible sizes

changed the description

One way to implement this would be as a fork peeling optimization which might be useful in other situations as well. We could also consider an optimization that expands the range of a fork and predicates the body (basically to create a if thread_id < n { ... } which is a pretty common pattern in CUDA it seems), since that may be preferable in some cases (though for good reductions we'd need something that can also add identity operations in, for instance adding 0 in each thread that is out of bounds).

mentioned in merge request !198 (merged)

mentioned in commit 52efbe6e

closed with merge request !198 (merged)

Fork tiling non-divisible sizes

Designs

Child items ...

Activity