Draw functions w/ colors per-device in graphviz visualization.
Fix bugs in GCM and loop nesting to allow forks to get to codegen.
Add GPU schedules for dot and matmul tests that contain forks.
Currently commented out in the build scripts, just switch which schedule_in_src line is commented out in both to actually use the GPU schedule when using --features=cuda.