Skip to content
Snippets Groups Projects
notes.org 7.81 KiB

active

make rigid body object device pointer

branch rb-device: have RBType objects hold device pointers for raw RigidBodyGrids

eventually

multiple “PMF” grids associated with each RB type

read restart file

throw exception if an RB type uses a grid key multiple times

improve efficiency of force evaluation kernel

each RB gets temperature from grid

optionally don’t eval forces every ts

organization

RgidBodyController (RBC) holds device pointers, manages force evaluation and integration

Opportunities for memory bandwidth savings

each block should (ideally) contain a compact set of density grid points

cache (automatically!?) potential grid lookups and coefficients!

each block should have same transformation matrices applied to each grid point?!

each block could have same inverse transformation matrix applied to each grid point

how well this peforms will depend on the number

but also simplifies reductions

new data structure for grids?

each grid contains blocks of data of a size optmized for the device

Each thread can operate on multiple data points if needed (any advantage?)

questions

Q: overhead of dynamic parallelism?

Where does it make sense to have a kernel call subkernels? A: At least: to sychronize blocks

Q: overhead for classes?

Q: could algorithm use shared memory through persistent threads?

bring in rigid body integrator

optimize algorithms for cuda (balance floating point calculation with memory transfers)

share pieces of memory that are used repeatedly

RigidBodyGrid: methods pulled from BaseGrid, but transforms passed to functions

no wrapping

where exactly to parallelize:

Easy to break calculation for density grid, but each thread needs to perform it’s own lookup of potential grid; only O(10) values per thread

neighboring density grid points will need to lookup similar region of potential grid BUT the regions could be difficult to determine a priori

possible to move potential grid data into shared memory tiles e.g. blocks of 10x10x10 this will be more important for global memory Better to use cudaCache?

pairlists for rigid bodies

maybe for grids, depending on parallel structure of code

other ideas

interpolate density grid?

BaseGrid.h: // RBTODO Fix? BaseGrid.h- BaseGrid(); // cmaffeo2 (2015) moved this out of protected, cause I wanted BaseGrid in a struct BaseGrid.h- // The most obvious of constructors. BaseGrid.h- BaseGrid(Matrix3 basis0, Vector3 origin0, int nx0, int ny0, int nz0); BaseGrid.h- – BaseGrid.h- float a0, a1, a2, a3; BaseGrid.h- BaseGrid.h: // RBTODO parallelize loops? BaseGrid.h- BaseGrid.h- // Mix along x, taking the derivative. BaseGrid.h- float g2[4][4]; BaseGrid.h- for (int iy = 0; iy < 4; iy++) { – BaseGrid.h- } BaseGrid.h- BaseGrid.h: // RBTODO overload with optimized algorithm BaseGrid.h- // skip transforms (assume identity basis) BaseGrid.h- HOST DEVICE inline float interpolatePotential(Vector3 pos) const { BaseGrid.h- // Find the home node. BaseGrid.h- Vector3 l = basisInv.transform(pos - origin); – BaseGrid.h- BaseGrid.h- // out of grid? return 0 BaseGrid.h: // RBTODO BaseGrid.h- BaseGrid.h- // Get the array jumps. BaseGrid.h- int jump[3]; BaseGrid.h- jump[0] = nz*ny; – BaseGrid.h- // Find the values at the neighbors. BaseGrid.h- float g1[4][4][4]; BaseGrid.h: //RBTODO parallelize? BaseGrid.h- for (int ix = 0; ix < 4; ix++) { BaseGrid.h- for (int iy = 0; iy < 4; iy++) { BaseGrid.h- for (int iz = 0; iz < 4; iz++) { BaseGrid.h- // Wrap around the periodic boundaries. – ComputeGridGrid.cuh- Matrix3 basis_rho, Vector3 origin_rho, ComputeGridGrid.cuh- Matrix3 basis_u, Vector3 origin_u) { ComputeGridGrid.cuh: // RBTODO http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/ ComputeGridGrid.cuh- const unsigned int r_id = blockIdx.x * blockDim.x + threadIdx.x; ComputeGridGrid.cuh- ComputeGridGrid.cuh: // RBTODO parallelize transform ComputeGridGrid.cuh- if (r_id > rho->size) // skip threads with no data ComputeGridGrid.cuh- return; ComputeGridGrid.cuh- ComputeGridGrid.cuh- // Maybe: Tile grid data into shared memory ComputeGridGrid.cuh: // RBTODO: think about localizing regions of grid data ComputeGridGrid.cuh- Vector3 p = rho->getPosition(r_id, basis, origin); ComputeGridGrid.cuh- float val = rho->val[r_id]; ComputeGridGrid.cuh- ComputeGridGrid.cuh: // RBTODO reduce forces and torques ComputeGridGrid.cuh- // http://www.cuvilib.com/Reduction.pdf ComputeGridGrid.cuh- ComputeGridGrid.cuh: // RBTODO combine interp methods and reduce repetition! ComputeGridGrid.cuh- float energy = u->interpolatePotential(p); ComputeGridGrid.cuh- Vector3 f = u->interpolateForceD(p); ComputeGridGrid.cuh- Vector3 t = cross(p,f); // test if sign is correct! ComputeGridGrid.cuh- ComputeGridGrid.cuh: // RBTODO 3rd-law forces + torques ComputeGridGrid.cuh-} – Configuration.cpp- cudaMemcpyHostToDevice)); Configuration.cpp- } Configuration.cpp: // RBTODO: moved this out of preceding loop; was that correct? Configuration.cpp- gpuErrchk(cudaMemcpyAsync(part_d, part_addr, sizeof(BrownianParticleType*) * numParts, Configuration.cpp- cudaMemcpyHostToDevice)); Configuration.cpp- Configuration.cpp- – Configuration.cpp- sz = sizeof(float) * len; Configuration.cpp- gpuErrchk(cudaMemcpy( tmpData, g->val, sz, cudaMemcpyHostToDevice)); Configuration.cpp: // RBTODO: why can’t this be deleted? Configuration.cpp- // delete[] tmpData; Configuration.cpp- } Configuration.cpp- } Configuration.cpp- – Configuration.cpp- sz = sizeof(float) * len; Configuration.cpp- gpuErrchk(cudaMemcpy( tmpData, g->val, sz, cudaMemcpyHostToDevice)); Configuration.cpp: // RBTODO: why can’t this be deleted? Configuration.cpp- // delete[] tmpData; Configuration.cpp- } Configuration.cpp- Configuration.cpp- } – RigidBodyGrid.h- \===============================*/ RigidBodyGrid.h- RigidBodyGrid.h: // RBTODO Fix? RigidBodyGrid.h- RigidBodyGrid(); // cmaffeo2 (2015) moved this out of protected, cause I wanted RigidBodyGrid in a struct RigidBodyGrid.h- // The most obvious of constructors. RigidBodyGrid.h- RigidBodyGrid(int nx0, int ny0, int nz0); RigidBodyGrid.h- – RigidBodyGrid.h- float a0, a1, a2, a3; RigidBodyGrid.h- RigidBodyGrid.h: // RBTODO further parallelize loops? unlikely? RigidBodyGrid.h- RigidBodyGrid.h- // Mix along x, taking the derivative. RigidBodyGrid.h- float g2[4][4]; RigidBodyGrid.h- for (int iy = 0; iy < 4; iy++) { – RigidBodyGrid.h- RigidBodyGrid.h- // out of grid? return 0 RigidBodyGrid.h: // RBTODO RigidBodyGrid.h- RigidBodyGrid.h- // Get the array jumps. RigidBodyGrid.h- int jump[3]; RigidBodyGrid.h- jump[0] = nz*ny; – RigidBodyGrid.h- w[2] = l.z - homeZ; RigidBodyGrid.h- // Find the values at the neighbors. RigidBodyGrid.h: float g1[4][4][4]; * RBTODO: inefficient for my algorithm? * RigidBodyGrid.h- for (int ix = 0; ix < 4; ix++) { RigidBodyGrid.h- int jx = ix-1 + home[0]; RigidBodyGrid.h- for (int iy = 0; iy < 4; iy++) { RigidBodyGrid.h- int jy = iy-1 + home[1]; – RigidBodyGrid.h- // Assume zero value at edges RigidBodyGrid.h- int jz = iz-1 + home[2]; RigidBodyGrid.h: // RBTODO: possible branch divergence in warp? RigidBodyGrid.h- if (jx < 0 || jy < 0 || jz < 0 || RigidBodyGrid.h- jx >= nx || jz >= nz || jz >= nz) { RigidBodyGrid.h- g1[ix][iy][iz] = 0; RigidBodyGrid.h- } else { – RigidBodyGrid.h- // Find the values at the neighbors. RigidBodyGrid.h- float g1[4][4][4]; RigidBodyGrid.h: //RBTODO parallelize? RigidBodyGrid.h- for (int ix = 0; ix < 4; ix++) { RigidBodyGrid.h- for (int iy = 0; iy < 4; iy++) { RigidBodyGrid.h- for (int iz = 0; iz < 4; iz++) { RigidBodyGrid.h- // Wrap around the periodic boundaries.