@@ -5,11 +5,11 @@ Compilation of an HPVM program involves the following steps:
2.`opt` takes (`main.ll`) and invoke the GenHPVM pass on it, which converts the HPVM-C function calls to HPVM intrinsics. This generates the HPVM textual representation (`main.hpvm.ll`).
3.`opt` takes the HPVM textual representation (`main.hpvm.ll`) and invokes the following passes in sequence:
* BuildDFG: Converts the textual representation to the internal HPVM representation.
* LocalMem and DFG2LLVM_NVPTX: Invoked only when GPU target is selected. Generates the kernel module (`main.kernels.ll`) and the portion of the host code that invokes the kernel into the host module (`main.host.ll`).
* DFG2LLVM_X86: Generates either all, or the remainder of the host module (`main.host.ll`) depending on the chosen target.
* LocalMem and DFG2LLVM_OpenCL: Invoked only when GPU target is selected. Generates the kernel module (`main.kernels.ll`) and the portion of the host code that invokes the kernel into the host module (`main.host.ll`).
* DFG2LLVM_CPU: Generates either all, or the remainder of the host module (`main.host.ll`) depending on the chosen target.
* ClearDFG: Deletes the internal HPVM representation from memory.
4.`clang` is used to to compile any remaining project files that would be later linked with the host module.
5.`llvm-link` takes the host module and all the other generate `ll` files, and links them with the HPVM runtime module (`hpvm-rt.bc`), to generate the linked host module (`main.host.linked.ll`).
6. Generate the executable code from the generated `ll` files for all parts of the program:
* GPU target: `llvm-cbe` takes the kernel module (`main.kernels.ll`) and generates an OpenCL representation of the kernels that will be invoked by the host.
*X86 target: `clang` takes the linked host module (`main.host.linked.ll`) and generates the X86 binary.
*CPU target: `clang` takes the linked host module (`main.host.linked.ll`) and generates the CPU binary.
An HPVM program is a combination of host code and one or more data flow graphs (DFG) at the IR level. We provide C function declarations representing the HPVM intrinsics that allow creating, querying, and interacting with the DFGs. More details about the HPVM IR intrinsics can be found in [the HPVM IR Specification.](/hpvm/docs/hpvm-specification.md).
An HPVM-C program contains both the host and the DFG code. Each HPVM kernel, represented by a leaf node in the DFG, can be compiled to multiple different targets (e.g. CPU and GPU) as described below.
This document describes all the API calls that can be used in an HPVM-C program.
## Host API
...
...
@@ -102,3 +107,25 @@ Atomically computes the bitwise XOR of ```v``` and the value stored at memory lo
```void __hpvm__barrier()```
Local synchronization barrier across dynamic instances of current leaf node.
# Porting a Program from C to HPVM-C
The following represents the required steps to port a regular C program into an HPVM program with HPVM-C. These steps are described at a high level; for more detail, please see [hpvm-cava](/hpvm/test/benchmarks/hpvm-cava) provided in [benchmarks](/hpvm/test/benchmarks).
* Separate the computation that will become a kernel into its own (leaf node) function and add the attributes and target hint.
* Create a level 1 wrapper node function that will describe the thread-level parallelism (for the GPU). The node will:
* Use the ```createNode[ND]()``` method to create a kernel node and specify how many threads will execute it.
* Bind its arguments to the kernel arguments.
* If desired, create a level 2 wrapper node function which will describe the threadblock-level parallalism (for the GPU). This node will:
* Use the ```createNode[ND]()``` method to create a level 1 wrapper node and specify how many threadblocks will execute it.
* Bind its arguments to its child node's arguments.
* A root node function that creates all the top-level wrapper nodes, binds their arguments, and connects their edges.
* Each root node represents a DFG.
* All the above node functions have the combined arguments of all the kernels that are nested at each level.
* The host code will have to include the following:
* Initialize the HPVM runtime using the ```init()``` method.
* Create an argument struct for each DFG and assign its member variables.
* Add all the memory that is required by the kernel into the memory tracker.
* Launch the DFG by calling the ```launch()``` method on the root node function, and passing the corresponding argument struct.
* Wait for the DFG to complete execution.
* Read out any generated memory using the ```request_mem()``` method.
* Remove all the tracked memory from the memory tracker.
*[Intrinsics for Memory Allocation and Synchronization](#memory)
*[Intrinsics for Graph Interaction](#interaction)
*[Implementation Limitations](#limitations)
<aname="abstraction"></a>
# HPVM Abstraction
An HPVM program is a combination of host code plus a set of one or more distinct dataflow graphs. Each dataflow graph (DFG) is a hierarchical graph with side effects. The DFG must be acyclic. Nodes represent units of execution, and edges between nodes describe the explicit data transfer requirements. A node can begin execution once a data item becomes available on every one of its input edges. Repeated transfer of data items between nodes (if more inputs are provided) yields a pipelined execution of different nodes in the graph. The execution of a DFG is initiated and terminated by host code that launches the graph. Nodes may access globally shared memory through load and store instructions (side-effects).
## Dataflow Node
<aname="node"></a>
## Dataflow Node
A *dataflow node* represents unit of computation in the DFG. A node can begin execution once a data item becomes available on every one of its input edges.
A single static dataflow node represents multiple dynamic instances of the node, each executing the same computation with different index values used to uniquely identify each dynamic instance w.r.t. the others. The dynamic instances of a node may be executed concurrently, and any required synchronization must imposed using HPVM synchronization operations.
...
...
@@ -14,8 +29,8 @@ Leaf nodes contain code expressing actual computations. Leaf nodes may contain i
Note that the graph is fully interpreted at compile-time and cannot be modified at runtime except for the number of dynamic instances, which can be data dependent.
## Dataflow Edge
<aname="edge"></a>
## Dataflow Edge
A *dataflow edge* from the output ```out``` of a source dataflow node ```Src``` to the input ```in``` of a sink dataflow node ```Dst``` describes the explicit data transfer requirements. ```Src``` and ```Dst``` node must belong to the same child graph, i.e. must be children of the same internal node.
An edge from source to sink has the semantics of copying the specified data from the source to the sink after the source node has completed execution. The pairs ```(Src, out)``` and ```(Dst, in)```, representing source and sink respectively, must be unique w.r.t. every other edge in the same child graph, i.e. two dataflow edges in the same child graph cannot have the same source or destination.
...
...
@@ -26,7 +41,8 @@ An edge can be instantiated at runtime using one of two replication mechanisms:
-*All-to-all*, where all dynamic instances of the source node are connected to all dynamic instances of the sink node, thus expressing a synchronization barrier between the two groups of nodes, or
-*One-to-one*, where each dynamic instance of the source node is connected to a single corresponding instance of the sink node. One-to-one replication requires that the grid structure (number of dimensions and the extents in each dimension) of the source and sink nodes be identical.
## Input and Output Bind
<aname="bind"></a>
## Input and Output Bind
An internal node is responsible for mapping its inputs, provided by incoming dataflow edges, to the inputs to one or more nodes of the child graph.
An internal node binds its input ```ip``` to input ```ic``` of its child node ```Dst``` using an *input bind*.
...
...
@@ -36,7 +52,8 @@ Conversely, an internal node binds output ```oc``` of its child node ```Src``` t
A bind is always ***all-to-all***.
## Host Code
<aname="host"></a>
## Host Code
In an HPVM program, the host code is responsible for setting up, initiating the execution and blocking for completion of a DFG. The host can interact with the DFG to sustain a streaming computation by sending all data required for, and receiving all data produced by, one execution of the DFG. The list of actions that can be performed by the host is described below:
-**Initialization and Cleanup**:
...
...
@@ -60,7 +77,8 @@ The host code blocks for completion of specified DFG.
- For a non-streaming DFG, the data produced by the DFG are ready to be read by the host.
- For a streaming DFG, no more data may be provided for processing by the DFG.
# HPVM Implementation
<aname="implementation"></a>
# HPVM Implementation
This section describes the implementation of HPVM on top of LLVM IR.
...
...
@@ -78,7 +96,8 @@ We represent nodes with opaque handles (pointers of LLVM type i8\*). We represen
Pointer arguments of node functions are required to be annotated with attributes in, and/or out, depending on their expected use (read only, write only, read write).
## Intrinsics for Describing Graphs
<aname="describing"></a>
## Intrinsics for Describing Graphs
The intrinsics for describing graphs can only be used by internal nodes. Also, internal nodes are only allowed to have these intrinsics as part of their node function, with the exception of a return statement of the appropriate type, in order to return the result of the outgoing dataflow edges.
...
...
@@ -104,7 +123,8 @@ Bind input ```ip``` of current node to input ```ic``` of child node ```N```. Arg
```void llvm.hpvm.bind.output(i8* N, i32 oc, i32 op, i1 isStream)```
Bind output ```oc``` of child node ```N``` to output ```op``` of current node. Field ```oc``` of the return struct in ```N```'s node function and field ```op``` of the return struct in the current node function must have matching types. ```isStream``` chooses a streaming (1) or non streaming (0) bind.
## Intrinsics for Querying Graphs
<a name="querying"></a>
## Intrinsics for Querying Graphs
The following intrinsics are used to query the structure of the DFG. They can only be used by leaf nodes.
...
...
@@ -123,6 +143,7 @@ Get index of current dynamic node instance of node ```N``` in dimension x, y or
Get number of dynamic instances of node ```N``` in dimension x, y or z respectively. The dimension must be one of the dimensions in which the node is replicated.
<a name="memory"></a>
## Intrinsics for Memory Allocation and Synchronization
The following intrinsics are used for memory allocation and synchronization. They can only be used by leaf nodes.
...
...
@@ -158,6 +179,7 @@ Atomically computes the bitwise XOR of ```v``` and the value stored at memory lo
```void llvm.hpvm.barrier()```
Local synchronization barrier across dynamic instances of current leaf node.
<a name="interaction"></a>
## Intrinsics for Graph Interaction
The following intrinsics are for graph initialization/termination and interaction with the host code, and can be used only by the host code.
...
...
@@ -189,6 +211,7 @@ Push set of input data ```args``` (same as type included in launch) to streaming
```i8* llvm.hpvm.pop(i8* GraphID)```
Pop and return data from streaming DFG with handle ```GraphID```. The return type is a struct containing a field for every output of DFG.
<aname="limitations"></a>
## Implementation Limitations
Due to limitations of our current prototype implementation, the following restrictions are imposed: