Skip to content
Snippets Groups Projects
Commit 5f55483f authored by Yifan Zhao's avatar Yifan Zhao
Browse files

Merge branch 'hpvm-release/docs' into hpvm-release-exp

parents 99424f70 1b418b04
No related branches found
No related tags found
No related merge requests found
......@@ -83,8 +83,9 @@ Supported/tested GPU architectures for Tensor Backend:
* Nvidia Jetson TX2
* Nvidia GeForce GTX 1080
HPVM has not been tested but might work on other CPUs supported by LLVM Backend,
and GPUs supported by OpenCL such as Intel, AMD, etc.
HPVM has not been tested on other architectures,
but it is expected to work on CPUs supported by the LLVM Backend
and GPUs supported by OpenCL (Intel, AMD, etc.).
**NOTE**: Approximations are tuned for Jetson TX2 and same speedups may not exist for other architectures.
......
......@@ -3,16 +3,11 @@ Components
HPVM consists of a few relatively independent key components.
* Frontends (Keras/PyTorch): code generators in Python for lowering Keras and PyTorch
DNN models into HPVM-C format.
* Patched LLVM: provides HPVM IR and a compilation infrastructure, including ``clang`` and ``opt``.
* HPVM code generator: a few ``opt`` passes that lowers HPVM IR to LLVM IR,
which is then compiled into object code and binary.
:doc:`Compilation process of HPVM </specifications/hpvm-spec>`
shows how these 2 components work together.
In addition, there are:
* Frontends (Keras/PyTorch): code generators in Python for lowering Keras and PyTorch
DNN models into HPVM-C format.
* Predictive tuner: an autotuner library in Python for finding approximation choices (configurations)
with best performance gain within some loss of Quality of Service (QoS, such as accuracy).
* HPVM profiler: an API in Python for measuring real performance of configurations.
......
......@@ -9,17 +9,17 @@ Perforated Convolutions
Overview
^^^^^^^^^
The core idea of perforated convolutions is to compute a subset of the output tensor elements and interpolate the missing elements. Specifically, we include an implementation that skips entire output rows or columns (configurable through knobs), and intepolates the missing rows/columns through neighbour averaging.
Since the approximation reduces the number of output elemennts that need to be computed, it reduces the multiply-accumulate (MAC) operations, and reduces memory bandwidth usage (loads subset of data), hence resulting in both speedups and energy reductions. Our implementation performs the perforation at fixed strides (e.g. skip 1 out of every 3 rows) and the rate of perforation is a configurable knob. The type of perforation (row/col) is also configurable. Another knob is the starting offset - the tensor row/column index to start the perforation (i.e., skipping) from.
The core idea of perforated convolutions is to compute a subset of the output tensor elements and interpolate the missing elements. Specifically, we include an implementation that skips entire output rows or columns (configurable through knobs), and interpolates the missing rows/columns through neighbour averaging.
Since the approximation reduces the number of output elements that need to be computed, it reduces the multiply-accumulate (MAC) operations, and reduces memory bandwidth usage (loads subset of data), hence resulting in both speedups and energy reductions. Our implementation performs the perforation at fixed strides (e.g. skip 1 out of every 3 rows) and the rate of perforation is a configurable knob. The type of perforation (row/col) is also configurable. Another knob is the starting offset - the tensor row/column index to start the perforation (i.e., skipping) from.
Description
^^^^^^^^^^^
Our implementation for perforated convolution is a three-step process:
* **Patch matrix creation:** Based on indices of the rows/columns to be perforated, the corresponding elements of the input tensor are used to create a new matrix called an input-patch matrix. The input-patch matrix is a matrix laid out in memory such that convolution is reduced to a simple matrix multiplication operation. This approach is similar to one described in this `paper <https://dl.acm.org/doi/abs/10.1145/2964284.2967243>`_.
* **Patch matrix creation:** Based on indices of the rows/columns to be perforated, the corresponding elements of the input tensor are used to create a new matrix called an input-patch matrix. The input-patch matrix is a matrix laid out in memory such that convolution is reduced to a simple matrix multiplication operation. This approach is similar to one described in this `paper <https://dl.acm.org/doi/abs/10.1145/2964284.2967243>`__.
* **Dense matrix multiplication:** This step involves performing a matrix multiplication in a manner very similar to described in this `paper <https://arxiv.org/pdf/1704.04428.pdf>`_. Note that these matrices are dense (no sparsity in the tensor representation).
* **Dense matrix multiplication:** This step involves performing a matrix multiplication in a manner very similar to described in this `paper <https://arxiv.org/pdf/1704.04428.pdf>`__. Note that these matrices are dense (no sparsity in the tensor representation).
* **Interpolation of missing values:** A new tensor is created with dimensions the output tensor is expected to have. The perforated tensor output (after matrix multiplication) is copied to corresponding rows/columns, and for the skipped rows/columns, the output values are populated using interpolation; using arithmetic mean of the neighboring elements. For column perforation, these neighboring elements are the right and left element of the skipped element. For row perforation, the top and bottom neigbouring elements are used for interpolation. The output of this step is the approximate (perforated) convolution result.
......
......@@ -197,7 +197,7 @@ Consider the following leaf node function which performs a tensor convolution:
}
3. The rest of the pass can be viewed as a dictionary mapping from ApproxHPVM intrinsics representing tensor operations such as convolutions to their corresponding CuDNN functions.
3. The rest of the pass can be viewed as a dictionary mapping from HPVM intrinsics representing tensor operations such as convolutions to their corresponding CuDNN functions.
.. code-block:: c
......@@ -235,7 +235,7 @@ codeGen(DFLeafNode* )
---------------------
While the pass is generic, we only support `TENSOR_TARGET` (this hint implies HPVM nodes with tensor operations) nodes for fusion.
Additionally each leaf node is first identified as being a valid HPVM tensor node (i.e. contains ApproxHPVM intrinsics as the first intrinsic).
Additionally each leaf node is first identified as being a valid HPVM tensor node (i.e. contains HPVM intrinsics as the first intrinsic).
Consider the following consecutive leaf nodes:
......@@ -380,7 +380,7 @@ Let’s consider the end result of the `FuseHPVMTensorNodes` example:
void *r2 = __hpvm__tensor_add(r1, t3);
void *r3 = __hpvm__tensor_relu(r2);
void *r4 = __hpvm__tensor_pool_max(r3, 3, 3, 0, 0, 2, 2);
__hpvm__return(2, r3, (size_t)0);
__hpvm__return(2, r4, (size_t)0);
}
Similar to the FuseHPVMTensorNodes example, the DFG2LLVM_WrapperAPI pass also has fusion patterns. However in this pass, the tensor operations are within a single node.
......@@ -405,6 +405,8 @@ codeGen(DFLeafNode* )
void all_fused_wrapper_api(void *t1, size_t bytes_t1, void *t2,size_t bytes_t2, void *t3, size_t bytes_t3) {
__hpvm_request_tensor(t1, /* GPU */ 1);
__hpvm_request_tensor(t2, /* GPU */ 1);
__hpvm_request_tensor(t3, /* GPU */ 1);
......@@ -412,7 +414,7 @@ codeGen(DFLeafNode* )
void *r2 = __hpvm__tensor_add(r1, t3);
void *r3 = __hpvm__tensor_relu(r2);
void *r4 = __hpvm__tensor_pool_max(r3, 3, 3, 0, 0, 2, 2);
__hpvm__return(2, r3, (size_t)0);
__hpvm__return(2, r4, (size_t)0);
}
......@@ -423,16 +425,17 @@ codeGen(DFLeafNode* )
void all_fused_wrapper_api(void *t1, size_t bytes_t1, void *t2, size_t bytes_t2, void *t3, size_t bytes_t3) {
__hpvm_request_tensor(t1, /* GPU */ 1);
__hpvm_request_tensor(t2, /* GPU */ 1);
__hpvm_request_tensor(t3, /* GPU */ 1);
void* w1 = wrapper_ConvLayer2(all_fused_wra...”, …);
void* w1 = wrapper_ConvLayer2("all_fused_wra..." /* , ... */); // some arguments omitted
void *r1 = __hpvm__tensor_convolution(t1, t2, 2, 2, 4, 4);
void *r2 = __hpvm__tensor_add(r1, t3);
void *r3 = __hpvm__tensor_relu(r2);
void *r4 = __hpvm__tensor_pool_max(r3, 3, 3, 0, 0, 2, 2);
__hpvm__return(2, r3, (size_t)0);
__hpvm__return(2, r4, (size_t)0);
}
3. The remaining arguments of the wrapper_convLayer2 call are taken from the arguments passed to the individual tensor operations from which the fused call is made.
......@@ -442,15 +445,16 @@ codeGen(DFLeafNode* )
void all_fused_wrapper_api(void *t1, size_t bytes_t1, void *t2, size_t bytes_t2, void *t3, size_t bytes_t3) {
__hpvm_request_tensor(t1, /* GPU */ 1);
__hpvm_request_tensor(t2, /* GPU */ 1);
__hpvm_request_tensor(t3, /* GPU */ 1);
void* w1 = wrapper_ConvLayer2(all_fused_wra..., t1, t2, t3, 2, 2, 4, 4, 0, 3, 3, 0, 0, 2, 2, 0.0, 0.0);
void* w1 = wrapper_ConvLayer2("all_fused_wra...", t1, t2, t3, 2, 2, 4, 4, 0, 3, 3, 0, 0, 2, 2, 0.0, 0.0);
void *r1 = __hpvm__tensor_convolution(t1, t2, 2, 2, 4, 4);
void *r2 = __hpvm__tensor_add(r1, t3);
void *r3 = __hpvm__tensor_relu(r2);
void *r4 = __hpvm__tensor_pool_max(r3, 3, 3, 0, 0, 2, 2);
__hpvm__return(2, r3, (size_t)0);
__hpvm__return(2, r4, (size_t)0);
}
4. Finally, the original operations are removed and the final values uses are replaced with the wrapper function call result.
......
Frequently Asked Questions
==========================
#. **Is Python3.6 a strict requirement for installation?**
**Q1. Is Python3.6 a strict requirement for installation**?
Yes, our HPVM python packages require python version = 3.6.
If you don't have a Python3.6 on your system, we encourage using the provided ``env.yaml`` conda environment.
Yes, our HPVM python packages require python version = 3.6. If you don't have a Python3.6 on your system, we encourage using the provided `env.yaml` conda environment.
#. **What is a "target device" or the "profiling stage"?
Why does the tutorial seems to suggest building HPVM** :ref:`on a second device<target-profiling>`?
**Q2. What to do when running into out of memory errors**?
HPVM is capable of *predictive approximation tuning* which, due to its computational cost,
is often done on a powerful computer, like a server,
but the selected approximations are usually used to speedup your application
on a less powerful device (the *target device*, such as an edge device).
The profiling stage (using `hpvm-profiler`) is necessary so that the real speedup of approximations are measured,
and this is also done on the target device.
See our `ApproxTuner paper <https://dl.acm.org/doi/10.1145/3437801.3446108>`_ for more details on this.
Users can configure the batch size through Keras/PyTorch frontends. Users are encouraged to reduce batch size when encountering out of memory errors.
Currently, HPVM must be built on both the server and the target device for this purpose.
We will achieve better server/edge separation of HPVM in the following releases,
so that only the necessary part of code are built on each device.
**Q3. Should I expect speedups with approximations on my hardware system**?
#. **What is the expcted speedups with approximations on my target device?**
The approximation implementations in HPVM are currently optimized for the Nvidia Tegra Tx2 edge device. The routines are not expected to provide speedups across other hardware devices - though systems with similar hardware specifications may exhibit similar performance. We are working on providing speedups across a wider range of devices.
The approximation implementations in HPVM are currently only optimized for
`Nvidia Tegra TX2 <https://developer.nvidia.com/embedded/jetson-tx2>`_.
The routines may not provide the same speedup on other hardware devices --
though systems with similar hardware specifications may exhibit similar performance.
We are working on providing speedups across a wider range of devices.
**Q4. How many autotuning iterations should I use with `predtuner` package in HPVM**?
#. **Why doesn't the conda environment / Python packages installation work on Jetson boards?**
The number of tuning iterations required to achieve good results varies across benchmarks. Users must tune this on a per-benchmark basis. For the included 10 CNNs, we recommmend using atleast 10K iterations.
You may be seeing errors like
**Q5. How can I extend HPVM to include new custom approximations**?
.. code-block:: text
Users can update the `hpvm-tensor-rt` in HPVM to include new custom approximations that are targeted by the compiler.
ResolvePackageNotFound:
pytorch==1.6.0
Alternatively developers can update the HPVM backends to compile to external libraries with support for custom approximations. The HPVM backends are documented in detail in [TODO : Add link to Backends Doc]
or other errors indicating ``pytorch``, ``torchvision`` or other packages cannot be installed,
because these packages are not prebuilt for ARM CPU on `PyPI <https://pypi.org/>`_.
The `predtuner` in HPVM is flexible to include more approximation knobs. [TODO: Yifan should add more details on how to add more knobs]
The simplest solution is not to install HPVM frontends and autotuner;
see :ref:`this <skip-pypkg>` for how to do so.
The job of these packages are best left to a server machine.
**Q6. Does this release support combining HPVM tensor and non-tensor operations in a single program**?
#. **What to do when running into "CUDA out of memory" errors?**
Currently we do not support tensor and non-tensor code in the same application. We will support this feature in the next release.
When the Keras/PyTorch frontends generates code, they accept a "batch size" parameter,
which decides the batch size at which the DNN inference runs.
You may need to reduce batch size when encountering out of memory errors.
**Q7. Does this release support object detection models?**
#. **How many autotuning iterations should I use with PredTuner package in HPVM?**
Currrently, HPVM doesn't support object detection models. Support will be added in future releases.
The number of tuning iterations required to achieve good results varies across benchmarks
and should be figured out on a per-benchmark basis.
For the included 10 CNNs, we recommmend using at least 10K iterations.
#. **Does this release support combining HPVM tensor and non-tensor operations in a single program?**
Currently we do not support tensor and non-tensor code in the same application.
We will support this feature in the next release.
#. **Does this release support object detection models?**
Currrently, HPVM doesn't support object detection models,
due to the limited number of operators supported in the tensor library `hpvm-tensor-rt`.
We will add support for more operators in the next release.
......@@ -16,14 +16,14 @@ Audience
--------
The intended audience for HPVM includes researchers and developers working in the areas of
compilers,programming languages, approximate computing, software optimization,
compilers, programming languages, approximate computing, software optimization,
static and dynamic program analysis, and systems for machine learning.
`HPVM <https://dl.acm.org/doi/pdf/10.1145/3200691.3178493>`_
is a retargetable compiler infrastructure that targets CPUs, CPUs, and accelerators
includes a retargetable compiler infrastructure that targets CPUs, GPUs, and accelerators
(this release does not include accelerator support)
and uses a portable compiler IR that explicitly represents data flow at the IR level,
and supports task, data, and pipelined parallelism.
It supports task, data, and pipelined parallelism
HPVM provides an extensible platform that compiler and programming languages
researchers can use as part of their work.
......
.. role:: raw-html-m2r(raw)
:format: html
.. |br| raw:: html
<br/>
HPVM-C Language Specification
=============================
An HPVM program is a combination of host code and one or more data flow graphs (DFG) at the IR level. We provide C function declarations representing the HPVM intrinsics that allow creating, querying, and interacting with the DFGs. More details about the HPVM IR intrinsics can be found in `the HPVM IR Specification <hpvm-specification.html>`_.
An HPVM program is a combination of host code and one or more data flow graphs (DFG) at the IR level.
We provide C function declarations representing the HPVM intrinsics that allow
creating, querying, and interacting with the DFGs.
More details about the HPVM IR intrinsics can be found in the
:doc:`hpvm-spec`.
An HPVM-C program contains both the host and the DFG code. Each HPVM kernel, represented by a leaf node in the DFG, can be compiled to multiple different targets (e.g. CPU and GPU) as described below.
......@@ -14,106 +18,106 @@ This document describes all the API calls that can be used in an HPVM-C program.
Host API
--------
``void __hpvm__init()``:raw-html-m2r:`<br>`
``void __hpvm__init()`` |br|
Used before all other HPVM calls to initialize the HPVM runtime.
``void __hpvm__cleanup()``:raw-html-m2r:`<br>`
``void __hpvm__cleanup()`` |br|
Used at the end of HPVM program to clean up all remaining runtime-created HPVM objects.
``void llvm_hpvm_track_mem(void* ptr, size_t sz)``:raw-html-m2r:`<br>`
``void llvm_hpvm_track_mem(void* ptr, size_t sz)`` |br|
Insert memory starting at ``ptr`` of size ``sz`` in the memory tracker of HPVM runtime.
``void llvm_hpvm_untrack_mem(void* ptr)``:raw-html-m2r:`<br>`
``void llvm_hpvm_untrack_mem(void* ptr)`` |br|
Stop tracking the memory object identified by ``ptr``.
``void llvm_hpvm_request_mem(void* ptr, size_t sz)``:raw-html-m2r:`<br>`
``void llvm_hpvm_request_mem(void* ptr, size_t sz)`` |br|
If the memory object identified by ``ptr`` is not in host memory, copy it to host memory.
``void* __hpvm__launch(unsigned isStream, void* rootGraph, void* args)``:raw-html-m2r:`<br>`
``void* __hpvm__launch(unsigned isStream, void* rootGraph, void* args)`` |br|
Launches the execution of the dataflow graph with node function ``rootGraph``. ``args`` is a pointer to a packed struct, containing one field per argument of the RootGraph function, consecutively. For non-streaming DFGs with a non empty result type, ``args`` must contain an additional field of the type ``RootGraph.returnTy``, where the result of the graph will be returned. ``isStream`` chooses between a non streaming (0) or streaming (1) graph execution. Returns a handle to the executing graph.
``void __hpvm__wait(void* G)``:raw-html-m2r:`<br>`
``void __hpvm__wait(void* G)`` |br|
Waits for completion of execution of the dataflow graph with handle ``G``.
``void __hpvm__push(void* G, void* args)``:raw-html-m2r:`<br>`
``void __hpvm__push(void* G, void* args)`` |br|
Push set of input data items, ``args``, (same as type included in launch) to streaming DFG with handle ``G``.
``void* __hpvm__pop(void* G)``:raw-html-m2r:`<br>`
``void* __hpvm__pop(void* G)`` |br|
Pop and return data produced from one execution of streaming DFG with handle ``G``. The return type is a struct containing a field for every output of DFG.
Internal Node API
-----------------
``void* __hpvm__createNodeND(unsigned dims, void* F, ...)``:raw-html-m2r:`<br>`
``void* __hpvm__createNodeND(unsigned dims, void* F, ...)`` |br|
Creates a static dataflow node replicated in ``dims`` dimensions (0 to 3), each executing node function ``F``. The arguments following ``F`` are the size of each dimension, respectively, passed in as a ``size_t``. Returns a handle to the created dataflow node.
``void* __hpvm__edge(void* src, void* dst, unsigned replType, unsigned sp, unsigned dp, unsigned isStream)``:raw-html-m2r:`<br>`
``void* __hpvm__edge(void* src, void* dst, unsigned replType, unsigned sp, unsigned dp, unsigned isStream)`` |br|
Creates an edge from output ``sp`` of node ``src`` to input ``dp`` of node ``dst``. If ``replType`` is 0, the edge is a one-to-one edge, otherwise it is an all-to-all edge. ``isStream`` defines whether or not the edge is streaming. Returns a handle to the created edge.
``void __hpvm__bindIn(void* N, unsigned ip, unsigned ic, unsigned isStream)``:raw-html-m2r:`<br>`
``void __hpvm__bindIn(void* N, unsigned ip, unsigned ic, unsigned isStream)`` |br|
Binds the input ``ip`` of the current node to input ``ic`` of child node function ``N``. ``isStream`` defines whether or not the input bind is streaming.
``void __hpvm__bindOut(void* N, unsigned op, unsigned oc, unsigned isStream)``:raw-html-m2r:`<br>`
``void __hpvm__bindOut(void* N, unsigned op, unsigned oc, unsigned isStream)`` |br|
Binds the output ``op`` of the current node to output ``oc`` of child node function ``N``. ``isStream`` defines whether or not the output bind is streaming.
``void __hpvm__hint(enum Target target)`` (C):raw-html-m2r:`<br>`
``void __hpvm__hint(hpvm::Target target)`` (C++):raw-html-m2r:`<br>`
``void __hpvm__hint(enum Target target)`` (C) |br|
``void __hpvm__hint(hpvm::Target target)`` (C++) |br|
Must be called once in each node function. Indicates which hardware target the current function should run in.
``void __hpvm__attributes(unsigned ni, , unsigned no, …)``:raw-html-m2r:`<br>`
``void __hpvm__attributes(unsigned ni, ..., unsigned no, ...)`` |br|
Must be called once at the beginning of each node function. Defines the properties of the pointer arguments to the current function. ``ni`` represents the number of input arguments, and ``no`` the number of output arguments. The arguments following ``ni`` are the input arguments, and the arguments following ``no`` are the output arguments. Arguments can be marked as both input and output. All pointer arguments must be included.
Leaf Node API
-------------
``void __hpvm__hint(enum Target target)`` (C):raw-html-m2r:`<br>`
``void __hpvm__hint(hpvm::Target target)`` (C++):raw-html-m2r:`<br>`
``void __hpvm__hint(enum Target target)`` (C) |br|
``void __hpvm__hint(hpvm::Target target)`` (C++) |br|
As described in internal node API.
``void __hpvm__attributes(unsigned ni, , unsigned no, …)``:raw-html-m2r:`<br>`
``void __hpvm__attributes(unsigned ni, ..., unsigned no, ...)`` |br|
As described in internal node API.
``void __hpvm__return(unsigned n, ...)``:raw-html-m2r:`<br>`
``void __hpvm__return(unsigned n, ...)`` |br|
Returns ``n`` values from a leaf node function. The remaining arguments are the values to be returned. All ``__hpvm__return`` statements within the same function must return the same number of values.
``void* __hpvm__getNode()``:raw-html-m2r:`<br>`
``void* __hpvm__getNode()`` |br|
Returns a handle to the current leaf node.
``void* __hpvm__getParentNode(void* N)``:raw-html-m2r:`<br>`
``void* __hpvm__getParentNode(void* N)`` |br|
Returns a handle to the parent node of node ``N``.
``long __hpvm__getNodeInstanceID_{x,y,z}(void* N)``:raw-html-m2r:`<br>`
``long __hpvm__getNodeInstanceID_{x,y,z}(void* N)`` |br|
Returns the dynamic ID of the current instance of node ``N`` in the x, y, or z dimension respectively. The dimension must be one of the dimensions in which the node is replicated.
``long __hpvm__getNumNodeInstances_{x,y,z}(void* N)``:raw-html-m2r:`<br>`
``long __hpvm__getNumNodeInstances_{x,y,z}(void* N)`` |br|
Returns the number of dynamic instances of node ``N`` in the x, y, or z dimension respectively. The dimension must be one of the dimensions in which the node is replicated.
``void* __hpvm__malloc(long nBytes)``:raw-html-m2r:`<br>`
``void* __hpvm__malloc(long nBytes)`` |br|
Allocate a block of memory of size ``nBytes`` and returns a pointer to it. The allocated object can be shared by all nodes. *Note that the returned pointer must somehow be communicated explicitly for use by other nodes.*
``int __hpvm__atomic_add(int* m, int v)``:raw-html-m2r:`<br>`
``int __hpvm__atomic_add(int* m, int v)`` |br|
Atomically adds ``v`` to the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
``int __hpvm__atomic_sub(int* m, int v)``:raw-html-m2r:`<br>`
``int __hpvm__atomic_sub(int* m, int v)`` |br|
Atomically subtracts ``v`` from the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
``int __hpvm__atomic_min(int* m, int v)``:raw-html-m2r:`<br>`
``int __hpvm__atomic_min(int* m, int v)`` |br|
Atomically computes the min of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
``int __hpvm__atomic_max(int* m, int v)``:raw-html-m2r:`<br>`
``int __hpvm__atomic_max(int* m, int v)`` |br|
Atomically computes the max of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
``int __hpvm__atomic_xchg(int* m, int v)``:raw-html-m2r:`<br>`
``int __hpvm__atomic_xchg(int* m, int v)`` |br|
Atomically swaps ``v`` with the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
``int __hpvm__atomic_and(int* m, int v)``:raw-html-m2r:`<br>`
``int __hpvm__atomic_and(int* m, int v)`` |br|
Atomically computes the bitwise AND of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
``int __hpvm__atomic_or(int* m, int v)``:raw-html-m2r:`<br>`
``int __hpvm__atomic_or(int* m, int v)`` |br|
Atomically computes the bitwise OR of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
``int __hpvm__atomic_xor(int* m, int v)``:raw-html-m2r:`<br>`
``int __hpvm__atomic_xor(int* m, int v)`` |br|
Atomically computes the bitwise XOR of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
``void __hpvm__barrier()``:raw-html-m2r:`<br>`
``void __hpvm__barrier()`` |br|
Local synchronization barrier across dynamic instances of current leaf node.
......@@ -22,7 +22,7 @@ Getting Started
Let's look at an example that uses DNNs and weights pre-shipped with HPVM.
This is found at ``hpvm/test/dnn_benchmarks/pytorch/test_frontend.py``.
*Note* that below we'll be working under directory ``hpvm/test/dnn_benchmarks/pytorch``.
**Note** that below we'll be working under directory ``hpvm/test/dnn_benchmarks/pytorch``.
We'll be generating ResNet-18 into an HPVM-compiled binary.
First, prepare 2 datasets for autotuning and testing.
......@@ -40,12 +40,11 @@ First, prepare 2 datasets for autotuning and testing.
``BinDataset`` is a dataset created over files of ApproxHPVM dataset format.
Any instance ``torch.utils.data.Dataset`` can be used here.
*Note* that each ``module`` is bound to 2 datasets: a "tune" and a "test" set.
**Note** that each ``module`` is bound to 2 datasets: a "tune" and a "test" set.
The generated binary accepts an argument to be either the string "tune" or "test",
and performs inference over a dataset accordingly.
This is because the dataset can contain arbitrary Python code which cannot yet be exported into HPVM-C;
instead the frontend has to export some predefined datasets for the model to use.
See TODOs (1).
Create a DNN ``module`` and load the checkpoint:
......@@ -81,7 +80,7 @@ Now we are ready to export the model. The main functioning class of ``torch2hpvm
and path to the compiled binary respectively.
``batch_size`` is the batch size the binary uses during inference.
*Note* that ``conf_file`` is the path to an HPVM approximation configuration file.
**Note** that ``conf_file`` is the path to an HPVM approximation configuration file.
This file decides what approximation the binary will use during inference.
This path is hardcoded into the binary and is only read when the binary starts,
so it's fine to have ``conf_file`` point to a non-existing path.
......@@ -134,7 +133,6 @@ This choice of operators is largely constrained by backend (tensor_runtime) supp
TODOs
-----
#. Optionally insert a Python-C interface in the generated binary to
call back into a Dataset class and read the data.
......
......@@ -63,7 +63,7 @@ The following targets runs these tests respectively:
* ``make -j check-hpvm-dnn`` runs all 20 DNN benchmarks under ``dnn_benchmarks/hpvm-c``
(10 DNNs x 2 versions) and validates their accuracy.
*Note* that this can take quite long due to the size of DNNs and datasets.
**Note** that this can take quite long due to the size of DNNs and datasets.
Depending on your hardware capability, this test can take 5-30 minutes.
Also, this is set to run sequentially out of GPU memory concerns.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment