diff --git a/hpvm/docs/build-hpvm.rst b/hpvm/docs/build-hpvm.rst
index 4e31ad8c7f4e7f43f998c17493d80f9c9cfeecd1..43f6e0f67fc58ba7e7b2e7386e8c55191eb801eb 100644
--- a/hpvm/docs/build-hpvm.rst
+++ b/hpvm/docs/build-hpvm.rst
@@ -83,8 +83,9 @@ Supported/tested GPU architectures for Tensor Backend:
 * Nvidia Jetson TX2
 * Nvidia GeForce GTX 1080
 
-HPVM has not been tested but might work on other CPUs supported by LLVM Backend,
-and GPUs supported by OpenCL such as Intel, AMD, etc.
+HPVM has not been tested on other architectures,
+but it is expected to work on CPUs supported by the LLVM Backend
+and GPUs supported by OpenCL (Intel, AMD, etc.).
 
 **NOTE**: Approximations are tuned for Jetson TX2 and same speedups may not exist for other architectures.
 
diff --git a/hpvm/docs/components/index.rst b/hpvm/docs/components/index.rst
index 96fb35bea6304b6fb1930b99439e1501fabbc90a..a61fb6303e783df8e21fe79ba2c24be868c0d0c2 100644
--- a/hpvm/docs/components/index.rst
+++ b/hpvm/docs/components/index.rst
@@ -3,16 +3,11 @@ Components
 
 HPVM consists of a few relatively independent key components.
 
+* Frontends (Keras/PyTorch): code generators in Python for lowering Keras and PyTorch
+  DNN models into HPVM-C format.
 * Patched LLVM: provides HPVM IR and a compilation infrastructure, including ``clang`` and ``opt``.
 * HPVM code generator: a few ``opt`` passes that lowers HPVM IR to LLVM IR,
   which is then compiled into object code and binary.
-
-:doc:`Compilation process of HPVM </specifications/hpvm-spec>`
-shows how these 2 components work together.
-In addition, there are:
-
-* Frontends (Keras/PyTorch): code generators in Python for lowering Keras and PyTorch
-  DNN models into HPVM-C format.
 * Predictive tuner: an autotuner library in Python for finding approximation choices (configurations)
   with best performance gain within some loss of Quality of Service (QoS, such as accuracy).
 * HPVM profiler: an API in Python for measuring real performance of configurations.
diff --git a/hpvm/docs/developerdocs/approximation-implementation.rst b/hpvm/docs/developerdocs/approximation-implementation.rst
index 7513c2213af050f4c42eb553164a0c1762c4de5a..584f20ee63d2a61f8c22e2709259042bd72a175d 100644
--- a/hpvm/docs/developerdocs/approximation-implementation.rst
+++ b/hpvm/docs/developerdocs/approximation-implementation.rst
@@ -9,17 +9,17 @@ Perforated Convolutions
 Overview
 ^^^^^^^^^
 
-The core idea of perforated convolutions is to compute a subset of the output tensor elements and interpolate the missing elements. Specifically, we include an implementation that skips entire output rows or columns (configurable through knobs), and intepolates the missing rows/columns through neighbour averaging.
-Since the approximation reduces the number of output elemennts that need to be computed, it reduces the multiply-accumulate (MAC) operations, and reduces memory bandwidth usage (loads subset of data), hence resulting in both speedups and energy reductions. Our implementation performs the perforation at fixed strides (e.g. skip 1 out of every 3 rows) and the rate of perforation is a configurable knob. The type of perforation (row/col) is also configurable. Another knob is the starting offset - the tensor row/column index to start the perforation (i.e., skipping) from. 
+The core idea of perforated convolutions is to compute a subset of the output tensor elements and interpolate the missing elements. Specifically, we include an implementation that skips entire output rows or columns (configurable through knobs), and interpolates the missing rows/columns through neighbour averaging.
+Since the approximation reduces the number of output elements that need to be computed, it reduces the multiply-accumulate (MAC) operations, and reduces memory bandwidth usage (loads subset of data), hence resulting in both speedups and energy reductions. Our implementation performs the perforation at fixed strides (e.g. skip 1 out of every 3 rows) and the rate of perforation is a configurable knob. The type of perforation (row/col) is also configurable. Another knob is the starting offset - the tensor row/column index to start the perforation (i.e., skipping) from. 
 
 Description
 ^^^^^^^^^^^
 
 Our implementation for perforated convolution is a three-step process:
 
-* **Patch matrix creation:** Based on indices of the rows/columns to be perforated, the corresponding elements of the input tensor are used to create a new matrix called an input-patch matrix. The input-patch matrix is a matrix laid out in memory such that convolution is reduced to a simple matrix multiplication operation. This approach is similar to one described in this `paper <https://dl.acm.org/doi/abs/10.1145/2964284.2967243>`_.
+* **Patch matrix creation:** Based on indices of the rows/columns to be perforated, the corresponding elements of the input tensor are used to create a new matrix called an input-patch matrix. The input-patch matrix is a matrix laid out in memory such that convolution is reduced to a simple matrix multiplication operation. This approach is similar to one described in this `paper <https://dl.acm.org/doi/abs/10.1145/2964284.2967243>`__.
 
-* **Dense matrix multiplication:** This step involves performing a matrix multiplication in a manner very similar to described in this `paper <https://arxiv.org/pdf/1704.04428.pdf>`_. Note that these matrices are dense (no sparsity in the tensor representation).
+* **Dense matrix multiplication:** This step involves performing a matrix multiplication in a manner very similar to described in this `paper <https://arxiv.org/pdf/1704.04428.pdf>`__. Note that these matrices are dense (no sparsity in the tensor representation).
 
 * **Interpolation of missing values:** A new tensor is created with dimensions the output tensor is expected to have. The perforated tensor output (after matrix multiplication) is copied to corresponding rows/columns, and for the skipped rows/columns, the output values are populated using interpolation; using arithmetic mean of the neighboring elements. For column perforation, these neighboring elements are the right and left element of the skipped element. For row perforation, the top and bottom neigbouring elements are used for interpolation. The output of this step is the approximate (perforated) convolution result.
 
diff --git a/hpvm/docs/developerdocs/backend-passes.rst b/hpvm/docs/developerdocs/backend-passes.rst
index 7cfff37e6f3c15dae8ba3bb37227af53c842eaf1..ceea1be303310b6e6640c4dac12c82aafaabd421 100644
--- a/hpvm/docs/developerdocs/backend-passes.rst
+++ b/hpvm/docs/developerdocs/backend-passes.rst
@@ -428,7 +428,7 @@ codeGen(DFLeafNode* )
     __hpvm_request_tensor(t3, /* GPU */ 1);
 
 
-    void* w1 = wrapper_ConvLayer2( â€œall_fused_wra...â€, â€¦);
+    void* w1 = wrapper_ConvLayer2("all_fused_wra..." /* , ... */);  // some arguments omitted
 
 
     void *r1 = __hpvm__tensor_convolution(t1, t2, 2, 2, 4, 4);
@@ -448,7 +448,7 @@ codeGen(DFLeafNode* )
     __hpvm_request_tensor(t3, /* GPU */ 1);
 
 
-    void* w1 = wrapper_ConvLayer2( â€œall_fused_wra...â€, t1, t2, t3, 2, 2, 4, 4, 0, 3, 3, 0, 0, 2, 2, 0.0, 0.0);
+    void* w1 = wrapper_ConvLayer2("all_fused_wra...", t1, t2, t3, 2, 2, 4, 4, 0, 3, 3, 0, 0, 2, 2, 0.0, 0.0);
 
     void *r1 = __hpvm__tensor_convolution(t1, t2, 2, 2, 4, 4);
     void *r2 = __hpvm__tensor_add(r1, t3);
diff --git a/hpvm/docs/index.rst b/hpvm/docs/index.rst
index 92b32cdfb06c7cbf9d18adf71ce5fd752453eaf7..0ff3f36815b4892724417d394add8e23cddfe910 100644
--- a/hpvm/docs/index.rst
+++ b/hpvm/docs/index.rst
@@ -20,10 +20,10 @@ compilers,programming languages, approximate computing, software optimization,
 static and dynamic program analysis, and systems for machine learning.
 
 `HPVM <https://dl.acm.org/doi/pdf/10.1145/3200691.3178493>`_
-includes a retargetable compiler infrastructure that targets CPUs, CPUs, and accelerators
+includes a retargetable compiler infrastructure that targets CPUs, GPUs, and accelerators
 (this release does not include accelerator support)
 and uses a portable compiler IR that explicitly represents data flow at the IR level,
-and supports task, data, and pipelined parallelism.
+It supports task, data, and pipelined parallelism
 HPVM provides an extensible platform that compiler and programming languages
 researchers can use as part of their work.
 
diff --git a/hpvm/docs/specifications/hpvm-c-spec.rst b/hpvm/docs/specifications/hpvm-c-spec.rst
index ee2b55a4961c2eb2eda947e9a0c6bce6a7a292ee..b0907345a76c0ee345ffc45fa92f6bf0f90433ba 100644
--- a/hpvm/docs/specifications/hpvm-c-spec.rst
+++ b/hpvm/docs/specifications/hpvm-c-spec.rst
@@ -1,11 +1,15 @@
-.. role:: raw-html-m2r(raw)
-   :format: html
+.. |br| raw:: html
 
+  <br/>
 
 HPVM-C Language Specification
 =============================
 
-An HPVM program is a combination of host code and one or more data flow graphs (DFG) at the IR level. We provide C function declarations representing the HPVM intrinsics that allow creating, querying, and interacting with the DFGs. More details about the HPVM IR intrinsics can be found in `the HPVM IR Specification <hpvm-specification.html>`_.
+An HPVM program is a combination of host code and one or more data flow graphs (DFG) at the IR level.
+We provide C function declarations representing the HPVM intrinsics that allow
+creating, querying, and interacting with the DFGs.
+More details about the HPVM IR intrinsics can be found in the
+:doc:`hpvm-spec`.
 
 An HPVM-C program contains both the host and the DFG code. Each HPVM kernel, represented by a leaf node in the DFG, can be compiled to multiple different targets (e.g. CPU and GPU) as described below. 
 
@@ -14,106 +18,106 @@ This document describes all the API calls that can be used in an HPVM-C program.
 Host API
 --------
 
-``void __hpvm__init()``:raw-html-m2r:`<br>`
+``void __hpvm__init()`` |br|
 Used before all other HPVM calls to initialize the HPVM runtime.
 
-``void __hpvm__cleanup()``:raw-html-m2r:`<br>`
+``void __hpvm__cleanup()`` |br|
 Used at the end of HPVM program to clean up all remaining runtime-created HPVM objects.
 
-``void llvm_hpvm_track_mem(void* ptr, size_t sz)``:raw-html-m2r:`<br>`
+``void llvm_hpvm_track_mem(void* ptr, size_t sz)`` |br|
 Insert memory starting at ``ptr`` of size ``sz`` in the memory tracker of HPVM runtime.
 
-``void llvm_hpvm_untrack_mem(void* ptr)``:raw-html-m2r:`<br>`
+``void llvm_hpvm_untrack_mem(void* ptr)`` |br|
 Stop tracking the memory object identified by ``ptr``.
 
-``void llvm_hpvm_request_mem(void* ptr, size_t sz)``:raw-html-m2r:`<br>`
+``void llvm_hpvm_request_mem(void* ptr, size_t sz)`` |br|
 If the memory object identified by ``ptr`` is not in host memory, copy it to host memory.
 
-``void* __hpvm__launch(unsigned isStream, void* rootGraph, void* args)``:raw-html-m2r:`<br>`
+``void* __hpvm__launch(unsigned isStream, void* rootGraph, void* args)`` |br|
 Launches the execution of the dataflow graph with node function ``rootGraph``. ``args`` is a pointer to a packed struct, containing one field per argument of the RootGraph function, consecutively. For non-streaming DFGs with a non empty result type, ``args`` must contain an additional field of the type ``RootGraph.returnTy``, where the result of the graph will be returned. ``isStream`` chooses between a non streaming (0) or streaming (1) graph execution. Returns a handle to the executing graph.
 
-``void __hpvm__wait(void* G)``:raw-html-m2r:`<br>`
+``void __hpvm__wait(void* G)`` |br|
 Waits for completion of execution of the dataflow graph with handle ``G``.
 
-``void __hpvm__push(void* G, void* args)``:raw-html-m2r:`<br>`
+``void __hpvm__push(void* G, void* args)`` |br|
 Push set of input data items, ``args``, (same as type included in launch) to streaming DFG with handle ``G``.
 
-``void* __hpvm__pop(void* G)``:raw-html-m2r:`<br>`
+``void* __hpvm__pop(void* G)`` |br|
 Pop and return data produced from one execution of streaming DFG with handle ``G``. The return type is a struct containing a field for every output of DFG. 
 
 Internal Node API
 -----------------
 
-``void* __hpvm__createNodeND(unsigned dims, void* F, ...)``:raw-html-m2r:`<br>`
+``void* __hpvm__createNodeND(unsigned dims, void* F, ...)`` |br|
 Creates a static dataflow node replicated in ``dims`` dimensions (0 to 3), each executing node function ``F``. The arguments following ``F`` are the size of each dimension, respectively, passed in as a ``size_t``. Returns a handle to the created dataflow node.
 
-``void* __hpvm__edge(void* src, void* dst, unsigned replType, unsigned sp, unsigned dp, unsigned isStream)``:raw-html-m2r:`<br>`
+``void* __hpvm__edge(void* src, void* dst, unsigned replType, unsigned sp, unsigned dp, unsigned isStream)`` |br|
 Creates an edge from output ``sp`` of node ``src`` to input ``dp`` of node ``dst``. If ``replType`` is 0, the edge is a one-to-one edge, otherwise it is an all-to-all edge. ``isStream`` defines whether or not the edge is streaming. Returns a handle to the created edge.
 
-``void __hpvm__bindIn(void* N, unsigned ip, unsigned ic, unsigned isStream)``:raw-html-m2r:`<br>`
+``void __hpvm__bindIn(void* N, unsigned ip, unsigned ic, unsigned isStream)`` |br|
 Binds the input ``ip`` of the current node to input ``ic`` of child node function ``N``. ``isStream`` defines whether or not the input bind is streaming.
 
-``void __hpvm__bindOut(void* N, unsigned op, unsigned oc, unsigned isStream)``:raw-html-m2r:`<br>`
+``void __hpvm__bindOut(void* N, unsigned op, unsigned oc, unsigned isStream)`` |br|
 Binds the output ``op`` of the current node to output ``oc`` of child node function ``N``. ``isStream`` defines whether or not the output bind is streaming.
 
-``void __hpvm__hint(enum Target target)`` (C):raw-html-m2r:`<br>`
-``void __hpvm__hint(hpvm::Target target)`` (C++):raw-html-m2r:`<br>`
+``void __hpvm__hint(enum Target target)`` (C) |br|
+``void __hpvm__hint(hpvm::Target target)`` (C++) |br|
 Must be called once in each node function. Indicates which hardware target the current function should run in.
 
-``void __hpvm__attributes(unsigned ni, â€¦, unsigned no, â€¦)``:raw-html-m2r:`<br>`
+``void __hpvm__attributes(unsigned ni, ..., unsigned no, ...)`` |br|
 Must be called once at the beginning of each node function. Defines the properties of the pointer arguments to the current function. ``ni`` represents the number of input arguments, and ``no`` the number of output arguments. The arguments following ``ni`` are the input arguments, and the arguments following ``no`` are the output arguments. Arguments can be marked as both input and output. All pointer arguments must be included.
 
 Leaf Node API
 -------------
 
-``void __hpvm__hint(enum Target target)`` (C):raw-html-m2r:`<br>`
-``void __hpvm__hint(hpvm::Target target)`` (C++):raw-html-m2r:`<br>`
+``void __hpvm__hint(enum Target target)`` (C) |br|
+``void __hpvm__hint(hpvm::Target target)`` (C++) |br|
 As described in internal node API.
 
-``void __hpvm__attributes(unsigned ni, â€¦, unsigned no, â€¦)``:raw-html-m2r:`<br>`
+``void __hpvm__attributes(unsigned ni, ..., unsigned no, ...)`` |br|
 As described in internal node API.
 
-``void __hpvm__return(unsigned n, ...)``:raw-html-m2r:`<br>`
+``void __hpvm__return(unsigned n, ...)`` |br|
 Returns ``n`` values from a leaf node function. The remaining arguments are the values to be returned. All ``__hpvm__return`` statements within the same function must return the same number of values.
 
-``void* __hpvm__getNode()``:raw-html-m2r:`<br>`
+``void* __hpvm__getNode()`` |br|
 Returns a handle to the current leaf node.
 
-``void* __hpvm__getParentNode(void* N)``:raw-html-m2r:`<br>`
+``void* __hpvm__getParentNode(void* N)`` |br|
 Returns a handle to the parent node of node ``N``.
 
-``long __hpvm__getNodeInstanceID_{x,y,z}(void* N)``:raw-html-m2r:`<br>`
+``long __hpvm__getNodeInstanceID_{x,y,z}(void* N)`` |br|
 Returns the dynamic ID of the current instance of node ``N`` in the x, y, or z dimension respectively. The dimension must be one of the dimensions in which the node is replicated.
 
-``long __hpvm__getNumNodeInstances_{x,y,z}(void* N)``:raw-html-m2r:`<br>`
+``long __hpvm__getNumNodeInstances_{x,y,z}(void* N)`` |br|
 Returns the number of dynamic instances of node ``N`` in the x, y, or z dimension respectively. The dimension must be one of the dimensions in which the node is replicated.
 
-``void* __hpvm__malloc(long nBytes)``:raw-html-m2r:`<br>`
+``void* __hpvm__malloc(long nBytes)`` |br|
 Allocate a block of memory of size ``nBytes`` and returns a pointer to it. The allocated object can be shared by all nodes. *Note that the returned pointer must somehow be communicated explicitly for use by other nodes.*
 
-``int __hpvm__atomic_add(int* m, int v)``:raw-html-m2r:`<br>`
+``int __hpvm__atomic_add(int* m, int v)`` |br|
 Atomically adds ``v`` to the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
 
-``int __hpvm__atomic_sub(int* m, int v)``:raw-html-m2r:`<br>`
+``int __hpvm__atomic_sub(int* m, int v)`` |br|
 Atomically subtracts ``v`` from the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
 
-``int __hpvm__atomic_min(int* m, int v)``:raw-html-m2r:`<br>`
+``int __hpvm__atomic_min(int* m, int v)`` |br|
 Atomically computes the min of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
 
-``int __hpvm__atomic_max(int* m, int v)``:raw-html-m2r:`<br>`
+``int __hpvm__atomic_max(int* m, int v)`` |br|
 Atomically computes the max of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
 
-``int __hpvm__atomic_xchg(int* m, int v)``:raw-html-m2r:`<br>`
+``int __hpvm__atomic_xchg(int* m, int v)`` |br|
 Atomically swaps ``v`` with the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
 
-``int __hpvm__atomic_and(int* m, int v)``:raw-html-m2r:`<br>`
+``int __hpvm__atomic_and(int* m, int v)`` |br|
 Atomically computes the bitwise AND of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
 
-``int __hpvm__atomic_or(int* m, int v)``:raw-html-m2r:`<br>`
+``int __hpvm__atomic_or(int* m, int v)`` |br|
 Atomically computes the bitwise OR of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
 
-``int __hpvm__atomic_xor(int* m, int v)``:raw-html-m2r:`<br>`
+``int __hpvm__atomic_xor(int* m, int v)`` |br|
 Atomically computes the bitwise XOR of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
 
-``void __hpvm__barrier()``:raw-html-m2r:`<br>`
+``void __hpvm__barrier()`` |br|
 Local synchronization barrier across dynamic instances of current leaf node.
diff --git a/hpvm/projects/torch2hpvm/README.rst b/hpvm/projects/torch2hpvm/README.rst
index 928aa2e19f1d8efdcbb33268925c7c3d6ba0394f..e6ac559e7df8fa12b891e5930478d627ed46cb55 100644
--- a/hpvm/projects/torch2hpvm/README.rst
+++ b/hpvm/projects/torch2hpvm/README.rst
@@ -22,7 +22,7 @@ Getting Started
 
 Let's look at an example that uses DNNs and weights pre-shipped with HPVM.
 This is found at ``hpvm/test/dnn_benchmarks/pytorch/test_frontend.py``.
-*Note* that below we'll be working under directory ``hpvm/test/dnn_benchmarks/pytorch``.
+**Note** that below we'll be working under directory ``hpvm/test/dnn_benchmarks/pytorch``.
 
 We'll be generating ResNet-18 into an HPVM-compiled binary.
 First, prepare 2 datasets for autotuning and testing.
@@ -40,12 +40,11 @@ First, prepare 2 datasets for autotuning and testing.
 ``BinDataset`` is a dataset created over files of ApproxHPVM dataset format.
 Any instance ``torch.utils.data.Dataset`` can be used here.
 
-*Note* that each ``module`` is bound to 2 datasets: a "tune" and a "test" set.
+**Note** that each ``module`` is bound to 2 datasets: a "tune" and a "test" set.
 The generated binary accepts an argument to be either the string "tune" or "test",
 and performs inference over a dataset accordingly.
 This is because the dataset can contain arbitrary Python code which cannot yet be exported into HPVM-C;
 instead the frontend has to export some predefined datasets for the model to use.
-See TODOs (1).
 
 Create a DNN ``module`` and load the checkpoint:
 
@@ -81,7 +80,7 @@ Now we are ready to export the model. The main functioning class of ``torch2hpvm
 and path to the compiled binary respectively.
 ``batch_size`` is the batch size the binary uses during inference.
 
-*Note* that ``conf_file`` is the path to an HPVM approximation configuration file.
+**Note** that ``conf_file`` is the path to an HPVM approximation configuration file.
 This file decides what approximation the binary will use during inference.
 This path is hardcoded into the binary and is only read when the binary starts,
 so it's fine to have ``conf_file`` point to a non-existing path.
@@ -134,7 +133,6 @@ This choice of operators is largely constrained by backend (tensor_runtime) supp
 TODOs
 -----
 
-
 #. Optionally insert a Python-C interface in the generated binary to
    call back into a Dataset class and read the data.
 
diff --git a/hpvm/test/README.rst b/hpvm/test/README.rst
index 796699c068b512b1029ffd21aef2a7d1f27d25fe..2b970d54f4e45c6581aab8280f06eb098ffe10db 100644
--- a/hpvm/test/README.rst
+++ b/hpvm/test/README.rst
@@ -63,7 +63,7 @@ The following targets runs these tests respectively:
 * ``make -j check-hpvm-dnn`` runs all 20 DNN benchmarks under ``dnn_benchmarks/hpvm-c``
   (10 DNNs x 2 versions) and validates their accuracy.
 
-  *Note* that this can take quite long due to the size of DNNs and datasets.
+  **Note** that this can take quite long due to the size of DNNs and datasets.
   Depending on your hardware capability, this test can take 5-30 minutes.
   Also, this is set to run sequentially out of GPU memory concerns.