diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
new file mode 100644
index 0000000000000000000000000000000000000000..2b2f46e1ced98f77ad21037be26cd976cf5dd4c8
--- /dev/null
+++ b/.gitlab-ci.yml
@@ -0,0 +1,12 @@
+image: python:3.7-alpine
+
+build_sphinx:
+  stage: build
+  tags:
+    - hpvm
+  script:
+    - pip install -U sphinx sphinx-autodoc-typehints sphinx-rtd-theme numpydoc
+    - sphinx-build -b html hpvm/docs/ public/
+  artifacts:
+    paths:
+    - public
diff --git a/README.md b/README.md
index d1a28fae2ee717a765ab80ee9eef34b1cc0cb738..c902d71c5f18e63757c68a7c56c3050bd6e2cfd7 100644
--- a/README.md
+++ b/README.md
@@ -2,167 +2,26 @@
 
 This repository contains the source code and documentation for the HPVM Compiler Infrastructure.
 
-The README briefly describes how to get started with building and installing HPVM. It also provides a
-benchmark suite to test the compiler infrastructure.
+HPVM is a compiler for heterogeneous parallel system.
+For more about what HPVM is, see [our website](https://publish.illinois.edu/hpvm-project/)
+and publications:
+[PPoPP'18 paper](https://dl.acm.org/doi/pdf/10.1145/3200691.3178493),
+[OOPSLA'19 paper](https://dl.acm.org/doi/10.1145/3360612),
+[PPoPP'21 paper](https://dl.acm.org/doi/10.1145/3437801.3446108).
 
-HPVM is currently at **version 1.0**. For more about what HPVM is, see [our website](https://publish.illinois.edu/hpvm-project/).
+HPVM is currently at **version 1.0**.
 
-## Papers
+For instruction on how to build and install HPVM, see [here](/hpvm/docs/install.rst);
+for how to use HPVM, see [here](/hpvm/docs/getting_started.rst).
 
-[PPoPP'18 paper](https://dl.acm.org/doi/pdf/10.1145/3200691.3178493)
-
-[OOPSLA'19 paper](https://dl.acm.org/doi/10.1145/3360612)
-
-[PPoPP'21 paper](https://dl.acm.org/doi/10.1145/3437801.3446108)
-
-## Resources
-
-[HPVM IR Specification](/hpvm/docs/hpvm-specification.md)
-
-[HPVM-C Language Specification](/hpvm/docs/hpvm-c.md)
-
-[HPVM Compilation Process](/hpvm/docs/compilation.md)
-
-## Dependencies
-
-The following components are required to be installed on your machine to build HPVM.
-
-* GCC (>=5.1)
-  * In addition, each version of CUDA-nvcc requires GCC to be not newer than a certain version.
-    See [here](https://gist.github.com/ax3l/9489132) for the support matrix.
-* CMake (>=3.17)
-* GNU Make (>=3.79)
-* OpenCL (>=1.0.0)
-* CUDA (>=9.1)
-* Python (==3.6) with pip (>=20)
-
-Python must be strictly 3.6 (any subversion between 3.6.0~3.6.13).
-Alternatively, if you use Anaconda for package management,
-we provide a conda environment file that covers all Python and Python package requirements:
-
-```bash
-conda env create -n hpvm -f hpvm/env.yaml
-```
-
-## Supported Targets
-
-Supported/tested CPU architectures:
-
-* Intel Xeon E5-2640
-* Intel Xeon W-2135
-* ARM Cortex A-57
-
-Supported/tested GPU architectures for OpenCL backend:
-
-* Nvidia Quadro P1000
-* Nvidia GeForce GTX 1080
-
-Supported/tested GPU architectures for Tensor Backend:
-
-* Nvidia Jetson TX2
-* Nvidia GeForce GTX 1080
-
-HPVM has not been tested but might work on other CPUs supported by LLVM Backend, and GPUs supported by OpenCL such as Intel, AMD, etc.
-
-**NOTE**: Approximations are tuned for Jetson TX2 and same speedups may not exist for other architectures.
-
-## Getting Started
-
-### Getting source code and setting up environment
-
-Checkout HPVM and go to directory `./hpvm` under project root:
-
-```shell
-git clone --recursive -b approx_hpvm_reorg --single-branch https://gitlab.engr.illinois.edu/llvm/hpvm.git
-cd hpvm/
-```
-
-HPVM needs to be able to find CUDA.
-If CUDA is installed in your system's $PATH (e.g. if it was installed at the default location),
-HPVM can find CUDA automatically.
-Otherwise, some environment variables are required:
-
-* `CUDA_TOOLKIT_PATH` --- Path to the CUDA toolkit
-* `CUDA_INCLUDE_PATH` --- Path to the CUDA headers
-* `CUDA_LIB_PATH` --- Path to CUDA libraries
-
-`set_paths.sh` can be used for this.
-Modify the values of these variables in `set_paths.sh` according to your system, and source the script:
-
-```shell
-source set_paths.sh
-```
-
-HPVM installer script can be used to download, configure and build HPVM along with LLVM and Clang.
-
-```shell
-bash install.sh
-```
-
-On launch, the installer asks whether it should also build HPVM.
-If HPVM is to be built, the installer asks the number of threads to be used.
-The default number of threads used to build HPVM is two (2).
-
-If you use this automatic build, skip the next section.
-
-* Specifically, the HPVM installer downloads LLVM, and Clang, copies HPVM source into
-  llvm/tools and builds the entire tree. It also builds a modified LLVM C-Backend,
-  based on the one maintained by [Julia Computing](https://github.com/JuliaComputing/llvm-cbe),
-  as a part of HPVM and is currently used to generate OpenCL kernels for GPUs.
-
-### Manually Build HPVM
-
-Alternatively, you can manually build HPVM with CMake.
-Please note that in this case,
-the installer script still *must* be executed to obtain some required components,
-but without the build step.
-
-In current directory (`hpvm/`), do
-
-```shell
-mkdir build
-cd build
-cmake ../llvm [options]
-export PATH=$(realpath ./bin):$PATH
-```
-
-Some common options that can be used with CMake are:
-
-* -DCMAKE_INSTALL_PREFIX=directory --- Specify for directory the full pathname of where you want the HPVM tools and libraries to be installed.
-* -DCMAKE_BUILD_TYPE=type --- Valid options for type are Debug, Release, RelWithDebInfo, and MinSizeRel. Default is Debug.
-* -DLLVM_ENABLE_ASSERTIONS=On --- Compile with assertion checks enabled (default is Yes for Debug builds, No for all other build types).
-
-**Note** that if the installer script was not used,
-you must _manually add `build/bin` directory to your $PATH variable_ as absolute path (as shown above).
-
-Now, compile the HPVM Compilation Tool `approxhpvm.py` using:
-
-```shell
-make -j<number of threads> approxhpvm.py
-```
-
-With all the aforementioned steps, HPVM should be built, installed, tested and ready to use.
-In particular, `approxhpvm.py` should be an executable command from your command line.
-
-When not using the installer, you may want to run the regression tests using this script (outside of build directory):
-
-```shell
-cd ..
-bash scripts/automate_tests.sh
-```
-
-## Benchmarks and Tests
-
-We are providing the following [HPVM benchmarks](/hpvm/test/benchmarks):
-
-* Select benchmarks from the [Parboil](http://impact.crhc.illinois.edu/parboil/parboil.aspx) benchmark suite, located under [test/benchmarks/parboil](/hpvm/test/benchmarks/parboil).
-* An edge detection pipeline benchmark, located under [test/benchmarks/pipeline](/hpvm/test/benchmarks/pipeline).
-* A Camera ISP pipeline, located under [test/benchmarks/hpvm-cava](/hpvm/test/benchmarks/hpvm-cava), adapted from C code provided from our collaborators at [Harvard](http://vlsiarch.eecs.harvard.edu).
+## Support
 
-Benchmark descriptions and instructions on how to compile and run them are [here](/hpvm/test/benchmarks).
+All questions can be directed to [hpvm-dev@lists.cs.illinois.edu](mailto:hpvm-dev@lists.cs.illinois.edu).
 
-We are also providing [unit tests](/hpvm/test/unitTests) and [regression tests](/hpvm/test/regressionTests).
+## References
 
-## Support
+Some documents on technical details and the internal working of HPVM:
 
-All questions can be directed to [hpvm-dev@lists.cs.illinois.edu](mailto:hpvm-dev@lists.cs.illinois.edu).
+* [HPVM IR Specification](/hpvm/docs/references/hpvm-specification.md)
+* [HPVM-C Language Specification](/hpvm/docs/references/hpvm-c.md)
+* [HPVM Compilation Process](/hpvm/docs/references/compilation-process.rst)
diff --git a/hpvm/docs/KerasFrontend.md b/hpvm/docs/KerasFrontend.md
deleted file mode 100644
index 3225b112ad4f1e8c03b69af1b330a2dbced24ab1..0000000000000000000000000000000000000000
--- a/hpvm/docs/KerasFrontend.md
+++ /dev/null
@@ -1,191 +0,0 @@
-# Keras Frontend 
-
-Install Keras Frontend after moving to directory `/hpvm/hpvm/projects/keras`
-
-## Requirements 
-
-* python == 3.6.x
-* pip >= 18
-
-If your system uses a different Python version, we recommend using the conda environment `keras_python36.yml`. Install this using:
-
-```
-conda env create -f keras_python36.yml --name keras_python36
-```
-
-Activate the conda environment before installing the pip package (below) using:
-
-```
-conda activate keras_python36
-```
-
-**NOTE:** This step must be performed each time (for each shell process) the frontend is to be used.
-
-
-## Installing the Keras Frontend Package
-
-At the root of this project (`/projects/keras/`) install the Keras frontend pip package as:
-
-```
-pip3 install -e ./
-```
-
-**NOTE:** If you are using the conda environment, activate it prior to this step.
-
-## Suppported Operations
-
-List of supported operations and limitations are documented [here](../projects/keras/docs/Support.md) 
-
-
-
-# Keras Benchmarks
-
-Run the Keras benchmarks under `hpvm/hpvm/test/dnn_benchmarks/keras`
-
-## Download CNN Model Files 
-
-Prior to running the benchmarks, ensure you download the CNN model data (inputs and weights) if not done in automatic build script.
-
-```
-wget https://databank.illinois.edu/datafiles/o3izd/download -O model_params.tar.gz
-tar -xf  model_params.tar.gz
-```
-
-Move extracted `model_params` directory to `/test/dnn_benchmarks/model_params` (Benchmarks expect data at this location)
-
-
-## Running Benchmaks
-
-List of benchmarks and the expected accuracies:
-
-| Benchmark       | Accuracy    |
-| ----------- | ----------- |
-| alexnet.py      | 79.28       |
-| alexnet2.py   | 84.98        |
-| alexnet_imagenet.py | 56.30 |
-| lenet.py | 98.70 | 
-| mobilenet_cifar10.py | 84.42 |
-| resnet18_cifar10.py | 89.56 |
-| resnet50_imagenet.py | 75.10 |
-| vgg16_cifar10.py | 89.96 |
-| vgg16_cifar100.py | 66.50 |
-| vgg16_imagenet.py | 69.46 |
-
-
-### Synopsis
-
-```
-python3 ${BENCH_NAME}.py  [hpvm_reload|keras_reload]  [frontend] [compile]
-
-```
-
-
-**Command-line Parameters**
-
-`hpvm_reload` : Reloads HPVM weights (`.bin` binary format used by HPVM weights - present in `model_params` download directory) from directory path specified in the `reload_dir` parameter set in code - this is described in "Parameters to Change in Code" (below).
-
-`keras_reload`: Alternatively, reload weights in Keras `.h5` file format with path to file specified in `keras_model_file` described in "Parameters to Change in Code" (below).
-
-`frontend`: Invokes the HPVM frontend and dumps weights (in HPVM `.bin` format) in the output directory specified. The parameters that control where data and source files are dumped are specified by parameters `data_dir` and `src_dir`, respectively. These are described below.
-
-`compile`: Optional Parameter. When specified, it compiles the HPVM-C code generated by the frontend into an HPVM binary under the directory specified by `src_dir` (described below). If `src_dir` path exists, a unique directory (which appends a unique ID) is created. 
-The binary is built with the name `HPVM_binary`. 
-
-**NOTE:** Before running `HPVM_binary` necessary to set CUDA and CUDNN paths with:
-
-```
-source ${PATH_TO_YOUR_HPVM_ROOT}/hpvm/set_paths.sh
-```
-
-**Parameters to Change in Code** 
-
-The AlexNet source is commented with explanations on how to use the Keras frontend interface. AlexNet source is [here](https://gitlab.engr.illinois.edu/llvm/hpvm/-/blob/approx_hpvm_reorg_keras/hpvm/projects/keras/src/alexnet.py).
-
-* `NAME`: Benchmark Name - Can be set to any desired value
-
-* `reload_dir`: Path to directory from where to reload weights in HPVM format. This directory is used to reload weights if `hpvm_reload` command-line option is used.
-
-* `keras_model_file`: Path to Keras .h5 model file to reload weigths from. Either of `reload_dir` or `keras_model_file` can be used. 
-`keras_model_file` is used when `keras_reload` commad-line parameter is used with the Benchmark script.
-
-* `data_dir`: Directory to dump weights specified specified in [constructor](https://gitlab.engr.illinois.edu/llvm/hpvm/-/blob/approx_hpvm_reorg_keras/hpvm/projects/keras/src/Benchmark.py#L21)
- 
-* `src_dir`: Directory to dump ApproxHPVM sources in HPVM-C (C with HPVM compiler intrinsics) specified in [constructor](https://gitlab.engr.illinois.edu/llvm/hpvm/-/blob/approx_hpvm_reorg_keras/hpvm/projects/keras/src/Benchmark.py#L22) 
-
-* `num_classes`: number of output classes - dependent on the dataset used. For CIFAR10, `num_classes` is 10, CIFAR100 has 100 classes,
- for ImageNet, number of classes is 1000.
-
-* `batch_size`: This parameter controls the size of each batch that is processed in HPVM. The batch size should be kept as large as the GPU memory 
-can support. This parameter should be adapted according to the memory size of the deployed device.
-
-
-
-### Using the Frontend with Custom (New) Benchmarks 
-
-Any new benchmarks must inherit from the commom parent `Benchmark` class 
-and override the virtual functions for building the model, training, 
-and data preprocessing. These methods are described below:
-        
-    
-`def buildModel(self)`:
-Constructs and returns a keras model
-
-`def data_preprocess(self)`:
-returns X_train, y_train, X_test, y_test, X_tuner, and y_tuner data (in that order): 
-These are described here:
-
-* `X_train:` Training data (fp32) in NCHW format
-* `y_train:` Training labels (int32)
-
-* `X_test:` Testing/Evaluation data in NCHW format
-* `y_test:` Testing/Evaluation labels
-
-* `X_tuner:` Data to be used for autotuning 
-* `y_tuner:` Labels corresponding to tuning data
-
-
-`def trainModel(self, model, X_train, y_train, X_test, y_test)`:
-Trains the Keras model constructed in `buildModel` and is expected to return the 
-trained keras model - training parameters should be tuned here.
-
-### Directly using Keras Frontend API
-
-Alternate to extending the `Benchmark` class, users may directly invoke the Keras Frontend API. This can be done as:
-
-```python
-
-from keras_frontend.approxhpvm_translator import translate_to_approxhpvm
-
-# Construct and train your Keras Model (or load pre-trained weights)
-
-translate_to_approxhpvm(model, data_dir, src_dir, test_data, test_labels, tune_data, tune_labels, batch_size, num_classes)
-
-```
-
-## Running HPVM Binary 
-
-Run the `HPVM_binary` generated under the directory specified by `src_dir` (described above). Usage: 
-
-```
-./HPVM_binary -t {test|tune} -c ${config_file_path}
-```
-
-`test|tune`: Runs with either tune (autotuning data) or test set (for evaluation)
-
-`config_file_path`: Path to an HPVM tensor configuration file (includes approximation settings)
-
-**NOTE:** The accuracy of the bennchmarks is dumped into a file named `final_accuracy` in the current working directory - this includes accuracy averaged across batches
-
-## Automated Tests 
-
-`scripts/test_benchmarks.py` is an automated test script that evaluates the accuracy of each Benchmark in Keras and HPVM (after comilation using HPVM Compiler) and compares the accuracy of each binary to the known correct accuracy. Run from root of `/test/dnn_benchmarks/keras`:
-
-```
-python test_benchmarks.py
-```
-
-
-
-
-
-
diff --git a/hpvm/docs/Makefile b/hpvm/docs/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..b7000de4853b1c4a9f75cdc1ced9f351ab02284d
--- /dev/null
+++ b/hpvm/docs/Makefile
@@ -0,0 +1,109 @@
+# Makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = sphinx-build
+PAPER         =
+
+# Internal variables.
+PAPEROPT_a4     = -D latex_paper_size=a4
+PAPEROPT_letter = -D latex_paper_size=letter
+ALLSPHINXOPTS   = -d build/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
+
+.PHONY: help clean html dirhtml pickle json htmlhelp qthelp latex changes linkcheck doctest epub
+
+help:
+	@echo "Please use \`make <target>' where <target> is one of"
+	@echo "  html      to make standalone HTML files"
+	@echo "  dirhtml   to make HTML files named index.html in directories"
+	@echo "  pickle    to make pickle files"
+	@echo "  epub       to make an epub"
+	@echo "  json      to make JSON files"
+	@echo "  htmlhelp  to make HTML files and a HTML help project"
+	@echo "  qthelp    to make HTML files and a qthelp project"
+	@echo "  latex     to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
+	@echo "  changes   to make an overview of all changed/added/deprecated items"
+	@echo "  linkcheck to check all external links for integrity"
+	@echo "  doctest   to run all doctests embedded in the documentation (if enabled)"
+	@echo "  gitwash   to update the gitwash documentation"
+
+
+clean:
+	-rm -rf build/*
+
+dist: html
+	test -d build/latex || make latex
+	make -C build/latex all-pdf
+	-rm -rf build/dist
+	(cd build/html; cp -r . ../../build/dist)
+	(cd build/dist && tar czf ../dist.tar.gz .)
+
+html:
+	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) build/html
+	@echo
+	@echo "Build finished. The HTML pages are in build/html."
+
+dirhtml:
+	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) build/dirhtml
+	@echo
+	@echo "Build finished. The HTML pages are in build/dirhtml."
+
+pickle:
+	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) build/pickle
+	@echo
+	@echo "Build finished; now you can process the pickle files."
+
+json:
+	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) build/json
+	@echo
+	@echo "Build finished; now you can process the JSON files."
+
+htmlhelp:
+	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) build/htmlhelp
+	@echo
+	@echo "Build finished; now you can run HTML Help Workshop with the" \
+	      ".hhp project file in build/htmlhelp."
+
+qthelp:
+	$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) build/qthelp
+	@echo
+	@echo "Build finished; now you can run "qcollectiongenerator" with the" \
+	      ".qhcp project file in build/qthelp, like this:"
+	@echo "# qcollectiongenerator build/qthelp/test.qhcp"
+	@echo "To view the help file:"
+	@echo "# assistant -collectionFile build/qthelp/test.qhc"
+
+epub:
+	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) build/epub
+	@echo
+	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
+
+
+latex:
+	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) build/latex
+	@echo
+	@echo "Build finished; the LaTeX files are in build/latex."
+	@echo "Run \`make all-pdf' or \`make all-ps' in that directory to" \
+	      "run these through (pdf)latex."
+
+changes:
+	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) build/changes
+	@echo
+	@echo "The overview file is in build/changes."
+
+linkcheck:
+	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) build/linkcheck
+	@echo
+	@echo "Link check complete; look for any errors in the above output " \
+	      "or in build/linkcheck/output.txt."
+
+doctest:
+	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) build/doctest
+	@echo "Testing of doctests in the sources finished, look at the " \
+	      "results in build/doctest/output.txt."
+
+latexpdf: latex
+	@echo "Running LaTeX files through latexmk..."
+	$(MAKE) -C build/latex all-pdf
+	@echo "latexmk finished; the PDF files are in build/latex."
diff --git a/hpvm/docs/README.md b/hpvm/docs/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c93b7ed1517061d4417b0924a51c62b3726a7ca9
--- /dev/null
+++ b/hpvm/docs/README.md
@@ -0,0 +1,28 @@
+# Building docs
+
+We use Sphinx for generating the API and reference documentation.
+
+## Instructions
+
+Install the following Python packages needed to build the documentation by entering:
+
+```bash
+pip3 install sphinx sphinx-autodoc-typehints sphinx-rtd-theme numpydoc
+```
+
+To build the HTML documentation, enter::
+
+```bash
+make html
+```
+
+in the ``doc/`` directory. This will generate a ``build/html`` subdirectory
+containing the built documentation.
+
+To build the PDF documentation, enter::
+
+```bash
+make latexpdf
+```
+
+You will need to have LaTeX installed for this.
diff --git a/hpvm/docs/compilation.md b/hpvm/docs/compilation.md
deleted file mode 100644
index 6381fec7d856c79fdd2ed31bc23fe02990c9e38d..0000000000000000000000000000000000000000
--- a/hpvm/docs/compilation.md
+++ /dev/null
@@ -1,15 +0,0 @@
-# HPVM Compilation Process
-Compilation of an HPVM program involves the following steps:
-
-1. `clang` takes an HPVM-C/C++ program (e.g. `main.c`) and produces an LLVM IR (`main.ll`) file that contains the HPVM-C function calls. The declarations of these functions are defined in `test/benchmark/include/hpvm.h`, which must be included in the program.
-2. `opt` takes (`main.ll`) and invoke the GenHPVM pass on it, which converts the HPVM-C function calls to HPVM intrinsics. This generates the HPVM textual representation (`main.hpvm.ll`).
-3. `opt` takes the HPVM textual representation (`main.hpvm.ll`) and invokes the following passes in sequence: 
-    * BuildDFG: Converts the textual representation to the internal HPVM representation.
-    * LocalMem and DFG2LLVM_OpenCL: Invoked only when GPU target is selected. Generates the kernel module (`main.kernels.ll`) and the portion of the host code that invokes the kernel into the host module (`main.host.ll`).
-    * DFG2LLVM_CPU: Generates either all, or the remainder of the host module (`main.host.ll`) depending on the chosen target.
-    * ClearDFG: Deletes the internal HPVM representation from memory.
-4. `clang` is used to to compile any remaining project files that would be later linked with the host module.
-5. `llvm-link` takes the host module and all the other generate `ll` files, and links them with the HPVM runtime module (`hpvm-rt.bc`), to generate the linked host module (`main.host.linked.ll`). 
-6. Generate the executable code from the generated `ll` files for all parts of the program:
-    * GPU target: `llvm-cbe` takes the kernel module (`main.kernels.ll`) and generates an OpenCL representation of the kernels that will be invoked by the host.
-    * CPU target: `clang` takes the linked  host module (`main.host.linked.ll`) and generates the CPU binary.
diff --git a/hpvm/docs/components/hpvm-profiler.rst b/hpvm/docs/components/hpvm-profiler.rst
new file mode 100644
index 0000000000000000000000000000000000000000..8a0e6603d3b7111d2735a86b5db26d7aa834ebb6
--- /dev/null
+++ b/hpvm/docs/components/hpvm-profiler.rst
@@ -0,0 +1,6 @@
+HPVM Profiler API
+======================
+
+.. autofunction:: hpvm_profiler.profile_configs
+
+.. autofunction:: hpvm_profiler.plot_hpvm_configs
diff --git a/hpvm/docs/components/index.rst b/hpvm/docs/components/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..8f9ab42a8dbf6ad461ea93867c6e6537ce79762b
--- /dev/null
+++ b/hpvm/docs/components/index.rst
@@ -0,0 +1,33 @@
+Components
+================================
+
+HPVM consists of a few relatively independent key components.
+
+* Patched LLVM: provides HPVM IR and a compilation infrastructure, including ``clang`` and ``opt``.
+* HPVM code generator: a few ``opt`` passes that lowers HPVM IR to LLVM IR,
+  which is then compiled into object code and binary.
+
+:doc:`Compilation process of HPVM </references/hpvm-specification>`
+shows how these 2 components work together.
+In addition, there are:
+
+* Frontends (Keras/PyTorch): code generators in Python for lowering Keras and PyTorch
+  DNN models into HPVM-C format.
+* Predictive tuner: an autotuner library in Python for finding approximation choices (configurations)
+  with best performance gain within some loss of Quality of Service (QoS, such as accuracy).
+* HPVM profiler: an API in Python for measuring real performance of configurations.
+* Tensor runtime: a backend which holds implementations for some common tensor operators
+  (such as convolution) that HPVM-C functions can be converted into.
+
+The documentation of these components are listed below,
+which explains their role, usage, and other details.
+
+.. toctree::
+   :maxdepth: 1
+
+   keras-frontend
+   keras-support
+   keras-benchmarks
+   torch2hpvm
+   predtuner
+   hpvm-profiler
diff --git a/hpvm/docs/components/keras-benchmarks.rst b/hpvm/docs/components/keras-benchmarks.rst
new file mode 100644
index 0000000000000000000000000000000000000000..fe6a6414e24a70092667ca986c21b3ecaa56cd87
--- /dev/null
+++ b/hpvm/docs/components/keras-benchmarks.rst
@@ -0,0 +1,177 @@
+Keras Benchmarks
+================
+
+TODO: some of this belongs to `test/`.
+
+Run the Keras benchmarks under ``hpvm/hpvm/test/dnn_benchmarks/keras``
+
+Download CNN Model Files
+------------------------
+
+Prior to running the benchmarks, ensure you download the CNN model data (inputs and weights) if not done in automatic build script.
+
+.. code-block::
+
+   wget https://databank.illinois.edu/datafiles/o3izd/download -O model_params.tar.gz
+   tar -xf  model_params.tar.gz
+
+Move extracted ``model_params`` directory to ``/test/dnn_benchmarks/model_params`` (Benchmarks expect data at this location)
+
+Running Benchmaks
+-----------------
+
+List of benchmarks and the expected accuracies:
+
+.. list-table::
+   :header-rows: 1
+
+   * - Benchmark
+     - Accuracy
+   * - alexnet.py
+     - 79.28
+   * - alexnet2.py
+     - 84.98
+   * - alexnet_imagenet.py
+     - 56.30
+   * - lenet.py
+     - 98.70
+   * - mobilenet_cifar10.py
+     - 84.42
+   * - resnet18_cifar10.py
+     - 89.56
+   * - resnet50_imagenet.py
+     - 75.10
+   * - vgg16_cifar10.py
+     - 89.96
+   * - vgg16_cifar100.py
+     - 66.50
+   * - vgg16_imagenet.py
+     - 69.46
+
+
+Synopsis
+^^^^^^^^
+
+.. code-block::
+
+   python3 ${BENCH_NAME}.py  [hpvm_reload|keras_reload]  [frontend] [compile]
+
+**Command-line Parameters**
+
+``hpvm_reload`` : Reloads HPVM weights (``.bin`` binary format used by HPVM weights - present in ``model_params`` download directory) from directory path specified in the ``reload_dir`` parameter set in code - this is described in "Parameters to Change in Code" (below).
+
+``keras_reload``: Alternatively, reload weights in Keras ``.h5`` file format with path to file specified in ``keras_model_file`` described in "Parameters to Change in Code" (below).
+
+``frontend``: Invokes the HPVM frontend and dumps weights (in HPVM ``.bin`` format) in the output directory specified. The parameters that control where data and source files are dumped are specified by parameters ``data_dir`` and ``src_dir``, respectively. These are described below.
+
+``compile``: Optional Parameter. When specified, it compiles the HPVM-C code generated by the frontend into an HPVM binary under the directory specified by ``src_dir`` (described below). If ``src_dir`` path exists, a unique directory (which appends a unique ID) is created. 
+The binary is built with the name ``HPVM_binary``. 
+
+**NOTE:** Before running ``HPVM_binary`` necessary to set CUDA and CUDNN paths with:
+
+.. code-block::
+
+   source ${PATH_TO_YOUR_HPVM_ROOT}/hpvm/set_paths.sh
+
+**Parameters to Change in Code** 
+
+The AlexNet source is commented with explanations on how to use the Keras frontend interface. AlexNet source is `here <https://gitlab.engr.illinois.edu/llvm/hpvm/-/blob/approx_hpvm_reorg_keras/hpvm/projects/keras/src/alexnet.py>`_.
+
+
+* 
+  ``NAME``: Benchmark Name - Can be set to any desired value
+
+* 
+  ``reload_dir``: Path to directory from where to reload weights in HPVM format. This directory is used to reload weights if ``hpvm_reload`` command-line option is used.
+
+* 
+  ``keras_model_file``: Path to Keras .h5 model file to reload weigths from. Either of ``reload_dir`` or ``keras_model_file`` can be used. 
+  ``keras_model_file`` is used when ``keras_reload`` commad-line parameter is used with the Benchmark script.
+
+* 
+  ``data_dir``: Directory to dump weights specified specified in
+  `constructor <https://gitlab.engr.illinois.edu/llvm/hpvm/-/blob/approx_hpvm_reorg_keras/hpvm/projects/keras/src/Benchmark.py#L21>`_.
+
+* 
+  ``src_dir``: Directory to dump ApproxHPVM sources in HPVM-C (C with HPVM compiler intrinsics) specified in
+  `constructor <https://gitlab.engr.illinois.edu/llvm/hpvm/-/blob/approx_hpvm_reorg_keras/hpvm/projects/keras/src/Benchmark.py#L22>`_.
+
+* 
+  ``num_classes``: number of output classes - dependent on the dataset used. For CIFAR10, ``num_classes`` is 10, CIFAR100 has 100 classes,
+  for ImageNet, number of classes is 1000.
+
+* 
+  ``batch_size``: This parameter controls the size of each batch that is processed in HPVM. The batch size should be kept as large as the GPU memory 
+  can support. This parameter should be adapted according to the memory size of the deployed device.
+
+Using the Frontend with Custom (New) Benchmarks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Any new benchmarks must inherit from the commom parent ``Benchmark`` class 
+and override the virtual functions for building the model, training, 
+and data preprocessing. These methods are described below:
+
+``def buildModel(self)``:
+Constructs and returns a keras model
+
+``def data_preprocess(self)``:
+returns X_train, y_train, X_test, y_test, X_tuner, and y_tuner data (in that order): 
+These are described here:
+
+
+* ``X_train:`` Training data (fp32) in NCHW format
+* 
+  ``y_train:`` Training labels (int32)
+
+* 
+  ``X_test:`` Testing/Evaluation data in NCHW format
+
+* 
+  ``y_test:`` Testing/Evaluation labels
+
+* 
+  ``X_tuner:`` Data to be used for autotuning 
+
+* ``y_tuner:`` Labels corresponding to tuning data
+
+``def trainModel(self, model, X_train, y_train, X_test, y_test)``:
+Trains the Keras model constructed in ``buildModel`` and is expected to return the 
+trained keras model - training parameters should be tuned here.
+
+Directly using Keras Frontend API
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Alternate to extending the ``Benchmark`` class, users may directly invoke the Keras Frontend API. This can be done as:
+
+.. code-block:: python
+
+
+   from keras_frontend.approxhpvm_translator import translate_to_approxhpvm
+
+   # Construct and train your Keras Model (or load pre-trained weights)
+
+   translate_to_approxhpvm(model, data_dir, src_dir, test_data, test_labels, tune_data, tune_labels, batch_size, num_classes)
+
+Running HPVM Binary
+-------------------
+
+Run the ``HPVM_binary`` generated under the directory specified by ``src_dir`` (described above). Usage: 
+
+.. code-block::
+
+   ./HPVM_binary -t {test|tune} -c ${config_file_path}
+
+``test|tune``: Runs with either tune (autotuning data) or test set (for evaluation)
+
+``config_file_path``: Path to an HPVM tensor configuration file (includes approximation settings)
+
+**NOTE:** The accuracy of the bennchmarks is dumped into a file named ``final_accuracy`` in the current working directory - this includes accuracy averaged across batches
+
+Automated Tests
+---------------
+
+``scripts/test_benchmarks.py`` is an automated test script that evaluates the accuracy of each Benchmark in Keras and HPVM (after comilation using HPVM Compiler) and compares the accuracy of each binary to the known correct accuracy. Run from root of ``/test/dnn_benchmarks/keras``:
+
+.. code-block::
+
+   python3 test_benchmarks.py
diff --git a/hpvm/docs/components/keras-frontend.rst b/hpvm/docs/components/keras-frontend.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d67adff4f2befd0b4c57468fbffb6d472940bc6d
--- /dev/null
+++ b/hpvm/docs/components/keras-frontend.rst
@@ -0,0 +1,43 @@
+
+Keras Frontend
+==============
+
+Install Keras Frontend after moving to directory ``/hpvm/hpvm/projects/keras``
+
+Requirements
+------------
+
+
+* python == 3.6.x
+* pip >= 18
+
+If your system uses a different Python version, we recommend using the conda environment ``keras_python36.yml``. Install this using:
+
+.. code-block::
+
+   conda env create -f keras_python36.yml --name keras_python36
+
+Activate the conda environment before installing the pip package (below) using:
+
+.. code-block::
+
+   conda activate keras_python36
+
+**NOTE:** This step must be performed each time (for each shell process) the frontend is to be used.
+
+Installing the Keras Frontend Package
+-------------------------------------
+
+At the root of this project (``/projects/keras/``) install the Keras frontend pip package as:
+
+.. code-block::
+
+   pip3 install -e ./
+
+**NOTE:** If you are using the conda environment, activate it prior to this step.
+
+Suppported Operations
+---------------------
+
+List of supported operations and limitations are documented
+:doc:`here <keras-support>`.
diff --git a/hpvm/docs/components/keras-support.rst b/hpvm/docs/components/keras-support.rst
new file mode 120000
index 0000000000000000000000000000000000000000..0b77e04dc1e11c1c148b0c74bbdda191560165b6
--- /dev/null
+++ b/hpvm/docs/components/keras-support.rst
@@ -0,0 +1 @@
+../../projects/keras/docs/Support.rst
\ No newline at end of file
diff --git a/hpvm/docs/components/predtuner.rst b/hpvm/docs/components/predtuner.rst
new file mode 120000
index 0000000000000000000000000000000000000000..cdda0ab7b31c03473ad9d5646b36dd4bade13b2c
--- /dev/null
+++ b/hpvm/docs/components/predtuner.rst
@@ -0,0 +1 @@
+../../projects/predtuner/README.rst
\ No newline at end of file
diff --git a/hpvm/docs/components/torch2hpvm.rst b/hpvm/docs/components/torch2hpvm.rst
new file mode 120000
index 0000000000000000000000000000000000000000..54d5dbb5804b350fc1b7896812d1e77e0305368e
--- /dev/null
+++ b/hpvm/docs/components/torch2hpvm.rst
@@ -0,0 +1 @@
+../../projects/torch2hpvm/README.rst
\ No newline at end of file
diff --git a/hpvm/docs/conf.py b/hpvm/docs/conf.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e65a2b3582d69e409820921dc0fe3a37ebaf3d5
--- /dev/null
+++ b/hpvm/docs/conf.py
@@ -0,0 +1,156 @@
+from datetime import date
+import sphinx_rtd_theme
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import sys
+
+sys.path.insert(0, os.path.abspath(".."))
+
+# General configuration
+# ---------------------
+
+# Add any Sphinx extension module names here, as strings. They can be extensions
+# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
+extensions = [
+    "sphinx.ext.autosummary",
+    "sphinx.ext.autodoc",
+    "sphinx_autodoc_typehints",
+    "sphinx.ext.coverage",
+    "sphinx.ext.doctest",
+    "sphinx.ext.intersphinx",
+    "sphinx.ext.mathjax",
+    "sphinx.ext.todo",
+    "sphinx.ext.viewcode",
+    "numpydoc",
+]
+always_document_param_types = True
+
+# generate autosummary pages
+autosummary_generate = True
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ["_templates"]
+
+# The suffix of source filenames.
+source_suffix = ".rst"
+
+# The encoding of source files.
+source_encoding = "utf-8"
+
+# The master toctree document.
+master_doc = "index"
+
+# General substitutions.
+project = "HPVM"
+copyright = f"2020-{date.today().year}, University of Illinois"
+
+# There are two options for replacing |today|: either, you set today to some
+# non-false value, then it is used:
+# today = ''
+# Else, today_fmt is used as the format for a strftime call.
+# today_fmt = '%B %d, %Y'
+
+# List of documents that shouldn't be included in the build.
+# unused_docs = ['']
+
+# If true, '()' will be appended to :func: etc. cross-reference text.
+# add_function_parentheses = True
+
+# If true, the current module name will be prepended to all description
+# unit titles (such as .. function::).
+add_module_names = False
+
+# show_authors = True
+
+# The name of the Pygments (syntax highlighting) style to use.
+# pygments_style = 'friendly'
+pygments_style = "sphinx"
+
+# A list of prefixs that are ignored when creating the module index. (new in Sphinx 0.6)
+# modindex_common_prefix = ["networkx."]
+
+# doctest_global_setup = "import networkx as nx"
+
+# Options for HTML output
+# -----------------------
+
+
+html_theme = "sphinx_rtd_theme"
+html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
+
+html_theme_options = {
+    "canonical_url": "https://networkx.org/documentation/stable/",
+    "navigation_depth": 3,
+    "logo_only": True,
+}
+
+# html_logo = "_static/networkx_logo.svg"
+
+# The style sheet to use for HTML and HTML Help pages. A file of that name
+# must exist either in Sphinx' static/ path, or in one of the custom paths
+# given in html_static_path.
+# html_style = ''
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = []
+
+# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
+# using the given strftime format.
+html_last_updated_fmt = "%b %d, %Y"
+
+# If true, SmartyPants will be used to convert quotes and dashes to
+# typographically correct entities.
+# html_use_smartypants = True
+
+# Content template for the index page.
+# html_index = 'index.html'
+
+# Custom sidebar templates, maps page names to templates.
+# html_sidebars = {}
+
+# Additional templates that should be rendered to pages, maps page names to
+# templates.
+# html_additional_pages = {'': ''}
+
+# If true, the reST sources are included in the HTML build as _sources/<name>.
+html_copy_source = False
+
+# Options for LaTeX output
+# ------------------------
+
+# Use a latex engine that allows for unicode characters in docstrings
+latex_engine = "xelatex"
+# The paper size ('letter' or 'a4').
+latex_paper_size = "letter"
+
+# The font size ('10pt', '11pt' or '12pt').
+# latex_font_size = '10pt'
+
+latex_appendices = ["tutorial"]
+
+# Intersphinx mapping
+intersphinx_mapping = {
+    "python": ("https://docs.python.org/3/", None),
+    "numpy": ("https://numpy.org/doc/stable/", None),
+    "matplotlib": ("https://matplotlib.org", None),
+    "scipy": ("https://docs.scipy.org/doc/scipy/reference", None),
+    "pandas": ("https://pandas.pydata.org/pandas-docs/stable", None),
+    "pytorch": ("https://pytorch.org/docs/stable", None),
+}
+
+# The reST default role (used for this markup: `text`) to use for all
+# documents.
+default_role = "obj"
+
+numpydoc_show_class_members = False
+
+
+def setup(app):
+    app.add_css_file("custom.css")
+    app.add_js_file("copybutton.js")
diff --git a/hpvm/docs/developerdocs/approximation-implementation.rst b/hpvm/docs/developerdocs/approximation-implementation.rst
new file mode 100644
index 0000000000000000000000000000000000000000..bb9f5de2d908a269fa2e1d648c5525bed7fb9c95
--- /dev/null
+++ b/hpvm/docs/developerdocs/approximation-implementation.rst
@@ -0,0 +1,40 @@
+Approximate Algorithm Implementations
+=========================================
+
+
+Perforated Convolutions
+-----------------------
+
+Overview
+^^^^^^^^^
+
+Perforation approximation for convolution operation entails, perforating rows/columns of tensors i.e. skipping computing values of rows/columns of tensors and using the neighboring values to interpolate the skipped ones to recover the accuracy and shape of the resultant tensor. This helps reduce the number MAC operations performed while improving cache and memory bandwidth usage. Perforation is performed at a uniform rate, which is the percentage of rows/columns perforated in relation to total number of rows/columns. Perforation is performed starting from an offset, which is the index of the row/columns from where perforation is performed and rows/columns prior to that index remain unapproximated.
+
+Description
+^^^^^^^^^^^
+
+The algorithm for perforated convolution can be broken down into three major steps:
+
+* **Patch matrix creation:** Based on indices of the rows/columns to be perforated, the corresponding elements of the input tensor are used to create a new matrix called an input-patch matrix. The input-patch matrix is a matrix laid out in memory in such a way that convolution is then reduced down to a simple matrix multiplication operation. This approach is similar to one described in this paper.
+
+* **Dense matrix multiplication:** This step involves performing a matrix multiplication in a manner very similar to described in this paper. It is important to note that it is performed on reduced, dense matrices.
+
+* **Interpolation of missing values:** This step entails allocation of a new tensor to which computed elements from the reduced, dense tensor are copied and the elements whose computation was skipped are interpolated by taking the arithmetic mean of the neighboring elements. These neighboring elements constitute the computed elements on the right and the left of the skipped element in case of column perforation;  and the computed elements above and below the skipped element in case of row perforation.
+
+
+Filter Sampling
+---------------
+
+Overview
+^^^^^^^^^
+Convolution performed with filter sampling approximation constitutes performing convolution operation using a sampled filter i.e. a filter with missing elements. This helps reduce the number MAC operations performed while improving cache and memory bandwidth usage by reducing overall filter size. Filter sampling is performed at a rate, which is the percentage of elements sampled in relation to total number of elements in a tensor. Sampling is performed starting from an offset into a filter, which is the index of the filter element at which sampling begins - i.e. filter elements prior to this index are not skipped/sampled.
+
+Description
+^^^^^^^^^^^
+The algorithm for convolution using a reduced filter is implemented in three major steps:
+
+* **Creation of sampled filter:** This step entails allocation of a new sampled filter whose size is based on the sampling rate and offset. The appropriate elements of the original filter are scaled up by the factor of  rate / (rate - 1) and copied to the newly allocated reduced fiter. Scaling up of filter elements helps make up for the lost accuracy from sampling the filter.
+
+* **Patch matrix creation:** Based on indices of the elements of the original filter that go into making the sampled filter, the corresponding elements of the input tensor are used to create a new matrix called an input-patch matrix. The input-patch matrix is a matrix laid out in memory in such a way that convolution is then reduced down to a simple matrix multiplication operation. 
+
+* **Dense matrix multiplication:** This step involves performing a matrix multiplication on the (sampled) filter and input patch matrices. 
diff --git a/hpvm/docs/developerdocs/cnn-models.rst b/hpvm/docs/developerdocs/cnn-models.rst
new file mode 100644
index 0000000000000000000000000000000000000000..de0847031a21ad62169f72205bea8e7debbb1e7c
--- /dev/null
+++ b/hpvm/docs/developerdocs/cnn-models.rst
@@ -0,0 +1,16 @@
+
+CNN Model Weights
+===================
+
+The CNN weights (and input) files can be downloaded from here: https://databank.illinois.edu/datasets/IDB-6565690
+
+The extracted `model_params` directory is to be copied to `hpvm/hpvm/test/dnn_benchmarks/model_params` - the CNN benchmark expect the model weights at this specific location. The automatic HPVM install (`install.sh`) does the data download, extraction, and copying automatically.
+
+We support CNN weights in 3 formats:
+
+* `.h5` file format: The entire CNN model is stored as a single `.h5` file. The Keras models are shipped as `.h5` files. These can be found under `model_params/keras`
+
+* `.pth.tar` file format: The PyTorch models are shipped as `pth.tar` files and can be found under `model_params/pytorch`
+
+* `.bin` serialized binary file format: This format is used by the HPVM binaries. Convolution, Dense, and BatchNormalization parameters for each layer are stored as individual files. The weights are serialized FP32 values layed out serially in `NCHW` format. Our frontends (Keras and PyTorch) convert `.h5` and `pth.tar` files into `.bin` files in the frontend translation phase. The `.bin` weights can be found under the respective subdirectory for each benchmark under `model_params/`
+
diff --git a/hpvm/docs/developerdocs/index.rst b/hpvm/docs/developerdocs/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..77083aa7d100550d2f564a089ae20a886b6bdbf2
--- /dev/null
+++ b/hpvm/docs/developerdocs/index.rst
@@ -0,0 +1,8 @@
+Developer Documents
+================================
+
+.. toctree::
+   :maxdepth: 1
+
+   approximation-implementation
+   cnn-models
diff --git a/hpvm/docs/getting-started.rst b/hpvm/docs/getting-started.rst
new file mode 100644
index 0000000000000000000000000000000000000000..f2f0c8452597b963f282b7767ce06894e32f897e
--- /dev/null
+++ b/hpvm/docs/getting-started.rst
@@ -0,0 +1,4 @@
+Getting Started
+===============
+
+TODO: this is the system-wide tour Sasa was suggesting. Finish this.
diff --git a/hpvm/docs/hpvm-c.md b/hpvm/docs/hpvm-c.md
deleted file mode 100644
index 76cfde58c0e406896606bc9703e88d8a9bf27fa7..0000000000000000000000000000000000000000
--- a/hpvm/docs/hpvm-c.md
+++ /dev/null
@@ -1,131 +0,0 @@
-# HPVM-C Language Specification
-An HPVM program is a combination of host code and one or more data flow graphs (DFG) at the IR level. We provide C function declarations representing the HPVM intrinsics that allow creating, querying, and interacting with the DFGs. More details about the HPVM IR intrinsics can be found in [the HPVM IR Specification.](/hpvm/docs/hpvm-specification.md).
-
-An HPVM-C program contains both the host and the DFG code. Each HPVM kernel, represented by a leaf node in the DFG, can be compiled to multiple different targets (e.g. CPU and GPU) as described below. 
-
-This document describes all the API calls that can be used in an HPVM-C program.
-
-## Host API
-
-```void __hpvm__init()```  
-Used before all other HPVM calls to initialize the HPVM runtime.
-
-```void __hpvm__cleanup()```  
-Used at the end of HPVM program to clean up all remaining runtime-created HPVM objects.
-
-```void llvm_hpvm_track_mem(void* ptr, size_t sz)```  
-Insert memory starting at ```ptr``` of size ```sz``` in the memory tracker of HPVM runtime.
-
-```void llvm_hpvm_untrack_mem(void* ptr)```  
-Stop tracking the memory object identified by ```ptr```.
-
-```void llvm_hpvm_request_mem(void* ptr, size_t sz)```  
-If the memory object identified by ```ptr``` is not in host memory, copy it to host memory.
-
-```void* __hpvm__launch(unsigned isStream, void* rootGraph, void* args)```  
-Launches the execution of the dataflow graph with node function ```rootGraph```. ```args``` is a pointer to a packed struct, containing one field per argument of the RootGraph function, consecutively. For non-streaming DFGs with a non empty result type, ```args``` must contain an additional field of the type ```RootGraph.returnTy```, where the result of the graph will be returned. ```isStream``` chooses between a non streaming (0) or streaming (1) graph execution. Returns a handle to the executing graph.
-
-```void __hpvm__wait(void* G)```  
-Waits for completion of execution of the dataflow graph with handle ```G```.
-
-```void __hpvm__push(void* G, void* args)```  
-Push set of input data items, ```args```, (same as type included in launch) to streaming DFG with handle ```G```.
-
-```void* __hpvm__pop(void* G)```  
-Pop and return data produced from one execution of streaming DFG with handle ```G```. The return type is a struct containing a field for every output of DFG. 
-
-## Internal Node API
-
-```void* __hpvm__createNodeND(unsigned dims, void* F, ...)```  
-Creates a static dataflow node replicated in ```dims``` dimensions (0 to 3), each executing node function ```F```. The arguments following ```F``` are the size of each dimension, respectively, passed in as a ```size_t```. Returns a handle to the created dataflow node.
-
-```void* __hpvm__edge(void* src, void* dst, unsigned replType, unsigned sp, unsigned dp, unsigned isStream)```  
-Creates an edge from output ```sp``` of node ```src``` to input ```dp``` of node ```dst```. If ```replType``` is 0, the edge is a one-to-one edge, otherwise it is an all-to-all edge. ```isStream``` defines whether or not the edge is streaming. Returns a handle to the created edge.
-
-```void __hpvm__bindIn(void* N, unsigned ip, unsigned ic, unsigned isStream)```  
-Binds the input ```ip``` of the current node to input ```ic``` of child node function ```N```. ```isStream``` defines whether or not the input bind is streaming.
-
-```void __hpvm__bindOut(void* N, unsigned op, unsigned oc, unsigned isStream)```  
-Binds the output ```op``` of the current node to output ```oc``` of child node function ```N```. ```isStream``` defines whether or not the output bind is streaming.
-
-```void __hpvm__hint(enum Target target)``` (C\)  
-```void __hpvm__hint(hpvm::Target target)``` (C++)  
-Must be called once in each node function. Indicates which hardware target the current function should run in.
-
-```void __hpvm__attributes(unsigned ni, …, unsigned no, …)```  
-Must be called once at the beginning of each node function. Defines the properties of the pointer arguments to the current function. ```ni``` represents the number of input arguments, and ```no``` the number of output arguments. The arguments following ```ni``` are the input arguments, and the arguments following ```no``` are the output arguments. Arguments can be marked as both input and output. All pointer arguments must be included.
-
-## Leaf Node API
-```void __hpvm__hint(enum Target target)``` (C\)  
-```void __hpvm__hint(hpvm::Target target)``` (C++)  
-As described in internal node API.
-
-```void __hpvm__attributes(unsigned ni, …, unsigned no, …)```  
-As described in internal node API.
-
-```void __hpvm__return(unsigned n, ...)```  
-Returns ```n``` values from a leaf node function. The remaining arguments are the values to be returned. All ```__hpvm__return``` statements within the same function must return the same number of values.
-
-```void* __hpvm__getNode()```  
-Returns a handle to the current leaf node.
-
-```void* __hpvm__getParentNode(void* N)```  
-Returns a handle to the parent node of node ```N```.
-
-```long __hpvm__getNodeInstanceID_{x,y,z}(void* N)```  
-Returns the dynamic ID of the current instance of node ```N``` in the x, y, or z dimension respectively. The dimension must be one of the dimensions in which the node is replicated.
-
-```long __hpvm__getNumNodeInstances_{x,y,z}(void* N)```  
-Returns the number of dynamic instances of node ```N``` in the x, y, or z dimension respectively. The dimension must be one of the dimensions in which the node is replicated.
-
-```void* __hpvm__malloc(long nBytes)```  
-Allocate a block of memory of size ```nBytes``` and returns a pointer to it. The allocated object can be shared by all nodes. *Note that the returned pointer must somehow be communicated explicitly for use by other nodes.*
-
-```int __hpvm__atomic_add(int* m, int v)```  
-Atomically adds ```v``` to the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```int __hpvm__atomic_sub(int* m, int v)```  
-Atomically subtracts ```v``` from the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```int __hpvm__atomic_min(int* m, int v)```  
-Atomically computes the min of ```v``` and the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```int __hpvm__atomic_max(int* m, int v)```  
-Atomically computes the max of ```v``` and the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```int __hpvm__atomic_xchg(int* m, int v)```  
-Atomically swaps ```v``` with the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```int __hpvm__atomic_and(int* m, int v)```  
-Atomically computes the bitwise AND of ```v``` and the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```int __hpvm__atomic_or(int* m, int v)```  
-Atomically computes the bitwise OR of ```v``` and the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```int __hpvm__atomic_xor(int* m, int v)```  
-Atomically computes the bitwise XOR of ```v``` and the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```void __hpvm__barrier()```  
-Local synchronization barrier across dynamic instances of current leaf node.
-
-# Porting a Program from C to HPVM-C
-
-The following represents the required steps to port a regular C program into an HPVM program with HPVM-C. These steps are described at a high level; for more detail, please see [hpvm-cava](/hpvm/test/benchmarks/hpvm-cava) provided in [benchmarks](/hpvm/test/benchmarks).
-* Separate the computation that will become a kernel into its own (leaf node) function and add the attributes and target hint.
-* Create a level 1 wrapper node function that will describe the thread-level parallelism (for the GPU). The node will:
-    * Use the ```createNode[ND]()``` method to create a kernel node and specify how many threads will execute it.
-    * Bind its arguments to the kernel arguments.
-* If desired, create a level 2 wrapper node function which will describe the threadblock-level parallalism (for the GPU). This node will:
-    * Use the ```createNode[ND]()``` method to create a level 1 wrapper node and specify how many threadblocks will execute it.
-    * Bind its arguments to its child node's arguments.
-* A root node function that creates all the top-level wrapper nodes, binds their arguments, and connects their edges.
-    * Each root node represents a DFG.
-* All the above node functions have the combined arguments of all the kernels that are nested at each level. 
-* The host code will have to include the following:
-    * Initialize the HPVM runtime using the ```init()``` method.
-    * Create an argument struct for each DFG and assign its member variables.
-    * Add all the memory that is required by the kernel into the memory tracker.
-    * Launch the DFG by calling the ```launch()``` method on the root node function, and passing the corresponding argument struct.
-    * Wait for the DFG to complete execution.
-    * Read out any generated memory using the ```request_mem()``` method.
-    * Remove all the tracked memory from the memory tracker.
diff --git a/hpvm/docs/hpvm-specification.md b/hpvm/docs/hpvm-specification.md
deleted file mode 100644
index 54023fc9eddc4d9f317ac1b4cc585e52b98b8ae5..0000000000000000000000000000000000000000
--- a/hpvm/docs/hpvm-specification.md
+++ /dev/null
@@ -1,221 +0,0 @@
-# Table of Contents
-* [HPVM Abstraction](#abstraction)
-    * [Dataflow Node](#node)
-    * [Dataflow Edge](#edge)
-    * [Input and Output Bind](#bind)
-    * [Host Code](#host)
-* [HPVM Implementation](#implementation)
-    * [Intrinsics for Describing Graphs](#describing)
-    * [Intrinsics for Querying Graphs](#querying)
-    * [Intrinsics for Memory Allocation and Synchronization](#memory)
-    * [Intrinsics for Graph Interaction](#interaction)
-* [Implementation Limitations](#limitations)
-
-<a name="abstraction"></a>
-# HPVM Abstraction 
-An HPVM program is a combination of host code plus a set of one or more distinct dataflow graphs. Each dataflow graph (DFG) is a hierarchical graph with side effects. The DFG must be acyclic. Nodes represent units of execution, and edges between nodes describe the explicit data transfer requirements. A node can begin execution once a data item becomes available on every one of its input edges. Repeated transfer of data items between nodes (if more inputs are provided) yields a pipelined execution of different nodes in the graph. The execution of a DFG is initiated and terminated by host code that launches the graph. Nodes may access globally shared memory through load and store instructions (side-effects).
-
-<a name="node"></a>
-## Dataflow Node 
-A *dataflow node* represents unit of computation in the DFG. A node can begin execution once a data item becomes available on every one of its input edges.
-
-A single static dataflow node represents multiple dynamic instances of the node, each executing the same computation with different index values used to uniquely identify each dynamic instance w.r.t. the others. The dynamic instances of a node may be executed concurrently, and any required synchronization must imposed using HPVM synchronization operations.
-
-Each dataflow node in a DFG can either be a *leaf node* or an *internal node*. An internal node contains a complete DFG, called a *child graph*, and the child graph itself can have internal nodes and/or leaf nodes.
-
-Internal nodes only create the structure of the child graph, and cannot include actual computation. 
-
-Leaf nodes contain code expressing actual computations. Leaf nodes may contain instructions to query the structure of the underlying DFG, and any non host side HPVM operation for synchronization and memory allocation.
-
-Note that the graph is fully interpreted at compile-time and  cannot be modified at runtime except for the number of dynamic instances, which can be data dependent.
-
-<a name="edge"></a>
-## Dataflow Edge 
-A *dataflow edge* from the output ```out``` of a source dataflow node ```Src``` to the input ```in``` of a sink dataflow node ```Dst``` describes the explicit data transfer requirements. ```Src``` and ```Dst``` node must belong to the same child graph, i.e. must be children of the same internal node.
-
-An edge from source to sink has the semantics of copying the specified data from the source to the sink after the source node has completed execution. The pairs ```(Src, out)``` and ```(Dst, in)```, representing source and sink respectively, must be unique w.r.t. every other edge in the same child graph, i.e. two dataflow edges in the same child graph cannot have the same source or destination.
-
-A static edge also represents multiple dynamic instances of that edge between the dynamic instances of the source and the sink nodes.
-
-An edge can be instantiated at runtime using one of two replication mechanisms:
-- *All-to-all*, where all dynamic instances of the source node are connected to all dynamic instances of the sink node, thus expressing a synchronization barrier between the two groups of nodes, or
-- *One-to-one*, where each dynamic instance of the source node is connected to a single corresponding instance of the sink node. One-to-one replication requires that the grid structure (number of dimensions and the extents in each dimension) of the source and sink nodes be identical.
-
-<a name="bind"></a>
-## Input and Output Bind 
-An internal node is responsible for mapping its inputs, provided by incoming dataflow edges, to the inputs to one or more nodes of the child graph.
-
-An internal node binds its input ```ip``` to input ```ic``` of its child node ```Dst``` using an *input bind*.
-The pair ```(Dst, ic)``` must be unique, i.e. no two input binds in the same graph can have the same destination, as that would create a conflict. Semantically, these represent name bindings of input values and not data movement.
-
-Conversely, an internal node binds output ```oc``` of its child node ```Src``` to its output ```op``` using an *output bind*. The pair ```(Src, oc)``` and destination ```op``` must be unique, i.e. no two output binds in the same graph can have the same source destination, as that would create a conflict.
-
-A bind is always ***all-to-all***.
-
-<a name="host"></a>
-## Host Code 
-In an HPVM program, the host code is responsible for setting up, initiating the execution and blocking for completion of a DFG. The host can interact with the DFG to sustain a streaming computation by sending all data required for, and receiving all data produced by, one execution of the DFG. The list of actions that can be performed by the host is described below:
-
-- **Initialization and Cleanup**:
-All HPVM operations must be enclosed by the HPVM initialization and cleanup. These operations perform initialization and cleanup of runtime constructs that provide the runtime support for HPVM.
-- **Track Memory**:
-Memory objects that are passed to dataflow graphs need to be managed by the HPVM runtime. Our memory model assumes two separate address spaces for host and device memory. The HPVM runtime includes a memory tracker for tracking the location of HPVM-managed memory objects between these address spaces. Track memory inserts the specified memory object in the memory tracker and starts tracking it.
-- **Untrack Memory**:
-Stop tracking specified memory object and remove it from memory tracker.
-- **Request Memory**:
-If the specified memory object is not present in host memory, copy it to host memory.
-- **Launch**:
-The host code initiates execution of specified DFG, either streaming or non streaming.
-    - Non streaming DFG: The host provides all data items required for execution of the DFG at the time of the launch.
-    - Streaming DFG: No data is provided by the launch operation. Streaming execution is sustained by push and pop operations, described below.
-- **Push**:
-Push a set of data items required for one graph execution to the specified DFG. The DFG must have been launched using a streaming launch operation. This is a blocking operation.
-- **Pop**:
-Read data produced from one execution of the specified DFG. The DFG must have been launched using a streaming launch operation. This is a blocking operation.
-- **Wait**:
-The host code blocks for completion of specified DFG.
-    - For a non-streaming DFG, the data produced by the DFG are ready to be read by the host.
-    - For a streaming DFG, no more data may be provided for processing by the DFG.
-
-<a name="implementation"></a>
-# HPVM Implementation 
-
-This section describes the implementation of HPVM on top of LLVM IR.
-
-iN is the N-bit integer type in LLVM.
-
-We use intrinsic functions to implement the HPVM IR.
-
-The code for each dataflow node is given as a separate LLVM function, called the node function. The node function may call additional, auxiliary functions. However, the auxiliary functions are not allowed to include any HPVM intrinsics, as they are not considered to be part of the HPVM node hierarchy.
-
-The incoming dataflow edges and their data types are denoted by the parameters to the node function. The outgoing dataflow edges are represented by the return type of the node function, which must be an LLVM struct type with zero or more fields (one per outgoing edge).
-
-Each top-level DFG in an HPVM program is defined by its own *root node function* which creates the underlying DFG structure. The DFG is the (internal) root node's child graph. Unlike regular internal nodes, the root node only has one dynamic instance because it instantiates the top-level DFG. The DFG is launched by the host code using the root node function, as described below.
-
-We represent nodes with opaque handles (pointers of LLVM type i8\*). We represent input edges of a node as integer indices into the list of function arguments, and output edges of a node as integer indices into the return struct type.
-
-Pointer arguments of node functions are required to be annotated with attributes in, and/or out, depending on their expected use (read only, write only, read write).
-
-<a name="describing"></a>
-## Intrinsics for Describing Graphs 
-
-The intrinsics for describing graphs can only be used by internal nodes. Also, internal nodes are only allowed to have these intrinsics as part of their node function, with the exception of a return statement of the appropriate type, in order to return the result of the outgoing dataflow edges.
-
-
-```i8* llvm.hpvm.createNode(i8* F)```  
-Create a static dataflow node with one dynamic instance executing node function ```F```. Return a handle to the created node.
-
-```i8* llvm.hpvm.createNode1D(i8* F, i64 n1)```  
-Create a static dataflow node replicated in one dimension, namely ```x```, with ```n1``` dynamic instances executing node function ```F```. Return a handle to the created node.
-
-```i8* llvm.hpvm.createNode2D(i8* F, i64 n1, i64 n2)```  
-Create a static dataflow node replicated in two dimensions, namely ```x``` and ```y```, with ```n1``` and ```n2``` dynamic instances in each dimension respectively, executing node function ```F```. Return a handle to the created node.
-
-```i8* llvm.hpvm.createNode3D(i8* F, i64 n1, i64 n2, i64 n3)```  
-Create a static dataflow node replicated in three dimensions, namely ```x```, ```y``` and ```z```, with ```n1```, ```n2``` and ```n3``` dynamic instances in each dimension respectively, executing node function ```F```. Return a handle to the created node.
-
-```i8* llvm.hpvm.createEdge(i8* Src, i8* Dst, i1 ReplType, i32 sp, i32 dp, i1 isStream)```  
-Create edge from output ```sp``` of node ```Src``` to input ```dp``` of node ```Dst```. Argument ```dp``` of ```Dst```'s node function and field ```sp``` of the return struct in ```Src```'s node function must have matching types. ```ReplType``` chooses between a one-to-one (0) or all-to-all (1) edge. ```isStream``` chooses a streaming (1) or non streaming (0) edge. Return a handle to the created edge.
-
-```void llvm.hpvm.bind.input(i8* N, i32 ip, i32 ic, i1 isStream)```  
-Bind input ```ip``` of current node to input ```ic``` of child node ```N```. Argument ```ic``` of ```N```'s node function and argument ```ip``` of the current node function must have matching types. ```isStream``` chooses a streaming (1) or non streaming (0) bind.
-
-```void llvm.hpvm.bind.output(i8* N, i32 oc, i32 op, i1 isStream)```  
-Bind output ```oc``` of child node ```N``` to output ```op``` of current node. Field ```oc``` of the return struct in ```N```'s node function and field ```op``` of the return struct in the current node function must have matching types. ```isStream``` chooses a streaming (1) or non streaming (0) bind.
-
-<a name="querying"></a>
-## Intrinsics for Querying Graphs 
-
-The following intrinsics are used to query the structure of the DFG. They can only be used by leaf nodes.
-
-```i8* llvm.hpvm.getNode()```  
-Return a handle to the current leaf node.
-
-```i8* llvm.hpvm.getParentNode(i8* N)```  
-Return a handle to the parent in the hierarchy of node ```N```.
-
-```i32 llvm.hpvm.getNumDims(i8* N)```  
-Get the number of dimensions of node ```N```.
-
-```i64 llvm.hpvm.getNodeInstanceID.{x,y,z}(i8* N)```  
-Get index of current dynamic node instance of node ```N``` in dimension x, y or z respectively. The dimension must be one of the dimensions in which the node is replicated.
-
-```i64 llvm.hpvm.getNumNodeInstances.{x,y,z}(i8* N)```  
-Get number of dynamic instances of node ```N``` in dimension x, y or z respectively. The dimension must be one of the dimensions in which the node is replicated.
-
-<a name="memory"></a>
-## Intrinsics for Memory Allocation and Synchronization
-
-The following intrinsics are used for memory allocation and synchronization. They can only be used by leaf nodes.
-
-```i8* llvm.hpvm.malloc(i64 nBytes)```  
-Allocate a block of memory of size ```nBytes``` and return pointer to it. The allocated object can be shared by all nodes.  
-*Note that the returned pointer must somehow be communicated explicitly for use by other nodes.*
-
-```i32 llvm.hpvm.atomic.add(i8* m, i32 v)```  
-Atomically computes the bitwise ADD of ```v``` and the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```i32 llvm.hpvm.atomic.sub(i8* m, i32 v)```  
-Atomically computes the bitwise SUB of ```v``` and the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```i32 llvm.hpvm.atomic.min(i8* m, i32 v)```  
-Atomically computes the bitwise MIN of ```v``` and the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```i32 llvm.hpvm.atomic.max(i8* m, i32 v)```  
-Atomically computes the bitwise MAX of ```v``` and the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```i32 llvm.hpvm.atomic.xchg(i8* m, i32 v)```  
-Atomically computes the bitwise XCHG of ```v``` and the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```i32 llvm.hpvm.atomic.and(i8* m, i32 v)```  
-Atomically computes the bitwise AND of ```v``` and the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```i32 llvm.hpvm.atomic.or(i8* m, i32 v)```  
-Atomically computes the bitwise OR of ```v``` and the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```i32 llvm.hpvm.atomic.xor(i8* m, i32 v)```  
-Atomically computes the bitwise XOR of ```v``` and the value stored at memory location ```[m]``` w.r.t. the dynamic instances of the current leaf node and stores the result back into ```[m]```. Returns the value previously stored at ```[m]```.
-
-```void llvm.hpvm.barrier()```  
-Local synchronization barrier across dynamic instances of current leaf node.
-
-<a name="interaction"></a>
-## Intrinsics for Graph Interaction
-
-The following intrinsics are for graph initialization/termination and interaction with the host code, and can be used only by the host code.
-
-```void llvm.hpvm.init()```  
-Initialization of HPVM runtime.
-
-```void llvm.hpvm.cleanup()```  
-Cleanup of HPVM runtime created objects.
-
-```void llvm.hpvm.trackMemory(i8* ptr, i64 sz)```  
-Insert memory starting at ```ptr``` of size ```sz``` in the memory tracker. ```ptr``` becomes the key for identifying this memory object. As soon as a memory object is inserted in the memory tracker it starts being tracked, and can be passed as a data item to a DFG.
-
-```void llvm.hpvm.untrackMemory(i8* ptr)```  
-Stop tracking memory object with key ```ptr```, and remove it from memory tracker.
-
-```void llvm.hpvm.requestMemory(i8* ptr, i64 sz)```  
-If memory object with key ```ptr``` is not located in host memory, copy it to host memory.
-
-```i8* llvm.hpvm.launch(i8* RootGraph, i8* Args, i1 isStream)```  
-Launch the execution of a top-level DFG with root node function ```RootGraph```. ```Args``` is a pointer to a packed struct, containing one field per argument of the ```RootGraph``` function, consecutively. For non-streaming DFGs with a non empty result type, ```Args``` must contain an additional field of the type ```RootGraph.returnTy```, where the result of the graph will be returned. ```isStream``` chooses between a non streaming (0) or streaming (1) graph execution. Return a handle to the invoked DFG.
-
-```void llvm.hpvm.wait(i8* GraphID)```  
-Wait for completion of execution of DFG with handle ```GraphID```.
-
-```void llvm.hpvm.push(i8* GraphID, i8* args)```  
-Push set of input data ```args``` (same as type included in launch) to streaming DFG with handle ```GraphID```.
-
-```i8* llvm.hpvm.pop(i8* GraphID)```  
-Pop and return data from streaming DFG with handle ```GraphID```. The return type is a struct containing a field for every output of DFG. 
-
-<a name="limitations"></a>
-## Implementation Limitations
-Due to limitations of our current prototype implementation, the following restrictions are imposed:
-
-- In HPVM, a memory object is represented as a (pointer, size) pair that includes the address of memory object, and the size (in bytes) of the pointed-to object. Therefore, when an edge/bind carries a pointer, it must be followed by an i64 size value.           
-- Pointers cannot be transferred between nodes using dataflow edges. Instead, they should be passed using the bind operation from the (common) parent of the source and sink nodes.
-
-- Instantiation of dataflow nodes is supported in up to three dimensions.
diff --git a/hpvm/docs/index.rst b/hpvm/docs/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..694a51148d474d9cb8249bbee7ba333b6b65b0ab
--- /dev/null
+++ b/hpvm/docs/index.rst
@@ -0,0 +1,41 @@
+.. _contents:
+
+The HPVM Compiler Infrastructure
+================================
+
+HPVM is a compiler for heterogeneous parallel system.
+For more about what HPVM is, see `our website <https://publish.illinois.edu/hpvm-project/>`_
+and publications:
+`PPoPP'18 paper <https://dl.acm.org/doi/pdf/10.1145/3200691.3178493>`_,
+`OOPSLA'19 paper <https://dl.acm.org/doi/10.1145/3360612>`_,
+`PPoPP'21 paper <https://dl.acm.org/doi/10.1145/3437801.3446108>`_.
+
+This is the documentation of HPVM at **version 1.0**.
+
+Audience
+--------
+
+TODO: write something here.
+
+Documentation
+-------------
+
+.. toctree::
+   :maxdepth: 1
+
+   install
+   getting-started
+   tests
+   components/index
+   references/index
+   developerdocs/index
+
+Indices and tables
+------------------
+
+* :ref:`genindex`
+
+Support
+-------
+
+All questions can be directed to `hpvm-dev@lists.cs.illinois.edu <mailto:hpvm-dev@lists.cs.illinois.edu>`_.
diff --git a/hpvm/docs/install.rst b/hpvm/docs/install.rst
new file mode 100644
index 0000000000000000000000000000000000000000..47b400d2be63dbbc6caa548384656dc9e4d27bd1
--- /dev/null
+++ b/hpvm/docs/install.rst
@@ -0,0 +1,164 @@
+Install
+===============
+
+Dependencies
+------------
+
+The following components are required to be installed on your machine to build HPVM.
+
+* GCC (>=5.1)
+
+  * In addition, each version of CUDA-nvcc requires GCC to be not newer than a certain version.
+    See `here <https://gist.github.com/ax3l/9489132>`_ for the support matrix.
+
+* CMake (>=3.17)
+* GNU Make (>=3.79)
+* OpenCL (>=1.0.0)
+* CUDA (>=9.1)
+* Python (==3.6) with pip (>=20)
+
+Python must be strictly 3.6 (any subversion from 3.6.0 to 3.6.13).
+Alternatively, if you use Anaconda for package management,
+we provide a conda environment file that covers all Python and package requirements:
+
+.. code-block:: bash
+
+   conda env create -n hpvm -f hpvm/env.yaml
+
+
+Supported Architectures
+-----------------------
+
+Supported/tested CPU architectures:
+
+* Intel Xeon E5-2640
+* Intel Xeon W-2135
+* ARM Cortex A-57
+
+Supported/tested GPU architectures for OpenCL backend:
+
+* Nvidia Quadro P1000
+* Nvidia GeForce GTX 1080
+
+Supported/tested GPU architectures for Tensor Backend:
+
+* Nvidia Jetson TX2
+* Nvidia GeForce GTX 1080
+
+HPVM has not been tested but might work on other CPUs supported by LLVM Backend,
+and GPUs supported by OpenCL such as Intel, AMD, etc.
+
+**NOTE**: Approximations are tuned for Jetson TX2 and same speedups may not exist for other architectures.
+
+
+Installing from Source
+----------------------
+
+Checkout HPVM and go to directory ``./hpvm`` under project root:
+
+.. code-block:: shell
+
+   git clone --recursive -b approx_hpvm_reorg --single-branch https://gitlab.engr.illinois.edu/llvm/hpvm.git
+   cd hpvm/
+
+HPVM needs to be able to find CUDA.
+If CUDA is installed in your system's $PATH (e.g. if it was installed at the default location),
+HPVM can find CUDA automatically.
+
+Use HPVM installer script to download, configure and build HPVM along with LLVM and Clang:
+
+.. code-block:: shell
+
+   ./install.sh
+
+* Without arguments, this script will interactively prompt you for some parameters.
+  Alternatively, use ``./install.sh -h`` for a list of available arguments
+  and pass arguments as required.
+
+* ``./install.sh`` can relay additional arguments to CMake, but the dash must be dropped
+  regardless of using prompt or CLI arguments.
+  For example, 
+
+  .. code-block:: shell
+
+   ./install.sh -j32 DCMAKE_BUILD_TYPE=Release
+
+  will compile HPVM with 32 threads in Release mode; similarly, inputting
+  ``DCMAKE_BUILD_TYPE=Release`` to the prompt will also send ``-DCMAKE_BUILD_TYPE=Release``
+  to CMake which gives a build in Release mode.
+
+After configuring HPVM,
+the installer will also compile HPVM by default, which you can opt out of.
+If you do so, follow the next section "Manually Build HPVM" to manually compile HPVM,
+and "Benchmarks and Tests" to manually run test cases if you wish so.
+Otherwise, you can skip the next 2 sections.
+
+* Specifically, the HPVM installer downloads LLVM, and Clang, copies HPVM source into
+  llvm/tools and builds the entire tree. It also builds a modified LLVM C-Backend,
+  based on the one maintained by `Julia Computing <https://github.com/JuliaComputing/llvm-cbe>`_,
+  as a part of HPVM and is currently used to generate OpenCL kernels for GPUs.
+
+TroubleShooting
+^^^^^^^^^^^^^^^
+
+If CMake did not find your CUDA, some environment variables will help it:
+
+* ``CUDA_TOOLKIT_PATH`` --- Path to the CUDA toolkit
+* ``CUDA_INCLUDE_PATH`` --- Path to the CUDA headers
+* ``CUDA_LIB_PATH`` --- Path to CUDA libraries
+
+You can use ``set_paths.sh`` for this purpose: modify the values of these variables
+in ``set_paths.sh`` according to your system, and source the script:
+
+.. code-block:: shell
+
+   source set_paths.sh
+
+Manually Build HPVM
+-------------------
+
+Alternatively, you can manually build HPVM with CMake.
+Please note that in this case,
+the installer script still *must* be executed to obtain some required components,
+but without the build step.
+In current directory (``hpvm/``), do
+
+.. code-block:: shell
+
+   mkdir build
+   cd build
+   cmake ../llvm [options]
+   export PATH=$(realpath ./bin):$PATH
+
+**Note** that you must *manually add ``build/bin`` directory to your $PATH variable*
+as absolute path (as shown above).
+
+Some common options that can be used with CMake are:
+
+* ``-DCMAKE_INSTALL_PREFIX=directory`` --- Specify for directory the full pathname of where you want the HPVM tools and libraries to be installed.
+* ``-DCMAKE_BUILD_TYPE=type`` --- Valid options for type are Debug, Release, RelWithDebInfo, and MinSizeRel. Default is Debug.
+* ``-DLLVM_ENABLE_ASSERTIONS=On`` --- Compile with assertion checks enabled (default is Yes for Debug builds, No for all other build types).
+
+Now, compile the HPVM Compilation Tool ``approxhpvm.py`` using:
+
+.. code-block:: shell
+
+   make -j<number of threads> approxhpvm.py
+
+With all the aforementioned steps, HPVM should be built, installed, tested and ready to use.
+In particular, ``approxhpvm.py`` should be an executable command from your command line.
+
+Benchmarks and Tests
+--------------------
+
+We provide a number of general benchmarks, DNN benchmarks, and test cases, written in HPVM.
+
+``make`` targets ``check-hpvm-pass``, ``check-hpvm-dnn``, and ``check-hpvm-profiler``
+tests various components of HPVM and are increasingly time-consuming.
+You can run tests similarly as how ``approxhpvm.py`` is compiled: for example,
+
+.. code-block:: shell
+
+   make -j<number of threads> check-hpvm-pass
+
+runs ``check-hpvm-pass`` tests. See TODO for details on benchmarks and test cases.
diff --git a/hpvm/docs/references/compilation-process.rst b/hpvm/docs/references/compilation-process.rst
new file mode 100644
index 0000000000000000000000000000000000000000..1115de935f4adcb06e1946bf97983641877fa5ee
--- /dev/null
+++ b/hpvm/docs/references/compilation-process.rst
@@ -0,0 +1,23 @@
+.. _hpvm-comp-process:
+
+HPVM Compilation Process
+========================
+
+Compilation of an HPVM program involves the following steps:
+
+
+#. ``clang`` takes an HPVM-C/C++ program (e.g. ``main.c``) and produces an LLVM IR (``main.ll``) file that contains the HPVM-C function calls. The declarations of these functions are defined in ``test/benchmark/include/hpvm.h``, which must be included in the program.
+#. ``opt`` takes (``main.ll``) and invoke the GenHPVM pass on it, which converts the HPVM-C function calls to HPVM intrinsics. This generates the HPVM textual representation (``main.hpvm.ll``).
+#. ``opt`` takes the HPVM textual representation (``main.hpvm.ll``) and invokes the following passes in sequence: 
+
+   * BuildDFG: Converts the textual representation to the internal HPVM representation.
+   * LocalMem and DFG2LLVM_OpenCL: Invoked only when GPU target is selected. Generates the kernel module (``main.kernels.ll``) and the portion of the host code that invokes the kernel into the host module (``main.host.ll``).
+   * DFG2LLVM_CPU: Generates either all, or the remainder of the host module (``main.host.ll``) depending on the chosen target.
+   * ClearDFG: Deletes the internal HPVM representation from memory.
+
+#. ``clang`` is used to to compile any remaining project files that would be later linked with the host module.
+#. ``llvm-link`` takes the host module and all the other generate ``ll`` files, and links them with the HPVM runtime module (``hpvm-rt.bc``), to generate the linked host module (``main.host.linked.ll``). 
+#. Generate the executable code from the generated ``ll`` files for all parts of the program:
+
+   * GPU target: ``llvm-cbe`` takes the kernel module (``main.kernels.ll``) and generates an OpenCL representation of the kernels that will be invoked by the host.
+   * CPU target: ``clang`` takes the linked  host module (``main.host.linked.ll``) and generates the CPU binary.
diff --git a/hpvm/docs/references/hpvm-c.rst b/hpvm/docs/references/hpvm-c.rst
new file mode 100644
index 0000000000000000000000000000000000000000..8956bf0c87118ff37e3c2020d94138cfd9b03c29
--- /dev/null
+++ b/hpvm/docs/references/hpvm-c.rst
@@ -0,0 +1,151 @@
+.. role:: raw-html-m2r(raw)
+   :format: html
+
+
+HPVM-C Language Specification
+=============================
+
+An HPVM program is a combination of host code and one or more data flow graphs (DFG) at the IR level. We provide C function declarations representing the HPVM intrinsics that allow creating, querying, and interacting with the DFGs. More details about the HPVM IR intrinsics can be found in `the HPVM IR Specification <hpvm-specification.html>`_.
+
+An HPVM-C program contains both the host and the DFG code. Each HPVM kernel, represented by a leaf node in the DFG, can be compiled to multiple different targets (e.g. CPU and GPU) as described below. 
+
+This document describes all the API calls that can be used in an HPVM-C program.
+
+Host API
+--------
+
+``void __hpvm__init()``:raw-html-m2r:`<br>`
+Used before all other HPVM calls to initialize the HPVM runtime.
+
+``void __hpvm__cleanup()``:raw-html-m2r:`<br>`
+Used at the end of HPVM program to clean up all remaining runtime-created HPVM objects.
+
+``void llvm_hpvm_track_mem(void* ptr, size_t sz)``:raw-html-m2r:`<br>`
+Insert memory starting at ``ptr`` of size ``sz`` in the memory tracker of HPVM runtime.
+
+``void llvm_hpvm_untrack_mem(void* ptr)``:raw-html-m2r:`<br>`
+Stop tracking the memory object identified by ``ptr``.
+
+``void llvm_hpvm_request_mem(void* ptr, size_t sz)``:raw-html-m2r:`<br>`
+If the memory object identified by ``ptr`` is not in host memory, copy it to host memory.
+
+``void* __hpvm__launch(unsigned isStream, void* rootGraph, void* args)``:raw-html-m2r:`<br>`
+Launches the execution of the dataflow graph with node function ``rootGraph``. ``args`` is a pointer to a packed struct, containing one field per argument of the RootGraph function, consecutively. For non-streaming DFGs with a non empty result type, ``args`` must contain an additional field of the type ``RootGraph.returnTy``, where the result of the graph will be returned. ``isStream`` chooses between a non streaming (0) or streaming (1) graph execution. Returns a handle to the executing graph.
+
+``void __hpvm__wait(void* G)``:raw-html-m2r:`<br>`
+Waits for completion of execution of the dataflow graph with handle ``G``.
+
+``void __hpvm__push(void* G, void* args)``:raw-html-m2r:`<br>`
+Push set of input data items, ``args``, (same as type included in launch) to streaming DFG with handle ``G``.
+
+``void* __hpvm__pop(void* G)``:raw-html-m2r:`<br>`
+Pop and return data produced from one execution of streaming DFG with handle ``G``. The return type is a struct containing a field for every output of DFG. 
+
+Internal Node API
+-----------------
+
+``void* __hpvm__createNodeND(unsigned dims, void* F, ...)``:raw-html-m2r:`<br>`
+Creates a static dataflow node replicated in ``dims`` dimensions (0 to 3), each executing node function ``F``. The arguments following ``F`` are the size of each dimension, respectively, passed in as a ``size_t``. Returns a handle to the created dataflow node.
+
+``void* __hpvm__edge(void* src, void* dst, unsigned replType, unsigned sp, unsigned dp, unsigned isStream)``:raw-html-m2r:`<br>`
+Creates an edge from output ``sp`` of node ``src`` to input ``dp`` of node ``dst``. If ``replType`` is 0, the edge is a one-to-one edge, otherwise it is an all-to-all edge. ``isStream`` defines whether or not the edge is streaming. Returns a handle to the created edge.
+
+``void __hpvm__bindIn(void* N, unsigned ip, unsigned ic, unsigned isStream)``:raw-html-m2r:`<br>`
+Binds the input ``ip`` of the current node to input ``ic`` of child node function ``N``. ``isStream`` defines whether or not the input bind is streaming.
+
+``void __hpvm__bindOut(void* N, unsigned op, unsigned oc, unsigned isStream)``:raw-html-m2r:`<br>`
+Binds the output ``op`` of the current node to output ``oc`` of child node function ``N``. ``isStream`` defines whether or not the output bind is streaming.
+
+``void __hpvm__hint(enum Target target)`` (C):raw-html-m2r:`<br>`
+``void __hpvm__hint(hpvm::Target target)`` (C++):raw-html-m2r:`<br>`
+Must be called once in each node function. Indicates which hardware target the current function should run in.
+
+``void __hpvm__attributes(unsigned ni, …, unsigned no, …)``:raw-html-m2r:`<br>`
+Must be called once at the beginning of each node function. Defines the properties of the pointer arguments to the current function. ``ni`` represents the number of input arguments, and ``no`` the number of output arguments. The arguments following ``ni`` are the input arguments, and the arguments following ``no`` are the output arguments. Arguments can be marked as both input and output. All pointer arguments must be included.
+
+Leaf Node API
+-------------
+
+``void __hpvm__hint(enum Target target)`` (C):raw-html-m2r:`<br>`
+``void __hpvm__hint(hpvm::Target target)`` (C++):raw-html-m2r:`<br>`
+As described in internal node API.
+
+``void __hpvm__attributes(unsigned ni, …, unsigned no, …)``:raw-html-m2r:`<br>`
+As described in internal node API.
+
+``void __hpvm__return(unsigned n, ...)``:raw-html-m2r:`<br>`
+Returns ``n`` values from a leaf node function. The remaining arguments are the values to be returned. All ``__hpvm__return`` statements within the same function must return the same number of values.
+
+``void* __hpvm__getNode()``:raw-html-m2r:`<br>`
+Returns a handle to the current leaf node.
+
+``void* __hpvm__getParentNode(void* N)``:raw-html-m2r:`<br>`
+Returns a handle to the parent node of node ``N``.
+
+``long __hpvm__getNodeInstanceID_{x,y,z}(void* N)``:raw-html-m2r:`<br>`
+Returns the dynamic ID of the current instance of node ``N`` in the x, y, or z dimension respectively. The dimension must be one of the dimensions in which the node is replicated.
+
+``long __hpvm__getNumNodeInstances_{x,y,z}(void* N)``:raw-html-m2r:`<br>`
+Returns the number of dynamic instances of node ``N`` in the x, y, or z dimension respectively. The dimension must be one of the dimensions in which the node is replicated.
+
+``void* __hpvm__malloc(long nBytes)``:raw-html-m2r:`<br>`
+Allocate a block of memory of size ``nBytes`` and returns a pointer to it. The allocated object can be shared by all nodes. *Note that the returned pointer must somehow be communicated explicitly for use by other nodes.*
+
+``int __hpvm__atomic_add(int* m, int v)``:raw-html-m2r:`<br>`
+Atomically adds ``v`` to the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``int __hpvm__atomic_sub(int* m, int v)``:raw-html-m2r:`<br>`
+Atomically subtracts ``v`` from the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``int __hpvm__atomic_min(int* m, int v)``:raw-html-m2r:`<br>`
+Atomically computes the min of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``int __hpvm__atomic_max(int* m, int v)``:raw-html-m2r:`<br>`
+Atomically computes the max of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``int __hpvm__atomic_xchg(int* m, int v)``:raw-html-m2r:`<br>`
+Atomically swaps ``v`` with the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``int __hpvm__atomic_and(int* m, int v)``:raw-html-m2r:`<br>`
+Atomically computes the bitwise AND of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``int __hpvm__atomic_or(int* m, int v)``:raw-html-m2r:`<br>`
+Atomically computes the bitwise OR of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``int __hpvm__atomic_xor(int* m, int v)``:raw-html-m2r:`<br>`
+Atomically computes the bitwise XOR of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``void __hpvm__barrier()``:raw-html-m2r:`<br>`
+Local synchronization barrier across dynamic instances of current leaf node.
+
+Porting a Program from C to HPVM-C
+==================================
+
+The following represents the required steps to port a regular C program into an HPVM program with HPVM-C. These steps are described at a high level; for more detail, please see `hpvm-cava </hpvm/test/benchmarks/hpvm-cava>`_ provided in `benchmarks </hpvm/test/benchmarks>`_.
+
+
+* Separate the computation that will become a kernel into its own (leaf node) function and add the attributes and target hint.
+* Create a level 1 wrapper node function that will describe the thread-level parallelism (for the GPU). The node will:
+
+  * Use the ``createNode[ND]()`` method to create a kernel node and specify how many threads will execute it.
+  * Bind its arguments to the kernel arguments.
+
+* If desired, create a level 2 wrapper node function which will describe the threadblock-level parallalism (for the GPU). This node will:
+
+  * Use the ``createNode[ND]()`` method to create a level 1 wrapper node and specify how many threadblocks will execute it.
+  * Bind its arguments to its child node's arguments.
+
+* A root node function that creates all the top-level wrapper nodes, binds their arguments, and connects their edges.
+
+  * Each root node represents a DFG.
+
+* All the above node functions have the combined arguments of all the kernels that are nested at each level. 
+* The host code will have to include the following:
+
+  * Initialize the HPVM runtime using the ``init()`` method.
+  * Create an argument struct for each DFG and assign its member variables.
+  * Add all the memory that is required by the kernel into the memory tracker.
+  * Launch the DFG by calling the ``launch()`` method on the root node function, and passing the corresponding argument struct.
+  * Wait for the DFG to complete execution.
+  * Read out any generated memory using the ``request_mem()`` method.
+  * Remove all the tracked memory from the memory tracker.
diff --git a/hpvm/docs/references/hpvm-specification.rst b/hpvm/docs/references/hpvm-specification.rst
new file mode 100644
index 0000000000000000000000000000000000000000..90226b333db2bfc8339bea8d6367ad9f8fcbe8e0
--- /dev/null
+++ b/hpvm/docs/references/hpvm-specification.rst
@@ -0,0 +1,266 @@
+.. role:: raw-html-m2r(raw)
+   :format: html
+
+HPVM Abstraction
+================
+
+Table of Contents
+------------------
+
+* `HPVM Abstraction <#abstraction>`_
+
+  * `Dataflow Node <#node>`_
+  * `Dataflow Edge <#edge>`_
+  * `Input and Output Bind <#bind>`_
+  * `Host Code <#host>`_
+
+* `HPVM Implementation <#implementation>`_
+
+  * `Intrinsics for Describing Graphs <#describing>`_
+  * `Intrinsics for Querying Graphs <#querying>`_
+  * `Intrinsics for Memory Allocation and Synchronization <#memory>`_
+  * `Intrinsics for Graph Interaction <#interaction>`_
+
+* `Implementation Limitations <#limitations>`_
+
+:raw-html-m2r:`<a name="abstraction"></a>`
+
+An HPVM program is a combination of host code plus a set of one or more distinct dataflow graphs. Each dataflow graph (DFG) is a hierarchical graph with side effects. The DFG must be acyclic. Nodes represent units of execution, and edges between nodes describe the explicit data transfer requirements. A node can begin execution once a data item becomes available on every one of its input edges. Repeated transfer of data items between nodes (if more inputs are provided) yields a pipelined execution of different nodes in the graph. The execution of a DFG is initiated and terminated by host code that launches the graph. Nodes may access globally shared memory through load and store instructions (side-effects).
+
+:raw-html-m2r:`<a name="node"></a>`
+
+Dataflow Node
+-------------
+
+A *dataflow node* represents unit of computation in the DFG. A node can begin execution once a data item becomes available on every one of its input edges.
+
+A single static dataflow node represents multiple dynamic instances of the node, each executing the same computation with different index values used to uniquely identify each dynamic instance w.r.t. the others. The dynamic instances of a node may be executed concurrently, and any required synchronization must imposed using HPVM synchronization operations.
+
+Each dataflow node in a DFG can either be a *leaf node* or an *internal node*. An internal node contains a complete DFG, called a *child graph*, and the child graph itself can have internal nodes and/or leaf nodes.
+
+Internal nodes only create the structure of the child graph, and cannot include actual computation. 
+
+Leaf nodes contain code expressing actual computations. Leaf nodes may contain instructions to query the structure of the underlying DFG, and any non host side HPVM operation for synchronization and memory allocation.
+
+Note that the graph is fully interpreted at compile-time and  cannot be modified at runtime except for the number of dynamic instances, which can be data dependent.
+
+:raw-html-m2r:`<a name="edge"></a>`
+
+Dataflow Edge
+-------------
+
+A *dataflow edge* from the output ``out`` of a source dataflow node ``Src`` to the input ``in`` of a sink dataflow node ``Dst`` describes the explicit data transfer requirements. ``Src`` and ``Dst`` node must belong to the same child graph, i.e. must be children of the same internal node.
+
+An edge from source to sink has the semantics of copying the specified data from the source to the sink after the source node has completed execution. The pairs ``(Src, out)`` and ``(Dst, in)``, representing source and sink respectively, must be unique w.r.t. every other edge in the same child graph, i.e. two dataflow edges in the same child graph cannot have the same source or destination.
+
+A static edge also represents multiple dynamic instances of that edge between the dynamic instances of the source and the sink nodes.
+
+An edge can be instantiated at runtime using one of two replication mechanisms:
+
+
+* *All-to-all*, where all dynamic instances of the source node are connected to all dynamic instances of the sink node, thus expressing a synchronization barrier between the two groups of nodes, or
+* *One-to-one*, where each dynamic instance of the source node is connected to a single corresponding instance of the sink node. One-to-one replication requires that the grid structure (number of dimensions and the extents in each dimension) of the source and sink nodes be identical.
+
+:raw-html-m2r:`<a name="bind"></a>`
+
+Input and Output Bind
+---------------------
+
+An internal node is responsible for mapping its inputs, provided by incoming dataflow edges, to the inputs to one or more nodes of the child graph.
+
+An internal node binds its input ``ip`` to input ``ic`` of its child node ``Dst`` using an *input bind*.
+The pair ``(Dst, ic)`` must be unique, i.e. no two input binds in the same graph can have the same destination, as that would create a conflict. Semantically, these represent name bindings of input values and not data movement.
+
+Conversely, an internal node binds output ``oc`` of its child node ``Src`` to its output ``op`` using an *output bind*. The pair ``(Src, oc)`` and destination ``op`` must be unique, i.e. no two output binds in the same graph can have the same source destination, as that would create a conflict.
+
+A bind is always **all-to-all**.
+
+:raw-html-m2r:`<a name="host"></a>`
+
+Host Code
+---------
+
+In an HPVM program, the host code is responsible for setting up, initiating the execution and blocking for completion of a DFG. The host can interact with the DFG to sustain a streaming computation by sending all data required for, and receiving all data produced by, one execution of the DFG. The list of actions that can be performed by the host is described below:
+
+
+* **Initialization and Cleanup**:
+  All HPVM operations must be enclosed by the HPVM initialization and cleanup. These operations perform initialization and cleanup of runtime constructs that provide the runtime support for HPVM.
+* **Track Memory**:
+  Memory objects that are passed to dataflow graphs need to be managed by the HPVM runtime. Our memory model assumes two separate address spaces for host and device memory. The HPVM runtime includes a memory tracker for tracking the location of HPVM-managed memory objects between these address spaces. Track memory inserts the specified memory object in the memory tracker and starts tracking it.
+* **Untrack Memory**:
+  Stop tracking specified memory object and remove it from memory tracker.
+* **Request Memory**:
+  If the specified memory object is not present in host memory, copy it to host memory.
+* **Launch**:
+  The host code initiates execution of specified DFG, either streaming or non streaming.
+
+  * Non streaming DFG: The host provides all data items required for execution of the DFG at the time of the launch.
+  * Streaming DFG: No data is provided by the launch operation. Streaming execution is sustained by push and pop operations, described below.
+
+* **Push**:
+  Push a set of data items required for one graph execution to the specified DFG. The DFG must have been launched using a streaming launch operation. This is a blocking operation.
+* **Pop**:
+  Read data produced from one execution of the specified DFG. The DFG must have been launched using a streaming launch operation. This is a blocking operation.
+* **Wait**:
+  The host code blocks for completion of specified DFG.
+
+  * For a non-streaming DFG, the data produced by the DFG are ready to be read by the host.
+  * For a streaming DFG, no more data may be provided for processing by the DFG.
+
+:raw-html-m2r:`<a name="implementation"></a>`
+
+HPVM Implementation
+===================
+
+This section describes the implementation of HPVM on top of LLVM IR.
+
+iN is the N-bit integer type in LLVM.
+
+We use intrinsic functions to implement the HPVM IR.
+
+The code for each dataflow node is given as a separate LLVM function, called the node function. The node function may call additional, auxiliary functions. However, the auxiliary functions are not allowed to include any HPVM intrinsics, as they are not considered to be part of the HPVM node hierarchy.
+
+The incoming dataflow edges and their data types are denoted by the parameters to the node function. The outgoing dataflow edges are represented by the return type of the node function, which must be an LLVM struct type with zero or more fields (one per outgoing edge).
+
+Each top-level DFG in an HPVM program is defined by its own *root node function* which creates the underlying DFG structure. The DFG is the (internal) root node's child graph. Unlike regular internal nodes, the root node only has one dynamic instance because it instantiates the top-level DFG. The DFG is launched by the host code using the root node function, as described below.
+
+We represent nodes with opaque handles (pointers of LLVM type i8*). We represent input edges of a node as integer indices into the list of function arguments, and output edges of a node as integer indices into the return struct type.
+
+Pointer arguments of node functions are required to be annotated with attributes in, and/or out, depending on their expected use (read only, write only, read write).
+
+:raw-html-m2r:`<a name="describing"></a>`
+
+Intrinsics for Describing Graphs
+--------------------------------
+
+The intrinsics for describing graphs can only be used by internal nodes. Also, internal nodes are only allowed to have these intrinsics as part of their node function, with the exception of a return statement of the appropriate type, in order to return the result of the outgoing dataflow edges.
+
+``i8* llvm.hpvm.createNode(i8* F)``:raw-html-m2r:`<br>`
+Create a static dataflow node with one dynamic instance executing node function ``F``. Return a handle to the created node.
+
+``i8* llvm.hpvm.createNode1D(i8* F, i64 n1)``:raw-html-m2r:`<br>`
+Create a static dataflow node replicated in one dimension, namely ``x``, with ``n1`` dynamic instances executing node function ``F``. Return a handle to the created node.
+
+``i8* llvm.hpvm.createNode2D(i8* F, i64 n1, i64 n2)``:raw-html-m2r:`<br>`
+Create a static dataflow node replicated in two dimensions, namely ``x`` and ``y``, with ``n1`` and ``n2`` dynamic instances in each dimension respectively, executing node function ``F``. Return a handle to the created node.
+
+``i8* llvm.hpvm.createNode3D(i8* F, i64 n1, i64 n2, i64 n3)``:raw-html-m2r:`<br>`
+Create a static dataflow node replicated in three dimensions, namely ``x``, ``y`` and ``z``, with ``n1``, ``n2`` and ``n3`` dynamic instances in each dimension respectively, executing node function ``F``. Return a handle to the created node.
+
+``i8* llvm.hpvm.createEdge(i8* Src, i8* Dst, i1 ReplType, i32 sp, i32 dp, i1 isStream)``:raw-html-m2r:`<br>`
+Create edge from output ``sp`` of node ``Src`` to input ``dp`` of node ``Dst``. Argument ``dp`` of ``Dst``'s node function and field ``sp`` of the return struct in ``Src``'s node function must have matching types. ``ReplType`` chooses between a one-to-one (0) or all-to-all (1) edge. ``isStream`` chooses a streaming (1) or non streaming (0) edge. Return a handle to the created edge.
+
+``void llvm.hpvm.bind.input(i8* N, i32 ip, i32 ic, i1 isStream)``:raw-html-m2r:`<br>`
+Bind input ``ip`` of current node to input ``ic`` of child node ``N``. Argument ``ic`` of ``N``'s node function and argument ``ip`` of the current node function must have matching types. ``isStream`` chooses a streaming (1) or non streaming (0) bind.
+
+``void llvm.hpvm.bind.output(i8* N, i32 oc, i32 op, i1 isStream)``:raw-html-m2r:`<br>`
+Bind output ``oc`` of child node ``N`` to output ``op`` of current node. Field ``oc`` of the return struct in ``N``'s node function and field ``op`` of the return struct in the current node function must have matching types. ``isStream`` chooses a streaming (1) or non streaming (0) bind.
+
+:raw-html-m2r:`<a name="querying"></a>`
+
+Intrinsics for Querying Graphs
+------------------------------
+
+The following intrinsics are used to query the structure of the DFG. They can only be used by leaf nodes.
+
+``i8* llvm.hpvm.getNode()``:raw-html-m2r:`<br>`
+Return a handle to the current leaf node.
+
+``i8* llvm.hpvm.getParentNode(i8* N)``:raw-html-m2r:`<br>`
+Return a handle to the parent in the hierarchy of node ``N``.
+
+``i32 llvm.hpvm.getNumDims(i8* N)``:raw-html-m2r:`<br>`
+Get the number of dimensions of node ``N``.
+
+``i64 llvm.hpvm.getNodeInstanceID.{x,y,z}(i8* N)``:raw-html-m2r:`<br>`
+Get index of current dynamic node instance of node ``N`` in dimension x, y or z respectively. The dimension must be one of the dimensions in which the node is replicated.
+
+``i64 llvm.hpvm.getNumNodeInstances.{x,y,z}(i8* N)``:raw-html-m2r:`<br>`
+Get number of dynamic instances of node ``N`` in dimension x, y or z respectively. The dimension must be one of the dimensions in which the node is replicated.
+
+:raw-html-m2r:`<a name="memory"></a>`
+
+Intrinsics for Memory Allocation and Synchronization
+----------------------------------------------------
+
+The following intrinsics are used for memory allocation and synchronization. They can only be used by leaf nodes.
+
+``i8* llvm.hpvm.malloc(i64 nBytes)``:raw-html-m2r:`<br>`
+Allocate a block of memory of size ``nBytes`` and return pointer to it. The allocated object can be shared by all nodes.:raw-html-m2r:`<br>`
+*Note that the returned pointer must somehow be communicated explicitly for use by other nodes.*
+
+``i32 llvm.hpvm.atomic.add(i8* m, i32 v)``:raw-html-m2r:`<br>`
+Atomically computes the bitwise ADD of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``i32 llvm.hpvm.atomic.sub(i8* m, i32 v)``:raw-html-m2r:`<br>`
+Atomically computes the bitwise SUB of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``i32 llvm.hpvm.atomic.min(i8* m, i32 v)``:raw-html-m2r:`<br>`
+Atomically computes the bitwise MIN of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``i32 llvm.hpvm.atomic.max(i8* m, i32 v)``:raw-html-m2r:`<br>`
+Atomically computes the bitwise MAX of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``i32 llvm.hpvm.atomic.xchg(i8* m, i32 v)``:raw-html-m2r:`<br>`
+Atomically computes the bitwise XCHG of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``i32 llvm.hpvm.atomic.and(i8* m, i32 v)``:raw-html-m2r:`<br>`
+Atomically computes the bitwise AND of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``i32 llvm.hpvm.atomic.or(i8* m, i32 v)``:raw-html-m2r:`<br>`
+Atomically computes the bitwise OR of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``i32 llvm.hpvm.atomic.xor(i8* m, i32 v)``:raw-html-m2r:`<br>`
+Atomically computes the bitwise XOR of ``v`` and the value stored at memory location ``[m]`` w.r.t. the dynamic instances of the current leaf node and stores the result back into ``[m]``. Returns the value previously stored at ``[m]``.
+
+``void llvm.hpvm.barrier()``:raw-html-m2r:`<br>`
+Local synchronization barrier across dynamic instances of current leaf node.
+
+:raw-html-m2r:`<a name="interaction"></a>`
+
+Intrinsics for Graph Interaction
+--------------------------------
+
+The following intrinsics are for graph initialization/termination and interaction with the host code, and can be used only by the host code.
+
+``void llvm.hpvm.init()``:raw-html-m2r:`<br>`
+Initialization of HPVM runtime.
+
+``void llvm.hpvm.cleanup()``:raw-html-m2r:`<br>`
+Cleanup of HPVM runtime created objects.
+
+``void llvm.hpvm.trackMemory(i8* ptr, i64 sz)``:raw-html-m2r:`<br>`
+Insert memory starting at ``ptr`` of size ``sz`` in the memory tracker. ``ptr`` becomes the key for identifying this memory object. As soon as a memory object is inserted in the memory tracker it starts being tracked, and can be passed as a data item to a DFG.
+
+``void llvm.hpvm.untrackMemory(i8* ptr)``:raw-html-m2r:`<br>`
+Stop tracking memory object with key ``ptr``, and remove it from memory tracker.
+
+``void llvm.hpvm.requestMemory(i8* ptr, i64 sz)``:raw-html-m2r:`<br>`
+If memory object with key ``ptr`` is not located in host memory, copy it to host memory.
+
+``i8* llvm.hpvm.launch(i8* RootGraph, i8* Args, i1 isStream)``:raw-html-m2r:`<br>`
+Launch the execution of a top-level DFG with root node function ``RootGraph``. ``Args`` is a pointer to a packed struct, containing one field per argument of the ``RootGraph`` function, consecutively. For non-streaming DFGs with a non empty result type, ``Args`` must contain an additional field of the type ``RootGraph.returnTy``, where the result of the graph will be returned. ``isStream`` chooses between a non streaming (0) or streaming (1) graph execution. Return a handle to the invoked DFG.
+
+``void llvm.hpvm.wait(i8* GraphID)``:raw-html-m2r:`<br>`
+Wait for completion of execution of DFG with handle ``GraphID``.
+
+``void llvm.hpvm.push(i8* GraphID, i8* args)``:raw-html-m2r:`<br>`
+Push set of input data ``args`` (same as type included in launch) to streaming DFG with handle ``GraphID``.
+
+``i8* llvm.hpvm.pop(i8* GraphID)``:raw-html-m2r:`<br>`
+Pop and return data from streaming DFG with handle ``GraphID``. The return type is a struct containing a field for every output of DFG. 
+
+:raw-html-m2r:`<a name="limitations"></a>`
+
+Implementation Limitations
+--------------------------
+
+Due to limitations of our current prototype implementation, the following restrictions are imposed:
+
+
+* In HPVM, a memory object is represented as a (pointer, size) pair that includes the address of memory object, and the size (in bytes) of the pointed-to object. Therefore, when an edge/bind carries a pointer, it must be followed by an i64 size value.           
+* 
+  Pointers cannot be transferred between nodes using dataflow edges. Instead, they should be passed using the bind operation from the (common) parent of the source and sink nodes.
+
+* 
+  Instantiation of dataflow nodes is supported in up to three dimensions.
diff --git a/hpvm/docs/references/index.rst b/hpvm/docs/references/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e2650fb9e2b6729f514120b289f0f1dfd0a54b76
--- /dev/null
+++ b/hpvm/docs/references/index.rst
@@ -0,0 +1,11 @@
+References
+============
+
+Below are some technical details of HPVM system and the HPVM-C language.
+
+.. toctree::
+   :maxdepth: 1
+
+   hpvm-c
+   hpvm-specification
+   compilation-process
diff --git a/hpvm/docs/tests.rst b/hpvm/docs/tests.rst
new file mode 120000
index 0000000000000000000000000000000000000000..729ecaa5ca892a1fae9d88818e54cb76ba2a3c9a
--- /dev/null
+++ b/hpvm/docs/tests.rst
@@ -0,0 +1 @@
+../test/README.rst
\ No newline at end of file
diff --git a/hpvm/docs/tradeoff-curves/alexnet2.pdf b/hpvm/docs/tradeoff-curves/alexnet2_cifar10.pdf
similarity index 100%
rename from hpvm/docs/tradeoff-curves/alexnet2.pdf
rename to hpvm/docs/tradeoff-curves/alexnet2_cifar10.pdf
diff --git a/hpvm/docs/tradeoff-curves/alexnet.pdf b/hpvm/docs/tradeoff-curves/alexnet_cifar10.pdf
similarity index 100%
rename from hpvm/docs/tradeoff-curves/alexnet.pdf
rename to hpvm/docs/tradeoff-curves/alexnet_cifar10.pdf
diff --git a/hpvm/docs/tradeoff-curves/alexnet_imagenet.pdf b/hpvm/docs/tradeoff-curves/alexnet_imagenet.pdf
index b0e0ec99473c01e4e2809fc16a298fe81e993376..b3bcc6e53b091d0c8ac317fe8f3d9cfb6a79fb80 100644
Binary files a/hpvm/docs/tradeoff-curves/alexnet_imagenet.pdf and b/hpvm/docs/tradeoff-curves/alexnet_imagenet.pdf differ
diff --git a/hpvm/docs/tradeoff-curves/alexnet_imagenet_tradeoff.pdf b/hpvm/docs/tradeoff-curves/alexnet_imagenet_tradeoff.pdf
deleted file mode 100644
index b3bcc6e53b091d0c8ac317fe8f3d9cfb6a79fb80..0000000000000000000000000000000000000000
Binary files a/hpvm/docs/tradeoff-curves/alexnet_imagenet_tradeoff.pdf and /dev/null differ
diff --git a/hpvm/docs/tradeoff-curves/index.rst b/hpvm/docs/tradeoff-curves/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..551ac34089fd55872ef3c408948c6c6d63707d87
--- /dev/null
+++ b/hpvm/docs/tradeoff-curves/index.rst
@@ -0,0 +1,6 @@
+Gallery
+=======
+
+This gallery contains example tradeoff curves for the 10 DNN benchmarks in HPVM.
+
+.. image:: alexnet_cifar10.pdf
diff --git a/hpvm/docs/tradeoff-curves/mobilenet.pdf b/hpvm/docs/tradeoff-curves/mobilenet_cifar10.pdf
similarity index 100%
rename from hpvm/docs/tradeoff-curves/mobilenet.pdf
rename to hpvm/docs/tradeoff-curves/mobilenet_cifar10.pdf
diff --git a/hpvm/docs/tradeoff-curves/resnet18.pdf b/hpvm/docs/tradeoff-curves/resnet18_cifar10.pdf
similarity index 100%
rename from hpvm/docs/tradeoff-curves/resnet18.pdf
rename to hpvm/docs/tradeoff-curves/resnet18_cifar10.pdf
diff --git a/hpvm/docs/tradeoff-curves/resnet50_imagenet.pdf b/hpvm/docs/tradeoff-curves/resnet50_imagenet.pdf
index 6d3be5c8ad19198902b15ff818f2984644026ae0..67d3651bf15ab0b7c1350988846072fd712f4b4f 100644
Binary files a/hpvm/docs/tradeoff-curves/resnet50_imagenet.pdf and b/hpvm/docs/tradeoff-curves/resnet50_imagenet.pdf differ
diff --git a/hpvm/docs/tradeoff-curves/resnet50_imagenet_tradeoff.pdf b/hpvm/docs/tradeoff-curves/resnet50_imagenet_tradeoff.pdf
deleted file mode 100644
index 67d3651bf15ab0b7c1350988846072fd712f4b4f..0000000000000000000000000000000000000000
Binary files a/hpvm/docs/tradeoff-curves/resnet50_imagenet_tradeoff.pdf and /dev/null differ
diff --git a/hpvm/projects/hpvm-profiler/hpvm_profiler/__init__.py b/hpvm/projects/hpvm-profiler/hpvm_profiler/__init__.py
index 88f74c9194bb105a2a731ed063aed7e2ac875e6a..14d0f491b8dccb41f7136087f4146f64e0470693 100644
--- a/hpvm/projects/hpvm-profiler/hpvm_profiler/__init__.py
+++ b/hpvm/projects/hpvm-profiler/hpvm_profiler/__init__.py
@@ -1,7 +1,9 @@
+from dataclasses import dataclass
 from pathlib import Path
-from subprocess import CalledProcessError, PIPE
+from subprocess import PIPE, CalledProcessError
 from typing import Iterable, List, Tuple, Union
-from dataclasses import dataclass
+
+import matplotlib.pyplot as plt
 from tqdm import trange
 
 PathLike = Union[Path, str]
@@ -14,25 +16,20 @@ def profile_configs(
     output_config_path: PathLike,
     profile_filename: str = "profile_info.txt",
     qos_filename: str = "final_accuracy",
-):
+) -> None:
     """
     Profile an HPVM configuration file with an HPVM binary.
     The configuration file must have the baseline as the first configuration.
 
-    binary_path: Union[Path, str]
-        Path to binary to be executed in profiling.
-    config_path: Union[Path, str]
-        Path to config file (HPVM configuration format)
+    binary_path: Path to binary to be executed in profiling.
+    config_path: Path to config file (HPVM configuration format)
         with configs to enumerate for profiling.
-    output_config_path: Union[Path, str]
-        Path where the output configs are written.
+    output_config_path: Path where the output configs are written.
         The output config file has the same configs as the input `config_path` file,
         but the performance and energy readings are updated.
-    profile_filename: str
-        Name of profile file generated by the binary (in current directory).
+    profile_filename: Name of profile file generated by the binary (in current directory).
         This defaults to "profile_info.txt" and should not be changed for HPVM binaries.
-    qos_filename: str
-        Name of QoS file generated by the binary (in current directory).
+    qos_filename: Name of QoS file generated by the binary (in current directory).
         It contains a single float number as the QoS of this run.
         This defaults to "final_accuracy" and should not be changed for HPVM binaries.
     """
@@ -76,26 +73,21 @@ def plot_hpvm_configs(
     save_to: PathLike = None,
     show_qos_loss: bool = True,
     **fig_kwargs,
-):
+) -> plt.Figure:
     """
     Plot the QoS-speedup information in an HPVM configuration file.
     It is recommended to profile the config file first (using `profile_configs`)
     to obtain real speedup numbers.
     This function creates a `matplotlib.pyplot.Figure`, plots on it, and returns it.
 
-    config_path: Union[Path, str]
-        Path to the config file (HPVM configuration format).
-    save_to: Union[Path, str]
-        File to save figure into. Default is None: don't save figure (just return it).
-    show_qos_loss: bool
-        Show the loss of QoS on x axis of the figure. Defaults to True.
+    config_path: Path to the config file (HPVM configuration format).
+    save_to: File to save figure into. Default is None: don't save figure (just return it).
+    show_qos_loss: Show the loss of QoS on x axis of the figure. Defaults to True.
         If False, will use (absolute) QoS instead of QoS loss.
-    fig_kwargs:
-        Arguments to pass to `plt.subplots`.
+    fig_kwargs: Arguments to pass to `plt.subplots`.
     """
 
     import numpy as np
-    import matplotlib.pyplot as plt
 
     _, configs = read_hpvm_configs(config_path)
     get_qos = lambda c: c.qos_loss if show_qos_loss else c.qos
@@ -109,6 +101,7 @@ def plot_hpvm_configs(
         fig.savefig(save_to, dpi=300)
     return fig
 
+
 @dataclass
 class Config:
     conf_name: str
diff --git a/hpvm/projects/keras/docs/Support.md b/hpvm/projects/keras/docs/Support.md
deleted file mode 100644
index b568d3d640204fd90c977e63e24dc36dc6d92336..0000000000000000000000000000000000000000
--- a/hpvm/projects/keras/docs/Support.md
+++ /dev/null
@@ -1,54 +0,0 @@
-## Supported Keras Operators 
-
-The Keras frontend supports `Sequential()` Keras models.
-The list of supported operations is as follows:
-
-* `Conv2D`
-* `DepthwiseConv2D`
-* `Dense`
-* `BatchNormalization`
-* `MaxPooling2D`
-* `AveragePooling2D`
-* `Flatten`
-* `Add`
-* `ZeroPadding2D`
-* `Activation` 
-   * `relu`
-   * `tanh`
-   * `softmax`
-
-
-
-## Limitations 
-
-* Currently, we support Convolutional Neural Networks (CNNs) that include the supported operators (above) - RNNs/LSTMs not supported
-* We currently only support models in NCHW format (NHWC is not supported)
-* Softmax operator should be the last operation in the CNN pipeline 
-* Softmax operation must be a separate operator (not specified as activation to another type of Keras operator). Example of what works:
-
-```python
-Activation ("softmax")
-```
-
-Example of what is NOT supported:
-
-```python
-Dense(num_classes, activation="softmax")
-```
-
-* For convolutions with stride > 1 `same` convolution is NOT supported. Explicitly add `ZeroPadding2D` layer before `Conv2D` or `DepthwiseConv2D` operators. Example of what does NOT work:
-
-```python
-Conv2D(num_filters, kernel_size = (3,3), strides = (2,2), padding = `same`)
-```
-
-Example of what works instead:
-
-```python
-# NOTE: Amount of padding varies with kernel sizes and strides
-ZeroPadding2D(padding=(1,1), data_format = `channels_first`) # only support NCHW
-Conv2D(num_filters, kernel_size = (3,3), strides = (2,2), padding = `valid`)
-```
-
-
-
diff --git a/hpvm/projects/keras/docs/Support.rst b/hpvm/projects/keras/docs/Support.rst
new file mode 100644
index 0000000000000000000000000000000000000000..8d9d11c4cb0cb92068c28afb8a729ca2b1e25365
--- /dev/null
+++ b/hpvm/projects/keras/docs/Support.rst
@@ -0,0 +1,54 @@
+Supported Keras Operators
+=========================
+
+The Keras frontend supports ``Sequential()`` Keras models.
+The list of supported operations is as follows:
+
+
+* ``Conv2D``
+* ``DepthwiseConv2D``
+* ``Dense``
+* ``BatchNormalization``
+* ``MaxPooling2D``
+* ``AveragePooling2D``
+* ``Flatten``
+* ``Add``
+* ``ZeroPadding2D``
+* ``Activation`` 
+
+  * ``relu``
+  * ``tanh``
+  * ``softmax``
+
+Limitations
+-----------
+
+* Currently, we support Convolutional Neural Networks (CNNs) that include the supported operators (above) - RNNs/LSTMs not supported
+* We currently only support models in NCHW format (NHWC is not supported)
+* Softmax operator should be the last operation in the CNN pipeline 
+* Softmax operation must be a separate operator (not specified as activation to another type of Keras operator). Example of what works:
+
+.. code-block:: python
+
+   Activation ("softmax")
+
+Example of what is NOT supported:
+
+.. code-block:: python
+
+   Dense(num_classes, activation="softmax")
+
+
+* For convolutions with stride > 1 ``same`` convolution is NOT supported. Explicitly add ``ZeroPadding2D`` layer before ``Conv2D`` or ``DepthwiseConv2D`` operators. Example of what does NOT work:
+
+.. code-block:: python
+
+   Conv2D(num_filters, kernel_size = (3,3), strides = (2,2), padding = `same`)
+
+Example of what works instead:
+
+.. code-block:: python
+
+   # NOTE: Amount of padding varies with kernel sizes and strides
+   ZeroPadding2D(padding=(1,1), data_format = `channels_first`) # only support NCHW
+   Conv2D(num_filters, kernel_size = (3,3), strides = (2,2), padding = `valid`)
diff --git a/hpvm/projects/predtuner b/hpvm/projects/predtuner
index 67b66ea17f5e22778ae51742cffd95f52b1c31ba..aa28a41ca12b15af31e1d15b687367e27cde3878 160000
--- a/hpvm/projects/predtuner
+++ b/hpvm/projects/predtuner
@@ -1 +1 @@
-Subproject commit 67b66ea17f5e22778ae51742cffd95f52b1c31ba
+Subproject commit aa28a41ca12b15af31e1d15b687367e27cde3878
diff --git a/hpvm/projects/torch2hpvm/README.md b/hpvm/projects/torch2hpvm/README.md
deleted file mode 100644
index 1f06142f524b34760f1fffc84a5b2a2f07bf23a3..0000000000000000000000000000000000000000
--- a/hpvm/projects/torch2hpvm/README.md
+++ /dev/null
@@ -1,111 +0,0 @@
-# PyTorch Frontend for HPVM
-
-`torch2hpvm` is a PyTorch frontend for HPVM. It provides a set of API that
-
-- Generates a PyTorch `module` into HPVM-C code;
-- Exports a PyTorch dataset to ApproxHPVM dataset format;
-- Compiles the generated code into binary by invoking HPVM automatically.
-
-## Installation
-
-`pip` is the recommended package manager (also available within `conda`).
-Using `pip`:
-
-```bash
-pip install -e ./
-```
-
-## Getting Started
-
-Let's look at an example that uses DNNs and weights pre-shipped with HPVM.
-This is found at `hpvm/test/dnn_benchmarks/pytorch/test_frontend.py`.
-*Note* that below we'll be working under directory `hpvm/test/dnn_benchmarks/pytorch`.
-
-We'll be generating ResNet-18 into an HPVM-compiled binary.
-First, prepare 2 datasets for autotuning and testing.
-
-```python
-from torch2hpvm import BinDataset
-from pathlib import Path
-
-data_dir = Path(__file__).parent / "../model_params/resnet18_cifar10"
-dataset_shape = 5000, 3, 32, 32
-tuneset = BinDataset(data_dir / "tune_input.bin", data_dir / "tune_labels.bin", dataset_shape)
-testset = BinDataset(data_dir / "test_input.bin", data_dir / "test_labels.bin", dataset_shape)
-```
-
-`BinDataset` is a dataset created over files of ApproxHPVM dataset format.
-Any instance `torch.utils.data.Dataset` can be used here.
-
-*Note* that each `module` is bound to 2 datasets: a "tune" and a "test" set.
-The generated binary accepts an argument to be either the string "tune" or "test",
-and performs inference over a dataset accordingly.
-This is because the dataset can contain arbitrary Python code which cannot yet be exported into HPVM-C;
-instead the frontend has to export some predefined datasets for the model to use.
-See TODOs (1).
-
-Create a DNN `module` and load the checkpoint:
-
-```python
-import torch
-from torch.nn import Module
-import dnn  # Defined at `hpvm/test/dnn_benchmarks/pytorch`
-
-model: Module = dnn.ResNet18()
-checkpoint = Path(__file__).parent / "../model_params/resnet18_cifar10.pth.tar"
-model.load_state_dict(torch.load(checkpoint))
-```
-
-Any `torch.nn.Module` can be similarly used,
-as long as they only contain the tensor operators supported in HPVM
-(see "Supported Operators" and TODOs (2)).
-
-Now we are ready to export the model. The main functioning class of `torch2hpvm` is `ModelExporter`:
-
-```python
-from torch2hpvm import ModelExporter
-
-output_dir = Path("./resnet18_hpvm")
-build_dir = output_dir / "build"
-target_binary = build_dir / "resnet18"
-batch_size = 500
-conf_file = "" # TODO: points to your configuration file.
-exporter = ModelExporter(model, tuneset, testset, output_dir, config_file=conf_file)
-exporter.generate(batch_size=batch_size).compile(target_binary, build_dir)
-```
-
-`output_dir`, `build_dir`, and `target_binary` define the folder for code generation, compilation,
-and path to the compiled binary respectively.
-`batch_size` is the batch size the binary uses during inference.
-
-*Note* that `conf_file` is the path to an HPVM approximation configuration file.
-This file decides what approximation the binary will use during inference.
-This path is hardcoded into the binary and is only read when the binary starts,
-so it's fine to have `conf_file` point to a non-existing path.
-An example can be found at `test/dnn_benchmarks/hpvm-c/benchmarks/resnet18_cifar10/data/tuner_confs.txt`.
-
-## Supported Operators
-
-Any builtin and custom PyTorch `Module` are supported
-*as long as* the generated ONNX model consists of only the following operators
-when the Module is exported into ONNX:
-
-| Convolution | Linear | Pooling           | Pointwise          | Other    |
-|-------------|--------|-------------------|--------------------|----------|
-| Conv        | MatMul | GlobalAveragePool | BatchNormalization | Flatten  |
-|             | Gemm   | AveragePool       | Relu               | Softmax  |
-|             |        | MaxPool           | Tanh               | Identity |
-|             |        |                   |                    | Pad      |
-|             |        |                   |                    | Add      |
-
-This choice of operators is largely constrained by backend (tensor_runtime) supports.
-
-## TODOs
-
-1. Optionally insert a Python-C interface in the generated binary to
-   call back into a Dataset class and read the data.
-   - Needs pybind11, hardcoding of Python environment, and some fiddling with import mechanism.
-1. Expand the list of operators supported in the frontend.
-   - Most ideally, create a high-level description of operators that can tie
-     HPVM-C intrinsics and the frontend list of operators together.
-
diff --git a/hpvm/projects/torch2hpvm/README.rst b/hpvm/projects/torch2hpvm/README.rst
new file mode 100644
index 0000000000000000000000000000000000000000..928aa2e19f1d8efdcbb33268925c7c3d6ba0394f
--- /dev/null
+++ b/hpvm/projects/torch2hpvm/README.rst
@@ -0,0 +1,146 @@
+PyTorch Frontend for HPVM
+=========================
+
+``torch2hpvm`` is a PyTorch frontend for HPVM. It provides a set of API that
+
+* Generates a PyTorch ``module`` into HPVM-C code;
+* Exports a PyTorch dataset to ApproxHPVM dataset format;
+* Compiles the generated code into binary by invoking HPVM automatically.
+
+Installation
+------------
+
+``pip3`` is the recommended package manager (also available within ``conda``).
+Using ``pip3``:
+
+.. code-block:: bash
+
+   pip3 install -e ./
+
+Getting Started
+---------------
+
+Let's look at an example that uses DNNs and weights pre-shipped with HPVM.
+This is found at ``hpvm/test/dnn_benchmarks/pytorch/test_frontend.py``.
+*Note* that below we'll be working under directory ``hpvm/test/dnn_benchmarks/pytorch``.
+
+We'll be generating ResNet-18 into an HPVM-compiled binary.
+First, prepare 2 datasets for autotuning and testing.
+
+.. code-block:: python
+
+   from torch2hpvm import BinDataset
+   from pathlib import Path
+
+   data_dir = Path(__file__).parent / "../model_params/resnet18_cifar10"
+   dataset_shape = 5000, 3, 32, 32
+   tuneset = BinDataset(data_dir / "tune_input.bin", data_dir / "tune_labels.bin", dataset_shape)
+   testset = BinDataset(data_dir / "test_input.bin", data_dir / "test_labels.bin", dataset_shape)
+
+``BinDataset`` is a dataset created over files of ApproxHPVM dataset format.
+Any instance ``torch.utils.data.Dataset`` can be used here.
+
+*Note* that each ``module`` is bound to 2 datasets: a "tune" and a "test" set.
+The generated binary accepts an argument to be either the string "tune" or "test",
+and performs inference over a dataset accordingly.
+This is because the dataset can contain arbitrary Python code which cannot yet be exported into HPVM-C;
+instead the frontend has to export some predefined datasets for the model to use.
+See TODOs (1).
+
+Create a DNN ``module`` and load the checkpoint:
+
+.. code-block:: python
+
+   import torch
+   from torch.nn import Module
+   import dnn  # Defined at `hpvm/test/dnn_benchmarks/pytorch`
+
+   model: Module = dnn.ResNet18()
+   checkpoint = Path(__file__).parent / "../model_params/resnet18_cifar10.pth.tar"
+   model.load_state_dict(torch.load(checkpoint))
+
+Any ``torch.nn.Module`` can be similarly used,
+as long as they only contain the tensor operators supported in HPVM
+(see "Supported Operators" and TODOs (2)).
+
+Now we are ready to export the model. The main functioning class of ``torch2hpvm`` is ``ModelExporter``:
+
+.. code-block:: python
+
+   from torch2hpvm import ModelExporter
+
+   output_dir = Path("./resnet18_hpvm")
+   build_dir = output_dir / "build"
+   target_binary = build_dir / "resnet18"
+   batch_size = 500
+   conf_file = "" # Change this to point to your configuration file.
+   exporter = ModelExporter(model, tuneset, testset, output_dir, config_file=conf_file)
+   exporter.generate(batch_size=batch_size).compile(target_binary, build_dir)
+
+``output_dir``, ``build_dir``, and ``target_binary`` define the folder for code generation, compilation,
+and path to the compiled binary respectively.
+``batch_size`` is the batch size the binary uses during inference.
+
+*Note* that ``conf_file`` is the path to an HPVM approximation configuration file.
+This file decides what approximation the binary will use during inference.
+This path is hardcoded into the binary and is only read when the binary starts,
+so it's fine to have ``conf_file`` point to a non-existing path.
+An example can be found at ``test/dnn_benchmarks/hpvm-c/benchmarks/resnet18_cifar10/data/tuner_confs.txt``.
+
+Supported Operators
+-------------------
+
+Any builtin and custom PyTorch ``Module`` are supported
+*as long as* the generated ONNX model consists of only the following operators
+when the Module is exported into ONNX:
+
+.. list-table::
+   :header-rows: 1
+
+   * - Convolution
+     - Linear
+     - Pooling
+     - Pointwise
+     - Other
+   * - Conv
+     - MatMul
+     - GlobalAveragePool
+     - BatchNormalization
+     - Flatten
+   * - 
+     - Gemm
+     - AveragePool
+     - Relu
+     - Softmax
+   * - 
+     - 
+     - MaxPool
+     - Tanh
+     - Identity
+   * - 
+     - 
+     - 
+     - 
+     - Pad
+   * - 
+     - 
+     - 
+     - 
+     - Add
+
+
+This choice of operators is largely constrained by backend (tensor_runtime) supports.
+
+TODOs
+-----
+
+
+#. Optionally insert a Python-C interface in the generated binary to
+   call back into a Dataset class and read the data.
+
+   * Needs pybind11, hardcoding of Python environment, and some fiddling with import mechanism.
+
+#. Expand the list of operators supported in the frontend.
+
+   * Most ideally, create a high-level description of operators that can tie
+     HPVM-C intrinsics and the frontend list of operators together.
diff --git a/hpvm/test/README.md b/hpvm/test/README.md
deleted file mode 100644
index 18cb05b833434fcffc7e4c50b5f38150c924fb19..0000000000000000000000000000000000000000
--- a/hpvm/test/README.md
+++ /dev/null
@@ -1,91 +0,0 @@
-# HPVM Test and Benchmarks
-
-## Directory Organization
-
-This directory is organized as follows:
-
-* `unitTests/` and `regressionTests/`: unit and regression tests for HPVM.
-  These are LLVM-bitcode test cases for HPVM passes.
-
-* `benchmarks/`: includes a few applications written in HPVM-C, a template, and directions for compiling and running these benchmarks.
-
-* `dnn_benchmarks/`: ten (10) DNN benchmarks in HPVM-C, Keras and PyTorch, supported by ApproxHPVM.
-  This tests HPVM as well as the Keras and PyTorch frontends.
-
-  * `dnn_benchmarks/hpvm-c` contains the HPVM-C version of these DNNs.
-    Their organization and usage are similar to the benchmarks under `benchmarks/`.
-
-    Each subfolder contains a DNN with 2 versions (2 `.cpp` files):
-    the `tensor`-targeted version which compiles to `tensor_runtime`,
-    and the `cudnn`-targeted version which compiles to operators in `cuDNN`
-    (has `_cudnn` in name).
-
-  * `dnn_benchmarks/keras` contains these DNNs implemented in Keras,
-    and code for generating them down to HPVM-C (testing Keras frontend).
-  * `dnn_benchmarks/pytorch` contains these DNNs in PyTorch
-    and code for generating them down to HPVM-C (testing PyTorch/ONNX frontend).
-
-  The code generated from Keras and PyTorch frontend should be largely similar and functionally equivalent.
-
-## Running Test Cases and Benchmarks
-
-The easiest way to run tests is to use `make` targets,
-which will also take care of all compilation of test cases and test fixtures.
-The following targets runs these tests respectively:
-
-* `make -j check-hpvm-pass` runs tests in `hpvm_pass`: `hpvm_pass/**/*.ll`.
-  These are regression and unit tests for HPVM passes.
-* `make -j check-hpvm-dnn` runs all 20 DNN benchmarks under `dnn_benchmarks/hpvm-c`
-  (10 DNNs x 2 versions) and validates their accuracy.
-
-  *Note* that this can take quite long due to the size of DNNs and datasets.
-  Depending on your hardware capability, this test can take 5-30 minutes.
-  Also, this is set to run sequentially out of GPU memory concerns.
-
-* `make -j check-hpvm-profiler` runs `hpvm-profiler` on some smaller networks
-  (as it is extremely time-consuming) and presents the tradeoff curve with profiled speedup.
-
-  *Note* that if you're on an NVIDIA Jetson TX2, you may want to run
-  `bash dnn_benchmarks/profiling/jetson_clocks.sh`
-  to ensure that the clocks are running at the maximum frequency
-
-Underneath, `llvm-lit` is used to discover and run the tests.
-
-`benchmarks/` can only be compiled in-source with `make`.
-We are working to migrate it into the `cmake` system.
-
-## Compiling Benchmarks
-
-This section explains how to compile the benchmarks without running them as tests.
-
-### HPVM-C DNN Benchmarks
-
-To build (not run) all `dnn_benchmarks/hpvm-c`, use `make -j dnn_benchmarks`.
-For each benchmark `${bench_name}`, the binary is generated at
-`${build_dir}/tools/hpvm/test/dnn_benchmarks/hpvm-c/${bench_name}`.
-
-Alternatively, it's possible to build just 1 DNN benchmark.
-The output of CMake shows a list of these benchmarks as target names, starting with
-> List of test dnn benchmarks: alexnet2_cifar10;alexnet2_cifar10...
-
-Currently, there are 20 of them. These are:
-
-|                   |                         |
-|-------------------|-------------------------|
-| lenet_mnist       | lenet_mnist_cudnn       |
-| alexnet_cifar10   | alexnet_cifar10_cudnn   |
-| alexnet2_cifar10  | alexnet2_cifar10_cudnn  |
-| vgg16_cifar10     | vgg16_cifar10_cudnn     |
-| vgg16_cifar100    | vgg16_cifar100_cudnn    |
-| mobilenet_cifar10 | mobilenet_cifar10_cudnn |
-| resnet18_cifar10  | resnet18_cifar10_cudnn  |
-| alexnet_imagenet  | alexnet_imagenet_cudnn  |
-| vgg16_imagenet    | vgg16_imagenet_cudnn    |
-| resnet50_imagenet | resnet50_imagenet_cudnn |
-
-`_cudnn` suffix indicates the code is generated onto cuDNN functions.
-Otherwise they are generated to `tensor_runtime` DNN functions which are hand-written in CUDA.
-
-### TODO: figure out how to
-
-1. Auto run Keras and PyTorch tests (generating, compiling and running all DNNs)
diff --git a/hpvm/test/README.rst b/hpvm/test/README.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e770b05cb39796aab6deb480012470a553923323
--- /dev/null
+++ b/hpvm/test/README.rst
@@ -0,0 +1,128 @@
+Test and Benchmarks
+========================
+
+Directory Organization
+----------------------
+
+The `hpvm/test` directory holds all tests and benchmarks in HPVM and is organized as follows:
+
+* 
+  ``hpvm_pass/``: unit and regression tests for HPVM Passes, written in LLVM-bitcode.
+
+* 
+  ``benchmarks/``: includes a few applications written in HPVM-C, a template, and directions for compiling and running these benchmarks.
+
+  * ``benchmarks/parboil``: Selected benchmarks from the `Parboil <http://impact.crhc.illinois.edu/parboil/parboil.aspx>`_ benchmark suite.
+  * ``benchmarks/pipeline``: An edge detection pipeline benchmark.
+  * ``benchmarks/hpvm-cava``: A Camera ISP pipeline, adapted from C code provided from our collaborators at `Harvard <http://vlsiarch.eecs.harvard.edu>`_.
+
+* 
+  ``dnn_benchmarks/``: ten (10) DNN benchmarks in HPVM-C, Keras and PyTorch, supported by ApproxHPVM.
+  This tests HPVM as well as the Keras and PyTorch frontends.
+
+  * 
+    ``dnn_benchmarks/hpvm-c`` contains the HPVM-C version of these DNNs.
+    Their organization and usage are similar to the benchmarks under ``benchmarks/``.
+
+    Each subfolder contains a DNN with 2 versions (2 ``.cpp`` files):
+    the ``tensor``-targeted version which compiles to ``tensor_runtime``,
+    and the ``cudnn``-targeted version which compiles to operators in ``cuDNN``
+    (has ``_cudnn`` in name).
+
+  * 
+    ``dnn_benchmarks/keras`` contains these DNNs implemented in Keras,
+    and code for generating them down to HPVM-C (testing Keras frontend).
+
+  * ``dnn_benchmarks/pytorch`` contains these DNNs in PyTorch
+    and code for generating them down to HPVM-C (testing PyTorch/ONNX frontend).
+
+  The code generated from Keras and PyTorch frontend should be largely similar and functionally equivalent.
+
+Running Test Cases and Benchmarks
+---------------------------------
+
+The easiest way to run tests is to use ``make`` targets,
+which will also take care of all compilation of test cases and test fixtures.
+The following targets runs these tests respectively:
+
+
+* ``make -j check-hpvm-pass`` runs tests in ``hpvm_pass``: ``hpvm_pass/**/*.ll``.
+  These are regression and unit tests for HPVM passes.
+* 
+  ``make -j check-hpvm-dnn`` runs all 20 DNN benchmarks under ``dnn_benchmarks/hpvm-c``
+  (10 DNNs x 2 versions) and validates their accuracy.
+
+  *Note* that this can take quite long due to the size of DNNs and datasets.
+  Depending on your hardware capability, this test can take 5-30 minutes.
+  Also, this is set to run sequentially out of GPU memory concerns.
+
+* 
+  ``make -j check-hpvm-profiler`` runs ``hpvm-profiler`` on some smaller networks
+  (as it is extremely time-consuming) and presents the tradeoff curve with profiled speedup.
+
+  *Note* that if you're on an NVIDIA Jetson TX2, you may want to run
+  ``bash dnn_benchmarks/profiling/jetson_clocks.sh``
+  to ensure that the clocks are running at the maximum frequency
+
+Underneath, ``llvm-lit`` is used to discover and run the tests.
+
+``benchmarks/`` can only be compiled in-source with ``make``.
+We are working to migrate it into the ``cmake`` system.
+
+Compiling Benchmarks
+--------------------
+
+This section explains how to compile the benchmarks without running them as tests.
+
+HPVM-C DNN Benchmarks
+^^^^^^^^^^^^^^^^^^^^^
+
+To build (not run) all ``dnn_benchmarks/hpvm-c``, use ``make -j dnn_benchmarks``.
+For each benchmark ``${bench_name}``, the binary is generated at
+``${build_dir}/tools/hpvm/test/dnn_benchmarks/hpvm-c/${bench_name}``.
+
+Alternatively, it's possible to build just 1 DNN benchmark.
+The output of CMake shows a list of these benchmarks as target names, starting with
+
+..
+
+   List of test dnn benchmarks: alexnet2_cifar10;alexnet2_cifar10...
+
+
+Currently, there are 20 of them. These are:
+
+.. list-table::
+   :header-rows: 1
+
+   * - 
+     - 
+   * - lenet_mnist
+     - lenet_mnist_cudnn
+   * - alexnet_cifar10
+     - alexnet_cifar10_cudnn
+   * - alexnet2_cifar10
+     - alexnet2_cifar10_cudnn
+   * - vgg16_cifar10
+     - vgg16_cifar10_cudnn
+   * - vgg16_cifar100
+     - vgg16_cifar100_cudnn
+   * - mobilenet_cifar10
+     - mobilenet_cifar10_cudnn
+   * - resnet18_cifar10
+     - resnet18_cifar10_cudnn
+   * - alexnet_imagenet
+     - alexnet_imagenet_cudnn
+   * - vgg16_imagenet
+     - vgg16_imagenet_cudnn
+   * - resnet50_imagenet
+     - resnet50_imagenet_cudnn
+
+
+``_cudnn`` suffix indicates the code is generated onto cuDNN functions.
+Otherwise they are generated to ``tensor_runtime`` DNN functions which are hand-written in CUDA.
+
+TODO: figure out how to
+^^^^^^^^^^^^^^^^^^^^^^^
+
+
+#. Auto run Keras and PyTorch tests (generating, compiling and running all DNNs)