Skip to content
Snippets Groups Projects
Haoran Qiu's avatar
Haoran Qiu authored
0f7444fc
History

Power-aware Deep Learning Model Serving

This repository contains the code for the paper "Power-aware Deep Learning Model Serving with μ-Serve" (link).

µ-Serve is a model-serving framework that optimizes the power consumption and model serving latency/throughput of serving multiple ML models efficiently in a homogeneous GPU cluster. Model multiplexing approaches such as model parallelism, model placement, replication, and batching aim to optimize the model-serving performance. However, they fall short of leveraging the GPU frequency scaling opportunity for power saving. We demonstrate (1) the benefits of GPU frequency scaling in power saving for model serving; and (2) the necessity for co-design and optimization of scheduling, fine-grained model multiplexing and GPU frequency scaling.

Overview

Below is the overall architecture of the proposed model serving framework:

The framework mainly consists of three main components:

  • Power-aware model partitioning and placement (based on AlpaServe)
  • Proxy model-based requeset scheduling that addresses the autoregressive nature of LLM serving
  • Dynamic GPU frequency scaling based on the (latency) SLO attainment of the serving workload

In the offline phase, the sensitivity scores are profiled and used to partition the models into different devices to create opportunities for GPU frequency tuning. Additionally, the proxy models are trained using history model input/output data to learn to predict the next request sequence length (in LLM serving). In the online phase, requests are scheduled based on the proxy model predictions (in a Shortest-Job-First manner) and the GPU frequency scaling is dynamically adjusted driven by the SLO attainment.

Code Structure

The code is organized as follows:

  • output-token-len-predictions/: code for training and evaluating the output token length predictor
    • Note: This code is optional in AE since it took >24 hours (per type) for proxy model fine-tuning. Evaluation of the predictor accuracy can be find in model-serving/.
  • model-serving/: code for the proxy model-based request scheduling
  • characterization/: code for characterizing the output token length distribution of the LLM model
  • power/: code for profiling the sensitivity scores, machine/device profiles, and frequency scaling
  • requirements.txt: Python dependencies

Usage

Requirements

The repo is tested on:

  • Ubuntu 22.04.4 LTS
  • Python 3.11.4
  • Conda 23.9.0
  • NVIDIA Tesla V100 (32GB)
conda create -n atc24-env python=3.11.4 -y
conda activate atc24-env
pip install -r requirements.txt

[Functionality & Reproducing] Proxy-model Prediction

Note that this subsection is optional for AE as the training time is >24h. Description and usage commands are provided below for functionality.

[Optional] Training dataset generation:

cd output-token-len-predictions
python preprocess_dataset.py [--FLAGS]

[Optional] Training and evaluation of the output token length predictor:

python latency_prediction.py [--FLAGS]

Predictor supports four basic modes:

  • Regression --task_type 0
  • Binary Classification --task_type 1
  • Multi-class Classification --task_type 2
  • Multi-class Ordinal Classification --task_type 3
  • Bi-class Ordinal Classification --task_type 4

For regression and ordinal classification, you can choose to use L1 loss or MSE loss during training:

  • L1 loss --l1_loss
  • MSE loss (simply no flag)

To enable multi-round support, add the --multi_round flag.

Example commands can be found in output-token-len-prediction/script.sh.

Prediction Accuracy Eval

Accuracy of various training methods:

cd model-serving/prediction/final/
python eval_prediction.py

Results should be consistent with Table 2 in the paper where multi-class classification has accuracy around 0.57 and regression with L1 loss has the highest accuracy of 0.61 (this is after data cleaning which we will update in the camera-ready version).

Prediction overhead analysis:

cd model-serving/prediction/
python predictor_overhead_vs_model_serving_latency.py

[Functionality] Characterization

To understand your LLM model's output token length distribution, check out the characterization folder: characterization/.

cd characterization/
python characterization.py

[Functionality & Reproducing] SSJF Scheduler

cd model-serving/
python auto_eval.py
python auto_eval_lineplot.py

Results are located in model-serving/results/.

See README.md in model-serving/ for details (about the functionality and the support for batching).

[Functionality & Reproducing] Power Saving

cd power/
./eval.sh

Results will be located in the same directory power/.

[Optional] To profile a new model apart from existing ones, run profile_models.py and characterization.py.

[Optional] To profile a new GPU device, run operator-clustering/benchmark.py.

See README.md in power/ for remaining details.

Citation

If you find this repository useful, please consider citing the following paper:

@inproceedings{qiu2024atc,
  title={Power-aware Deep Learning Model Serving with $\mu$-Serve},
  author={Qiu, Haoran and Mao, Weichao and Patke, Archit and Cui, Shengkun and Jha, Saurabh and Wang, Chen and Franke, Hubertus and Kalbarczyk, Zbigniew T and Ba{\c{s}}ar, Tamer and Iyer, Ravishankar K},
  booktitle={Proceedings of the 2024 USENIX Annual Technical Conference (USENIX ATC 2024)},
  year={2024}
}

Getting Support