LLM on AMD GPU: Memory Footprint and Performance Improvements on AMD Ryzen™ AI and Radeon™ Platforms

May 23, 2024

Abstract background

Written by:Hisham Chowdhury (AMD),Sonbol Yazdanbakhsh (AMD),Lucas Neves (AMD)

Picture10.jpg

Prerequisites

Introduction

Over the past year, AMD with close partnership with Microsoft has made significant advancement on accelerating generative AI workloads via ONNXRuntime with DirectML on AMD platforms. As a follow up to our previous releases, we are happy to share that with close collaboration with Microsoft, we are bringing 4-bit quantization support and acceleration for LLMs (Large Language Models) to integrated and discrete AMD Radeon GPU platforms running with ONNXRuntime->DirectML.

LLMs are invariably bottlenecked by memory bandwidth and memory availability on the system. Depending on the number of parameters used by the LLM (7B, 13B, 70B etc), the memory consumption on the system increases significantly, which makes some of the system out of contention in running such workloads. To solve that problem and to make a large set of integrated and discrete GPUs accessible to such LLM workloads, we are introducing 4-bit quantization for LLM parameters to execute these workloads with large memory reduction at the same time increasing the performance.

MSBuild_blog_8.png

Fig: Software Stack on AMD Radeon platform with DirectML

NEW! Activation-Aware Quantization


With the latest DirectML and AMD driver preview release, Microsoft and AMD are happy to introduce Activation-Aware Quantization (AWQ) based LM acceleration accelerated on AMD GPU platforms. The AWQ technique compresses weights to 4-bit wherever possible with minimal impact to accuracy, thus reducing the memory footprint of running these LLM models significantly while increase the performance at the same time.

The AWQ technique can achieve this compression while maintaining accuracy by identifying the top 1% of salient weights that are necessary for maintaining model accuracy and quantizing the remaining 99% of weight parameters. This technique takes the actual data distribution in the activations into account to decide which weights to quantize from 16-bit to 4-bit resulting in up to 3x memory reduction for the quantized weights/LLM parameters. By taking the data distribution in activations into account it can also maintain the model accuracy compared to traditional weight quantization techniques that doesn’t consider activation data distributions.

This 4-bit AWQ quantization is performed by using Microsoft Olive toolchains for DirectML and at runtime AMD driver resident ML layers dequantize the parameters and accelerates on the ML hardware to achieve the perf boost on AMD Radeon GPUs. The quantization process mentioned here is post training quantization and performed offline before the model is deployed for inference. This technique now makes it possible to run these language models (LM) on device on low memory equipped system which was not possible before.

Performance


Memory footprint reduction compared to running 16-bit variant of the weights on AMD Radeon™ RX 7900 XTX systems, similar reduction on AMD Radeon™ 780m based AMD Ryzen™ AI platforms as well:

MSBuild_blog_5.png
*Figures in charts are averages (see endnote RX-1107)
MSBuild_blog_6.png

As mentioned earlier, transition to the 4bit quantization technique for LM parameters not only improves memory utilization, but also improves performance by reducing bandwidth significantly:

MSBuild_Blog_10.png

*Figures in charts are averages(see endnote RX-1108)


image.png

*Figures in charts are averages(see endnote RM-159)

Running Sample Application of LLM using 4-bit quantized models

Run Question-Answer Models Using ONNXRuntime GenAI backend:

Download the 4-bit quantized onnx model from here (e.g. phi-3-mini-4k)

git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx
conda create --name=llm-int4 python
conda activate llm-int4
pip install numpy onnxruntime-genai-directml
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/model-qa.py -o model-qa.py
python model-qa.py -m Phi-3-mini-4k-instruct-onnx\directml\directml-int4-awq-block-128 --timing --max_length=256

Share:

Article By


Related Blogs