LLM on AMD GPU: Memory Footprint and Performance Improvements on AMD Ryzen™ AI and Radeon™ Platforms

May 23, 2024

Written by:Hisham Chowdhury (AMD),Sonbol Yazdanbakhsh (AMD),Lucas Neves (AMD)

Prerequisites

Installed Git (Git for Windows)
InstalledAnaconda
onnxruntime_directml==1.18.0 or newer
Platform with AMD Radeon Graphics (GPUs)
Driver: AMD Software: preview release or Adrenalin Edition™ 24.6.1 or newer (https://www.amd.com/en/support)

Introduction

Over the past year, AMD with close partnership with Microsoft has made significant advancement on accelerating generative AI workloads via ONNXRuntime with DirectML on AMD platforms. As a follow up to our previous releases, we are happy to share that with close collaboration with Microsoft, we are bringing 4-bit quantization support and acceleration for LLMs (Large Language Models) to integrated and discrete AMD Radeon GPU platforms running with ONNXRuntime->DirectML.

LLMs are invariably bottlenecked by memory bandwidth and memory availability on the system. Depending on the number of parameters used by the LLM (7B, 13B, 70B etc), the memory consumption on the system increases significantly, which makes some of the system out of contention in running such workloads. To solve that problem and to make a large set of integrated and discrete GPUs accessible to such LLM workloads, we are introducing 4-bit quantization for LLM parameters to execute these workloads with large memory reduction at the same time increasing the performance.

Fig: Software Stack on AMD Radeon platform with DirectML

NEW! Activation-Aware Quantization

With the latest DirectML and AMD driver preview release, Microsoft and AMD are happy to introduce Activation-Aware Quantization (AWQ) based LM acceleration accelerated on AMD GPU platforms. The AWQ technique compresses weights to 4-bit wherever possible with minimal impact to accuracy, thus reducing the memory footprint of running these LLM models significantly while increase the performance at the same time.

The AWQ technique can achieve this compression while maintaining accuracy by identifying the top 1% of salient weights that are necessary for maintaining model accuracy and quantizing the remaining 99% of weight parameters. This technique takes the actual data distribution in the activations into account to decide which weights to quantize from 16-bit to 4-bit resulting in up to 3x memory reduction for the quantized weights/LLM parameters. By taking the data distribution in activations into account it can also maintain the model accuracy compared to traditional weight quantization techniques that doesn’t consider activation data distributions.

This 4-bit AWQ quantization is performed by using Microsoft Olive toolchains for DirectML and at runtime AMD driver resident ML layers dequantize the parameters and accelerates on the ML hardware to achieve the perf boost on AMD Radeon GPUs. The quantization process mentioned here is post training quantization and performed offline before the model is deployed for inference. This technique now makes it possible to run these language models (LM) on device on low memory equipped system which was not possible before.

Performance

Memory footprint reduction compared to running 16-bit variant of the weights on AMD Radeon™ RX 7900 XTX systems, similar reduction on AMD Radeon™ 780m based AMD Ryzen™ AI platforms as well:

*Figures in charts are averages (see endnote RX-1107)

As mentioned earlier, transition to the 4bit quantization technique for LM parameters not only improves memory utilization, but also improves performance by reducing bandwidth significantly:

*Figures in charts are averages(see endnote RX-1108)

*Figures in charts are averages(see endnote RM-159)

Running Sample Application of LLM using 4-bit quantized models

Run Question-Answer Models Using ONNXRuntime GenAI backend:

Download the 4-bit quantized onnx model from here (e.g. phi-3-mini-4k)

git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx

conda create --name=llm-int4 python

conda activate llm-int4

pip install numpy onnxruntime-genai-directml

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/model-qa.py -o model-qa.py

python model-qa.py -m Phi-3-mini-4k-instruct-onnx\directml\directml-int4-awq-block-128 --timing --max_length=256

Article By

AMD AI Group

white pearl gradient medium color divider

Related Blogs

View All Blogs

Data Center

Business Systems

Personal & Gaming

Embedded

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

Ethernet Adapter Tools

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Industries

Workloads

Gaming

Systems

Technologies

Resources

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators

Embedded Products

Graphics

Overview

Resources by Product

Resources by Type

About Our Partners

AMD Global Support

Processors & Graphics

Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

LLM on AMD GPU: Memory Footprint and Performance Improvements on AMD Ryzen™ AI and Radeon™ Platforms

Article By

Related Blogs