[How-To] Running Optimized Llama2 with Microsoft DirectML on AMD Radeon Graphics

Nov 15, 2023

Prepared byHisham Chowdhury (AMD)and Sonbol Yazdanbakhsh (AMD).

Microsoft and AMD continue to collaborate enabling and accelerating AI workloads across AMD GPUs on Windows platforms. Following up to our earlier improvements made to Stable Diffusion workloads, we are happy to share that Microsoft and AMD engineering teams worked closely to optimize Llama2 to run on AMD GPUs accelerated via the Microsoft DirectML platform API and AMD driver ML metacommands. AMD driver resident ML metacommands utilizes AMD Matrix Processing Cores wavemma intrinsics to accelerate DirectML based ML workloads including Stable Diffusion and Llama2.

Fig 1:OnnxRuntime-DirectML on AMD GPUs

As we continue to further optimize Llama2, watch out for future updates and improvements via Microsoft Olive and AMD Graphicsdrivers.

Below are brief instructions on how to optimize the Llama2 model with Microsoft Olive, and how to run the model on any DirectML capable AMD graphics card with ONNXRuntime, accelerated via the DirectML platform API.

If you have already optimized the ONNX model for execution and just want to run the inference, please advance to Step 3 below.

1.Prerequisites

Installed Git (Git for Windows)
Installed Anaconda
onnxruntime_directml==1.16.2 or newer
Platform having AMD Graphics Processing Units (GPU)
- Driver: AMD Software: Adrenalin Edition™ 23.11.1 or newer (https://www.amd.com/en/support)

2.Convert Llama2 model to ONNX format and optimize the models for execution

Download the Llama2 models from Meta’s release, use Microsoft Olive to convert it to ONNX format and optimize the ONNX model for GPU hardware acceleration.

Using the instructions from Microsoft Olive, download Llama model weights and generate optimized ONNX models for efficient execution on AMD GPUs

Open Anaconda terminal and input the following commands:

conda create --name=llama2_Optimize python=3.9
conda activate llama2_Optimize
git clone https://github.com/microsoft/Olive.git
cd olive
pip install -r requirements.txt
pip install -e .
cd examples/directml/llama_v2
pip install -r requirements.txt

Request accessto the Llama 2 weights from Meta, Convert to ONNX, and optimize the ONNX models

python llama_v2.py --optimize
- Note: The first time this script is invoked can take some time since it will need to download the Llama 2 weights from Meta. When requested, paste the URL that was sent to your e-mail address by Meta (the link is valid for 24 hours)

3. Run Optimized Llama2 Model on AMD GPUs

Once the optimized ONNX model is generated from Step 2, or if you already have the models locally, see the below instructions for running Llama2 on AMD Graphics.

3.1Run Llama 2 using Python Command Line

Open Anaconda terminal