Data Curation Just Got Smarter: How Essential AI Used AMD Instinct GPUs to Label the Web for Smarter AI

Jun 25, 2025

Essential AI - Graphic showing AI image representing Math, Web Cods, STEM, and Medical

Introduction

In our previous post, Pretraining Just Got Smarter: How Essential AI’s Research Redefines the Role of Reasoning in Model Development, we looked at how Essential AI is redefining model reasoning through smarter pretraining techniques. Smarter models start with smarter data—and that’s where Essential AI’s latest work takes a major leap forward.

Their new dataset, Essential-Web v1.0, is a massive, fully labeled resource offering researchers instant access to clean, richly annotated training data across science, medicine, code, and more. Built from 24 trillion tokens and annotated using a rich taxonomy, it enables fast, domain-specific data extraction and simplifies the process of AI training and curation. And it’s all made possible by AMD Instinct™ MI300X GPUs —delivering the scale, speed, and efficiency required to make this vision real.

Why This Dataset Matters

Most large-scale datasets today fall into two categories. Some are massive but opaque, filtered with black-box models and vague quality scores. Others are carefully curated for specific domains, often requiring custom pipelines, which are expensive and time-consuming to rebuild.

Essential-Web offers a new approach. With Essential-Web, you can build your own dataset in under 15 minutes — no classifiers, no scraping, just filters.

By applying a consistent taxonomy to 23.6 billion web documents, researchers can now use simple filters to extract domain-specific corpora in minutes. For example:

A math dataset focused on reasoning and correctness
A medical QA corpus with high-quality source material
A web code dataset with spam and boilerplate removed

The days of rebuilding pipelines from scratch for every new use case might be behind us.

We’re excited to see what the community builds with Essential-Web—including the next generation of state-of-the-art models trained on this open, richly labeled foundation and accelerated by AMD Instinct.

Powered by AMD Instinct MI300X GPUs

Labeling 24 trillion tokens at this scale is a serious compute challenge. Essential ran inference across the full dataset using 512 AMD Instinct MI300X GPUs, totaling around 90,000 GPU-hours of compute.

That infrastructure helped them:

Distill a fast, 0.5 billion parameter annotator model from a larger teacher
Improve throughput by 50 times while maintaining label quality
Achieve over 96% recall on math and code domains while filtering out more than 95% of noise

"At Essential AI, AMD’s MI300X GPUs have been the cornerstone of our web-scale data-taxonomy pipeline. Their 192 GB of HBM lets us run thousands of concurrent requests per GPU. Across a 1024-GPU cluster, our vLLM-based inference service sustained more than 80,000 requests per second. We fine-tuned the Qwen-2.5-32B-Instruct for our taxonomy classification, and distilled it to a lightweight 0.5B parameter model for fast inference, and automatically label 24 trillion tokens spanning over 10 billion documents. The MI300X makes production-grade, planet-scale data curation not just feasible, but fast." - Yash Vanjani (Member of Technical Staff - Essential AI)

AMD is proud to have played a role in enabling this scale of open data curation.

Clean Filters. Strong Results.

To test the usefulness of the dataset, Essential used their filters to build benchmark datasets in math, web code, medical, and STEM. They then trained 2.3 billion parameter models and compared them to the best open-source baselines.

Domain	Improvement vs. Baseline
Math	Slightly below
Web Code	+14 percent
STEM	+25 percent
Medical	+9 percent

These gains came without hand-built pipelines — just smart filtering and efficient infrastructure.

*All performance claims mentioned herein are provided by Essential AI and have not been independently verified by AMD. Performance benefits are impacted by a variety of variables. Results herein are specific to Essential AI and may not be typical. GD-181a.

Ready to Use

If you want to try it yourself:

Article By

Karim Bhalwani

white pearl gradient medium color divider

Related Blogs

View All Blogs

Server-CPUs

Business-Systeme

Personal Computing und Gaming

Embedded

Ressourcen

GPU-Beschleuniger

Adaptive Beschleuniger

DPU-Beschleuniger

Ethernet-Adapter

Workstations

Desktops

Notebooks

Ressourcen

FPGAs und adaptive SoCs

Systemmodule (SOMs)

Technologien

Ressourcen für Entwickler

Probeplatinen und Bausätze

Prozessor-Tools

Grafik-Tools und -Apps

Tools für FPGAs und adaptive SoCs

Urheberrechte und Apps

Tools und Apps für GPU-Beschleuniger

Ethernet-Adapter-Tools

Übersicht

Für Rechenzentren und die Cloud

Für Edge und Endpunkte

Für Entwickler

Branchen

Branchen

Branchen

Branchen

Industrias

Einsatzbereiche

Gaming

Systeme

Technologien

Ressourcen

EPYC Prozessoren

Radeon GPUs und AMD Chipsätze

FPGAs und adaptive SoCs

Alveo-Beschleuniger & Kria-SOMs

Ryzen Prozessoren

Ethernet-Adapter

Übersicht

Prozessoren

Beschleuniger

Embedded Produkte

Grafikprodukte

Übersicht

Ressourcen nach Produkt

Ressourcen nach Typ

Über unsere Partner

Weltweiter AMD Support

Prozessoren und Grafikprodukte

Beschleuniger

FPGAs und adaptive SoCs

Gaming und Personal Computing

Adaptive und Embedded Computing

Get AMD Fan Gear

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Data Curation Just Got Smarter: How Essential AI Used AMD Instinct GPUs to Label the Web for Smarter AI

Article By

Related Blogs

AMD.com Feedback