Data Curation Just Got Smarter: How Essential AI Used AMD Instinct GPUs to Label the Web for Smarter AI

Jun 25, 2025

Essential AI - Graphic showing AI image representing Math, Web Cods, STEM, and Medical

Introduction

In our previous post, Pretraining Just Got Smarter: How Essential AI’s Research Redefines the Role of Reasoning in Model Development, we looked at how Essential AI is redefining model reasoning through smarter pretraining techniques. Smarter models start with smarter data—and that’s where Essential AI’s latest work takes a major leap forward.

Their new dataset, Essential-Web v1.0,  is a massive, fully labeled resource offering researchers instant access to clean, richly annotated training data across science, medicine, code, and more. Built from 24 trillion tokens and annotated using a rich taxonomy, it enables fast, domain-specific data extraction and simplifies the process of AI training and curation. And it’s all made possible by AMD Instinct™ MI300X GPUs —delivering the scale, speed, and efficiency required to make this vision real.

Why This Dataset Matters

Most large-scale datasets today fall into two categories. Some are massive but opaque, filtered with black-box models and vague quality scores. Others are carefully curated for specific domains, often requiring custom pipelines, which are expensive and time-consuming to rebuild.

Essential-Web offers a new approach. With Essential-Web, you can build your own dataset in under 15 minutes — no classifiers, no scraping, just filters.

By applying a consistent taxonomy to 23.6 billion web documents, researchers can now use simple filters to extract domain-specific corpora in minutes. For example:

  •  A math dataset focused on reasoning and correctness
  •  A medical QA corpus with high-quality source material
  •  A web code dataset with spam and boilerplate removed

The days of rebuilding pipelines from scratch for every new use case might be behind us.

We’re excited to see what the community builds with Essential-Web—including the next generation of state-of-the-art models trained on this open, richly labeled foundation and accelerated by AMD Instinct.

Powered by AMD Instinct MI300X GPUs

Labeling 24 trillion tokens at this scale is a serious compute challenge. Essential ran inference across the full dataset using 512 AMD Instinct MI300X GPUs, totaling around 90,000 GPU-hours of compute.

That infrastructure helped them:

  • Distill a fast, 0.5 billion parameter annotator model from a larger teacher
  • Improve throughput by 50 times while maintaining label quality
  • Achieve over 96% recall on math and code domains while filtering out more than 95% of noise

"At Essential AI, AMD’s MI300X GPUs have been the cornerstone of our web-scale data-taxonomy pipeline. Their 192 GB of HBM lets us run thousands of concurrent requests per GPU. Across a 1024-GPU cluster, our vLLM-based inference service sustained more than 80,000 requests per second. We fine-tuned the Qwen-2.5-32B-Instruct for our taxonomy classification, and distilled it to a lightweight 0.5B parameter model for fast inference, and automatically label 24 trillion tokens spanning over 10 billion documents. The MI300X makes production-grade, planet-scale data curation not just feasible, but fast." - Yash Vanjani (Member of Technical Staff - Essential AI)

AMD is proud to have played a role in enabling this scale of open data curation.

Clean Filters. Strong Results.

To test the usefulness of the dataset, Essential used their filters to build benchmark datasets in math, web code, medical, and STEM. They then trained 2.3 billion parameter models and compared them to the best open-source baselines.

Domain

Improvement vs. Baseline

Math

Slightly below

Web Code

+14 percent

STEM

+25 percent

Medical

+9 percent

These gains came without hand-built pipelines — just smart filtering and efficient infrastructure. 

*All performance claims mentioned herein are provided by Essential AI and have not been independently verified by AMD. Performance benefits are impacted by a variety of variables. Results herein are specific to Essential AI and may not be typical. GD-181a.

Ready to Use

If you want to try it yourself:

Share:

Article By


Related Blogs