Large Language Diffusion Models

TribeNews
121 Min Read

Large Language Diffusion Models

1Renmin University of
China,
2Ant Group

- Advertisement -

TL;DR: We introduce LLaDA, a diffusion model with an unprecedented 8B scale, trained entirely from scratch, rivaling LLaMA3 8B in performance.

What is now proved was once only imagined. ——— William Blake

- Advertisement -

Motivation

We contend that the intelligence of LLMs—manifested in scalability, instruction-following, in-context learning,
conversational ability, and compression—stems not from the autoregressive mechanism per se, but rather from
the core principle of generative modeling: approximating the true language distribution through maximum
likelihood estimation.

- Advertisement -

We introduce LLaDA (Large Language Diffusion with mAsking), a simple yet principled
generative paradigm for large language models that demonstrates the aforementioned remarkable capabilities.

Method

LLaDA is a masked diffusion model [1, 2] that follows standard pretraining and SFT
while sampling via diffusion. During pretraining, it masks all tokens randomly
at ratio ( t ∼ U[0,1] ); in SFT, only response tokens may be masked. The model
simulates diffusion from full masking ((t = 1)) to unmasking ((t = 0)), predicting
all masks simultaneously at each step with flexible remasking.

- Advertisement -

Scalability

LLaDA demonstrates impressive scalability, with its overall
trend being highly competitive with that of autoregressive baseline on the same data.

A text generation method different from the traditional left-to-right approach

- Advertisement -

Prompt: Explain what artificial intelligence is.

Case Studies

Math problem

User

Emma decides to run 4 sprints 5 times a week. She runs 75 meters each sprint. How many total meters does she run a week?

LLaDA-Instruct

Emma runs 4 sprints each time, so she runs a total of 4 x 5 = 20 sprints a week. Each sprint is 75 meters long, so Emma runs a total of 20 x 75 = 1500 meters a week. The final result is 1500.

Daily life

User

Please recommend me three famous movies

LLaDA-Instruct

Certainly! Here are three famous movies that have achieved widespread popularity and critical acclaim:

1. The Shawshank Redemption (1994) – Inspired by a novella and directed by Frank Darabont, this film tells the story of Andy Dufresne, a man convicted of murdering his wife, dealing with his harsh life in Shawshank Prison and forming an unlikely friendship with a fellow inmate.

2. The Godfather (1972) – Directed by Francis Ford Coppola, this film is based on Mario Puzo’s novel of the same name. It follows the story of the Corleone family, particularly Michael Corleone, who rises to power and takes over the family’s criminal empire.

3. The Empire Strikes Back (1980) – Directed by George Lucas, this film continues the story of a young Skywalker in the Star Wars saga. It follows Luke Skywalker as he helps the Rebel Alliance fight against the Galactic Empire and destroy the powerful weapon, the Death Star.

All of these films are celebrated for their cinematic achievements, powerful storytelling, and memorable characters.

Code generation

User

I am writing python code, I have an array, please
help me sort it from largest to smallest

LLaDA-Instruct

Certainly! You can sort an array from the largest to
smallest in Python using the ‘sort()’ method or the
‘sorted()‘ function. Here’s an example:

### Using ‘sort()’ method:
my array = [3, 1,
4, 1, 5, 2]
my array.sort(reverse=True)

print(my array)

Translation (English → Chinese)

User

Please help me translate into Chinese: ‘What is now proved was once only imagined’

LLaDA-Instruct

现在被证明的曾经只是想象

Translation (English → German)

User

Please help me translate into German: ‘What is now proved was once only imagined’

LLaDA-Instruct

Wer heute bestätigt ist, war einst nur immaginiert.

Multi-turn dialogue case 1

User

Could you please share with me what the initial two lines of the renowned poem `The Road Not Taken’?

LLaDA-Instruct

Certainly! The initial two lines of the renowned poem “The Road Not Taken” by Robert Frost are: “Two roads diverged in a yellow wood,
And sorry I could not travel both.” These lines set the stage for the poem, introducing the idea of a speaker facing a choice between two paths.

User

Please help me translate into Chinese.

LLaDA-Instruct

Sure! The first two lines of “The Road Not Taken” by Robert Frost can be translated into Chinese as:
“两条路分岔在黄色的树林中,遗憾我不能同时走”

User

Please translate into German.

LLaDA-Instruct

Sure! The first two lines of “The Road Not Taken” by Robert Frost can be translated into German as:
“Zwei Wege trennten sich im gelben Wald, und leider konnte ich nicht beide tun.”

User

Please also write a poem about life choices. I require 5 sentences, and the first word of each sentence starts with C as the first letter.

LLaDA-Instruct

Certainly! Here’s a poem about life choices, with each sentence starting with the letter “C”:
Choices cascade through life’s journey, Creating connections with each decision, Careful
consideration leads to better paths, Courageous choices shape our destiny, Choosing wisely guides us through the years.

Multi-turn dialogue case 2

User

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May.
How many clips did Natalia sell altogether in April and May?

LLaDA-Instruct

Natalia sold clips to 48 friends in April. She sold half as many clips in May, so she sold 48/2 = 24 clips in May.
Altogether, Natalia sold 48 + 24 = 72 clips in April and May. The final result is 72.

User

Natalia sold three times as many items in June as she did in May. How many clips did Natalia sell altogether in these three months?

LLaDA-Instruct

Natalia sold clips to 48 friends in April. She sold half as many clips in May, so she sold 48/2 = 24 clips in May.
Natalia sold three times as many items in June as she did in May, so she sold 24 * 3 = 72 clips in June.
Altogether, Natalia sold 48 + 24 + 72 = 144 clips in these three months. The final result is 144.

BibTeX

@misc{nie2025largelanguagediffusionmodels,
title={Large Language Diffusion Models},
author={Shen Nie and Fengqi Zhu and Zebin You and Xiaolu Zhang and Jingyang Ou and Jun Hu and Jun Zhou and Yankai Lin and Ji-Rong Wen and Chongxuan Li},
year={2025},
eprint={2502.09992},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.09992},
}

References

[1] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and
Van Den Berg, R. Structured denoising diffusion models
in discrete state-spaces. Advances in Neural Information
Processing Systems, 34:17981–17993, 2021a.

[2] Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z.,
and Li, C. Your absorbing discrete diffusion secretly
models the conditional distributions of clean data.
arXiv preprint arXiv:2406.03736, 2024.

Leave a Comment
Ads Blocker Image Powered by Code Help Pro

Ads Blocker Detected & This Is Prohibited!!!

We have detected that you are using extensions to block ads and you are also not using our official app. Your Account Have been Flagged and reported, pending de-activation & All your earning will be wiped out. Please turn off the software to continue

You cannot copy content of this app