FlashAR: Efficient Post-Training Acceleration
for Autoregressive Image Generation

Junkang Zhou^1*, Yefei He^1*†‡, Feng Chen^2*†, Weijie Wang¹, Bohan Zhuang^1‡

¹Zhejiang University, China ²University of Adelaide, Australia

* Equal contribution † Project lead ‡ Corresponding authors

arXiv LlamaGen-FlashAR Emu3.5-FlashAR 🤗 LlamaGen-FlashAR 🤗 Emu3.5-Image-FlashAR

Generated samples from FlashAR. The first row shows 512×512 text-guided generation results (Emu3.5-FlashAR), while the second row presents class-conditional generation at 384×384 and 256×256 (LlamaGen-FlashAR).

22.9×

Wall-clock Speedup

80K

Training Images

80.29

GenEval Score

Abstract

FlashAR is a lightweight post-training framework that transforms a pre-trained raster-scan autoregressive model into a highly parallel generator. Instead of training from scratch or altering the prediction objective, FlashAR introduces a vertical head branched from an intermediate layer alongside the original horizontal head, fused by a learnable gate—enabling diagonal-parallel decoding that reduces serial steps from H×W to H+W−1. With only 0.05% of the original training data, FlashAR achieves up to 22.9× wall-clock speedup on Emu3.5-34B at 512×512 resolution, while preserving generation quality.

Method

Overview of FlashAR. A vertical head branches from an intermediate layer of the pre-trained AR backbone. Horizontal and vertical predictions are dynamically fused via a learnable gate, enabling parallel decoding along anti-diagonals.

Three key designs make FlashAR both fast and easy to adapt:

Intermediate Branching. The vertical head is attached at an upper-intermediate layer rather than the final layer, where features are still semantically rich and not yet specialized to horizontal prediction. Both branches run concurrently with no extra critical-path depth.
Learnable Fusion Gate. A lightweight MLP-based gate adaptively balances horizontal and vertical logits at each spatial position, avoiding the blurring artifacts of naive averaging.
Two-Stage Adaptation. Stage 1 freezes the backbone to initialize the vertical head; Stage 2 jointly fine-tunes everything. This keeps post-training stable and data-efficient.

At inference time, FlexAttention compiles sparse diagonal masks on-the-fly and KV caches are updated in batched operations, translating the theoretical parallelism into real wall-clock gains.

Results

ImageNet 256×256 Class-conditional Generation

Size	Method	Type	Epoch	FID↓	IS↑	P/R-F1↑	Steps	img/s
B (~120M)	LlamaGen	From scratch	300	5.46	193.6	0.594	256	117.9
	PAR	From scratch	300	6.21	204.4	0.537	67	174.1
	NAR	From scratch	300	4.65	212.3	0.600	31	419.7
	BlockDiffusion	Post-training	75	5.91	176.2	0.589	64	186.3
	FlashAR	Post-training	25	4.68	208.3	0.605	31	447.2
L (~360M)	LlamaGen	From scratch	300	3.80	248.3	0.639	256	47.1
	PAR	From scratch	300	4.32	189.4	0.576	67	93.8
	NAR	From scratch	300	3.06	263.9	0.641	31	195.4
	VAR	From scratch	200	3.30	274.4	0.634	10	129.3
	BlockDiffusion	Post-training	75	4.55	243.5	0.645	64	103.2
	FlashAR	Post-training	25	3.16	289.0	0.656	31	224.7
XL (~700M)	LlamaGen	From scratch	300	3.39	227.1	0.648	256	23.7
	PAR	From scratch	300	3.50	234.4	0.619	67	53.9
	NAR	From scratch	300	2.70	277.5	0.676	31	98.1
	BlockDiffusion	Post-training	75	4.13	258.6	0.654	64	41.7
	FlashAR	Post-training	25	2.94	293.7	0.672	31	109.3
XXL (~1.4B)	LlamaGen	From scratch	300	3.09	253.6	0.647	256	14.1
	PAR	From scratch	300	3.20	288.3	0.632	67	33.9
	NAR	From scratch	300	2.58	293.5	0.673	31	56.9
	BlockDiffusion	Post-training	75	3.78	264.9	0.652	64	26.8
	FlashAR	Post-training	25	2.79	289.4	0.690	31	63.4

Emu3.5-Image-34B — Inference Efficiency

Method	Type	Steps	Data	Latency (s)↓	Decoding Steps
Emu3.5-Image	From scratch	940K	150B	130.10	1024
BlockDiffusion	Post-training	50K	80M	6.17	64
FlashAR	Post-training	50K	80M	5.68	63

Emu3.5-Image-34B — GenEval (512×512)

Method	Overall	Single Obj	Two Obj	Counting	Colors	Position	Color Attr
Emu3.5-Image	80.48	100.00	94.95	53.75	90.96	73.00	70.25
BlockDiffusion	73.83	96.88	88.89	47.50	85.64	68.00	58.44
FlashAR	80.29	98.75	91.92	53.75	92.55	80.00	64.00

Visualizations

Text-guided Image Generation (Emu3.5-FlashAR)

Class-conditional Generation (FlashAR-XXL, ImageNet 256×256)

class 387, lesser panda

class 90, lorikeet

class 250, Siberian husky

class 933, cheeseburger

class 437, beacon

class 979, valley

class 985, daisy

BibTeX

@article{zhou2026flashar,
  title={FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation},
  author={Zhou, Junkang and He, Yefei and Chen, Feng and Wang, Weijie and Zhuang, Bohan},
  journal={arXiv preprint arXiv:2605.09430},
  year={2026}
}