FlashAR: Efficient Post-Training Acceleration
for Autoregressive Image Generation

1Zhejiang University, China    2University of Adelaide, Australia
* Equal contribution    † Project lead    ‡ Corresponding authors
Generated samples from FlashAR

Generated samples from FlashAR. The first row shows 512×512 text-guided generation results (Emu3.5-FlashAR), while the second row presents class-conditional generation at 384×384 and 256×256 (LlamaGen-FlashAR).

22.9×
Wall-clock Speedup
80K
Training Images
80.29
GenEval Score

Abstract

FlashAR is a lightweight post-training framework that transforms a pre-trained raster-scan autoregressive model into a highly parallel generator. Instead of training from scratch or altering the prediction objective, FlashAR introduces a vertical head branched from an intermediate layer alongside the original horizontal head, fused by a learnable gate—enabling diagonal-parallel decoding that reduces serial steps from H×W to H+W−1. With only 0.05% of the original training data, FlashAR achieves up to 22.9× wall-clock speedup on Emu3.5-34B at 512×512 resolution, while preserving generation quality.

Method

Overview of FlashAR

Overview of FlashAR. A vertical head branches from an intermediate layer of the pre-trained AR backbone. Horizontal and vertical predictions are dynamically fused via a learnable gate, enabling parallel decoding along anti-diagonals.

Three key designs make FlashAR both fast and easy to adapt:

  • Intermediate Branching. The vertical head is attached at an upper-intermediate layer rather than the final layer, where features are still semantically rich and not yet specialized to horizontal prediction. Both branches run concurrently with no extra critical-path depth.
  • Learnable Fusion Gate. A lightweight MLP-based gate adaptively balances horizontal and vertical logits at each spatial position, avoiding the blurring artifacts of naive averaging.
  • Two-Stage Adaptation. Stage 1 freezes the backbone to initialize the vertical head; Stage 2 jointly fine-tunes everything. This keeps post-training stable and data-efficient.

At inference time, FlexAttention compiles sparse diagonal masks on-the-fly and KV caches are updated in batched operations, translating the theoretical parallelism into real wall-clock gains.

Results

ImageNet 256×256 Class-conditional Generation

SizeMethodTypeEpoch FID↓IS↑P/R-F1↑Stepsimg/s
B
(~120M)
LlamaGenFrom scratch3005.46193.60.594256117.9
PARFrom scratch3006.21204.40.53767174.1
NARFrom scratch3004.65212.30.60031419.7
BlockDiffusionPost-training755.91176.20.58964186.3
FlashARPost-training254.68208.30.60531447.2
L
(~360M)
LlamaGenFrom scratch3003.80248.30.63925647.1
PARFrom scratch3004.32189.40.5766793.8
NARFrom scratch3003.06263.90.64131195.4
VARFrom scratch2003.30274.40.63410129.3
BlockDiffusionPost-training754.55243.50.64564103.2
FlashARPost-training253.16289.00.65631224.7
XL
(~700M)
LlamaGenFrom scratch3003.39227.10.64825623.7
PARFrom scratch3003.50234.40.6196753.9
NARFrom scratch3002.70277.50.6763198.1
BlockDiffusionPost-training754.13258.60.6546441.7
FlashARPost-training252.94293.70.67231109.3
XXL
(~1.4B)
LlamaGenFrom scratch3003.09253.60.64725614.1
PARFrom scratch3003.20288.30.6326733.9
NARFrom scratch3002.58293.50.6733156.9
BlockDiffusionPost-training753.78264.90.6526426.8
FlashARPost-training252.79289.40.6903163.4

Emu3.5-Image-34B — Inference Efficiency

MethodTypeStepsDataLatency (s)↓Decoding Steps
Emu3.5-ImageFrom scratch940K150B130.101024
BlockDiffusionPost-training50K80M6.1764
FlashARPost-training50K80M5.6863

Emu3.5-Image-34B — GenEval (512×512)

MethodOverallSingle ObjTwo ObjCountingColorsPositionColor Attr
Emu3.5-Image80.48100.0094.9553.7590.9673.0070.25
BlockDiffusion73.8396.8888.8947.5085.6468.0058.44
FlashAR80.2998.7591.9253.7592.5580.0064.00

Visualizations

Text-guided Image Generation (Emu3.5-FlashAR)

Text-guided generation samples

Class-conditional Generation (FlashAR-XXL, ImageNet 256×256)

Lesser panda
class 387, lesser panda
Lorikeet
class 90, lorikeet
Siberian husky
class 250, Siberian husky
Cheeseburger
class 933, cheeseburger
Beacon
class 437, beacon
Valley
class 979, valley
Daisy
class 985, daisy

BibTeX

@article{zhou2026flashar,
  title={FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation},
  author={Zhou, Junkang and He, Yefei and Chen, Feng and Wang, Weijie and Zhuang, Bohan},
  journal={arXiv preprint arXiv:2605.09430},
  year={2026}
}