Arena-T2I Hard

Benchmarking and Improving Faithfulness with a Dependency-Aware Checklist

Arena Intelligence & UCLA
Yuanhao Ban1, Tong Xie2, Sohyun An2, Yunqi Hong2, Evan Frick1, I-Hung Hsu1, Wei-Lin Chiang1, Ion Stoica1, Cho-Jui Hsieh1
1Arena Intelligence Inc   2UCLA

Abstract

Existing T2I faithfulness benchmarks rely on atomic instructions that top systems already satisfy near-perfectly, so a single binary VLM-judge score no longer reveals which constraints a model fails. We introduce Arena-T2I Hard, a 310-prompt stress benchmark drawn from real arena T2I logs, with ~30 decomposed yes/no constraints per prompt across six categories. The strongest system we evaluate reaches only 0.855, with a 33 pp spread across 11 systems—and high public-arena rankings fail to predict faithfulness, confirming that holistic preference scores reward aesthetics over fine-grained prompt adherence.

The dependency-aware checklist that scores the benchmark also serves as a post-training reward. Combined with a Bradley-Terry aesthetic reward via group-decoupled normalization (GDPO), it improves faithfulness and aesthetics on SD3.5-Medium and FLUX.1-dev, beating every single-reward, naive weighted-sum, and 4-reward-ensemble baseline.

Drawbacks of Current Benchmarks

Existing faithfulness benchmarks rely on synthetic templates, LLM rewriting, or short curated captions — even the longest-form predecessor (DPG-Bench) averages only ~70 words per prompt, and top systems already score over 95% on them. Arena-T2I Hard instead draws from real T2I-arena user votes and selects for compositional difficulty: ~430 words and ~30 decomposed yes/no questions per prompt.

Existing faithfulness benchmarks vs. Arena-T2I Hard

The Benchmark

Arena-T2I Hard

310 prompts sampled from a public T2I arena leaderboard (Jan–Mar 2026)—real user requests, filtered and hand-selected for compositional difficulty and balanced across six visual-style categories.

310
stress-test prompts
~430
words / prompt (~6× DPG)
~30
yes/no questions / prompt
5
max dependency depth
33 pp
spread across 11 systems

Leaderboard: 11 frontier closed-source systems

# Model Arena-T2I Hard ↑ DPG-Bench ↑ DSG ↑ Arena rank Arena score
1gemini-3-pro-image-preview-2k0.8550.9700.946#31244±4
2grok-imagine-image-202603060.8490.9650.934#81170±4
3gpt-image-1.5-high-fidelity0.7960.9700.942#41242±4
4recraft-v40.7870.9560.914#291109±5
5wan2.6-t2i-v20.7680.9540.917#211132±4
6gemini-2.5-flash-image (nano-banana)0.7680.9540.921#141152±3
7gpt-image-10.7220.9590.938#271115±3
8imagen-4.0-ultra-generate-0010.6800.9610.928#171148±4
9imagen-4.0-generate-0010.6590.9470.913#221130±3
10hunyuan-image-3.0-fal0.6090.9470.837#151151±3
11ideogram-v3-quality0.5230.8940.855#421049±4

How We Score: Dependency-Aware Checklist

A frozen LLM decomposer maps each prompt to a DAG of yes/no questions; a VLM judge answers them in BFS order. If a parent fails, its descendants are scored NO without a further VLM call—avoiding the inflated scores flat checklists produce when an attribute fires on the wrong object. The faithfulness score is the yes-ratio over the questions.

“A red cat sitting on a blue chair in a garden”
q0  Is there a cat? (root)  → YES
q1  Is the cat red? (depends_on: q0)  → YES
q2  Is there a chair? (root)  → NO
q5  Is there a garden? (root)  → YES

A Benchmark Example

One real Arena-T2I Hard prompt scored against the image from the top-ranked system, gemini-3-pro-image-preview—which still satisfies only 23 of 27 constraints (0.85). The hairbrush and hand mirror aren’t pink (q17, q19 = No), and the shoes are hidden by the gown, so q10 is indeterminate (N/A)—auto-skipping its dependent q11.

gemini-3-pro generation: a person in a pink corseted ball gown and tiara seated on a pink throne, surrounded by pink props under a dramatic spotlight
Generated by gemini-3-pro-image-preview · Arena-T2I Hard prompt #45
This image depicts a person dressed in a highly elaborate and glamorous costume, likely for a performance or themed event. Costume: a full-length, pink, corseted gown with ruffles and layers—a fitted, corset-like bodice accentuating the waist and a voluminous, ruffled skirt. Accessories: a tiara or crown, high-heeled shoes matching the pink theme, and multiple pieces of jewelry including bracelets and rings. Makeup & hair: elaborate, dramatic makeup with bold eyeliner and heavy eyeshadow, possibly glitter; hair in a voluminous, wavy updo. Setting: seated on a large, pink throne-like chair, surrounded by pink props—a hairbrush, a hand mirror and a fan—against a dark, theatrical background with a floral arrangement and a mirror. Lighting: dramatic, with a spotlight effect, suggesting a drag show or themed event.
0.85 faithfulness = 23 Yes / 27 questions · 2 No · 1 N/A · 1 Skip
  • Yesq0. Is there a person in the image?
  • Yesq1. Is there a large, throne-like chair in the image?
  • Yesq2. Is the person seated on the throne-like chair? · needs q0, q1
  • Yesq3. Is the person wearing a gown? · needs q0
  • Yesq4. Is the gown pink? · needs q3
  • Yesq5. Is the gown full-length? · needs q3
  • Yesq6. Does the gown have a fitted, corseted bodice? · needs q3
  • Yesq7. Does the gown have a voluminous skirt? · needs q3
  • Yesq8. Is the gown adorned with ruffles and layers? · needs q3
  • Yesq9. Is the person wearing a tiara or crown? · needs q0
  • N/Aq10. Is the person wearing high-heeled shoes? · needs q0
  • Yesq12. Is the person wearing jewelry, such as bracelets and rings? · needs q0
  • Yesq13. Does the person have elaborate makeup, including bold eyeliner and heavy eyeshadow? · needs q0
  • Yesq14. Is the person's hair styled in a voluminous, wavy updo? · needs q0
  • Yesq15. Is the throne-like chair pink? · needs q1
  • Yesq16. Is there a hairbrush in the image?
  • Noq17. Is the hairbrush pink? · needs q16
  • Yesq18. Is there a hand mirror in the image?
  • Noq19. Is the hand mirror pink? · needs q18
  • Yesq20. Is there a fan in the image?
  • Yesq21. Is the fan pink? · needs q20
  • Yesq22. Are the hairbrush, hand mirror, and fan surrounding the chair? · needs q1, q16, q18, q20
  • Yesq23. Is the background dark?
  • Yesq24. Is there a floral arrangement in the background?
  • Yesq25. Is there a mirror in the background?
  • Yesq26. Is there dramatic lighting creating a spotlight effect on the person? · needs q0

Yes constraint satisfied · No violated · N/A judge couldn’t determine (counted as not satisfied) · Skip auto-failed because a parent question failed (no VLM call).

Bonus

The Checklist as a Post-Training Reward

Single rewards don't transfer across axes: Bradley-Terry preference rewards lift aesthetics but drag faithfulness down, and vice-versa. We combine the checklist (faithfulness) with a BT reward (aesthetics) under group-decoupled normalization (GDPO)—standardizing each reward within its rollout group before combining, so neither axis dominates the gradient. On FLUX.1-dev this lifts both at once: +10.5% faithfulness and +3.5% PickScore. Under a pairwise win-rate protocol (MMRB2), our Faith + Pick run is the strongest row on both SD3.5-Medium and FLUX.1-dev, beating every single-reward, naive weighted-sum, and 4-reward-ensemble baseline—and a 1,899-vote human study agrees (64.1% preferred overall).

Pairwise win-rate (SD3.5-Medium)
Pairwise net win-rate matrix on SD3.5-Medium

Pairwise net win-rates on the 1k test set (Gemini-3-flash, MMRB2 rubric)—Faith + Pick GDPO wins every cell of its row.

BibTeX

@article{ban2026arenat2ihard,
  title   = {Arena-T2I Hard: Benchmarking and Improving Faithfulness
             with Dependency-Aware Checklist},
  author  = {Ban, Yuanhao and Xie, Tong and An, Sohyun and Hong, Yunqi
             and Frick, Evan and Hsu, I-Hung and Chiang, Wei-Lin
             and Stoica, Ion and Hsieh, Cho-Jui},
  year    = {2026}
}