TIIF-Bench: How Does Your T2I Model Follow Your Instructions?

Introduction

The rapid advancements of Text-to-Image (T2I) models have ushered in a new phase of AI-generated content, marked by their growing ability to interpret and follow user instructions. However, existing T2I model evaluation benchmarks fall short in limited prompt diversity and complexity, as well as coarse evaluation metrics, making it difficult to evaluate the fine-grained alignment performance between textual instructions and generated images.

To remedy these gaps, we introduce Icon TIIF-Bench (Text-to-Image Instruction Following Benchmark), a benchmark built for the fine-grained assessment of T2I models. We extract ten concept pools from existing benchmarks and define 36 novel combinations of them with six compositional prompt dimensions. Each dimension incorporates multiple attributes, ensuring that every prompt is semantically distinct and exhibits diverse sentence structures. Additionally, two important dimensions previously overlooked, text rendering and style control, are introduced as dedicated categories in our Icon TIIF-Bench. We also collect 100 real-world designer-level prompts that encode rich human priors and aesthetic judgment. For each prompt, we provide a concise version and an extended version to assess the sensitivity of T2I models to prompt length. In total, the benchmark offers:

\(6_{compositional} \times (300_{short}+300_{long})+1_{text\_generation}\times (300_{short}+300_{long})+1_{style\_control}\times(300_{short}+300_{long})+1_{designer}\times (100_{short}+100_{long})=5000\) prompts.

In evaluation, each prompt is accompanied by a set of attribute-specific yes/no questions, enabling VLMs (We offer GPT-4o and QwenVL2.5-72B as judge models) to judge at a more granular level than a coarse score. Text rendering accuracy is further quantified by the proposed GNED metric.

Our findings indicate that, with the exception of GPT-4o, both open-source and closed-source models perform relatively well on prompts involving object attributes (>e.g., color, texture, shape), yet consistently struggle with spatial tasks and reasoning tasks such as 2D/3D layouts and logical instructions. Moreover, models that achieve higher overall scores on Icon TIIF-Bench tend to be more robust to prompt length, whereas lower-performing models exhibit greater sensitivity. This suggests a positive correlation between a model's instruction comprehension and its image generation quality. Finally, although AR-based models generally produce lower-fidelity images, their instruction-following performance is comparable to that of advanced diffusion-based models, highlighting the inherent advantage of autoregressive architectures in semantic understanding.

Leaderboard

Evaluated by { GPT-4o / QwenVL2.5-72B } on the testmini subset of Icon TIIF-Bench

We found that most models produce nearly identical score sequences when evaluated by GPT-4o and QwenVL-2.5-72B. After reinforcement learning with PickScore as the reward signal—implemented in FlowGRPO—the model’s behavior matches QwenVL’s preferences remarkably well. One might surmise that PickScore’s training data was, at least in part, influenced by insights traceable to QwenVL.

TIIF-Bench Construction

We first group the prompts in existing benchmarks based on their semantics and leverage GPT-4o to extract the underlying object–attribute/relation pairs, forming a set of dimension-specific concept pools. In total, we construct 10 concept pools from existing benchmarks, categorized them into three groups, as summarized in Table:

Building upon concept pools, we generate prompts by randomly combining attributes from each pool, leveraging GPT-4o to compose them into natural instructions. We define 36 distinct combinations, each paired with a dedicated meta-prompt to guide GPT-4o to assemble instructions. Prompts that combine elements drawn from a single concept-pool group are classified as Basic Following. In contrast, Advanced Following prompts intertwine elements taken from different concept-pool groups, yielding more intricate compositions.

To extend evaluation beyond conventional instruction-following skills, we introduce three novel dimensions:

(i) Text rendering evaluates a model's ability to accurately reproduce complex typographic elements, using prompts sourced from the Lex-Art corpus.

(ii) Style control assesses the model's capacity to adhere to high-level artistic directives, with prompts manually curated from leading AIGC creator communities.

(iii) Designer-level prompts involve complex instructions that incorporate practical constraints and domain-specific knowledge, also collected through manual annotation.

The text rendering and style control dimensions are included in the Advanced Following set, while the designer-level prompts constitute the Designer Level Following set.

Finally, for each generated prompt, we leverage GPT-4o to construct a corresponding long-form variant by expanding the content through natural language paraphrasing and stylistic elaboration, while faithfully preserving its original semantics. The whole pipeline can be summarized as:

BibTeX


      @article{wei2025tiif,
	  title={TIIF-Bench: How Does Your T2I Model Follow Your Instructions?},
	  author={Wei, Xinyu and Zhang, Jinrui and Wang, Zeqing and Wei, Hongyang and Guo, Zhen and Zhang, Lei},
	  journal={arXiv preprint arXiv:2506.02161},
	  year={2025}
	}, 
	}