Icon TIIF-Bench

How Does Your T2I Model Follow Your Instructions?

1Hong Kong Polytechnic University 2Sun Yat-sen University
3Tsinghua University 4OPPO Y-Lab

Introduction

The rapid advancements of Text-to-Image (T2I) models have ushered in a new phase of AI-generated content, marked by their growing ability to interpret and follow user instructions. However, existing T2I model evaluation benchmarks fall short in limited prompt diversity and complexity, as well as coarse evaluation metrics, making it difficult to evaluate the fine-grained alignment performance between textual instructions and generated images.

To remedy these gaps, we introduce Icon TIIF-Bench (Text-to-Image Instruction Following Benchmark), a benchmark built for the fine-grained assessment of T2I models. We extract ten concept pools from existing benchmarks and define 36 novel combinations of them with six compositional prompt dimensions. Each dimension incorporates multiple attributes, ensuring that every prompt is semantically distinct and exhibits diverse sentence structures. Additionally, two important dimensions previously overlooked, text rendering and style control, are introduced as dedicated categories in our Icon TIIF-Bench. We also collect 100 real-world designer-level prompts that encode rich human priors and aesthetic judgment. For each prompt, we provide a concise version and an extended version to assess the sensitivity of T2I models to prompt length. In total, the benchmark offers:

\(6_{compositional} \times (300_{short}+300_{long})+1_{text\_generation}\times (300_{short}+300_{long})+1_{style\_control}\times(300_{short}+300_{long})+1_{designer}\times (100_{short}+100_{long})=5000\) prompts.

In evaluation, each prompt is accompanied by a set of attribute-specific yes/no questions, enabling VLMs (We offer GPT-4o and QwenVL2.5-72B as judge models) to judge at a more granular level than a coarse score. Text rendering accuracy is further quantified by the proposed GNED metric.

Our findings indicate that, with the exception of GPT-4o, both open-source and closed-source models perform relatively well on prompts involving object attributes (>e.g., color, texture, shape), yet consistently struggle with spatial tasks and reasoning tasks such as 2D/3D layouts and logical instructions. Moreover, models that achieve higher overall scores on Icon TIIF-Bench tend to be more robust to prompt length, whereas lower-performing models exhibit greater sensitivity. This suggests a positive correlation between a model's instruction comprehension and its image generation quality. Finally, although AR-based models generally produce lower-fidelity images, their instruction-following performance is comparable to that of advanced diffusion-based models, highlighting the inherent advantage of autoregressive architectures in semantic understanding.

Leaderboard

Evaluated by { / } on the testmini subset of Icon TIIF-Bench

Evaluated by { / } on the ENTIRE set of Icon TIIF-Bench

Icon TIIF-Bench Construction

We first group the prompts in existing benchmarks based on their semantics and leverage GPT-4o to extract the underlying object–attribute/relation pairs, forming a set of dimension-specific concept pools. In total, we construct 10 concept pools from existing benchmarks, categorized them into three groups, as summarized in Table:

Concept Pools

Building upon concept pools, we generate prompts by randomly combining attributes from each pool, leveraging GPT-4o to compose them into natural instructions. We define 36 distinct combinations, each paired with a dedicated meta-prompt to guide GPT-4o to assemble instructions. Prompts that combine elements drawn from a single concept-pool group are classified as Basic Following. In contrast, Advanced Following prompts intertwine elements taken from different concept-pool groups, yielding more intricate compositions.

To extend evaluation beyond conventional instruction-following skills, we introduce three novel dimensions:

(i) Text rendering evaluates a model's ability to accurately reproduce complex typographic elements, using prompts sourced from the Lex-Art corpus.

(ii) Style control assesses the model's capacity to adhere to high-level artistic directives, with prompts manually curated from leading AIGC creator communities.

(iii) Designer-level prompts involve complex instructions that incorporate practical constraints and domain-specific knowledge, also collected through manual annotation.

The text rendering and style control dimensions are included in the Advanced Following set, while the designer-level prompts constitute the Designer Level Following set.

Concept Pools

Finally, for each generated prompt, we leverage GPT-4o to construct a corresponding long-form variant by expanding the content through natural language paraphrasing and stylistic elaboration, while faithfully preserving its original semantics. The whole pipeline can be summarized as:

Concept Pools

BibTeX


      @misc{wei2025tiifbenchdoest2imodel,
	      title={TIIF-Bench: How Does Your T2I Model Follow Your Instructions?}, 
	      author={Xinyu Wei and Jinrui Zhang and Zeqing Wang and Hongyang Wei and Zhen Guo and Lei Zhang},
	      year={2025},
	      eprint={2506.02161},
	      archivePrefix={arXiv},
	      primaryClass={cs.CV},
	      url={https://arxiv.org/abs/2506.02161}, 
	}
    
Hong Kong Polytechnic University
Sun Yat-sen University
Tsinghua University
OPPO Y-Lab