ASemConsist

Abstract

Recent text-to-image diffusion models have significantly improved visual quality and text alignment. However, generating a sequence of images while preserving consistent character identity across diverse scene descriptions remains challenging. Existing methods often struggle with a trade-off between maintaining identity consistency and ensuring per-image prompt alignment.

In this paper, we introduce a novel framework, ASemconsist, that addresses this challenge through selective text embedding modification, enabling explicit semantic control over character identity without sacrificing prompt alignment. Furthermore, based on our analysis of padding embeddings in FLUX, we propose a semantic control strategy that repurposes padding embeddings as semantic containers. Additionally, we introduce an adaptive feature-sharing strategy that automatically evaluates textual ambiguity and applies constraints only to the ambiguous identity prompt.

Finally, we propose a unified evaluation protocol, the Consistency Quality Score (CQS), which integrates identity preservation and per-image text alignment into a single comprehensive metric, explicitly capturing performance imbalances between the two metrics. Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs.

Comparison with existing methods

Qualitative results

Quantitative results

Ours achieves state-of-the-art performance on the CQS_har, balancing per-image alignment and identity consistency. We highlight the best score in light red and the second-best in yellow.

How dose it work?

Selective text-embedding modification

We selectively modify text embeddings to control identity and per-image semantics, amplifying identity-consistent features while suppressing irrelevant ones. Additionally, based on our analysis, we repurpose FLUX padding embeddings—dominated by dummy semantics—as semantic containers by injecting meaningful per-image information. These strategies enable stable identity preservation without sacrificing per-image prompt alignment.

Adaptive feature-sharing

We first automatically evaluate identity ambiguity by analyzing feature cohesion across transformer blocks and timesteps.

When ambiguity is high, we selectively apply residual feature sharing from a cached identity embedding. This adaptive mechanism reinforces identity consistency without sacrificing visual diversity or prompt fidelity.

Balanced Evaluation: Consistency Quality Score (CQS)

We propose the Consistency Quality Score (CQS), a unified metric that jointly evaluates the balance between identity preservation and per-image text alignment.

By measuring both criteria simultaneously within a single score, CQS enables a comprehensive and reliable ordering of identity-consistent generation methods.

Given an identity prompt p_id and per-image prompts [p₁, …, p_k], we generate a set of images X = {x₁, …, x_k}, where each image x_i reflects its per-image prompt p_i and identity prompt p_id. We define two alignment scores:

where t_i measures alignment with both the identity and per-image prompts, and a_i measures alignment with the per-image prompt alone. Identity consistency across the generated images is quantified using DreamSim:

DreamSim scores are transformed using (1−·) and scaled via min–max normalization to match the range of VQA scores.

To explicitly account for the imbalance between identity consistency and per-image text alignment, we define the alignment gap Δ_i = a_i − t_i, which measures how much the per-image alignment deviates from the combined identity–prompt alignment.

We summarize the dataset-level tendency via mean positive and negative gaps,

and define per-sample reward and penalty terms that interpolate between dataset-level statistics and instance-level deviations:

where λ ∈ [0,1] balances dataset-level and instance-level adjustments. The adjusted identity score is then given by

Definition of adjusted identity score d_i star

where μ, τ ≥ 0 control the strength of penalties and rewards.

The final per-sample score is computed as a harmonic mean:

By integrating identity consistency and per-image text alignment into a single metric and explicitly penalizing their imbalance, CQS_har provides a balanced, comprehensive criterion for evaluating and ordering identity-consistent generation methods.

BibTeX

@article{shin2025asemconsist,
      title={ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation},
      author={Shin, Minjung and Cho, Hyunin and Uh, Youngjung and others},
      journal={arXiv preprint arXiv:2512.23245},
      year={2025}
    }

ASemConsist : Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation