MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion

1Yonsei University, 2NAVER AI Lab, 3SNU AIIS,
Corresponding authors.

The images above show that our method faithfully preserves the identity of the customized reference subject while accurately reflecting camera pose variations (i.e., rotations) and maintaining multi-view consistency in the surrounding background as well as in the subject.

Abstract

Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models. However, existing multi-view generation models do not support customization with geometric consistency, whereas customization models lack explicit viewpoint control, making them challenging to unify.

Motivated by these gaps, we introduce a novel task, multi-view customization, which aims to jointly achieve multi-view camera pose control and customization. Due to the scarcity of training data in customization, existing multi-view generation models, which inherently rely on large-scale datasets, struggle to generalize to diverse prompts.

To address this, we propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity. In the training stage, MVCustom learns the subject's identity and geometry using a feature-field representation, incorporating the text-to-video diffusion backbone enhanced with dense spatio-temporal attention, which leverages temporal coherence for multi-view consistency. In the inference stage, we introduce two novel techniques: depth-aware feature rendering explicitly enforces geometric consistency, and consistent-aware latent completion ensures accurate perspective alignment of the customized subject and surrounding backgrounds.

Extensive experiments demonstrate that MVCustom is the only framework that simultaneously achieves faithful multi-view generation and customization.

What is the multi-view customization?

As generative models continue to advance, users increasingly expect fine-grained control over both viewpoint and personalization. Traditional research has focused on two separate directions—camera control, which generates images from specified viewpoints, and customization, which captures user-specific identities or concepts for personalized content generation.

Real-world applications such as virtual prototyping, digital humans, and creative design require both forms of control simultaneously. This motivates the task of Multi-View Customization, which unifies camera control and customization within a single generative framework.

  1. Viewpoint Alignment — generate images consistent with specified camera parameters, ensuring geometrically aligned perspectives for both the subject and the surrounding environment. This dual-level alignment is crucial for realistic and spatially coherent multi-view generation.
  2. Subject Identity Preservation — faithfully preserve the appearance and identity of the reference subject across all generated views.
  3. Contextual Consistency — maintain coherent surroundings across multiple viewpoints while adapting to diverse textual prompts.


As summarized in Table 1, existing methods only partially satisfy these requirements: customization methods preserve identity but lack viewpoint control; multi-view generation models ensure geometric consistency but cannot adapt to personalized subjects; viewpoint-aware customization remains subject-centric and often yields inconsistent backgrounds across views. These limitations motivate a dedicated framework that jointly handles personalization and multi-view consistency.

Comparison with existing methods


- CustomDiffusion360 is the state-of-the-art method for viewpoint-aware subject customization.
- Txt-MVgen with LoRA : A text-conditioned camera-motion-controllable model, CameraCtrl, customized with the conventional DreamBooth-LoRA.
- Custom Image + Img-MVgen : This method generates multi-view images by inputting a single customized image into the image-conditioned multi-view generation model, Stable Virtual Camera(SEVA). The single input image is taken from the first frame of the output produced by our model, conditioned on the text prompt.

Proposed method

To address multi-view customization, we propose MVCustom, a diffusion-based framework explicitly designed for robust multi-view customization. Our method separates training and inference stages to effectively handle limited data and ensure geometric consistency across diverse prompts.

Finetuning for model customization

In the training stage, we leverage pose-conditioned transformer blocks. However, a key change is using the video diffusion backbone enhanced with dense spatio-temporal attention to transfer temporal coherence into holistic-frames consistency, ensuring spatial coherence of both the subject and their surroundings across views.

Inference-time multi-view consistency under limited data

At inference, the key challenge is ensuring multi-view geometric consistency for novel prompts, particularly for the subject's surroundings that lack supervision from limited training data. To address this, we introduce two novel inference-stage techniques: depth-aware feature rendering, which explicitly enforces geometric consistency using inferred 3D scene geometry, and consistent-aware latent completion, which naturally completes previously unseen regions revealed by viewpoint shifts.

BibTeX

@article{shin2025mvcustom,
      title={MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion},
      author={Shin, Minjung and Cho, Hyunin and Go, Sooyeon and Kim, Jin-Hwa and Uh, Youngjung},
      journal={arXiv preprint arXiv:2510.13702},
      year={2025}
    }