Vision-Language-Garment Models | CVPR 2025

Multimodal foundation models for digital garments.

 


ABSTRACT

A vision-language-garment model is a multimodal foundation model for digital garments by fine-tuning a large multimodal model on a custom sewing pattern dataset using a novel tokenization scheme for these patterns. This approach transfers web knowledge to tasks requiring garment understanding a reasoning, enabling exciting new applications. Multimodal foundation models have demonstrated strong generalization, yet their ability to transfer knowledge to specialized domains such as garment generation remains underexplored. We introduce VLG, a vision-language-garment model that synthesizes garments from textual descriptions and visual imagery. Our experiments assess VLG’s zero-shot generalization, investigating its ability to transfer web-scale reasoning to unseen garment styles and prompts. Preliminary results indicate promising transfer capabilities, highlighting the potential for multimodal foundation models to adapt effectively to specialized domains like fashion design.

Related Projects

  • Nakayama, Ackermann, Kesdogan et al. “AIpparel: A Multimodal Foundation Model for Digital Garments”, CVPR 2025 (link)
  • Ackermann et al. “Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation”, CVPRW 2025 (link)

 

CITATION

@inproceedings{Nakayama:2025:aipparel,
author = {K. Nakayama and J. Ackermann and T. Kesdogan and Y. Zheng and M. Korosteleva and O. Sorkine-Hornung and L. Guibas and G. Yang and G. Wetzstein},
title = {{AIpparel: A Multimodal Foundation Model for Digital Garments}},
booktitle = {CVPR},
year = {2025},
}

@inproceedings{Ackermann:2025:vlg,
author = {J. Ackermann and K. Nakayama and G. Yang and T. Wu and G. Wetzstein},
title = {{Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation}},
booktitle = {CVPRW},
year = {2025},
}

Vision-language-garment models enable applications including text-to-garment and image-to-garment generation, language-instructed garment editing, among others.

 

Vision-Language-Garment Models transfer web knowledge to garment understanding and generation. Top: we tune existing vision-language models for garment-specific tasks end to end to predict sewing patterns from multimodal (textual and visual) inputs. Bottom: the performance of reasoning transfer of our proposed model (red) improves significantly over relevant baselines, including DressCode (blue), Sew-Former (orange) and Sewformer-FT (green), across various reasoning dimensions.

Acknowledgements

We thank LVMH and Google for their support.