doge1516
/

MS-Diffusion

stable diffusion

personalization

Model card Files Files and versions

MS-Diffusion / README.md

doge1516's picture

Add pipeline tag: text-to-image (#1)

6664031 verified 5 months ago

|

history blame contribute delete

2.37 kB

	---
	language:
	- en
	library_name: diffusers
	license: apache-2.0
	tags:
	- text-to-image
	- stable diffusion
	- personalization
	- msdiffusion
	pipeline_tag: text-to-image
	---

	# Introduction

	Our research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects. This innovative approach integrates grounding tokens with the feature resampler to maintain detail fidelity among subjects. With the layout guidance, MS-Diffusion further improves the cross-attention to adapt to the multi-subject inputs, ensuring that each subject condition acts on specific areas. The proposed multi-subject cross-attention orchestrates harmonious inter-subject compositions while preserving the control of texts.

	![example](teaser_new.png)

	- Project Page: [https://MS-Diffusion.github.io](https://MS-Diffusion.github.io)
	- GitHub: [https://github.com/MS-Diffusion/MS-Diffusion](https://github.com/MS-Diffusion/MS-Diffusion)
	- Paper (arXiv): [https://arxiv.org/abs/2406.07209](https://arxiv.org/abs/2406.07209)

	# Model

	Download the pretrained base models from [SDXL-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and [CLIP-G](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k).

	Please refer to our [GitHub repository](https://github.com/MS-Diffusion/MS-Diffusion) to prepare the environment and get detailed instructions on how to run the model.

	# Important Notes

	- This repo only contains the trained model checkpoint without data, code, or base models. Please check the GitHub repository carefully to get detailed instructions.
	- The `scale` parameter is used to determine the extent of image control. For default, the `scale` is set to 0.6. In practice, the `scale` of 0.4 would be better if your input contains subjects needing to effect on the whole image, such as the background. Feel free to adjust the `scale` in your applications.
	- The model prefers to need layout inputs. You can use the default layouts in the inference script, while more accurate and realistic layouts generate better results.
	- Though MS-Diffusion beats SOTA personalized diffusion methods in both single-subject and multi-subject generation, it still suffers from the influence of background in subject images. The best practice is to use masked images since they contain no irrelevant information.