File size: 3,650 Bytes
a1919f5 81c06ad a1919f5 b256093 a1919f5 81c06ad a1919f5 b256093 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
---
license: apache-2.0
library_name: transformers
pipeline_tag: image-to-image
frameworks:
- Pytorch
tasks:
- any-to-any
---
## News
- **May 27, 2025**: We fine-tuned Nexus-Gen using the [BLIP-3o-60k](https://huggingface.co/datasets/BLIP3o/BLIP3o-60k) dataset, significantly improving the model's robustness to text prompts in image generation, **achieving a GenEval score of 0.79**. The [model checkpoints](https://www.modelscope.cn/models/DiffSynth-Studio/Nexus-Gen) have been updated.
## What is the Nexus-Gen
[Nexus-Gen](https://huggingface.co/papers/2504.21356) is a unified model that synergizes the language reasoning capabilities of LLMs with the image synthesis power of diffusion models. To align the embedding space of the LLM and diffusion model, we conduct a dual-phase alignment training process. (1) The autoregressive LLM learns to predict image embeddings conditioned on multimodal inputs, while (2) the vision decoder is trained to reconstruct high-fidelity images from these embeddings. During training the LLM, we identified a critical discrepancy between the autoregressive paradigm's training and inference phases, where error accumulation in continuous embedding space severely degrades generation quality. To avoid this issue, we introduce a prefilled autoregression strategy that prefills input sequence with position-embedded special tokens instead of continuous embeddings. Through dual-phase training, Nexus-Gen has developed the integrated capability to comprehensively address the image understanding, generation and editing tasks as follows.
More information please refer to our repo: https://github.com/modelscope/Nexus-Gen.git


## Getting Started
### Installation
1. Install [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio.git) from source:
```shell
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
```
2. Install requirements
```
pip install -r requirements.txt
```
3. Install [ms-swift](https://github.com/modelscope/ms-swift.git) if you want to perform finetuning on Nexus-Gen.
```
pip install ms-swift -U
```
### Prepare models
```shell
python download_models.py
```
### Image Understanding
```shell
python image_understanding.py
```
### Image Generation
image generation with detailed prompt.
```shell
python image_generation.py
```
Polish prompt and generate images with Nexus-Gen.
```shell
image_generation_with_selfpolish.py
```
### Image Editing
```shell
python image_editing.py
```
### Training Codes
Nexus-Gen is trained base on [ms-swift](https://github.com/modelscope/ms-swift.git) and [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio.git). You can find the training scripts in `train/scripts/train_decoder.sh` and `train_llm.sh`.
## Limitations
- Please note that Nexus-Gen is trained primarily with English corpus, therefore instruction-following with non-English is not supported.
- Please note that Nexus-Gen was trained on limited text-to-image data and may not be robust to short prompts.
## Citation
Feel free to reference our work if you find it helpful.
```
@misc{zhang2025nexusgenunifiedmodelimage,
title={Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing},
author={Hong Zhang and Zhongjie Duan and Xingjun Wang and Yuze Zhao and Weiyi Lu and Zhipeng Di and Yixuan Xu and Yingda Chen and Yu Zhang},
year={2025},
eprint={2504.21356},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.21356v2},
}
```
|