LucasFang commited on
Commit
84e293d
·
verified ·
1 Parent(s): 59c6a86

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -3
README.md CHANGED
@@ -1,3 +1,93 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center">
2
+ <img src="assets/pic/PUMA.png" width="230">
3
+ </p>
4
+
5
+ # PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation
6
+
7
+ <div align="center">
8
+ <a href="https://rongyaofang.github.io/puma/"><img src="https://img.shields.io/badge/Project-Homepage-green" alt="Home"></a>
9
+ <a href="https://arxiv.org/abs/2410.13861"><img src="https://img.shields.io/badge/ArXiv-2410.13861-red"></a>
10
+ <img src="https://visitor-badge.laobi.icu/badge?page_id=rongyaofang/PUMA" alt="visitors">
11
+
12
+ [Rongyao Fang](https://scholar.google.com/citations?user=FtH3CW4AAAAJ&hl=en)<sup>1\*</sup>, [Chengqi Duan](https://scholar.google.com/citations?user=r9qb4ZwAAAAJ&hl=zh-CN)<sup>2\*</sup>, [Kun Wang]()<sup>3</sup>, [Hao Li](https://scholar.google.com/citations?user=qHqQsY4AAAAJ&hl=zh-CN)<sup>1,4</sup>, [Hao Tian]()<sup>3</sup>, [Xingyu Zeng]()<sup>3</sup>, [Rui Zhao]()<sup>3</sup>, [Jifeng Dai](https://jifengdai.org/)<sup>4,5</sup>, [Hongsheng Li](https://www.ee.cuhk.edu.hk/~hsli/)<sup>1 :envelope:</sup>, [Xihui Liu](https://xh-liu.github.io/)<sup>2 :envelope:</sup>
13
+
14
+ <sup>1</sup>CUHK MMLab, <sup>2</sup>HKU MMLab, <sup>3</sup>SenseTime, <sup>4</sup>Shanghai AI Laboratory, <sup>5</sup>Tsinghua University
15
+
16
+ *Equal contribution, :envelope:Corresponding authors
17
+ </div>
18
+
19
+ ## <a name="env"></a>Environment Setup
20
+ ```
21
+ conda create -n puma python==3.8
22
+ conda activate puma
23
+ pip install -r requirements.txt
24
+ ```
25
+
26
+ ## <a name="checkpoint"></a>Checkpoint Download
27
+ ```
28
+ # You should first replace the <token> with your huggingface token
29
+ python download_ckpt.py
30
+ ```
31
+ For manual downloads, please download checkpoints from [here](https://huggingface.co/LucasFang/PUMA) and put the checkpoints under **./ckpts**.
32
+
33
+ ## <a name="multi-granular"></a>Multi-granular Visual Decoding
34
+ ```
35
+ python infer_detokenizer.py --num_tokens <chosen number from [1, 4, 16, 64, 256]>
36
+ ```
37
+
38
+ ## <a name="abstract"></a>Abstract
39
+
40
+ > **PUMA** introduces a unified multimodal large language model framework designed to integrate multi-granular visual generation and understanding. Our model excels in a variety of visual tasks, including diverse text-to-image generation, precise image editing, conditional image generation, and visual understanding. It strikes a balance between generation diversity and controllability, making it a versatile tool for visual tasks.
41
+
42
+ Read the full paper [here](https://arxiv.org/abs/2410.13861).
43
+
44
+ ## <a name="framework"></a>Framework
45
+
46
+ <p align="center">
47
+ <img src="assets/pic/main_figure.jpg" width="920">
48
+ </p>
49
+
50
+ - PUMA leverages multi-granular visual representations as unified inputs and outputs for MLLM, allowing it to handle a variety of visual tasks, including text-to-image generation, image editing, inpainting, colorization, conditional generation, and image understanding.
51
+
52
+ ## <a name="decoding"></a>Multi-granular Semantic Visual Decoding
53
+
54
+ <p align="center">
55
+ <img src="assets/pic/rec.jpg" width="920">
56
+ </p>
57
+
58
+ - PUMA's visual decoding process spans five granular image representations (f<sub>0</sub> to f<sub>4</sub>) and corresponding decoders (D<sub>0</sub> to D<sub>4</sub>), which are trained using SDXL. This allows PUMA to achieve precise image reconstruction and semantic-guided generation, supporting both control and diversity in image generation tasks.
59
+
60
+ ## <a name="t2i"></a>Diverse Text-to-image Generation
61
+
62
+ <p align="center">
63
+ <img src="assets/pic/gen.jpg" width="920">
64
+ </p>
65
+
66
+ ## <a name="image_editing"></a>Image Editing
67
+
68
+ <p align="center">
69
+ <img src="assets/pic/edit.jpg" width="920">
70
+ </p>
71
+
72
+ ## <a name="cond_gen"></a>Image Conditional Generation
73
+
74
+ <p align="center">
75
+ <img src="assets/pic/cond_gen.jpg" width="920">
76
+ </p>
77
+
78
+ ## <a name="citation"></a>Citation
79
+
80
+ If you find PUMA useful in your research, please consider citing us:
81
+
82
+ ```
83
+ @article{fang2024puma,
84
+ title ={PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation},
85
+ author ={Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu},
86
+ journal ={arxiv},
87
+ year ={2024}
88
+ }
89
+ ```
90
+
91
+ ## <a name="license"></a>License
92
+
93
+ This project is released under the [Apache 2.0 license](LICENSE).