File size: 3,364 Bytes
8a3f482
 
 
 
 
 
 
 
 
 
39f2748
8a3f482
 
 
 
6142f4c
8a3f482
 
6142f4c
6743309
6142f4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8eb353f
6142f4c
8eb353f
 
ac43c9f
8eb353f
8b6be37
8eb353f
 
ac43c9f
e0b11bf
8b6be37
0bfca29
ac43c9f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
---
license: mit
tags:
- lumos
- image to image
- text to image
- novel view synthesis
- image to video
---
<p align="center">
  <img src="asset/logo.gif"  height=20>
</p> 

<div style="display:flex;justify-content: center">
  <a href="https://arxiv.org/pdf/2412.07767"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv:Lumos&color=red&logo=arxiv"></a> &ensp;
  <a href="https://xiaomabufei.github.io/lumos/"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Github&color=blue&logo=github-pages"></a> &ensp;
</div>

# 🥳 What is Lumos ?
<b>TL; DR: <font color="purple">Lumos</font> is a pure vision-based generative framework, which confirms the feasibility and the scalability of learning visual generative priors. It can be efficiently adapted to visual generative tasks such as text-to-image, image-to-3D, and image-to-video generation.</b>
<details><summary>CLICK for the full abstract</summary>
Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive.
We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling.
Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner.
We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models.
We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning.
We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant visual generative tasks, like image-to-3D and image-to-video.
</details>

# 🪄✨ Lumos Model Card
![row01](asset/teaser.png)

## 🚀 Model Structure
![pipeline](asset/method.png)

[Lumos](https://arxiv.org/pdf/2412.07767) consists of transformer blocks for latent diffusion, which is applied for various visual generative tasks such as text-to-image, image-to-3D, and image-to-video generation.

Source code is available at https://github.com/xiaomabufei/lumos.

## 📋 Model Description

- **Developed by:** Lumos
- **Model type:** Diffusion-Transformer-based generative model
- **License:** [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/LICENSE.md)
- **Model Description:** **Lumos-I2I** is a model designed for generating images based on image prompts. It utilizes a [Transformer Latent Diffusion architecture](https://arxiv.org/abs/2310.00426) and incorporates a fixed, pretrained vision encoder ([DINO](
https://dl.fbaipublicfiles.com/dino/dino_vitbase16_pretrain/dino_vitbase16_pretrain.pth)). **Lumos-T2I** is a model that can be used to generate images based on text prompts. 
It is a [Transformer Latent Diffusion Model](https://arxiv.org/abs/2310.00426) that uses one fixed, pretrained text encoders ([T5](
https://huggingface.co/DeepFloyd/t5-v1_1-xxl)).
- **Resources for more information:** Check out our [GitHub Repository](https://github.com/xiaomabufei/lumos) and the [Lumos report on arXiv](https://arxiv.org/pdf/2412.07767).