Update README.md
Browse files
README.md
CHANGED
@@ -8,10 +8,7 @@ pipeline_tag: image-text-to-text
|
|
8 |
library_name: transformers
|
9 |
---
|
10 |
|
11 |
-
## R1-Onevision
|
12 |
|
13 |
-
[\[📂 GitHub\]](https://github.com/Fancy-MLLM/R1-Onevision)[\[📝 Report\]](https://yangyi-vai.notion.site/r1-onevision?pvs=4)
|
14 |
-
[\[🤗 HF Dataset\]](https://huggingface.co/datasets/Fancy-MLLM/R1-onevision) [\[🤗 Reasoning Benchmark\]](https://huggingface.co/datasets/Fancy-MLLM/R1-OneVision-Bench) [\[🤗 HF Demo\]](https://huggingface.co/spaces/Fancy-MLLM/R1-OneVision)
|
15 |
|
16 |
## Model Overview
|
17 |
|
@@ -36,82 +33,3 @@ bf16: true
|
|
36 |
flash_attn: fa2
|
37 |
```
|
38 |
|
39 |
-
Training loss curve:
|
40 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/65af78bb3e82498d4c65ed2a/8BNyo-v68aFvab2kXxtt1.png"/>
|
41 |
-
|
42 |
-
## Usage
|
43 |
-
|
44 |
-
You can load the model using the Hugging Face `transformers` library:
|
45 |
-
|
46 |
-
```python
|
47 |
-
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
|
48 |
-
import torch
|
49 |
-
from qwen_vl_utils import process_vision_info
|
50 |
-
|
51 |
-
MODEL_ID = "Fancy-MLLM/R1-Onevision-7B"
|
52 |
-
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
|
53 |
-
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
54 |
-
MODEL_ID,
|
55 |
-
trust_remote_code=True,
|
56 |
-
torch_dtype=torch.bfloat16
|
57 |
-
).to("cuda").eval()
|
58 |
-
|
59 |
-
messages = [
|
60 |
-
{
|
61 |
-
"role": "user",
|
62 |
-
"content": [
|
63 |
-
{"type": "image", "image": "<your image path>"},
|
64 |
-
{"type": "text", "text": "Hint: Please answer the question and provide the final answer at the end. Question: Which number do you have to write in the last daisy?"},
|
65 |
-
],
|
66 |
-
}
|
67 |
-
]
|
68 |
-
|
69 |
-
# Preparation for inference
|
70 |
-
text = processor.apply_chat_template(
|
71 |
-
messages, tokenize=False, add_generation_prompt=True
|
72 |
-
)
|
73 |
-
image_inputs, video_inputs = process_vision_info(messages)
|
74 |
-
inputs = processor(
|
75 |
-
text=[text],
|
76 |
-
images=image_inputs,
|
77 |
-
videos=video_inputs,
|
78 |
-
padding=True,
|
79 |
-
return_tensors="pt",
|
80 |
-
)
|
81 |
-
inputs = inputs.to(model.device)
|
82 |
-
|
83 |
-
generated_ids = model.generate(**inputs, max_new_tokens=4096)
|
84 |
-
generated_ids_trimmed = [
|
85 |
-
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
86 |
-
]
|
87 |
-
output_text = processor.batch_decode(
|
88 |
-
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
89 |
-
)
|
90 |
-
print(output_text)
|
91 |
-
```
|
92 |
-
|
93 |
-
## Ongoing Work
|
94 |
-
1. **Rule-Based Reinforcement Learning (RL)**
|
95 |
-
|
96 |
-
We are actively exploring the integration of rule-based systems into reinforcement learning to enhance the agent's decision-making process. This approach combines domain-specific rules with the learning process, aiming to improve the efficiency and safety of learning in complex environments.
|
97 |
-
|
98 |
-
2. **Training with General Data and Multimodal Reasoning CoT**
|
99 |
-
|
100 |
-
Our ongoing work includes expanding the training datasets by incorporating more general data alongside multimodal reasoning Chain-of-Thought (CoT) data. This will enable the model to benefit from a broader range of information, enhancing its ability to handle diverse reasoning tasks across various domains.
|
101 |
-
|
102 |
-
3. **Incorporating Chinese Multimodal Reasoning CoT Data**
|
103 |
-
|
104 |
-
We are also focused on integrating Chinese multimodal reasoning CoT data into the training process. By adding this language-specific dataset, we aim to improve the model’s capability to perform reasoning tasks in Chinese, expanding its multilingual and multimodal reasoning proficiency.
|
105 |
-
|
106 |
-
4. **Release of the 3B Model**
|
107 |
-
|
108 |
-
|
109 |
-
We are working on the release of a smaller, more efficient 3B model, which is designed to provide a balance between performance and resource efficiency. This model aims to deliver strong multimodal reasoning capabilities while being more accessible and optimized for environments with limited computational resources, offering a more compact alternative to the current 7B model.
|
110 |
-
|
111 |
-
# Institution
|
112 |
-
- Zhejiang University
|
113 |
-
|
114 |
-
## Model Contact
|
115 | |
116 | |
117 |
|
|
8 |
library_name: transformers
|
9 |
---
|
10 |
|
|
|
11 |
|
|
|
|
|
12 |
|
13 |
## Model Overview
|
14 |
|
|
|
33 |
flash_attn: fa2
|
34 |
```
|
35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|