File size: 7,015 Bytes
58c4d97
 
 
 
 
61c4d17
58c4d97
899d413
58c4d97
 
899d413
61c4d17
899d413
 
ba526e8
899d413
 
 
f7b6d91
899d413
 
 
 
9d36690
899d413
 
 
 
 
 
 
 
 
 
 
 
 
 
d0b55db
899d413
 
 
 
 
 
 
 
 
 
 
 
 
1ef9d5f
8048ab1
899d413
61c4d17
899d413
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61c4d17
899d413
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61c4d17
899d413
61c4d17
899d413
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61c4d17
899d413
 
 
 
 
61c4d17
899d413
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
library_name: transformers
tags: []
---

# HumanF-MarkrAI/Gukbap-Ovis2-16B-VL🍚

## Model Details🍚

### Model Description
- **Developed by:** HumanF-MarkrAI
- **Model type:** Korean-VL-Ovis2-16B
- **Language(s):** Korean + English
- **Context Length:** 2048
- **License:** cc-by-4.0 
- **Finetuned from model:** [AIDC-AI/Ovis2-16B](https://huggingface.co/AIDC-AI/Ovis2-16B).  
  
### Model Sources
When training, we used `H100 80GB GPU`x4.

  
### Implications🍚
If you want to know our model's details, please see [🔥Gukbap-LMM Blog🔥](https://kyujinpy.tistory.com/169).  
And also, we provided the Korean-LMM training code based Ovis!! [🔥Github🔥](https://github.com/Marker-Inc-Korea/Ovis2-FFT-Korean). Please star⭐⭐!!

  
### Training Method (SFT)🧐
The following papers contain the foundational methodologies for the dataset and training methods we are currently proceeding.  
- [LIMA](https://arxiv.org/abs/2305.11206).
- [Ovis](https://arxiv.org/abs/2405.20797).  

  
### SFT Text-Datasets (Private)
When we made the `Open-Source based dataset`, we use `microsoft/WizardLM-2-8x22B` through [DeepInfra](https://deepinfra.com/).  
Our datasets are made by `Evolving system`, which is propsed by [WizardLM](https://wizardlm.github.io/WizardLM2/).
In training, we used 1849 training dataset, and 200 validation dataset.
  
- **Wizard-Korea-Datasets:** [MarkrAI/Markr_WizardLM_train_ver4](https://huggingface.co/datasets/MarkrAI/Markr_WizardLM_train_ver4).   
> Learning rate: 2e-5; Epoch: 2

    
## Benchmakrs🤗

### Global MM Benchmark Score (Zero-shot)

We internally evaluated [VLMEvalKit](https://github.com/open-compass/VLMEvalKit?tab=readme-ov-file).  
We utilized **chatgpt-0125**, **gpt-4o-mini** and **gpt-4-turbo** in `MMBench`, `MathVista` and `MMVet`, respectively.  
   
| Model | MMStar | MathVista | HallusionBench | AI2D | OCRBench | MMVet | MMBench_V11 | AVG |
|:---------:|:-----:|:------:|:-----:|:-----:|:----:|:-----:|:-----:|:-----:|
| Step-1o (closed model) | 69.3 | **74.7** | **89.1** | 55.8 | **92.6** | **82.8** | 87.3 | **78.8** |
| InternVL2.5-78B-MPO (Open) | **72.1** | 76.6 | 58.1 | **89.2** | 90.9 | 73.5 | **87.8** | 78.3 |
| InternVL2.5-38B-MPO (Open) | 70.1 | 73.6 | 59.7 | 87.9 | 89.4 | 72.6 | 85.4 | 77.0 |
| Ovis2-16B (Open) | 67.2 | 73.7 | 56.8 | 86.3 | 87.9 | 68.4 | 85.7 | 75.14 |
|:---------:|:-----:|:------:|:-----:|:-----:|:----:|:-----:|:-----:|:-----:|
| **Gukbap-Ovis2-16B-VL🍚** | 65.67 | 73.70 | 54.52 | 85.46 | 84.80 | 66.83 | 85.22 | **73.74** |
|:---------:|:-----:|:------:|:-----:|:-----:|:----:|:-----:|:-----:|:-----:|
| Gemini-2.0-Flash | 69.4 | 70.4 | 58.0 | 83.1 | 82.5 | 73.6 | 71.0 | 72.6 |
| GPT-4o-20241120 | 65.1 | 59.9 | 56.2 | 84.9 | 80.6 | 74.5 | 84.3 | 72.2 |
| Ovis1.6-Gemma2-9B (Open) | 62.00 | 67.10 | 84.42 | 51.96 | 82.60 | 64.68 | 82.20 | 70.71 |
| **Gukbap-Gemma2-9B-VL🍚** | 62.13 | 66.00 | 84.49 | 53.01 | 82.80 | 63.90 | 82.20 | **70.65** |
| LLaVA-OneVision-72B | 65.8 | 68.4 | 47.9 | 86.2 | 74.1| 60.6 | 84.5 | 69.6 |
| VARCO-VISION-14B (NCSoft) | 64.1 | 67.6 | 46.8 | 83.9 | 81.5 | 53.0 | 81.2 | 68.3 |
| GPT-4o-mini-20240718 | 54.8 | 52.4 | 46.1 | 77.8 | 78.5 | 66.9 | 76.0 | 64.6 |
> HallusionBench score: (aAcc + fAcc + qAcc) / 3
   
### Korean MM Benchmark Score (Zero-shot)

We internally evaluated [🔥our code🔥](https://github.com/Marker-Inc-Korea/KoVLMEval).  
We utilized **gpt-4o-2024-08-06** in `K-LLAVA-W` evaluation.  
  
| Model | K-MMBench | K-MMStar | K-DTCBench | K-LLAVA-W | AVG |
|:---------:|:-----:|:------:|:-----:|:-----:|:----:|
| GPT-4o-20241120 | NaN | NaN | NaN | 85.50 | NaN |
|:---------:|:-----:|:------:|:-----:|:-----:|:----:|
| **Gukbap-Ovis2-16B-VL🍚** | 88.24 | 61.00 | 79.58 | **66.67** | **73.87** |
| **Ovis2-16B** | **88.31** | **61.80** | 81.25 | 61.00 | 71.94 |
| Gukbap-Gemma2-9B-VL🍚 | 80.16 | 54.20 | 52.92 | 63.83 | 62.78 |
| Ovis1.6-Gemma2-9B | 52.46 | 50.40 | 47.08 | 55.67 | 51.40 |
| VARCO-VISION-14B | 87.16 | 58.13 | **85.42** | 51.17 | 70.47 |
| llama-3.2-Korean-Bllossom-AICA-5B | 26.01 | 21.60 | 17.08 | 45.33 | 27.51 |
   
### MM Benchmarks
- Global MM Bench dataset: [OpenCampass MM leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal)
- Korean MM Bench dataset: [NCSOFT](https://huggingface.co/NCSOFT).
  
## Inference
```python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM

#import os
#os.environ["cuda_visible_devices"]="0"

# load model
if __name__ == '__main__':
    # HumanF-MarkrAI/Gukbap-Ovis2-34B-VL
    # AIDC-AI/Ovis2-34B
    model = AutoModelForCausalLM.from_pretrained("HumanF-MarkrAI/Gukbap-Ovis2-16B-VL",
                                                torch_dtype=torch.bfloat16,
                                                multimodal_max_length=2048,
                                                cache_dir="/data/cache/",
                                                trust_remote_code=True).cuda()
    text_tokenizer = model.get_text_tokenizer()
    visual_tokenizer = model.get_visual_tokenizer()

    # single-image input (K-LLAVA-W)
    image_path = './images/ex_4.jpg'
    images = [Image.open(image_path)]
    max_partition = 9
    text = '이미지에서 잘리지 않은 과일은 몇 개인가요?'
    query = f'<image>\n{text}'

    # format conversation
    prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
    attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
    input_ids = input_ids.unsqueeze(0).to(device=model.device)
    attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
    if pixel_values is not None:
        pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
    pixel_values = [pixel_values]

    # generate output
    with torch.inference_mode():
        gen_kwargs = dict(
            max_new_tokens=2048,
            do_sample=False,
            top_p=None,
            top_k=None,
            temperature=None,
            repetition_penalty=None,
            eos_token_id=model.generation_config.eos_token_id,
            pad_token_id=text_tokenizer.pad_token_id,
            use_cache=True
        )
        output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
        output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
        print(f'Output:\n{output}')
```
  
## Chat Prompt😶‍🌫️
```yaml
<|im_start|>user<image>
Hello! My favorite food is Gukbap🍚!<|im_end|>
<|im_start|>assistant
(model answer)
```

     
## Gukbap-VL Series models🍚🍚
- [HumanF-MarkrAI/Gukbap-Gemma2-9B-VL](https://huggingface.co/HumanF-MarkrAI/Gukbap-Gemma2-9B-VL)
- [HumanF-MarkrAI/Gukbap-Ovis2-34B-VL](https://huggingface.co/HumanF-MarkrAI/Gukbap-Ovis2-34B-VL)
   
  
## BibTeX
```
@article{HumanF-MarkrAI,
  title={Gukbap-Ovis2-16B-VL},
  author={MarkrAI},
  year={2025},
  url={https://huggingface.co/HumanF-MarkrAI}
}
```