File size: 7,015 Bytes
58c4d97 61c4d17 58c4d97 899d413 58c4d97 899d413 61c4d17 899d413 ba526e8 899d413 f7b6d91 899d413 9d36690 899d413 d0b55db 899d413 1ef9d5f 8048ab1 899d413 61c4d17 899d413 61c4d17 899d413 61c4d17 899d413 61c4d17 899d413 61c4d17 899d413 61c4d17 899d413 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
library_name: transformers
tags: []
---
# HumanF-MarkrAI/Gukbap-Ovis2-16B-VL🍚
## Model Details🍚
### Model Description
- **Developed by:** HumanF-MarkrAI
- **Model type:** Korean-VL-Ovis2-16B
- **Language(s):** Korean + English
- **Context Length:** 2048
- **License:** cc-by-4.0
- **Finetuned from model:** [AIDC-AI/Ovis2-16B](https://huggingface.co/AIDC-AI/Ovis2-16B).
### Model Sources
When training, we used `H100 80GB GPU`x4.
### Implications🍚
If you want to know our model's details, please see [🔥Gukbap-LMM Blog🔥](https://kyujinpy.tistory.com/169).
And also, we provided the Korean-LMM training code based Ovis!! [🔥Github🔥](https://github.com/Marker-Inc-Korea/Ovis2-FFT-Korean). Please star⭐⭐!!
### Training Method (SFT)🧐
The following papers contain the foundational methodologies for the dataset and training methods we are currently proceeding.
- [LIMA](https://arxiv.org/abs/2305.11206).
- [Ovis](https://arxiv.org/abs/2405.20797).
### SFT Text-Datasets (Private)
When we made the `Open-Source based dataset`, we use `microsoft/WizardLM-2-8x22B` through [DeepInfra](https://deepinfra.com/).
Our datasets are made by `Evolving system`, which is propsed by [WizardLM](https://wizardlm.github.io/WizardLM2/).
In training, we used 1849 training dataset, and 200 validation dataset.
- **Wizard-Korea-Datasets:** [MarkrAI/Markr_WizardLM_train_ver4](https://huggingface.co/datasets/MarkrAI/Markr_WizardLM_train_ver4).
> Learning rate: 2e-5; Epoch: 2
## Benchmakrs🤗
### Global MM Benchmark Score (Zero-shot)
We internally evaluated [VLMEvalKit](https://github.com/open-compass/VLMEvalKit?tab=readme-ov-file).
We utilized **chatgpt-0125**, **gpt-4o-mini** and **gpt-4-turbo** in `MMBench`, `MathVista` and `MMVet`, respectively.
| Model | MMStar | MathVista | HallusionBench | AI2D | OCRBench | MMVet | MMBench_V11 | AVG |
|:---------:|:-----:|:------:|:-----:|:-----:|:----:|:-----:|:-----:|:-----:|
| Step-1o (closed model) | 69.3 | **74.7** | **89.1** | 55.8 | **92.6** | **82.8** | 87.3 | **78.8** |
| InternVL2.5-78B-MPO (Open) | **72.1** | 76.6 | 58.1 | **89.2** | 90.9 | 73.5 | **87.8** | 78.3 |
| InternVL2.5-38B-MPO (Open) | 70.1 | 73.6 | 59.7 | 87.9 | 89.4 | 72.6 | 85.4 | 77.0 |
| Ovis2-16B (Open) | 67.2 | 73.7 | 56.8 | 86.3 | 87.9 | 68.4 | 85.7 | 75.14 |
|:---------:|:-----:|:------:|:-----:|:-----:|:----:|:-----:|:-----:|:-----:|
| **Gukbap-Ovis2-16B-VL🍚** | 65.67 | 73.70 | 54.52 | 85.46 | 84.80 | 66.83 | 85.22 | **73.74** |
|:---------:|:-----:|:------:|:-----:|:-----:|:----:|:-----:|:-----:|:-----:|
| Gemini-2.0-Flash | 69.4 | 70.4 | 58.0 | 83.1 | 82.5 | 73.6 | 71.0 | 72.6 |
| GPT-4o-20241120 | 65.1 | 59.9 | 56.2 | 84.9 | 80.6 | 74.5 | 84.3 | 72.2 |
| Ovis1.6-Gemma2-9B (Open) | 62.00 | 67.10 | 84.42 | 51.96 | 82.60 | 64.68 | 82.20 | 70.71 |
| **Gukbap-Gemma2-9B-VL🍚** | 62.13 | 66.00 | 84.49 | 53.01 | 82.80 | 63.90 | 82.20 | **70.65** |
| LLaVA-OneVision-72B | 65.8 | 68.4 | 47.9 | 86.2 | 74.1| 60.6 | 84.5 | 69.6 |
| VARCO-VISION-14B (NCSoft) | 64.1 | 67.6 | 46.8 | 83.9 | 81.5 | 53.0 | 81.2 | 68.3 |
| GPT-4o-mini-20240718 | 54.8 | 52.4 | 46.1 | 77.8 | 78.5 | 66.9 | 76.0 | 64.6 |
> HallusionBench score: (aAcc + fAcc + qAcc) / 3
### Korean MM Benchmark Score (Zero-shot)
We internally evaluated [🔥our code🔥](https://github.com/Marker-Inc-Korea/KoVLMEval).
We utilized **gpt-4o-2024-08-06** in `K-LLAVA-W` evaluation.
| Model | K-MMBench | K-MMStar | K-DTCBench | K-LLAVA-W | AVG |
|:---------:|:-----:|:------:|:-----:|:-----:|:----:|
| GPT-4o-20241120 | NaN | NaN | NaN | 85.50 | NaN |
|:---------:|:-----:|:------:|:-----:|:-----:|:----:|
| **Gukbap-Ovis2-16B-VL🍚** | 88.24 | 61.00 | 79.58 | **66.67** | **73.87** |
| **Ovis2-16B** | **88.31** | **61.80** | 81.25 | 61.00 | 71.94 |
| Gukbap-Gemma2-9B-VL🍚 | 80.16 | 54.20 | 52.92 | 63.83 | 62.78 |
| Ovis1.6-Gemma2-9B | 52.46 | 50.40 | 47.08 | 55.67 | 51.40 |
| VARCO-VISION-14B | 87.16 | 58.13 | **85.42** | 51.17 | 70.47 |
| llama-3.2-Korean-Bllossom-AICA-5B | 26.01 | 21.60 | 17.08 | 45.33 | 27.51 |
### MM Benchmarks
- Global MM Bench dataset: [OpenCampass MM leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal)
- Korean MM Bench dataset: [NCSOFT](https://huggingface.co/NCSOFT).
## Inference
```python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM
#import os
#os.environ["cuda_visible_devices"]="0"
# load model
if __name__ == '__main__':
# HumanF-MarkrAI/Gukbap-Ovis2-34B-VL
# AIDC-AI/Ovis2-34B
model = AutoModelForCausalLM.from_pretrained("HumanF-MarkrAI/Gukbap-Ovis2-16B-VL",
torch_dtype=torch.bfloat16,
multimodal_max_length=2048,
cache_dir="/data/cache/",
trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()
# single-image input (K-LLAVA-W)
image_path = './images/ex_4.jpg'
images = [Image.open(image_path)]
max_partition = 9
text = '이미지에서 잘리지 않은 과일은 몇 개인가요?'
query = f'<image>\n{text}'
# format conversation
prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
if pixel_values is not None:
pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
pixel_values = [pixel_values]
# generate output
with torch.inference_mode():
gen_kwargs = dict(
max_new_tokens=2048,
do_sample=False,
top_p=None,
top_k=None,
temperature=None,
repetition_penalty=None,
eos_token_id=model.generation_config.eos_token_id,
pad_token_id=text_tokenizer.pad_token_id,
use_cache=True
)
output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
print(f'Output:\n{output}')
```
## Chat Prompt😶🌫️
```yaml
<|im_start|>user<image>
Hello! My favorite food is Gukbap🍚!<|im_end|>
<|im_start|>assistant
(model answer)
```
## Gukbap-VL Series models🍚🍚
- [HumanF-MarkrAI/Gukbap-Gemma2-9B-VL](https://huggingface.co/HumanF-MarkrAI/Gukbap-Gemma2-9B-VL)
- [HumanF-MarkrAI/Gukbap-Ovis2-34B-VL](https://huggingface.co/HumanF-MarkrAI/Gukbap-Ovis2-34B-VL)
## BibTeX
```
@article{HumanF-MarkrAI,
title={Gukbap-Ovis2-16B-VL},
author={MarkrAI},
year={2025},
url={https://huggingface.co/HumanF-MarkrAI}
}
``` |