Update README.md

eba492b verified 12 days ago

6.93 kB

	---
	library_name: transformers
	tags: []
	---

	# HumanF-MarkrAI/Gukbap-Qwen2.5-34B-VL🍚

	## Model Details🍚

	### Model Description
	- Developed by: HumanF-MarkrAI
	- Model type: Korean-VL-Qwen2.5-34B
	- Language(s): Korean + English
	- Context Length: 2048
	- License: cc-by-nc-4.0
	- Finetuned from model: [AIDC-AI/Ovis2-34B](https://huggingface.co/AIDC-AI/Ovis2-34B).

	### Model Sources
	When training, we used `H100 80GB GPU`x6.


	### Implications🍚
	If you want to know our model's details, please see [🔥Gukbap-LMM Blog🔥](https://kyujinpy.tistory.com/169).
	And also, we provided the Korean-LMM training code based Ovis!! [🔥Github🔥](https://github.com/Marker-Inc-Korea/Ovis2-FFT-Korean). Please star⭐⭐!!


	### Training Method (SFT)🧐
	The following papers contain the foundational methodologies for the dataset and training methods we are currently proceeding.
	- [LIMA](https://arxiv.org/abs/2305.11206).
	- [Ovis](https://arxiv.org/abs/2405.20797).


	### SFT Text-Datasets (Private)
	When we made the `Open-Source based dataset`, we use `microsoft/WizardLM-2-8x22B` through [DeepInfra](https://deepinfra.com/).
	Our datasets are made by `Evolving system`, which is propsed by [WizardLM](https://wizardlm.github.io/WizardLM2/).
	In training, we used 1849 training dataset, and 200 validation dataset.

	- Wizard-Korea-Datasets: [MarkrAI/Markr_WizardLM_train_ver4](https://huggingface.co/datasets/MarkrAI/Markr_WizardLM_train_ver4).
	> Learning rate: 2e-5; Epoch: 3


	## Benchmakrs🤗

	### Global MM Benchmark Score (Zero-shot)

	We internally evaluated [VLMEvalKit](https://github.com/open-compass/VLMEvalKit?tab=readme-ov-file).
	We utilized chatgpt-0125, gpt-4o-mini and gpt-4-turbo in `MMBench`, `MathVista` and `MMVet`, respectively.

	\| Model \| MMStar \| MathVista \| HallusionBench \| AI2D \| OCRBench \| MMVet \| MMBench_V11 \| AVG \|
	\|:---------:\|:-----:\|:------:\|:-----:\|:-----:\|:----:\|:-----:\|:-----:\|:-----:\|
	\| Step-1o (closed model) \| 69.3 \| 74.7 \| 89.1 \| 55.8 \| 92.6 \| 82.8 \| 87.3 \| 78.8 \|
	\| InternVL2.5-78B-MPO (Open) \| 72.1 \| 76.6 \| 58.1 \| 89.2 \| 90.9 \| 73.5 \| 87.8 \| 78.3 \|
	\| Ovis2-34B (Open) \| 69.2 \| 76.1 \| 58.8 \| 88.3 \| 89.4 \| 77.1 \| 86.5 \| 77.9 \|
	\| InternVL2.5-38B-MPO (Open) \| 70.1 \| 73.6 \| 59.7 \| 87.9 \| 89.4 \| 72.6 \| 85.4 \| 77.0 \|
	\|:---------:\|:-----:\|:------:\|:-----:\|:-----:\|:----:\|:-----:\|:-----:\|:-----:\|
	\| Gukbap-Qwen2-34B-VL🍚 \| 69.33 \| 77.40 \| 55.66 \| 88.31 \| 84.7 \| 74.13 \| 86.53 \| 76.58 \|
	\|:---------:\|:-----:\|:------:\|:-----:\|:-----:\|:----:\|:-----:\|:-----:\|:-----:\|
	\| Gemini-2.0-Flash \| 69.4 \| 70.4 \| 58.0 \| 83.1 \| 82.5 \| 73.6 \| 71.0 \| 72.6 \|
	\| GPT-4o-20241120 \| 65.1 \| 59.9 \| 56.2 \| 84.9 \| 80.6 \| 74.5 \| 84.3 \| 72.2 \|
	\| Ovis1.6-Gemma2-9B (Open) \| 62.00 \| 67.10 \| 84.42 \| 51.96 \| 82.60 \| 64.68 \| 82.20 \| 70.71 \|
	\| Gukbap-Gemma2-9B-VL🍚 \| 62.13 \| 66.00 \| 84.49 \| 53.01 \| 82.80 \| 63.90 \| 82.20 \| 70.65 \|
	\| LLaVA-OneVision-72B \| 65.8 \| 68.4 \| 47.9 \| 86.2 \| 74.1\| 60.6 \| 84.5 \| 69.6 \|
	\| VARCO-VISION-14B (NCSoft) \| 64.1 \| 67.6 \| 46.8 \| 83.9 \| 81.5 \| 53.0 \| 81.2 \| 68.3 \|
	\| GPT-4o-mini-20240718 \| 54.8 \| 52.4 \| 46.1 \| 77.8 \| 78.5 \| 66.9 \| 76.0 \| 64.6 \|
	> HallusionBench score: (aAcc + fAcc + qAcc) / 3

	### Korean MM Benchmark Score (Zero-shot)

	We internally evaluated [🔥our code🔥](https://github.com/Marker-Inc-Korea/KoVLMEval).
	We utilized gpt-4o-2024-08-06 in `K-LLAVA-W` evaluation.

	\| Model \| K-MMBench \| K-MMStar \| K-DTCBench \| K-LLAVA-W \| AVG \|
	\|:---------:\|:-----:\|:------:\|:-----:\|:-----:\|:----:\|
	\| GPT-4o-20241120 \| NaN \| NaN \| NaN \| 85.50 \| NaN \|
	\|:---------:\|:-----:\|:------:\|:-----:\|:-----:\|:----:\|
	\| Gukbap-Qwen2.5-34B-VL🍚 \| 89.10 \| 68.13 \| 77.08 \| 69.00 \| 75.83 \|
	\| Ovis2-34B \| 89.56 \| 68.27 \| 76.25 \| 53.67 \| 71.94 \|
	\| Gukbap-Gemma2-9B-VL🍚 \| 80.16 \| 54.20 \| 52.92 \| 63.83 \| 62.78 \|
	\| Ovis1.6-Gemma2-9B \| 52.46 \| 50.40 \| 47.08 \| 55.67 \| 51.40 \|
	\| VARCO-VISION-14B \| 87.16 \| 58.13 \| 85.42 \| 51.17 \| 70.47 \|
	\| llama-3.2-Korean-Bllossom-AICA-5B \| 26.01 \| 21.60 \| 17.08 \| 45.33 \| 27.51 \|

	### MM Benchmarks
	- Global MM Bench dataset: [OpenCampass MM leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal)
	- Korean MM Bench dataset: [NCSOFT](https://huggingface.co/NCSOFT).

	## Inference
	```python
	import torch
	from PIL import Image
	from transformers import AutoModelForCausalLM

	#import os
	#os.environ["cuda_visible_devices"]="0"

	# load model
	if __name__ == '__main__':
	# HumanF-MarkrAI/Gukbap-Qwen2-34B-VL
	# AIDC-AI/Ovis2-34B
	model = AutoModelForCausalLM.from_pretrained("HumanF-MarkrAI/Gukbap-Qwen2-34B-VL",
	torch_dtype=torch.bfloat16,
	multimodal_max_length=2048,
	cache_dir="/data/cache/",
	trust_remote_code=True).cuda()
	text_tokenizer = model.get_text_tokenizer()
	visual_tokenizer = model.get_visual_tokenizer()

	# single-image input (K-LLAVA-W)
	image_path = './images/ex_4.jpg'
	images = [Image.open(image_path)]
	max_partition = 9
	text = '이미지에서 잘리지 않은 과일은 몇 개인가요?'
	query = f'<image>\n{text}'

	# format conversation
	prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
	attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
	input_ids = input_ids.unsqueeze(0).to(device=model.device)
	attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
	if pixel_values is not None:
	pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
	pixel_values = [pixel_values]

	# generate output
	with torch.inference_mode():
	gen_kwargs = dict(
	max_new_tokens=2048,
	do_sample=False,
	top_p=None,
	top_k=None,
	temperature=None,
	repetition_penalty=None,
	eos_token_id=model.generation_config.eos_token_id,
	pad_token_id=text_tokenizer.pad_token_id,
	use_cache=True
	)
	output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
	output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
	print(f'Output:\n{output}')
	```

	## Chat Prompt😶‍🌫️
	```yaml
	<\|im_start\|>user<image>
	Hello! My favorite food is Gukbap🍚!<\|im_end\|>
	<\|im_start\|>assistant
	(model answer)
	```


	## Gukbap-VL Series models🍚🍚
	- [HumanF-MarkrAI/Gukbap-Gemma2-9B-VL](https://huggingface.co/HumanF-MarkrAI/Gukbap-Gemma2-9B-VL)


	## BibTeX
	```
	@article{HumanF-MarkrAI,
	title={Gukbap-Qwen2.5-34B-VL},
	author={MarkrAI},
	year={2025},
	url={https://huggingface.co/HumanF-MarkrAI}
	}
	```