HumanF-MarkrAI/Gukbap-Qwen2.5-34B-VL🍚

Model Details🍚

Model Description

Developed by: HumanF-MarkrAI
Model type: Korean-VL-Qwen2.5-34B
Language(s): Korean + English
Context Length: 2048
License: cc-by-nc-4.0
Finetuned from model: AIDC-AI/Ovis2-34B.

Model Sources

When training, we used H100 80GB GPUx6.

Implications🍚

If you want to know our model's details, please see 🔥Gukbap-LMM Blog🔥.
And also, we provided the Korean-LMM training code based Ovis!! 🔥Github🔥. Please star⭐⭐!!

Training Method (SFT)🧐

The following papers contain the foundational methodologies for the dataset and training methods we are currently proceeding.

LIMA.
Ovis.

SFT Text-Datasets (Private)

When we made the Open-Source based dataset, we use microsoft/WizardLM-2-8x22B through DeepInfra.
Our datasets are made by Evolving system, which is propsed by WizardLM. In training, we used 1849 training dataset, and 200 validation dataset.

Wizard-Korea-Datasets: MarkrAI/Markr_WizardLM_train_ver4.

Learning rate: 2e-5; Epoch: 3

Benchmakrs🤗

Global MM Benchmark Score (Zero-shot)

We internally evaluated VLMEvalKit.
We utilized chatgpt-0125, gpt-4o-mini and gpt-4-turbo in MMBench, MathVista and MMVet, respectively.

Model	MMStar	MathVista	HallusionBench	AI2D	OCRBench	MMVet	MMBench_V11	AVG
Step-1o (closed model)	69.3	74.7	89.1	55.8	92.6	82.8	87.3	78.8
InternVL2.5-78B-MPO (Open)	72.1	76.6	58.1	89.2	90.9	73.5	87.8	78.3
Ovis2-34B (Open)	69.2	76.1	58.8	88.3	89.4	77.1	86.5	77.9
InternVL2.5-38B-MPO (Open)	70.1	73.6	59.7	87.9	89.4	72.6	85.4	77.0
:---------:	:-----:	:------:	:-----:	:-----:	:----:	:-----:	:-----:	:-----:
Gukbap-Qwen2-34B-VL🍚	69.33	77.40	55.66	88.31	84.7	74.13	86.53	76.58
:---------:	:-----:	:------:	:-----:	:-----:	:----:	:-----:	:-----:	:-----:
Gemini-2.0-Flash	69.4	70.4	58.0	83.1	82.5	73.6	71.0	72.6
GPT-4o-20241120	65.1	59.9	56.2	84.9	80.6	74.5	84.3	72.2
Ovis1.6-Gemma2-9B (Open)	62.00	67.10	84.42	51.96	82.60	64.68	82.20	70.71
Gukbap-Gemma2-9B-VL🍚	62.13	66.00	84.49	53.01	82.80	63.90	82.20	70.65
LLaVA-OneVision-72B	65.8	68.4	47.9	86.2	74.1	60.6	84.5	69.6
VARCO-VISION-14B (NCSoft)	64.1	67.6	46.8	83.9	81.5	53.0	81.2	68.3
GPT-4o-mini-20240718	54.8	52.4	46.1	77.8	78.5	66.9	76.0	64.6

HallusionBench score: (aAcc + fAcc + qAcc) / 3

Korean MM Benchmark Score (Zero-shot)

We internally evaluated 🔥our code🔥.
We utilized gpt-4o-2024-08-06 in K-LLAVA-W evaluation.

Model	K-MMBench	K-MMStar	K-DTCBench	K-LLAVA-W	AVG
GPT-4o-20241120	NaN	NaN	NaN	85.50	NaN
:---------:	:-----:	:------:	:-----:	:-----:	:----:
Gukbap-Qwen2.5-34B-VL🍚	89.10	68.13	77.08	69.00	75.83
Ovis2-34B	89.56	68.27	76.25	53.67	71.94
Gukbap-Gemma2-9B-VL🍚	80.16	54.20	52.92	63.83	62.78
Ovis1.6-Gemma2-9B	52.46	50.40	47.08	55.67	51.40
VARCO-VISION-14B	87.16	58.13	85.42	51.17	70.47
llama-3.2-Korean-Bllossom-AICA-5B	26.01	21.60	17.08	45.33	27.51

MM Benchmarks

Global MM Bench dataset: OpenCampass MM leaderboard
Korean MM Bench dataset: NCSOFT.

Inference

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

#import os
#os.environ["cuda_visible_devices"]="0"

# load model
if __name__ == '__main__':
    # HumanF-MarkrAI/Gukbap-Qwen2-34B-VL
    # AIDC-AI/Ovis2-34B
    model = AutoModelForCausalLM.from_pretrained("HumanF-MarkrAI/Gukbap-Qwen2-34B-VL",
                                                torch_dtype=torch.bfloat16,
                                                multimodal_max_length=2048,
                                                cache_dir="/data/cache/",
                                                trust_remote_code=True).cuda()
    text_tokenizer = model.get_text_tokenizer()
    visual_tokenizer = model.get_visual_tokenizer()

    # single-image input (K-LLAVA-W)
    image_path = './images/ex_4.jpg'
    images = [Image.open(image_path)]
    max_partition = 9
    text = '이미지에서 잘리지 않은 과일은 몇 개인가요?'
    query = f'<image>\n{text}'

    # format conversation
    prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
    attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
    input_ids = input_ids.unsqueeze(0).to(device=model.device)
    attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
    if pixel_values is not None:
        pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
    pixel_values = [pixel_values]

    # generate output
    with torch.inference_mode():
        gen_kwargs = dict(
            max_new_tokens=2048,
            do_sample=False,
            top_p=None,
            top_k=None,
            temperature=None,
            repetition_penalty=None,
            eos_token_id=model.generation_config.eos_token_id,
            pad_token_id=text_tokenizer.pad_token_id,
            use_cache=True
        )
        output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
        output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
        print(f'Output:\n{output}')

Chat Prompt😶‍🌫️

<|im_start|>user<image>
Hello! My favorite food is Gukbap🍚!<|im_end|>
<|im_start|>assistant
(model answer)

Gukbap-VL Series models🍚🍚

HumanF-MarkrAI/Gukbap-Gemma2-9B-VL

BibTeX

@article{HumanF-MarkrAI,
  title={Gukbap-Qwen2.5-34B-VL},
  author={MarkrAI},
  year={2025},
  url={https://huggingface.co/HumanF-MarkrAI}
}

HumanF-MarkrAI
/

Gukbap-Qwen2-34B-VL