Qwen/Qwen2-VL-7B-Instruct · Finetuning script using HuggingFace (No llama-factory)

2U1

Sep 11, 2024

•

edited Sep 11, 2024

https://github.com/2U1/Qwen2-VL-Finetune

I made a code for who wants to use the huggingface version to finetune, and having difficult using some other frameworks like me.

This code only uses huggingface for fine-tuning the 7B and 2B model.

Also, you can set different learning_rate for vision_model and language_model. ( Also for the merger)

Feedback and issues are welcome!

tanliboy

Sep 11, 2024

Thanks for sharing it! Any video demo with this fine-tuning codebase?

2U1

Sep 11, 2024

@tanliboy I'm working on with fine-tuning with video. It will soon be updated!

2U1

Sep 11, 2024

@tanliboy I've updated the code for video training! Do you need a inference demo with video via cli or gradio?

Anu0202

Sep 14, 2024

@2U1 thanks for the scripts for LORA tuning the model.

I was trying to finetune it on a small dataset ~2000 samples (single image single turn QA)

I was trying to do it on Kaggle with 29GB RAM and 2 * T4 GPUs with 15GB each...but I am always getting into CUDA OOM (no offload, on params offloaded) and RAM OOM if param and optimizer both offloaded to CPU. Is there any way out? What is the suggested compute?

Also, I am using 2B param model for now. Can you throw some light on this? Thanks!

2U1

Sep 15, 2024

@Anu0202 Thanks for your interest!
It takes a lot of memory so you should use offloading and decrease the max pixel values.

tanliboy

Sep 16, 2024

Thanks, @Anu0202 ! Will try it out .

lucreziaT

Oct 31, 2024

Hello, thank you for sharing the code! I followed all the instructions, so I have the environment with all the packages installed, and the train dataset in the right format.
When i launch the fine-tuning with : bash scripts/finetune_lora_vision.sh --data_path my.json --image_folder myfolder --model_id '/anaconda3/envs/qwen2/lib/python3.10/site-packages/transformers/models/qwen2_vl/'
I have many errors that are related to the flash_attn package: 'ImportError: /anaconda3/envs/qwen2/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNK3c105Error4whatEv'
Do you have any clue about what the problem could be? My version of flash_attn is 2.5.8, of Python is 3.10.14 , CUDA is 12.6.77 and I am working on Ubuntu 20.04.6

2U1

Nov 1, 2024

@lucreziaT If so, you can downgrade the torch to torch==2.3.0.
I'll try some other combination with this again.

lucreziaT

Nov 4, 2024

Hello, in the end, I had to downgrade CUDA to version 12.1 .
I now have a new issue:
RuntimeError: shape mismatch: value tensor of shape [256, 3584] cannot be broadcast to indexing result of shape [0, 3584]
I see from here: https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct/discussions/33 that I should add a processor.apply_chat_template, but I don't know where. Do you have any clue?

2U1

Nov 5, 2024

@lucreziaT Does your data looks like

[
  {
    "id": "000000033471",
    "image": "000000033471.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat are the colors of the bus in the image?"
      },
      {
        "from": "gpt",
        "value": "The bus in the image is white and red."
      }
    ]
  }
  ...
]

When you are using my code, You should have <image>\n in the text.

blingster

Nov 27, 2024

•

edited Nov 27, 2024

can you fine tune with more than 1 image? ie: could something below work?

[
{
"id": "000000033471",
"image": ["000000033471.jpg", "image2.jpg", "image3.jpg",],
"conversations": [
{
"from": "human",
"value": "\n\n\nWhat are the colors of the bus in the image?"
},
{
"from": "gpt",
"value": "The bus in the image is white and red."
}
]
}
...
]

2U1

Dec 4, 2024

@blingster Yes You could finetune with more than 1 images. It works.

Rageshhf

Dec 9, 2024

Hi, I am working on creating a multimodal chatbot for a specific web application using a multimodal large language model (LLM). For text-based queries, I implement a retrieval mechanism. However, when the user query includes an image, I need to perform fine-tuning to handle such cases.

To achieve this, I scraped various pages of the web application and created some QA pairs using a vision-based LLM. These QA pairs were used to fine-tune the Qwen-2 VL model. Despite experimenting with multiple fine-tuning approaches, none of them have worked effectively.

The issues I encountered include:

The model loses its generalization capability and becomes overfitted to the custom data.
It answers only the questions similar to the training data, failing to handle broader or slightly varied queries.

I ensured the training data was as diverse as possible, yet the problem persists. Could you please help me figure out the issue? Are there any better alternatives or strategies I should consider? @2U1

2U1

Dec 9, 2024

@Rageshhf If you are performaing a full finetuning then you could fine tune only 1 epoch.
There is a research about catastrophic forgetting is a bit more often in LoRA (It depends on the rank of the lora module), so I think non-lora would be a bit more proper way for the task you are doing.

Rageshhf

Dec 10, 2024

Thanks for your suggestion! However, I have a doubt. Since LoRA freezes the pre-trained weights, shouldn't it help reduce catastrophic forgetting? In that case, how would a non-LoRA approach work for my scenario? Also, if I choose full fine-tuning, can I use the same dataset? @2U1

2U1

Dec 11, 2024

@Rageshhf According to this paper https://arxiv.org/pdf/2410.21228
While LoRA does freeze most pre-trained weights, research shows it can introduce "intruder dimensions" that reduce adaptability and degrade generalization, especially when tasks change over time. However it can be a bit more stable when you increase the rank when performing LoRA.

BTW, yes you can use the same dataset.

Rageshhf

Dec 11, 2024

Okey. Will look into it.

ascension-hf

Dec 16, 2024

•

edited Dec 16, 2024

I am experiencing the same problem as @Rageshhf . I am fine-tuning on a smaller specific coastal classification dataset that contains a satellite image with a question and answer prompting style. (1055 images). After training (for 1 epoch with the standard parameters I am experiencing the following:

The model sometimes forgets that it has the ability to analyse images. (Only thinks it is a language model)
Starts hallucinating like crazy sometimes and repeats itself a lot.
In longer conversations, it sometimes starts answering just nothing.
Performance on training data does improve a bit (loss is near zero, but performance on training data is not close to perfect accuracy)
Fails to perform inference well on varying the prompt given the same training image.

I have already tried different training scripts, the fine-tuning 8bit, and LoRA training. What do you recommend @2U1 to change? Do you have good results fine-tuning it yourself on a specific dataset? I am currently trying lower learning rates, (maybe they are way too high, so that it overwrites previous memory -> catastrophic forgetting).

Also, if 1 use single Q&A pair for each training sample, would my model lose the ability to do few-shot and longer conversations?

Looking forward for any help! :)

2U1

Dec 17, 2024

@ascension-hf I finetuned the model on my own dataset (about 170k image only on specific domain). It did not lose the general ability and it shows better perfornamce than 72b, on the domain I trained.
I've made few scenarios such as multi-turn conversation, describing and so on. This could be the reason my model wans't changed much.

If you have limited data, I think lowering the learning rate and using larger batch size (increase the accumulation step) could lead to a better result.

ascension-hf

Jan 7

Thanks! Did your dataset include various conversational techniques (multi turn convs / descriptions)? Which dataset did you use? Do you think if I train on single image single turn conversations, my model would lose it's ability to do multi-turn? Does that then need to be in the dataset as well? Why do you say ur model did not change much because it encapsulates all types of LLM interactions, and in this way is still able to support everything?

Thanks for your help.

2U1

Jan 7

•

edited Jan 10

@ascension-hf

I've included conversation types as single-turn/description/multi-turn/complex-reasoning and so on.
It's my own dataset I've made.
I think it won't lose the ability but you should be careful on overfitting, that this model is a bit more sensitive than other models I've tested.
I can't exactly understand what it means but it supports 'everything'. I meant for understanding and instruction following.

Since I saw the issues that qwen2-vl has overfitted, I've tested with a partial data of mine and I saw a similar issue. I think the hyperparameter for this modle should be carefully adjusted, or mix-up some data with some other open-sourced dataset to maintain the ability.

2U1

Jan 10

•

edited about 1 month ago

@ascension-hf I'll try to find out what was wrong about it. I'm testing with the 2B model and adjusting some hyperparameters.

I've seen some description in the InternVL code that, to add new domain-specific data on top of the general data from the open-source data will enhance downstream capabilities while retaining the foundational skills.

I think not all models should do this, but some models can show non-overfitting performance using this technique.

2U1

28 days ago

@ascension-hf @Rageshhf
Sorry it took a long time to answer the question. I was experimenting the parameters that are used to tune other models like MIniCPM-V, Deepseek-VL2, InterVL2 and so on.
Increasing the weight_decay to 0.1 and decreasing the adam_beta2 to 0.95 makes the model not to overfit to your training data (These are the values that other models were using).
You could try this one. Also, decreasing the lr and chaning the schedular to constant may help.

Rageshhf

27 days ago

Thanks for the update @2U1 . Will try it out.

jain645

19 days ago

Is this code able to fine-tune the 72b model if I change the model names?

2U1

19 days ago

@jain645 It should work but, I haven't tried becuase of the vram.

ascension-hf

12 days ago

Thanks @2U1 , increasing the weight decay seem to reduce the overfitting. Do you think this repo will also be compatible for fintuning of https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct?

2U1

12 days ago

@2U1 I'm not sure. Our country is on a holiday right now, So I'm gonna start working on it next week.

2U1

9 days ago

@ascension-hf I've updated the code for supporting Qwen2.5-VL!

ascension-hf

9 days ago

•

edited 8 days ago

@2U1 Thank you! I am currently using the new Qwen 2.5 model and it runs fine for me.
I do have a question how the input and labels are formatted in the fine-tuning repo.

It is currently implemented like this:

        input_ids = torch.cat([prompt_input_ids, response_input_ids], dim=1).squeeze(0)
        labels = torch.cat(
            [
                torch.tensor([IGNORE_INDEX] * len(prompt_input_ids[0])),
                response_input_ids.squeeze(0),
            ],
            dim=0,
        )

Why is the gpt response (label) concatenated in the input? Shouldn't it only be in the label? Can't the model just repeat the last part of the input and get the answer out? And why does the input length always need to match label length? (In my case I am padding my labels a lot)

I tried implementing it like this (only prompt in input and only got response in label while ensuring that length are equal):

        input_ids = torch.cat([prompt_input_ids], dim=1).squeeze(0)
        labels = torch.cat(
            [
                torch.tensor([IGNORE_INDEX] * ( len(prompt_input_ids[0]) - len(response_input_ids[0]) )),
                response_input_ids.squeeze(0),
            ],
            dim=0,
        )

But this gives me very bad results during training.

Thanks for your help!

2U1

8 days ago

@ascension-hf In a standard language modeling loss, each input token position corresponds to one output position that is compared againts a label. Hence, the input and label lengths must match.
By concatenating both prompt and the response into a single sequence and then masking the prompt tokens in the lables, it maintains the alignment.
This is crucial because if you only provide the prompt as input and have the response as the label (with extra padding), you break the natural correspondence between each input token and its target output.

martignago

7 days ago

Thank you @2U1 , could you maybe give me some advice?

I am on the Brazilian Portuguese Domain, have no interest in videos and the ML support we have around here is so that we don't even have a Decoder-Only model for our language that has competitive performance in our language.

So, previously my pipeline was using GOT-OCR-2.0 & Qwen2.5-3B finetuned on our data, the OCR to do the OCRzation and the model would extract/format the data in JSON, all datasets were made from GPT-4o data.

Now a new kind of document entered our domain, it has tons o new information, around 50 pages of text, most of the text is absolutely useless. Do you think I would be able to finetune to increase the knowledge in my domain, considering that the Qwen2.5-VL only supports english (I know it's tokenizer has like 130000 tokens), but according to them, it is mostly trained in english or chinese. Is it worth it?

Sorry if I look a little dumb....

Been full finetuning and using own architectures for not long.

Thank you!

2U1

7 days ago

•

edited 7 days ago

@martignago To the best of my knowledge Qwen2.5 supports Portugese, so it should work, if you train the model for OCR data. I'm using the model on English but, the OCR capability for Korean is quite good, so I think it's worth it.

2U1

7 days ago

@ascension-hf @martignago Sorry I found out that my latest code had some issues when training. It wasn't seeing the image when training. I've updated the code to properly train the model.
Also, Qwen2.5-VL has some issues with Flash-attention2. Should wait a little bit to use flash-attnetion2 when training.

Cherryblade29

6 days ago

hello , i have created a dataset for fine-tune qwen-2.5 vl for ocr tasks (arabic language) , could anyone , help me with it ?

ascension-hf

6 days ago

@2U1 Wow, that bug fix changed everything for me! My scores are going up and the model is doing what it is supposed to do! Thank you so much!