|
## Model Summary |
|
|
|
BigDocs-Phi-3.5-instruct is a multi-modal model that is trained with BigDocs for document intelligence tasks. |
|
|
|
microsoft/Phi-3.5-vision-instruct is used as the base and we perform 2 stages of training - |
|
1. Continual Pre-Training (CPT) with BigDocs-CPT keeping the encoder and adapter trainable. |
|
2. Fine Tuning (FT) with DocDownstream-1.0 keeping the decoder and adapter trainable. |
|
|
|
|
|
## General Document Benchmarks |
|
|
|
Models trained on [BigDocs-7.5M+DocDownstream] perform competitively across multimodal document benchmarks. We compare them to base checkpoints, instruction-tuned models, and those trained on [DocStruct4M+DocDownstream]. BigDocs models show consistent performance. |
|
|
|
| **Model** | **DocVQA**<br>*VAL* | **InfoVQA**<br>*VAL* | **DeepForm**<br>*TEST* | **KLC**<br>*TEST* | **WTQ**<br>*TEST* | **TabFact**<br>*TEST* | **ChartQA**<br>*TEST* | **TextVQA**<br>*VAL* | **MMMU**<br>*VAL* | **DudeMini**<br>*TEST* | **SlideVQA-M**<br>*TEST* | **TableVQA**<br>*TEST* | **Avg. Score** | |
|
|-----------------------------------|---------------------|-----------------------|-------------------------|-------------------|-------------------|-----------------------|-----------------------|----------------------|------------------|------------------------|--------------------------|-------------------------|----------------| |
|
| DocOwl1.5-8B (instruct) | 80.73 | 49.94 | 68.84 | 37.99 | 38.87 | 79.67 | 68.56 | 68.91 | 33.67 | 34.64 | 31.62 | 52.60 | 53.84 | |
|
| DocOwl1.5-8B (base) | 2.07 | 1.84 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 24.44 | 19.07 | 3.30 | 13.63 | 5.36 | |
|
| DocOwl1.5-8B (base) + DocStruct4M | 75.99 | 46.88 | 62.77 | 35.21 | 32.86 | 71.56 | 68.36 | 65.08 | 33.67 | 29.00 | 27.03 | 46.27 | 49.56 | |
|
| DocOwl1.5-8B (base) + BigDocs (Ours) | 78.70 | 47.62 | 64.39 | 36.93 | 35.69 | 72.65 | 65.80 | 67.30 | 32.33 | 32.55 | 29.60 | 49.03 | 51.05 | |
|
| Qwen2-VL-2B (instruct) | 89.16 | 64.11 | 32.38 | 25.18 | 38.20 | 57.21 | 73.40 | 79.90 | 42.00 | 45.23 | 46.50 | 43.07 | 53.03 | |
|
| Qwen2-VL-2B (base) | 7.26 | 0.78 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.14 | 34.89 | 28.43 | 14.55 | 0.00 | 7.25 | |
|
| Qwen2-VL-2B (base) + DocStruct4M | 59.53 | 32.00 | 53.98 | 36.38 | 28.48 | 64.24 | 54.44 | 55.89 | 34.89 | 28.78 | 22.68 | 46.53 | 43.15 | |
|
| *Qwen2-VL-2B (base) + BigDocs (Ours) | 57.23 | 31.88 | 49.31 | 34.39 | 31.61 | 64.75 | 68.60 | 61.01 | 35.67 | 27.19 | 17.46 | 47.53 | 43.89 | |
|
| Phi3.5-Vision-4B (instruct) | 86.00 | 56.20 | 10.47 | 7.49 | 17.18 | 30.43 | 82.16 | 73.12 | 46.00 | 37.20 | 30.93 | 70.70 | 45.66 | |
|
| Phi3.5-Vision-4B + DocStruct4M | 86.76 | 68.90 | 70.12 | 37.83 | 51.30 | 82.12 | 79.76 | 68.60 | 44.11 | 35.52 | 31.90 | 69.17 | 60.51 | |
|
| **Phi3.5-Vision-4B + BigDocs (Ours)** | **87.05** | **70.05** | **70.97** | **37.45** | **51.21** | **81.24** | **81.56** | **68.72** | **45.00** | **36.15** | **32.47** | **67.77** | **60.80** | |
|
| LLaVA-NeXT-7B (instruct) | 63.51 | 30.90 | 1.30 | 5.35 | 20.06 | 52.83 | 52.12 | 65.10 | 38.89 | 17.94 | 7.46 | 32.87 | 32.36 | |
|
| LLaVA-NeXT-7B + DocStruct4M | 60.95 | 26.14 | 39.78 | 28.34 | 25.90 | 67.72 | 61.20 | 52.25 | 25.78 | 21.70 | 15.33 | 27.03 | 37.68 | |
|
| LLaVA-NeXT-7B + BigDocs (Ours) | 57.13 | 24.47 | 46.38 | 31.09 | 27.06 | 72.58 | 54.72 | 49.06 | 17.78 | 22.88 | 16.07 | 33.13 | 37.70 | |
|
| Llama-3.2-90B | 74.15* | 48.71 | 4.18 | 1.81 | 24.20 | 63.01 | 11.36* | 71.69 | 57.78 | 41.24 | 26.09 | 41.57 | 38.82 | |
|
| GPT-4o 20240806 | 92.80 | 66.37 | 38.39 | 29.92 | 46.63 | 81.10 | 85.70 | 70.46 | 69.10 | 54.55 | 67.58 | 72.87 | 64.62 | |
|
| Claude-3.5 Sonnet | 88.48 | 59.05 | 31.41 | 24.82 | 47.13 | 53.48 | 51.84 | 71.42 | 64.78 | 35.11 | 0.00 | 81.27 | 50.73 | |
|
| GeminiPro-1.5 | 91.23 | 73.94 | 32.16 | 24.07 | 50.29 | 71.22 | 34.68 | 68.16 | 58.22 | 48.15 | 52.05 | 80.43 | 57.05 | |
|
| Qwen2-VL-72B | 96.50 | 84.50 | 30.45 | 24.78 | 55.63 | 0.00 | 88.30 | 85.50 | 64.50 | 35.87 | 2.15 | 74.23 | 58.40 | |
|
|
|
|
|
### Input Formats |
|
|
|
BigDocs-Phi-3.5-instruct follows the same chat format as Phi-3.5-vision-instruct: |
|
|
|
Single image: |
|
``` |
|
<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n |
|
``` |
|
|
|
Multi-turn conversations: |
|
``` |
|
<|user|>\n<|image_1|>\n{prompt_1}<|end|>\n<|assistant|>\n{response_1}<|end|>\n<|user|>\n{prompt_2}<|end|>\n<|assistant|>\n |
|
``` |
|
|
|
For multi-image usage, add multiple image placeholders in the front of the prompts. <|image_{}|> index should start from 1. One example of prompt is shown as follows: |
|
``` |
|
<|user|>\n<|image_1|>\n<|image_2|>\n<|image_3|>\n<|image_4|>\n{prompt}<|end|>\n<|assistant|>\n |
|
``` |
|
### Loading the model locally |
|
After obtaining the Phi-3.5-vision-instruct model checkpoints, users can use this sample code for inference. |
|
```python |
|
from PIL import Image |
|
import requests |
|
from transformers import AutoModelForCausalLM |
|
from transformers import AutoProcessor |
|
model_id = "BigDocs/BigDocs-Phi-3.5-instruct" |
|
|
|
# Note: set _attn_implementation='eager' if you don't have flash_attn installed |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
device_map="cuda", |
|
trust_remote_code=True, |
|
torch_dtype="auto", |
|
_attn_implementation='flash_attention_2' |
|
) |
|
# for best performance, use num_crops=4 for multi-frame, num_crops=16 for single-frame. |
|
processor = AutoProcessor.from_pretrained(model_id, |
|
trust_remote_code=True, |
|
num_crops=4 |
|
) |
|
|
|
images = [] |
|
placeholder = "" |
|
|
|
# Note: if OOM, you might consider reduce number of frames in this example. |
|
for i in range(1,20): |
|
url = f"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-{i}-2048.jpg" |
|
images.append(Image.open(requests.get(url, stream=True).raw)) |
|
placeholder += f"<|image_{i}|>\n" |
|
messages = [ |
|
{"role": "user", "content": placeholder+"Summarize the deck of slides."}, |
|
] |
|
prompt = processor.tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
|
|
inputs = processor(prompt, images, return_tensors="pt").to("cuda:0") |
|
generation_args = { |
|
"max_new_tokens": 1000, |
|
"temperature": 0.0, |
|
"do_sample": False, |
|
} |
|
generate_ids = model.generate(**inputs, |
|
eos_token_id=processor.tokenizer.eos_token_id, |
|
**generation_args |
|
) |
|
|
|
# remove input tokens |
|
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] |
|
response = processor.batch_decode(generate_ids, |
|
skip_special_tokens=True, |
|
clean_up_tokenization_spaces=False)[0] |
|
|
|
print(response) |
|
``` |