OpenGVLab
/

InternVL2-40B

@@ -76,6 +76,7 @@ We also welcome you to experience the InternVL2 series models in our [online dem
 > Please use transformers==4.37.2 to ensure the model works normally.
 ```python
 import numpy as np
 import torch
 import torchvision.transforms as T
@@ -163,17 +164,44 @@ def load_image(image_file, input_size=448, max_num=6):
     return pixel_values
 path = 'OpenGVLab/InternVL2-40B'
-# You need to set device_map='auto' to use multiple GPUs for inference.
-import os
-os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,
     low_cpu_mem_usage=True,
     trust_remote_code=True,
-    device_map='auto').eval()
 tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
 # set the max number of tiles in `max_num`
 pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
@@ -317,6 +345,10 @@ print(f'User: {question}')
 print(f'Assistant: {response}')
 ```
 ## Deployment
 ### LMDeploy
@@ -575,6 +607,10 @@ InternVL 2.0 是一个多模态大语言模型系列，包含各种规模的模
 示例代码请[点击这里](#quick-start)。
 ## 部署
 ### LMDeploy

 > Please use transformers==4.37.2 to ensure the model works normally.
 ```python
+import math
 import numpy as np
 import torch
 import torchvision.transforms as T
     return pixel_values
+def split_model(model_name):
+    device_map = {}
+    world_size = torch.cuda.device_count()
+    num_layers = {'InternVL2-8B': 32, 'InternVL2-26B': 48,
+                  'InternVL2-40B': 60, 'InternVL2-Llama3-76B': 80}[model_name]
+    # Since the first GPU will be used for ViT, treat it as half a GPU.
+    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
+    num_layers_per_gpu = [num_layers_per_gpu] * world_size
+    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
+    layer_cnt = 0
+    for i, num_layer in enumerate(num_layers_per_gpu):
+        for j in range(num_layer):
+            device_map[f'language_model.model.layers.{layer_cnt}'] = i
+            layer_cnt += 1
+    device_map['vision_model'] = 0
+    device_map['mlp1'] = 0
+    device_map['language_model.model.tok_embeddings'] = 0
+    device_map['language_model.model.embed_tokens'] = 0
+    device_map['language_model.output'] = 0
+    device_map['language_model.model.norm'] = 0
+    device_map['language_model.lm_head'] = 0
+    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
+    return device_map
 path = 'OpenGVLab/InternVL2-40B'
+device_map = split_model('InternVL2-40B')
+print(device_map)
+# If you set `load_in_8bit=True`, you will need one 80GB GPUs.
+# If you set `load_in_8bit=False`, you will need at least two 80GB GPUs.
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,
+    load_in_8bit=True,
     low_cpu_mem_usage=True,
     trust_remote_code=True,
+    device_map=device_map).eval()
 tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
 # set the max number of tiles in `max_num`
 pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
 print(f'Assistant: {response}')
 ```
+## Finetune
+SWIFT from ModelScope community has supported the fine-tuning (Image/Video) of InternVL, please check [this link](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md) for more details.
 ## Deployment
 ### LMDeploy
 示例代码请[点击这里](#quick-start)。
+## 微调
+来自ModelScope社区的SWIFT已经支持对InternVL进行微调（图像/视频），详情请查看[此链接](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md)。
 ## 部署
 ### LMDeploy