OpenGVLab/InternVL2_5-78B · Strange CUDA Error for Multi-GPU Setup

Hi,
First of all thanks for sharing the model and providing such detailed docs!

I'm experiencing a strange CUDA error when running the model on 4 A40 (46GB) GPUs with the device map code you provided.

025-02-08 23:57:22.458 | ERROR    | __main__:main:210 - Error during response generation for sample 0: varlen_fwd(): incompatible function arguments. The following argument types are supported:                                                                                                    | 0/3012 [00:00<?, ?it/s]
    1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: Optional[torch.Tensor], arg4: torch.Tensor, arg5: torch.Tensor, arg6: Optional[torch.Tensor], arg7: Optional[torch.Tensor], arg8: Optional[torch.Tensor], arg9: Optional[torch.Tensor], arg10: int, arg11: int, arg12: float, arg13: float, arg14: $
ool, arg15: bool, arg16: int, arg17: int, arg18: float, arg19: bool, arg20: Optional[torch.Generator]) -> list[torch.Tensor]

Invoked with: tensor([[[0., 0., 0.,  ..., 0., -0., 0.],
         [0., 0., 0.,  ..., 0., -0., 0.],
         [0., 0., 0.,  ..., -0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., -0.,  ..., 0., 0., 0.],
         [0., 0., -0.,  ..., 0., 0., 0.]],

        [[0., 0., 0.,  ..., 0., -0., 0.],
         [0., 0., 0.,  ..., 0., -0., 0.],
         [0., 0., 0.,  ..., -0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., -0.,  ..., 0., 0., 0.],
         [0., 0., -0.,  ..., 0., 0., 0.]],

        [[0., 0., 0.,  ..., 0., -0., 0.],
         [0., 0., 0.,  ..., 0., -0., 0.],
         [0., 0., 0.,  ..., -0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., -0.,  ..., 0., 0., 0.],
         [0., 0., -0.,  ..., 0., 0., 0.]],

        ...,

         [-0.0012, -0.0014, -0.0009,  ...,  0.0038,  0.0004, -0.0022]],

        [[ 0.0004, -0.0002, -0.0007,  ..., -0.0007, -0.0022,  0.0021],
         [ 0.0006, -0.0008,  0.0017,  ..., -0.0023, -0.0007,  0.0018],
         [-0.0007, -0.0018,  0.0015,  ..., -0.0010, -0.0012, -0.0011],
         ...,
         [ 0.0023, -0.0028,  0.0023,  ...,  0.0049,  0.0030, -0.0028],
         [-0.0020,  0.0014,  0.0004,  ...,  0.0001, -0.0033, -0.0050],
         [-0.0012, -0.0014, -0.0009,  ...,  0.0038,  0.0004, -0.0022]]],
       device='cuda:1', dtype=torch.bfloat16), None, tensor([1748], device='cuda:1', dtype=torch.int32), tensor([1748], device='cuda:1', dtype=torch.int32), None, None, None, None, 4575867776795852673, 4575867776795852673, 0.0, 0.08838834764831845, False, True, -1, -1, 0.0, False, None

After some googling, I think it's somehow related to flash-attention, but I'm afraid I can't fix it w/o some serious monkey patching... I'm using flash-attn==2.7.3, CUDA==12.1, transformers==4.48.0. Do you know how to fix this?