User-VLM 360°

Overview

User-VLM 360° is a series of personalized Vision-Language Models (VLMs) designed for social human-robot interactions. The model introduces User-aware tuning, addressing the semantic gap that arises from the misalignment between user queries and the observed scene as captured by a robot's camera. Unlike traditional instruction tuning, which introduces latency and reduces performance, User-VLM 360° enables real-time, robust adaptation in dynamic robotic environments by inherently aligning cross-modal user representations.

This model allows for customization of open-weight VLMs to produce personalized responses based on demographic attributes such as age, gender, emotion, and ethnicity while maintaining ethical and safety considerations.

Training Details

Base Model: User-VLM 360° is built on PaliGemma 2, which consists of a SigLIP vision encoder and Gemma 2 as the language model.

Fine-tuning Process:

Base Model Tuning:
- Tuned the MLP layer to provide user and scene descriptions over 1 epoch.
Instruction Model Tuning:
- Instruction-tuned the base model using personalized, user-specific Q&A datasets.
- Used Sparse Mixture of LoRA Experts (MoLE) (3 LoRA modules, rank=16, alpha=32, one chosen) and a standalone LoRA (rank=16, alpha=32) over 2 epochs.
Bias Mitigation:
- Applied Direct Preference Optimization (DPO) over 1 epoch using LoRA (rank=16, alpha=32).

Model Usage

Example Code:


# The base model is not instruction-tuned and therefore is not suitable for use in a conversational mode.

from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
import torch

model_id = "ACIDE/User-VLM-10B-base"
processor = PaliGemmaProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)

def generate_description(image, model, processor):
    prompt = "<image> "
    model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)
    input_len = model_inputs["input_ids"].shape[-1]
    
    with torch.inference_mode():
        generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
        generation = generation[0][input_len:]
        decoded = processor.decode(generation, skip_special_tokens=True)
        return decoded

# Example usage
from transformers.image_utils import load_image
url = "https://media.istockphoto.com/id/1282695693/photo/little-boy-sitting-on-chair-at-the-table.jpg"
image = load_image(url)

description = generate_description(image, model, processor)
print(description)

Ethical Considerations & Limitations

Research-Only Use: This model is intended strictly for research purposes and should not be deployed in real-world applications without further ethical validation.
Demographic Personalization: While the model can adapt responses based on user attributes, care must be taken to prevent bias and discrimination.
No Liability: The authors do not accept any liability regarding the use of this model. Responsibility for ethical and appropriate use remains with the users.

Citation

If you use this model in your research, please cite the following papers:

@article{rahimi2025user,
  title={User-VLM: LLM Contextualization with Multimodal Pre-trained User Models},
  author={Rahimi, Hamed and Abrini, Mouad and Khoramshahi, Mahdi and Chetouani, Mohamed},
  year={2025}
}

@article{rahimi2025user,
  title={User-VLM 360°: Personalized Vision Language Models with User-aware Tuning for Social Human Robot Interactions},
  author={Rahimi, Hamed and Bhaj, Adil, Abrini, Mouad, Khoramshahi, Mahdi, Ghogho, Mounir, and Chetouani, Mohamed},
  year={2025}
}

License

This model is licensed under the MIT License.

Contact

For any questions or issues regarding the model, please open an issue on the repository or contact the maintainers directly.

ACIDE
/

User-VLM-10B-base