|
--- |
|
library_name: transformers |
|
tags: |
|
- robotics |
|
license: mit |
|
datasets: |
|
- ACIDE/user-vlm-pt |
|
language: |
|
- en |
|
base_model: |
|
- google/paligemma2-10b-ft-docci-448 |
|
pipeline_tag: image-text-to-text |
|
--- |
|
# User-VLM 360° |
|
 |
|
|
|
## Overview |
|
**User-VLM 360°** is a series of personalized Vision-Language Models (VLMs) designed for social human-robot interactions. The model introduces **User-aware tuning**, addressing the **semantic gap** that arises from the misalignment between user queries and the observed scene as captured by a robot's camera. Unlike traditional instruction tuning, which introduces latency and reduces performance, **User-VLM 360°** enables **real-time, robust adaptation** in dynamic robotic environments by inherently aligning cross-modal user representations. |
|
|
|
This model allows for **customization of open-weight VLMs** to produce **personalized responses** based on demographic attributes such as age, gender, emotion, and ethnicity while maintaining ethical and safety considerations. |
|
|
|
## Training Details |
|
**Base Model:** User-VLM 360° is built on **PaliGemma 2**, which consists of a **SigLIP vision encoder** and **Gemma 2 as the language model**. |
|
|
|
 |
|
|
|
### Fine-tuning Process: |
|
1. **Base Model Tuning:** |
|
- Tuned the **MLP layer** to provide **user and scene descriptions** over **1 epoch**. |
|
2. **Instruction Model Tuning:** |
|
- Instruction-tuned the **base model** using **personalized, user-specific Q&A datasets**. |
|
- Used **Sparse Mixture of LoRA Experts (MoLE)** (3 LoRA modules, rank=16, alpha=32, one chosen) and a standalone **LoRA (rank=16, alpha=32)** over **2 epochs**. |
|
3. **Bias Mitigation:** |
|
- Applied **Direct Preference Optimization (DPO)** over **1 epoch** using **LoRA (rank=16, alpha=32)**. |
|
|
|
## Model Usage |
|
### Example Code: |
|
```python |
|
|
|
# The base model is not instruction-tuned and therefore is not suitable for use in a conversational mode. |
|
|
|
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration |
|
import torch |
|
|
|
model_id = "ACIDE/User-VLM-10B-base" |
|
processor = PaliGemmaProcessor.from_pretrained(model_id) |
|
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device) |
|
|
|
def generate_description(image, model, processor): |
|
prompt = "<image> " |
|
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device) |
|
input_len = model_inputs["input_ids"].shape[-1] |
|
|
|
with torch.inference_mode(): |
|
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False) |
|
generation = generation[0][input_len:] |
|
decoded = processor.decode(generation, skip_special_tokens=True) |
|
return decoded |
|
|
|
# Example usage |
|
from transformers.image_utils import load_image |
|
url = "https://media.istockphoto.com/id/1282695693/photo/little-boy-sitting-on-chair-at-the-table.jpg" |
|
image = load_image(url) |
|
|
|
description = generate_description(image, model, processor) |
|
print(description) |
|
``` |
|
|
|
## Ethical Considerations & Limitations |
|
- **Research-Only Use:** This model is intended strictly for **research purposes** and should not be deployed in real-world applications without further ethical validation. |
|
- **Demographic Personalization:** While the model can adapt responses based on user attributes, **care must be taken to prevent bias and discrimination**. |
|
- **No Liability:** The authors **do not accept any liability** regarding the use of this model. Responsibility for ethical and appropriate use remains with the users. |
|
|
|
## Citation |
|
If you use this model in your research, please cite the following papers: |
|
```bibtex |
|
@article{rahimi2025user, |
|
title={User-VLM: LLM Contextualization with Multimodal Pre-trained User Models}, |
|
author={Rahimi, Hamed and Abrini, Mouad and Khoramshahi, Mahdi and Chetouani, Mohamed}, |
|
year={2025} |
|
} |
|
|
|
@article{rahimi2025user, |
|
title={User-VLM 360°: Personalized Vision Language Models with User-aware Tuning for Social Human Robot Interactions}, |
|
author={Rahimi, Hamed and Bhaj, Adil, Abrini, Mouad, Khoramshahi, Mahdi, Ghogho, Mounir, and Chetouani, Mohamed}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
## License |
|
This model is licensed under the **MIT License**. |
|
|
|
## Contact |
|
For any questions or issues regarding the model, please open an issue on the repository or contact the maintainers directly. |