User-VLM-10B-base / README.md

Update README.md

327f99a verified 5 months ago

4.37 kB

	---
	library_name: transformers
	tags:
	- robotics
	license: mit
	datasets:
	- ACIDE/user-vlm-pt
	language:
	- en
	base_model:
	- google/paligemma2-10b-ft-docci-448
	pipeline_tag: image-text-to-text
	---
	# User-VLM 360°
	![Architecture](result-final.pdf)

	## Overview
	User-VLM 360° is a series of personalized Vision-Language Models (VLMs) designed for social human-robot interactions. The model introduces User-aware tuning, addressing the semantic gap that arises from the misalignment between user queries and the observed scene as captured by a robot's camera. Unlike traditional instruction tuning, which introduces latency and reduces performance, User-VLM 360° enables real-time, robust adaptation in dynamic robotic environments by inherently aligning cross-modal user representations.

	This model allows for customization of open-weight VLMs to produce personalized responses based on demographic attributes such as age, gender, emotion, and ethnicity while maintaining ethical and safety considerations.

	## Training Details
	Base Model: User-VLM 360° is built on PaliGemma 2, which consists of a SigLIP vision encoder and Gemma 2 as the language model.

	![Deployment on Pepper](pepper2.pdf)

	### Fine-tuning Process:
	1. Base Model Tuning:
	- Tuned the MLP layer to provide user and scene descriptions over 1 epoch.
	2. Instruction Model Tuning:
	- Instruction-tuned the base model using personalized, user-specific Q&A datasets.
	- Used Sparse Mixture of LoRA Experts (MoLE) (3 LoRA modules, rank=16, alpha=32, one chosen) and a standalone LoRA (rank=16, alpha=32) over 2 epochs.
	3. Bias Mitigation:
	- Applied Direct Preference Optimization (DPO) over 1 epoch using LoRA (rank=16, alpha=32).

	## Model Usage
	### Example Code:
	```python

	# The base model is not instruction-tuned and therefore is not suitable for use in a conversational mode.

	from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
	import torch

	model_id = "ACIDE/User-VLM-10B-base"
	processor = PaliGemmaProcessor.from_pretrained(model_id)
	model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)

	def generate_description(image, model, processor):
	prompt = "<image> "
	model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)
	input_len = model_inputs["input_ids"].shape[-1]

	with torch.inference_mode():
	generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
	generation = generation[0][input_len:]
	decoded = processor.decode(generation, skip_special_tokens=True)
	return decoded

	# Example usage
	from transformers.image_utils import load_image
	url = "https://media.istockphoto.com/id/1282695693/photo/little-boy-sitting-on-chair-at-the-table.jpg"
	image = load_image(url)

	description = generate_description(image, model, processor)
	print(description)
	```

	## Ethical Considerations & Limitations
	- Research-Only Use: This model is intended strictly for research purposes and should not be deployed in real-world applications without further ethical validation.
	- Demographic Personalization: While the model can adapt responses based on user attributes, care must be taken to prevent bias and discrimination.
	- No Liability: The authors do not accept any liability regarding the use of this model. Responsibility for ethical and appropriate use remains with the users.

	## Citation
	If you use this model in your research, please cite the following papers:
	```bibtex
	@article{rahimi2025user,
	title={User-VLM: LLM Contextualization with Multimodal Pre-trained User Models},
	author={Rahimi, Hamed and Abrini, Mouad and Khoramshahi, Mahdi and Chetouani, Mohamed},
	year={2025}
	}

	@article{rahimi2025user,
	title={User-VLM 360°: Personalized Vision Language Models with User-aware Tuning for Social Human Robot Interactions},
	author={Rahimi, Hamed and Bhaj, Adil, Abrini, Mouad, Khoramshahi, Mahdi, Ghogho, Mounir, and Chetouani, Mohamed},
	year={2025}
	}
	```

	## License
	This model is licensed under the MIT License.

	## Contact
	For any questions or issues regarding the model, please open an issue on the repository or contact the maintainers directly.