|
--- |
|
license: mit |
|
pipeline_tag: image-to-text |
|
library_name: transformers |
|
--- |
|
|
|
# Q-SiT: Image Quality Scoring and Interpreting with Large Language Models |
|
|
|
Q-SiT is a model for image quality scoring and interpretation. It uses a Large Language Model to perform both tasks simultaneously, recognizing the inherent connection between perception and decision-making in the human visual system. Unlike previous approaches which treat scoring and interpreting as separate tasks, Q-SiT provides a unified framework. |
|
|
|
Project page: https://github.com/Q-Future/Q-SiT |
|
|
|
## Quicker Start with Hugging Face AutoModel |
|
|
|
No need to install this GitHub repo. Ensure that you use the Transformers package version 4.45.0 (`pip install transformers==4.45.0`). |
|
|
|
### Image Quality Interpreting Chat |
|
|
|
```python |
|
import requests |
|
from PIL import Image |
|
import torch |
|
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration |
|
|
|
model_id = "zhangzicheng/q-sit-mini" |
|
# if you want to use primary version, switch to q-sit |
|
# model_id = "zhangzicheng/q-sit" |
|
|
|
model = LlavaOnevisionForConditionalGeneration.from_pretrained( |
|
model_id, |
|
torch_dtype=torch.float16, |
|
low_cpu_mem_usage=True, |
|
).to(0) |
|
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
|
|
conversation = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "text", "text": "How is the clarity of the human in this image?"}, |
|
{"type": "image"}, |
|
], |
|
}, |
|
] |
|
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) |
|
|
|
raw_image = Image.open(requests.get("https://github.com/Q-Future/Q-SiT/blob/main/44009500.jpg?raw=true",stream=True).raw) |
|
|
|
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16) |
|
|
|
output = model.generate(**inputs, max_new_tokens=200, do_sample=False) |
|
print(processor.decode(output[0][2:], skip_special_tokens=True).split("assistant")[-1]) |
|
# very low |
|
``` |
|
|
|
### Image Quality Scoring |
|
|
|
```python |
|
import torch |
|
import requests |
|
from PIL import Image |
|
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration, AutoTokenizer |
|
import numpy as np |
|
|
|
def wa5(logits): |
|
logprobs = np.array([logits["Excellent"], logits["Good"], logits["Fair"], logits["Poor"], logits["Bad"]]) |
|
probs = np.exp(logprobs) / np.sum(np.exp(logprobs)) |
|
return np.inner(probs, np.array([1, 0.75, 0.5, 0.25, 0])) |
|
|
|
model_id = "zhangzicheng/q-sit-mini" |
|
model = LlavaOnevisionForConditionalGeneration.from_pretrained( |
|
model_id, |
|
torch_dtype=torch.float16, |
|
low_cpu_mem_usage=True, |
|
).to(0) |
|
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
# Define rating tokens |
|
toks = ["Excellent", "Good", "Fair", "Poor", "Bad"] |
|
ids_ = [id_[0] for id_ in tokenizer(toks)["input_ids"]] |
|
print("Rating token IDs:", ids_) |
|
|
|
conversation = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "text", "text": "Assume you are an image quality evaluator. |
|
Your rating should be chosen from the following five categories: Excellent, Good, Fair, Poor, and Bad (from high to low). |
|
How would you rate the quality of this image?"}, |
|
{"type": "image"}, |
|
], |
|
}, |
|
] |
|
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) |
|
|
|
# Load image |
|
raw_image = Image.open(requests.get("https://github.com/Q-Future/Q-SiT/blob/main/44009500.jpg?raw=true",stream=True).raw) |
|
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16) |
|
|
|
# Manually append the assistant prefix "The quality of this image is " |
|
prefix_text = "The quality of this image is " |
|
prefix_ids = tokenizer(prefix_text, return_tensors="pt")["input_ids"].to(0) |
|
inputs["input_ids"] = torch.cat([inputs["input_ids"], prefix_ids], dim=-1) |
|
inputs["attention_mask"] = torch.ones_like(inputs["input_ids"]) # Update attention mask |
|
|
|
# Generate exactly one token (the rating) |
|
output = model.generate( |
|
**inputs, |
|
max_new_tokens=1, # Generate only the rating token |
|
output_logits=True, |
|
return_dict_in_generate=True, |
|
) |
|
|
|
# Extract logits for the generated rating token |
|
last_logits = output.logits[-1][0] # Shape: [vocab_size] |
|
logits_dict = {tok: last_logits[id_].item() for tok, id_ in zip(toks, ids_)} |
|
weighted_score = wa5(logits_dict) |
|
print("Weighted average score:", weighted_score) |
|
# Weighted average score: 0.045549712192942585 range from 0-1 |
|
# if you want range from 0-5, multiply 5 |
|
``` |
|
|
|
For dataset evaluation scripts, please refer to [this directory](https://github.com/Q-Future/Q-SiT/tree/main/eval_scripts). For training information, see the [Training Q-SiT](https://github.com/Q-Future/Q-SiT#training-q-sit) section of the GitHub repository. |
|
|
|
## Citation |
|
|
|
If you find our work useful, please cite our paper as: |
|
``` |
|
@misc{zhang2025teachinglmmsimagequality, |
|
title={Teaching LMMs for Image Quality Scoring and Interpreting}, |
|
author={Zicheng Zhang and Haoning Wu and Ziheng Jia and Weisi Lin and Guangtao Zhai}, |
|
year={2025}, |
|
eprint={2503.09197}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2503.09197}, |
|
} |
|
``` |