Zhaorun commited on
Commit
f0c9ee0
Β·
verified Β·
1 Parent(s): ab92717

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -28
README.md CHANGED
@@ -7,39 +7,21 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- # :woman_judge: [**MJ-Bench**: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?]()
11
 
 
12
 
13
- <!-- <h3 align="center"><a href="https://arxiv.org/abs/2407.04842" style="color:#9C276A">
14
- **MJ-BENCH**: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?</a></h3> -->
15
 
16
- <h5 align="center"> If our project helps you, please consider giving us a star ⭐ πŸ₯ΉπŸ™ </h2>
17
 
18
- <h5 align="center">
19
 
 
20
 
21
- <div align="center">
22
- <img src="https://github.com/MJ-Bench/MJ-Bench.github.io/blob/main/static/images/dataset_overview.png" width="80%">
23
- </div>
24
 
25
- [![project](https://img.shields.io/badge/πŸ₯³-Project-9C276A.svg)](https://mj-bench.github.io/)
26
- [![hf_space](https://img.shields.io/badge/πŸ€—-Huggingface-9C276A.svg)](https://huggingface.co/MJ-Bench)
27
- [![hf_leaderboard](https://img.shields.io/badge/πŸ€—-Leaderboard-9C276A.svg)](https://huggingface.co/spaces/MJ-Bench/MJ-Bench-Leaderboard)
28
- [![Dataset](https://img.shields.io/badge/πŸ€—-Dataset-9C276A.svg)](https://huggingface.co/datasets/MJ-Bench/MJ-Bench)<br>
29
- [![arXiv](https://img.shields.io/badge/Arxiv-2407.04842-AD1C18.svg?logo=arXiv)]([https://arxiv.org/abs/2407.04842](https://arxiv.org/abs/2407.04842))
30
- [![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FMJ-Bench%2FMJ-Bench&count_bg=%23C25AE6&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=Visitor&edge_flat=false)](https://hits.seeyoufarm.com)
31
- [![GitHub issues](https://img.shields.io/github/issues/MJ-Bench/MJ-Bench?color=critical&label=Issues)](https://github.com/MJ-Bench/MJ-Bench/issues)
32
- [![GitHub Stars](https://img.shields.io/github/stars/MJ-Bench/MJ-Bench?style=social)](https://github.com/MJ-Bench/MJ-Bench/stargazers)
33
- <br>
34
 
35
- </h5>
36
-
37
- Multimodal judges play a pivotal role in Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF). They serve as judges, providing crucial feedback to align foundation models (FMs) with desired behaviors. However, the evaluation of these multimodal judges often lacks thoroughness, leading to potential misalignment and unsafe fine-tuning outcomes.
38
-
39
- To address this, we introduce **MJ-Bench**, a novel benchmark designed to evaluate multimodal judges using a comprehensive preference dataset. MJ-Bench assesses feedback for image generation models across four key perspectives: ***alignment***, ***safety***, ***image quality***, and ***bias***.
40
-
41
- We evaluate a wide range of multimodal judges, including:
42
-
43
- - Scoring models
44
- - Open-source Vision-Language Models (VLMs) such as the LLaVA family
45
- - Closed-source VLMs like GPT-4o and Claude 3
 
7
  pinned: false
8
  ---
9
 
10
+ # πŸ‘©β€βš–οΈ [**MJ-Bench**: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?](https://mj-bench.github.io/)
11
 
12
+ Project page: https://mj-bench.github.io/
13
 
14
+ ![Dataset Overview](https://raw.githubusercontent.com/MJ-Bench/MJ-Bench.github.io/main/static/images/dataset_overview.png)
 
15
 
 
16
 
17
+ While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes.
18
 
19
+ To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: **alignment**, **safety**, **image quality**, and **bias**.
20
 
21
+ Specifically, we evaluate a large variety of multimodal judges including
 
 
22
 
23
+ - 6 smaller-sized CLIP-based scoring models
24
+ - 11 open-source VLMs (e.g. LLaVA family)
25
+ - 4 and close-source VLMs (e.g. GPT-4o, Claude 3)
 
 
 
 
 
 
26
 
27
+ We are actively updating the [leaderboard](https://mj-bench.github.io/) and you are welcome to submit the evaluation result of your multimodal judge on [our dataset](https://huggingface.co/datasets/MJ-Bench/MJ-Bench) to [huggingface leaderboard](https://huggingface.co/spaces/MJ-Bench/MJ-Bench-Leaderboard).