microsoft
/

GUI-Actor-7B-Qwen2-VL

Image-Text-to-Text

text-generation-inference

Model card Files Files and versions

qianhuiwu commited on Jun 3

Commit

609d191

·

verified ·

1 Parent(s): 1b198a2

update model card.

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -8,14 +8,14 @@ base_model:
 - [GUI-Actor-7B-Qwen2-VL](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL)
 - [GUI-Actor-2B-Qwen2-VL](https://huggingface.co/microsoft/GUI-Actor-2B-Qwen2-VL)
-- [GUI-Actor-7B-Qwen2.5-VL](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2.5-VL)
-- [GUI-Actor-3B-Qwen2.5-VL](https://huggingface.co/microsoft/GUI-Actor-3B-Qwen2.5-VL)
 - [GUI-Actor-Verifier-2B](https://huggingface.co/microsoft/GUI-Actor-Verifier-2B)
-This model was introduced in the paper [**GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents** (Wu et al, 2025)](https://github.com/microsoft/GUI-Actor).
 It is developed based on [Qwen2-VL-7B-Instruct ](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset [here (coming soon)]().
-For more details on model design and evaluation, please check the project page at [GUI-Actor](https://aka.ms/GUI-Actor).
 ## 📊 Performance Comparison on GUI Grounding Benchmarks
 Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with **Qwen2-VL** as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.
@@ -118,7 +118,7 @@ print(f"Predicted click point: [{round(px, 4)}, {round(py, 4)}]")
 # Predicted click point: [0.9709, 0.1548]
 ```
-## Citation
 ```
 @article{wu2025guiactor,
     title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},

 - [GUI-Actor-7B-Qwen2-VL](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL)
 - [GUI-Actor-2B-Qwen2-VL](https://huggingface.co/microsoft/GUI-Actor-2B-Qwen2-VL)
+- [GUI-Actor-7B-Qwen2.5-VL (coming soon)](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2.5-VL)
+- [GUI-Actor-3B-Qwen2.5-VL (coming soon)](https://huggingface.co/microsoft/GUI-Actor-3B-Qwen2.5-VL)
 - [GUI-Actor-Verifier-2B](https://huggingface.co/microsoft/GUI-Actor-Verifier-2B)
+This model was introduced in the paper [**GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents**](https://aka.ms/GUI-Actor).
 It is developed based on [Qwen2-VL-7B-Instruct ](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset [here (coming soon)]().
+For more details on model design and evaluation, please check: [🏠 Project Page](https://aka.ms/GUI-Actor) | [💻 Github Repo](https://github.com/microsoft/GUI-Actor) | [📑 Paper]().
 ## 📊 Performance Comparison on GUI Grounding Benchmarks
 Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with **Qwen2-VL** as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.
 # Predicted click point: [0.9709, 0.1548]
 ```
+## 📝 Citation
 ```
 @article{wu2025guiactor,
     title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},