zR
commited on
Commit
·
4437983
1
Parent(s):
678da3b
publish
Browse files- README.md +83 -76
- README_zh.md +169 -0
README.md
CHANGED
@@ -1,96 +1,103 @@
|
|
1 |
---
|
2 |
license: other
|
3 |
language:
|
4 |
-
- zh
|
5 |
-
- en
|
6 |
base_model:
|
7 |
-
- THUDM/glm-4v-9b
|
8 |
pipeline_tag: image-text-to-text
|
9 |
library_name: transformers
|
10 |
---
|
11 |
|
12 |
# CogAgent
|
13 |
|
14 |
-
|
15 |
|
16 |
-
|
17 |
-
`cogagent-9b-20241220` 是一款较为先进的智能体模型,它具备强大的跨平台兼容性,能够实现对多种计算设备上的图形界面进行自动化的操作。
|
18 |
-
无论是Windows、macOS还是Android系统,`cogagent-9b-20241220` 都能够接收用户指令,自动获取设备屏幕截图,经过模型推理后执行自动化设备操作。
|
19 |
|
20 |
-
|
|
|
|
|
|
|
|
|
21 |
|
22 |
-
|
|
|
|
|
23 |
|
24 |
-
##
|
25 |
|
26 |
-
|
27 |
-
这里展示了用户应该怎么整理自己的输入格式化的传入给模型。并获得模型规则的回复。
|
28 |
|
29 |
-
|
30 |
|
31 |
-
|
|
|
|
|
32 |
|
33 |
-
|
34 |
|
35 |
-
|
|
|
|
|
36 |
|
37 |
-
|
38 |
-
|
39 |
-
-
|
40 |
-
-
|
|
|
41 |
|
42 |
-
|
43 |
-
|
44 |
|
45 |
-
3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
-
|
48 |
-
|
49 |
-
- `Answer in Status-Plan-Action-Operation format.`: 返回模型的装题,行为,以及相应的操作。
|
50 |
-
- `Answer in Status-Action-Operation-Sensitive format.`: 返回模型的状态,行为,对应的操作,以及对应的敏感程度。
|
51 |
-
- `Answer in Status-Action-Operation format.`: 返回模型的状态,行为。
|
52 |
-
- `Answer in Action-Operation format.` 返回模型的行为,对应的操作。
|
53 |
-
|
54 |
-
4. `history` 字段
|
55 |
-
|
56 |
-
拼接顺序和结果应该如下所示:
|
57 |
```
|
58 |
query = f'{task}{history}{platform}{format}'
|
59 |
```
|
60 |
|
61 |
-
###
|
62 |
|
63 |
-
1.
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
- `CLICK`:
|
71 |
-
- `LONGPRESS`:
|
72 |
|
73 |
-
###
|
74 |
|
75 |
-
|
76 |
-
|
77 |
|
78 |
```
|
79 |
-
Task:
|
80 |
(Platform: Mac)
|
81 |
(Answer in Action-Operation-Sensitive format.)
|
82 |
```
|
83 |
|
84 |
-
|
85 |
-
|
86 |
|
87 |
<details>
|
88 |
<summary>Answer in Action-Operation-Sensitive format</summary>
|
89 |
|
90 |
```
|
91 |
-
Action:
|
92 |
-
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='
|
93 |
-
|
94 |
```
|
95 |
|
96 |
</details>
|
@@ -100,9 +107,9 @@ Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本
|
|
100 |
|
101 |
```
|
102 |
Status: None
|
103 |
-
Plan: None
|
104 |
-
Action:
|
105 |
-
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='
|
106 |
```
|
107 |
|
108 |
</details>
|
@@ -111,10 +118,10 @@ Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本
|
|
111 |
<summary>Answer in Status-Action-Operation-Sensitive format</summary>
|
112 |
|
113 |
```
|
114 |
-
Status:
|
115 |
-
Action:
|
116 |
-
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='
|
117 |
-
|
118 |
```
|
119 |
|
120 |
</details>
|
@@ -124,8 +131,8 @@ Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本
|
|
124 |
|
125 |
```
|
126 |
Status: None
|
127 |
-
Action:
|
128 |
-
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='
|
129 |
```
|
130 |
|
131 |
</details>
|
@@ -134,23 +141,23 @@ Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本
|
|
134 |
<summary>Answer in Action-Operation format</summary>
|
135 |
|
136 |
```
|
137 |
-
Action:
|
138 |
Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]')
|
139 |
```
|
140 |
|
141 |
</details>
|
142 |
|
143 |
-
###
|
144 |
-
|
145 |
-
1. 该模型不是对话模型,不支持连续对话,请发送具体指令,并参考我们提供的历史拼接方式进行拼接。
|
146 |
-
2. 该模型必须要有图片传入,纯文字对话无法实现GUI Agent任务。
|
147 |
-
3. 该模型输出有严格的格式要求,请严格按照我们的要求进行解析。输出格式为 STR 格式,不支持输出JSON 格式。
|
148 |
|
|
|
|
|
|
|
|
|
149 |
|
150 |
-
##
|
151 |
|
152 |
-
|
153 |
-
|
154 |
|
155 |
<div align="center">
|
156 |
<img src=https://raw.githubusercontent.com/THUDM/CogAgent/refs/heads/main/assets/cogagent_function_cn.jpg width=70% />
|
@@ -161,18 +168,18 @@ Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]'
|
|
161 |
<td>
|
162 |
<h2> CogVLM </h2>
|
163 |
<p> 📖 Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p>
|
164 |
-
<p><b>CogVLM</b>
|
165 |
-
<p><b>CogVLM-17B
|
166 |
</td>
|
167 |
<td>
|
168 |
<h2> CogAgent </h2>
|
169 |
<p> 📖 Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p>
|
170 |
-
<p><b>CogAgent</b>
|
171 |
-
<p
|
172 |
</td>
|
173 |
</tr>
|
174 |
</table>
|
175 |
|
176 |
-
##
|
177 |
|
178 |
-
|
|
|
1 |
---
|
2 |
license: other
|
3 |
language:
|
4 |
+
- zh
|
5 |
+
- en
|
6 |
base_model:
|
7 |
+
- THUDM/glm-4v-9b
|
8 |
pipeline_tag: image-text-to-text
|
9 |
library_name: transformers
|
10 |
---
|
11 |
|
12 |
# CogAgent
|
13 |
|
14 |
+
[中文阅读](README_zh.md)
|
15 |
|
16 |
+
## About the Model
|
|
|
|
|
17 |
|
18 |
+
The `CogAgent-9B-20241220` model is based on [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b), a bilingual
|
19 |
+
open-source VLM base model. Through data collection and optimization, multi-stage training, and strategy improvements,
|
20 |
+
`CogAgent-9B-20241220` achieves significant advancements in GUI perception, inference prediction accuracy, action space
|
21 |
+
completeness, and task generalizability. The model supports bilingual (Chinese and English) interaction with both
|
22 |
+
screenshots and language input.
|
23 |
|
24 |
+
This version of the CogAgent model has already been applied in
|
25 |
+
ZhipuAI's [GLM-PC product](https://cogagent.aminer.cn/home). We hope this release will assist researchers and developers
|
26 |
+
in advancing the research and applications of GUI agents based on vision-language models.
|
27 |
|
28 |
+
## Running the Model
|
29 |
|
30 |
+
Please refer to our [GitHub](https://github.com/THUDM/CogAgent) for specific examples of running the model.
|
|
|
31 |
|
32 |
+
## Input and Output
|
33 |
|
34 |
+
`cogagent-9b-20241220` is an agent execution model rather than a conversational model. It does not support continuous
|
35 |
+
conversations but does support continuous execution history. Below are guidelines on how users should format their input
|
36 |
+
for the model and interpret the formatted output.
|
37 |
|
38 |
+
### User Input
|
39 |
|
40 |
+
1. **`task` field**
|
41 |
+
A task description provided by the user, similar to a textual prompt. The input should be concise and clear to guide
|
42 |
+
the `CogAgent-9B-20241220` model to complete the task.
|
43 |
|
44 |
+
2. **`platform` field**
|
45 |
+
`CogAgent-9B-20241220` supports operation on several platforms with GUI interfaces:
|
46 |
+
- **Windows**: Use the `WIN` field for Windows 10 or 11.
|
47 |
+
- **Mac**: Use the `MAC` field for Mac 14 or 15.
|
48 |
+
- **Mobile**: Use the `Mobile` field for Android 13, 14, 15, or similar Android-based UI versions.
|
49 |
|
50 |
+
If using other systems, results may vary. Use the `Mobile` field for mobile devices, `WIN` for Windows, and `MAC` for
|
51 |
+
Mac.
|
52 |
|
53 |
+
3. **`format` field**
|
54 |
+
Specifies the desired format of the returned data. Options include:
|
55 |
+
- `Answer in Action-Operation-Sensitive format.`: The default format in our demo, returning actions, corresponding
|
56 |
+
operations, and sensitivity levels.
|
57 |
+
- `Answer in Status-Plan-Action-Operation format.`: Returns status, plans, actions, and corresponding operations.
|
58 |
+
- `Answer in Status-Action-Operation-Sensitive format.`: Returns status, actions, corresponding operations, and
|
59 |
+
sensitivity levels.
|
60 |
+
- `Answer in Status-Action-Operation format.`: Returns status, actions, and corresponding operations.
|
61 |
+
- `Answer in Action-Operation format.`: Returns actions and corresponding operations.
|
62 |
|
63 |
+
4. **`history` field**
|
64 |
+
The input should be concatenated in the following order:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
65 |
```
|
66 |
query = f'{task}{history}{platform}{format}'
|
67 |
```
|
68 |
|
69 |
+
### Model Output
|
70 |
|
71 |
+
1. **Sensitive Operations**: Includes types like `<<Sensitive Operation>>` or `<<General Operation>>`, returned only
|
72 |
+
when `Sensitive` is requested.
|
73 |
+
2. **`Plan`, `Agent`, `Status`, `Action` fields**: Describe the model's behavior and operations, returned based on the
|
74 |
+
requested format.
|
75 |
+
3. **General Responses**: Summarizes the output before formatting.
|
76 |
+
4. **`Grounded Operation` field**: Describes the model's specific actions, such as coordinates, element types, and
|
77 |
+
descriptions. Actions include:
|
78 |
+
- `CLICK`: Simulates mouse clicks or touch gestures.
|
79 |
+
- `LONGPRESS`: Simulates long presses (supported only in `Mobile` mode).
|
80 |
|
81 |
+
### Example
|
82 |
|
83 |
+
If the user wants to mark all emails as read on a Mac system and requests an `Action-Operation-Sensitive` format, the
|
84 |
+
prompt should be:
|
85 |
|
86 |
```
|
87 |
+
Task: Mark all emails as read
|
88 |
(Platform: Mac)
|
89 |
(Answer in Action-Operation-Sensitive format.)
|
90 |
```
|
91 |
|
92 |
+
Below are examples of model responses based on different requested formats:
|
|
|
93 |
|
94 |
<details>
|
95 |
<summary>Answer in Action-Operation-Sensitive format</summary>
|
96 |
|
97 |
```
|
98 |
+
Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read.
|
99 |
+
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
|
100 |
+
<<General Operation>>
|
101 |
```
|
102 |
|
103 |
</details>
|
|
|
107 |
|
108 |
```
|
109 |
Status: None
|
110 |
+
Plan: None
|
111 |
+
Action: Click the "Mark All as Read" button at the top center of the inbox page to mark all emails as read.
|
112 |
+
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
|
113 |
```
|
114 |
|
115 |
</details>
|
|
|
118 |
<summary>Answer in Status-Action-Operation-Sensitive format</summary>
|
119 |
|
120 |
```
|
121 |
+
Status: Currently on the email interface [[0, 2, 998, 905]], with email categories on the left [[1, 216, 144, 570]] and the inbox in the center [[144, 216, 998, 903]]. The "Mark All as Read" button [[223, 178, 311, 210]] has been clicked.
|
122 |
+
Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read.
|
123 |
+
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
|
124 |
+
<<General Operation>>
|
125 |
```
|
126 |
|
127 |
</details>
|
|
|
131 |
|
132 |
```
|
133 |
Status: None
|
134 |
+
Action: On the inbox page, click the "Mark All as Read" button to mark all emails as read.
|
135 |
+
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
|
136 |
```
|
137 |
|
138 |
</details>
|
|
|
141 |
<summary>Answer in Action-Operation format</summary>
|
142 |
|
143 |
```
|
144 |
+
Action: Right-click on the first email in the left-side list to open the action menu.
|
145 |
Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]')
|
146 |
```
|
147 |
|
148 |
</details>
|
149 |
|
150 |
+
### Notes
|
|
|
|
|
|
|
|
|
151 |
|
152 |
+
1. This model is not a conversational model and does not support continuous dialogue. Please send specific instructions
|
153 |
+
and use the suggested concatenation method.
|
154 |
+
2. Images must be provided as input; textual prompts alone cannot execute GUI agent tasks.
|
155 |
+
3. The model outputs strictly formatted STR data and does not support JSON format.
|
156 |
|
157 |
+
## Previous Work
|
158 |
|
159 |
+
In November 2023, we released the first generation of CogAgent. You can find related code and weights in
|
160 |
+
the [CogVLM & CogAgent Official Repository](https://github.com/THUDM/CogVLM).
|
161 |
|
162 |
<div align="center">
|
163 |
<img src=https://raw.githubusercontent.com/THUDM/CogAgent/refs/heads/main/assets/cogagent_function_cn.jpg width=70% />
|
|
|
168 |
<td>
|
169 |
<h2> CogVLM </h2>
|
170 |
<p> 📖 Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p>
|
171 |
+
<p><b>CogVLM</b> is a powerful open-source vision-language model (VLM). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters, supporting 490x490 resolution image understanding and multi-turn conversations.</p>
|
172 |
+
<p><b>CogVLM-17B achieved state-of-the-art performance on 10 classic cross-modal benchmarks</b> including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA, and TDIUC benchmarks.</p>
|
173 |
</td>
|
174 |
<td>
|
175 |
<h2> CogAgent </h2>
|
176 |
<p> 📖 Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p>
|
177 |
+
<p><b>CogAgent</b> is an improved open-source vision-language model based on CogVLM. CogAgent-18B has 11 billion vision parameters and 7 billion language parameters, <b>supporting image understanding at 1120x1120 resolution. Beyond CogVLM's capabilities, it also incorporates GUI agent capabilities.</b></p>
|
178 |
+
<p><b>CogAgent-18B achieved state-of-the-art performance on 9 classic cross-modal benchmarks,</b> including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE benchmarks. It significantly outperformed existing models on GUI operation datasets like AITW and Mind2Web.</p>
|
179 |
</td>
|
180 |
</tr>
|
181 |
</table>
|
182 |
|
183 |
+
## License
|
184 |
|
185 |
+
Please follow the [Model License](LICENSE) for using the model weights.
|
README_zh.md
ADDED
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# CogAgent
|
2 |
+
|
3 |
+
## 关于模型
|
4 |
+
|
5 |
+
`CogAgent-9B-2024122` 模型基于 [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b)
|
6 |
+
双语开源VLM基座模型,通过数据的采集与优化、多阶段训练与策略改进等方法,`CogAgent-9B-20241220` 在GUI
|
7 |
+
感知、推理预测准确性、动作空间完善性、任务的普适和泛化性上得到了大幅提升,能够接受中英文双语的屏幕截图和语言交互。
|
8 |
+
此版CogAgent模型已被应用于智谱AI的 [GLM-PC产品](https://cogagent.aminer.cn/home)
|
9 |
+
。我们希望这版模型的发布能够帮助到学术研究者们和开发者们,一起推进基于视觉语言模型的 GUI agent 的研究和应用。
|
10 |
+
|
11 |
+
## 运行模型
|
12 |
+
|
13 |
+
请前往我们的[github](https://github.com/THUDM/CogAgent) 查看具体的运行示例。
|
14 |
+
|
15 |
+
## 输入和输出
|
16 |
+
|
17 |
+
cogagent-9b-20241220是一个Agent类执行模型而非对话模型,不支持连续对话,但是但支持连续的执行历史。
|
18 |
+
这里展示了用户应该怎么整理自己的输入格式化的传入给模型。并获得模型规则的回复。
|
19 |
+
|
20 |
+
### 用户输入部分
|
21 |
+
|
22 |
+
1. `task` 字段
|
23 |
+
|
24 |
+
用户输入的任务描述,类似文本格式的prompt,该输入可以指导 CogAgent1.5 模型完成用户任务指令。请保证简洁明了。
|
25 |
+
|
26 |
+
2. `platform` 字段
|
27 |
+
|
28 |
+
CogAgent1.5 支持在多个平台上执行可操作Agent功能, 我们支持的带有图形界面的操作系统有三个系统,
|
29 |
+
- Windows 10,11,请使用 `WIN` 字段。
|
30 |
+
- Mac 14,15,请使用 `MAC` 字段。
|
31 |
+
- Android 13,14,15 以及其他GUI和UI操作方式几乎相同的安卓UI发行版,请使用 `Mobile` 字段。
|
32 |
+
|
33 |
+
如果您使用的是其他系统,效果可能不佳,但可以尝试使用 `Mobile` 字段用于手机设备,`WIN` 字段用于Windows设备,`MAC`
|
34 |
+
字段用于Mac设备。
|
35 |
+
|
36 |
+
3. `format` 字段
|
37 |
+
|
38 |
+
用户希望 CogAgent1.5 返回何种格式的数据, 这里有以下几种选项:
|
39 |
+
- `Answer in Action-Operation-Sensitive format.`: 本仓库中demo默认使用的返回方式,返回模型的行为,对应的操作,以及对应的敏感程度。
|
40 |
+
- `Answer in Status-Plan-Action-Operation format.`: 返回模型的装题,行为,以及相应的操作。
|
41 |
+
- `Answer in Status-Action-Operation-Sensitive format.`: 返回模型的状态,行为,对应的操作,以及对应的敏感程度。
|
42 |
+
- `Answer in Status-Action-Operation format.`: 返回模型的状态,行为。
|
43 |
+
- `Answer in Action-Operation format.` 返回模型的行为,对应的操作。
|
44 |
+
|
45 |
+
4. `history` 字段
|
46 |
+
|
47 |
+
拼接顺序和结果应该如下所示:
|
48 |
+
```
|
49 |
+
query = f'{task}{history}{platform}{format}'
|
50 |
+
```
|
51 |
+
|
52 |
+
### 模型返回部分
|
53 |
+
|
54 |
+
1. 敏感操作: 包括 `<<敏感操作>> <<一般操作>>` 几种类型,只有要求返回`Sensitive`的时候返回。
|
55 |
+
2. `Plan`, `Agent`, `Status`, `Action` 字段: 用于描述模型的行为和操作。只有要求返回对应字段的时候返回,例如带有`Action`则返回
|
56 |
+
`Action`字段内容。
|
57 |
+
3. 常规回答部分,这部分回答会在格式化回答之前,表示综述。
|
58 |
+
4. `Grounded Operation` 字段:
|
59 |
+
用于描述模型的具体操作,包括操作的位置,类型,以及具体的操作内容。其中 `box` 代表执行区域的坐标,`element_type` 代表执行的元素类型,
|
60 |
+
`element_info` 代表执行的元素描述。这些信息被一个 `操作指令` 操作所包裹。这些指令包括:
|
61 |
+
- `CLICK`: 点击操作,模拟鼠标点击或者手指触摸。
|
62 |
+
- `LONGPRESS`: 长案操作。仅在 `Mobile` 模式下支持。
|
63 |
+
|
64 |
+
### 例子
|
65 |
+
|
66 |
+
用户的任务是希望帮忙将所有邮件标记为已读,用户使用的是 Mac系统,希望返回的是Action-Operation-Sensitive格式。
|
67 |
+
正确拼接后的提示词应该为:
|
68 |
+
|
69 |
+
```
|
70 |
+
Task: 帮我将所有的邮件标注为已读
|
71 |
+
(Platform: Mac)
|
72 |
+
(Answer in Action-Operation-Sensitive format.)
|
73 |
+
```
|
74 |
+
|
75 |
+
接着,这里展现了不同格式要求下的返回结果:
|
76 |
+
|
77 |
+
|
78 |
+
<details>
|
79 |
+
<summary>Answer in Action-Operation-Sensitive format</summary>
|
80 |
+
|
81 |
+
```
|
82 |
+
Action: 点击页面顶部工具栏中的“全部标为已读”按钮,将所有邮件标记为已读。
|
83 |
+
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
|
84 |
+
<<一般操作>>
|
85 |
+
```
|
86 |
+
|
87 |
+
</details>
|
88 |
+
|
89 |
+
<details>
|
90 |
+
<summary>Answer in Status-Plan-Action-Operation format</summary>
|
91 |
+
|
92 |
+
```
|
93 |
+
Status: None
|
94 |
+
Plan: None.
|
95 |
+
Action: 点击收件箱页面顶部中间的“全部标记为已读”按钮,将所有邮件标记为已读。
|
96 |
+
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
|
97 |
+
```
|
98 |
+
|
99 |
+
</details>
|
100 |
+
|
101 |
+
<details>
|
102 |
+
<summary>Answer in Status-Action-Operation-Sensitive format</summary>
|
103 |
+
|
104 |
+
```
|
105 |
+
Status: 当前处于邮箱界面[[0, 2, 998, 905]],左侧是邮箱分类[[1, 216, 144, 570]],中间是收件箱[[144, 216, 998, 903]],已经点击“全部标为已读”按钮[[223, 178, 311, 210]]。
|
106 |
+
Action: 点击页面顶部工具栏中的“全部标为已读”按钮,将所有邮件标记为已读。
|
107 |
+
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
|
108 |
+
<<一般操作>>
|
109 |
+
```
|
110 |
+
|
111 |
+
</details>
|
112 |
+
|
113 |
+
<details>
|
114 |
+
<summary>Answer in Status-Action-Operation format</summary>
|
115 |
+
|
116 |
+
```
|
117 |
+
Status: None
|
118 |
+
Action: 在收件箱页面顶部,点击“全部标记为已读”按钮,将所有邮件标记为已读。
|
119 |
+
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
|
120 |
+
```
|
121 |
+
|
122 |
+
</details>
|
123 |
+
|
124 |
+
<details>
|
125 |
+
<summary>Answer in Action-Operation format</summary>
|
126 |
+
|
127 |
+
```
|
128 |
+
Action: 在左侧邮件列表中,右键单击第一封邮件,以打开操作菜单。
|
129 |
+
Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]')
|
130 |
+
```
|
131 |
+
|
132 |
+
</details>
|
133 |
+
|
134 |
+
### 注意事项
|
135 |
+
|
136 |
+
1. 该模型不是对话模型,不支持连续对话,请发送具体指令,并参考我们提供的历史拼接方式进行拼接。
|
137 |
+
2. 该模型必须要有图片传入,纯文字对话无法实现GUI Agent任务。
|
138 |
+
3. 该模型输出有严格的格式要求,请严格按照我们的要求进行解析。输出格式为 STR 格式,不支持输出JSON 格式。
|
139 |
+
|
140 |
+
|
141 |
+
## 先前的工作
|
142 |
+
|
143 |
+
在2023年11月,我们发布了CogAgent的第一代模型,现在,你可以在 [CogVLM&CogAgent官方仓库](https://github.com/THUDM/CogVLM)
|
144 |
+
找到相关代码和权重地址。
|
145 |
+
|
146 |
+
<div align="center">
|
147 |
+
<img src=https://raw.githubusercontent.com/THUDM/CogAgent/refs/heads/main/assets/cogagent_function_cn.jpg width=70% />
|
148 |
+
</div>
|
149 |
+
|
150 |
+
<table>
|
151 |
+
<tr>
|
152 |
+
<td>
|
153 |
+
<h2> CogVLM </h2>
|
154 |
+
<p> 📖 Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p>
|
155 |
+
<p><b>CogVLM</b> 是一个强大的开源视觉语言模型(VLM)。CogVLM-17B拥有100亿的视觉参数和70亿的语言参数,支持490*490分辨率的图像理解和多轮对话。</p>
|
156 |
+
<p><b>CogVLM-17B 17B在10个经典的跨模态基准测试中取得了最先进的性能</b>包括NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA 和 TDIUC 基准测试。</p>
|
157 |
+
</td>
|
158 |
+
<td>
|
159 |
+
<h2> CogAgent </h2>
|
160 |
+
<p> 📖 Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p>
|
161 |
+
<p><b>CogAgent</b> 是一个基于CogVLM改进的开源视觉语言模型。CogAgent-18B拥有110亿的视觉参数和70亿的语言参数, <b>支持1120*1120分辨率的图像理解。在CogVLM的能力之上,它进一步拥有了GUI图像Agent的能力。</b></p>
|
162 |
+
<p> <b>CogAgent-18B 在9个经典的跨模态基准测试中实现了最先进的通用性能,</b>包括 VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, 和 POPE 测试基准。它在包括AITW和Mind2Web在内的GUI操作数据集上显著超越了现有的模型。</p>
|
163 |
+
</td>
|
164 |
+
</tr>
|
165 |
+
</table>
|
166 |
+
|
167 |
+
## 协议
|
168 |
+
|
169 |
+
模型权重的使用请遵循 [Model License](LICENSE)。
|