THUDM
/

cogagent-9b-20241220

@@ -1,96 +1,103 @@
 ---
 license: other
 language:
-- zh
-- en
 base_model:
-- THUDM/glm-4v-9b
 pipeline_tag: image-text-to-text
 library_name: transformers
 ---
 # CogAgent
-## 关于模型
-`cogagent-9b-20241220` 是 我们基于 [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b) 训练得到的一个专门用于 Agent任务的模型。
-`cogagent-9b-20241220` 是一款较为先进的智能体模型，它具备强大的跨平台兼容性，能够实现对多种计算设备上的图形界面进行自动化的操作。
-无论是Windows、macOS还是Android系统，`cogagent-9b-20241220` 都能够接收用户指令，自动获取设备屏幕截图，经过模型推理后执行自动化设备操作。
-## 运行模型
-请前往我们的[github](https://github.com/THUDM/CogAgent) 查看具体的运行示例。
-## 输入和输出
-cogagent-9b-20241220是一个Agent类执行模型而非对话模型，不支持连续对话，但是但支持连续的执行历史。
-这里展示了用户应该怎么整理自己的输入格式化的传入给模型。并获得模型规则的回复。
-### 用户输入部分
-1. `task` 字段
-   用户输入的任务描述，类似文本格式的prompt，该输入可以指导 CogAgent1.5 模型完成用户任务指令。请保证简洁明了。
-2. `platform` 字段
-   CogAgent1.5 支持在多个平台上执行可操作Agent功能, 我们支持的带有图形界面的操作系统有三个系统，
-    - Windows 10，11，请使用 `WIN` 字段。
-    - Mac 14，15，请使用 `MAC` 字段。
-    - Android 13，14，15 以及其他GUI和UI操作方式几乎相同的安卓UI发行版，请使用 `Mobile` 字段。
-   如果您使用的是其他系统，效果可能不佳，但可以尝试使用 `Mobile` 字段用于手机设备，`WIN` 字段用于Windows设备，`MAC`
-   字段用于Mac设备。
-3. `format` 字段
-   用户希望 CogAgent1.5 返回何种格式的数据, 这里有以下几种选项:
-    - `Answer in Action-Operation-Sensitive format.`: 本仓库中demo默认使用的返回方式，返回模型的行为，对应的操作，以及对应的敏感程度。
-    - `Answer in Status-Plan-Action-Operation format.`: 返回模型的装题，行为，以及相应的操作。
-    - `Answer in Status-Action-Operation-Sensitive format.`: 返回模型的状态，行为，对应的操作，以及对应的敏感程度。
-    - `Answer in Status-Action-Operation format.`: 返回模型的状态，行为。
-    - `Answer in Action-Operation format.` 返回模型的行为，对应的操作。
-4. `history` 字段
-   拼接顺序和结果应该如下所示：
    ```
    query = f'{task}{history}{platform}{format}'
    ```
-### 模型返回部分
-1. 敏感操作: 包括 `<<敏感操作>> <<一般操作>>` 几种类型，只有要求返回`Sensitive`的时候返回。
-2. `Plan`, `Agent`, `Status`, `Action` 字段: 用于描述模型的行为和操作。只有要求返回对应字段的时候返回，例如带有`Action`则返回
-   `Action`字段内容。
-3. 常规回答部分，这部分回答会在格式化回答之前，表示综述。
-4. `Grounded Operation` 字段:
-   用于描述模型的具体操作，包括操作的位置，类型，以及具体的操作内容。其中 `box` 代表执行区域的坐标，`element_type` 代表执行的元素类型，
-   `element_info` 代表执行的元素描述。这些信息被一个 `操作指令` 操作所包裹。这些指令包括：
-    - `CLICK`: 点击操作，模拟鼠标点击或者手指触摸。
-    - `LONGPRESS`: 长案操作。仅在 `Mobile` 模式下支持。
-### 例子
-用户的任务是希望帮忙将所有邮件标记为已读，用户使用的是 Mac系统，希望返回的是Action-Operation-Sensitive格式。
-正确拼接后的提示词应该为：
 ```
-Task: 帮我将所有的邮件标注为已读
 (Platform: Mac)
 (Answer in Action-Operation-Sensitive format.)
 ```
-接着，这里展现了不同格式要求下的返回结果:
 <details>
 <summary>Answer in Action-Operation-Sensitive format</summary>
 ```
-Action: 点击页面顶部工具栏中的“全部标为已读”按钮，将所有邮件标记为已读。
-Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
-<<一般操作>>
 ```
 </details>
@@ -100,9 +107,9 @@ Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本
 ```
 Status: None
-Plan: None.
-Action: 点击收件箱页面顶部中间的“全部标记为已读”按钮，将所有邮件标记为已读。
-Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
 ```
 </details>
@@ -111,10 +118,10 @@ Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本
 <summary>Answer in Status-Action-Operation-Sensitive format</summary>
 ```
-Status: 当前处于邮箱界面[[0, 2, 998, 905]]，左侧是邮箱分类[[1, 216, 144, 570]]，中间是收件箱[[144, 216, 998, 903]]，已经点击“全部标为已读”按钮[[223, 178, 311, 210]]。
-Action: 点击页面顶部工具栏中的“全部标为已读”按钮，将所有邮件标记为已读。
-Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
-<<一般操作>>
 ```
 </details>
@@ -124,8 +131,8 @@ Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本
 ```
 Status: None
-Action: 在收件箱页面顶部，点击“全部标记为已读”按钮，将所有邮件标记为已读。
-Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
 ```
 </details>
@@ -134,23 +141,23 @@ Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本
 <summary>Answer in Action-Operation format</summary>
 ```
-Action: 在左侧邮件列表中，右键单击第一封邮件，以打开操作菜单。
 Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]')
 ```
 </details>
-### 注意事项
-1. 该模型不是对话模型，不支持连续对话，请发送具体指令，并参考我们提供的历史拼接方式进行拼接。
-2. 该模型必须要有图片传入，纯文字对话无法实现GUI Agent任务。
-3. 该模型输出有严格的格式要求，请严格按照我们的要求进行解析。输出格式为 STR 格式，不支持输出JSON 格式。
-## 先前的工作
-在2023年11月，我们发布了CogAgent的第一代模型，现在，你可以在 [CogVLM&CogAgent官方仓库](https://github.com/THUDM/CogVLM)
-找到相关代码和权重地址。
 <div align="center">
     <img src=https://raw.githubusercontent.com/THUDM/CogAgent/refs/heads/main/assets/cogagent_function_cn.jpg width=70% />
@@ -161,18 +168,18 @@ Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]'
     <td>
       <h2> CogVLM </h2>
       <p> 📖  Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p>
-      <p><b>CogVLM</b> 是一个强大的开源视觉语言模型（VLM）。CogVLM-17B拥有100亿的视觉参数和70亿的语言参数，支持490*490分辨率的图像理解和多轮对话。</p>
-      <p><b>CogVLM-17B 17B在10个经典的跨模态基准测试中取得了最先进的性能</b>包括NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA 和 TDIUC 基准测试。</p>
     </td>
     <td>
       <h2> CogAgent </h2>
       <p> 📖  Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p>
-      <p><b>CogAgent</b> 是一个基于CogVLM改进的开源视觉语言模型。CogAgent-18B拥有110亿的视觉参数和70亿的语言参数, <b>支持1120*1120分辨率的图像理解。在CogVLM的能力之上，它进一步拥有了GUI图像Agent的能力。</b></p>
-      <p> <b>CogAgent-18B 在9个经典的跨模态基准测试中实现了最先进的通用性能，</b>包括 VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, 和 POPE 测试基准。它在包括AITW和Mind2Web在内的GUI操作数据集上显著超越了现有的模型。</p>
     </td>
   </tr>
 </table>
-## 协议
-模型权重的使用请遵循 [Model License](LICENSE)。

 ---
 license: other
 language:
+  - zh
+  - en
 base_model:
+  - THUDM/glm-4v-9b
 pipeline_tag: image-text-to-text
 library_name: transformers
 ---
 # CogAgent
+[中文阅读](README_zh.md)
+## About the Model
+The `CogAgent-9B-20241220` model is based on [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b), a bilingual
+open-source VLM base model. Through data collection and optimization, multi-stage training, and strategy improvements,
+`CogAgent-9B-20241220` achieves significant advancements in GUI perception, inference prediction accuracy, action space
+completeness, and task generalizability. The model supports bilingual (Chinese and English) interaction with both
+screenshots and language input.
+This version of the CogAgent model has already been applied in
+ZhipuAI's [GLM-PC product](https://cogagent.aminer.cn/home). We hope this release will assist researchers and developers
+in advancing the research and applications of GUI agents based on vision-language models.
+## Running the Model
+Please refer to our [GitHub](https://github.com/THUDM/CogAgent) for specific examples of running the model.
+## Input and Output
+`cogagent-9b-20241220` is an agent execution model rather than a conversational model. It does not support continuous
+conversations but does support continuous execution history. Below are guidelines on how users should format their input
+for the model and interpret the formatted output.
+### User Input
+1. **`task` field**
+   A task description provided by the user, similar to a textual prompt. The input should be concise and clear to guide
+   the `CogAgent-9B-20241220` model to complete the task.
+2. **`platform` field**
+   `CogAgent-9B-20241220` supports operation on several platforms with GUI interfaces:
+    - **Windows**: Use the `WIN` field for Windows 10 or 11.
+    - **Mac**: Use the `MAC` field for Mac 14 or 15.
+    - **Mobile**: Use the `Mobile` field for Android 13, 14, 15, or similar Android-based UI versions.
+   If using other systems, results may vary. Use the `Mobile` field for mobile devices, `WIN` for Windows, and `MAC` for
+   Mac.
+3. **`format` field**
+   Specifies the desired format of the returned data. Options include:
+    - `Answer in Action-Operation-Sensitive format.`: The default format in our demo, returning actions, corresponding
+      operations, and sensitivity levels.
+    - `Answer in Status-Plan-Action-Operation format.`: Returns status, plans, actions, and corresponding operations.
+    - `Answer in Status-Action-Operation-Sensitive format.`: Returns status, actions, corresponding operations, and
+      sensitivity levels.
+    - `Answer in Status-Action-Operation format.`: Returns status, actions, and corresponding operations.
+    - `Answer in Action-Operation format.`: Returns actions and corresponding operations.
+4. **`history` field**
+   The input should be concatenated in the following order:
    ```
    query = f'{task}{history}{platform}{format}'
    ```
+### Model Output
+1. **Sensitive Operations**: Includes types like `<<Sensitive Operation>>` or `<<General Operation>>`, returned only
+   when `Sensitive` is requested.
+2. **`Plan`, `Agent`, `Status`, `Action` fields**: Describe the model's behavior and operations, returned based on the
+   requested format.
+3. **General Responses**: Summarizes the output before formatting.
+4. **`Grounded Operation` field**: Describes the model's specific actions, such as coordinates, element types, and
+   descriptions. Actions include:
+    - `CLICK`: Simulates mouse clicks or touch gestures.
+    - `LONGPRESS`: Simulates long presses (supported only in `Mobile` mode).
+### Example
+If the user wants to mark all emails as read on a Mac system and requests an `Action-Operation-Sensitive` format, the
+prompt should be:
 ```
+Task: Mark all emails as read
 (Platform: Mac)
 (Answer in Action-Operation-Sensitive format.)
 ```
+Below are examples of model responses based on different requested formats:
 <details>
 <summary>Answer in Action-Operation-Sensitive format</summary>
 ```
+Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read.
+Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
+<<General Operation>>
 ```
 </details>
 ```
 Status: None
+Plan: None
+Action: Click the "Mark All as Read" button at the top center of the inbox page to mark all emails as read.
+Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
 ```
 </details>
 <summary>Answer in Status-Action-Operation-Sensitive format</summary>
 ```
+Status: Currently on the email interface [[0, 2, 998, 905]], with email categories on the left [[1, 216, 144, 570]] and the inbox in the center [[144, 216, 998, 903]]. The "Mark All as Read" button [[223, 178, 311, 210]] has been clicked.
+Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read.
+Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
+<<General Operation>>
 ```
 </details>
 ```
 Status: None
+Action: On the inbox page, click the "Mark All as Read" button to mark all emails as read.
+Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
 ```
 </details>
 <summary>Answer in Action-Operation format</summary>
 ```
+Action: Right-click on the first email in the left-side list to open the action menu.
 Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]')
 ```
 </details>
+### Notes
+1. This model is not a conversational model and does not support continuous dialogue. Please send specific instructions
+   and use the suggested concatenation method.
+2. Images must be provided as input; textual prompts alone cannot execute GUI agent tasks.
+3. The model outputs strictly formatted STR data and does not support JSON format.
+## Previous Work
+In November 2023, we released the first generation of CogAgent. You can find related code and weights in
+the [CogVLM & CogAgent Official Repository](https://github.com/THUDM/CogVLM).
 <div align="center">
     <img src=https://raw.githubusercontent.com/THUDM/CogAgent/refs/heads/main/assets/cogagent_function_cn.jpg width=70% />
     <td>
       <h2> CogVLM </h2>
       <p> 📖  Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p>
+      <p><b>CogVLM</b> is a powerful open-source vision-language model (VLM). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters, supporting 490x490 resolution image understanding and multi-turn conversations.</p>
+      <p><b>CogVLM-17B achieved state-of-the-art performance on 10 classic cross-modal benchmarks</b> including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA, and TDIUC benchmarks.</p>
     </td>
     <td>
       <h2> CogAgent </h2>
       <p> 📖  Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p>
+      <p><b>CogAgent</b> is an improved open-source vision-language model based on CogVLM. CogAgent-18B has 11 billion vision parameters and 7 billion language parameters, <b>supporting image understanding at 1120x1120 resolution. Beyond CogVLM's capabilities, it also incorporates GUI agent capabilities.</b></p>
+      <p><b>CogAgent-18B achieved state-of-the-art performance on 9 classic cross-modal benchmarks,</b> including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE benchmarks. It significantly outperformed existing models on GUI operation datasets like AITW and Mind2Web.</p>
     </td>
   </tr>
 </table>
+## License
+Please follow the [Model License](LICENSE) for using the model weights.

README_zh.md ADDED Viewed

	@@ -0,0 +1,169 @@

+# CogAgent
+## 关于模型
+`CogAgent-9B-2024122` 模型基于 [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b)
+双语开源VLM基座模型，通过数据的采集与优化、多阶段训练与策略改进等方法，`CogAgent-9B-20241220` 在GUI
+感知、推理预测准确性、动作空间完善性、任务的普适和泛化性上得到了大幅提升，能够接受中英文双语的屏幕截图和语言交互。
+此版CogAgent模型已被应用于智谱AI的 [GLM-PC产品](https://cogagent.aminer.cn/home)
+。我们希望这版模型的发布能够帮助到学术研究者们和开发者们，一起推进基于视觉语言模型的 GUI agent 的研究和应用。
+## 运行模型
+请前往我们的[github](https://github.com/THUDM/CogAgent) 查看具体的运行示例。
+## 输入和输出
+cogagent-9b-20241220是一个Agent类执行模型而非对话模型，不支持连续对话，但是但支持连续的执行历史。
+这里展示了用户应该怎么整理自己的输入格式化的传入给模型。并获得模型规则的回复。
+### 用户输入部分
+1. `task` 字段
+   用户输入的任务描述，类似文本格式的prompt，该输入可以指导 CogAgent1.5 模型完成用户任务指令。请保证简洁明了。
+2. `platform` 字段
+   CogAgent1.5 支持在多个平台上执行可操作Agent功能, 我们支持的带有图形界面的操作系统有三个系统，
+    - Windows 10，11，请使用 `WIN` 字段。
+    - Mac 14，15，请使用 `MAC` 字段。
+    - Android 13，14，15 以及其他GUI和UI操作方式几乎相同的安卓UI发行版，请使用 `Mobile` 字段。
+   如果您使用的是其他系统，效果可能不佳，但可以尝试使用 `Mobile` 字段用于手机设备，`WIN` 字段用于Windows设备，`MAC`
+   字段用于Mac设备。
+3. `format` 字段
+   用户希望 CogAgent1.5 返回何种格式的数据, 这里有以下几种选项:
+    - `Answer in Action-Operation-Sensitive format.`: 本仓库中demo默认使用的返回方式，返回模型的行为，对应的操作，以及对应的敏感程度。
+    - `Answer in Status-Plan-Action-Operation format.`: 返回模型的装题，行为，以及相应的操作。
+    - `Answer in Status-Action-Operation-Sensitive format.`: 返回模型的状态，行为，对应的操作，以及对应的敏感程度。
+    - `Answer in Status-Action-Operation format.`: 返回模型的状态，行为。
+    - `Answer in Action-Operation format.` 返回模型的行为，对应的操作。
+4. `history` 字段
+   拼接顺序和结果应该如下所示：
+   ```
+   query = f'{task}{history}{platform}{format}'
+   ```
+### 模型返回部分
+1. 敏感操作: 包括 `<<敏感操作>> <<一般操作>>` 几种类型，只有要求返回`Sensitive`的时候返回。
+2. `Plan`, `Agent`, `Status`, `Action` 字段: 用于描述模型的行为和操作。只有要求返回对应字段的时候返回，例如带有`Action`则返回
+   `Action`字段内容。
+3. 常规回答部分，这部分回答会在格式化回答之前，表示综述。
+4. `Grounded Operation` 字段:
+   用于描述模型的具体操作，包括操作的位置，类型，以及具体的操作内容。其中 `box` 代表执行区域的坐标，`element_type` 代表执行的元素类型，
+   `element_info` 代表执行的元素描述。这些信息被一个 `操作指令` 操作所包裹。这些指令包括：
+    - `CLICK`: 点击操作，模拟鼠标点击或者手指触摸。
+    - `LONGPRESS`: 长案操作。仅在 `Mobile` 模式下支持。
+### 例子
+用户的任务是希望帮忙将所有邮件标记为已读，用户使用的是 Mac系统，希望返回的是Action-Operation-Sensitive格式。
+正确拼接后的提示词应该为：
+```
+Task: 帮我将所有的邮件标注为已读
+(Platform: Mac)
+(Answer in Action-Operation-Sensitive format.)
+```
+接着，这里展现了不同格式要求下的返回结果:
+<details>
+<summary>Answer in Action-Operation-Sensitive format</summary>
+```
+Action: 点击页面顶部工具栏中的“全部标为已读”按钮，将所有邮件标记为已读。
+Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
+<<一般操作>>
+```
+</details>
+<details>
+<summary>Answer in Status-Plan-Action-Operation format</summary>
+```
+Status: None
+Plan: None.
+Action: 点击收件箱页面顶部中间的“全部标记为已读”按钮，将所有邮件标记为已读。
+Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
+```
+</details>
+<details>
+<summary>Answer in Status-Action-Operation-Sensitive format</summary>
+```
+Status: 当前处于邮箱界面[[0, 2, 998, 905]]，左侧是邮箱分类[[1, 216, 144, 570]]，中间是收件箱[[144, 216, 998, 903]]，已经点击“全部标为已读”按钮[[223, 178, 311, 210]]。
+Action: 点击页面顶部工具栏中的“全部标为已读”按钮，将所有邮件标记为已读。
+Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
+<<一般操作>>
+```
+</details>
+<details>
+<summary>Answer in Status-Action-Operation format</summary>
+```
+Status: None
+Action: 在收件箱页面顶部，点击“全部标记为已读”按钮，将所有邮件标记为已读。
+Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
+```
+</details>
+<details>
+<summary>Answer in Action-Operation format</summary>
+```
+Action: 在左侧邮件列表中，右键单击第一封邮件，以打开操作菜单。
+Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]')
+```
+</details>
+### 注意事项
+1. 该模型不是对话模型，不支持连续对话，请发送具体指令，并参考我们提供的历史拼接方式进行拼接。
+2. 该模型必须要有图片传入，纯文字对话无法实现GUI Agent任务。
+3. 该模型输出有严格的格式要求，请严格按照我们的要求进行解析。输出格式为 STR 格式，不支持输出JSON 格式。
+## 先前的工作
+在2023年11月，我们发布了CogAgent的第一代模型，现在，你可以在 [CogVLM&CogAgent官方仓库](https://github.com/THUDM/CogVLM)
+找到相关代码和权重地址。
+<div align="center">
+    <img src=https://raw.githubusercontent.com/THUDM/CogAgent/refs/heads/main/assets/cogagent_function_cn.jpg width=70% />
+</div>
+<table>
+  <tr>
+    <td>
+      <h2> CogVLM </h2>
+      <p> 📖  Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p>
+      <p><b>CogVLM</b> 是一个强大的开源视觉语言模型（VLM）。CogVLM-17B拥有100亿的视觉参数和70亿的语言参数，支持490*490分辨率的图像理解和多轮对话。</p>
+      <p><b>CogVLM-17B 17B在10个经典的跨模态基准测试中取得了最先进的性能</b>包括NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA 和 TDIUC 基准测试。</p>
+    </td>
+    <td>
+      <h2> CogAgent </h2>
+      <p> 📖  Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p>
+      <p><b>CogAgent</b> 是一个基于CogVLM改进的开源视觉语言模型。CogAgent-18B拥有110亿的视觉参数和70亿的语言参数, <b>支持1120*1120分辨率的图像理解。在CogVLM的能力之上，它进一步拥有了GUI图像Agent的能力。</b></p>
+      <p> <b>CogAgent-18B 在9个经典的跨模态基准测试中实现了最先进的通用性能，</b>包括 VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, 和 POPE 测试基准。它在包括AITW和Mind2Web在内的GUI操作数据集上显著超越了现有的模型。</p>
+    </td>
+  </tr>
+</table>
+## 协议
+模型权重的使用请遵循 [Model License](LICENSE)。