Update README.md
Browse files
README.md
CHANGED
@@ -1,13 +1,13 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
datasets:
|
4 |
-
- declare-lab/Emma-X-GCOT
|
5 |
-
metrics:
|
6 |
-
- accuracy
|
7 |
-
base_model:
|
8 |
-
- openvla/openvla-7b
|
9 |
-
pipeline_tag: image-text-to-text
|
10 |
-
---
|
11 |
|
12 |
<h1 align="center">✨
|
13 |
<br/>
|
@@ -30,20 +30,19 @@ Meet Emma-X, an Embodied Multimodal Action Model
|
|
30 |
|
31 |
## Model Overview
|
32 |
|
33 |
-
EMMA-X is an Embodied Multimodal Action (VLA) Model designed to bridge the gap between Visual-Language Models (VLMs) and robotic control tasks. EMMA-X generalizes effectively across diverse environments, objects, and instructions while excelling at long-horizon spatial reasoning and grounded task planning using a novel Trajectory Segmentation Strategy.
|
34 |
|
35 |
-
|
36 |
-
EMMA-X is trained on a dataset derived from BridgeV2, containing 60,000 robot manipulation trajectories. Trained using a hierarchical dataset with visual grounded chain-of-thought reasoning, EMMA-X's output will include the following components:
|
37 |
|
38 |
-
Grounded Chain-of-Thought Reasoning:
|
39 |
-
Helps break down tasks into smaller, manageable subtasks, ensuring accurate task execution by mitigating hallucination in reasoning.
|
40 |
|
41 |
-
Gripper Position Guidance: Affordance point inside the image.
|
42 |
|
43 |
-
Look-Ahead Spatial Reasoning:
|
44 |
-
Enables the model to plan actions while considering spatial guidance for effective planning, enhancing long-horizon task performance.
|
45 |
|
46 |
-
|
|
|
|
|
47 |
|
48 |
## Model Card
|
49 |
- **Developed by:** SUTD Declare Lab
|
@@ -53,8 +52,8 @@ Action: Action policy in 7-dimensional vector to control the robot ([WidowX-6Dof
|
|
53 |
- **Finetuned from:** [`openvla-7B`](https://huggingface.co/openvla/openvla-7b/)
|
54 |
- **Pretraining Dataset:** Augmented version of [Bridge V2](https://rail-berkeley.github.io/bridgedata/), for more info check our repository.
|
55 |
- **Repository:** [https://github.com/declare-lab/Emma-X/](https://github.com/declare-lab/Emma-X/)
|
56 |
-
- **Paper:** [Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning]()
|
57 |
-
- **Project Page & Videos:** []()
|
58 |
|
59 |
## Getting Started
|
60 |
```python
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- declare-lab/Emma-X-GCOT
|
5 |
+
metrics:
|
6 |
+
- accuracy
|
7 |
+
base_model:
|
8 |
+
- openvla/openvla-7b
|
9 |
+
pipeline_tag: image-text-to-text
|
10 |
+
---
|
11 |
|
12 |
<h1 align="center">✨
|
13 |
<br/>
|
|
|
30 |
|
31 |
## Model Overview
|
32 |
|
33 |
+
EMMA-X is an Embodied Multimodal Action (VLA) Model designed to bridge the gap between Visual-Language Models (VLMs) and robotic control tasks. EMMA-X generalizes effectively across diverse environments, objects, and instructions while excelling at long-horizon spatial reasoning and grounded task planning using a novel Trajectory Segmentation Strategy. It relies on --
|
34 |
|
35 |
+
- Hierarchical Embodiment Dataset: Emma-X is trained on a dataset derived from BridgeV2, containing 60,000 robot manipulation trajectories. Trained using a hierarchical dataset with visual grounded chain-of-thought reasoning, EMMA-X's output will include the following components:
|
|
|
36 |
|
37 |
+
- Grounded Chain-of-Thought Reasoning: Helps break down tasks into smaller, manageable subtasks, ensuring accurate task execution by mitigating hallucination in reasoning.
|
|
|
38 |
|
39 |
+
- Gripper Position Guidance: Affordance point inside the image.
|
40 |
|
41 |
+
- Look-Ahead Spatial Reasoning: Enables the model to plan actions while considering spatial guidance for effective planning, enhancing long-horizon task performance.
|
|
|
42 |
|
43 |
+
It generates:
|
44 |
+
|
45 |
+
- Action: Action policy in 7-dimensional vector to control the robot ([WidowX-6Dof](https://www.trossenrobotics.com/widowx-250)).
|
46 |
|
47 |
## Model Card
|
48 |
- **Developed by:** SUTD Declare Lab
|
|
|
52 |
- **Finetuned from:** [`openvla-7B`](https://huggingface.co/openvla/openvla-7b/)
|
53 |
- **Pretraining Dataset:** Augmented version of [Bridge V2](https://rail-berkeley.github.io/bridgedata/), for more info check our repository.
|
54 |
- **Repository:** [https://github.com/declare-lab/Emma-X/](https://github.com/declare-lab/Emma-X/)
|
55 |
+
- **Paper:** [Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning](https://arxiv.org/pdf/2412.11974)
|
56 |
+
- **Project Page & Videos:** [https://declare-lab.github.io/Emma-X/](https://declare-lab.github.io/Emma-X/)
|
57 |
|
58 |
## Getting Started
|
59 |
```python
|