Image-Text-to-Text
Safetensors
openvla
custom_code
soujanyaporia commited on
Commit
3f185b0
·
verified ·
1 Parent(s): 200d588

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -21
README.md CHANGED
@@ -1,13 +1,13 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - declare-lab/Emma-X-GCOT
5
- metrics:
6
- - accuracy
7
- base_model:
8
- - openvla/openvla-7b
9
- pipeline_tag: image-text-to-text
10
- ---
11
 
12
  <h1 align="center">✨
13
  <br/>
@@ -30,20 +30,19 @@ Meet Emma-X, an Embodied Multimodal Action Model
30
 
31
  ## Model Overview
32
 
33
- EMMA-X is an Embodied Multimodal Action (VLA) Model designed to bridge the gap between Visual-Language Models (VLMs) and robotic control tasks. EMMA-X generalizes effectively across diverse environments, objects, and instructions while excelling at long-horizon spatial reasoning and grounded task planning using a novel Trajectory Segmentation Strategy.
34
 
35
- Hierarchical Embodiment Dataset:
36
- EMMA-X is trained on a dataset derived from BridgeV2, containing 60,000 robot manipulation trajectories. Trained using a hierarchical dataset with visual grounded chain-of-thought reasoning, EMMA-X's output will include the following components:
37
 
38
- Grounded Chain-of-Thought Reasoning:
39
- Helps break down tasks into smaller, manageable subtasks, ensuring accurate task execution by mitigating hallucination in reasoning.
40
 
41
- Gripper Position Guidance: Affordance point inside the image.
42
 
43
- Look-Ahead Spatial Reasoning:
44
- Enables the model to plan actions while considering spatial guidance for effective planning, enhancing long-horizon task performance.
45
 
46
- Action: Action policy in 7-dimensional vector to control the robot ([WidowX-6Dof](https://www.trossenrobotics.com/widowx-250)).
 
 
47
 
48
  ## Model Card
49
  - **Developed by:** SUTD Declare Lab
@@ -53,8 +52,8 @@ Action: Action policy in 7-dimensional vector to control the robot ([WidowX-6Dof
53
  - **Finetuned from:** [`openvla-7B`](https://huggingface.co/openvla/openvla-7b/)
54
  - **Pretraining Dataset:** Augmented version of [Bridge V2](https://rail-berkeley.github.io/bridgedata/), for more info check our repository.
55
  - **Repository:** [https://github.com/declare-lab/Emma-X/](https://github.com/declare-lab/Emma-X/)
56
- - **Paper:** [Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning]()
57
- - **Project Page & Videos:** []()
58
 
59
  ## Getting Started
60
  ```python
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - declare-lab/Emma-X-GCOT
5
+ metrics:
6
+ - accuracy
7
+ base_model:
8
+ - openvla/openvla-7b
9
+ pipeline_tag: image-text-to-text
10
+ ---
11
 
12
  <h1 align="center">✨
13
  <br/>
 
30
 
31
  ## Model Overview
32
 
33
+ EMMA-X is an Embodied Multimodal Action (VLA) Model designed to bridge the gap between Visual-Language Models (VLMs) and robotic control tasks. EMMA-X generalizes effectively across diverse environments, objects, and instructions while excelling at long-horizon spatial reasoning and grounded task planning using a novel Trajectory Segmentation Strategy. It relies on --
34
 
35
+ - Hierarchical Embodiment Dataset: Emma-X is trained on a dataset derived from BridgeV2, containing 60,000 robot manipulation trajectories. Trained using a hierarchical dataset with visual grounded chain-of-thought reasoning, EMMA-X's output will include the following components:
 
36
 
37
+ - Grounded Chain-of-Thought Reasoning: Helps break down tasks into smaller, manageable subtasks, ensuring accurate task execution by mitigating hallucination in reasoning.
 
38
 
39
+ - Gripper Position Guidance: Affordance point inside the image.
40
 
41
+ - Look-Ahead Spatial Reasoning: Enables the model to plan actions while considering spatial guidance for effective planning, enhancing long-horizon task performance.
 
42
 
43
+ It generates:
44
+
45
+ - Action: Action policy in 7-dimensional vector to control the robot ([WidowX-6Dof](https://www.trossenrobotics.com/widowx-250)).
46
 
47
  ## Model Card
48
  - **Developed by:** SUTD Declare Lab
 
52
  - **Finetuned from:** [`openvla-7B`](https://huggingface.co/openvla/openvla-7b/)
53
  - **Pretraining Dataset:** Augmented version of [Bridge V2](https://rail-berkeley.github.io/bridgedata/), for more info check our repository.
54
  - **Repository:** [https://github.com/declare-lab/Emma-X/](https://github.com/declare-lab/Emma-X/)
55
+ - **Paper:** [Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning](https://arxiv.org/pdf/2412.11974)
56
+ - **Project Page & Videos:** [https://declare-lab.github.io/Emma-X/](https://declare-lab.github.io/Emma-X/)
57
 
58
  ## Getting Started
59
  ```python