|
# Cosmos Autoregressive-based World Foundation Models |
|
|
|
## Table of Contents |
|
- [Getting Started](#getting-started) |
|
- [Set Up Docker Environment](#set-up-docker-environment) |
|
- [Download Checkpoints](#download-checkpoints) |
|
- [Usage](#usage) |
|
- [Model Types](#model-types) |
|
- [Single and Batch Generation](#single-and-batch-generation) |
|
- [Sample Commands](#sample-commands) |
|
- [Base Models (4B/12B)](#base-basepy-4b-and-12b) |
|
- [Video2World Models (5B/13B)](#video2world-video2worldpy-5b-and-13b) |
|
- [Arguments](#arguments) |
|
- [Common Parameters](#common-parameters) |
|
- [Base Specific Parameters](#base-specific-parameters) |
|
- [Video2World Specific Parameters](#video2world-specific-parameters) |
|
- [Safety Features](#safety-features) |
|
|
|
This page details the steps for using the Cosmos autoregressive-based world foundation models. |
|
|
|
## Getting Started |
|
|
|
### Set Up Docker Environment |
|
|
|
Follow our [Installation Guide](../../../INSTALL.md) to set up the Docker environment. All commands on this page should be run inside Docker. |
|
|
|
### Download Checkpoints |
|
|
|
1. Generate a [Hugging Face](https://huggingface.co/settings/tokens) access token. Set the access token to 'Read' permission (default is 'Fine-grained'). |
|
|
|
2. Log in to Hugging Face with the access token: |
|
|
|
```bash |
|
huggingface-cli login |
|
``` |
|
|
|
3. Download the Cosmos model weights from [Hugging Face](https://huggingface.co/collections/nvidia/cosmos-6751e884dc10e013a0a0d8e6): |
|
|
|
```bash |
|
PYTHONPATH=$(pwd) python cosmos1/scripts/download_autoregressive.py --model_sizes 4B 5B 12B 13B |
|
``` |
|
|
|
4. The downloaded files should be in the following structure: |
|
|
|
``` |
|
checkpoints/ |
|
βββ Cosmos-1.0-Autoregressive-4B |
|
β βββ model.pt |
|
β βββ config.json |
|
βββ Cosmos-1.0-Autoregressive-5B-Video2World |
|
β βββ model.pt |
|
β βββ config.json |
|
βββ Cosmos-1.0-Autoregressive-12B |
|
β βββ model.pt |
|
β βββ config.json |
|
βββ Cosmos-1.0-Autoregressive-13B-Video2World |
|
β βββ model.pt |
|
β βββ config.json |
|
βββ Cosmos-1.0-Tokenizer-CV8x8x8 |
|
β βββ decoder.jit |
|
β βββ encoder.jit |
|
β βββ mean_std.pt |
|
βββ Cosmos-1.0-Tokenizer-DV8x16x16 |
|
β βββ decoder.jit |
|
β βββ encoder.jit |
|
βββ Cosmos-1.0-Diffusion-7B-Decoder-DV8x16x16ToCV8x8x8 |
|
β βββ aux_vars.pt |
|
β βββ model.pt |
|
βββ Cosmos-1.0-Guardrail |
|
βββ aegis/ |
|
βββ blocklist/ |
|
βββ face_blur_filter/ |
|
βββ video_content_safety_filter/ |
|
``` |
|
|
|
## Usage |
|
|
|
|
|
### Model Types |
|
|
|
There are two model types available for autoregressive world generation: |
|
|
|
1. **Base**: Supports world generation from image/video input |
|
|
|
* Models: `Cosmos-1.0-Autoregressive-4B` and `Cosmos-1.0-Autoregressive-12B` |
|
* Inference script: [base.py](/cosmos1/models/autoregressive/inference/base.py) |
|
|
|
2. **Video2World**: Supports world generation from image/video input and text input |
|
|
|
* Models: `Cosmos-1.0-Autoregressive-5B-Video2World` and `Cosmos-1.0-Autoregressive-13B-Video2World` |
|
* Inference script: [video2world.py](/cosmos1/models/autoregressive/inference/video2world.py) |
|
|
|
Our models now support video extension up to 33 frames. Starting from either a single image or a 9-frame video input, they can generate the remaining frames to reach the 33-frame length (generating 32 or 24 frames, respectively). |
|
|
|
We have evaluated all eight possible configurations (4 models Γ 2 vision input types: image or video) using 100 test videos on physical AI topics. Below are the failure rates for each configuration: |
|
|
|
| Model | Image input | Video input (9 frames) | |
|
|:------------------------------------------|:--------------:|:-------------------------:| |
|
| Cosmos-1.0-Autoregressive-4B | 15% | 1% | |
|
| Cosmos-1.0-Autoregressive-5B-Video2World | 7% | 2% | |
|
| Cosmos-1.0-Autoregressive-12B | 2% | 1% | |
|
| Cosmos-1.0-Autoregressive-13B-Video2World | 3% | 0% | |
|
|
|
We define failure cases as videos with severe distortions, such as: |
|
|
|
* Sudden appearance of large unexpected objects |
|
* Video degrading to a single solid color |
|
|
|
Note that the following are not considered failures in our analysis: |
|
|
|
* Static video frames |
|
* Minor object distortions or artifacts |
|
|
|
### Single and Batch Generation |
|
|
|
We support both single and batch video generation. |
|
|
|
For generating a single video, `base` mode requires the input argument `--input_image_or_video_path` (image/video input), while `video2world` mode requires both `--input_image_or_video_path` (image/video input) and `--prompt` (text input). |
|
|
|
Note that our model only works with 1024x640 resolution videos. If the input image/video is not in this resolution, it will be resized and cropped. |
|
|
|
For generating a batch of videos, both `base` and `video2world` require `--batch_input_path` (path to a JSONL file). For `base`, the JSONL file should contain one visual input per line in the following format, where each line must contain a "visual_input" field: |
|
|
|
```json |
|
{"visual_input": "path/to/video1.mp4"} |
|
{"visual_input": "path/to/video2.mp4"} |
|
``` |
|
|
|
For `video2world`, each line in the JSONL file must contain both "prompt" and "visual_input" fields: |
|
|
|
```json |
|
{"prompt": "prompt1", "visual_input": "path/to/video1.mp4"} |
|
{"prompt": "prompt2", "visual_input": "path/to/video2.mp4"} |
|
``` |
|
|
|
### Sample Commands |
|
|
|
There are two main demo scripts for autoregressive world generation: `base.py` and `video2world.py`. Below you will find sample commands for single and batch generation, as well as commands for running with low-memory GPUs using model offloading. We also provide a memory usage table comparing different offloading strategies to help with configuration. |
|
|
|
#### Base (base.py): 4B and 12B |
|
|
|
Generates world from image/video input. |
|
|
|
The `input_type` argument can be either `video` or `image`. We have tuned the sampling parameters `top_p` and `temperature` to achieve the best performance. Please use the provided values in the command examples. |
|
|
|
Note that the command examples below all use video input. If you want to use image input, please change the `input_type` to `image`. |
|
|
|
##### Single Generation |
|
|
|
```bash |
|
# Example using 4B model |
|
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \ |
|
--input_type=video \ |
|
--input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ |
|
--video_save_name=Cosmos-1.0-Autoregressive-4B \ |
|
--ar_model_dir=Cosmos-1.0-Autoregressive-4B \ |
|
--top_p=0.8 \ |
|
--temperature=1.0 |
|
|
|
# Example for low-memory GPUs using 4B model with model offloading |
|
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \ |
|
--input_type=video \ |
|
--input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ |
|
--video_save_name=Cosmos-1.0-Autoregressive-4B \ |
|
--ar_model_dir=Cosmos-1.0-Autoregressive-4B \ |
|
--top_p=0.8 \ |
|
--temperature=1.0 \ |
|
--offload_guardrail_models \ |
|
--offload_diffusion_decoder \ |
|
--offload_ar_model \ |
|
--offload_tokenizer |
|
|
|
# Example using 12B model |
|
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \ |
|
--input_type=video \ |
|
--input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ |
|
--video_save_name=Cosmos-1.0-Autoregressive-12B \ |
|
--ar_model_dir=Cosmos-1.0-Autoregressive-12B \ |
|
--top_p=0.9 \ |
|
--temperature=1.0 |
|
|
|
# Example for low-memory GPUs using 12B model with model offloading |
|
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \ |
|
--input_type=video \ |
|
--input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ |
|
--video_save_name=Cosmos-1.0-Autoregressive-12B \ |
|
--ar_model_dir=Cosmos-1.0-Autoregressive-12B \ |
|
--top_p=0.9 \ |
|
--temperature=1.0 \ |
|
--offload_guardrail_models \ |
|
--offload_diffusion_decoder \ |
|
--offload_ar_model \ |
|
--offload_tokenizer |
|
``` |
|
|
|
##### Batch Generation |
|
|
|
```bash |
|
# Example using 4B model |
|
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \ |
|
--input_type=video \ |
|
--batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/base.jsonl \ |
|
--video_save_folder=outputs/Cosmos-1.0-Autoregressive-4B \ |
|
--ar_model_dir=Cosmos-1.0-Autoregressive-4B \ |
|
--top_p=0.8 \ |
|
--temperature=1.0 |
|
|
|
# Example using 12B model |
|
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \ |
|
--input_type=video \ |
|
--batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/base.jsonl \ |
|
--video_save_folder=outputs/Cosmos-1.0-Autoregressive-12B \ |
|
--ar_model_dir=Cosmos-1.0-Autoregressive-12B \ |
|
--top_p=0.9 \ |
|
--temperature=1.0 |
|
``` |
|
|
|
##### Example Output |
|
|
|
Here is an example output video generated using base.py with image input, using `Cosmos-1.0-Autoregressive-12B`: |
|
|
|
<video src="https://github.com/user-attachments/assets/634403a5-1873-42d7-8dd0-eb7fb4ac8cf4"> |
|
Your browser does not support the video tag. |
|
</video> |
|
|
|
The input image used to generate this video can be found in `cosmos1/models/autoregressive/assets/v1p0/input.jpg`. The image is from [BDD dataset](http://bdd-data.berkeley.edu/). |
|
|
|
Here is an example output video generated using base.py with 9-frame video input, using `Cosmos-1.0-Autoregressive-12B`: |
|
|
|
<video src="https://github.com/user-attachments/assets/1a3ff099-87d7-41e8-b149-a25cfcd4f40b"> |
|
Your browser does not support the video tag. |
|
</video> |
|
|
|
The input video used to generate this video can be found in `cosmos1/models/autoregressive/assets/v1p0/input.mp4`. |
|
|
|
##### Inference Time and GPU Memory Usage |
|
|
|
These numbers may vary based on system specifications and are provided for reference only. |
|
|
|
| Offloading Strategy | Cosmos-1.0-Autoregressive-4B | Cosmos-1.0-Autoregressive-12B | |
|
|-------------|---------|---------| |
|
| No offloading | 31.3 GB | 47.5 GB | |
|
| Guardrails | 28.9 GB | 45.2 GB | |
|
| Guardrails & Diffusion decoder | 28.5 GB | 43.1 GB | |
|
| Guardrails & Diffusion decoder & Tokenizer | 27.3 GB | 42.9 GB | |
|
| Guardrails & Diffusion decoder & Tokenizer & AR model | 18.7 GB | 27.4 GB | |
|
|
|
End-to-end inference runtime on one H100 without offloading and after model initialization: |
|
|
|
| Cosmos-1.0-Autoregressive-4B | Cosmos-1.0-Autoregressive-12B | |
|
|---------|---------| |
|
| ~62 seconds | ~119 seconds | |
|
|
|
#### Video2World (video2world.py): 5B and 13B |
|
|
|
Generates world from image/video and text input. |
|
|
|
The `input_type` argument can be either `text_and_video` or `text_and_image`. We have tuned the sampling parameters `top_p` and `temperature` to achieve the best performance. Please use the provided values in the command examples. |
|
|
|
Note that the command examples below all use video input. If you want to use image input, please change the `input_type` to `text_and_image`. |
|
|
|
##### Single Generation |
|
|
|
```bash |
|
# Example using 5B model |
|
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \ |
|
--input_type=text_and_video \ |
|
--input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ |
|
--prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \ |
|
--video_save_name=Cosmos-1.0-Autoregressive-5B-Video2World \ |
|
--ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \ |
|
--top_p=0.7 \ |
|
--temperature=1.0 |
|
|
|
# Example for low-memory GPUs using 5B model with model offloading |
|
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \ |
|
--input_type=text_and_video \ |
|
--input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ |
|
--prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \ |
|
--video_save_name=Cosmos-1.0-Autoregressive-5B-Video2World \ |
|
--ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \ |
|
--top_p=0.7 \ |
|
--temperature=1.0 \ |
|
--offload_guardrail_models \ |
|
--offload_diffusion_decoder \ |
|
--offload_ar_model \ |
|
--offload_tokenizer \ |
|
--offload_text_encoder_model |
|
|
|
# Example using 13B model |
|
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \ |
|
--input_type=text_and_video \ |
|
--input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ |
|
--prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \ |
|
--video_save_name=Cosmos-1.0-Autoregressive-13B-Video2World \ |
|
--ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \ |
|
--top_p=0.8 \ |
|
--temperature=1.0 \ |
|
--offload_guardrail_models |
|
|
|
# Example for low-memory GPUs using 13B model with model offloading |
|
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \ |
|
--input_type=text_and_video \ |
|
--input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ |
|
--prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \ |
|
--video_save_name=Cosmos-1.0-Autoregressive-13B-Video2World \ |
|
--ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \ |
|
--top_p=0.8 \ |
|
--temperature=1.0 \ |
|
--offload_guardrail_models \ |
|
--offload_diffusion_decoder \ |
|
--offload_ar_model \ |
|
--offload_tokenizer \ |
|
--offload_text_encoder_model |
|
``` |
|
|
|
##### Batch Generation |
|
|
|
```bash |
|
# Example using 5B model |
|
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \ |
|
--input_type=text_and_video \ |
|
--batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/video2world.jsonl \ |
|
--video_save_folder=outputs/Cosmos-1.0-Autoregressive-5B-Video2World \ |
|
--ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \ |
|
--top_p=0.7 \ |
|
--temperature=1.0 |
|
|
|
# Example using 13B model |
|
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \ |
|
--input_type=text_and_video \ |
|
--batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/video2world.jsonl \ |
|
--video_save_folder=outputs/Cosmos-1.0-Autoregressive-13B-Video2World \ |
|
--ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \ |
|
--top_p=0.8 \ |
|
--temperature=1.0 \ |
|
--offload_guardrail_models |
|
``` |
|
|
|
##### Example Output |
|
|
|
Here is an example output video generated using video2world.py with image input, using `Cosmos-1.0-Autoregressive-13B-Video2World`: |
|
|
|
<video src="https://github.com/user-attachments/assets/869f3b81-fabd-462e-a545-c04cdd9c1d22"> |
|
Your browser does not support the video tag. |
|
</video> |
|
|
|
The input image used to generate this video can be found in `cosmos1/models/autoregressive/assets/v1p0/input.jpg`. The prompt for generating the video is: |
|
|
|
``` |
|
A driving video captures a serene urban street scene on a sunny day. The camera is mounted on the dashboard of a moving vehicle, providing a first-person perspective as it travels down a two-lane road. The street is lined with parked cars on both sides, predominantly black and silver sedans and SUVs. The road is flanked by a mix of residential and commercial buildings, with a prominent red-brick building on the left side, featuring multiple windows and a flat roof. The sky is clear with a few scattered clouds, casting soft shadows on the street. Trees with lush green foliage line the right side of the road, providing a natural contrast to the urban environment. The camera remains steady, maintaining a consistent forward motion, suggesting a leisurely drive. Traffic is light, with a few vehicles moving in the opposite direction, including a black sedan and a yellow taxi. Street signs are visible, including a no-parking sign on the right. The overall atmosphere is calm and peaceful, with no pedestrians visible, emphasizing the focus on the drive and the surrounding urban landscape. |
|
``` |
|
|
|
Here is an example output video generated using video2world.py with 9-frame video input, using `Cosmos-1.0-Autoregressive-13B-Video2World`: |
|
|
|
<video src="https://github.com/user-attachments/assets/81840e1c-624b-4b01-9240-ab7db3722e58"> |
|
Your browser does not support the video tag. |
|
</video> |
|
|
|
The input video used to generate this video can be found in `cosmos1/models/autoregressive/assets/v1p0/input.mp4`. The prompt for generating the video is: |
|
|
|
``` |
|
A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions. |
|
``` |
|
|
|
##### Inference Time and GPU Memory Usage |
|
|
|
These numbers may vary based on system specifications and are provided for reference only. |
|
|
|
| Offloading Strategy | Cosmos-1.0-Autoregressive-5B-Video2World | Cosmos-1.0-Autoregressive-13B-Video2World | |
|
|-------------|---------|---------| |
|
| No offloading | 66.2 GB | > 80 GB | |
|
| Guardrails | 58.7 GB | 76.6 GB | |
|
| Guardrails & T5 encoder | 41.3 GB | 58.0 GB | |
|
| Guardrails & T5 encoder & Diffusion decoder | 29.0 GB | 46.9 GB | |
|
| Guardrails & T5 encoder & Diffusion decoder & Tokenizer | 28.8 GB | 46.7 GB | |
|
| Guardrails & T5 encoder & Diffusion decoder & Tokenizer & AR model | 21.1 GB | 30.9 GB | |
|
|
|
End-to-end inference runtime on one H100 with no offloading for 5B model and guardrail offloading for 13B, after model initialization: |
|
|
|
| Cosmos-1.0-Autoregressive-5B-Video2World | Cosmos-1.0-Autoregressive-13B-Video2World | |
|
|---------|---------| |
|
| ~73 seconds | ~150 seconds | |
|
|
|
### Arguments |
|
|
|
#### Common Parameters |
|
|
|
| Parameter | Description | Default | |
|
|-----------|-------------|---------| |
|
| `--checkpoint_dir` | Directory containing model weights | "checkpoints" | |
|
| `--video_save_name` | Output video filename for single video generation | "output" | |
|
| `--video_save_folder` | Folder where all output videos are stored | "outputs/" | |
|
| `--input_image_or_video_path` | Input image or video path. Required for single video generation | None | |
|
| `--batch_input_path` | Folder containing input images or videos. Required for batch video generation | None | |
|
| `--num_input_frames` | Number of input frames to use for Video2World prediction | 9 | |
|
| `--temperature` | Temperature used while sampling | 1.0 (recommend using values in sample commands provided) | |
|
| `--top_p` | Top-p value for top-p sampling | 0.8 (recommend using values in sample commands provided) | |
|
| `--seed` | Random seed | 0 | |
|
| `--disable_diffusion_decoder` | When set to True, use discrete tokenizer to decode discrete tokens to video. Otherwise, use diffusion decoder to decode video | False | |
|
| `--offload_guardrail_models` | Offload guardrail models after inference, used for low-memory GPUs | False | |
|
| `--offload_diffusion_decoder` | Offload diffusion decoder after inference, used for low-memory GPUs | False | |
|
| `--offload_ar_model` | Offload AR model after inference, used for low-memory GPUs | False | |
|
| `--offload_prompt_upsampler` | Offload prompt upsampler after inference, used for low-memory GPUs | False | |
|
|
|
#### Base Specific Parameters |
|
|
|
| Parameter | Description | Default | |
|
|-----------|-------------|---------| |
|
| `--ar_model_dir` | Directory containing AR model weight | "Cosmos-1.0-Autoregressive-4B" | |
|
| `--input_type` | Input type, either `video` or `image` | "video" | |
|
|
|
#### Video2World Specific Parameters |
|
|
|
| Parameter | Description | Default | |
|
|-----------|-------------|---------| |
|
| `--ar_model_dir` | Directory containing AR model weight | "Cosmos-1.0-Autoregressive-4B" | |
|
| `--input_type` | Input type, either `text_and_video` or `text_and_image` | "text_and_video" | |
|
| `--prompt` | Text prompt for single video generation. Required for single video generation | None | |
|
| `--input_prompts_path` | Path to JSONL file for batch video generation. Required for batch video generation | None | |
|
| `--offload_text_encoder_model` | Offload text encoder after inference, used for low-memory GPUs | False | |
|
|
|
### Safety Features |
|
|
|
The model uses a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed and will be blurred by the guardrail. |
|
|
|
For more information, check out the [Cosmos Guardrail Documentation](../guardrail/README.md). |
|
|