| # Cosmos Autoregressive-based World Foundation Models | |
| ## Table of Contents | |
| - [Getting Started](#getting-started) | |
| - [Set Up Docker Environment](#set-up-docker-environment) | |
| - [Download Checkpoints](#download-checkpoints) | |
| - [Usage](#usage) | |
| - [Model Types](#model-types) | |
| - [Single and Batch Generation](#single-and-batch-generation) | |
| - [Sample Commands](#sample-commands) | |
| - [Base Models (4B/12B)](#base-basepy-4b-and-12b) | |
| - [Video2World Models (5B/13B)](#video2world-video2worldpy-5b-and-13b) | |
| - [Arguments](#arguments) | |
| - [Common Parameters](#common-parameters) | |
| - [Base Specific Parameters](#base-specific-parameters) | |
| - [Video2World Specific Parameters](#video2world-specific-parameters) | |
| - [Safety Features](#safety-features) | |
| This page details the steps for using the Cosmos autoregressive-based world foundation models. | |
| ## Getting Started | |
| ### Set Up Docker Environment | |
| Follow our [Installation Guide](../../../INSTALL.md) to set up the Docker environment. All commands on this page should be run inside Docker. | |
| ### Download Checkpoints | |
| 1. Generate a [Hugging Face](https://huggingface.co/settings/tokens) access token. Set the access token to 'Read' permission (default is 'Fine-grained'). | |
| 2. Log in to Hugging Face with the access token: | |
| ```bash | |
| huggingface-cli login | |
| ``` | |
| 3. Download the Cosmos model weights from [Hugging Face](https://huggingface.co/collections/nvidia/cosmos-6751e884dc10e013a0a0d8e6): | |
| ```bash | |
| PYTHONPATH=$(pwd) python cosmos1/scripts/download_autoregressive.py --model_sizes 4B 5B 12B 13B | |
| ``` | |
| 4. The downloaded files should be in the following structure: | |
| ``` | |
| checkpoints/ | |
| βββ Cosmos-1.0-Autoregressive-4B | |
| β βββ model.pt | |
| β βββ config.json | |
| βββ Cosmos-1.0-Autoregressive-5B-Video2World | |
| β βββ model.pt | |
| β βββ config.json | |
| βββ Cosmos-1.0-Autoregressive-12B | |
| β βββ model.pt | |
| β βββ config.json | |
| βββ Cosmos-1.0-Autoregressive-13B-Video2World | |
| β βββ model.pt | |
| β βββ config.json | |
| βββ Cosmos-1.0-Tokenizer-CV8x8x8 | |
| β βββ decoder.jit | |
| β βββ encoder.jit | |
| β βββ mean_std.pt | |
| βββ Cosmos-1.0-Tokenizer-DV8x16x16 | |
| β βββ decoder.jit | |
| β βββ encoder.jit | |
| βββ Cosmos-1.0-Diffusion-7B-Decoder-DV8x16x16ToCV8x8x8 | |
| β βββ aux_vars.pt | |
| β βββ model.pt | |
| βββ Cosmos-1.0-Guardrail | |
| βββ aegis/ | |
| βββ blocklist/ | |
| βββ face_blur_filter/ | |
| βββ video_content_safety_filter/ | |
| ``` | |
| ## Usage | |
| ### Model Types | |
| There are two model types available for autoregressive world generation: | |
| 1. **Base**: Supports world generation from image/video input | |
| * Models: `Cosmos-1.0-Autoregressive-4B` and `Cosmos-1.0-Autoregressive-12B` | |
| * Inference script: [base.py](/cosmos1/models/autoregressive/inference/base.py) | |
| 2. **Video2World**: Supports world generation from image/video input and text input | |
| * Models: `Cosmos-1.0-Autoregressive-5B-Video2World` and `Cosmos-1.0-Autoregressive-13B-Video2World` | |
| * Inference script: [video2world.py](/cosmos1/models/autoregressive/inference/video2world.py) | |
| Our models now support video extension up to 33 frames. Starting from either a single image or a 9-frame video input, they can generate the remaining frames to reach the 33-frame length (generating 32 or 24 frames, respectively). | |
| We have evaluated all eight possible configurations (4 models Γ 2 vision input types: image or video) using 100 test videos on physical AI topics. Below are the failure rates for each configuration: | |
| | Model | Image input | Video input (9 frames) | | |
| |:------------------------------------------|:--------------:|:-------------------------:| | |
| | Cosmos-1.0-Autoregressive-4B | 15% | 1% | | |
| | Cosmos-1.0-Autoregressive-5B-Video2World | 7% | 2% | | |
| | Cosmos-1.0-Autoregressive-12B | 2% | 1% | | |
| | Cosmos-1.0-Autoregressive-13B-Video2World | 3% | 0% | | |
| We define failure cases as videos with severe distortions, such as: | |
| * Sudden appearance of large unexpected objects | |
| * Video degrading to a single solid color | |
| Note that the following are not considered failures in our analysis: | |
| * Static video frames | |
| * Minor object distortions or artifacts | |
| ### Single and Batch Generation | |
| We support both single and batch video generation. | |
| For generating a single video, `base` mode requires the input argument `--input_image_or_video_path` (image/video input), while `video2world` mode requires both `--input_image_or_video_path` (image/video input) and `--prompt` (text input). | |
| Note that our model only works with 1024x640 resolution videos. If the input image/video is not in this resolution, it will be resized and cropped. | |
| For generating a batch of videos, both `base` and `video2world` require `--batch_input_path` (path to a JSONL file). For `base`, the JSONL file should contain one visual input per line in the following format, where each line must contain a "visual_input" field: | |
| ```json | |
| {"visual_input": "path/to/video1.mp4"} | |
| {"visual_input": "path/to/video2.mp4"} | |
| ``` | |
| For `video2world`, each line in the JSONL file must contain both "prompt" and "visual_input" fields: | |
| ```json | |
| {"prompt": "prompt1", "visual_input": "path/to/video1.mp4"} | |
| {"prompt": "prompt2", "visual_input": "path/to/video2.mp4"} | |
| ``` | |
| ### Sample Commands | |
| There are two main demo scripts for autoregressive world generation: `base.py` and `video2world.py`. Below you will find sample commands for single and batch generation, as well as commands for running with low-memory GPUs using model offloading. We also provide a memory usage table comparing different offloading strategies to help with configuration. | |
| #### Base (base.py): 4B and 12B | |
| Generates world from image/video input. | |
| The `input_type` argument can be either `video` or `image`. We have tuned the sampling parameters `top_p` and `temperature` to achieve the best performance. Please use the provided values in the command examples. | |
| Note that the command examples below all use video input. If you want to use image input, please change the `input_type` to `image`. | |
| ##### Single Generation | |
| ```bash | |
| # Example using 4B model | |
| CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \ | |
| --input_type=video \ | |
| --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ | |
| --video_save_name=Cosmos-1.0-Autoregressive-4B \ | |
| --ar_model_dir=Cosmos-1.0-Autoregressive-4B \ | |
| --top_p=0.8 \ | |
| --temperature=1.0 | |
| # Example for low-memory GPUs using 4B model with model offloading | |
| CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \ | |
| --input_type=video \ | |
| --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ | |
| --video_save_name=Cosmos-1.0-Autoregressive-4B \ | |
| --ar_model_dir=Cosmos-1.0-Autoregressive-4B \ | |
| --top_p=0.8 \ | |
| --temperature=1.0 \ | |
| --offload_guardrail_models \ | |
| --offload_diffusion_decoder \ | |
| --offload_ar_model \ | |
| --offload_tokenizer | |
| # Example using 12B model | |
| CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \ | |
| --input_type=video \ | |
| --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ | |
| --video_save_name=Cosmos-1.0-Autoregressive-12B \ | |
| --ar_model_dir=Cosmos-1.0-Autoregressive-12B \ | |
| --top_p=0.9 \ | |
| --temperature=1.0 | |
| # Example for low-memory GPUs using 12B model with model offloading | |
| CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \ | |
| --input_type=video \ | |
| --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ | |
| --video_save_name=Cosmos-1.0-Autoregressive-12B \ | |
| --ar_model_dir=Cosmos-1.0-Autoregressive-12B \ | |
| --top_p=0.9 \ | |
| --temperature=1.0 \ | |
| --offload_guardrail_models \ | |
| --offload_diffusion_decoder \ | |
| --offload_ar_model \ | |
| --offload_tokenizer | |
| ``` | |
| ##### Batch Generation | |
| ```bash | |
| # Example using 4B model | |
| CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \ | |
| --input_type=video \ | |
| --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/base.jsonl \ | |
| --video_save_folder=outputs/Cosmos-1.0-Autoregressive-4B \ | |
| --ar_model_dir=Cosmos-1.0-Autoregressive-4B \ | |
| --top_p=0.8 \ | |
| --temperature=1.0 | |
| # Example using 12B model | |
| CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \ | |
| --input_type=video \ | |
| --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/base.jsonl \ | |
| --video_save_folder=outputs/Cosmos-1.0-Autoregressive-12B \ | |
| --ar_model_dir=Cosmos-1.0-Autoregressive-12B \ | |
| --top_p=0.9 \ | |
| --temperature=1.0 | |
| ``` | |
| ##### Example Output | |
| Here is an example output video generated using base.py with image input, using `Cosmos-1.0-Autoregressive-12B`: | |
| <video src="https://github.com/user-attachments/assets/634403a5-1873-42d7-8dd0-eb7fb4ac8cf4"> | |
| Your browser does not support the video tag. | |
| </video> | |
| The input image used to generate this video can be found in `cosmos1/models/autoregressive/assets/v1p0/input.jpg`. The image is from [BDD dataset](http://bdd-data.berkeley.edu/). | |
| Here is an example output video generated using base.py with 9-frame video input, using `Cosmos-1.0-Autoregressive-12B`: | |
| <video src="https://github.com/user-attachments/assets/1a3ff099-87d7-41e8-b149-a25cfcd4f40b"> | |
| Your browser does not support the video tag. | |
| </video> | |
| The input video used to generate this video can be found in `cosmos1/models/autoregressive/assets/v1p0/input.mp4`. | |
| ##### Inference Time and GPU Memory Usage | |
| These numbers may vary based on system specifications and are provided for reference only. | |
| | Offloading Strategy | Cosmos-1.0-Autoregressive-4B | Cosmos-1.0-Autoregressive-12B | | |
| |-------------|---------|---------| | |
| | No offloading | 31.3 GB | 47.5 GB | | |
| | Guardrails | 28.9 GB | 45.2 GB | | |
| | Guardrails & Diffusion decoder | 28.5 GB | 43.1 GB | | |
| | Guardrails & Diffusion decoder & Tokenizer | 27.3 GB | 42.9 GB | | |
| | Guardrails & Diffusion decoder & Tokenizer & AR model | 18.7 GB | 27.4 GB | | |
| End-to-end inference runtime on one H100 without offloading and after model initialization: | |
| | Cosmos-1.0-Autoregressive-4B | Cosmos-1.0-Autoregressive-12B | | |
| |---------|---------| | |
| | ~62 seconds | ~119 seconds | | |
| #### Video2World (video2world.py): 5B and 13B | |
| Generates world from image/video and text input. | |
| The `input_type` argument can be either `text_and_video` or `text_and_image`. We have tuned the sampling parameters `top_p` and `temperature` to achieve the best performance. Please use the provided values in the command examples. | |
| Note that the command examples below all use video input. If you want to use image input, please change the `input_type` to `text_and_image`. | |
| ##### Single Generation | |
| ```bash | |
| # Example using 5B model | |
| CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \ | |
| --input_type=text_and_video \ | |
| --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ | |
| --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \ | |
| --video_save_name=Cosmos-1.0-Autoregressive-5B-Video2World \ | |
| --ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \ | |
| --top_p=0.7 \ | |
| --temperature=1.0 | |
| # Example for low-memory GPUs using 5B model with model offloading | |
| CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \ | |
| --input_type=text_and_video \ | |
| --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ | |
| --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \ | |
| --video_save_name=Cosmos-1.0-Autoregressive-5B-Video2World \ | |
| --ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \ | |
| --top_p=0.7 \ | |
| --temperature=1.0 \ | |
| --offload_guardrail_models \ | |
| --offload_diffusion_decoder \ | |
| --offload_ar_model \ | |
| --offload_tokenizer \ | |
| --offload_text_encoder_model | |
| # Example using 13B model | |
| CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \ | |
| --input_type=text_and_video \ | |
| --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ | |
| --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \ | |
| --video_save_name=Cosmos-1.0-Autoregressive-13B-Video2World \ | |
| --ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \ | |
| --top_p=0.8 \ | |
| --temperature=1.0 \ | |
| --offload_guardrail_models | |
| # Example for low-memory GPUs using 13B model with model offloading | |
| CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \ | |
| --input_type=text_and_video \ | |
| --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \ | |
| --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \ | |
| --video_save_name=Cosmos-1.0-Autoregressive-13B-Video2World \ | |
| --ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \ | |
| --top_p=0.8 \ | |
| --temperature=1.0 \ | |
| --offload_guardrail_models \ | |
| --offload_diffusion_decoder \ | |
| --offload_ar_model \ | |
| --offload_tokenizer \ | |
| --offload_text_encoder_model | |
| ``` | |
| ##### Batch Generation | |
| ```bash | |
| # Example using 5B model | |
| CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \ | |
| --input_type=text_and_video \ | |
| --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/video2world.jsonl \ | |
| --video_save_folder=outputs/Cosmos-1.0-Autoregressive-5B-Video2World \ | |
| --ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \ | |
| --top_p=0.7 \ | |
| --temperature=1.0 | |
| # Example using 13B model | |
| CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \ | |
| --input_type=text_and_video \ | |
| --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/video2world.jsonl \ | |
| --video_save_folder=outputs/Cosmos-1.0-Autoregressive-13B-Video2World \ | |
| --ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \ | |
| --top_p=0.8 \ | |
| --temperature=1.0 \ | |
| --offload_guardrail_models | |
| ``` | |
| ##### Example Output | |
| Here is an example output video generated using video2world.py with image input, using `Cosmos-1.0-Autoregressive-13B-Video2World`: | |
| <video src="https://github.com/user-attachments/assets/869f3b81-fabd-462e-a545-c04cdd9c1d22"> | |
| Your browser does not support the video tag. | |
| </video> | |
| The input image used to generate this video can be found in `cosmos1/models/autoregressive/assets/v1p0/input.jpg`. The prompt for generating the video is: | |
| ``` | |
| A driving video captures a serene urban street scene on a sunny day. The camera is mounted on the dashboard of a moving vehicle, providing a first-person perspective as it travels down a two-lane road. The street is lined with parked cars on both sides, predominantly black and silver sedans and SUVs. The road is flanked by a mix of residential and commercial buildings, with a prominent red-brick building on the left side, featuring multiple windows and a flat roof. The sky is clear with a few scattered clouds, casting soft shadows on the street. Trees with lush green foliage line the right side of the road, providing a natural contrast to the urban environment. The camera remains steady, maintaining a consistent forward motion, suggesting a leisurely drive. Traffic is light, with a few vehicles moving in the opposite direction, including a black sedan and a yellow taxi. Street signs are visible, including a no-parking sign on the right. The overall atmosphere is calm and peaceful, with no pedestrians visible, emphasizing the focus on the drive and the surrounding urban landscape. | |
| ``` | |
| Here is an example output video generated using video2world.py with 9-frame video input, using `Cosmos-1.0-Autoregressive-13B-Video2World`: | |
| <video src="https://github.com/user-attachments/assets/81840e1c-624b-4b01-9240-ab7db3722e58"> | |
| Your browser does not support the video tag. | |
| </video> | |
| The input video used to generate this video can be found in `cosmos1/models/autoregressive/assets/v1p0/input.mp4`. The prompt for generating the video is: | |
| ``` | |
| A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions. | |
| ``` | |
| ##### Inference Time and GPU Memory Usage | |
| These numbers may vary based on system specifications and are provided for reference only. | |
| | Offloading Strategy | Cosmos-1.0-Autoregressive-5B-Video2World | Cosmos-1.0-Autoregressive-13B-Video2World | | |
| |-------------|---------|---------| | |
| | No offloading | 66.2 GB | > 80 GB | | |
| | Guardrails | 58.7 GB | 76.6 GB | | |
| | Guardrails & T5 encoder | 41.3 GB | 58.0 GB | | |
| | Guardrails & T5 encoder & Diffusion decoder | 29.0 GB | 46.9 GB | | |
| | Guardrails & T5 encoder & Diffusion decoder & Tokenizer | 28.8 GB | 46.7 GB | | |
| | Guardrails & T5 encoder & Diffusion decoder & Tokenizer & AR model | 21.1 GB | 30.9 GB | | |
| End-to-end inference runtime on one H100 with no offloading for 5B model and guardrail offloading for 13B, after model initialization: | |
| | Cosmos-1.0-Autoregressive-5B-Video2World | Cosmos-1.0-Autoregressive-13B-Video2World | | |
| |---------|---------| | |
| | ~73 seconds | ~150 seconds | | |
| ### Arguments | |
| #### Common Parameters | |
| | Parameter | Description | Default | | |
| |-----------|-------------|---------| | |
| | `--checkpoint_dir` | Directory containing model weights | "checkpoints" | | |
| | `--video_save_name` | Output video filename for single video generation | "output" | | |
| | `--video_save_folder` | Folder where all output videos are stored | "outputs/" | | |
| | `--input_image_or_video_path` | Input image or video path. Required for single video generation | None | | |
| | `--batch_input_path` | Folder containing input images or videos. Required for batch video generation | None | | |
| | `--num_input_frames` | Number of input frames to use for Video2World prediction | 9 | | |
| | `--temperature` | Temperature used while sampling | 1.0 (recommend using values in sample commands provided) | | |
| | `--top_p` | Top-p value for top-p sampling | 0.8 (recommend using values in sample commands provided) | | |
| | `--seed` | Random seed | 0 | | |
| | `--disable_diffusion_decoder` | When set to True, use discrete tokenizer to decode discrete tokens to video. Otherwise, use diffusion decoder to decode video | False | | |
| | `--offload_guardrail_models` | Offload guardrail models after inference, used for low-memory GPUs | False | | |
| | `--offload_diffusion_decoder` | Offload diffusion decoder after inference, used for low-memory GPUs | False | | |
| | `--offload_ar_model` | Offload AR model after inference, used for low-memory GPUs | False | | |
| | `--offload_prompt_upsampler` | Offload prompt upsampler after inference, used for low-memory GPUs | False | | |
| #### Base Specific Parameters | |
| | Parameter | Description | Default | | |
| |-----------|-------------|---------| | |
| | `--ar_model_dir` | Directory containing AR model weight | "Cosmos-1.0-Autoregressive-4B" | | |
| | `--input_type` | Input type, either `video` or `image` | "video" | | |
| #### Video2World Specific Parameters | |
| | Parameter | Description | Default | | |
| |-----------|-------------|---------| | |
| | `--ar_model_dir` | Directory containing AR model weight | "Cosmos-1.0-Autoregressive-4B" | | |
| | `--input_type` | Input type, either `text_and_video` or `text_and_image` | "text_and_video" | | |
| | `--prompt` | Text prompt for single video generation. Required for single video generation | None | | |
| | `--input_prompts_path` | Path to JSONL file for batch video generation. Required for batch video generation | None | | |
| | `--offload_text_encoder_model` | Offload text encoder after inference, used for low-memory GPUs | False | | |
| ### Safety Features | |
| The model uses a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed and will be blurred by the guardrail. | |
| For more information, check out the [Cosmos Guardrail Documentation](../guardrail/README.md). | |