File size: 20,508 Bytes
01a383f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
# Cosmos Autoregressive-based World Foundation Models

## Table of Contents
- [Getting Started](#getting-started)
  - [Set Up Docker Environment](#set-up-docker-environment)
  - [Download Checkpoints](#download-checkpoints)
- [Usage](#usage)
  - [Model Types](#model-types)
  - [Single and Batch Generation](#single-and-batch-generation)
  - [Sample Commands](#sample-commands)
    - [Base Models (4B/12B)](#base-basepy-4b-and-12b)
    - [Video2World Models (5B/13B)](#video2world-video2worldpy-5b-and-13b)
  - [Arguments](#arguments)
    - [Common Parameters](#common-parameters)
    - [Base Specific Parameters](#base-specific-parameters)
    - [Video2World Specific Parameters](#video2world-specific-parameters)
  - [Safety Features](#safety-features)

This page details the steps for using the Cosmos autoregressive-based world foundation models.

## Getting Started

### Set Up Docker Environment

Follow our [Installation Guide](../../../INSTALL.md) to set up the Docker environment. All commands on this page should be run inside Docker.

### Download Checkpoints

1. Generate a [Hugging Face](https://huggingface.co/settings/tokens) access token. Set the access token to 'Read' permission (default is 'Fine-grained').

2. Log in to Hugging Face with the access token:

```bash
huggingface-cli login
```

3. Download the Cosmos model weights from [Hugging Face](https://huggingface.co/collections/nvidia/cosmos-6751e884dc10e013a0a0d8e6):

```bash
PYTHONPATH=$(pwd) python cosmos1/scripts/download_autoregressive.py --model_sizes 4B 5B 12B 13B
```

4. The downloaded files should be in the following structure:

```
checkpoints/
β”œβ”€β”€ Cosmos-1.0-Autoregressive-4B
β”‚   β”œβ”€β”€ model.pt
β”‚   └── config.json
β”œβ”€β”€ Cosmos-1.0-Autoregressive-5B-Video2World
β”‚   β”œβ”€β”€ model.pt
β”‚   └── config.json
β”œβ”€β”€ Cosmos-1.0-Autoregressive-12B
β”‚   β”œβ”€β”€ model.pt
β”‚   └── config.json
β”œβ”€β”€ Cosmos-1.0-Autoregressive-13B-Video2World
β”‚   β”œβ”€β”€ model.pt
β”‚   └── config.json
β”œβ”€β”€ Cosmos-1.0-Tokenizer-CV8x8x8
β”‚   β”œβ”€β”€ decoder.jit
β”‚   β”œβ”€β”€ encoder.jit
β”‚   └── mean_std.pt
β”œβ”€β”€ Cosmos-1.0-Tokenizer-DV8x16x16
β”‚   β”œβ”€β”€ decoder.jit
β”‚   └── encoder.jit
β”œβ”€β”€ Cosmos-1.0-Diffusion-7B-Decoder-DV8x16x16ToCV8x8x8
β”‚   β”œβ”€β”€ aux_vars.pt
β”‚   └── model.pt
└── Cosmos-1.0-Guardrail
    β”œβ”€β”€ aegis/
    β”œβ”€β”€ blocklist/
    β”œβ”€β”€ face_blur_filter/
    └── video_content_safety_filter/
```

## Usage


### Model Types

There are two model types available for autoregressive world generation:

1. **Base**: Supports world generation from image/video input

* Models: `Cosmos-1.0-Autoregressive-4B` and `Cosmos-1.0-Autoregressive-12B`
* Inference script: [base.py](/cosmos1/models/autoregressive/inference/base.py)

2. **Video2World**: Supports world generation from image/video input and text input

* Models: `Cosmos-1.0-Autoregressive-5B-Video2World` and `Cosmos-1.0-Autoregressive-13B-Video2World`
* Inference script: [video2world.py](/cosmos1/models/autoregressive/inference/video2world.py)

Our models now support video extension up to 33 frames. Starting from either a single image or a 9-frame video input, they can generate the remaining frames to reach the 33-frame length (generating 32 or 24 frames, respectively).

We have evaluated all eight possible configurations (4 models Γ— 2 vision input types: image or video) using 100 test videos on physical AI topics. Below are the failure rates for each configuration:

| Model                                      | Image input | Video input (9 frames) |
|:------------------------------------------|:--------------:|:-------------------------:|
| Cosmos-1.0-Autoregressive-4B              | 15%           | 1%                       |
| Cosmos-1.0-Autoregressive-5B-Video2World  | 7%            | 2%                       |
| Cosmos-1.0-Autoregressive-12B             | 2%            | 1%                       |
| Cosmos-1.0-Autoregressive-13B-Video2World | 3%            | 0%                       |

We define failure cases as videos with severe distortions, such as:

* Sudden appearance of large unexpected objects
* Video degrading to a single solid color

Note that the following are not considered failures in our analysis:

* Static video frames
* Minor object distortions or artifacts

### Single and Batch Generation

We support both single and batch video generation.

For generating a single video, `base` mode requires the input argument `--input_image_or_video_path` (image/video input), while `video2world` mode requires both `--input_image_or_video_path` (image/video input) and `--prompt` (text input).

Note that our model only works with 1024x640 resolution videos. If the input image/video is not in this resolution, it will be resized and cropped.

For generating a batch of videos, both `base` and `video2world` require `--batch_input_path` (path to a JSONL file). For `base`, the JSONL file should contain one visual input per line in the following format, where each line must contain a "visual_input" field:

```json
{"visual_input": "path/to/video1.mp4"}
{"visual_input": "path/to/video2.mp4"}
```

For `video2world`, each line in the JSONL file must contain both "prompt" and "visual_input" fields:

```json
{"prompt": "prompt1", "visual_input": "path/to/video1.mp4"}
{"prompt": "prompt2", "visual_input": "path/to/video2.mp4"}
```

### Sample Commands

There are two main demo scripts for autoregressive world generation: `base.py` and `video2world.py`. Below you will find sample commands for single and batch generation, as well as commands for running with low-memory GPUs using model offloading. We also provide a memory usage table comparing different offloading strategies to help with configuration.

#### Base (base.py): 4B and 12B

Generates world from image/video input.

The `input_type` argument can be either `video` or `image`. We have tuned the sampling parameters `top_p` and `temperature` to achieve the best performance. Please use the provided values in the command examples.

Note that the command examples below all use video input. If you want to use image input, please change the `input_type` to `image`.

##### Single Generation

```bash
# Example using 4B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --video_save_name=Cosmos-1.0-Autoregressive-4B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-4B \
    --top_p=0.8 \
    --temperature=1.0

# Example for low-memory GPUs using 4B model with model offloading
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --video_save_name=Cosmos-1.0-Autoregressive-4B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-4B \
    --top_p=0.8 \
    --temperature=1.0 \
    --offload_guardrail_models \
    --offload_diffusion_decoder \
    --offload_ar_model \
    --offload_tokenizer

# Example using 12B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --video_save_name=Cosmos-1.0-Autoregressive-12B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-12B \
    --top_p=0.9 \
    --temperature=1.0

# Example for low-memory GPUs using 12B model with model offloading
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --video_save_name=Cosmos-1.0-Autoregressive-12B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-12B \
    --top_p=0.9 \
    --temperature=1.0 \
    --offload_guardrail_models \
    --offload_diffusion_decoder \
    --offload_ar_model \
    --offload_tokenizer
```

##### Batch Generation

```bash
# Example using 4B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/base.jsonl \
    --video_save_folder=outputs/Cosmos-1.0-Autoregressive-4B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-4B \
    --top_p=0.8 \
    --temperature=1.0

# Example using 12B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/base.jsonl \
    --video_save_folder=outputs/Cosmos-1.0-Autoregressive-12B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-12B \
    --top_p=0.9 \
    --temperature=1.0
```

##### Example Output

Here is an example output video generated using base.py with image input, using `Cosmos-1.0-Autoregressive-12B`:

<video src="https://github.com/user-attachments/assets/634403a5-1873-42d7-8dd0-eb7fb4ac8cf4">
  Your browser does not support the video tag.
</video>

The input image used to generate this video can be found in `cosmos1/models/autoregressive/assets/v1p0/input.jpg`. The image is from [BDD dataset](http://bdd-data.berkeley.edu/).

Here is an example output video generated using base.py with 9-frame video input, using `Cosmos-1.0-Autoregressive-12B`:

<video src="https://github.com/user-attachments/assets/1a3ff099-87d7-41e8-b149-a25cfcd4f40b">
  Your browser does not support the video tag.
</video>

The input video used to generate this video can be found in `cosmos1/models/autoregressive/assets/v1p0/input.mp4`.

##### Inference Time and GPU Memory Usage

These numbers may vary based on system specifications and are provided for reference only.

| Offloading Strategy | Cosmos-1.0-Autoregressive-4B | Cosmos-1.0-Autoregressive-12B |
|-------------|---------|---------|
| No offloading | 31.3 GB | 47.5 GB |
| Guardrails | 28.9 GB | 45.2 GB |
| Guardrails & Diffusion decoder | 28.5 GB | 43.1 GB |
| Guardrails & Diffusion decoder & Tokenizer | 27.3 GB | 42.9 GB |
| Guardrails & Diffusion decoder & Tokenizer & AR model | 18.7 GB | 27.4 GB |

End-to-end inference runtime on one H100 without offloading and after model initialization:

| Cosmos-1.0-Autoregressive-4B | Cosmos-1.0-Autoregressive-12B |
|---------|---------|
| ~62 seconds | ~119 seconds |

#### Video2World (video2world.py): 5B and 13B

Generates world from image/video and text input.

The `input_type` argument can be either `text_and_video` or `text_and_image`. We have tuned the sampling parameters `top_p` and `temperature` to achieve the best performance. Please use the provided values in the command examples.

Note that the command examples below all use video input. If you want to use image input, please change the `input_type` to `text_and_image`.

##### Single Generation

```bash
# Example using 5B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \
    --video_save_name=Cosmos-1.0-Autoregressive-5B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \
    --top_p=0.7 \
    --temperature=1.0

# Example for low-memory GPUs using 5B model with model offloading
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \
    --video_save_name=Cosmos-1.0-Autoregressive-5B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \
    --top_p=0.7 \
    --temperature=1.0 \
    --offload_guardrail_models \
    --offload_diffusion_decoder \
    --offload_ar_model \
    --offload_tokenizer \
    --offload_text_encoder_model

# Example using 13B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \
    --video_save_name=Cosmos-1.0-Autoregressive-13B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \
    --top_p=0.8 \
    --temperature=1.0 \
    --offload_guardrail_models

# Example for low-memory GPUs using 13B model with model offloading
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \
    --video_save_name=Cosmos-1.0-Autoregressive-13B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \
    --top_p=0.8 \
    --temperature=1.0 \
    --offload_guardrail_models \
    --offload_diffusion_decoder \
    --offload_ar_model \
    --offload_tokenizer \
    --offload_text_encoder_model
```

##### Batch Generation

```bash
# Example using 5B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/video2world.jsonl \
    --video_save_folder=outputs/Cosmos-1.0-Autoregressive-5B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \
    --top_p=0.7 \
    --temperature=1.0

# Example using 13B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/video2world.jsonl \
    --video_save_folder=outputs/Cosmos-1.0-Autoregressive-13B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \
    --top_p=0.8 \
    --temperature=1.0 \
    --offload_guardrail_models
```

##### Example Output

Here is an example output video generated using video2world.py with image input, using `Cosmos-1.0-Autoregressive-13B-Video2World`:

<video src="https://github.com/user-attachments/assets/869f3b81-fabd-462e-a545-c04cdd9c1d22">
  Your browser does not support the video tag.
</video>

The input image used to generate this video can be found in `cosmos1/models/autoregressive/assets/v1p0/input.jpg`. The prompt for generating the video is:

```
A driving video captures a serene urban street scene on a sunny day. The camera is mounted on the dashboard of a moving vehicle, providing a first-person perspective as it travels down a two-lane road. The street is lined with parked cars on both sides, predominantly black and silver sedans and SUVs. The road is flanked by a mix of residential and commercial buildings, with a prominent red-brick building on the left side, featuring multiple windows and a flat roof. The sky is clear with a few scattered clouds, casting soft shadows on the street. Trees with lush green foliage line the right side of the road, providing a natural contrast to the urban environment. The camera remains steady, maintaining a consistent forward motion, suggesting a leisurely drive. Traffic is light, with a few vehicles moving in the opposite direction, including a black sedan and a yellow taxi. Street signs are visible, including a no-parking sign on the right. The overall atmosphere is calm and peaceful, with no pedestrians visible, emphasizing the focus on the drive and the surrounding urban landscape.
```

Here is an example output video generated using video2world.py with 9-frame video input, using `Cosmos-1.0-Autoregressive-13B-Video2World`:

<video src="https://github.com/user-attachments/assets/81840e1c-624b-4b01-9240-ab7db3722e58">
  Your browser does not support the video tag.
</video>

The input video used to generate this video can be found in `cosmos1/models/autoregressive/assets/v1p0/input.mp4`. The prompt for generating the video is:

```
A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions.
```

##### Inference Time and GPU Memory Usage

These numbers may vary based on system specifications and are provided for reference only.

| Offloading Strategy | Cosmos-1.0-Autoregressive-5B-Video2World | Cosmos-1.0-Autoregressive-13B-Video2World |
|-------------|---------|---------|
| No offloading | 66.2 GB | > 80 GB |
| Guardrails | 58.7 GB | 76.6 GB |
| Guardrails & T5 encoder | 41.3 GB | 58.0 GB |
| Guardrails & T5 encoder & Diffusion decoder | 29.0 GB | 46.9 GB |
| Guardrails & T5 encoder & Diffusion decoder & Tokenizer | 28.8 GB | 46.7 GB |
| Guardrails & T5 encoder & Diffusion decoder & Tokenizer & AR model | 21.1 GB | 30.9 GB |

End-to-end inference runtime on one H100 with no offloading for 5B model and guardrail offloading for 13B, after model initialization:

| Cosmos-1.0-Autoregressive-5B-Video2World | Cosmos-1.0-Autoregressive-13B-Video2World |
|---------|---------|
| ~73 seconds | ~150 seconds |

### Arguments

#### Common Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `--checkpoint_dir` | Directory containing model weights | "checkpoints" |
| `--video_save_name` | Output video filename for single video generation | "output" |
| `--video_save_folder` | Folder where all output videos are stored | "outputs/" |
| `--input_image_or_video_path` | Input image or video path. Required for single video generation | None |
| `--batch_input_path` | Folder containing input images or videos. Required for batch video generation | None |
| `--num_input_frames` | Number of input frames to use for Video2World prediction | 9 |
| `--temperature` | Temperature used while sampling | 1.0 (recommend using values in sample commands provided) |
| `--top_p` | Top-p value for top-p sampling | 0.8 (recommend using values in sample commands provided) |
| `--seed` | Random seed | 0 |
| `--disable_diffusion_decoder` | When set to True, use discrete tokenizer to decode discrete tokens to video. Otherwise, use diffusion decoder to decode video | False |
| `--offload_guardrail_models` | Offload guardrail models after inference, used for low-memory GPUs | False |
| `--offload_diffusion_decoder` | Offload diffusion decoder after inference, used for low-memory GPUs | False |
| `--offload_ar_model` | Offload AR model after inference, used for low-memory GPUs | False |
| `--offload_prompt_upsampler` | Offload prompt upsampler after inference, used for low-memory GPUs | False |

#### Base Specific Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `--ar_model_dir` | Directory containing AR model weight | "Cosmos-1.0-Autoregressive-4B" |
| `--input_type` | Input type, either `video` or `image` | "video" |

#### Video2World Specific Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `--ar_model_dir` | Directory containing AR model weight | "Cosmos-1.0-Autoregressive-4B" |
| `--input_type` | Input type, either `text_and_video` or `text_and_image` | "text_and_video" |
| `--prompt` | Text prompt for single video generation. Required for single video generation | None |
| `--input_prompts_path` | Path to JSONL file for batch video generation. Required for batch video generation | None |
| `--offload_text_encoder_model` | Offload text encoder after inference, used for low-memory GPUs | False |

### Safety Features

The model uses a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed and will be blurred by the guardrail.

For more information, check out the [Cosmos Guardrail Documentation](../guardrail/README.md).