Spaces:

Swarmeta-AI
/

Twig-V0-Alpha-Demo-CPU

Runtime error

App Files Files Community

zzc0208 commited on Feb 14

Commit

f1f9265

verified ·

1 Parent(s): 3d65f8f

Upload 265 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +7 -0
.github/workflows/bot-autolint.yaml +50 -0
.github/workflows/ci.yaml +54 -0
.gitignore +184 -0
.pre-commit-config.yaml +62 -0
CITATION.bib +9 -0
CIs/add_license_all.sh +2 -0
Dockerfile +26 -0
LICENSE +201 -0
README.md +401 -12
app.py +441 -93
app/app_sana.py +502 -0
app/app_sana_4bit.py +409 -0
app/app_sana_4bit_compare_bf16.py +313 -0
app/app_sana_controlnet_hed.py +306 -0
app/app_sana_multithread.py +565 -0
app/safety_check.py +72 -0
app/sana_controlnet_pipeline.py +353 -0
app/sana_pipeline.py +304 -0
asset/Sana.jpg +3 -0
asset/app_styles/controlnet_app_style.css +28 -0
asset/controlnet/ref_images/A transparent sculpture of a duck made out of glass. The sculpture is in front of a painting of a la.jpg +3 -0
asset/controlnet/ref_images/a house.png +3 -0
asset/controlnet/ref_images/a living room.png +3 -0
asset/controlnet/ref_images/nvidia.png +0 -0
asset/controlnet/samples_controlnet.json +26 -0
asset/docs/4bit_sana.md +68 -0
asset/docs/8bit_sana.md +109 -0
asset/docs/ComfyUI/Sana_CogVideoX.json +1142 -0
asset/docs/ComfyUI/Sana_FlowEuler.json +508 -0
asset/docs/ComfyUI/Sana_FlowEuler_2K.json +508 -0
asset/docs/ComfyUI/Sana_FlowEuler_4K.json +508 -0
asset/docs/ComfyUI/comfyui.md +40 -0
asset/docs/metrics_toolkit.md +118 -0
asset/docs/model_zoo.md +157 -0
asset/docs/sana_controlnet.md +75 -0
asset/docs/sana_lora_dreambooth.md +144 -0
asset/example_data/00000000.jpg +3 -0
asset/example_data/00000000.png +3 -0
asset/example_data/00000000.txt +1 -0
asset/example_data/00000000_InternVL2-26B.json +5 -0
asset/example_data/00000000_InternVL2-26B_clip_score.json +5 -0
asset/example_data/00000000_VILA1-5-13B.json +5 -0
asset/example_data/00000000_VILA1-5-13B_clip_score.json +5 -0
asset/example_data/00000000_prompt_clip_score.json +5 -0
asset/example_data/meta_data.json +7 -0
asset/examples.py +69 -0
asset/logo.png +0 -0
asset/model-incremental.jpg +3 -0
asset/model_paths.txt +2 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+asset/controlnet/ref_images/a[[:space:]]house.png filter=lfs diff=lfs merge=lfs -text
+asset/controlnet/ref_images/a[[:space:]]living[[:space:]]room.png filter=lfs diff=lfs merge=lfs -text
+asset/controlnet/ref_images/A[[:space:]]transparent[[:space:]]sculpture[[:space:]]of[[:space:]]a[[:space:]]duck[[:space:]]made[[:space:]]out[[:space:]]of[[:space:]]glass.[[:space:]]The[[:space:]]sculpture[[:space:]]is[[:space:]]in[[:space:]]front[[:space:]]of[[:space:]]a[[:space:]]painting[[:space:]]of[[:space:]]a[[:space:]]la.jpg filter=lfs diff=lfs merge=lfs -text
+asset/example_data/00000000.jpg filter=lfs diff=lfs merge=lfs -text
+asset/example_data/00000000.png filter=lfs diff=lfs merge=lfs -text
+asset/model-incremental.jpg filter=lfs diff=lfs merge=lfs -text
+asset/Sana.jpg filter=lfs diff=lfs merge=lfs -text

.github/workflows/bot-autolint.yaml ADDED Viewed

	@@ -0,0 +1,50 @@

+name: Auto Lint (triggered by "auto lint" label)
+on:
+  pull_request:
+    types:
+      - opened
+      - edited
+      - closed
+      - reopened
+      - synchronize
+      - labeled
+      - unlabeled
+# run only one unit test for a branch / tag.
+concurrency:
+  group: ci-lint-${{ github.ref }}
+  cancel-in-progress: true
+jobs:
+  lint-by-label:
+    if: contains(github.event.pull_request.labels.*.name, 'lint wanted')
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out Git repository
+        uses: actions/checkout@v4
+        with:
+          token: ${{ secrets.PAT }}
+          ref: ${{ github.event.pull_request.head.ref }}
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.10'
+      - name: Test pre-commit hooks
+        continue-on-error: true
+        uses: pre-commit/[email protected] # sync with https://github.com/Efficient-Large-Model/VILA-Internal/blob/main/.github/workflows/pre-commit.yaml
+        with:
+          extra_args: --all-files
+      - name: Check if there are any changes
+        id: verify_diff
+        run: |
+          git diff --quiet . || echo "changed=true" >> $GITHUB_OUTPUT
+      - name: Commit files
+        if: steps.verify_diff.outputs.changed == 'true'
+        run: |
+          git config --local user.email "[email protected]"
+          git config --local user.name "GitHub Action"
+          git add .
+          git commit -m "[CI-Lint] Fix code style issues with pre-commit ${{ github.sha }}" -a
+          git push
+      - name: Remove label(s) after lint
+        uses: actions-ecosystem/action-remove-labels@v1
+        with:
+          labels: lint wanted

.github/workflows/ci.yaml ADDED Viewed

	@@ -0,0 +1,54 @@

+name: ci
+on:
+  pull_request:
+  push:
+    branches: [main, feat/Sana-public, feat/Sana-public-for-NVLab]
+concurrency:
+  group: ci-${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+# if: ${{ github.repository == 'Efficient-Large-Model/Sana' }}
+jobs:
+  pre-commit:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out Git repository
+        uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: 3.10.10
+      - name: Test pre-commit hooks
+        uses: pre-commit/[email protected]
+  tests-bash:
+    # needs: pre-commit
+    runs-on: self-hosted
+    steps:
+      - name: Check out Git repository
+        uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: 3.10.10
+      - name: Set up the environment
+        run: |
+          bash environment_setup.sh
+      - name: Run tests with Slurm
+        run: |
+          sana-run --pty -m ci -J tests-bash bash tests/bash/entry.sh
+# tests-python:
+#     needs: pre-commit
+#     runs-on: self-hosted
+#     steps:
+#         - name: Check out Git repository
+#           uses: actions/checkout@v4
+#         - name: Set up Python
+#           uses: actions/setup-python@v5
+#           with:
+#               python-version: 3.10.10
+#         - name: Set up the environment
+#           run: |
+#               ./environment_setup.sh
+#         - name: Run tests with Slurm
+#           run: |
+#               sana-run --pty -m ci -J tests-python pytest tests/python

.gitignore ADDED Viewed

	@@ -0,0 +1,184 @@

+# Sana related files
+*_dev.py
+*_dev.sh
+.count.db
+.gradio/
+.idea/
+*.png
+tmp*
+output*
+output/
+outputs/
+wandb/
+.vscode/
+private/
+ldm_ae*
+data/*
+*.pth
+.gradio/
+*.bin
+*.safetensors
+*.pkl
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/

.pre-commit-config.yaml ADDED Viewed

	@@ -0,0 +1,62 @@

+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v5.0.0
+    hooks:
+      - id: trailing-whitespace
+        name: (Common) Remove trailing whitespaces
+      - id: mixed-line-ending
+        name: (Common) Fix mixed line ending
+        args: [--fix=lf]
+      - id: end-of-file-fixer
+        name: (Common) Remove extra EOF newlines
+      - id: check-merge-conflict
+        name: (Common) Check for merge conflicts
+      - id: requirements-txt-fixer
+        name: (Common) Sort "requirements.txt"
+      - id: fix-encoding-pragma
+        name: (Python) Remove encoding pragmas
+        args: [--remove]
+        # - id: debug-statements
+        #   name: (Python) Check for debugger imports
+      - id: check-json
+        name: (JSON) Check syntax
+      - id: check-yaml
+        name: (YAML) Check syntax
+      - id: check-toml
+        name: (TOML) Check syntax
+  # - repo: https://github.com/shellcheck-py/shellcheck-py
+  #   rev: v0.10.0.1
+  #   hooks:
+  #     - id: shellcheck
+  - repo: https://github.com/google/yamlfmt
+    rev: v0.13.0
+    hooks:
+      - id: yamlfmt
+  - repo: https://github.com/executablebooks/mdformat
+    rev: 0.7.16
+    hooks:
+      - id: mdformat
+        name: (Markdown) Format docs with mdformat
+  - repo: https://github.com/asottile/pyupgrade
+    rev: v3.2.2
+    hooks:
+      - id: pyupgrade
+        name: (Python) Update syntax for newer versions
+        args: [--py37-plus]
+  - repo: https://github.com/psf/black
+    rev: 22.10.0
+    hooks:
+      - id: black
+        name: (Python) Format code with black
+  - repo: https://github.com/pycqa/isort
+    rev: 5.12.0
+    hooks:
+      - id: isort
+        name: (Python) Sort imports with isort
+  - repo: https://github.com/pre-commit/mirrors-clang-format
+    rev: v15.0.4
+    hooks:
+      - id: clang-format
+        name: (C/C++/CUDA) Format code with clang-format
+        args: [-style=google, -i]
+        types_or: [c, c++, cuda]

CITATION.bib ADDED Viewed

	@@ -0,0 +1,9 @@

+@misc{xie2024sana,
+    title={Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer},
+    author={Enze Xie and Junsong Chen and Junyu Chen and Han Cai and Haotian Tang and Yujun Lin and Zhekai Zhang and Muyang Li and Ligeng Zhu and Yao Lu and Song Han},
+    year={2024},
+    eprint={2410.10629},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV},
+    url={https://arxiv.org/abs/2410.10629},
+}

CIs/add_license_all.sh ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ #/bin/bash
2	+ addlicense -s -c 'NVIDIA CORPORATION & AFFILIATES' -ignore "*/__init__.py" */.py

Dockerfile ADDED Viewed

	@@ -0,0 +1,26 @@

+FROM nvcr.io/nvidia/pytorch:24.06-py3
+ENV PATH=/opt/conda/bin:$PATH
+RUN apt-get update && apt-get install -y \
+    libgl1-mesa-glx \
+    libglib2.0-0 \
+    && rm -rf /var/lib/apt/lists/*
+WORKDIR /app
+RUN curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o ~/miniconda.sh \
+    && sh ~/miniconda.sh -b -p /opt/conda \
+    && rm ~/miniconda.sh
+COPY pyproject.toml pyproject.toml
+COPY diffusion diffusion
+COPY configs configs
+COPY sana sana
+COPY app app
+COPY tools tools
+COPY environment_setup.sh environment_setup.sh
+RUN ./environment_setup.sh
+CMD ["python", "-u", "-W", "ignore", "app/app_sana.py", "--share", "--config=configs/sana_config/1024ms/Sana_1600M_img1024.yaml", "--model_path=hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth"]

LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2024 Nvidia
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README.md CHANGED Viewed

@@ -1,12 +1,401 @@
----
-title: Twig V0 Alpha Demo CPU
-emoji: 🖼
-colorFrom: purple
-colorTo: red
-sdk: gradio
-sdk_version: 5.0.1
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+<p align="center" style="border-radius: 10px">
+  <img src="asset/logo.png" width="35%" alt="logo"/>
+</p>
+# ⚡️Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer
+### <div align="center"> ICLR 2025 Oral Presentation <div>
+<div align="center">
+  <a href="https://nvlabs.github.io/Sana/"><img src="https://img.shields.io/static/v1?label=Project&message=Github&color=blue&logo=github-pages"></a> &ensp;
+  <a href="https://hanlab.mit.edu/projects/sana/"><img src="https://img.shields.io/static/v1?label=Page&message=MIT&color=darkred&logo=github-pages"></a> &ensp;
+  <a href="https://arxiv.org/abs/2410.10629"><img src="https://img.shields.io/static/v1?label=Arxiv&message=Sana&color=red&logo=arxiv"></a> &ensp;
+  <a href="https://nv-sana.mit.edu/"><img src="https://img.shields.io/static/v1?label=Demo:6x3090&message=MIT&color=yellow"></a> &ensp;
+  <a href="https://nv-sana.mit.edu/4bit/"><img src="https://img.shields.io/static/v1?label=Demo:1x3090&message=4bit&color=yellow"></a> &ensp;
+  <a href="https://nv-sana.mit.edu/ctrlnet/"><img src="https://img.shields.io/static/v1?label=Demo:1x3090&message=ControlNet&color=yellow"></a> &ensp;
+  <a href="https://replicate.com/chenxwh/sana"><img src="https://img.shields.io/static/v1?label=API:H100&message=Replicate&color=pink"></a> &ensp;
+  <a href="https://discord.gg/rde6eaE5Ta"><img src="https://img.shields.io/static/v1?label=Discuss&message=Discord&color=purple&logo=discord"></a> &ensp;
+</div>
+<p align="center" border-radius="10px">
+  <img src="asset/Sana.jpg" width="90%" alt="teaser_page1"/>
+</p>
+## 💡 Introduction
+We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution.
+Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU.
+Core designs include:
+(1) [**DC-AE**](https://hanlab.mit.edu/projects/dc-ae): unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. \
+(2) **Linear DiT**: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. \
+(3) **Decoder-only text encoder**: we replaced T5 with a modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. \
+(4) **Efficient training and sampling**: we propose **Flow-DPM-Solver** to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence.
+As a result, Sana-0.6B is very competitive with modern giant diffusion models (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024 × 1024 resolution image. Sana enables content creation at low cost.
+<p align="center" border-raduis="10px">
+  <img src="asset/model-incremental.jpg" width="90%" alt="teaser_page2"/>
+</p>
+## 🔥🔥 News
+- (🔥 New) \[2025/2/10\] 🚀Sana + ControlNet is released. [\[Guidance\]](asset/docs/sana_controlnet.md) | [\[Model\]](asset/docs/model_zoo.md) | [\[Demo\]](https://nv-sana.mit.edu/ctrlnet/)
+- (🔥 New) \[2025/1/30\] Release CAME-8bit optimizer code. Saving more GPU memory during training. [\[How to config\]](https://github.com/NVlabs/Sana/blob/main/configs/sana_config/1024ms/Sana_1600M_img1024_CAME8bit.yaml#L86)
+- (🔥 New) \[2025/1/29\] 🎉 🎉 🎉**SANA 1.5 is out! Figure out how to do efficient training & inference scaling!** 🚀[\[Tech Report\]](https://arxiv.org/abs/2501.18427)
+- (🔥 New) \[2025/1/24\] 4bit-Sana is released, powered by [SVDQuant and Nunchaku](https://github.com/mit-han-lab/nunchaku) inference engine. Now run your Sana within **8GB** GPU VRAM [\[Guidance\]](asset/docs/4bit_sana.md) [\[Demo\]](https://svdquant.mit.edu/) [\[Model\]](asset/docs/model_zoo.md)
+- (🔥 New) \[2025/1/24\] DCAE-1.1 is released, better reconstruction quality. [\[Model\]](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.1) [\[diffusers\]](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers)
+- (🔥 New) \[2025/1/23\] **Sana is accepted as Oral by ICLR-2025.** 🎉🎉🎉
+______________________________________________________________________
+- (🔥 New) \[2025/1/12\] DC-AE tiling makes Sana-4K inferences 4096x4096px images within 22GB GPU memory. With model offload and 8bit/4bit quantize. The 4K Sana run within **8GB** GPU VRAM. [\[Guidance\]](asset/docs/model_zoo.md#-3-4k-models)
+- (🔥 New) \[2025/1/11\] Sana code-base license changed to Apache 2.0.
+- (🔥 New) \[2025/1/10\] Inference Sana with 8bit quantization.[\[Guidance\]](asset/docs/8bit_sana.md#quantization)
+- (🔥 New) \[2025/1/8\] 4K resolution [Sana models](asset/docs/model_zoo.md) is supported in [Sana-ComfyUI](https://github.com/Efficient-Large-Model/ComfyUI_ExtraModels) and [work flow](asset/docs/ComfyUI/Sana_FlowEuler_4K.json) is also prepared. [\[4K guidance\]](asset/docs/ComfyUI/comfyui.md)
+- (🔥 New) \[2025/1/8\] 1.6B 4K resolution [Sana models](asset/docs/model_zoo.md) are released: [\[BF16 pth\]](https://huggingface.co/Efficient-Large-Model/Sana_1600M_4Kpx_BF16) or [\[BF16 diffusers\]](https://huggingface.co/Efficient-Large-Model/Sana_1600M_4Kpx_BF16_diffusers). 🚀 Get your 4096x4096 resolution images within 20 seconds! Find more samples in [Sana page](https://nvlabs.github.io/Sana/). Thanks [SUPIR](https://github.com/Fanghua-Yu/SUPIR) for their wonderful work and support.
+- (🔥 New) \[2025/1/2\] Bug in the `diffusers` pipeline is solved. [Solved PR](https://github.com/huggingface/diffusers/pull/10431)
+- (🔥 New) \[2025/1/2\] 2K resolution [Sana models](asset/docs/model_zoo.md) is supported in [Sana-ComfyUI](https://github.com/Efficient-Large-Model/ComfyUI_ExtraModels) and [work flow](asset/docs/ComfyUI/Sana_FlowEuler_2K.json) is also prepared.
+- ✅ \[2024/12\] 1.6B 2K resolution [Sana models](asset/docs/model_zoo.md) are released: [\[BF16 pth\]](https://huggingface.co/Efficient-Large-Model/Sana_1600M_2Kpx_BF16) or [\[BF16 diffusers\]](https://huggingface.co/Efficient-Large-Model/Sana_1600M_2Kpx_BF16_diffusers). 🚀 Get your 2K resolution images within 4 seconds! Find more samples in [Sana page](https://nvlabs.github.io/Sana/). Thanks [SUPIR](https://github.com/Fanghua-Yu/SUPIR) for their wonderful work and support.
+- ✅ \[2024/12\] `diffusers` supports Sana-LoRA fine-tuning! Sana-LoRA's training and convergence speed is super fast. [\[Guidance\]](asset/docs/sana_lora_dreambooth.md) or  [\[diffusers docs\]](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sana.md).
+- ✅ \[2024/12\] `diffusers` has Sana! [All Sana models in diffusers safetensors](https://huggingface.co/collections/Efficient-Large-Model/sana-673efba2a57ed99843f11f9e) are released and diffusers pipeline `SanaPipeline`, `SanaPAGPipeline`, `DPMSolverMultistepScheduler(with FlowMatching)` are all supported now. We prepare a [Model Card](asset/docs/model_zoo.md) for you to choose.
+- ✅ \[2024/12\] 1.6B BF16 [Sana model](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_BF16) is released for stable fine-tuning.
+- ✅ \[2024/12\] We release the [ComfyUI node](https://github.com/Efficient-Large-Model/ComfyUI_ExtraModels) for Sana. [\[Guidance\]](asset/docs/ComfyUI/comfyui.md)
+- ✅ \[2024/11\] All multi-linguistic (Emoji & Chinese & English) SFT models are released: [1.6B-512px](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_MultiLing), [1.6B-1024px](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_MultiLing), [600M-512px](https://huggingface.co/Efficient-Large-Model/Sana_600M_512px), [600M-1024px](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px). The metric performance is shown [here](#performance)
+- ✅ \[2024/11\] Sana Replicate API is launching at [Sana-API](https://replicate.com/chenxwh/sana).
+- ✅ \[2024/11\] 1.6B [Sana models](https://huggingface.co/collections/Efficient-Large-Model/sana-673efba2a57ed99843f11f9e) are released.
+- ✅ \[2024/11\] Training & Inference & Metrics code are released.
+- ✅ \[2024/11\] Working on [`diffusers`](https://github.com/huggingface/diffusers/pull/9982).
+- \[2024/10\] [Demo](https://nv-sana.mit.edu/) is released.
+- \[2024/10\] [DC-AE Code](https://github.com/mit-han-lab/efficientvit/blob/master/applications/dc_ae/README.md) and [weights](https://huggingface.co/collections/mit-han-lab/dc-ae-670085b9400ad7197bb1009b) are released!
+- \[2024/10\] [Paper](https://arxiv.org/abs/2410.10629) is on Arxiv!
+## Performance
+| Methods (1024x1024)                                                                                 | Throughput (samples/s) | Latency (s) | Params (B) | Speedup | FID 👇      | CLIP 👆      | GenEval 👆  | DPG 👆      |
+|-----------------------------------------------------------------------------------------------------|------------------------|-------------|------------|---------|-------------|--------------|-------------|-------------|
+| FLUX-dev                                                                                            | 0.04                   | 23.0        | 12.0       | 1.0×    | 10.15       | 27.47        | _0.67_      | 84.0        |
+| **Sana-0.6B**                                                                                       | 1.7                    | 0.9         | 0.6        | 39.5×   | _5.81_      | 28.36        | 0.64        | 83.6        |
+| **[Sana-0.6B-MultiLing](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px)**            | 1.7                    | 0.9         | 0.6        | 39.5×   | **5.61**    | <u>28.80</u> | <u>0.68</u> | _84.2_      |
+| **Sana-1.6B**                                                                                       | 1.0                    | 1.2         | 1.6        | 23.3×   | <u>5.76</u> | _28.67_      | 0.66        | **84.8**    |
+| **[Sana-1.6B-MultiLing](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_MultiLing)** | 1.0                    | 1.2         | 1.6        | 23.3×   | 5.92        | **28.94**    | **0.69**    | <u>84.5</u> |
+<details>
+  <summary><h3>Click to show all</h3></summary>
+| Methods                      | Throughput (samples/s) | Latency (s) | Params (B) | Speedup   | FID 👆      | CLIP 👆      | GenEval 👆  | DPG 👆      |
+|------------------------------|------------------------|-------------|------------|-----------|-------------|--------------|-------------|-------------|
+| _**512 × 512 resolution**_   |                        |             |            |           |             |              |             |             |
+| PixArt-α                     | 1.5                    | 1.2         | 0.6        | 1.0×      | 6.14        | 27.55        | 0.48        | 71.6        |
+| PixArt-Σ                     | 1.5                    | 1.2         | 0.6        | 1.0×      | _6.34_      | _27.62_      | <u>0.52</u> | _79.5_      |
+| **Sana-0.6B**                | 6.7                    | 0.8         | 0.6        | 5.0×      | <u>5.67</u> | <u>27.92</u> | _0.64_      | <u>84.3</u> |
+| **Sana-1.6B**                | 3.8                    | 0.6         | 1.6        | 2.5×      | **5.16**    | **28.19**    | **0.66**    | **85.5**    |
+| _**1024 × 1024 resolution**_ |                        |             |            |           |             |              |             |             |
+| LUMINA-Next                  | 0.12                   | 9.1         | 2.0        | 2.8×      | 7.58        | 26.84        | 0.46        | 74.6        |
+| SDXL                         | 0.15                   | 6.5         | 2.6        | 3.5×      | 6.63        | _29.03_      | 0.55        | 74.7        |
+| PlayGroundv2.5               | 0.21                   | 5.3         | 2.6        | 4.9×      | _6.09_      | **29.13**    | 0.56        | 75.5        |
+| Hunyuan-DiT                  | 0.05                   | 18.2        | 1.5        | 1.2×      | 6.54        | 28.19        | 0.63        | 78.9        |
+| PixArt-Σ                     | 0.4                    | 2.7         | 0.6        | 9.3×      | 6.15        | 28.26        | 0.54        | 80.5        |
+| DALLE3                       | -                      | -           | -          | -         | -           | -            | _0.67_      | 83.5        |
+| SD3-medium                   | 0.28                   | 4.4         | 2.0        | 6.5×      | 11.92       | 27.83        | 0.62        | <u>84.1</u> |
+| FLUX-dev                     | 0.04                   | 23.0        | 12.0       | 1.0×      | 10.15       | 27.47        | _0.67_      | _84.0_      |
+| FLUX-schnell                 | 0.5                    | 2.1         | 12.0       | 11.6×     | 7.94        | 28.14        | **0.71**    | **84.8**    |
+| **Sana-0.6B**                | 1.7                    | 0.9         | 0.6        | **39.5×** | <u>5.81</u> | 28.36        | 0.64        | 83.6        |
+| **Sana-1.6B**                | 1.0                    | 1.2         | 1.6        | **23.3×** | **5.76**    | <u>28.67</u> | <u>0.66</u> | **84.8**    |
+</details>
+## Contents
+- [Env](#-1-dependencies-and-installation)
+- [Demo](#-2-how-to-play-with-sana-inference)
+- [Model Zoo](asset/docs/model_zoo.md)
+- [Training](#-3-how-to-train-sana)
+- [Testing](#-4-metric-toolkit)
+- [TODO](#to-do-list)
+- [Citation](#bibtex)
+# 🔧 1. Dependencies and Installation
+- Python >= 3.10.0 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html))
+- [PyTorch >= 2.0.1+cu12.1](https://pytorch.org/)
+```bash
+git clone https://github.com/NVlabs/Sana.git
+cd Sana
+./environment_setup.sh sana
+# or you can install each components step by step following environment_setup.sh
+```
+# 💻 2. How to Play with Sana (Inference)
+## 💰Hardware requirement
+- 9GB VRAM is required for 0.6B model and 12GB VRAM for 1.6B model. Our later quantization version will require less than 8GB for inference.
+- All the tests are done on A100 GPUs. Different GPU version may be different.
+## 🔛 Choose your model: [Model card](asset/docs/model_zoo.md)
+## 🔛 Quick start with [Gradio](https://www.gradio.app/guides/quickstart)
+```bash
+# official online demo
+DEMO_PORT=15432 \
+python app/app_sana.py \
+    --share \
+    --config=configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
+    --model_path=hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth \
+    --image_size=1024
+```
+### 1. How to use `SanaPipeline` with `🧨diffusers`
+> \[!IMPORTANT\]
+> Upgrade your `diffusers>=0.32.0.dev` to make the `SanaPipeline` and `SanaPAGPipeline` available!
+>
+> ```bash
+> pip install git+https://github.com/huggingface/diffusers
+> ```
+>
+> Make sure to specify `pipe.transformer` to default `torch_dtype` and `variant` according to [Model Card](asset/docs/model_zoo.md).
+>
+> Set `pipe.text_encoder` to BF16 and `pipe.vae` to FP32 or BF16. For more info, [docs](https://huggingface.co/docs/diffusers/main/en/api/pipelines/sana#sanapipeline) are here.
+```python
+# run `pip install git+https://github.com/huggingface/diffusers` before use Sana in diffusers
+import torch
+from diffusers import SanaPipeline
+pipe = SanaPipeline.from_pretrained(
+    "Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers",
+    variant="bf16",
+    torch_dtype=torch.bfloat16,
+)
+pipe.to("cuda")
+pipe.vae.to(torch.bfloat16)
+pipe.text_encoder.to(torch.bfloat16)
+prompt = 'a cyberpunk cat with a neon sign that says "Sana"'
+image = pipe(
+    prompt=prompt,
+    height=1024,
+    width=1024,
+    guidance_scale=4.5,
+    num_inference_steps=20,
+    generator=torch.Generator(device="cuda").manual_seed(42),
+)[0]
+image[0].save("sana.png")
+```
+### 2. How to use `SanaPAGPipeline` with `🧨diffusers`
+```python
+# run `pip install git+https://github.com/huggingface/diffusers` before use Sana in diffusers
+import torch
+from diffusers import SanaPAGPipeline
+pipe = SanaPAGPipeline.from_pretrained(
+  "Efficient-Large-Model/Sana_1600M_1024px_diffusers",
+  variant="fp16",
+  torch_dtype=torch.float16,
+  pag_applied_layers="transformer_blocks.8",
+)
+pipe.to("cuda")
+pipe.text_encoder.to(torch.bfloat16)
+pipe.vae.to(torch.bfloat16)
+prompt = 'a cyberpunk cat with a neon sign that says "Sana"'
+image = pipe(
+    prompt=prompt,
+    guidance_scale=5.0,
+    pag_scale=2.0,
+    num_inference_steps=20,
+    generator=torch.Generator(device="cuda").manual_seed(42),
+)[0]
+image[0].save('sana.png')
+```
+<details>
+<summary><h3>3. How to use Sana in this repo</h3></summary>
+```python
+import torch
+from app.sana_pipeline import SanaPipeline
+from torchvision.utils import save_image
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+generator = torch.Generator(device=device).manual_seed(42)
+sana = SanaPipeline("configs/sana_config/1024ms/Sana_1600M_img1024.yaml")
+sana.from_pretrained("hf://Efficient-Large-Model/Sana_1600M_1024px_BF16/checkpoints/Sana_1600M_1024px_BF16.pth")
+prompt = 'a cyberpunk cat with a neon sign that says "Sana"'
+image = sana(
+    prompt=prompt,
+    height=1024,
+    width=1024,
+    guidance_scale=5.0,
+    pag_guidance_scale=2.0,
+    num_inference_steps=18,
+    generator=generator,
+)
+save_image(image, 'output/sana.png', nrow=1, normalize=True, value_range=(-1, 1))
+```
+</details>
+<details>
+<summary><h3>4. Run Sana (Inference) with Docker</h3></summary>
+```
+# Pull related models
+huggingface-cli download google/gemma-2b-it
+huggingface-cli download google/shieldgemma-2b
+huggingface-cli download mit-han-lab/dc-ae-f32c32-sana-1.0
+huggingface-cli download Efficient-Large-Model/Sana_1600M_1024px
+# Run with docker
+docker build . -t sana
+docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
+    -v ~/.cache:/root/.cache \
+    sana
+```
+</details>
+## 🔛 Run inference with TXT or JSON files
+```bash
+# Run samples in a txt file
+python scripts/inference.py \
+      --config=configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
+      --model_path=hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth \
+      --txt_file=asset/samples/samples_mini.txt
+# Run samples in a json file
+python scripts/inference.py \
+      --config=configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
+      --model_path=hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth \
+      --json_file=asset/samples/samples_mini.json
+```
+where each line of [`asset/samples/samples_mini.txt`](asset/samples/samples_mini.txt) contains a prompt to generate
+# 🔥 3. How to Train Sana
+## 💰Hardware requirement
+- 32GB VRAM is required for both 0.6B and 1.6B model's training
+### 1). Train with image-text pairs in directory
+We provide a training example here and you can also select your desired config file from [config files dir](configs/sana_config) based on your data structure.
+To launch Sana training, you will first need to prepare data in the following formats. [Here](asset/example_data) is an example for the data structure for reference.
+```bash
+asset/example_data
+├── AAA.txt
+├── AAA.png
+├── BCC.txt
+├── BCC.png
+├── ......
+├── CCC.txt
+└── CCC.png
+```
+Then Sana's training can be launched via
+```bash
+# Example of training Sana 0.6B with 512x512 resolution from scratch
+bash train_scripts/train.sh \
+  configs/sana_config/512ms/Sana_600M_img512.yaml \
+  --data.data_dir="[asset/example_data]" \
+  --data.type=SanaImgDataset \
+  --model.multi_scale=false \
+  --train.train_batch_size=32
+# Example of fine-tuning Sana 1.6B with 1024x1024 resolution
+bash train_scripts/train.sh \
+  configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
+  --data.data_dir="[asset/example_data]" \
+  --data.type=SanaImgDataset \
+  --model.load_from=hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth \
+  --model.multi_scale=false \
+  --train.train_batch_size=8
+```
+### 2). Train with image-text pairs in directory
+We also provide conversion scripts to convert your data to the required format. You can refer to the [data conversion scripts](asset/data_conversion_scripts) for more details.
+```bash
+python tools/convert_ImgDataset_to_WebDatasetMS_format.py
+```
+Then Sana's training can be launched via
+```bash
+# Example of training Sana 0.6B with 512x512 resolution from scratch
+bash train_scripts/train.sh \
+  configs/sana_config/512ms/Sana_600M_img512.yaml \
+  --data.data_dir="[asset/example_data_tar]" \
+  --data.type=SanaWebDatasetMS \
+  --model.multi_scale=true \
+  --train.train_batch_size=32
+```
+# 💻 4. Metric toolkit
+Refer to [Toolkit Manual](asset/docs/metrics_toolkit.md).
+# 💪To-Do List
+We will try our best to release
+- \[✅\] Training code
+- \[✅\] Inference code
+- \[✅\] Model zoo
+- \[✅\] ComfyUI
+- \[✅\] DC-AE Diffusers
+- \[✅\] Sana merged in Diffusers(https://github.com/huggingface/diffusers/pull/9982)
+- \[✅\] LoRA training by [@paul](https://github.com/sayakpaul)(`diffusers`: https://github.com/huggingface/diffusers/pull/10234)
+- \[✅\] 2K/4K resolution models.(Thanks [@SUPIR](https://github.com/Fanghua-Yu/SUPIR) to provide a 4K super-resolution model)
+- \[✅\] 8bit / 4bit Laptop development
+- \[💻\] ControlNet (train & inference & models)
+- \[💻\] Larger model size
+- \[💻\] Better re-construction F32/F64 VAEs.
+- \[💻\] **Sana1.5 (Focus on: Human body / Human face / Text rendering / Realism / Efficiency)**
+# 🤗Acknowledgements
+**Thanks to the following open-sourced codebase for their wonderful work and codebase!**
+- [PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha)
+- [PixArt-Σ](https://github.com/PixArt-alpha/PixArt-sigma)
+- [Efficient-ViT](https://github.com/mit-han-lab/efficientvit)
+- [ComfyUI_ExtraModels](https://github.com/city96/ComfyUI_ExtraModels)
+- [SVDQuant and Nunchaku](https://github.com/mit-han-lab/nunchaku)
+- [diffusers](https://github.com/huggingface/diffusers)
+## 🌟 Star History
+[![Star History Chart](https://api.star-history.com/svg?repos=NVlabs/Sana&type=Date)](https://star-history.com/#NVlabs/sana&Date)
+# 📖BibTeX
+```
+@misc{xie2024sana,
+      title={Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer},
+      author={Enze Xie and Junsong Chen and Junyu Chen and Han Cai and Haotian Tang and Yujun Lin and Zhekai Zhang and Muyang Li and Ligeng Zhu and Yao Lu and Song Han},
+      year={2024},
+      eprint={2410.10629},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2410.10629},
+    }
+```

app.py CHANGED Viewed

@@ -1,73 +1,369 @@
-import gradio as gr
-import numpy as np
 import random
-# import spaces #[uncomment to use ZeroGPU]
-from diffusers import DiffusionPipeline
 import torch
-device = "cuda" if torch.cuda.is_available() else "cpu"
-model_repo_id = "stabilityai/sdxl-turbo"  # Replace to the model you would like to use
 if torch.cuda.is_available():
-    torch_dtype = torch.float16
-else:
-    torch_dtype = torch.float32
-pipe = DiffusionPipeline.from_pretrained(model_repo_id, torch_dtype=torch_dtype)
-pipe = pipe.to(device)
-MAX_SEED = np.iinfo(np.int32).max
-MAX_IMAGE_SIZE = 1024
-# @spaces.GPU #[uncomment to use ZeroGPU]
-def infer(
-    prompt,
-    negative_prompt,
-    seed,
-    randomize_seed,
-    width,
-    height,
-    guidance_scale,
-    num_inference_steps,
-    progress=gr.Progress(track_tqdm=True),
-):
     if randomize_seed:
         seed = random.randint(0, MAX_SEED)
-    generator = torch.Generator().manual_seed(seed)
-    image = pipe(
         prompt=prompt,
         negative_prompt=negative_prompt,
         guidance_scale=guidance_scale,
         num_inference_steps=num_inference_steps,
-        width=width,
-        height=height,
         generator=generator,
-    ).images[0]
-    return image, seed
 examples = [
-    "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
-    "An astronaut riding a green horse",
-    "A delicious ceviche cheesecake slice",
 ]
 css = """
-#col-container {
-    margin: 0 auto;
-    max-width: 640px;
-}
 """
-with gr.Blocks(css=css) as demo:
-    with gr.Column(elem_id="col-container"):
-        gr.Markdown(" # Text-to-Image Gradio Template")
         with gr.Row():
             prompt = gr.Text(
                 label="Prompt",
@@ -76,19 +372,66 @@ with gr.Blocks(css=css) as demo:
                 placeholder="Enter your prompt",
                 container=False,
             )
-            run_button = gr.Button("Run", scale=0, variant="primary")
-        result = gr.Image(label="Result", show_label=False)
-        with gr.Accordion("Advanced Settings", open=False):
             negative_prompt = gr.Text(
                 label="Negative prompt",
                 max_lines=1,
                 placeholder="Enter a negative prompt",
-                visible=False,
             )
             seed = gr.Slider(
                 label="Seed",
                 minimum=0,
@@ -96,59 +439,64 @@ with gr.Blocks(css=css) as demo:
                 step=1,
                 value=0,
             )
             randomize_seed = gr.Checkbox(label="Randomize seed", value=True)
-            with gr.Row():
-                width = gr.Slider(
-                    label="Width",
-                    minimum=256,
-                    maximum=MAX_IMAGE_SIZE,
-                    step=32,
-                    value=1024,  # Replace with defaults that work for your model
-                )
-                height = gr.Slider(
-                    label="Height",
-                    minimum=256,
-                    maximum=MAX_IMAGE_SIZE,
-                    step=32,
-                    value=1024,  # Replace with defaults that work for your model
-                )
-            with gr.Row():
-                guidance_scale = gr.Slider(
-                    label="Guidance scale",
-                    minimum=0.0,
-                    maximum=10.0,
-                    step=0.1,
-                    value=0.0,  # Replace with defaults that work for your model
                 )
-                num_inference_steps = gr.Slider(
-                    label="Number of inference steps",
                     minimum=1,
-                    maximum=50,
                     step=1,
-                    value=2,  # Replace with defaults that work for your model
                 )
-        gr.Examples(examples=examples, inputs=[prompt])
     gr.on(
-        triggers=[run_button.click, prompt.submit],
-        fn=infer,
         inputs=[
             prompt,
             negative_prompt,
             seed,
-            randomize_seed,
-            width,
             height,
-            guidance_scale,
-            num_inference_steps,
         ],
-        outputs=[result, seed],
     )
 if __name__ == "__main__":
-    demo.launch()

+#!/usr/bin/env python
+# Copyright 2024 NVIDIA CORPORATION & AFFILIATES
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# SPDX-License-Identifier: Apache-2.0
+from __future__ import annotations
+import argparse
+import os
 import random
+import socket
+import sqlite3
+import time
+import uuid
+from datetime import datetime
+import gradio as gr
+import numpy as np
+import spaces
 import torch
+from PIL import Image
+from torchvision.utils import make_grid, save_image
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from app import safety_check
+from app.sana_pipeline import SanaPipeline
+MAX_SEED = np.iinfo(np.int32).max
+CACHE_EXAMPLES = torch.cuda.is_available() and os.getenv("CACHE_EXAMPLES", "1") == "1"
+MAX_IMAGE_SIZE = int(os.getenv("MAX_IMAGE_SIZE", "4096"))
+USE_TORCH_COMPILE = os.getenv("USE_TORCH_COMPILE", "0") == "1"
+ENABLE_CPU_OFFLOAD = os.getenv("ENABLE_CPU_OFFLOAD", "0") == "1"
+DEMO_PORT = int(os.getenv("DEMO_PORT", "15432"))
+os.environ["GRADIO_EXAMPLES_CACHE"] = "./.gradio/cache"
+COUNTER_DB = os.getenv("COUNTER_DB", ".count.db")
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+style_list = [
+    {
+        "name": "(No style)",
+        "prompt": "{prompt}",
+        "negative_prompt": "",
+    },
+    {
+        "name": "Cinematic",
+        "prompt": "cinematic still {prompt} . emotional, harmonious, vignette, highly detailed, high budget, bokeh, "
+        "cinemascope, moody, epic, gorgeous, film grain, grainy",
+        "negative_prompt": "anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured",
+    },
+    {
+        "name": "Photographic",
+        "prompt": "cinematic photo {prompt} . 35mm photograph, film, bokeh, professional, 4k, highly detailed",
+        "negative_prompt": "drawing, painting, crayon, sketch, graphite, impressionist, noisy, blurry, soft, deformed, ugly",
+    },
+    {
+        "name": "Anime",
+        "prompt": "anime artwork {prompt} . anime style, key visual, vibrant, studio anime,  highly detailed",
+        "negative_prompt": "photo, deformed, black and white, realism, disfigured, low contrast",
+    },
+    {
+        "name": "Manga",
+        "prompt": "manga style {prompt} . vibrant, high-energy, detailed, iconic, Japanese comic style",
+        "negative_prompt": "ugly, deformed, noisy, blurry, low contrast, realism, photorealistic, Western comic style",
+    },
+    {
+        "name": "Digital Art",
+        "prompt": "concept art {prompt} . digital artwork, illustrative, painterly, matte painting, highly detailed",
+        "negative_prompt": "photo, photorealistic, realism, ugly",
+    },
+    {
+        "name": "Pixel art",
+        "prompt": "pixel-art {prompt} . low-res, blocky, pixel art style, 8-bit graphics",
+        "negative_prompt": "sloppy, messy, blurry, noisy, highly detailed, ultra textured, photo, realistic",
+    },
+    {
+        "name": "Fantasy art",
+        "prompt": "ethereal fantasy concept art of  {prompt} . magnificent, celestial, ethereal, painterly, epic, "
+        "majestic, magical, fantasy art, cover art, dreamy",
+        "negative_prompt": "photographic, realistic, realism, 35mm film, dslr, cropped, frame, text, deformed, "
+        "glitch, noise, noisy, off-center, deformed, cross-eyed, closed eyes, bad anatomy, ugly, "
+        "disfigured, sloppy, duplicate, mutated, black and white",
+    },
+    {
+        "name": "Neonpunk",
+        "prompt": "neonpunk style {prompt} . cyberpunk, vaporwave, neon, vibes, vibrant, stunningly beautiful, crisp, "
+        "detailed, sleek, ultramodern, magenta highlights, dark purple shadows, high contrast, cinematic, "
+        "ultra detailed, intricate, professional",
+        "negative_prompt": "painting, drawing, illustration, glitch, deformed, mutated, cross-eyed, ugly, disfigured",
+    },
+    {
+        "name": "3D Model",
+        "prompt": "professional 3d model {prompt} . octane render, highly detailed, volumetric, dramatic lighting",
+        "negative_prompt": "ugly, deformed, noisy, low poly, blurry, painting",
+    },
+]
+styles = {k["name"]: (k["prompt"], k["negative_prompt"]) for k in style_list}
+STYLE_NAMES = list(styles.keys())
+DEFAULT_STYLE_NAME = "(No style)"
+SCHEDULE_NAME = ["Flow_DPM_Solver"]
+DEFAULT_SCHEDULE_NAME = "Flow_DPM_Solver"
+NUM_IMAGES_PER_PROMPT = 1
+INFER_SPEED = 0
+def norm_ip(img, low, high):
+    img.clamp_(min=low, max=high)
+    img.sub_(low).div_(max(high - low, 1e-5))
+    return img
+def open_db():
+    db = sqlite3.connect(COUNTER_DB)
+    db.execute("CREATE TABLE IF NOT EXISTS counter(app CHARS PRIMARY KEY UNIQUE, value INTEGER)")
+    db.execute('INSERT OR IGNORE INTO counter(app, value) VALUES("Sana", 0)')
+    return db
+def read_inference_count():
+    with open_db() as db:
+        cur = db.execute('SELECT value FROM counter WHERE app="Sana"')
+        db.commit()
+    return cur.fetchone()[0]
+def write_inference_count(count):
+    count = max(0, int(count))
+    with open_db() as db:
+        db.execute(f'UPDATE counter SET value=value+{count} WHERE app="Sana"')
+        db.commit()
+def run_inference(num_imgs=1):
+    write_inference_count(num_imgs)
+    count = read_inference_count()
+    return (
+        f"<span style='font-size: 16px; font-weight: bold;'>Total inference runs: </span><span style='font-size: "
+        f"16px; color:red; font-weight: bold;'>{count}</span>"
+    )
+def update_inference_count():
+    count = read_inference_count()
+    return (
+        f"<span style='font-size: 16px; font-weight: bold;'>Total inference runs: </span><span style='font-size: "
+        f"16px; color:red; font-weight: bold;'>{count}</span>"
+    )
+def apply_style(style_name: str, positive: str, negative: str = "") -> tuple[str, str]:
+    p, n = styles.get(style_name, styles[DEFAULT_STYLE_NAME])
+    if not negative:
+        negative = ""
+    return p.replace("{prompt}", positive), n + negative
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", type=str, help="config")
+    parser.add_argument(
+        "--model_path",
+        nargs="?",
+        default="hf://Swarmeta-AI/Twig-v0-alpha/Twig-v0-alpha-1.6B-2048x-fp16.pth",
+        type=str,
+        help="Path to the model file (positional)",
+    )
+    parser.add_argument("--output", default="./", type=str)
+    parser.add_argument("--bs", default=1, type=int)
+    parser.add_argument("--image_size", default=1024, type=int)
+    parser.add_argument("--cfg_scale", default=5.0, type=float)
+    parser.add_argument("--pag_scale", default=2.0, type=float)
+    parser.add_argument("--seed", default=42, type=int)
+    parser.add_argument("--step", default=-1, type=int)
+    parser.add_argument("--custom_image_size", default=None, type=int)
+    parser.add_argument("--share", action="store_true")
+    parser.add_argument(
+        "--shield_model_path",
+        type=str,
+        help="The path to shield model, we employ ShieldGemma-2B by default.",
+        default="google/shieldgemma-2b",
+    )
+    return parser.parse_known_args()[0]
+args = get_args()
 if torch.cuda.is_available():
+    model_path = args.model_path
+    pipe = SanaPipeline(args.config)
+    pipe.from_pretrained(model_path)
+    pipe.register_progress_bar(gr.Progress())
+    # safety checker
+    safety_checker_tokenizer = AutoTokenizer.from_pretrained(args.shield_model_path)
+    safety_checker_model = AutoModelForCausalLM.from_pretrained(
+        args.shield_model_path,
+        device_map="auto",
+        torch_dtype=torch.bfloat16,
+    ).to(device)
+def save_image_sana(img, seed="", save_img=False):
+    unique_name = f"{str(uuid.uuid4())}_{seed}.png"
+    save_path = os.path.join(f"output/online_demo_img/{datetime.now().date()}")
+    os.umask(0o000)  # file permission: 666; dir permission: 777
+    os.makedirs(save_path, exist_ok=True)
+    unique_name = os.path.join(save_path, unique_name)
+    if save_img:
+        save_image(img, unique_name, nrow=1, normalize=True, value_range=(-1, 1))
+    return unique_name
+def randomize_seed_fn(seed: int, randomize_seed: bool) -> int:
     if randomize_seed:
         seed = random.randint(0, MAX_SEED)
+    return seed
+@torch.no_grad()
+@torch.inference_mode()
+@spaces.GPU(enable_queue=True)
+def generate(
+    prompt: str = None,
+    negative_prompt: str = "",
+    style: str = DEFAULT_STYLE_NAME,
+    use_negative_prompt: bool = False,
+    num_imgs: int = 1,
+    seed: int = 0,
+    height: int = 1024,
+    width: int = 1024,
+    flow_dpms_guidance_scale: float = 5.0,
+    flow_dpms_pag_guidance_scale: float = 2.0,
+    flow_dpms_inference_steps: int = 20,
+    randomize_seed: bool = False,
+):
+    global INFER_SPEED
+    # seed = 823753551
+    box = run_inference(num_imgs)
+    seed = int(randomize_seed_fn(seed, randomize_seed))
+    generator = torch.Generator(device=device).manual_seed(seed)
+    print(f"PORT: {DEMO_PORT}, model_path: {model_path}")
+    if safety_check.is_dangerous(safety_checker_tokenizer, safety_checker_model, prompt, threshold=0.2):
+        prompt = "A red heart."
+    print(prompt)
+    num_inference_steps = flow_dpms_inference_steps
+    guidance_scale = flow_dpms_guidance_scale
+    pag_guidance_scale = flow_dpms_pag_guidance_scale
+    if not use_negative_prompt:
+        negative_prompt = None  # type: ignore
+    prompt, negative_prompt = apply_style(style, prompt, negative_prompt)
+    pipe.progress_fn(0, desc="Sana Start")
+    time_start = time.time()
+    images = pipe(
         prompt=prompt,
+        height=height,
+        width=width,
         negative_prompt=negative_prompt,
         guidance_scale=guidance_scale,
+        pag_guidance_scale=pag_guidance_scale,
         num_inference_steps=num_inference_steps,
+        num_images_per_prompt=num_imgs,
         generator=generator,
+    )
+    pipe.progress_fn(1.0, desc="Sana End")
+    INFER_SPEED = (time.time() - time_start) / num_imgs
+    save_img = False
+    if save_img:
+        img = [save_image_sana(img, seed, save_img=save_image) for img in images]
+        print(img)
+    else:
+        img = [
+            Image.fromarray(
+                norm_ip(img, -1, 1)
+                .mul(255)
+                .add_(0.5)
+                .clamp_(0, 255)
+                .permute(1, 2, 0)
+                .to("cpu", torch.uint8)
+                .numpy()
+                .astype(np.uint8)
+            )
+            for img in images
+        ]
+    torch.cuda.empty_cache()
+    return (
+        img,
+        seed,
+        f"<span style='font-size: 16px; font-weight: bold;'>Inference Speed: {INFER_SPEED:.3f} s/Img</span>",
+        box,
+    )
+model_size = "1.6" if "1600M" in args.model_path else "0.6"
+title = f"""
+    <div style='display: flex; align-items: center; justify-content: center; text-align: center;'>
+        <img src="https://raw.githubusercontent.com/NVlabs/Sana/refs/heads/main/asset/logo.png" width="50%" alt="logo"/>
+    </div>
+"""
+DESCRIPTION = f"""
+        <p><span style="font-size: 36px; font-weight: bold;">Sana-{model_size}B</span><span style="font-size: 20px; font-weight: bold;">{args.image_size}px</span></p>
+        <p style="font-size: 16px; font-weight: bold;">Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer</p>
+        <p><span style="font-size: 16px;"><a href="https://arxiv.org/abs/2410.10629">[Paper]</a></span> <span style="font-size: 16px;"><a href="https://github.com/NVlabs/Sana">[Github]</a></span> <span style="font-size: 16px;"><a href="https://nvlabs.github.io/Sana">[Project]</a></span</p>
+        <p style="font-size: 16px; font-weight: bold;">Powered by <a href="https://hanlab.mit.edu/projects/dc-ae">DC-AE</a> with 32x latent space, </p>running on node {socket.gethostname()}.
+        <p style="font-size: 16px; font-weight: bold;">Unsafe word will give you a 'Red Heart' in the image instead.</p>
+        """
+if model_size == "0.6":
+    DESCRIPTION += "\n<p>0.6B model's text rendering ability is limited.</p>"
+if not torch.cuda.is_available():
+    DESCRIPTION += "\n<p>Running on CPU 🥶 This demo does not work on CPU.</p>"
 examples = [
+    'a cyberpunk cat with a neon sign that says "Sana"',
+    "A very detailed and realistic full body photo set of a tall, slim, and athletic Shiba Inu in a white oversized straight t-shirt, white shorts, and short white shoes.",
+    "Pirate ship trapped in a cosmic maelstrom nebula, rendered in cosmic beach whirlpool engine, volumetric lighting, spectacular, ambient lights, light pollution, cinematic atmosphere, art nouveau style, illustration art artwork by SenseiJaye, intricate detail.",
+    "portrait photo of a girl, photograph, highly detailed face, depth of field",
+    'make me a logo that says "So Fast"  with a really cool flying dragon shape with lightning sparks all over the sides and all of it contains Indonesian language',
+    "🐶 Wearing 🕶 flying on the 🌈",
+    "👧 with 🌹 in the ❄️",
+    "an old rusted robot wearing pants and a jacket riding skis in a supermarket.",
+    "professional portrait photo of an anthropomorphic cat wearing fancy gentleman hat and jacket walking in autumn forest.",
+    "Astronaut in a jungle, cold color palette, muted colors, detailed",
+    "a stunning and luxurious bedroom carved into a rocky mountainside seamlessly blending nature with modern design with a plush earth-toned bed textured stone walls circular fireplace massive uniquely shaped window framing snow-capped mountains dense forests",
 ]
 css = """
+.gradio-container{max-width: 640px !important}
+h1{text-align:center}
 """
+with gr.Blocks(css=css, title="Sana") as demo:
+    gr.Markdown(title)
+    gr.HTML(DESCRIPTION)
+    gr.DuplicateButton(
+        value="Duplicate Space for private use",
+        elem_id="duplicate-button",
+        visible=os.getenv("SHOW_DUPLICATE_BUTTON") == "1",
+    )
+    info_box = gr.Markdown(
+        value=f"<span style='font-size: 16px; font-weight: bold;'>Total inference runs: </span><span style='font-size: 16px; color:red; font-weight: bold;'>{read_inference_count()}</span>"
+    )
+    demo.load(fn=update_inference_count, outputs=info_box)  # update the value when re-loading the page
+    # with gr.Row(equal_height=False):
+    with gr.Group():
         with gr.Row():
             prompt = gr.Text(
                 label="Prompt",
                 placeholder="Enter your prompt",
                 container=False,
             )
+            run_button = gr.Button("Run", scale=0)
+        result = gr.Gallery(label="Result", show_label=False, columns=NUM_IMAGES_PER_PROMPT, format="png")
+    speed_box = gr.Markdown(
+        value=f"<span style='font-size: 16px; font-weight: bold;'>Inference speed: {INFER_SPEED} s/Img</span>"
+    )
+    with gr.Accordion("Advanced options", open=False):
+        with gr.Group():
+            with gr.Row(visible=True):
+                height = gr.Slider(
+                    label="Height",
+                    minimum=256,
+                    maximum=MAX_IMAGE_SIZE,
+                    step=32,
+                    value=args.image_size,
+                )
+                width = gr.Slider(
+                    label="Width",
+                    minimum=256,
+                    maximum=MAX_IMAGE_SIZE,
+                    step=32,
+                    value=args.image_size,
+                )
+            with gr.Row():
+                flow_dpms_inference_steps = gr.Slider(
+                    label="Sampling steps",
+                    minimum=5,
+                    maximum=40,
+                    step=1,
+                    value=20,
+                )
+                flow_dpms_guidance_scale = gr.Slider(
+                    label="CFG Guidance scale",
+                    minimum=1,
+                    maximum=10,
+                    step=0.1,
+                    value=4.5,
+                )
+                flow_dpms_pag_guidance_scale = gr.Slider(
+                    label="PAG Guidance scale",
+                    minimum=1,
+                    maximum=4,
+                    step=0.5,
+                    value=1.0,
+                )
+            with gr.Row():
+                use_negative_prompt = gr.Checkbox(label="Use negative prompt", value=False, visible=True)
             negative_prompt = gr.Text(
                 label="Negative prompt",
                 max_lines=1,
                 placeholder="Enter a negative prompt",
+                visible=True,
+            )
+            style_selection = gr.Radio(
+                show_label=True,
+                container=True,
+                interactive=True,
+                choices=STYLE_NAMES,
+                value=DEFAULT_STYLE_NAME,
+                label="Image Style",
             )
             seed = gr.Slider(
                 label="Seed",
                 minimum=0,
                 step=1,
                 value=0,
             )
             randomize_seed = gr.Checkbox(label="Randomize seed", value=True)
+            with gr.Row(visible=True):
+                schedule = gr.Radio(
+                    show_label=True,
+                    container=True,
+                    interactive=True,
+                    choices=SCHEDULE_NAME,
+                    value=DEFAULT_SCHEDULE_NAME,
+                    label="Sampler Schedule",
+                    visible=True,
                 )
+                num_imgs = gr.Slider(
+                    label="Num Images",
                     minimum=1,
+                    maximum=6,
                     step=1,
+                    value=1,
                 )
+    gr.Examples(
+        examples=examples,
+        inputs=prompt,
+        outputs=[result, seed],
+        fn=generate,
+        cache_examples=CACHE_EXAMPLES,
+    )
+    use_negative_prompt.change(
+        fn=lambda x: gr.update(visible=x),
+        inputs=use_negative_prompt,
+        outputs=negative_prompt,
+        api_name=False,
+    )
     gr.on(
+        triggers=[
+            prompt.submit,
+            negative_prompt.submit,
+            run_button.click,
+        ],
+        fn=generate,
         inputs=[
             prompt,
             negative_prompt,
+            style_selection,
+            use_negative_prompt,
+            num_imgs,
             seed,
             height,
+            width,
+            flow_dpms_guidance_scale,
+            flow_dpms_pag_guidance_scale,
+            flow_dpms_inference_steps,
+            randomize_seed,
         ],
+        outputs=[result, seed, speed_box, info_box],
+        api_name="run",
     )
 if __name__ == "__main__":
+    demo.queue(max_size=20).launch(server_name="0.0.0.0", server_port=DEMO_PORT, debug=False, share=args.share)

app/app_sana.py ADDED Viewed

	@@ -0,0 +1,502 @@

+#!/usr/bin/env python
+# Copyright 2024 NVIDIA CORPORATION & AFFILIATES
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# SPDX-License-Identifier: Apache-2.0
+from __future__ import annotations
+import argparse
+import os
+import random
+import socket
+import sqlite3
+import time
+import uuid
+from datetime import datetime
+import gradio as gr
+import numpy as np
+import spaces
+import torch
+from PIL import Image
+from torchvision.utils import make_grid, save_image
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from app import safety_check
+from app.sana_pipeline import SanaPipeline
+MAX_SEED = np.iinfo(np.int32).max
+CACHE_EXAMPLES = torch.cuda.is_available() and os.getenv("CACHE_EXAMPLES", "1") == "1"
+MAX_IMAGE_SIZE = int(os.getenv("MAX_IMAGE_SIZE", "4096"))
+USE_TORCH_COMPILE = os.getenv("USE_TORCH_COMPILE", "0") == "1"
+ENABLE_CPU_OFFLOAD = os.getenv("ENABLE_CPU_OFFLOAD", "0") == "1"
+DEMO_PORT = int(os.getenv("DEMO_PORT", "15432"))
+os.environ["GRADIO_EXAMPLES_CACHE"] = "./.gradio/cache"
+COUNTER_DB = os.getenv("COUNTER_DB", ".count.db")
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+style_list = [
+    {
+        "name": "(No style)",
+        "prompt": "{prompt}",
+        "negative_prompt": "",
+    },
+    {
+        "name": "Cinematic",
+        "prompt": "cinematic still {prompt} . emotional, harmonious, vignette, highly detailed, high budget, bokeh, "
+        "cinemascope, moody, epic, gorgeous, film grain, grainy",
+        "negative_prompt": "anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured",
+    },
+    {
+        "name": "Photographic",
+        "prompt": "cinematic photo {prompt} . 35mm photograph, film, bokeh, professional, 4k, highly detailed",
+        "negative_prompt": "drawing, painting, crayon, sketch, graphite, impressionist, noisy, blurry, soft, deformed, ugly",
+    },
+    {
+        "name": "Anime",
+        "prompt": "anime artwork {prompt} . anime style, key visual, vibrant, studio anime,  highly detailed",
+        "negative_prompt": "photo, deformed, black and white, realism, disfigured, low contrast",
+    },
+    {
+        "name": "Manga",
+        "prompt": "manga style {prompt} . vibrant, high-energy, detailed, iconic, Japanese comic style",
+        "negative_prompt": "ugly, deformed, noisy, blurry, low contrast, realism, photorealistic, Western comic style",
+    },
+    {
+        "name": "Digital Art",
+        "prompt": "concept art {prompt} . digital artwork, illustrative, painterly, matte painting, highly detailed",
+        "negative_prompt": "photo, photorealistic, realism, ugly",
+    },
+    {
+        "name": "Pixel art",
+        "prompt": "pixel-art {prompt} . low-res, blocky, pixel art style, 8-bit graphics",
+        "negative_prompt": "sloppy, messy, blurry, noisy, highly detailed, ultra textured, photo, realistic",
+    },
+    {
+        "name": "Fantasy art",
+        "prompt": "ethereal fantasy concept art of  {prompt} . magnificent, celestial, ethereal, painterly, epic, "
+        "majestic, magical, fantasy art, cover art, dreamy",
+        "negative_prompt": "photographic, realistic, realism, 35mm film, dslr, cropped, frame, text, deformed, "
+        "glitch, noise, noisy, off-center, deformed, cross-eyed, closed eyes, bad anatomy, ugly, "
+        "disfigured, sloppy, duplicate, mutated, black and white",
+    },
+    {
+        "name": "Neonpunk",
+        "prompt": "neonpunk style {prompt} . cyberpunk, vaporwave, neon, vibes, vibrant, stunningly beautiful, crisp, "
+        "detailed, sleek, ultramodern, magenta highlights, dark purple shadows, high contrast, cinematic, "
+        "ultra detailed, intricate, professional",
+        "negative_prompt": "painting, drawing, illustration, glitch, deformed, mutated, cross-eyed, ugly, disfigured",
+    },
+    {
+        "name": "3D Model",
+        "prompt": "professional 3d model {prompt} . octane render, highly detailed, volumetric, dramatic lighting",
+        "negative_prompt": "ugly, deformed, noisy, low poly, blurry, painting",
+    },
+]
+styles = {k["name"]: (k["prompt"], k["negative_prompt"]) for k in style_list}
+STYLE_NAMES = list(styles.keys())
+DEFAULT_STYLE_NAME = "(No style)"
+SCHEDULE_NAME = ["Flow_DPM_Solver"]
+DEFAULT_SCHEDULE_NAME = "Flow_DPM_Solver"
+NUM_IMAGES_PER_PROMPT = 1
+INFER_SPEED = 0
+def norm_ip(img, low, high):
+    img.clamp_(min=low, max=high)
+    img.sub_(low).div_(max(high - low, 1e-5))
+    return img
+def open_db():
+    db = sqlite3.connect(COUNTER_DB)
+    db.execute("CREATE TABLE IF NOT EXISTS counter(app CHARS PRIMARY KEY UNIQUE, value INTEGER)")
+    db.execute('INSERT OR IGNORE INTO counter(app, value) VALUES("Sana", 0)')
+    return db
+def read_inference_count():
+    with open_db() as db:
+        cur = db.execute('SELECT value FROM counter WHERE app="Sana"')
+        db.commit()
+    return cur.fetchone()[0]
+def write_inference_count(count):
+    count = max(0, int(count))
+    with open_db() as db:
+        db.execute(f'UPDATE counter SET value=value+{count} WHERE app="Sana"')
+        db.commit()
+def run_inference(num_imgs=1):
+    write_inference_count(num_imgs)
+    count = read_inference_count()
+    return (
+        f"<span style='font-size: 16px; font-weight: bold;'>Total inference runs: </span><span style='font-size: "
+        f"16px; color:red; font-weight: bold;'>{count}</span>"
+    )
+def update_inference_count():
+    count = read_inference_count()
+    return (
+        f"<span style='font-size: 16px; font-weight: bold;'>Total inference runs: </span><span style='font-size: "
+        f"16px; color:red; font-weight: bold;'>{count}</span>"
+    )
+def apply_style(style_name: str, positive: str, negative: str = "") -> tuple[str, str]:
+    p, n = styles.get(style_name, styles[DEFAULT_STYLE_NAME])
+    if not negative:
+        negative = ""
+    return p.replace("{prompt}", positive), n + negative
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", type=str, help="config")
+    parser.add_argument(
+        "--model_path",
+        nargs="?",
+        default="hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth",
+        type=str,
+        help="Path to the model file (positional)",
+    )
+    parser.add_argument("--output", default="./", type=str)
+    parser.add_argument("--bs", default=1, type=int)
+    parser.add_argument("--image_size", default=1024, type=int)
+    parser.add_argument("--cfg_scale", default=5.0, type=float)
+    parser.add_argument("--pag_scale", default=2.0, type=float)
+    parser.add_argument("--seed", default=42, type=int)
+    parser.add_argument("--step", default=-1, type=int)
+    parser.add_argument("--custom_image_size", default=None, type=int)
+    parser.add_argument("--share", action="store_true")
+    parser.add_argument(
+        "--shield_model_path",
+        type=str,
+        help="The path to shield model, we employ ShieldGemma-2B by default.",
+        default="google/shieldgemma-2b",
+    )
+    return parser.parse_known_args()[0]
+args = get_args()
+if torch.cuda.is_available():
+    model_path = args.model_path
+    pipe = SanaPipeline(args.config)
+    pipe.from_pretrained(model_path)
+    pipe.register_progress_bar(gr.Progress())
+    # safety checker
+    safety_checker_tokenizer = AutoTokenizer.from_pretrained(args.shield_model_path)
+    safety_checker_model = AutoModelForCausalLM.from_pretrained(
+        args.shield_model_path,
+        device_map="auto",
+        torch_dtype=torch.bfloat16,
+    ).to(device)
+def save_image_sana(img, seed="", save_img=False):
+    unique_name = f"{str(uuid.uuid4())}_{seed}.png"
+    save_path = os.path.join(f"output/online_demo_img/{datetime.now().date()}")
+    os.umask(0o000)  # file permission: 666; dir permission: 777
+    os.makedirs(save_path, exist_ok=True)
+    unique_name = os.path.join(save_path, unique_name)
+    if save_img:
+        save_image(img, unique_name, nrow=1, normalize=True, value_range=(-1, 1))
+    return unique_name
+def randomize_seed_fn(seed: int, randomize_seed: bool) -> int:
+    if randomize_seed:
+        seed = random.randint(0, MAX_SEED)
+    return seed
+@torch.no_grad()
+@torch.inference_mode()
+@spaces.GPU(enable_queue=True)
+def generate(
+    prompt: str = None,
+    negative_prompt: str = "",
+    style: str = DEFAULT_STYLE_NAME,
+    use_negative_prompt: bool = False,
+    num_imgs: int = 1,
+    seed: int = 0,
+    height: int = 1024,
+    width: int = 1024,
+    flow_dpms_guidance_scale: float = 5.0,
+    flow_dpms_pag_guidance_scale: float = 2.0,
+    flow_dpms_inference_steps: int = 20,
+    randomize_seed: bool = False,
+):
+    global INFER_SPEED
+    # seed = 823753551
+    box = run_inference(num_imgs)
+    seed = int(randomize_seed_fn(seed, randomize_seed))
+    generator = torch.Generator(device=device).manual_seed(seed)
+    print(f"PORT: {DEMO_PORT}, model_path: {model_path}")
+    if safety_check.is_dangerous(safety_checker_tokenizer, safety_checker_model, prompt, threshold=0.2):
+        prompt = "A red heart."
+    print(prompt)
+    num_inference_steps = flow_dpms_inference_steps
+    guidance_scale = flow_dpms_guidance_scale
+    pag_guidance_scale = flow_dpms_pag_guidance_scale
+    if not use_negative_prompt:
+        negative_prompt = None  # type: ignore
+    prompt, negative_prompt = apply_style(style, prompt, negative_prompt)
+    pipe.progress_fn(0, desc="Sana Start")
+    time_start = time.time()
+    images = pipe(
+        prompt=prompt,
+        height=height,
+        width=width,
+        negative_prompt=negative_prompt,
+        guidance_scale=guidance_scale,
+        pag_guidance_scale=pag_guidance_scale,
+        num_inference_steps=num_inference_steps,
+        num_images_per_prompt=num_imgs,
+        generator=generator,
+    )
+    pipe.progress_fn(1.0, desc="Sana End")
+    INFER_SPEED = (time.time() - time_start) / num_imgs
+    save_img = False
+    if save_img:
+        img = [save_image_sana(img, seed, save_img=save_image) for img in images]
+        print(img)
+    else:
+        img = [
+            Image.fromarray(
+                norm_ip(img, -1, 1)
+                .mul(255)
+                .add_(0.5)
+                .clamp_(0, 255)
+                .permute(1, 2, 0)
+                .to("cpu", torch.uint8)
+                .numpy()
+                .astype(np.uint8)
+            )
+            for img in images
+        ]
+    torch.cuda.empty_cache()
+    return (
+        img,
+        seed,
+        f"<span style='font-size: 16px; font-weight: bold;'>Inference Speed: {INFER_SPEED:.3f} s/Img</span>",
+        box,
+    )
+model_size = "1.6" if "1600M" in args.model_path else "0.6"
+title = f"""
+    <div style='display: flex; align-items: center; justify-content: center; text-align: center;'>
+        <img src="https://raw.githubusercontent.com/NVlabs/Sana/refs/heads/main/asset/logo.png" width="50%" alt="logo"/>
+    </div>
+"""
+DESCRIPTION = f"""
+        <p><span style="font-size: 36px; font-weight: bold;">Sana-{model_size}B</span><span style="font-size: 20px; font-weight: bold;">{args.image_size}px</span></p>
+        <p style="font-size: 16px; font-weight: bold;">Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer</p>
+        <p><span style="font-size: 16px;"><a href="https://arxiv.org/abs/2410.10629">[Paper]</a></span> <span style="font-size: 16px;"><a href="https://github.com/NVlabs/Sana">[Github]</a></span> <span style="font-size: 16px;"><a href="https://nvlabs.github.io/Sana">[Project]</a></span</p>
+        <p style="font-size: 16px; font-weight: bold;">Powered by <a href="https://hanlab.mit.edu/projects/dc-ae">DC-AE</a> with 32x latent space, </p>running on node {socket.gethostname()}.
+        <p style="font-size: 16px; font-weight: bold;">Unsafe word will give you a 'Red Heart' in the image instead.</p>
+        """
+if model_size == "0.6":
+    DESCRIPTION += "\n<p>0.6B model's text rendering ability is limited.</p>"
+if not torch.cuda.is_available():
+    DESCRIPTION += "\n<p>Running on CPU 🥶 This demo does not work on CPU.</p>"
+examples = [
+    'a cyberpunk cat with a neon sign that says "Sana"',
+    "A very detailed and realistic full body photo set of a tall, slim, and athletic Shiba Inu in a white oversized straight t-shirt, white shorts, and short white shoes.",
+    "Pirate ship trapped in a cosmic maelstrom nebula, rendered in cosmic beach whirlpool engine, volumetric lighting, spectacular, ambient lights, light pollution, cinematic atmosphere, art nouveau style, illustration art artwork by SenseiJaye, intricate detail.",
+    "portrait photo of a girl, photograph, highly detailed face, depth of field",
+    'make me a logo that says "So Fast"  with a really cool flying dragon shape with lightning sparks all over the sides and all of it contains Indonesian language',
+    "🐶 Wearing 🕶 flying on the 🌈",
+    "👧 with 🌹 in the ❄️",
+    "an old rusted robot wearing pants and a jacket riding skis in a supermarket.",
+    "professional portrait photo of an anthropomorphic cat wearing fancy gentleman hat and jacket walking in autumn forest.",
+    "Astronaut in a jungle, cold color palette, muted colors, detailed",
+    "a stunning and luxurious bedroom carved into a rocky mountainside seamlessly blending nature with modern design with a plush earth-toned bed textured stone walls circular fireplace massive uniquely shaped window framing snow-capped mountains dense forests",
+]
+css = """
+.gradio-container{max-width: 640px !important}
+h1{text-align:center}
+"""
+with gr.Blocks(css=css, title="Sana") as demo:
+    gr.Markdown(title)
+    gr.HTML(DESCRIPTION)
+    gr.DuplicateButton(
+        value="Duplicate Space for private use",
+        elem_id="duplicate-button",
+        visible=os.getenv("SHOW_DUPLICATE_BUTTON") == "1",
+    )
+    info_box = gr.Markdown(
+        value=f"<span style='font-size: 16px; font-weight: bold;'>Total inference runs: </span><span style='font-size: 16px; color:red; font-weight: bold;'>{read_inference_count()}</span>"
+    )
+    demo.load(fn=update_inference_count, outputs=info_box)  # update the value when re-loading the page
+    # with gr.Row(equal_height=False):
+    with gr.Group():
+        with gr.Row():
+            prompt = gr.Text(
+                label="Prompt",
+                show_label=False,
+                max_lines=1,
+                placeholder="Enter your prompt",
+                container=False,
+            )
+            run_button = gr.Button("Run", scale=0)
+        result = gr.Gallery(label="Result", show_label=False, columns=NUM_IMAGES_PER_PROMPT, format="png")
+    speed_box = gr.Markdown(
+        value=f"<span style='font-size: 16px; font-weight: bold;'>Inference speed: {INFER_SPEED} s/Img</span>"
+    )
+    with gr.Accordion("Advanced options", open=False):
+        with gr.Group():
+            with gr.Row(visible=True):
+                height = gr.Slider(
+                    label="Height",
+                    minimum=256,
+                    maximum=MAX_IMAGE_SIZE,
+                    step=32,
+                    value=args.image_size,
+                )
+                width = gr.Slider(
+                    label="Width",
+                    minimum=256,
+                    maximum=MAX_IMAGE_SIZE,
+                    step=32,
+                    value=args.image_size,
+                )
+            with gr.Row():
+                flow_dpms_inference_steps = gr.Slider(
+                    label="Sampling steps",
+                    minimum=5,
+                    maximum=40,
+                    step=1,
+                    value=20,
+                )
+                flow_dpms_guidance_scale = gr.Slider(
+                    label="CFG Guidance scale",
+                    minimum=1,
+                    maximum=10,
+                    step=0.1,
+                    value=4.5,
+                )
+                flow_dpms_pag_guidance_scale = gr.Slider(
+                    label="PAG Guidance scale",
+                    minimum=1,
+                    maximum=4,
+                    step=0.5,
+                    value=1.0,
+                )
+            with gr.Row():
+                use_negative_prompt = gr.Checkbox(label="Use negative prompt", value=False, visible=True)
+            negative_prompt = gr.Text(
+                label="Negative prompt",
+                max_lines=1,
+                placeholder="Enter a negative prompt",
+                visible=True,
+            )
+            style_selection = gr.Radio(
+                show_label=True,
+                container=True,
+                interactive=True,
+                choices=STYLE_NAMES,
+                value=DEFAULT_STYLE_NAME,
+                label="Image Style",
+            )
+            seed = gr.Slider(
+                label="Seed",
+                minimum=0,
+                maximum=MAX_SEED,
+                step=1,
+                value=0,
+            )
+            randomize_seed = gr.Checkbox(label="Randomize seed", value=True)
+            with gr.Row(visible=True):
+                schedule = gr.Radio(
+                    show_label=True,
+                    container=True,
+                    interactive=True,
+                    choices=SCHEDULE_NAME,
+                    value=DEFAULT_SCHEDULE_NAME,
+                    label="Sampler Schedule",
+                    visible=True,
+                )
+                num_imgs = gr.Slider(
+                    label="Num Images",
+                    minimum=1,
+                    maximum=6,
+                    step=1,
+                    value=1,
+                )
+    gr.Examples(
+        examples=examples,
+        inputs=prompt,
+        outputs=[result, seed],
+        fn=generate,
+        cache_examples=CACHE_EXAMPLES,
+    )
+    use_negative_prompt.change(
+        fn=lambda x: gr.update(visible=x),
+        inputs=use_negative_prompt,
+        outputs=negative_prompt,
+        api_name=False,
+    )
+    gr.on(
+        triggers=[
+            prompt.submit,
+            negative_prompt.submit,
+            run_button.click,
+        ],
+        fn=generate,
+        inputs=[
+            prompt,
+            negative_prompt,
+            style_selection,
+            use_negative_prompt,
+            num_imgs,
+            seed,
+            height,
+            width,
+            flow_dpms_guidance_scale,
+            flow_dpms_pag_guidance_scale,
+            flow_dpms_inference_steps,
+            randomize_seed,
+        ],
+        outputs=[result, seed, speed_box, info_box],
+        api_name="run",
+    )
+if __name__ == "__main__":
+    demo.queue(max_size=20).launch(server_name="0.0.0.0", server_port=DEMO_PORT, debug=False, share=args.share)

app/app_sana_4bit.py ADDED Viewed

	@@ -0,0 +1,409 @@

+#!/usr/bin/env python
+# Copyright 2024 NVIDIA CORPORATION & AFFILIATES
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+#!/usr/bin/env python
+# Copyright 2024 NVIDIA CORPORATION & AFFILIATES
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# SPDX-License-Identifier: Apache-2.0
+from __future__ import annotations
+import argparse
+import os
+import random
+import time
+import uuid
+from datetime import datetime
+import gradio as gr
+import numpy as np
+import spaces
+import torch
+from diffusers import SanaPipeline
+from nunchaku.models.transformer_sana import NunchakuSanaTransformer2DModel
+from torchvision.utils import save_image
+MAX_SEED = np.iinfo(np.int32).max
+CACHE_EXAMPLES = torch.cuda.is_available() and os.getenv("CACHE_EXAMPLES", "1") == "1"
+MAX_IMAGE_SIZE = int(os.getenv("MAX_IMAGE_SIZE", "4096"))
+USE_TORCH_COMPILE = os.getenv("USE_TORCH_COMPILE", "0") == "1"
+ENABLE_CPU_OFFLOAD = os.getenv("ENABLE_CPU_OFFLOAD", "0") == "1"
+DEMO_PORT = int(os.getenv("DEMO_PORT", "15432"))
+os.environ["GRADIO_EXAMPLES_CACHE"] = "./.gradio/cache"
+COUNTER_DB = os.getenv("COUNTER_DB", ".count.db")
+INFER_SPEED = 0
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+style_list = [
+    {
+        "name": "(No style)",
+        "prompt": "{prompt}",
+        "negative_prompt": "",
+    },
+    {
+        "name": "Cinematic",
+        "prompt": "cinematic still {prompt} . emotional, harmonious, vignette, highly detailed, high budget, bokeh, "
+        "cinemascope, moody, epic, gorgeous, film grain, grainy",
+        "negative_prompt": "anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured",
+    },
+    {
+        "name": "Photographic",
+        "prompt": "cinematic photo {prompt} . 35mm photograph, film, bokeh, professional, 4k, highly detailed",
+        "negative_prompt": "drawing, painting, crayon, sketch, graphite, impressionist, noisy, blurry, soft, deformed, ugly",
+    },
+    {
+        "name": "Anime",
+        "prompt": "anime artwork {prompt} . anime style, key visual, vibrant, studio anime,  highly detailed",
+        "negative_prompt": "photo, deformed, black and white, realism, disfigured, low contrast",
+    },
+    {
+        "name": "Manga",
+        "prompt": "manga style {prompt} . vibrant, high-energy, detailed, iconic, Japanese comic style",
+        "negative_prompt": "ugly, deformed, noisy, blurry, low contrast, realism, photorealistic, Western comic style",
+    },
+    {
+        "name": "Digital Art",
+        "prompt": "concept art {prompt} . digital artwork, illustrative, painterly, matte painting, highly detailed",
+        "negative_prompt": "photo, photorealistic, realism, ugly",
+    },
+    {
+        "name": "Pixel art",
+        "prompt": "pixel-art {prompt} . low-res, blocky, pixel art style, 8-bit graphics",
+        "negative_prompt": "sloppy, messy, blurry, noisy, highly detailed, ultra textured, photo, realistic",
+    },
+    {
+        "name": "Fantasy art",
+        "prompt": "ethereal fantasy concept art of  {prompt} . magnificent, celestial, ethereal, painterly, epic, "
+        "majestic, magical, fantasy art, cover art, dreamy",
+        "negative_prompt": "photographic, realistic, realism, 35mm film, dslr, cropped, frame, text, deformed, "
+        "glitch, noise, noisy, off-center, deformed, cross-eyed, closed eyes, bad anatomy, ugly, "
+        "disfigured, sloppy, duplicate, mutated, black and white",
+    },
+    {
+        "name": "Neonpunk",
+        "prompt": "neonpunk style {prompt} . cyberpunk, vaporwave, neon, vibes, vibrant, stunningly beautiful, crisp, "
+        "detailed, sleek, ultramodern, magenta highlights, dark purple shadows, high contrast, cinematic, "
+        "ultra detailed, intricate, professional",
+        "negative_prompt": "painting, drawing, illustration, glitch, deformed, mutated, cross-eyed, ugly, disfigured",
+    },
+    {
+        "name": "3D Model",
+        "prompt": "professional 3d model {prompt} . octane render, highly detailed, volumetric, dramatic lighting",
+        "negative_prompt": "ugly, deformed, noisy, low poly, blurry, painting",
+    },
+]
+styles = {k["name"]: (k["prompt"], k["negative_prompt"]) for k in style_list}
+STYLE_NAMES = list(styles.keys())
+DEFAULT_STYLE_NAME = "(No style)"
+SCHEDULE_NAME = ["Flow_DPM_Solver"]
+DEFAULT_SCHEDULE_NAME = "Flow_DPM_Solver"
+NUM_IMAGES_PER_PROMPT = 1
+def apply_style(style_name: str, positive: str, negative: str = "") -> tuple[str, str]:
+    p, n = styles.get(style_name, styles[DEFAULT_STYLE_NAME])
+    if not negative:
+        negative = ""
+    return p.replace("{prompt}", positive), n + negative
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_path",
+        nargs="?",
+        default="Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers",
+        type=str,
+        help="Path to the model file (positional)",
+    )
+    parser.add_argument("--share", action="store_true")
+    return parser.parse_known_args()[0]
+args = get_args()
+if torch.cuda.is_available():
+    transformer = NunchakuSanaTransformer2DModel.from_pretrained("mit-han-lab/svdq-int4-sana-1600m")
+    pipe = SanaPipeline.from_pretrained(
+        "Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers",
+        transformer=transformer,
+        variant="bf16",
+        torch_dtype=torch.bfloat16,
+    ).to(device)
+    pipe.text_encoder.to(torch.bfloat16)
+    pipe.vae.to(torch.bfloat16)
+def save_image_sana(img, seed="", save_img=False):
+    unique_name = f"{str(uuid.uuid4())}_{seed}.png"
+    save_path = os.path.join(f"output/online_demo_img/{datetime.now().date()}")
+    os.umask(0o000)  # file permission: 666; dir permission: 777
+    os.makedirs(save_path, exist_ok=True)
+    unique_name = os.path.join(save_path, unique_name)
+    if save_img:
+        save_image(img, unique_name, nrow=1, normalize=True, value_range=(-1, 1))
+    return unique_name
+def randomize_seed_fn(seed: int, randomize_seed: bool) -> int:
+    if randomize_seed:
+        seed = random.randint(0, MAX_SEED)
+    return seed
+@torch.no_grad()
+@torch.inference_mode()
+@spaces.GPU(enable_queue=True)
+def generate(
+    prompt: str = None,
+    negative_prompt: str = "",
+    style: str = DEFAULT_STYLE_NAME,
+    use_negative_prompt: bool = False,
+    num_imgs: int = 1,
+    seed: int = 0,
+    height: int = 1024,
+    width: int = 1024,
+    flow_dpms_guidance_scale: float = 5.0,
+    flow_dpms_inference_steps: int = 20,
+    randomize_seed: bool = False,
+):
+    global INFER_SPEED
+    # seed = 823753551
+    seed = int(randomize_seed_fn(seed, randomize_seed))
+    generator = torch.Generator(device=device).manual_seed(seed)
+    print(f"PORT: {DEMO_PORT}, model_path: {args.model_path}")
+    print(prompt)
+    num_inference_steps = flow_dpms_inference_steps
+    guidance_scale = flow_dpms_guidance_scale
+    if not use_negative_prompt:
+        negative_prompt = None  # type: ignore
+    prompt, negative_prompt = apply_style(style, prompt, negative_prompt)
+    time_start = time.time()
+    images = pipe(
+        prompt=prompt,
+        height=height,
+        width=width,
+        negative_prompt=negative_prompt,
+        guidance_scale=guidance_scale,
+        num_inference_steps=num_inference_steps,
+        num_images_per_prompt=num_imgs,
+        generator=generator,
+    ).images
+    INFER_SPEED = (time.time() - time_start) / num_imgs
+    save_img = False
+    if save_img:
+        img = [save_image_sana(img, seed, save_img=save_image) for img in images]
+        print(img)
+    else:
+        img = images
+    torch.cuda.empty_cache()
+    return (
+        img,
+        seed,
+        f"<span style='font-size: 16px; font-weight: bold;'>Inference Speed: {INFER_SPEED:.3f} s/Img</span>",
+    )
+model_size = "1.6" if "1600M" in args.model_path else "0.6"
+title = f"""
+    <div style='display: flex; align-items: center; justify-content: center; text-align: center;'>
+        <img src="https://raw.githubusercontent.com/NVlabs/Sana/refs/heads/main/asset/logo.png" width="30%" alt="logo"/>
+    </div>
+"""
+DESCRIPTION = f"""
+        <p style="font-size: 30px; font-weight: bold; text-align: center;">Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer (4bit version)</p>
+        """
+if model_size == "0.6":
+    DESCRIPTION += "\n<p>0.6B model's text rendering ability is limited.</p>"
+if not torch.cuda.is_available():
+    DESCRIPTION += "\n<p>Running on CPU 🥶 This demo does not work on CPU.</p>"
+examples = [
+    'a cyberpunk cat with a neon sign that says "Sana"',
+    "A very detailed and realistic full body photo set of a tall, slim, and athletic Shiba Inu in a white oversized straight t-shirt, white shorts, and short white shoes.",
+    "Pirate ship trapped in a cosmic maelstrom nebula, rendered in cosmic beach whirlpool engine, volumetric lighting, spectacular, ambient lights, light pollution, cinematic atmosphere, art nouveau style, illustration art artwork by SenseiJaye, intricate detail.",
+    "portrait photo of a girl, photograph, highly detailed face, depth of field",
+    'make me a logo that says "So Fast"  with a really cool flying dragon shape with lightning sparks all over the sides and all of it contains Indonesian language',
+    "🐶 Wearing 🕶 flying on the 🌈",
+    "👧 with 🌹 in the ❄️",
+    "an old rusted robot wearing pants and a jacket riding skis in a supermarket.",
+    "professional portrait photo of an anthropomorphic cat wearing fancy gentleman hat and jacket walking in autumn forest.",
+    "Astronaut in a jungle, cold color palette, muted colors, detailed",
+    "a stunning and luxurious bedroom carved into a rocky mountainside seamlessly blending nature with modern design with a plush earth-toned bed textured stone walls circular fireplace massive uniquely shaped window framing snow-capped mountains dense forests",
+]
+css = """
+.gradio-container {max-width: 850px !important; height: auto !important;}
+h1 {text-align: center;}
+"""
+theme = gr.themes.Base()
+with gr.Blocks(css=css, theme=theme, title="Sana") as demo:
+    gr.Markdown(title)
+    gr.HTML(DESCRIPTION)
+    gr.DuplicateButton(
+        value="Duplicate Space for private use",
+        elem_id="duplicate-button",
+        visible=os.getenv("SHOW_DUPLICATE_BUTTON") == "1",
+    )
+    # with gr.Row(equal_height=False):
+    with gr.Group():
+        with gr.Row():
+            prompt = gr.Text(
+                label="Prompt",
+                show_label=False,
+                max_lines=1,
+                placeholder="Enter your prompt",
+                container=False,
+            )
+            run_button = gr.Button("Run", scale=0)
+        result = gr.Gallery(
+            label="Result",
+            show_label=False,
+            height=750,
+            columns=NUM_IMAGES_PER_PROMPT,
+            format="jpeg",
+        )
+    speed_box = gr.Markdown(
+        value=f"<span style='font-size: 16px; font-weight: bold;'>Inference speed: {INFER_SPEED} s/Img</span>"
+    )
+    with gr.Accordion("Advanced options", open=False):
+        with gr.Group():
+            with gr.Row(visible=True):
+                height = gr.Slider(
+                    label="Height",
+                    minimum=256,
+                    maximum=MAX_IMAGE_SIZE,
+                    step=32,
+                    value=1024,
+                )
+                width = gr.Slider(
+                    label="Width",
+                    minimum=256,
+                    maximum=MAX_IMAGE_SIZE,
+                    step=32,
+                    value=1024,
+                )
+            with gr.Row():
+                flow_dpms_inference_steps = gr.Slider(
+                    label="Sampling steps",
+                    minimum=5,
+                    maximum=40,
+                    step=1,
+                    value=20,
+                )
+                flow_dpms_guidance_scale = gr.Slider(
+                    label="CFG Guidance scale",
+                    minimum=1,
+                    maximum=10,
+                    step=0.1,
+                    value=4.5,
+                )
+            with gr.Row():
+                use_negative_prompt = gr.Checkbox(label="Use negative prompt", value=False, visible=True)
+            negative_prompt = gr.Text(
+                label="Negative prompt",
+                max_lines=1,
+                placeholder="Enter a negative prompt",
+                visible=True,
+            )
+            style_selection = gr.Radio(
+                show_label=True,
+                container=True,
+                interactive=True,
+                choices=STYLE_NAMES,
+                value=DEFAULT_STYLE_NAME,
+                label="Image Style",
+            )
+            seed = gr.Slider(
+                label="Seed",
+                minimum=0,
+                maximum=MAX_SEED,
+                step=1,
+                value=0,
+            )
+            randomize_seed = gr.Checkbox(label="Randomize seed", value=True)
+            with gr.Row(visible=True):
+                schedule = gr.Radio(
+                    show_label=True,
+                    container=True,
+                    interactive=True,
+                    choices=SCHEDULE_NAME,
+                    value=DEFAULT_SCHEDULE_NAME,
+                    label="Sampler Schedule",
+                    visible=True,
+                )
+                num_imgs = gr.Slider(
+                    label="Num Images",
+                    minimum=1,
+                    maximum=6,
+                    step=1,
+                    value=1,
+                )
+    gr.Examples(
+        examples=examples,
+        inputs=prompt,
+        outputs=[result, seed],
+        fn=generate,
+        cache_examples=CACHE_EXAMPLES,
+    )
+    use_negative_prompt.change(
+        fn=lambda x: gr.update(visible=x),
+        inputs=use_negative_prompt,
+        outputs=negative_prompt,
+        api_name=False,
+    )
+    gr.on(
+        triggers=[
+            prompt.submit,
+            negative_prompt.submit,
+            run_button.click,
+        ],
+        fn=generate,
+        inputs=[
+            prompt,
+            negative_prompt,
+            style_selection,
+            use_negative_prompt,
+            num_imgs,
+            seed,
+            height,
+            width,
+            flow_dpms_guidance_scale,
+            flow_dpms_inference_steps,
+            randomize_seed,
+        ],
+        outputs=[result, seed, speed_box],
+        api_name="run",
+    )
+if __name__ == "__main__":
+    demo.queue(max_size=20).launch(server_name="0.0.0.0", server_port=DEMO_PORT, debug=False, share=args.share)

app/app_sana_4bit_compare_bf16.py ADDED Viewed

	@@ -0,0 +1,313 @@

+# Changed from https://huggingface.co/spaces/playgroundai/playground-v2.5/blob/main/app.py
+import argparse
+import os
+import random
+import time
+from datetime import datetime
+import GPUtil
+# import gradio last to avoid conflicts with other imports
+import gradio as gr
+import safety_check
+import spaces
+import torch
+from diffusers import SanaPipeline
+from nunchaku.models.transformer_sana import NunchakuSanaTransformer2DModel
+from transformers import AutoModelForCausalLM, AutoTokenizer
+MAX_IMAGE_SIZE = 2048
+MAX_SEED = 1000000000
+DEFAULT_HEIGHT = 1024
+DEFAULT_WIDTH = 1024
+# num_inference_steps, guidance_scale, seed
+EXAMPLES = [
+    [
+        "🐶 Wearing 🕶 flying on the 🌈",
+        1024,
+        1024,
+        20,
+        5,
+        2,
+    ],
+    [
+        "大漠孤烟直, 长河落日圆",
+        1024,
+        1024,
+        20,
+        5,
+        23,
+    ],
+    [
+        "Pirate ship trapped in a cosmic maelstrom nebula, rendered in cosmic beach whirlpool engine, "
+        "volumetric lighting, spectacular, ambient lights, light pollution, cinematic atmosphere, "
+        "art nouveau style, illustration art artwork by SenseiJaye, intricate detail.",
+        1024,
+        1024,
+        20,
+        5,
+        233,
+    ],
+    [
+        "A photo of a Eurasian lynx in a sunlit forest, with tufted ears and a spotted coat. The lynx should be "
+        "sharply focused, gazing into the distance, while the background is softly blurred for depth. Use cinematic "
+        "lighting with soft rays filtering through the trees, and capture the scene with a shallow depth of field "
+        "for a natural, peaceful atmosphere. 8K resolution, highly detailed, photorealistic, "
+        "cinematic lighting, ultra-HD.",
+        1024,
+        1024,
+        20,
+        5,
+        2333,
+    ],
+    [
+        "A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. "
+        "She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. "
+        "She wears sunglasses and red lipstick. She walks confidently and casually. "
+        "The street is damp and reflective, creating a mirror effect of the colorful lights. "
+        "Many pedestrians walk about.",
+        1024,
+        1024,
+        20,
+        5,
+        23333,
+    ],
+    [
+        "Cozy bedroom with vintage wooden furniture and a large circular window covered in lush green vines, "
+        "opening to a misty forest. Soft, ambient lighting highlights the bed with crumpled blankets, a bookshelf, "
+        "and a desk. The atmosphere is serene and natural. 8K resolution, highly detailed, photorealistic, "
+        "cinematic lighting, ultra-HD.",
+        1024,
+        1024,
+        20,
+        5,
+        233333,
+    ],
+]
+def hash_str_to_int(s: str) -> int:
+    """Hash a string to an integer."""
+    modulus = 10**9 + 7  # Large prime modulus
+    hash_int = 0
+    for char in s:
+        hash_int = (hash_int * 31 + ord(char)) % modulus
+    return hash_int
+def get_pipeline(
+    precision: str, use_qencoder: bool = False, device: str | torch.device = "cuda", pipeline_init_kwargs: dict = {}
+) -> SanaPipeline:
+    if precision == "int4":
+        assert torch.device(device).type == "cuda", "int4 only supported on CUDA devices"
+        transformer = NunchakuSanaTransformer2DModel.from_pretrained("mit-han-lab/svdq-int4-sana-1600m")
+        pipeline_init_kwargs["transformer"] = transformer
+        if use_qencoder:
+            raise NotImplementedError("Quantized encoder not supported for Sana for now")
+    else:
+        assert precision == "bf16"
+    pipeline = SanaPipeline.from_pretrained(
+        "Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers",
+        variant="bf16",
+        torch_dtype=torch.bfloat16,
+        **pipeline_init_kwargs,
+    )
+    pipeline = pipeline.to(device)
+    return pipeline
+def get_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-p",
+        "--precisions",
+        type=str,
+        default=["int4"],
+        nargs="*",
+        choices=["int4", "bf16"],
+        help="Which precisions to use",
+    )
+    parser.add_argument("--use-qencoder", action="store_true", help="Whether to use 4-bit text encoder")
+    parser.add_argument("--no-safety-checker", action="store_true", help="Disable safety checker")
+    parser.add_argument("--count-use", action="store_true", help="Whether to count the number of uses")
+    return parser.parse_args()
+args = get_args()
+pipelines = []
+pipeline_init_kwargs = {}
+for i, precision in enumerate(args.precisions):
+    pipeline = get_pipeline(
+        precision=precision,
+        use_qencoder=args.use_qencoder,
+        device="cuda",
+        pipeline_init_kwargs={**pipeline_init_kwargs},
+    )
+    pipelines.append(pipeline)
+    if i == 0:
+        pipeline_init_kwargs["vae"] = pipeline.vae
+        pipeline_init_kwargs["text_encoder"] = pipeline.text_encoder
+# safety checker
+safety_checker_tokenizer = AutoTokenizer.from_pretrained(args.shield_model_path)
+safety_checker_model = AutoModelForCausalLM.from_pretrained(
+    args.shield_model_path,
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
+).to(pipeline.device)
+@spaces.GPU(enable_queue=True)
+def generate(
+    prompt: str = None,
+    height: int = 1024,
+    width: int = 1024,
+    num_inference_steps: int = 4,
+    guidance_scale: float = 0,
+    seed: int = 0,
+):
+    print(f"Prompt: {prompt}")
+    is_unsafe_prompt = False
+    if safety_check.is_dangerous(safety_checker_tokenizer, safety_checker_model, prompt, threshold=0.2):
+        prompt = "A peaceful world."
+    images, latency_strs = [], []
+    for i, pipeline in enumerate(pipelines):
+        progress = gr.Progress(track_tqdm=True)
+        start_time = time.time()
+        image = pipeline(
+            prompt=prompt,
+            height=height,
+            width=width,
+            guidance_scale=guidance_scale,
+            num_inference_steps=num_inference_steps,
+            generator=torch.Generator().manual_seed(seed),
+        ).images[0]
+        end_time = time.time()
+        latency = end_time - start_time
+        if latency < 1:
+            latency = latency * 1000
+            latency_str = f"{latency:.2f}ms"
+        else:
+            latency_str = f"{latency:.2f}s"
+        images.append(image)
+        latency_strs.append(latency_str)
+    if is_unsafe_prompt:
+        for i in range(len(latency_strs)):
+            latency_strs[i] += " (Unsafe prompt detected)"
+    torch.cuda.empty_cache()
+    if args.count_use:
+        if os.path.exists("use_count.txt"):
+            with open("use_count.txt") as f:
+                count = int(f.read())
+        else:
+            count = 0
+        count += 1
+        current_time = datetime.now()
+        print(f"{current_time}: {count}")
+        with open("use_count.txt", "w") as f:
+            f.write(str(count))
+        with open("use_record.txt", "a") as f:
+            f.write(f"{current_time}: {count}\n")
+    return *images, *latency_strs
+with open("./assets/description.html") as f:
+    DESCRIPTION = f.read()
+gpus = GPUtil.getGPUs()
+if len(gpus) > 0:
+    gpu = gpus[0]
+    memory = gpu.memoryTotal / 1024
+    device_info = f"Running on {gpu.name} with {memory:.0f} GiB memory."
+else:
+    device_info = "Running on CPU 🥶 This demo does not work on CPU."
+notice = f'<strong>Notice:</strong>&nbsp;We will replace unsafe prompts with a default prompt: "A peaceful world."'
+with gr.Blocks(
+    css_paths=[f"assets/frame{len(args.precisions)}.css", "assets/common.css"],
+    title=f"SVDQuant SANA-1600M Demo",
+) as demo:
+    def get_header_str():
+        if args.count_use:
+            if os.path.exists("use_count.txt"):
+                with open("use_count.txt") as f:
+                    count = int(f.read())
+            else:
+                count = 0
+            count_info = (
+                f"<div style='display: flex; justify-content: center; align-items: center; text-align: center;'>"
+                f"<span style='font-size: 18px; font-weight: bold;'>Total inference runs: </span>"
+                f"<span style='font-size: 18px; color:red; font-weight: bold;'>&nbsp;{count}</span></div>"
+            )
+        else:
+            count_info = ""
+        header_str = DESCRIPTION.format(device_info=device_info, notice=notice, count_info=count_info)
+        return header_str
+    header = gr.HTML(get_header_str())
+    demo.load(fn=get_header_str, outputs=header)
+    with gr.Row():
+        image_results, latency_results = [], []
+        for i, precision in enumerate(args.precisions):
+            with gr.Column():
+                gr.Markdown(f"# {precision.upper()}", elem_id="image_header")
+                with gr.Group():
+                    image_result = gr.Image(
+                        format="png",
+                        image_mode="RGB",
+                        label="Result",
+                        show_label=False,
+                        show_download_button=True,
+                        interactive=False,
+                    )
+                    latency_result = gr.Text(label="Inference Latency", show_label=True)
+                    image_results.append(image_result)
+                    latency_results.append(latency_result)
+    with gr.Row():
+        prompt = gr.Text(
+            label="Prompt", show_label=False, max_lines=1, placeholder="Enter your prompt", container=False, scale=4
+        )
+        run_button = gr.Button("Run", scale=1)
+    with gr.Row():
+        seed = gr.Slider(label="Seed", show_label=True, minimum=0, maximum=MAX_SEED, value=233, step=1, scale=4)
+        randomize_seed = gr.Button("Random Seed", scale=1, min_width=50, elem_id="random_seed")
+    with gr.Accordion("Advanced options", open=False):
+        with gr.Group():
+            height = gr.Slider(label="Height", minimum=256, maximum=4096, step=32, value=1024)
+            width = gr.Slider(label="Width", minimum=256, maximum=4096, step=32, value=1024)
+        with gr.Group():
+            num_inference_steps = gr.Slider(label="Sampling Steps", minimum=10, maximum=50, step=1, value=20)
+            guidance_scale = gr.Slider(label="Guidance Scale", minimum=1, maximum=10, step=0.1, value=5)
+    input_args = [prompt, height, width, num_inference_steps, guidance_scale, seed]
+    gr.Examples(examples=EXAMPLES, inputs=input_args, outputs=[*image_results, *latency_results], fn=generate)
+    gr.on(
+        triggers=[prompt.submit, run_button.click],
+        fn=generate,
+        inputs=input_args,
+        outputs=[*image_results, *latency_results],
+        api_name="run",
+    )
+    randomize_seed.click(
+        lambda: random.randint(0, MAX_SEED), inputs=[], outputs=seed, api_name=False, queue=False
+    ).then(fn=generate, inputs=input_args, outputs=[*image_results, *latency_results], api_name=False, queue=False)
+    gr.Markdown("MIT Accessibility: https://accessibility.mit.edu/", elem_id="accessibility")
+if __name__ == "__main__":
+    demo.queue(max_size=20).launch(server_name="0.0.0.0", debug=True, share=True)

app/app_sana_controlnet_hed.py ADDED Viewed

	@@ -0,0 +1,306 @@

+# Changed from https://github.com/GaParmar/img2img-turbo/blob/main/gradio_sketch2image.py
+import argparse
+import os
+import random
+import socket
+import tempfile
+import time
+import gradio as gr
+import numpy as np
+import torch
+from PIL import Image
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from app import safety_check
+from app.sana_controlnet_pipeline import SanaControlNetPipeline
+STYLES = {
+    "None": "{prompt}",
+    "Cinematic": "cinematic still {prompt}. emotional, harmonious, vignette, highly detailed, high budget, bokeh, cinemascope, moody, epic, gorgeous, film grain, grainy",
+    "3D Model": "professional 3d model {prompt}. octane render, highly detailed, volumetric, dramatic lighting",
+    "Anime": "anime artwork {prompt}. anime style, key visual, vibrant, studio anime,  highly detailed",
+    "Digital Art": "concept art {prompt}. digital artwork, illustrative, painterly, matte painting, highly detailed",
+    "Photographic": "cinematic photo {prompt}. 35mm photograph, film, bokeh, professional, 4k, highly detailed",
+    "Pixel art": "pixel-art {prompt}. low-res, blocky, pixel art style, 8-bit graphics",
+    "Fantasy art": "ethereal fantasy concept art of  {prompt}. magnificent, celestial, ethereal, painterly, epic, majestic, magical, fantasy art, cover art, dreamy",
+    "Neonpunk": "neonpunk style {prompt}. cyberpunk, vaporwave, neon, vibes, vibrant, stunningly beautiful, crisp, detailed, sleek, ultramodern, magenta highlights, dark purple shadows, high contrast, cinematic, ultra detailed, intricate, professional",
+    "Manga": "manga style {prompt}. vibrant, high-energy, detailed, iconic, Japanese comic style",
+}
+DEFAULT_STYLE_NAME = "None"
+STYLE_NAMES = list(STYLES.keys())
+MAX_SEED = 1000000000
+DEFAULT_SKETCH_GUIDANCE = 0.28
+DEMO_PORT = int(os.getenv("DEMO_PORT", "15432"))
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+blank_image = Image.new("RGB", (1024, 1024), (255, 255, 255))
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", type=str, help="config")
+    parser.add_argument(
+        "--model_path",
+        nargs="?",
+        default="hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth",
+        type=str,
+        help="Path to the model file (positional)",
+    )
+    parser.add_argument("--output", default="./", type=str)
+    parser.add_argument("--bs", default=1, type=int)
+    parser.add_argument("--image_size", default=1024, type=int)
+    parser.add_argument("--cfg_scale", default=5.0, type=float)
+    parser.add_argument("--pag_scale", default=2.0, type=float)
+    parser.add_argument("--seed", default=42, type=int)
+    parser.add_argument("--step", default=-1, type=int)
+    parser.add_argument("--custom_image_size", default=None, type=int)
+    parser.add_argument("--share", action="store_true")
+    parser.add_argument(
+        "--shield_model_path",
+        type=str,
+        help="The path to shield model, we employ ShieldGemma-2B by default.",
+        default="google/shieldgemma-2b",
+    )
+    return parser.parse_known_args()[0]
+args = get_args()
+if torch.cuda.is_available():
+    model_path = args.model_path
+    pipe = SanaControlNetPipeline(args.config)
+    pipe.from_pretrained(model_path)
+    pipe.register_progress_bar(gr.Progress())
+    # safety checker
+    safety_checker_tokenizer = AutoTokenizer.from_pretrained(args.shield_model_path)
+    safety_checker_model = AutoModelForCausalLM.from_pretrained(
+        args.shield_model_path,
+        device_map="auto",
+        torch_dtype=torch.bfloat16,
+    ).to(device)
+def save_image(img):
+    if isinstance(img, dict):
+        img = img["composite"]
+    temp_file = tempfile.NamedTemporaryFile(suffix=".png", delete=False)
+    img.save(temp_file.name)
+    return temp_file.name
+def norm_ip(img, low, high):
+    img.clamp_(min=low, max=high)
+    img.sub_(low).div_(max(high - low, 1e-5))
+    return img
+@torch.no_grad()
+@torch.inference_mode()
+def run(
+    image,
+    prompt: str,
+    prompt_template: str,
+    sketch_thickness: int,
+    guidance_scale: float,
+    inference_steps: int,
+    seed: int,
+    blend_alpha: float,
+) -> tuple[Image, str]:
+    print(f"Prompt: {prompt}")
+    image_numpy = np.array(image["composite"].convert("RGB"))
+    if prompt.strip() == "" and (np.sum(image_numpy == 255) >= 3145628 or np.sum(image_numpy == 0) >= 3145628):
+        return blank_image, "Please input the prompt or draw something."
+    if safety_check.is_dangerous(safety_checker_tokenizer, safety_checker_model, prompt, threshold=0.2):
+        prompt = "A red heart."
+    prompt = prompt_template.format(prompt=prompt)
+    pipe.set_blend_alpha(blend_alpha)
+    start_time = time.time()
+    images = pipe(
+        prompt=prompt,
+        ref_image=image["composite"],
+        guidance_scale=guidance_scale,
+        num_inference_steps=inference_steps,
+        num_images_per_prompt=1,
+        sketch_thickness=sketch_thickness,
+        generator=torch.Generator(device=device).manual_seed(seed),
+    )
+    latency = time.time() - start_time
+    if latency < 1:
+        latency = latency * 1000
+        latency_str = f"{latency:.2f}ms"
+    else:
+        latency_str = f"{latency:.2f}s"
+    torch.cuda.empty_cache()
+    img = [
+        Image.fromarray(
+            norm_ip(img, -1, 1)
+            .mul(255)
+            .add_(0.5)
+            .clamp_(0, 255)
+            .permute(1, 2, 0)
+            .to("cpu", torch.uint8)
+            .numpy()
+            .astype(np.uint8)
+        )
+        for img in images
+    ]
+    img = img[0]
+    return img, latency_str
+model_size = "1.6" if "1600M" in args.model_path else "0.6"
+title = f"""
+    <div style='display: flex; align-items: center; justify-content: center; text-align: center;'>
+        <img src="https://raw.githubusercontent.com/NVlabs/Sana/refs/heads/main/asset/logo.png" width="50%" alt="logo"/>
+    </div>
+"""
+DESCRIPTION = f"""
+        <p><span style="font-size: 36px; font-weight: bold;">Sana-ControlNet-{model_size}B</span><span style="font-size: 20px; font-weight: bold;">{args.image_size}px</span></p>
+        <p style="font-size: 18px; font-weight: bold;">Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer</p>
+        <p><span style="font-size: 16px;"><a href="https://arxiv.org/abs/2410.10629">[Paper]</a></span> <span style="font-size: 16px;"><a href="https://github.com/NVlabs/Sana">[Github]</a></span> <span style="font-size: 16px;"><a href="https://nvlabs.github.io/Sana">[Project]</a></span</p>
+        <p style="font-size: 18px; font-weight: bold;">Powered by <a href="https://hanlab.mit.edu/projects/dc-ae">DC-AE</a> with 32x latent space, </p>running on node {socket.gethostname()}.
+        <p style="font-size: 16px; font-weight: bold;">Unsafe word will give you a 'Red Heart' in the image instead.</p>
+        """
+if model_size == "0.6":
+    DESCRIPTION += "\n<p>0.6B model's text rendering ability is limited.</p>"
+if not torch.cuda.is_available():
+    DESCRIPTION += "\n<p>Running on CPU 🥶 This demo does not work on CPU.</p>"
+with gr.Blocks(css_paths="asset/app_styles/controlnet_app_style.css", title=f"Sana Sketch-to-Image Demo") as demo:
+    gr.Markdown(title)
+    gr.HTML(DESCRIPTION)
+    with gr.Row(elem_id="main_row"):
+        with gr.Column(elem_id="column_input"):
+            gr.Markdown("## INPUT", elem_id="input_header")
+            with gr.Group():
+                canvas = gr.Sketchpad(
+                    value=blank_image,
+                    height=640,
+                    image_mode="RGB",
+                    sources=["upload", "clipboard"],
+                    type="pil",
+                    label="Sketch",
+                    show_label=False,
+                    show_download_button=True,
+                    interactive=True,
+                    transforms=[],
+                    canvas_size=(1024, 1024),
+                    scale=1,
+                    brush=gr.Brush(default_size=3, colors=["#000000"], color_mode="fixed"),
+                    format="png",
+                    layers=False,
+                )
+                with gr.Row():
+                    prompt = gr.Text(label="Prompt", placeholder="Enter your prompt", scale=6)
+                    run_button = gr.Button("Run", scale=1, elem_id="run_button")
+            download_sketch = gr.DownloadButton("Download Sketch", scale=1, elem_id="download_sketch")
+            with gr.Row():
+                style = gr.Dropdown(label="Style", choices=STYLE_NAMES, value=DEFAULT_STYLE_NAME, scale=1)
+                prompt_template = gr.Textbox(
+                    label="Prompt Style Template", value=STYLES[DEFAULT_STYLE_NAME], scale=2, max_lines=1
+                )
+            with gr.Row():
+                sketch_thickness = gr.Slider(
+                    label="Sketch Thickness",
+                    minimum=1,
+                    maximum=4,
+                    step=1,
+                    value=2,
+                )
+            with gr.Row():
+                inference_steps = gr.Slider(
+                    label="Sampling steps",
+                    minimum=5,
+                    maximum=40,
+                    step=1,
+                    value=20,
+                )
+                guidance_scale = gr.Slider(
+                    label="CFG Guidance scale",
+                    minimum=1,
+                    maximum=10,
+                    step=0.1,
+                    value=4.5,
+                )
+                blend_alpha = gr.Slider(
+                    label="Blend Alpha",
+                    minimum=0,
+                    maximum=1,
+                    step=0.1,
+                    value=0,
+                )
+            with gr.Row():
+                seed = gr.Slider(label="Seed", show_label=True, minimum=0, maximum=MAX_SEED, value=233, step=1, scale=4)
+                randomize_seed = gr.Button("Random Seed", scale=1, min_width=50, elem_id="random_seed")
+        with gr.Column(elem_id="column_output"):
+            gr.Markdown("## OUTPUT", elem_id="output_header")
+            with gr.Group():
+                result = gr.Image(
+                    format="png",
+                    height=640,
+                    image_mode="RGB",
+                    type="pil",
+                    label="Result",
+                    show_label=False,
+                    show_download_button=True,
+                    interactive=False,
+                    elem_id="output_image",
+                )
+                latency_result = gr.Text(label="Inference Latency", show_label=True)
+            download_result = gr.DownloadButton("Download Result", elem_id="download_result")
+            gr.Markdown("### Instructions")
+            gr.Markdown("**1**. Enter a text prompt (e.g. a cat)")
+            gr.Markdown("**2**. Start sketching or upload a reference image")
+            gr.Markdown("**3**. Change the image style using a style template")
+            gr.Markdown("**4**. Try different seeds to generate different results")
+    run_inputs = [canvas, prompt, prompt_template, sketch_thickness, guidance_scale, inference_steps, seed, blend_alpha]
+    run_outputs = [result, latency_result]
+    randomize_seed.click(
+        lambda: random.randint(0, MAX_SEED),
+        inputs=[],
+        outputs=seed,
+        api_name=False,
+        queue=False,
+    ).then(run, inputs=run_inputs, outputs=run_outputs, api_name=False)
+    style.change(
+        lambda x: STYLES[x],
+        inputs=[style],
+        outputs=[prompt_template],
+        api_name=False,
+        queue=False,
+    ).then(fn=run, inputs=run_inputs, outputs=run_outputs, api_name=False)
+    gr.on(
+        triggers=[prompt.submit, run_button.click, canvas.change],
+        fn=run,
+        inputs=run_inputs,
+        outputs=run_outputs,
+        api_name=False,
+    )
+    download_sketch.click(fn=save_image, inputs=canvas, outputs=download_sketch)
+    download_result.click(fn=save_image, inputs=result, outputs=download_result)
+    gr.Markdown("MIT Accessibility: https://accessibility.mit.edu/", elem_id="accessibility")
+if __name__ == "__main__":
+    demo.queue(max_size=20).launch(server_name="0.0.0.0", server_port=DEMO_PORT, debug=False, share=args.share)

app/app_sana_multithread.py ADDED Viewed

	@@ -0,0 +1,565 @@

+#!/usr/bin/env python
+# Copyright 2024 NVIDIA CORPORATION & AFFILIATES
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# SPDX-License-Identifier: Apache-2.0
+from __future__ import annotations
+import argparse
+import os
+import random
+import uuid
+from datetime import datetime
+import gradio as gr
+import numpy as np
+import spaces
+import torch
+from diffusers import FluxPipeline
+from PIL import Image
+from torchvision.utils import make_grid, save_image
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from app import safety_check
+from app.sana_pipeline import SanaPipeline
+MAX_SEED = np.iinfo(np.int32).max
+CACHE_EXAMPLES = torch.cuda.is_available() and os.getenv("CACHE_EXAMPLES", "1") == "1"
+MAX_IMAGE_SIZE = int(os.getenv("MAX_IMAGE_SIZE", "4096"))
+USE_TORCH_COMPILE = os.getenv("USE_TORCH_COMPILE", "0") == "1"
+ENABLE_CPU_OFFLOAD = os.getenv("ENABLE_CPU_OFFLOAD", "0") == "1"
+DEMO_PORT = int(os.getenv("DEMO_PORT", "15432"))
+os.environ["GRADIO_EXAMPLES_CACHE"] = "./.gradio/cache"
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+style_list = [
+    {
+        "name": "(No style)",
+        "prompt": "{prompt}",
+        "negative_prompt": "",
+    },
+    {
+        "name": "Cinematic",
+        "prompt": "cinematic still {prompt} . emotional, harmonious, vignette, highly detailed, high budget, bokeh, "
+        "cinemascope, moody, epic, gorgeous, film grain, grainy",
+        "negative_prompt": "anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured",
+    },
+    {
+        "name": "Photographic",
+        "prompt": "cinematic photo {prompt} . 35mm photograph, film, bokeh, professional, 4k, highly detailed",
+        "negative_prompt": "drawing, painting, crayon, sketch, graphite, impressionist, noisy, blurry, soft, deformed, ugly",
+    },
+    {
+        "name": "Anime",
+        "prompt": "anime artwork {prompt} . anime style, key visual, vibrant, studio anime,  highly detailed",
+        "negative_prompt": "photo, deformed, black and white, realism, disfigured, low contrast",
+    },
+    {
+        "name": "Manga",
+        "prompt": "manga style {prompt} . vibrant, high-energy, detailed, iconic, Japanese comic style",
+        "negative_prompt": "ugly, deformed, noisy, blurry, low contrast, realism, photorealistic, Western comic style",
+    },
+    {
+        "name": "Digital Art",
+        "prompt": "concept art {prompt} . digital artwork, illustrative, painterly, matte painting, highly detailed",
+        "negative_prompt": "photo, photorealistic, realism, ugly",
+    },
+    {
+        "name": "Pixel art",
+        "prompt": "pixel-art {prompt} . low-res, blocky, pixel art style, 8-bit graphics",
+        "negative_prompt": "sloppy, messy, blurry, noisy, highly detailed, ultra textured, photo, realistic",
+    },
+    {
+        "name": "Fantasy art",
+        "prompt": "ethereal fantasy concept art of  {prompt} . magnificent, celestial, ethereal, painterly, epic, "
+        "majestic, magical, fantasy art, cover art, dreamy",
+        "negative_prompt": "photographic, realistic, realism, 35mm film, dslr, cropped, frame, text, deformed, "
+        "glitch, noise, noisy, off-center, deformed, cross-eyed, closed eyes, bad anatomy, ugly, "
+        "disfigured, sloppy, duplicate, mutated, black and white",
+    },
+    {
+        "name": "Neonpunk",
+        "prompt": "neonpunk style {prompt} . cyberpunk, vaporwave, neon, vibes, vibrant, stunningly beautiful, crisp, "
+        "detailed, sleek, ultramodern, magenta highlights, dark purple shadows, high contrast, cinematic, "
+        "ultra detailed, intricate, professional",
+        "negative_prompt": "painting, drawing, illustration, glitch, deformed, mutated, cross-eyed, ugly, disfigured",
+    },
+    {
+        "name": "3D Model",
+        "prompt": "professional 3d model {prompt} . octane render, highly detailed, volumetric, dramatic lighting",
+        "negative_prompt": "ugly, deformed, noisy, low poly, blurry, painting",
+    },
+]
+styles = {k["name"]: (k["prompt"], k["negative_prompt"]) for k in style_list}
+STYLE_NAMES = list(styles.keys())
+DEFAULT_STYLE_NAME = "(No style)"
+SCHEDULE_NAME = ["Flow_DPM_Solver"]
+DEFAULT_SCHEDULE_NAME = "Flow_DPM_Solver"
+NUM_IMAGES_PER_PROMPT = 1
+TEST_TIMES = 0
+FILENAME = f"output/port{DEMO_PORT}_inference_count.txt"
+def set_env(seed=0):
+    torch.manual_seed(seed)
+    torch.set_grad_enabled(False)
+def read_inference_count():
+    global TEST_TIMES
+    try:
+        with open(FILENAME) as f:
+            count = int(f.read().strip())
+    except FileNotFoundError:
+        count = 0
+    TEST_TIMES = count
+    return count
+def write_inference_count(count):
+    with open(FILENAME, "w") as f:
+        f.write(str(count))
+def run_inference(num_imgs=1):
+    TEST_TIMES = read_inference_count()
+    TEST_TIMES += int(num_imgs)
+    write_inference_count(TEST_TIMES)
+    return (
+        f"<span style='font-size: 16px; font-weight: bold;'>Total inference runs: </span><span style='font-size: "
+        f"16px; color:red; font-weight: bold;'>{TEST_TIMES}</span>"
+    )
+def update_inference_count():
+    count = read_inference_count()
+    return (
+        f"<span style='font-size: 16px; font-weight: bold;'>Total inference runs: </span><span style='font-size: "
+        f"16px; color:red; font-weight: bold;'>{count}</span>"
+    )
+def apply_style(style_name: str, positive: str, negative: str = "") -> tuple[str, str]:
+    p, n = styles.get(style_name, styles[DEFAULT_STYLE_NAME])
+    if not negative:
+        negative = ""
+    return p.replace("{prompt}", positive), n + negative
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", type=str, help="config")
+    parser.add_argument(
+        "--model_path",
+        nargs="?",
+        default="output/Sana_D20/SANA.pth",
+        type=str,
+        help="Path to the model file (positional)",
+    )
+    parser.add_argument("--output", default="./", type=str)
+    parser.add_argument("--bs", default=1, type=int)
+    parser.add_argument("--image_size", default=1024, type=int)
+    parser.add_argument("--cfg_scale", default=5.0, type=float)
+    parser.add_argument("--pag_scale", default=2.0, type=float)
+    parser.add_argument("--seed", default=42, type=int)
+    parser.add_argument("--step", default=-1, type=int)
+    parser.add_argument("--custom_image_size", default=None, type=int)
+    parser.add_argument(
+        "--shield_model_path",
+        type=str,
+        help="The path to shield model, we employ ShieldGemma-2B by default.",
+        default="google/shieldgemma-2b",
+    )
+    return parser.parse_args()
+args = get_args()
+if torch.cuda.is_available():
+    weight_dtype = torch.float16
+    model_path = args.model_path
+    pipe = SanaPipeline(args.config)
+    pipe.from_pretrained(model_path)
+    pipe.register_progress_bar(gr.Progress())
+    repo_name = "black-forest-labs/FLUX.1-dev"
+    pipe2 = FluxPipeline.from_pretrained(repo_name, torch_dtype=torch.float16).to("cuda")
+    # safety checker
+    safety_checker_tokenizer = AutoTokenizer.from_pretrained(args.shield_model_path)
+    safety_checker_model = AutoModelForCausalLM.from_pretrained(
+        args.shield_model_path,
+        device_map="auto",
+        torch_dtype=torch.bfloat16,
+    ).to(device)
+    set_env(42)
+def save_image_sana(img, seed="", save_img=False):
+    unique_name = f"{str(uuid.uuid4())}_{seed}.png"
+    save_path = os.path.join(f"output/online_demo_img/{datetime.now().date()}")
+    os.umask(0o000)  # file permission: 666; dir permission: 777
+    os.makedirs(save_path, exist_ok=True)
+    unique_name = os.path.join(save_path, unique_name)
+    if save_img:
+        save_image(img, unique_name, nrow=1, normalize=True, value_range=(-1, 1))
+    return unique_name
+def randomize_seed_fn(seed: int, randomize_seed: bool) -> int:
+    if randomize_seed:
+        seed = random.randint(0, MAX_SEED)
+    return seed
+@spaces.GPU(enable_queue=True)
+async def generate_2(
+    prompt: str = None,
+    negative_prompt: str = "",
+    style: str = DEFAULT_STYLE_NAME,
+    use_negative_prompt: bool = False,
+    num_imgs: int = 1,
+    seed: int = 0,
+    height: int = 1024,
+    width: int = 1024,
+    flow_dpms_guidance_scale: float = 5.0,
+    flow_dpms_pag_guidance_scale: float = 2.0,
+    flow_dpms_inference_steps: int = 20,
+    randomize_seed: bool = False,
+):
+    seed = int(randomize_seed_fn(seed, randomize_seed))
+    generator = torch.Generator(device=device).manual_seed(seed)
+    print(f"PORT: {DEMO_PORT}, model_path: {model_path}")
+    if safety_check.is_dangerous(safety_checker_tokenizer, safety_checker_model, prompt):
+        prompt = "A red heart."
+    print(prompt)
+    if not use_negative_prompt:
+        negative_prompt = None  # type: ignore
+    prompt, negative_prompt = apply_style(style, prompt, negative_prompt)
+    with torch.no_grad():
+        images = pipe2(
+            prompt=prompt,
+            height=height,
+            width=width,
+            guidance_scale=3.5,
+            num_inference_steps=50,
+            num_images_per_prompt=num_imgs,
+            max_sequence_length=256,
+            generator=generator,
+        ).images
+    save_img = False
+    img = images
+    if save_img:
+        img = [save_image_sana(img, seed, save_img=save_image) for img in images]
+        print(img)
+    torch.cuda.empty_cache()
+    return img
+@spaces.GPU(enable_queue=True)
+async def generate(
+    prompt: str = None,
+    negative_prompt: str = "",
+    style: str = DEFAULT_STYLE_NAME,
+    use_negative_prompt: bool = False,
+    num_imgs: int = 1,
+    seed: int = 0,
+    height: int = 1024,
+    width: int = 1024,
+    flow_dpms_guidance_scale: float = 5.0,
+    flow_dpms_pag_guidance_scale: float = 2.0,
+    flow_dpms_inference_steps: int = 20,
+    randomize_seed: bool = False,
+):
+    global TEST_TIMES
+    # seed = 823753551
+    seed = int(randomize_seed_fn(seed, randomize_seed))
+    generator = torch.Generator(device=device).manual_seed(seed)
+    print(f"PORT: {DEMO_PORT}, model_path: {model_path}, time_times: {TEST_TIMES}")
+    if safety_check.is_dangerous(safety_checker_tokenizer, safety_checker_model, prompt):
+        prompt = "A red heart."
+    print(prompt)
+    num_inference_steps = flow_dpms_inference_steps
+    guidance_scale = flow_dpms_guidance_scale
+    pag_guidance_scale = flow_dpms_pag_guidance_scale
+    if not use_negative_prompt:
+        negative_prompt = None  # type: ignore
+    prompt, negative_prompt = apply_style(style, prompt, negative_prompt)
+    pipe.progress_fn(0, desc="Sana Start")
+    with torch.no_grad():
+        images = pipe(
+            prompt=prompt,
+            height=height,
+            width=width,
+            negative_prompt=negative_prompt,
+            guidance_scale=guidance_scale,
+            pag_guidance_scale=pag_guidance_scale,
+            num_inference_steps=num_inference_steps,
+            num_images_per_prompt=num_imgs,
+            generator=generator,
+        )
+    pipe.progress_fn(1.0, desc="Sana End")
+    save_img = False
+    if save_img:
+        img = [save_image_sana(img, seed, save_img=save_image) for img in images]
+        print(img)
+    else:
+        if num_imgs > 1:
+            nrow = 2
+        else:
+            nrow = 1
+        img = make_grid(images, nrow=nrow, normalize=True, value_range=(-1, 1))
+        img = img.mul(255).add_(0.5).clamp_(0, 255).permute(1, 2, 0).to("cpu", torch.uint8).numpy()
+        img = [Image.fromarray(img.astype(np.uint8))]
+    torch.cuda.empty_cache()
+    return img
+TEST_TIMES = read_inference_count()
+model_size = "1.6" if "D20" in args.model_path else "0.6"
+title = f"""
+    <div style='display: flex; align-items: center; justify-content: center; text-align: center;'>
+        <img src="https://raw.githubusercontent.com/NVlabs/Sana/refs/heads/main/asset/logo.png" width="50%" alt="logo"/>
+    </div>
+"""
+DESCRIPTION = f"""
+        <p><span style="font-size: 36px; font-weight: bold;">Sana-{model_size}B</span><span style="font-size: 20px; font-weight: bold;">{args.image_size}px</span></p>
+        <p style="font-size: 16px; font-weight: bold;">Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer</p>
+        <p><span style="font-size: 16px;"><a href="https://arxiv.org/abs/2410.10629">[Paper]</a></span> <span style="font-size: 16px;"><a href="https://github.com/NVlabs/Sana">[Github]</a></span> <span style="font-size: 16px;"><a href="https://nvlabs.github.io/Sana">[Project]</a></span</p>
+        <p style="font-size: 16px; font-weight: bold;">Powered by <a href="https://hanlab.mit.edu/projects/dc-ae">DC-AE</a> with 32x latent space</p>
+        <p style="font-size: 16px; font-weight: bold;">Unsafe word will give you a 'Red Heart' in the image instead.</p>
+        """
+if model_size == "0.6":
+    DESCRIPTION += "\n<p>0.6B model's text rendering ability is limited.</p>"
+if not torch.cuda.is_available():
+    DESCRIPTION += "\n<p>Running on CPU 🥶 This demo does not work on CPU.</p>"
+examples = [
+    'a cyberpunk cat with a neon sign that says "Sana"',
+    "A very detailed and realistic full body photo set of a tall, slim, and athletic Shiba Inu in a white oversized straight t-shirt, white shorts, and short white shoes.",
+    "Pirate ship trapped in a cosmic maelstrom nebula, rendered in cosmic beach whirlpool engine, volumetric lighting, spectacular, ambient lights, light pollution, cinematic atmosphere, art nouveau style, illustration art artwork by SenseiJaye, intricate detail.",
+    "portrait photo of a girl, photograph, highly detailed face, depth of field",
+    'make me a logo that says "So Fast"  with a really cool flying dragon shape with lightning sparks all over the sides and all of it contains Indonesian language',
+    "🐶 Wearing 🕶 flying on the 🌈",
+    # "👧 with 🌹 in the ❄️",
+    # "an old rusted robot wearing pants and a jacket riding skis in a supermarket.",
+    # "professional portrait photo of an anthropomorphic cat wearing fancy gentleman hat and jacket walking in autumn forest.",
+    # "Astronaut in a jungle, cold color palette, muted colors, detailed",
+    # "a stunning and luxurious bedroom carved into a rocky mountainside seamlessly blending nature with modern design with a plush earth-toned bed textured stone walls circular fireplace massive uniquely shaped window framing snow-capped mountains dense forests",
+]
+css = """
+.gradio-container{max-width: 1024px !important}
+h1{text-align:center}
+"""
+with gr.Blocks(css=css) as demo:
+    gr.Markdown(title)
+    gr.Markdown(DESCRIPTION)
+    gr.DuplicateButton(
+        value="Duplicate Space for private use",
+        elem_id="duplicate-button",
+        visible=os.getenv("SHOW_DUPLICATE_BUTTON") == "1",
+    )
+    info_box = gr.Markdown(
+        value=f"<span style='font-size: 16px; font-weight: bold;'>Total inference runs: </span><span style='font-size: 16px; color:red; font-weight: bold;'>{read_inference_count()}</span>"
+    )
+    demo.load(fn=update_inference_count, outputs=info_box)  # update the value when re-loading the page
+    # with gr.Row(equal_height=False):
+    with gr.Group():
+        with gr.Row():
+            prompt = gr.Text(
+                label="Prompt",
+                show_label=False,
+                max_lines=1,
+                placeholder="Enter your prompt",
+                container=False,
+            )
+            run_button = gr.Button("Run-sana", scale=0)
+            run_button2 = gr.Button("Run-flux", scale=0)
+        with gr.Row():
+            result = gr.Gallery(label="Result from Sana", show_label=True, columns=NUM_IMAGES_PER_PROMPT, format="webp")
+            result_2 = gr.Gallery(
+                label="Result from FLUX", show_label=True, columns=NUM_IMAGES_PER_PROMPT, format="webp"
+            )
+    with gr.Accordion("Advanced options", open=False):
+        with gr.Group():
+            with gr.Row(visible=True):
+                height = gr.Slider(
+                    label="Height",
+                    minimum=256,
+                    maximum=MAX_IMAGE_SIZE,
+                    step=32,
+                    value=1024,
+                )
+                width = gr.Slider(
+                    label="Width",
+                    minimum=256,
+                    maximum=MAX_IMAGE_SIZE,
+                    step=32,
+                    value=1024,
+                )
+            with gr.Row():
+                flow_dpms_inference_steps = gr.Slider(
+                    label="Sampling steps",
+                    minimum=5,
+                    maximum=40,
+                    step=1,
+                    value=18,
+                )
+                flow_dpms_guidance_scale = gr.Slider(
+                    label="CFG Guidance scale",
+                    minimum=1,
+                    maximum=10,
+                    step=0.1,
+                    value=5.0,
+                )
+                flow_dpms_pag_guidance_scale = gr.Slider(
+                    label="PAG Guidance scale",
+                    minimum=1,
+                    maximum=4,
+                    step=0.5,
+                    value=2.0,
+                )
+            with gr.Row():
+                use_negative_prompt = gr.Checkbox(label="Use negative prompt", value=False, visible=True)
+            negative_prompt = gr.Text(
+                label="Negative prompt",
+                max_lines=1,
+                placeholder="Enter a negative prompt",
+                visible=True,
+            )
+            style_selection = gr.Radio(
+                show_label=True,
+                container=True,
+                interactive=True,
+                choices=STYLE_NAMES,
+                value=DEFAULT_STYLE_NAME,
+                label="Image Style",
+            )
+            seed = gr.Slider(
+                label="Seed",
+                minimum=0,
+                maximum=MAX_SEED,
+                step=1,
+                value=0,
+            )
+            randomize_seed = gr.Checkbox(label="Randomize seed", value=True)
+            with gr.Row(visible=True):
+                schedule = gr.Radio(
+                    show_label=True,
+                    container=True,
+                    interactive=True,
+                    choices=SCHEDULE_NAME,
+                    value=DEFAULT_SCHEDULE_NAME,
+                    label="Sampler Schedule",
+                    visible=True,
+                )
+                num_imgs = gr.Slider(
+                    label="Num Images",
+                    minimum=1,
+                    maximum=6,
+                    step=1,
+                    value=1,
+                )
+    run_button.click(fn=run_inference, inputs=num_imgs, outputs=info_box)
+    gr.Examples(
+        examples=examples,
+        inputs=prompt,
+        outputs=[result],
+        fn=generate,
+        cache_examples=CACHE_EXAMPLES,
+    )
+    gr.Examples(
+        examples=examples,
+        inputs=prompt,
+        outputs=[result_2],
+        fn=generate_2,
+        cache_examples=CACHE_EXAMPLES,
+    )
+    use_negative_prompt.change(
+        fn=lambda x: gr.update(visible=x),
+        inputs=use_negative_prompt,
+        outputs=negative_prompt,
+        api_name=False,
+    )
+    run_button.click(
+        fn=generate,
+        inputs=[
+            prompt,
+            negative_prompt,
+            style_selection,
+            use_negative_prompt,
+            num_imgs,
+            seed,
+            height,
+            width,
+            flow_dpms_guidance_scale,
+            flow_dpms_pag_guidance_scale,
+            flow_dpms_inference_steps,
+            randomize_seed,
+        ],
+        outputs=[result],
+        queue=True,
+    )
+    run_button2.click(
+        fn=generate_2,
+        inputs=[
+            prompt,
+            negative_prompt,
+            style_selection,
+            use_negative_prompt,
+            num_imgs,
+            seed,
+            height,
+            width,
+            flow_dpms_guidance_scale,
+            flow_dpms_pag_guidance_scale,
+            flow_dpms_inference_steps,
+            randomize_seed,
+        ],
+        outputs=[result_2],
+        queue=True,
+    )
+if __name__ == "__main__":
+    demo.queue(max_size=20).launch(server_name="0.0.0.0", server_port=DEMO_PORT, debug=True, share=True)

app/safety_check.py ADDED Viewed

	@@ -0,0 +1,72 @@

+# Copyright 2024 NVIDIA CORPORATION & AFFILIATES
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# SPDX-License-Identifier: Apache-2.0
+# This file is copy from https://github.com/mit-han-lab/hart/blob/main/hart/utils/safety_check.py
+import json
+from functools import lru_cache
+from typing import Dict, Sequence
+import torch
+import transformers
+from torch.nn.functional import softmax
+from transformers import AutoModelForCausalLM, AutoTokenizer
+__all__ = ["is_dangerous"]
+__rank, __local_rank, __world_size, __device = (
+    0,
+    0,
+    1,
+    "cuda" if torch.cuda.is_available() else "cpu",
+)
+safety_policy = """
+    The prompt must not seek to generate harmful, abusive, hateful, sexually explicit, or otherwise inappropriate content targeting individuals or protected groups.
+"""
+def is_dangerous(tokenizer, model, user_prompt, threshold=0.5):
+    prompt = f"""You are a policy expert trying to help determine whether a user
+    prompt is in violation of the defined safety policies.
+    <start_of_turn>
+    Human Question: {user_prompt.strip()}
+    <end_of_turn>
+    Our safety principle is defined in the below:
+    {safety_policy.strip()}
+    Does the human question violate the above principle? Your answer must start
+    with 'Yes' or 'No'. And then walk through step by step to be sure we answer
+    correctly.
+    """
+    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
+    with torch.no_grad():
+        logits = model(**inputs).logits
+    # Extract the logits for the Yes and No tokens
+    vocab = tokenizer.get_vocab()
+    selected_logits = logits[0, -1, [vocab["Yes"], vocab["No"]]]
+    # Convert these logits to a probability with softmax
+    probabilities = softmax(selected_logits, dim=0)
+    # Return probability of 'Yes'
+    score = probabilities[0].item()
+    return score > threshold

app/sana_controlnet_pipeline.py ADDED Viewed

	@@ -0,0 +1,353 @@

+# Copyright 2024 NVIDIA CORPORATION & AFFILIATES
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# SPDX-License-Identifier: Apache-2.0
+import warnings
+from dataclasses import dataclass, field
+from typing import Optional, Tuple
+import numpy as np
+import pyrallis
+import torch
+import torch.nn as nn
+from PIL import Image
+warnings.filterwarnings("ignore")  # ignore warning
+from diffusion import DPMS, FlowEuler
+from diffusion.data.datasets.utils import (
+    ASPECT_RATIO_512_TEST,
+    ASPECT_RATIO_1024_TEST,
+    ASPECT_RATIO_2048_TEST,
+    ASPECT_RATIO_4096_TEST,
+)
+from diffusion.model.builder import build_model, get_tokenizer_and_text_encoder, get_vae, vae_decode, vae_encode
+from diffusion.model.utils import get_weight_dtype, prepare_prompt_ar, resize_and_crop_tensor
+from diffusion.utils.config import SanaConfig, model_init_config
+from diffusion.utils.logger import get_root_logger
+from tools.controlnet.utils import get_scribble_map, transform_control_signal
+from tools.download import find_model
+def guidance_type_select(default_guidance_type, pag_scale, attn_type):
+    guidance_type = default_guidance_type
+    if not (pag_scale > 1.0 and attn_type == "linear"):
+        guidance_type = "classifier-free"
+    elif pag_scale > 1.0 and attn_type == "linear":
+        guidance_type = "classifier-free_PAG"
+    return guidance_type
+def classify_height_width_bin(height: int, width: int, ratios: dict) -> Tuple[int, int]:
+    """Returns binned height and width."""
+    ar = float(height / width)
+    closest_ratio = min(ratios.keys(), key=lambda ratio: abs(float(ratio) - ar))
+    default_hw = ratios[closest_ratio]
+    return int(default_hw[0]), int(default_hw[1])
+def get_ar_from_ref_image(ref_image):
+    def reduce_ratio(h, w):
+        def gcd(a, b):
+            while b:
+                a, b = b, a % b
+            return a
+        divisor = gcd(h, w)
+        return f"{h // divisor}:{w // divisor}"
+    if isinstance(ref_image, str):
+        ref_image = Image.open(ref_image)
+    w, h = ref_image.size
+    return reduce_ratio(h, w)
+@dataclass
+class SanaControlNetInference(SanaConfig):
+    config: Optional[str] = "configs/sana_config/1024ms/Sana_1600M_img1024.yaml"  # config
+    model_path: str = field(
+        default="output/Sana_D20/SANA.pth", metadata={"help": "Path to the model file (positional)"}
+    )
+    output: str = "./output"
+    bs: int = 1
+    image_size: int = 1024
+    cfg_scale: float = 5.0
+    pag_scale: float = 2.0
+    seed: int = 42
+    step: int = -1
+    custom_image_size: Optional[int] = None
+    shield_model_path: str = field(
+        default="google/shieldgemma-2b",
+        metadata={"help": "The path to shield model, we employ ShieldGemma-2B by default."},
+    )
+class SanaControlNetPipeline(nn.Module):
+    def __init__(
+        self,
+        config: Optional[str] = "configs/sana_config/1024ms/Sana_1600M_img1024.yaml",
+    ):
+        super().__init__()
+        config = pyrallis.load(SanaControlNetInference, open(config))
+        self.args = self.config = config
+        # set some hyper-parameters
+        self.image_size = self.config.model.image_size
+        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+        logger = get_root_logger()
+        self.logger = logger
+        self.progress_fn = lambda progress, desc: None
+        self.thickness = 2
+        self.blend_alpha = 0.0
+        self.latent_size = self.image_size // config.vae.vae_downsample_rate
+        self.max_sequence_length = config.text_encoder.model_max_length
+        self.flow_shift = config.scheduler.flow_shift
+        guidance_type = "classifier-free_PAG"
+        weight_dtype = get_weight_dtype(config.model.mixed_precision)
+        self.weight_dtype = weight_dtype
+        self.vae_dtype = get_weight_dtype(config.vae.weight_dtype)
+        self.base_ratios = eval(f"ASPECT_RATIO_{self.image_size}_TEST")
+        self.vis_sampler = self.config.scheduler.vis_sampler
+        logger.info(f"Sampler {self.vis_sampler}, flow_shift: {self.flow_shift}")
+        self.guidance_type = guidance_type_select(guidance_type, self.args.pag_scale, config.model.attn_type)
+        logger.info(f"Inference with {self.weight_dtype}, PAG guidance layer: {self.config.model.pag_applied_layers}")
+        # 1. build vae and text encoder
+        self.vae = self.build_vae(config.vae)
+        self.tokenizer, self.text_encoder = self.build_text_encoder(config.text_encoder)
+        # 2. build Sana model
+        self.model = self.build_sana_model(config).to(self.device)
+        # 3. pre-compute null embedding
+        with torch.no_grad():
+            null_caption_token = self.tokenizer(
+                "", max_length=self.max_sequence_length, padding="max_length", truncation=True, return_tensors="pt"
+            ).to(self.device)
+            self.null_caption_embs = self.text_encoder(null_caption_token.input_ids, null_caption_token.attention_mask)[
+                0
+            ]
+    def build_vae(self, config):
+        vae = get_vae(config.vae_type, config.vae_pretrained, self.device).to(self.vae_dtype)
+        return vae
+    def build_text_encoder(self, config):
+        tokenizer, text_encoder = get_tokenizer_and_text_encoder(name=config.text_encoder_name, device=self.device)
+        return tokenizer, text_encoder
+    def build_sana_model(self, config):
+        # model setting
+        model_kwargs = model_init_config(config, latent_size=self.latent_size)
+        model = build_model(
+            config.model.model,
+            use_fp32_attention=config.model.get("fp32_attention", False) and config.model.mixed_precision != "bf16",
+            **model_kwargs,
+        )
+        self.logger.info(f"use_fp32_attention: {model.fp32_attention}")
+        self.logger.info(
+            f"{model.__class__.__name__}:{config.model.model},"
+            f"Model Parameters: {sum(p.numel() for p in model.parameters()):,}"
+        )
+        return model
+    def from_pretrained(self, model_path):
+        state_dict = find_model(model_path)
+        state_dict = state_dict.get("state_dict", state_dict)
+        if "pos_embed" in state_dict:
+            del state_dict["pos_embed"]
+        missing, unexpected = self.model.load_state_dict(state_dict, strict=False)
+        self.model.eval().to(self.weight_dtype)
+        self.logger.info("Generating sample from ckpt: %s" % model_path)
+        self.logger.warning(f"Missing keys: {missing}")
+        self.logger.warning(f"Unexpected keys: {unexpected}")
+    def register_progress_bar(self, progress_fn=None):
+        self.progress_fn = progress_fn if progress_fn is not None else self.progress_fn
+    def set_blend_alpha(self, blend_alpha):
+        self.blend_alpha = blend_alpha
+    @torch.inference_mode()
+    def forward(
+        self,
+        prompt=None,
+        ref_image=None,
+        negative_prompt="",
+        num_inference_steps=20,
+        guidance_scale=5,
+        pag_guidance_scale=2.5,
+        num_images_per_prompt=1,
+        sketch_thickness=2,
+        generator=torch.Generator().manual_seed(42),
+        latents=None,
+    ):
+        self.ori_height, self.ori_width = ref_image.height, ref_image.width
+        self.guidance_type = guidance_type_select(self.guidance_type, pag_guidance_scale, self.config.model.attn_type)
+        # 1. pre-compute negative embedding
+        if negative_prompt != "":
+            null_caption_token = self.tokenizer(
+                negative_prompt,
+                max_length=self.max_sequence_length,
+                padding="max_length",
+                truncation=True,
+                return_tensors="pt",
+            ).to(self.device)
+            self.null_caption_embs = self.text_encoder(null_caption_token.input_ids, null_caption_token.attention_mask)[
+                0
+            ]
+        if prompt is None:
+            prompt = [""]
+        prompts = prompt if isinstance(prompt, list) else [prompt]
+        samples = []
+        for prompt in prompts:
+            # data prepare
+            prompts, hw, ar = (
+                [],
+                torch.tensor([[self.image_size, self.image_size]], dtype=torch.float, device=self.device).repeat(
+                    num_images_per_prompt, 1
+                ),
+                torch.tensor([[1.0]], device=self.device).repeat(num_images_per_prompt, 1),
+            )
+            ar = get_ar_from_ref_image(ref_image)
+            prompt += f" --ar {ar}"
+            for _ in range(num_images_per_prompt):
+                prompt_clean, _, hw, ar, custom_hw = prepare_prompt_ar(
+                    prompt, self.base_ratios, device=self.device, show=False
+                )
+                prompts.append(prompt_clean.strip())
+            self.latent_size_h, self.latent_size_w = (
+                int(hw[0, 0] // self.config.vae.vae_downsample_rate),
+                int(hw[0, 1] // self.config.vae.vae_downsample_rate),
+            )
+            with torch.no_grad():
+                # prepare text feature
+                if not self.config.text_encoder.chi_prompt:
+                    max_length_all = self.config.text_encoder.model_max_length
+                    prompts_all = prompts
+                else:
+                    chi_prompt = "\n".join(self.config.text_encoder.chi_prompt)
+                    prompts_all = [chi_prompt + prompt for prompt in prompts]
+                    num_chi_prompt_tokens = len(self.tokenizer.encode(chi_prompt))
+                    max_length_all = (
+                        num_chi_prompt_tokens + self.config.text_encoder.model_max_length - 2
+                    )  # magic number 2: [bos], [_]
+                caption_token = self.tokenizer(
+                    prompts_all,
+                    max_length=max_length_all,
+                    padding="max_length",
+                    truncation=True,
+                    return_tensors="pt",
+                ).to(device=self.device)
+                select_index = [0] + list(range(-self.config.text_encoder.model_max_length + 1, 0))
+                caption_embs = self.text_encoder(caption_token.input_ids, caption_token.attention_mask)[0][:, None][
+                    :, :, select_index
+                ].to(self.weight_dtype)
+                emb_masks = caption_token.attention_mask[:, select_index]
+                null_y = self.null_caption_embs.repeat(len(prompts), 1, 1)[:, None].to(self.weight_dtype)
+                n = len(prompts)
+                if latents is None:
+                    z = torch.randn(
+                        n,
+                        self.config.vae.vae_latent_dim,
+                        self.latent_size_h,
+                        self.latent_size_w,
+                        generator=generator,
+                        device=self.device,
+                    )
+                else:
+                    z = latents.to(self.device)
+                model_kwargs = dict(data_info={"img_hw": hw, "aspect_ratio": ar}, mask=emb_masks)
+                # control signal
+                if isinstance(ref_image, str):
+                    ref_image = cv2.imread(ref_image)
+                elif isinstance(ref_image, Image.Image):
+                    ref_image = np.array(ref_image)
+                control_signal = get_scribble_map(
+                    input_image=ref_image,
+                    det="Scribble_HED",
+                    detect_resolution=int(hw.min()),
+                    thickness=sketch_thickness,
+                )
+                control_signal = transform_control_signal(control_signal, hw).to(self.device).to(self.weight_dtype)
+                control_signal_latent = vae_encode(
+                    self.config.vae.vae_type, self.vae, control_signal, self.config.vae.sample_posterior, self.device
+                )
+                model_kwargs["control_signal"] = control_signal_latent
+                if self.vis_sampler == "flow_euler":
+                    flow_solver = FlowEuler(
+                        self.model,
+                        condition=caption_embs,
+                        uncondition=null_y,
+                        cfg_scale=guidance_scale,
+                        model_kwargs=model_kwargs,
+                    )
+                    sample = flow_solver.sample(
+                        z,
+                        steps=num_inference_steps,
+                    )
+                elif self.vis_sampler == "flow_dpm-solver":
+                    scheduler = DPMS(
+                        self.model.forward_with_dpmsolver,
+                        condition=caption_embs,
+                        uncondition=null_y,
+                        guidance_type=self.guidance_type,
+                        cfg_scale=guidance_scale,
+                        model_type="flow",
+                        model_kwargs=model_kwargs,
+                        schedule="FLOW",
+                    )
+                    scheduler.register_progress_bar(self.progress_fn)
+                    sample = scheduler.sample(
+                        z,
+                        steps=num_inference_steps,
+                        order=2,
+                        skip_type="time_uniform_flow",
+                        method="multistep",
+                        flow_shift=self.flow_shift,
+                    )
+            sample = sample.to(self.vae_dtype)
+            with torch.no_grad():
+                sample = vae_decode(self.config.vae.vae_type, self.vae, sample)
+            if self.blend_alpha > 0:
+                print(f"blend image and mask with alpha: {self.blend_alpha}")
+                sample = sample * (1 - self.blend_alpha) + control_signal * self.blend_alpha
+            sample = resize_and_crop_tensor(sample, self.ori_width, self.ori_height)
+            samples.append(sample)
+            return sample
+        return samples

app/sana_pipeline.py ADDED Viewed

	@@ -0,0 +1,304 @@

+# Copyright 2024 NVIDIA CORPORATION & AFFILIATES
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# SPDX-License-Identifier: Apache-2.0
+import argparse
+import warnings
+from dataclasses import dataclass, field
+from typing import Optional, Tuple
+import pyrallis
+import torch
+import torch.nn as nn
+warnings.filterwarnings("ignore")  # ignore warning
+from diffusion import DPMS, FlowEuler
+from diffusion.data.datasets.utils import (
+    ASPECT_RATIO_512_TEST,
+    ASPECT_RATIO_1024_TEST,
+    ASPECT_RATIO_2048_TEST,
+    ASPECT_RATIO_4096_TEST,
+)
+from diffusion.model.builder import build_model, get_tokenizer_and_text_encoder, get_vae, vae_decode
+from diffusion.model.utils import get_weight_dtype, prepare_prompt_ar, resize_and_crop_tensor
+from diffusion.utils.config import SanaConfig, model_init_config
+from diffusion.utils.logger import get_root_logger
+# from diffusion.utils.misc import read_config
+from tools.download import find_model
+def guidance_type_select(default_guidance_type, pag_scale, attn_type):
+    guidance_type = default_guidance_type
+    if not (pag_scale > 1.0 and attn_type == "linear"):
+        guidance_type = "classifier-free"
+    elif pag_scale > 1.0 and attn_type == "linear":
+        guidance_type = "classifier-free_PAG"
+    return guidance_type
+def classify_height_width_bin(height: int, width: int, ratios: dict) -> Tuple[int, int]:
+    """Returns binned height and width."""
+    ar = float(height / width)
+    closest_ratio = min(ratios.keys(), key=lambda ratio: abs(float(ratio) - ar))
+    default_hw = ratios[closest_ratio]
+    return int(default_hw[0]), int(default_hw[1])
+@dataclass
+class SanaInference(SanaConfig):
+    config: Optional[str] = "configs/sana_config/1024ms/Sana_1600M_img1024.yaml"  # config
+    model_path: str = field(
+        default="output/Sana_D20/SANA.pth", metadata={"help": "Path to the model file (positional)"}
+    )
+    output: str = "./output"
+    bs: int = 1
+    image_size: int = 1024
+    cfg_scale: float = 5.0
+    pag_scale: float = 2.0
+    seed: int = 42
+    step: int = -1
+    custom_image_size: Optional[int] = None
+    shield_model_path: str = field(
+        default="google/shieldgemma-2b",
+        metadata={"help": "The path to shield model, we employ ShieldGemma-2B by default."},
+    )
+class SanaPipeline(nn.Module):
+    def __init__(
+        self,
+        config: Optional[str] = "configs/sana_config/1024ms/Sana_1600M_img1024.yaml",
+    ):
+        super().__init__()
+        config = pyrallis.load(SanaInference, open(config))
+        self.args = self.config = config
+        # set some hyper-parameters
+        self.image_size = self.config.model.image_size
+        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+        logger = get_root_logger()
+        self.logger = logger
+        self.progress_fn = lambda progress, desc: None
+        self.latent_size = self.image_size // config.vae.vae_downsample_rate
+        self.max_sequence_length = config.text_encoder.model_max_length
+        self.flow_shift = config.scheduler.flow_shift
+        guidance_type = "classifier-free_PAG"
+        weight_dtype = get_weight_dtype(config.model.mixed_precision)
+        self.weight_dtype = weight_dtype
+        self.vae_dtype = get_weight_dtype(config.vae.weight_dtype)
+        self.base_ratios = eval(f"ASPECT_RATIO_{self.image_size}_TEST")
+        self.vis_sampler = self.config.scheduler.vis_sampler
+        logger.info(f"Sampler {self.vis_sampler}, flow_shift: {self.flow_shift}")
+        self.guidance_type = guidance_type_select(guidance_type, self.args.pag_scale, config.model.attn_type)
+        logger.info(f"Inference with {self.weight_dtype}, PAG guidance layer: {self.config.model.pag_applied_layers}")
+        # 1. build vae and text encoder
+        self.vae = self.build_vae(config.vae)
+        self.tokenizer, self.text_encoder = self.build_text_encoder(config.text_encoder)
+        # 2. build Sana model
+        self.model = self.build_sana_model(config).to(self.device)
+        # 3. pre-compute null embedding
+        with torch.no_grad():
+            null_caption_token = self.tokenizer(
+                "", max_length=self.max_sequence_length, padding="max_length", truncation=True, return_tensors="pt"
+            ).to(self.device)
+            self.null_caption_embs = self.text_encoder(null_caption_token.input_ids, null_caption_token.attention_mask)[
+                0
+            ]
+    def build_vae(self, config):
+        vae = get_vae(config.vae_type, config.vae_pretrained, self.device).to(self.vae_dtype)
+        return vae
+    def build_text_encoder(self, config):
+        tokenizer, text_encoder = get_tokenizer_and_text_encoder(name=config.text_encoder_name, device=self.device)
+        return tokenizer, text_encoder
+    def build_sana_model(self, config):
+        # model setting
+        model_kwargs = model_init_config(config, latent_size=self.latent_size)
+        model = build_model(
+            config.model.model,
+            use_fp32_attention=config.model.get("fp32_attention", False) and config.model.mixed_precision != "bf16",
+            **model_kwargs,
+        )
+        self.logger.info(f"use_fp32_attention: {model.fp32_attention}")
+        self.logger.info(
+            f"{model.__class__.__name__}:{config.model.model},"
+            f"Model Parameters: {sum(p.numel() for p in model.parameters()):,}"
+        )
+        return model
+    def from_pretrained(self, model_path):
+        state_dict = find_model(model_path)
+        state_dict = state_dict.get("state_dict", state_dict)
+        if "pos_embed" in state_dict:
+            del state_dict["pos_embed"]
+        missing, unexpected = self.model.load_state_dict(state_dict, strict=False)
+        self.model.eval().to(self.weight_dtype)
+        self.logger.info("Generating sample from ckpt: %s" % model_path)
+        self.logger.warning(f"Missing keys: {missing}")
+        self.logger.warning(f"Unexpected keys: {unexpected}")
+    def register_progress_bar(self, progress_fn=None):
+        self.progress_fn = progress_fn if progress_fn is not None else self.progress_fn
+    @torch.inference_mode()
+    def forward(
+        self,
+        prompt=None,
+        height=1024,
+        width=1024,
+        negative_prompt="",
+        num_inference_steps=20,
+        guidance_scale=5,
+        pag_guidance_scale=2.5,
+        num_images_per_prompt=1,
+        generator=torch.Generator().manual_seed(42),
+        latents=None,
+    ):
+        self.ori_height, self.ori_width = height, width
+        self.height, self.width = classify_height_width_bin(height, width, ratios=self.base_ratios)
+        self.latent_size_h, self.latent_size_w = (
+            self.height // self.config.vae.vae_downsample_rate,
+            self.width // self.config.vae.vae_downsample_rate,
+        )
+        self.guidance_type = guidance_type_select(self.guidance_type, pag_guidance_scale, self.config.model.attn_type)
+        # 1. pre-compute negative embedding
+        if negative_prompt != "":
+            null_caption_token = self.tokenizer(
+                negative_prompt,
+                max_length=self.max_sequence_length,
+                padding="max_length",
+                truncation=True,
+                return_tensors="pt",
+            ).to(self.device)
+            self.null_caption_embs = self.text_encoder(null_caption_token.input_ids, null_caption_token.attention_mask)[
+                0
+            ]
+        if prompt is None:
+            prompt = [""]
+        prompts = prompt if isinstance(prompt, list) else [prompt]
+        samples = []
+        for prompt in prompts:
+            # data prepare
+            prompts, hw, ar = (
+                [],
+                torch.tensor([[self.image_size, self.image_size]], dtype=torch.float, device=self.device).repeat(
+                    num_images_per_prompt, 1
+                ),
+                torch.tensor([[1.0]], device=self.device).repeat(num_images_per_prompt, 1),
+            )
+            for _ in range(num_images_per_prompt):
+                prompts.append(prepare_prompt_ar(prompt, self.base_ratios, device=self.device, show=False)[0].strip())
+            with torch.no_grad():
+                # prepare text feature
+                if not self.config.text_encoder.chi_prompt:
+                    max_length_all = self.config.text_encoder.model_max_length
+                    prompts_all = prompts
+                else:
+                    chi_prompt = "\n".join(self.config.text_encoder.chi_prompt)
+                    prompts_all = [chi_prompt + prompt for prompt in prompts]
+                    num_chi_prompt_tokens = len(self.tokenizer.encode(chi_prompt))
+                    max_length_all = (
+                        num_chi_prompt_tokens + self.config.text_encoder.model_max_length - 2
+                    )  # magic number 2: [bos], [_]
+                caption_token = self.tokenizer(
+                    prompts_all,
+                    max_length=max_length_all,
+                    padding="max_length",
+                    truncation=True,
+                    return_tensors="pt",
+                ).to(device=self.device)
+                select_index = [0] + list(range(-self.config.text_encoder.model_max_length + 1, 0))
+                caption_embs = self.text_encoder(caption_token.input_ids, caption_token.attention_mask)[0][:, None][
+                    :, :, select_index
+                ].to(self.weight_dtype)
+                emb_masks = caption_token.attention_mask[:, select_index]
+                null_y = self.null_caption_embs.repeat(len(prompts), 1, 1)[:, None].to(self.weight_dtype)
+                n = len(prompts)
+                if latents is None:
+                    z = torch.randn(
+                        n,
+                        self.config.vae.vae_latent_dim,
+                        self.latent_size_h,
+                        self.latent_size_w,
+                        generator=generator,
+                        device=self.device,
+                    )
+                else:
+                    z = latents.to(self.device)
+                model_kwargs = dict(data_info={"img_hw": hw, "aspect_ratio": ar}, mask=emb_masks)
+                if self.vis_sampler == "flow_euler":
+                    flow_solver = FlowEuler(
+                        self.model,
+                        condition=caption_embs,
+                        uncondition=null_y,
+                        cfg_scale=guidance_scale,
+                        model_kwargs=model_kwargs,
+                    )
+                    sample = flow_solver.sample(
+                        z,
+                        steps=num_inference_steps,
+                    )
+                elif self.vis_sampler == "flow_dpm-solver":
+                    scheduler = DPMS(
+                        self.model,
+                        condition=caption_embs,
+                        uncondition=null_y,
+                        guidance_type=self.guidance_type,
+                        cfg_scale=guidance_scale,
+                        pag_scale=pag_guidance_scale,
+                        pag_applied_layers=self.config.model.pag_applied_layers,
+                        model_type="flow",
+                        model_kwargs=model_kwargs,
+                        schedule="FLOW",
+                    )
+                    scheduler.register_progress_bar(self.progress_fn)
+                    sample = scheduler.sample(
+                        z,
+                        steps=num_inference_steps,
+                        order=2,
+                        skip_type="time_uniform_flow",
+                        method="multistep",
+                        flow_shift=self.flow_shift,
+                    )
+            sample = sample.to(self.vae_dtype)
+            with torch.no_grad():
+                sample = vae_decode(self.config.vae.vae_type, self.vae, sample)
+            sample = resize_and_crop_tensor(sample, self.ori_width, self.ori_height)
+            samples.append(sample)
+            return sample
+        return samples

asset/Sana.jpg ADDED Viewed

Git LFS Details

SHA256: 1a10d77cfe5a1a703c2cb801d0f3fe9fa32a05c60dfff22b0bc7a479980df61c
Pointer size: 132 Bytes
Size of remote file: 1.16 MB

asset/app_styles/controlnet_app_style.css ADDED Viewed

	@@ -0,0 +1,28 @@

+@import url('https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.1/css/all.min.css');
+h1{text-align:center}
+.wrap.svelte-p4aq0j.svelte-p4aq0j {
+    display: none;
+}
+#column_input, #column_output {
+    width: 500px;
+    display: flex;
+    align-items: center;
+}
+#input_header, #output_header {
+    display: flex;
+    justify-content: center;
+    align-items: center;
+    width: 400px;
+}
+#accessibility {
+    text-align: center; /* Center-aligns the text */
+    margin: auto;       /* Centers the element horizontally */
+}
+#random_seed {height: 71px;}
+#run_button {height: 87px;}

asset/controlnet/ref_images/A transparent sculpture of a duck made out of glass. The sculpture is in front of a painting of a la.jpg ADDED Viewed

Git LFS Details

SHA256: f74cbd0c051c90decfa85c903a69a8cda3998bf62199d56cabbf835c625c13c6
Pointer size: 131 Bytes
Size of remote file: 155 kB

asset/controlnet/ref_images/a house.png ADDED Viewed

Git LFS Details

SHA256: 30241902aa9a42a1f0f5ce628f4c63a9bda15c1f1af2aadbbf8a459a3b4c81cf
Pointer size: 131 Bytes
Size of remote file: 407 kB

asset/controlnet/ref_images/a living room.png ADDED Viewed

Git LFS Details

SHA256: db300835c3bfca4615fa51b26593565b525217f8fc3dad7e40c24d7197322953
Pointer size: 131 Bytes
Size of remote file: 215 kB

asset/controlnet/ref_images/nvidia.png ADDED Viewed

asset/controlnet/samples_controlnet.json ADDED Viewed

	@@ -0,0 +1,26 @@

+[
+    {
+        "prompt": "A transparent sculpture of a duck made out of glass. The sculpture is in front of a painting of a landscape.",
+        "ref_image_path": "asset/controlnet/ref_images/A transparent sculpture of a duck made out of glass. The sculpture is in front of a painting of a la.jpg"
+    },
+    {
+        "prompt": "an architecture in INDIA,15th-18th style, with a lot of details",
+        "ref_image_path": "asset/controlnet/ref_images/a house.png"
+    },
+    {
+        "prompt": "An IKEA modern style living room with sofa, coffee table, stairs, etc., a brand new theme.",
+        "ref_image_path": "asset/controlnet/ref_images/a living room.png"
+    },
+    {
+        "prompt": "A modern new living room with sofa, coffee table, carpet, stairs, etc., high quality high detail, high resolution.",
+        "ref_image_path": "asset/controlnet/ref_images/a living room.png"
+    },
+    {
+        "prompt": "big eye, vibrant colors, intricate details, captivating gaze, surreal, dreamlike, fantasy, enchanting, mysterious, magical, moonlit, mystical, ethereal, enchanting {macro lens, high aperture, low ISO}",
+        "ref_image_path": "asset/controlnet/ref_images/nvidia.png"
+    },
+    {
+        "prompt": "shining eye, bright and vivid colors, radiant glow, sparkling reflections, joyful, uplifting, optimistic, hopeful, magical, luminous, celestial, dreamy {zoom lens, high aperture, natural light, vibrant color film}",
+        "ref_image_path": "asset/controlnet/ref_images/nvidia.png"
+    }
+]

asset/docs/4bit_sana.md ADDED Viewed

	@@ -0,0 +1,68 @@

+<!--Copyright 2024 NVIDIA CORPORATION & AFFILIATES
+#
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License. -->
+# 4bit SanaPipeline
+### 1. Environment setup
+Follow the official [SVDQuant-Nunchaku](https://github.com/mit-han-lab/nunchaku) repository to set up the environment. The guidance can be found [here](https://github.com/mit-han-lab/nunchaku?tab=readme-ov-file#installation).
+### 2. Code snap for inference
+Here we show the code snippet for SanaPipeline. For SanaPAGPipeline, please refer to the [SanaPAGPipeline](https://github.com/mit-han-lab/nunchaku/blob/main/examples/sana_1600m_pag.py) section.
+```python
+import torch
+from diffusers import SanaPipeline
+from nunchaku.models.transformer_sana import NunchakuSanaTransformer2DModel
+transformer = NunchakuSanaTransformer2DModel.from_pretrained("mit-han-lab/svdq-int4-sana-1600m")
+pipe = SanaPipeline.from_pretrained(
+    "Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers",
+    transformer=transformer,
+    variant="bf16",
+    torch_dtype=torch.bfloat16,
+).to("cuda")
+pipe.text_encoder.to(torch.bfloat16)
+pipe.vae.to(torch.bfloat16)
+image = pipe(
+    prompt="A cute 🐼 eating 🎋, ink drawing style",
+    height=1024,
+    width=1024,
+    guidance_scale=4.5,
+    num_inference_steps=20,
+    generator=torch.Generator().manual_seed(42),
+).images[0]
+image.save("sana_1600m.png")
+```
+### 3. Online demo
+1). Launch the 4bit Sana.
+```bash
+python app/app_sana_4bit.py
+```
+2). Compare with BF16 version
+Refer to the original [Nunchaku-Sana.](https://github.com/mit-han-lab/nunchaku/tree/main/app/sana/t2i) guidance for SanaPAGPipeline
+```bash
+python app/app_sana_4bit_compare_bf16.py
+```

asset/docs/8bit_sana.md ADDED Viewed

	@@ -0,0 +1,109 @@

+<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License. -->
+# SanaPipeline
+[SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.
+The abstract from the paper is:
+*We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.*
+<Tip>
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
+</Tip>
+This pipeline was contributed by [lawrence-cj](https://github.com/lawrence-cj) and [chenjy2003](https://github.com/chenjy2003). The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://huggingface.co/Efficient-Large-Model).
+Available models:
+| Model | Recommended dtype |
+|:-----:|:-----------------:|
+| [`Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers) | `torch.bfloat16` |
+| [`Efficient-Large-Model/Sana_1600M_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_diffusers) | `torch.float16` |
+| [`Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers) | `torch.float16` |
+| [`Efficient-Large-Model/Sana_1600M_512px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_diffusers) | `torch.float16` |
+| [`Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers) | `torch.float16` |
+| [`Efficient-Large-Model/Sana_600M_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_diffusers) | `torch.float16` |
+| [`Efficient-Large-Model/Sana_600M_512px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_600M_512px_diffusers) | `torch.float16` |
+Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-673efba2a57ed99843f11f9e) collection for more information.
+Note: The recommended dtype mentioned is for the transformer weights. The text encoder and VAE weights must stay in `torch.bfloat16` or `torch.float32` for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype.
+<Tip>
+Make sure to pass the `variant` argument for downloaded checkpoints to use lower disk space. Set it to `"fp16"` for models with recommended dtype as `torch.float16`, and `"bf16"` for models with recommended dtype as `torch.bfloat16`. By default, `torch.float32` weights are downloaded, which use twice the amount of disk storage. Additionally, `torch.float32` weights can be downcasted on-the-fly by specifying the `torch_dtype` argument. Read about it in the [docs](https://huggingface.co/docs/diffusers/v0.31.0/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained).
+</Tip>
+## Quantization
+Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
+Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized \[`SanaPipeline`\] for inference with bitsandbytes.
+```py
+import torch
+from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SanaTransformer2DModel, SanaPipeline
+from transformers import BitsAndBytesConfig as BitsAndBytesConfig, AutoModel
+quant_config = BitsAndBytesConfig(load_in_8bit=True)
+text_encoder_8bit = AutoModel.from_pretrained(
+    "Efficient-Large-Model/Sana_1600M_1024px_diffusers",
+    subfolder="text_encoder",
+    quantization_config=quant_config,
+    torch_dtype=torch.float16,
+)
+quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
+transformer_8bit = SanaTransformer2DModel.from_pretrained(
+    "Efficient-Large-Model/Sana_1600M_1024px_diffusers",
+    subfolder="transformer",
+    quantization_config=quant_config,
+    torch_dtype=torch.float16,
+)
+pipeline = SanaPipeline.from_pretrained(
+    "Efficient-Large-Model/Sana_1600M_1024px_diffusers",
+    text_encoder=text_encoder_8bit,
+    transformer=transformer_8bit,
+    torch_dtype=torch.float16,
+    device_map="balanced",
+)
+prompt = "a tiny astronaut hatching from an egg on the moon"
+image = pipeline(prompt).images[0]
+image.save("sana.png")
+```
+## SanaPipeline
+\[\[autodoc\]\] SanaPipeline
+- all
+- __call__
+## SanaPAGPipeline
+\[\[autodoc\]\] SanaPAGPipeline
+- all
+- __call__
+## SanaPipelineOutput
+\[\[autodoc\]\] pipelines.sana.pipeline_output.SanaPipelineOutput

asset/docs/ComfyUI/Sana_CogVideoX.json ADDED Viewed

	@@ -0,0 +1,1142 @@

+{
+  "last_node_id": 37,
+  "last_link_id": 48,
+  "nodes": [
+    {
+      "id": 5,
+      "type": "GemmaLoader",
+      "pos": [
+        283.376953125,
+        603.7484741210938
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 0,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "GEMMA",
+          "type": "GEMMA",
+          "links": [
+            9,
+            11
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "GemmaLoader"
+      },
+      "widgets_values": [
+        "google/gemma-2-2b-it",
+        "cuda",
+        "BF16"
+      ]
+    },
+    {
+      "id": 12,
+      "type": "SanaTextEncode",
+      "pos": [
+        670.9176635742188,
+        797.39501953125
+      ],
+      "size": [
+        400,
+        200
+      ],
+      "flags": {},
+      "order": 7,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "GEMMA",
+          "type": "GEMMA",
+          "link": 11
+        }
+      ],
+      "outputs": [
+        {
+          "name": "CONDITIONING",
+          "type": "CONDITIONING",
+          "links": [
+            3
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "SanaTextEncode"
+      },
+      "widgets_values": [
+        "\"\""
+      ]
+    },
+    {
+      "id": 4,
+      "type": "SanaResolutionSelect",
+      "pos": [
+        300.2852783203125,
+        392.79766845703125
+      ],
+      "size": [
+        315,
+        102
+      ],
+      "flags": {},
+      "order": 1,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "width",
+          "type": "INT",
+          "links": [
+            7
+          ],
+          "slot_index": 0
+        },
+        {
+          "name": "height",
+          "type": "INT",
+          "links": [
+            8
+          ],
+          "slot_index": 1
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "SanaResolutionSelect"
+      },
+      "widgets_values": [
+        "1024px",
+        "1.46"
+      ]
+    },
+    {
+      "id": 7,
+      "type": "SanaTextEncode",
+      "pos": [
+        674.2115478515625,
+        504.2879638671875
+      ],
+      "size": [
+        400,
+        200
+      ],
+      "flags": {},
+      "order": 6,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "GEMMA",
+          "type": "GEMMA",
+          "link": 9
+        }
+      ],
+      "outputs": [
+        {
+          "name": "CONDITIONING",
+          "type": "CONDITIONING",
+          "links": [
+            2
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "SanaTextEncode"
+      },
+      "widgets_values": [
+        "A cyberpunk cat with a neon sign that says 'Sana'."
+      ]
+    },
+    {
+      "id": 24,
+      "type": "PreviewImage",
+      "pos": [
+        1443.0323486328125,
+        352.056396484375
+      ],
+      "size": [
+        210,
+        246
+      ],
+      "flags": {},
+      "order": 13,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "images",
+          "type": "IMAGE",
+          "link": 47
+        }
+      ],
+      "outputs": [],
+      "properties": {
+        "Node name for S&R": "PreviewImage"
+      },
+      "widgets_values": []
+    },
+    {
+      "id": 25,
+      "type": "VHS_VideoCombine",
+      "pos": [
+        2825.935546875,
+        -102.76895904541016
+      ],
+      "size": [
+        767.7372436523438,
+        310
+      ],
+      "flags": {},
+      "order": 18,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "images",
+          "type": "IMAGE",
+          "link": 30
+        },
+        {
+          "name": "audio",
+          "type": "AUDIO",
+          "link": null,
+          "shape": 7
+        },
+        {
+          "name": "meta_batch",
+          "type": "VHS_BatchManager",
+          "link": null,
+          "shape": 7
+        },
+        {
+          "name": "vae",
+          "type": "VAE",
+          "link": null,
+          "shape": 7
+        }
+      ],
+      "outputs": [
+        {
+          "name": "Filenames",
+          "type": "VHS_FILENAMES",
+          "links": null,
+          "shape": 3
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "VHS_VideoCombine"
+      },
+      "widgets_values": {
+        "frame_rate": 8,
+        "loop_count": 0,
+        "filename_prefix": "CogVideoX_Fun",
+        "format": "video/h264-mp4",
+        "pix_fmt": "yuv420p",
+        "crf": 19,
+        "save_metadata": true,
+        "pingpong": false,
+        "save_output": true,
+        "videopreview": {
+          "hidden": false,
+          "paused": false,
+          "params": {
+            "filename": "CogVideoX_Fun_00005.mp4",
+            "subfolder": "",
+            "type": "output",
+            "format": "video/h264-mp4",
+            "frame_rate": 8
+          },
+          "muted": false
+        }
+      }
+    },
+    {
+      "id": 27,
+      "type": "CogVideoTextEncode",
+      "pos": [
+        1713.936279296875,
+        174.2305450439453
+      ],
+      "size": [
+        471.90142822265625,
+        168.08047485351562
+      ],
+      "flags": {},
+      "order": 9,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "clip",
+          "type": "CLIP",
+          "link": 35
+        }
+      ],
+      "outputs": [
+        {
+          "name": "conditioning",
+          "type": "CONDITIONING",
+          "links": [
+            32
+          ],
+          "slot_index": 0,
+          "shape": 3
+        },
+        {
+          "name": "clip",
+          "type": "CLIP",
+          "links": [
+            36
+          ],
+          "slot_index": 1
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "CogVideoTextEncode"
+      },
+      "widgets_values": [
+        "fireworks display over night city. The video is of high quality, and the view is very clear. High quality, masterpiece, best quality, highres, ultra-detailed, fantastic.",
+        1,
+        false
+      ]
+    },
+    {
+      "id": 28,
+      "type": "CogVideoTextEncode",
+      "pos": [
+        1720.936279296875,
+        393.230712890625
+      ],
+      "size": [
+        463.01251220703125,
+        144
+      ],
+      "flags": {},
+      "order": 11,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "clip",
+          "type": "CLIP",
+          "link": 36
+        }
+      ],
+      "outputs": [
+        {
+          "name": "conditioning",
+          "type": "CONDITIONING",
+          "links": [
+            33
+          ],
+          "slot_index": 0,
+          "shape": 3
+        },
+        {
+          "name": "clip",
+          "type": "CLIP",
+          "links": null
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "CogVideoTextEncode"
+      },
+      "widgets_values": [
+        "The video is not of a high quality, it has a low resolution. Watermark present in each frame. Strange motion trajectory. ",
+        1,
+        true
+      ]
+    },
+    {
+      "id": 30,
+      "type": "CogVideoImageEncodeFunInP",
+      "pos": [
+        2088.93603515625,
+        595.230712890625
+      ],
+      "size": [
+        253.60000610351562,
+        146
+      ],
+      "flags": {},
+      "order": 15,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "vae",
+          "type": "VAE",
+          "link": 37
+        },
+        {
+          "name": "start_image",
+          "type": "IMAGE",
+          "link": 38
+        },
+        {
+          "name": "end_image",
+          "type": "IMAGE",
+          "link": null,
+          "shape": 7
+        }
+      ],
+      "outputs": [
+        {
+          "name": "image_cond_latents",
+          "type": "LATENT",
+          "links": [
+            34
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "CogVideoImageEncodeFunInP"
+      },
+      "widgets_values": [
+        49,
+        true,
+        0
+      ]
+    },
+    {
+      "id": 33,
+      "type": "CogVideoDecode",
+      "pos": [
+        2442.93603515625,
+        -105.76895904541016
+      ],
+      "size": [
+        315,
+        198
+      ],
+      "flags": {},
+      "order": 17,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "vae",
+          "type": "VAE",
+          "link": 40
+        },
+        {
+          "name": "samples",
+          "type": "LATENT",
+          "link": 41
+        }
+      ],
+      "outputs": [
+        {
+          "name": "images",
+          "type": "IMAGE",
+          "links": [
+            30
+          ]
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "CogVideoDecode"
+      },
+      "widgets_values": [
+        true,
+        240,
+        360,
+        0.2,
+        0.2,
+        true
+      ]
+    },
+    {
+      "id": 34,
+      "type": "DownloadAndLoadCogVideoModel",
+      "pos": [
+        1714.936279296875,
+        -138.76895141601562
+      ],
+      "size": [
+        362.1656799316406,
+        218
+      ],
+      "flags": {},
+      "order": 2,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "block_edit",
+          "type": "TRANSFORMERBLOCKS",
+          "link": null,
+          "shape": 7
+        },
+        {
+          "name": "lora",
+          "type": "COGLORA",
+          "link": null,
+          "shape": 7
+        },
+        {
+          "name": "compile_args",
+          "type": "COMPILEARGS",
+          "link": null,
+          "shape": 7
+        }
+      ],
+      "outputs": [
+        {
+          "name": "model",
+          "type": "COGVIDEOMODEL",
+          "links": [
+            31
+          ]
+        },
+        {
+          "name": "vae",
+          "type": "VAE",
+          "links": [
+            37,
+            40
+          ],
+          "slot_index": 1
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "DownloadAndLoadCogVideoModel"
+      },
+      "widgets_values": [
+        "alibaba-pai/CogVideoX-Fun-V1.1-5b-InP",
+        "bf16",
+        "disabled",
+        false,
+        "sdpa",
+        "main_device"
+      ]
+    },
+    {
+      "id": 31,
+      "type": "ImageResizeKJ",
+      "pos": [
+        1722.936279296875,
+        615.230712890625
+      ],
+      "size": [
+        315,
+        266
+      ],
+      "flags": {},
+      "order": 14,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "image",
+          "type": "IMAGE",
+          "link": 48
+        },
+        {
+          "name": "get_image_size",
+          "type": "IMAGE",
+          "link": null,
+          "shape": 7
+        },
+        {
+          "name": "width_input",
+          "type": "INT",
+          "link": null,
+          "widget": {
+            "name": "width_input"
+          },
+          "shape": 7
+        },
+        {
+          "name": "height_input",
+          "type": "INT",
+          "link": null,
+          "widget": {
+            "name": "height_input"
+          },
+          "shape": 7
+        }
+      ],
+      "outputs": [
+        {
+          "name": "IMAGE",
+          "type": "IMAGE",
+          "links": [
+            38
+          ],
+          "slot_index": 0,
+          "shape": 3
+        },
+        {
+          "name": "width",
+          "type": "INT",
+          "links": null,
+          "shape": 3
+        },
+        {
+          "name": "height",
+          "type": "INT",
+          "links": null,
+          "shape": 3
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "ImageResizeKJ"
+      },
+      "widgets_values": [
+        720,
+        480,
+        "lanczos",
+        false,
+        2,
+        0,
+        0,
+        "disabled"
+      ]
+    },
+    {
+      "id": 29,
+      "type": "CLIPLoader",
+      "pos": [
+        1216.935791015625,
+        -8.769308090209961
+      ],
+      "size": [
+        451.30548095703125,
+        82
+      ],
+      "flags": {},
+      "order": 3,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "CLIP",
+          "type": "CLIP",
+          "links": [
+            35
+          ],
+          "slot_index": 0,
+          "shape": 3
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "CLIPLoader"
+      },
+      "widgets_values": [
+        "text_encoders/t5xxl_fp16.safetensors",
+        "sd3"
+      ]
+    },
+    {
+      "id": 26,
+      "type": "CogVideoSampler",
+      "pos": [
+        2423.935791015625,
+        152.23048400878906
+      ],
+      "size": [
+        330,
+        574
+      ],
+      "flags": {},
+      "order": 16,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "model",
+          "type": "COGVIDEOMODEL",
+          "link": 31
+        },
+        {
+          "name": "positive",
+          "type": "CONDITIONING",
+          "link": 32
+        },
+        {
+          "name": "negative",
+          "type": "CONDITIONING",
+          "link": 33
+        },
+        {
+          "name": "samples",
+          "type": "LATENT",
+          "link": null,
+          "shape": 7
+        },
+        {
+          "name": "image_cond_latents",
+          "type": "LATENT",
+          "link": 34,
+          "shape": 7
+        },
+        {
+          "name": "context_options",
+          "type": "COGCONTEXT",
+          "link": null,
+          "shape": 7
+        },
+        {
+          "name": "controlnet",
+          "type": "COGVIDECONTROLNET",
+          "link": null,
+          "shape": 7
+        },
+        {
+          "name": "tora_trajectory",
+          "type": "TORAFEATURES",
+          "link": null,
+          "shape": 7
+        },
+        {
+          "name": "fastercache",
+          "type": "FASTERCACHEARGS",
+          "link": null,
+          "shape": 7
+        }
+      ],
+      "outputs": [
+        {
+          "name": "samples",
+          "type": "LATENT",
+          "links": [
+            41
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "CogVideoSampler"
+      },
+      "widgets_values": [
+        49,
+        25,
+        6,
+        1123398248636718,
+        "randomize",
+        "CogVideoXDDIM",
+        1
+      ]
+    },
+    {
+      "id": 35,
+      "type": "SanaCheckpointLoader",
+      "pos": [
+        286.5307922363281,
+        235.45753479003906
+      ],
+      "size": [
+        315,
+        82
+      ],
+      "flags": {},
+      "order": 4,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "model",
+          "type": "MODEL",
+          "links": [
+            43
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "SanaCheckpointLoader"
+      },
+      "widgets_values": [
+        "Efficient-Large-Model/Sana_1600M_1024px_MultiLing",
+        "SanaMS_1600M_P1_D20"
+      ]
+    },
+    {
+      "id": 37,
+      "type": "ExtraVAELoader",
+      "pos": [
+        1070.8033447265625,
+        747.4982299804688
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 5,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "VAE",
+          "type": "VAE",
+          "links": [
+            46
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "ExtraVAELoader"
+      },
+      "widgets_values": [
+        "mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers",
+        "dcae-f32c32-sana-1.0-diffusers",
+        "BF16"
+      ]
+    },
+    {
+      "id": 1,
+      "type": "KSampler",
+      "pos": [
+        1101.390625,
+        196.0309600830078
+      ],
+      "size": [
+        300,
+        480
+      ],
+      "flags": {},
+      "order": 10,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "model",
+          "type": "MODEL",
+          "link": 43
+        },
+        {
+          "name": "positive",
+          "type": "CONDITIONING",
+          "link": 2
+        },
+        {
+          "name": "negative",
+          "type": "CONDITIONING",
+          "link": 3
+        },
+        {
+          "name": "latent_image",
+          "type": "LATENT",
+          "link": 4
+        }
+      ],
+      "outputs": [
+        {
+          "name": "LATENT",
+          "type": "LATENT",
+          "links": [
+            5
+          ],
+          "slot_index": 0,
+          "shape": 3
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "KSampler"
+      },
+      "widgets_values": [
+        869595936769725,
+        "randomize",
+        28,
+        5,
+        "euler",
+        "normal",
+        1
+      ]
+    },
+    {
+      "id": 6,
+      "type": "EmptyDCAELatentImage",
+      "pos": [
+        723.0592041015625,
+        317.112548828125
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 8,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "width",
+          "type": "INT",
+          "link": 7,
+          "widget": {
+            "name": "width"
+          }
+        },
+        {
+          "name": "height",
+          "type": "INT",
+          "link": 8,
+          "widget": {
+            "name": "height"
+          }
+        }
+      ],
+      "outputs": [
+        {
+          "name": "LATENT",
+          "type": "LATENT",
+          "links": [
+            4
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "EmptyDCAELatentImage"
+      },
+      "widgets_values": [
+        512,
+        512,
+        1
+      ]
+    },
+    {
+      "id": 2,
+      "type": "VAEDecode",
+      "pos": [
+        1452.4869384765625,
+        217.9922637939453
+      ],
+      "size": [
+        200,
+        50
+      ],
+      "flags": {},
+      "order": 12,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "samples",
+          "type": "LATENT",
+          "link": 5
+        },
+        {
+          "name": "vae",
+          "type": "VAE",
+          "link": 46
+        }
+      ],
+      "outputs": [
+        {
+          "name": "IMAGE",
+          "type": "IMAGE",
+          "links": [
+            47,
+            48
+          ],
+          "slot_index": 0,
+          "shape": 3
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "VAEDecode"
+      },
+      "widgets_values": []
+    }
+  ],
+  "links": [
+    [
+      2,
+      7,
+      0,
+      1,
+      1,
+      "CONDITIONING"
+    ],
+    [
+      3,
+      12,
+      0,
+      1,
+      2,
+      "CONDITIONING"
+    ],
+    [
+      4,
+      6,
+      0,
+      1,
+      3,
+      "LATENT"
+    ],
+    [
+      5,
+      1,
+      0,
+      2,
+      0,
+      "LATENT"
+    ],
+    [
+      7,
+      4,
+      0,
+      6,
+      0,
+      "INT"
+    ],
+    [
+      8,
+      4,
+      1,
+      6,
+      1,
+      "INT"
+    ],
+    [
+      9,
+      5,
+      0,
+      7,
+      0,
+      "GEMMA"
+    ],
+    [
+      11,
+      5,
+      0,
+      12,
+      0,
+      "GEMMA"
+    ],
+    [
+      30,
+      33,
+      0,
+      25,
+      0,
+      "IMAGE"
+    ],
+    [
+      31,
+      34,
+      0,
+      26,
+      0,
+      "COGVIDEOMODEL"
+    ],
+    [
+      32,
+      27,
+      0,
+      26,
+      1,
+      "CONDITIONING"
+    ],
+    [
+      33,
+      28,
+      0,
+      26,
+      2,
+      "CONDITIONING"
+    ],
+    [
+      34,
+      30,
+      0,
+      26,
+      4,
+      "LATENT"
+    ],
+    [
+      35,
+      29,
+      0,
+      27,
+      0,
+      "CLIP"
+    ],
+    [
+      36,
+      27,
+      1,
+      28,
+      0,
+      "CLIP"
+    ],
+    [
+      37,
+      34,
+      1,
+      30,
+      0,
+      "VAE"
+    ],
+    [
+      38,
+      31,
+      0,
+      30,
+      1,
+      "IMAGE"
+    ],
+    [
+      40,
+      34,
+      1,
+      33,
+      0,
+      "VAE"
+    ],
+    [
+      41,
+      26,
+      0,
+      33,
+      1,
+      "LATENT"
+    ],
+    [
+      43,
+      35,
+      0,
+      1,
+      0,
+      "MODEL"
+    ],
+    [
+      46,
+      37,
+      0,
+      2,
+      1,
+      "VAE"
+    ],
+    [
+      47,
+      2,
+      0,
+      24,
+      0,
+      "IMAGE"
+    ],
+    [
+      48,
+      2,
+      0,
+      31,
+      0,
+      "IMAGE"
+    ]
+  ],
+  "groups": [],
+  "config": {},
+  "extra": {
+    "ds": {
+      "scale": 0.5644739300537776,
+      "offset": [
+        515.970442108866,
+        435.7565370847522
+      ]
+    },
+    "groupNodes": {}
+  },
+  "version": 0.4
+}

asset/docs/ComfyUI/Sana_FlowEuler.json ADDED Viewed

	@@ -0,0 +1,508 @@

+{
+  "last_node_id": 10,
+  "last_link_id": 11,
+  "nodes": [
+    {
+      "id": 1,
+      "type": "VAEDecode",
+      "pos": [
+        1116.951416015625,
+        273.2231140136719
+      ],
+      "size": [
+        200,
+        50
+      ],
+      "flags": {},
+      "order": 8,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "samples",
+          "type": "LATENT",
+          "link": 1
+        },
+        {
+          "name": "vae",
+          "type": "VAE",
+          "link": 2
+        }
+      ],
+      "outputs": [
+        {
+          "name": "IMAGE",
+          "type": "IMAGE",
+          "links": [
+            9
+          ],
+          "slot_index": 0,
+          "shape": 3
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "VAEDecode"
+      },
+      "widgets_values": []
+    },
+    {
+      "id": 2,
+      "type": "GemmaLoader",
+      "pos": [
+        -41.03317642211914,
+        680.6829223632812
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 0,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "GEMMA",
+          "type": "GEMMA",
+          "links": [
+            10,
+            11
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "GemmaLoader"
+      },
+      "widgets_values": [
+        "Efficient-Large-Model/gemma-2-2b-it",
+        "cuda",
+        "BF16"
+      ]
+    },
+    {
+      "id": 3,
+      "type": "ExtraVAELoader",
+      "pos": [
+        801.2960205078125,
+        863.7061157226562
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 1,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "VAE",
+          "type": "VAE",
+          "links": [
+            2
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "ExtraVAELoader"
+      },
+      "widgets_values": [
+        "mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers",
+        "dcae-f32c32-sana-1.0-diffusers",
+        "BF16"
+      ]
+    },
+    {
+      "id": 4,
+      "type": "KSampler",
+      "pos": [
+        770.397216796875,
+        267.5942077636719
+      ],
+      "size": [
+        300,
+        480
+      ],
+      "flags": {},
+      "order": 7,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "model",
+          "type": "MODEL",
+          "link": 3
+        },
+        {
+          "name": "positive",
+          "type": "CONDITIONING",
+          "link": 4
+        },
+        {
+          "name": "negative",
+          "type": "CONDITIONING",
+          "link": 5
+        },
+        {
+          "name": "latent_image",
+          "type": "LATENT",
+          "link": 6
+        }
+      ],
+      "outputs": [
+        {
+          "name": "LATENT",
+          "type": "LATENT",
+          "links": [
+            1
+          ],
+          "slot_index": 0,
+          "shape": 3
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "KSampler"
+      },
+      "widgets_values": [
+        1057228702589644,
+        "fixed",
+        28,
+        2,
+        "euler",
+        "normal",
+        1
+      ]
+    },
+    {
+      "id": 5,
+      "type": "EmptySanaLatentImage",
+      "pos": [
+        392.18475341796875,
+        367.0936279296875
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 6,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "width",
+          "type": "INT",
+          "link": 7,
+          "widget": {
+            "name": "width"
+          }
+        },
+        {
+          "name": "height",
+          "type": "INT",
+          "link": 8,
+          "widget": {
+            "name": "height"
+          }
+        }
+      ],
+      "outputs": [
+        {
+          "name": "LATENT",
+          "type": "LATENT",
+          "links": [
+            6
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "EmptySanaLatentImage"
+      },
+      "widgets_values": [
+        512,
+        512,
+        1
+      ]
+    },
+    {
+      "id": 6,
+      "type": "PreviewImage",
+      "pos": [
+        1143.318115234375,
+        385.34552001953125
+      ],
+      "size": [
+        605.93505859375,
+        665.570068359375
+      ],
+      "flags": {},
+      "order": 9,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "images",
+          "type": "IMAGE",
+          "link": 9
+        }
+      ],
+      "outputs": [],
+      "properties": {
+        "Node name for S&R": "PreviewImage"
+      },
+      "widgets_values": []
+    },
+    {
+      "id": 9,
+      "type": "GemmaTextEncode",
+      "pos": [
+        320.47918701171875,
+        884.2686767578125
+      ],
+      "size": [
+        400,
+        200
+      ],
+      "flags": {},
+      "order": 4,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "GEMMA",
+          "type": "GEMMA",
+          "link": 10
+        }
+      ],
+      "outputs": [
+        {
+          "name": "CONDITIONING",
+          "type": "CONDITIONING",
+          "links": [
+            5
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "GemmaTextEncode"
+      },
+      "widgets_values": [
+        ""
+      ]
+    },
+    {
+      "id": 10,
+      "type": "SanaTextEncode",
+      "pos": [
+        323.21978759765625,
+        632.0758666992188
+      ],
+      "size": [
+        400,
+        200
+      ],
+      "flags": {},
+      "order": 5,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "GEMMA",
+          "type": "GEMMA",
+          "link": 11
+        }
+      ],
+      "outputs": [
+        {
+          "name": "CONDITIONING",
+          "type": "CONDITIONING",
+          "links": [
+            4
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "SanaTextEncode"
+      },
+      "widgets_values": [
+        "a dog and a cat"
+      ]
+    },
+    {
+      "id": 7,
+      "type": "SanaCheckpointLoader",
+      "pos": [
+        -15.461307525634766,
+        297.74456787109375
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 2,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "model",
+          "type": "MODEL",
+          "links": [
+            3
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "SanaCheckpointLoader"
+      },
+      "widgets_values": [
+        "Efficient-Large-Model/Sana_1600M_1024px_BF16",
+        "SanaMS_1600M_P1_D20",
+        "BF16"
+      ]
+    },
+    {
+      "id": 8,
+      "type": "SanaResolutionSelect",
+      "pos": [
+        -24.12485122680664,
+        469.7320556640625
+      ],
+      "size": [
+        315,
+        102
+      ],
+      "flags": {},
+      "order": 3,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "width",
+          "type": "INT",
+          "links": [
+            7
+          ],
+          "slot_index": 0
+        },
+        {
+          "name": "height",
+          "type": "INT",
+          "links": [
+            8
+          ],
+          "slot_index": 1
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "SanaResolutionSelect"
+      },
+      "widgets_values": [
+        "1024px",
+        "1.00"
+      ]
+    }
+  ],
+  "links": [
+    [
+      1,
+      4,
+      0,
+      1,
+      0,
+      "LATENT"
+    ],
+    [
+      2,
+      3,
+      0,
+      1,
+      1,
+      "VAE"
+    ],
+    [
+      3,
+      7,
+      0,
+      4,
+      0,
+      "MODEL"
+    ],
+    [
+      4,
+      10,
+      0,
+      4,
+      1,
+      "CONDITIONING"
+    ],
+    [
+      5,
+      9,
+      0,
+      4,
+      2,
+      "CONDITIONING"
+    ],
+    [
+      6,
+      5,
+      0,
+      4,
+      3,
+      "LATENT"
+    ],
+    [
+      7,
+      8,
+      0,
+      5,
+      0,
+      "INT"
+    ],
+    [
+      8,
+      8,
+      1,
+      5,
+      1,
+      "INT"
+    ],
+    [
+      9,
+      1,
+      0,
+      6,
+      0,
+      "IMAGE"
+    ],
+    [
+      10,
+      2,
+      0,
+      9,
+      0,
+      "GEMMA"
+    ],
+    [
+      11,
+      2,
+      0,
+      10,
+      0,
+      "GEMMA"
+    ]
+  ],
+  "groups": [],
+  "config": {},
+  "extra": {
+    "ds": {
+      "scale": 1,
+      "offset": [
+        363.9719256481908,
+        -27.1040341608292
+      ]
+    }
+  },
+  "version": 0.4
+}

asset/docs/ComfyUI/Sana_FlowEuler_2K.json ADDED Viewed

	@@ -0,0 +1,508 @@

+{
+  "last_node_id": 38,
+  "last_link_id": 47,
+  "nodes": [
+    {
+      "id": 4,
+      "type": "VAEDecode",
+      "pos": [
+        776.332763671875,
+        105.08650970458984
+      ],
+      "size": [
+        200,
+        50
+      ],
+      "flags": {},
+      "order": 8,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "samples",
+          "type": "LATENT",
+          "link": 3
+        },
+        {
+          "name": "vae",
+          "type": "VAE",
+          "link": 24
+        }
+      ],
+      "outputs": [
+        {
+          "name": "IMAGE",
+          "type": "IMAGE",
+          "links": [
+            11
+          ],
+          "slot_index": 0,
+          "shape": 3
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "VAEDecode"
+      },
+      "widgets_values": []
+    },
+    {
+      "id": 9,
+      "type": "GemmaLoader",
+      "pos": [
+        -381.6518859863281,
+        512.5463256835938
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 0,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "GEMMA",
+          "type": "GEMMA",
+          "links": [
+            39,
+            41
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "GemmaLoader"
+      },
+      "widgets_values": [
+        "Efficient-Large-Model/gemma-2-2b-it",
+        "cuda",
+        "BF16"
+      ]
+    },
+    {
+      "id": 29,
+      "type": "ExtraVAELoader",
+      "pos": [
+        460.67730712890625,
+        695.5695190429688
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 1,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "VAE",
+          "type": "VAE",
+          "links": [
+            24
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "ExtraVAELoader"
+      },
+      "widgets_values": [
+        "mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers",
+        "dcae-f32c32-sana-1.0-diffusers",
+        "BF16"
+      ]
+    },
+    {
+      "id": 10,
+      "type": "KSampler",
+      "pos": [
+        429.7785339355469,
+        99.45759582519531
+      ],
+      "size": [
+        300,
+        480
+      ],
+      "flags": {},
+      "order": 7,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "model",
+          "type": "MODEL",
+          "link": 33
+        },
+        {
+          "name": "positive",
+          "type": "CONDITIONING",
+          "link": 42
+        },
+        {
+          "name": "negative",
+          "type": "CONDITIONING",
+          "link": 47
+        },
+        {
+          "name": "latent_image",
+          "type": "LATENT",
+          "link": 46
+        }
+      ],
+      "outputs": [
+        {
+          "name": "LATENT",
+          "type": "LATENT",
+          "links": [
+            3
+          ],
+          "slot_index": 0,
+          "shape": 3
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "KSampler"
+      },
+      "widgets_values": [
+        1057228702589644,
+        "fixed",
+        28,
+        2,
+        "euler",
+        "normal",
+        1
+      ]
+    },
+    {
+      "id": 33,
+      "type": "EmptySanaLatentImage",
+      "pos": [
+        51.56604766845703,
+        198.95700073242188
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 6,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "width",
+          "type": "INT",
+          "link": 28,
+          "widget": {
+            "name": "width"
+          }
+        },
+        {
+          "name": "height",
+          "type": "INT",
+          "link": 29,
+          "widget": {
+            "name": "height"
+          }
+        }
+      ],
+      "outputs": [
+        {
+          "name": "LATENT",
+          "type": "LATENT",
+          "links": [
+            46
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "EmptySanaLatentImage"
+      },
+      "widgets_values": [
+        512,
+        512,
+        1
+      ]
+    },
+    {
+      "id": 13,
+      "type": "PreviewImage",
+      "pos": [
+        802.6994018554688,
+        217.20889282226562
+      ],
+      "size": [
+        605.93505859375,
+        665.570068359375
+      ],
+      "flags": {},
+      "order": 9,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "images",
+          "type": "IMAGE",
+          "link": 11
+        }
+      ],
+      "outputs": [],
+      "properties": {
+        "Node name for S&R": "PreviewImage"
+      },
+      "widgets_values": []
+    },
+    {
+      "id": 25,
+      "type": "SanaCheckpointLoader",
+      "pos": [
+        -356.08001708984375,
+        129.6079559326172
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 2,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "model",
+          "type": "MODEL",
+          "links": [
+            33
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "SanaCheckpointLoader"
+      },
+      "widgets_values": [
+        "Efficient-Large-Model/Sana_1600M_2Kpx_BF16",
+        "SanaMS_1600M_P1_D20_2K",
+        "BF16"
+      ]
+    },
+    {
+      "id": 6,
+      "type": "SanaResolutionSelect",
+      "pos": [
+        -364.7435607910156,
+        301.5954284667969
+      ],
+      "size": [
+        315,
+        102
+      ],
+      "flags": {},
+      "order": 3,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "width",
+          "type": "INT",
+          "links": [
+            28
+          ],
+          "slot_index": 0
+        },
+        {
+          "name": "height",
+          "type": "INT",
+          "links": [
+            29
+          ],
+          "slot_index": 1
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "SanaResolutionSelect"
+      },
+      "widgets_values": [
+        "2K",
+        "1.00"
+      ]
+    },
+    {
+      "id": 14,
+      "type": "SanaTextEncode",
+      "pos": [
+        -17.398910522460938,
+        463.93927001953125
+      ],
+      "size": [
+        400,
+        200
+      ],
+      "flags": {},
+      "order": 4,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "GEMMA",
+          "type": "GEMMA",
+          "link": 39
+        }
+      ],
+      "outputs": [
+        {
+          "name": "CONDITIONING",
+          "type": "CONDITIONING",
+          "links": [
+            42
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "SanaTextEncode"
+      },
+      "widgets_values": [
+        "a dog and a cat"
+      ]
+    },
+    {
+      "id": 37,
+      "type": "GemmaTextEncode",
+      "pos": [
+        -20.1395263671875,
+        716.132080078125
+      ],
+      "size": [
+        400,
+        200
+      ],
+      "flags": {},
+      "order": 5,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "GEMMA",
+          "type": "GEMMA",
+          "link": 41
+        }
+      ],
+      "outputs": [
+        {
+          "name": "CONDITIONING",
+          "type": "CONDITIONING",
+          "links": [
+            47
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "GemmaTextEncode"
+      },
+      "widgets_values": [
+        ""
+      ]
+    }
+  ],
+  "links": [
+    [
+      3,
+      10,
+      0,
+      4,
+      0,
+      "LATENT"
+    ],
+    [
+      11,
+      4,
+      0,
+      13,
+      0,
+      "IMAGE"
+    ],
+    [
+      24,
+      29,
+      0,
+      4,
+      1,
+      "VAE"
+    ],
+    [
+      28,
+      6,
+      0,
+      33,
+      0,
+      "INT"
+    ],
+    [
+      29,
+      6,
+      1,
+      33,
+      1,
+      "INT"
+    ],
+    [
+      33,
+      25,
+      0,
+      10,
+      0,
+      "MODEL"
+    ],
+    [
+      39,
+      9,
+      0,
+      14,
+      0,
+      "GEMMA"
+    ],
+    [
+      41,
+      9,
+      0,
+      37,
+      0,
+      "GEMMA"
+    ],
+    [
+      42,
+      14,
+      0,
+      10,
+      1,
+      "CONDITIONING"
+    ],
+    [
+      46,
+      33,
+      0,
+      10,
+      3,
+      "LATENT"
+    ],
+    [
+      47,
+      37,
+      0,
+      10,
+      2,
+      "CONDITIONING"
+    ]
+  ],
+  "groups": [],
+  "config": {},
+  "extra": {
+    "ds": {
+      "scale": 0.9090909090909091,
+      "offset": [
+        623.7012344346042,
+        257.61183690683845
+      ]
+    }
+  },
+  "version": 0.4
+}

asset/docs/ComfyUI/Sana_FlowEuler_4K.json ADDED Viewed

	@@ -0,0 +1,508 @@

+{
+  "last_node_id": 131,
+  "last_link_id": 146,
+  "nodes": [
+    {
+      "id": 121,
+      "type": "VAEDecode",
+      "pos": [
+        3658.290771484375,
+        1351.9073486328125
+      ],
+      "size": [
+        200,
+        50
+      ],
+      "flags": {},
+      "order": 8,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "samples",
+          "type": "LATENT",
+          "link": 133
+        },
+        {
+          "name": "vae",
+          "type": "VAE",
+          "link": 146
+        }
+      ],
+      "outputs": [
+        {
+          "name": "IMAGE",
+          "type": "IMAGE",
+          "links": [
+            141
+          ],
+          "slot_index": 0,
+          "shape": 3
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "VAEDecode"
+      },
+      "widgets_values": []
+    },
+    {
+      "id": 122,
+      "type": "GemmaLoader",
+      "pos": [
+        2500.30615234375,
+        1759.3671875
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 0,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "GEMMA",
+          "type": "GEMMA",
+          "links": [
+            142,
+            143
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "GemmaLoader"
+      },
+      "widgets_values": [
+        "Efficient-Large-Model/gemma-2-2b-it",
+        "cuda",
+        "BF16"
+      ]
+    },
+    {
+      "id": 125,
+      "type": "EmptySanaLatentImage",
+      "pos": [
+        2933.52392578125,
+        1445.77783203125
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 6,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "width",
+          "type": "INT",
+          "link": 139,
+          "widget": {
+            "name": "width"
+          }
+        },
+        {
+          "name": "height",
+          "type": "INT",
+          "link": 140,
+          "widget": {
+            "name": "height"
+          }
+        }
+      ],
+      "outputs": [
+        {
+          "name": "LATENT",
+          "type": "LATENT",
+          "links": [
+            138
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "EmptySanaLatentImage"
+      },
+      "widgets_values": [
+        512,
+        512,
+        1
+      ]
+    },
+    {
+      "id": 129,
+      "type": "GemmaTextEncode",
+      "pos": [
+        2861.818359375,
+        1962.9530029296875
+      ],
+      "size": [
+        400,
+        200
+      ],
+      "flags": {},
+      "order": 5,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "GEMMA",
+          "type": "GEMMA",
+          "link": 143
+        }
+      ],
+      "outputs": [
+        {
+          "name": "CONDITIONING",
+          "type": "CONDITIONING",
+          "links": [
+            137
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "GemmaTextEncode"
+      },
+      "widgets_values": [
+        ""
+      ]
+    },
+    {
+      "id": 130,
+      "type": "SanaCheckpointLoader",
+      "pos": [
+        2525.8779296875,
+        1376.4288330078125
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 1,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "model",
+          "type": "MODEL",
+          "links": [
+            135
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "SanaCheckpointLoader"
+      },
+      "widgets_values": [
+        "Efficient-Large-Model/Sana_1600M_4Kpx_BF16",
+        "SanaMS_1600M_P1_D20_4K",
+        "BF16"
+      ]
+    },
+    {
+      "id": 127,
+      "type": "SanaResolutionSelect",
+      "pos": [
+        2517.21435546875,
+        1548.416259765625
+      ],
+      "size": [
+        315,
+        102
+      ],
+      "flags": {},
+      "order": 2,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "width",
+          "type": "INT",
+          "links": [
+            139
+          ],
+          "slot_index": 0
+        },
+        {
+          "name": "height",
+          "type": "INT",
+          "links": [
+            140
+          ],
+          "slot_index": 1
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "SanaResolutionSelect"
+      },
+      "widgets_values": [
+        "4K",
+        "1.00"
+      ]
+    },
+    {
+      "id": 128,
+      "type": "SanaTextEncode",
+      "pos": [
+        2864.55908203125,
+        1710.7601318359375
+      ],
+      "size": [
+        400,
+        200
+      ],
+      "flags": {},
+      "order": 4,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "GEMMA",
+          "type": "GEMMA",
+          "link": 142
+        }
+      ],
+      "outputs": [
+        {
+          "name": "CONDITIONING",
+          "type": "CONDITIONING",
+          "links": [
+            136
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "SanaTextEncode"
+      },
+      "widgets_values": [
+        "a dog and a cat"
+      ]
+    },
+    {
+      "id": 123,
+      "type": "ExtraVAELoader",
+      "pos": [
+        3325.43359375,
+        1988.7694091796875
+      ],
+      "size": [
+        315,
+        106
+      ],
+      "flags": {},
+      "order": 3,
+      "mode": 0,
+      "inputs": [],
+      "outputs": [
+        {
+          "name": "VAE",
+          "type": "VAE",
+          "links": [
+            146
+          ],
+          "slot_index": 0
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "ExtraVAELoader"
+      },
+      "widgets_values": [
+        "mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers",
+        "dcae-f32c32-sana-1.0-diffusers",
+        "BF16"
+      ]
+    },
+    {
+      "id": 126,
+      "type": "PreviewImage",
+      "pos": [
+        3684.657470703125,
+        1464.02978515625
+      ],
+      "size": [
+        605.93505859375,
+        665.570068359375
+      ],
+      "flags": {},
+      "order": 9,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "images",
+          "type": "IMAGE",
+          "link": 141
+        }
+      ],
+      "outputs": [],
+      "properties": {
+        "Node name for S&R": "PreviewImage"
+      },
+      "widgets_values": []
+    },
+    {
+      "id": 124,
+      "type": "KSampler",
+      "pos": [
+        3311.736572265625,
+        1346.2784423828125
+      ],
+      "size": [
+        300,
+        480
+      ],
+      "flags": {},
+      "order": 7,
+      "mode": 0,
+      "inputs": [
+        {
+          "name": "model",
+          "type": "MODEL",
+          "link": 135
+        },
+        {
+          "name": "positive",
+          "type": "CONDITIONING",
+          "link": 136
+        },
+        {
+          "name": "negative",
+          "type": "CONDITIONING",
+          "link": 137
+        },
+        {
+          "name": "latent_image",
+          "type": "LATENT",
+          "link": 138
+        }
+      ],
+      "outputs": [
+        {
+          "name": "LATENT",
+          "type": "LATENT",
+          "links": [
+            133
+          ],
+          "slot_index": 0,
+          "shape": 3
+        }
+      ],
+      "properties": {
+        "Node name for S&R": "KSampler"
+      },
+      "widgets_values": [
+        1057228702589645,
+        "fixed",
+        28,
+        2,
+        "euler",
+        "normal",
+        1
+      ]
+    }
+  ],
+  "links": [
+    [
+      133,
+      124,
+      0,
+      121,
+      0,
+      "LATENT"
+    ],
+    [
+      135,
+      130,
+      0,
+      124,
+      0,
+      "MODEL"
+    ],
+    [
+      136,
+      128,
+      0,
+      124,
+      1,
+      "CONDITIONING"
+    ],
+    [
+      137,
+      129,
+      0,
+      124,
+      2,
+      "CONDITIONING"
+    ],
+    [
+      138,
+      125,
+      0,
+      124,
+      3,
+      "LATENT"
+    ],
+    [
+      139,
+      127,
+      0,
+      125,
+      0,
+      "INT"
+    ],
+    [
+      140,
+      127,
+      1,
+      125,
+      1,
+      "INT"
+    ],
+    [
+      141,
+      121,
+      0,
+      126,
+      0,
+      "IMAGE"
+    ],
+    [
+      142,
+      122,
+      0,
+      128,
+      0,
+      "GEMMA"
+    ],
+    [
+      143,
+      122,
+      0,
+      129,
+      0,
+      "GEMMA"
+    ],
+    [
+      146,
+      123,
+      0,
+      121,
+      1,
+      "VAE"
+    ]
+  ],
+  "groups": [],
+  "config": {},
+  "extra": {
+    "ds": {
+      "scale": 0.7513148009015777,
+      "offset": [
+        -1938.732003792888,
+        -1072.7654372703548
+      ]
+    }
+  },
+  "version": 0.4
+}

asset/docs/ComfyUI/comfyui.md ADDED Viewed

	@@ -0,0 +1,40 @@

+## 🖌️ Sana-ComfyUI
+[Original Repo](https://github.com/city96/ComfyUI_ExtraModels)
+### Model info / implementation
+- Uses Gemma2 2B as the text encoder
+- Multiple resolutions and models available
+- Compressed latent space (32 channels, /32 compression) - needs custom VAE
+### Usage
+1. All the checkpoints will be downloaded automatically.
+1. KSampler(Flow Euler) is available for now; Flow DPM-Solver will be available soon.
+```bash
+git clone https://github.com/comfyanonymous/ComfyUI.git
+cd ComfyUI
+git clone https://github.com/Efficient-Large-Model/ComfyUI_ExtraModels.git custom_nodes/ComfyUI_ExtraModels
+python main.py
+```
+### A sample workflow for Sana
+[Sana workflow](Sana_FlowEuler.json)
+![Sana](https://raw.githubusercontent.com/NVlabs/Sana/refs/heads/page/asset/content/comfyui/sana.jpg)
+### A sample for T2I(Sana) + I2V(CogVideoX)
+[Sana + CogVideoX workflow](Sana_CogVideoX.json)
+[![Sample T2I + I2V](https://raw.githubusercontent.com/NVlabs/Sana/refs/heads/page/asset/content/comfyui/sana-cogvideox.jpg)](https://nvlabs.github.io/Sana/asset/content/comfyui/Sana_CogVideoX_Fun.mp4)
+### A sample workflow for Sana 4096x4096 image (18GB GPU is needed)
+[Sana workflow](Sana_FlowEuler_4K.json)
+![Sana](https://raw.githubusercontent.com/NVlabs/Sana/refs/heads/page/asset/content/comfyui/Sana_4K_workflow.jpg)

asset/docs/metrics_toolkit.md ADDED Viewed

	@@ -0,0 +1,118 @@

+# 💻 How to Inference & Test Metrics (FID, CLIP Score, GenEval, DPG-Bench, etc...)
+This ToolKit will automatically inference your model and log the metrics results onto wandb as chart for better illustration. We curerntly support:
+- \[x\] [FID](https://github.com/mseitzer/pytorch-fid) & [CLIP-Score](https://github.com/openai/CLIP)
+- \[x\] [GenEval](https://github.com/djghosh13/geneval)
+- \[x\] [DPG-Bench](https://github.com/TencentQQGYLab/ELLA)
+- \[x\] [ImageReward](https://github.com/THUDM/ImageReward/tree/main)
+### 0. Install corresponding env for GenEval and DPG-Bench
+Make sure you can activate the following envs:
+- `conda activate geneval`([GenEval](https://github.com/djghosh13/geneval))
+- `conda activate dpg`([DGB-Bench](https://github.com/TencentQQGYLab/ELLA))
+### 0.1 Prepare data.
+Metirc FID & CLIP-Score on [MJHQ-30K](https://huggingface.co/datasets/playgroundai/MJHQ-30K)
+```python
+from huggingface_hub import hf_hub_download
+hf_hub_download(
+  repo_id="playgroundai/MJHQ-30K",
+  filename="mjhq30k_imgs.zip",
+  local_dir="data/test/PG-eval-data/MJHQ-30K/",
+  repo_type="dataset"
+)
+```
+Unzip mjhq30k_imgs.zip into its per-category folder structure.
+```
+data/test/PG-eval-data/MJHQ-30K/imgs/
+├── animals
+├── art
+├── fashion
+├── food
+├── indoor
+├── landscape
+├── logo
+├── people
+├── plants
+└── vehicles
+```
+### 0.2 Prepare checkpoints
+```bash
+huggingface-cli download  Efficient-Large-Model/Sana_1600M_1024px --repo-type model --local-dir ./output/Sana_1600M_1024px --local-dir-use-symlinks False
+```
+### 1. directly \[Inference and Metric\] a .pth file
+```bash
+# We provide four scripts for evaluating metrics:
+fid_clipscore_launch=scripts/bash_run_inference_metric.sh
+geneval_launch=scripts/bash_run_inference_metric_geneval.sh
+dpg_launch=scripts/bash_run_inference_metric_dpg.sh
+image_reward_launch=scripts/bash_run_inference_metric_imagereward.sh
+# Use following format to metric your models:
+# bash $correspoinding_metric_launch $your_config_file_path $your_relative_pth_file_path
+# example
+bash $geneval_launch \
+    configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
+    output/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth
+```
+### 2. \[Inference and Metric\] a list of .pth files using a txt file
+You can also write all your pth files of a job in one txt file, eg. [model_paths.txt](../model_paths.txt)
+```bash
+# Use following format to metric your models, gathering in a txt file:
+# bash $correspoinding_metric_launch $your_config_file_path $your_txt_file_path_containing_pth_path
+# We suggest follow the file tree structure in our project for robust experiment
+# example
+bash scripts/bash_run_inference_metric.sh \
+    configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
+    asset/model_paths.txt
+```
+### 3. You will get the following data tree.
+```
+output
+├──your_job_name/  (everything will be saved here)
+│  ├──config.yaml
+│  ├──train_log.log
+│  ├──checkpoints    (all checkpoints)
+│  │  ├──epoch_1_step_6666.pth
+│  │  ├──epoch_1_step_8888.pth
+│  │  ├──......
+│  ├──vis    (all visualization result dirs)
+│  │  ├──visualization_file_name
+│  │  │  ├──xxxxxxx.jpg
+│  │  │  ├──......
+│  │  ├──visualization_file_name2
+│  │  │  ├──xxxxxxx.jpg
+│  │  │  ├──......
+│  ├──......
+│  ├──metrics    (all metrics testing related files)
+│  │  ├──model_paths.txt  Optional(👈)(relative path of testing ckpts)
+│  │  │  ├──output/your_job_name/checkpoings/epoch_1_step_6666.pth
+│  │  │  ├──output/your_job_name/checkpoings/epoch_1_step_8888.pth
+│  │  ├──fid_img_paths.txt  Optional(👈)(name of testing img_dir in vis)
+│  │  │  ├──visualization_file_name
+│  │  │  ├──visualization_file_name2
+│  │  ├──cached_img_paths.txt  Optional(👈)
+│  │  ├──......
+```

asset/docs/model_zoo.md ADDED Viewed

	@@ -0,0 +1,157 @@

+## 🔥 1. We provide all the links of Sana pth and diffusers safetensor below
+| Model                | Reso   | pth link                                                                                                                    | diffusers                                                                                                                                         | Precision     | Description    |
+|----------------------|--------|-----------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|---------------|----------------|
+| Sana-0.6B            | 512px  | [Sana_600M_512px](https://huggingface.co/Efficient-Large-Model/Sana_600M_512px)                                             | [Efficient-Large-Model/Sana_600M_512px_diffusers](https://huggingface.co/Efficient-Large-Model/Sana_600M_512px_diffusers)                         | fp16/fp32     | Multi-Language |
+| Sana-0.6B            | 1024px | [Sana_600M_1024px](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px)                                           | [Efficient-Large-Model/Sana_600M_1024px_diffusers](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_diffusers)                       | fp16/fp32     | Multi-Language |
+| Sana-1.6B            | 512px  | [Sana_1600M_512px](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px)                                           | [Efficient-Large-Model/Sana_1600M_512px_diffusers](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_diffusers)                       | fp16/fp32     | -              |
+| Sana-1.6B            | 512px  | [Sana_1600M_512px_MultiLing](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_MultiLing)                       | [Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers)   | fp16/fp32     | Multi-Language |
+| Sana-1.6B            | 1024px | [Sana_1600M_1024px](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px)                                         | [Efficient-Large-Model/Sana_1600M_1024px_diffusers](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_diffusers)                     | fp16/fp32     | -              |
+| Sana-1.6B            | 1024px | [Sana_1600M_1024px_MultiLing](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_MultiLing)                     | [Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers) | fp16/fp32     | Multi-Language |
+| Sana-1.6B            | 1024px | [Sana_1600M_1024px_BF16](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_BF16)                               | [Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers)           | **bf16**/fp32 | Multi-Language |
+| Sana-1.6B            | 1024px | -                                                                                                                           | [mit-han-lab/svdq-int4-sana-1600m](https://huggingface.co/mit-han-lab/svdq-int4-sana-1600m)                                                       | **int4**      | Multi-Language |
+| Sana-1.6B            | 2Kpx   | [Sana_1600M_2Kpx_BF16](https://huggingface.co/Efficient-Large-Model/Sana_1600M_2Kpx_BF16)                                   | [Efficient-Large-Model/Sana_1600M_2Kpx_BF16_diffusers](https://huggingface.co/Efficient-Large-Model/Sana_1600M_2Kpx_BF16_diffusers)               | **bf16**/fp32 | Multi-Language |
+| Sana-1.6B            | 4Kpx   | [Sana_1600M_4Kpx_BF16](https://huggingface.co/Efficient-Large-Model/Sana_1600M_4Kpx_BF16)                                   | [Efficient-Large-Model/Sana_1600M_4Kpx_BF16_diffusers](https://huggingface.co/Efficient-Large-Model/Sana_1600M_4Kpx_BF16_diffusers)               | **bf16**/fp32 | Multi-Language |
+| Sana-1.6B            | 4Kpx   | [Sana_1600M_4Kpx_BF16](https://huggingface.co/Efficient-Large-Model/Sana_1600M_4Kpx_BF16)                                   | [Efficient-Large-Model/Sana_1600M_4Kpx_BF16_diffusers](https://huggingface.co/Efficient-Large-Model/Sana_1600M_4Kpx_BF16_diffusers)               | **bf16**/fp32 | Multi-Language |
+| ControlNet           |        |                                                                                                                             |                                                                                                                                                   |               |                |
+| Sana-1.6B-ControlNet | 1Kpx   | [Sana_1600M_1024px_BF16_ControlNet_HED](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_BF16_ControlNet_HED) | Coming soon                                                                                                                                       | **bf16**/fp32 | Multi-Language |
+| Sana-0.6B-ControlNet | 1Kpx   | [Sana_600M_1024px_ControlNet_HED](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED)             | Coming soon                                                                                                                                       | fp16/fp32     | -              |
+## ❗ 2. Make sure to use correct precision(fp16/bf16/fp32) for training and inference.
+### We provide two samples to use fp16 and bf16 weights, respectively.
+❗️Make sure to set `variant` and `torch_dtype` in diffusers pipelines to the desired precision.
+#### 1). For fp16 models
+```python
+# run `pip install git+https://github.com/huggingface/diffusers` before use Sana in diffusers
+import torch
+from diffusers import SanaPipeline
+pipe = SanaPipeline.from_pretrained(
+    "Efficient-Large-Model/Sana_1600M_1024px_diffusers",
+    variant="fp16",
+    torch_dtype=torch.float16,
+)
+pipe.to("cuda")
+pipe.vae.to(torch.bfloat16)
+pipe.text_encoder.to(torch.bfloat16)
+prompt = 'a cyberpunk cat with a neon sign that says "Sana"'
+image = pipe(
+    prompt=prompt,
+    height=1024,
+    width=1024,
+    guidance_scale=5.0,
+    num_inference_steps=20,
+    generator=torch.Generator(device="cuda").manual_seed(42),
+)[0]
+image[0].save("sana.png")
+```
+#### 2). For bf16 models
+```python
+# run `pip install git+https://github.com/huggingface/diffusers` before use Sana in diffusers
+import torch
+from diffusers import SanaPAGPipeline
+pipe = SanaPAGPipeline.from_pretrained(
+  "Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers",
+  variant="bf16",
+  torch_dtype=torch.bfloat16,
+  pag_applied_layers="transformer_blocks.8",
+)
+pipe.to("cuda")
+pipe.text_encoder.to(torch.bfloat16)
+pipe.vae.to(torch.bfloat16)
+prompt = 'a cyberpunk cat with a neon sign that says "Sana"'
+image = pipe(
+    prompt=prompt,
+    guidance_scale=5.0,
+    pag_scale=2.0,
+    num_inference_steps=20,
+    generator=torch.Generator(device="cuda").manual_seed(42),
+)[0]
+image[0].save('sana.png')
+```
+## ❗ 3. 4K models
+4K models need VAE tiling to avoid OOM issue.(16 GPU is recommended)
+```python
+# run `pip install git+https://github.com/huggingface/diffusers` before use Sana in diffusers
+import torch
+from diffusers import SanaPipeline
+pipe = SanaPipeline.from_pretrained(
+    "Efficient-Large-Model/Sana_1600M_4Kpx_BF16_diffusers",
+    variant="bf16",
+    torch_dtype=torch.bfloat16,
+)
+pipe.to("cuda")
+pipe.vae.to(torch.bfloat16)
+pipe.text_encoder.to(torch.bfloat16)
+# for 4096x4096 image generation OOM issue, feel free adjust the tile size
+if pipe.transformer.config.sample_size == 128:
+    pipe.vae.enable_tiling(
+        tile_sample_min_height=1024,
+        tile_sample_min_width=1024,
+        tile_sample_stride_height=896,
+        tile_sample_stride_width=896,
+    )
+prompt = 'a cyberpunk cat with a neon sign that says "Sana"'
+image = pipe(
+    prompt=prompt,
+    height=4096,
+    width=4096,
+    guidance_scale=5.0,
+    num_inference_steps=20,
+    generator=torch.Generator(device="cuda").manual_seed(42),
+)[0]
+image[0].save("sana_4K.png")
+```
+## ❗ 4. int4 inference
+This int4 model is quantized with [SVDQuant-Nunchaku](https://github.com/mit-han-lab/nunchaku). You need first follow the [guidance of installation](https://github.com/mit-han-lab/nunchaku?tab=readme-ov-file#installation) of nunchaku engine, then you can use the following code snippet to perform inference with int4 Sana model.
+Here we show the code snippet for SanaPipeline. For SanaPAGPipeline, please refer to the [SanaPAGPipeline](https://github.com/mit-han-lab/nunchaku/blob/main/examples/sana_1600m_pag.py) section.
+```python
+import torch
+from diffusers import SanaPipeline
+from nunchaku.models.transformer_sana import NunchakuSanaTransformer2DModel
+transformer = NunchakuSanaTransformer2DModel.from_pretrained("mit-han-lab/svdq-int4-sana-1600m")
+pipe = SanaPipeline.from_pretrained(
+    "Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers",
+    transformer=transformer,
+    variant="bf16",
+    torch_dtype=torch.bfloat16,
+).to("cuda")
+pipe.text_encoder.to(torch.bfloat16)
+pipe.vae.to(torch.bfloat16)
+image = pipe(
+    prompt="A cute 🐼 eating 🎋, ink drawing style",
+    height=1024,
+    width=1024,
+    guidance_scale=4.5,
+    num_inference_steps=20,
+    generator=torch.Generator().manual_seed(42),
+).images[0]
+image.save("sana_1600m.png")
+```

asset/docs/sana_controlnet.md ADDED Viewed

	@@ -0,0 +1,75 @@

+<!-- Copyright 2024 NVIDIA CORPORATION & AFFILIATES
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+     http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+SPDX-License-Identifier: Apache-2.0 -->
+## 🔥 ControlNet
+We incorporate a ControlNet-like(https://github.com/lllyasviel/ControlNet) module enables fine-grained control over text-to-image diffusion models. We implement a ControlNet-Transformer architecture, specifically tailored for Transformers, achieving explicit controllability alongside high-quality image generation.
+<p align="center">
+  <img src="https://raw.githubusercontent.com/NVlabs/Sana/refs/heads/page/asset/content/controlnet/sana_controlnet.jpg"  height=480>
+</p>
+## Inference of `Sana + ControlNet`
+### 1). Gradio Interface
+```bash
+python app/app_sana_controlnet_hed.py \
+        --config configs/sana_controlnet_config/Sana_1600M_1024px_controlnet_bf16.yaml \
+        --model_path hf://Efficient-Large-Model/Sana_1600M_1024px_BF16_ControlNet_HED/checkpoints/Sana_1600M_1024px_BF16_ControlNet_HED.pth
+```
+<p align="center" border-raduis="10px">
+  <img src="https://nvlabs.github.io/Sana/asset/content/controlnet/controlnet_app.jpg" width="90%" alt="teaser_page2"/>
+</p>
+### 2). Inference with JSON file
+```bash
+python tools/controlnet/inference_controlnet.py \
+        --config configs/sana_controlnet_config/Sana_1600M_1024px_controlnet_bf16.yaml \
+        --model_path hf://Efficient-Large-Model/Sana_1600M_1024px_BF16_ControlNet_HED/checkpoints/Sana_1600M_1024px_BF16_ControlNet_HED.pth \
+        --json_file asset/controlnet/samples_controlnet.json
+```
+### 3). Inference code snap
+```python
+import torch
+from PIL import Image
+from app.sana_controlnet_pipeline import SanaControlNetPipeline
+device = "cuda" if torch.cuda.is_available() else "cpu"
+pipe = SanaControlNetPipeline("configs/sana_controlnet_config/Sana_1600M_1024px_controlnet_bf16.yaml")
+pipe.from_pretrained("hf://Efficient-Large-Model/Sana_1600M_1024px_BF16_ControlNet_HED/checkpoints/Sana_1600M_1024px_BF16_ControlNet_HED.pth")
+ref_image = Image.open("asset/controlnet/ref_images/A transparent sculpture of a duck made out of glass. The sculpture is in front of a painting of a la.jpg")
+prompt = "A transparent sculpture of a duck made out of glass. The sculpture is in front of a painting of a landscape."
+images = pipe(
+    prompt=prompt,
+    ref_image=ref_image,
+    guidance_scale=4.5,
+    num_inference_steps=10,
+    sketch_thickness=2,
+    generator=torch.Generator(device=device).manual_seed(0),
+)
+```
+## Training of `Sana + ControlNet`
+### Coming soon

asset/docs/sana_lora_dreambooth.md ADDED Viewed

	@@ -0,0 +1,144 @@

+# DreamBooth training example for SANA
+[DreamBooth](https://arxiv.org/abs/2208.12242) is a method to personalize text2image models like stable diffusion given just a few (3~5) images of a subject.
+The `train_dreambooth_lora_sana.py` script shows how to implement the training procedure with [LoRA](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) and adapt it for [SANA](https://arxiv.org/abs/2410.10629).
+This will also allow us to push the trained model parameters to the Hugging Face Hub platform.
+## Running locally with PyTorch
+### Installing the dependencies
+Before running the scripts, make sure to install the library's training dependencies:
+**Important**
+To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install -e .
+```
+And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
+```bash
+accelerate config
+```
+Or for a default accelerate configuration without answering questions about your environment
+```bash
+accelerate config default
+```
+Or if your environment doesn't support an interactive shell (e.g., a notebook)
+```python
+from accelerate.utils import write_basic_config
+write_basic_config()
+```
+When running `accelerate config`, if we specify torch compile mode to True there can be dramatic speedups.
+Note also that we use PEFT library as backend for LoRA training, make sure to have `peft>=0.14.0` installed in your environment.
+### Dog toy example
+Now let's get our dataset. For this example we will use some dog images: https://huggingface.co/datasets/diffusers/dog-example.
+Let's first download it locally:
+```python
+from huggingface_hub import snapshot_download
+local_dir = "data/dreambooth/dog"
+snapshot_download(
+    "diffusers/dog-example",
+    local_dir=local_dir, repo_type="dataset",
+    ignore_patterns=".gitattributes",
+)
+```
+This will also allow us to push the trained LoRA parameters to the Hugging Face Hub platform.
+[Here is the Model Card](model_zoo.md) for you to choose the desired pre-trained models and set it to `MODEL_NAME`.
+Now, we can launch training using [file here](../../train_scripts/train_lora.sh):
+```bash
+bash train_scripts/train_lora.sh
+```
+or you can run it locally:
+```bash
+export MODEL_NAME="Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers"
+export INSTANCE_DIR="data/dreambooth/dog"
+export OUTPUT_DIR="trained-sana-lora"
+accelerate launch --num_processes 8 --main_process_port 29500 --gpu_ids 0,1,2,3 \
+  train_scripts/train_dreambooth_lora_sana.py \
+  --pretrained_model_name_or_path=$MODEL_NAME  \
+  --instance_data_dir=$INSTANCE_DIR \
+  --output_dir=$OUTPUT_DIR \
+  --mixed_precision="bf16" \
+  --instance_prompt="a photo of sks dog" \
+  --resolution=1024 \
+  --train_batch_size=1 \
+  --gradient_accumulation_steps=4 \
+  --use_8bit_adam \
+  --learning_rate=1e-4 \
+  --report_to="wandb" \
+  --lr_scheduler="constant" \
+  --lr_warmup_steps=0 \
+  --max_train_steps=500 \
+  --validation_prompt="A photo of sks dog in a pond, yarn art style" \
+  --validation_epochs=25 \
+  --seed="0" \
+  --push_to_hub
+```
+For using `push_to_hub`, make you're logged into your Hugging Face account:
+```bash
+huggingface-cli login
+```
+To better track our training experiments, we're using the following flags in the command above:
+- `report_to="wandb` will ensure the training runs are tracked on [Weights and Biases](https://wandb.ai/site). To use it, be sure to install `wandb` with `pip install wandb`. Don't forget to call `wandb login <your_api_key>` before training if you haven't done it before.
+- `validation_prompt` and `validation_epochs` to allow the script to do a few validation inference runs. This allows us to qualitatively check if the training is progressing as expected.
+## Notes
+Additionally, we welcome you to explore the following CLI arguments:
+- `--lora_layers`: The transformer modules to apply LoRA training on. Please specify the layers in a comma seperated. E.g. - "to_k,to_q,to_v" will result in lora training of attention layers only.
+- `--complex_human_instruction`: Instructions for complex human attention as shown in [here](https://github.com/NVlabs/Sana/blob/main/configs/sana_app_config/Sana_1600M_app.yaml#L55).
+- `--max_sequence_length`: Maximum sequence length to use for text embeddings.
+We provide several options for optimizing memory optimization:
+- `--offload`: When enabled, we will offload the text encoder and VAE to CPU, when they are not used.
+- `cache_latents`: When enabled, we will pre-compute the latents from the input images with the VAE and remove the VAE from memory once done.
+- `--use_8bit_adam`: When enabled, we will use the 8bit version of AdamW provided by the `bitsandbytes` library.
+Refer to the [official documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/sana) of the `SanaPipeline` to know more about the models available under the SANA family and their preferred dtypes during inference.
+## Samples
+We show some samples during Sana-LoRA fine-tuning process below.
+<p align="center" border-raduis="10px">
+  <img src="https://nvlabs.github.io/Sana/asset/content/dreambooth/step0.jpg" width="90%" alt="sana-lora-step0"/>
+  <br>
+  <em> training samples at step=0 </em>
+</p>
+<p align="center" border-raduis="10px">
+  <img src="https://nvlabs.github.io/Sana/asset/content/dreambooth/step500.jpg" width="90%" alt="sana-lora-step500"/>
+  <br>
+  <em> training samples at step=500 </em>
+</p>

asset/example_data/00000000.jpg ADDED Viewed

Git LFS Details

SHA256: 093affd5bbefce86625ad616d192a87b006ffe5758b93200cb54d3afbd849434
Pointer size: 132 Bytes
Size of remote file: 1.54 MB

asset/example_data/00000000.png ADDED Viewed

Git LFS Details

SHA256: 093affd5bbefce86625ad616d192a87b006ffe5758b93200cb54d3afbd849434
Pointer size: 132 Bytes
Size of remote file: 1.54 MB

asset/example_data/00000000.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ a cyberpunk cat with a neon sign that says "Sana".

asset/example_data/00000000_InternVL2-26B.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+    "00000000": {
+        "InternVL2-26B": "a cyberpunk cat with a neon sign that says 'Sana'"
+    }
+}

asset/example_data/00000000_InternVL2-26B_clip_score.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+    "00000000": {
+        "InternVL2-26B": "27.1037"
+    }
+}

asset/example_data/00000000_VILA1-5-13B.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+    "00000000": {
+        "VILA1-5-13B": "a cyberpunk cat with a neon sign that says 'Sana'"
+    }
+}

asset/example_data/00000000_VILA1-5-13B_clip_score.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+    "00000000": {
+        "VILA1-5-13B": "27.2321"
+    }
+}

asset/example_data/00000000_prompt_clip_score.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+    "00000000": {
+        "prompt":  "26.7331"
+    }
+}

asset/example_data/meta_data.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+    "name": "sana-dev",
+    "__kind__": "Sana-ImgDataset",
+    "img_names": [
+        "00000000", "00000000", "00000000.png", "00000000.jpg"
+    ]
+}

asset/examples.py ADDED Viewed

	@@ -0,0 +1,69 @@

+# Copyright 2024 NVIDIA CORPORATION & AFFILIATES
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# SPDX-License-Identifier: Apache-2.0
+examples = [
+    [
+        "A small cactus with a happy face in the Sahara desert.",
+        "flow_dpm-solver",
+        20,
+        5.0,
+        2.5,
+    ],
+    [
+        "An extreme close-up of an gray-haired man with a beard in his 60s, he is deep in thought pondering the history"
+        "of the universe as he sits at a cafe in Paris, his eyes focus on people offscreen as they walk as he sits "
+        "mostly motionless, he is dressed in a wool coat suit coat with a button-down shirt, he wears a brown beret "
+        "and glasses and has a very professorial appearance, and the end he offers a subtle closed-mouth smile "
+        "as if he found the answer to the mystery of life, the lighting is very cinematic with the golden light and "
+        "the Parisian streets and city in the background, depth of field, cinematic 35mm film.",
+        "flow_dpm-solver",
+        20,
+        5.0,
+        2.5,
+    ],
+    [
+        "An illustration of a human heart made of translucent glass, standing on a pedestal amidst a stormy sea. "
+        "Rays of sunlight pierce the clouds, illuminating the heart, revealing a tiny universe within. "
+        "The quote 'Find the universe within you' is etched in bold letters across the horizon."
+        "blue and pink, brilliantly illuminated in the background.",
+        "flow_dpm-solver",
+        20,
+        5.0,
+        2.5,
+    ],
+    [
+        "A transparent sculpture of a duck made out of glass. The sculpture is in front of a painting of a landscape.",
+        "flow_dpm-solver",
+        20,
+        5.0,
+        2.5,
+    ],
+    [
+        "A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.",
+        "flow_dpm-solver",
+        20,
+        5.0,
+        2.5,
+    ],
+    [
+        "a kayak in the water, in the style of optical color mixing, aerial view, rainbowcore, "
+        "national geographic photo, 8k resolution, crayon art, interactive artwork",
+        "flow_dpm-solver",
+        20,
+        5.0,
+        2.5,
+    ],
+]

asset/logo.png ADDED Viewed

asset/model-incremental.jpg ADDED Viewed

Git LFS Details

SHA256: 92680c603480e472a718643a447abed80b76aedcbf8965e0b6571985ed552a6b
Pointer size: 131 Bytes
Size of remote file: 873 kB

asset/model_paths.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ output/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth
2	+ output/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth