import os import random import uuid import json import time import asyncio import re from threading import Thread from io import BytesIO import subprocess import gradio as gr import spaces import torch import numpy as np from PIL import Image import edge_tts # Install flash-attn without building CUDA kernels (if needed) subprocess.run( 'pip install flash-attn --no-build-isolation', env={'FLASH_ATTENTION_SKIP_CUDA_BUILD': "TRUE"}, shell=True ) from transformers import AutoProcessor, AutoModelForImageTextToText, TextIteratorStreamer from diffusers import DiffusionPipeline # ------------------------------------------------------------------------------ # Global Configurations # ------------------------------------------------------------------------------ DESCRIPTION = "# SmolVLM2 with Flux.1 Integration 📺" if not torch.cuda.is_available(): DESCRIPTION += "\n

⚠️Running on CPU, This may not work on CPU.

" css = ''' h1 { text-align: center; display: block; } ''' device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # ------------------------------------------------------------------------------ # FLUX.1 IMAGE GENERATION SETUP # ------------------------------------------------------------------------------ MAX_SEED = np.iinfo(np.int32).max def save_image(img: Image.Image) -> str: """Save a PIL image with a unique filename and return the path.""" unique_name = str(uuid.uuid4()) + ".png" img.save(unique_name) return unique_name def randomize_seed_fn(seed: int, randomize_seed: bool) -> int: if randomize_seed: seed = random.randint(0, MAX_SEED) return seed # Initialize Flux.1 pipeline base_model = "black-forest-labs/FLUX.1-dev" pipe = DiffusionPipeline.from_pretrained(base_model, torch_dtype=torch.bfloat16) lora_repo = "strangerzonehf/Flux-Super-Realism-LoRA" trigger_word = "Super Realism" # Leave blank if no trigger word is needed. pipe.load_lora_weights(lora_repo) pipe.to("cuda") # Define style prompts for Flux.1 style_list = [ { "name": "3840 x 2160", "prompt": "hyper-realistic 8K image of {prompt}. ultra-detailed, lifelike, high-resolution, sharp, vibrant colors, photorealistic", }, { "name": "2560 x 1440", "prompt": "hyper-realistic 4K image of {prompt}. ultra-detailed, lifelike, high-resolution, sharp, vibrant colors, photorealistic", }, { "name": "HD+", "prompt": "hyper-realistic 2K image of {prompt}. ultra-detailed, lifelike, high-resolution, sharp, vibrant colors, photorealistic", }, { "name": "Style Zero", "prompt": "{prompt}", }, ] styles = {s["name"]: s["prompt"] for s in style_list} DEFAULT_STYLE_NAME = "3840 x 2160" STYLE_NAMES = list(styles.keys()) def apply_style(style_name: str, positive: str) -> str: return styles.get(style_name, styles[DEFAULT_STYLE_NAME]).replace("{prompt}", positive) def generate_image_flux( prompt: str, seed: int = 0, width: int = 1024, height: int = 1024, guidance_scale: float = 3, randomize_seed: bool = False, style_name: str = DEFAULT_STYLE_NAME, ): """Generate an image using the Flux.1 pipeline with style prompts.""" seed = int(randomize_seed_fn(seed, randomize_seed)) positive_prompt = apply_style(style_name, prompt) if trigger_word: positive_prompt = f"{trigger_word} {positive_prompt}" images = pipe( prompt=positive_prompt, width=width, height=height, guidance_scale=guidance_scale, num_inference_steps=28, num_images_per_prompt=1, output_type="pil", ).images image_paths = [save_image(img) for img in images] return image_paths, seed # ------------------------------------------------------------------------------ # SMOLVLM2 MODEL SETUP # ------------------------------------------------------------------------------ processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct") model = AutoModelForImageTextToText.from_pretrained( "HuggingFaceTB/SmolVLM2-2.2B-Instruct", _attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16 ).to("cuda:0") # ------------------------------------------------------------------------------ # CHAT / INFERENCE FUNCTION # ------------------------------------------------------------------------------ @spaces.GPU def model_inference(input_dict, history, max_tokens): """ Implements a chat interface using SmolVLM2. Special behavior: - If the query text starts with "@image", the Flux.1 pipeline is used to generate an image. - Otherwise, the query is processed with SmolVLM2. - In the SmolVLM2 branch, a progress message "Processing with SmolVLM2..." is yielded. """ text = input_dict["text"] files = input_dict.get("files", []) # If the text begins with "@image", use Flux.1 image generation. if text.strip().lower().startswith("@image"): prompt = text[len("@image"):].strip() yield "Hold Tight Generating Flux.1 Image..." image_paths, used_seed = generate_image_flux( prompt=prompt, seed=1, width=1024, height=1024, guidance_scale=3, randomize_seed=True, style_name=DEFAULT_STYLE_NAME, ) yield gr.Image(image_paths[0]) return # Default: Use SmolVLM2 inference. yield "Processing with SmolVLM2..." user_content = [] media_queue = [] # If no conversation history, process current input. if not history: text = text.strip() for file in files: if file.endswith((".png", ".jpg", ".jpeg", ".gif", ".bmp")): media_queue.append({"type": "image", "path": file}) elif file.endswith((".mp4", ".mov", ".avi", ".mkv", ".flv")): media_queue.append({"type": "video", "path": file}) if "" in text or "