Spaces:

Luigi
/

ZeroGPU-LLM-Inference

Runtime error

App Files Files Community

Luigi commited on Apr 12

Commit

a703203

1 Parent(s): 794ee70

switch to gradio version for stability reason

Browse files

Files changed (3) hide show

README.md +19 -18
app.py +185 -217
requirements.txt +2 -1

README.md CHANGED Viewed

@@ -3,15 +3,22 @@ title: Multi-GGUF LLM Inference
 emoji: 🧠
 colorFrom: pink
 colorTo: purple
-sdk: streamlit
-sdk_version: 1.44.1
 app_file: app.py
 pinned: false
 license: apache-2.0
-short_description: Run GGUF models with llama.cpp
 ---
-This Streamlit app enables **chat-based inference** on various GGUF models using `llama.cpp` and `llama-cpp-python`.
 ### 🔄 Supported Models:
 - `Qwen/Qwen2.5-7B-Instruct-GGUF` → `qwen2.5-7b-instruct-q2_k.gguf`
@@ -23,19 +30,13 @@ This Streamlit app enables **chat-based inference** on various GGUF models using
 - `Qwen/Qwen2.5-Coder-7B-Instruct-GGUF` → `qwen2.5-coder-7b-instruct-q2_k.gguf`
 ### ⚙️ Features:
-- Model selection in the sidebar
-- Customizable system prompt and generation parameters
-- Chat-style UI with streaming responses
-- **Markdown output rendering** for readable, styled output
-- **DeepSeek-compatible `<think>` tag handling** — shows model reasoning in a collapsible expander
-### 🧠 Memory-Safe Design (for HuggingFace Spaces):
-- Loads only **one model at a time** to prevent memory bloat
-- Utilizes **manual unloading and `gc.collect()`** to free memory when switching models
-- Adjusts `n_ctx` context length to operate within a 16 GB RAM limit
-- Automatically downloads models as needed
-- Limits history to the **last 8 user-assistant turns** to prevent context overflow
-Ideal for deploying multiple GGUF chat models on **free-tier HuggingFace Spaces**!
-Refer to the configuration guide at https://huggingface.co/docs/hub/spaces-config-reference

 emoji: 🧠
 colorFrom: pink
 colorTo: purple
+sdk: gradio
+sdk_version: 3.29.1
 app_file: app.py
 pinned: false
 license: apache-2.0
+short_description: Chat-based inference for GGUF models using llama.cpp and Gradio
 ---
+This Gradio app enables **chat-based inference** on various GGUF models using `llama.cpp` and `llama-cpp-python`. The application features:
+- **Real-Time Web Search Integration:** Uses DuckDuckGo to retrieve up-to-date context; debug output is displayed in real time.
+- **Streaming Token-by-Token Responses:** Users see the generated answer as it comes in.
+- **Response Cancellation:** A cancel button allows stopping response generation in progress.
+- **Customizable Prompts & Generation Parameters:** Adjust the system prompt (with dynamic date insertion), temperature, token limits, and more.
+- **Memory-Safe Design:** Loads one model at a time with proper memory management, ideal for deployment on Hugging Face Spaces.
+- **Rate Limit Handling:** Implements exponential backoff to cope with DuckDuckGo API rate limits.
 ### 🔄 Supported Models:
 - `Qwen/Qwen2.5-7B-Instruct-GGUF` → `qwen2.5-7b-instruct-q2_k.gguf`
 - `Qwen/Qwen2.5-Coder-7B-Instruct-GGUF` → `qwen2.5-coder-7b-instruct-q2_k.gguf`
 ### ⚙️ Features:
+- **Model Selection:** Select from multiple GGUF models.
+- **Customizable Prompts & Parameters:** Set a system prompt (e.g., automatically including today’s date), adjust temperature, token limits, and more.
+- **Chat-style Interface:** Interactive Gradio UI with streaming token-by-token responses.
+- **Real-Time Web Search & Debug Output:** Leverages DuckDuckGo to fetch recent context, with a dedicated debug panel showing web search progress and results.
+- **Response Cancellation:** Cancel in-progress answer generation using a cancel button.
+- **Memory-Safe & Rate-Limit Resilient:** Loads one model at a time with proper cleanup and incorporates exponential backoff to handle API rate limits.
+Ideal for deploying multiple GGUF chat models on Hugging Face Spaces with a robust, user-friendly interface!
+For further details, check the [Spaces configuration guide](https://huggingface.co/docs/hub/spaces-config-reference).

app.py CHANGED Viewed

@@ -1,38 +1,23 @@
-import streamlit as st
-import os, gc, shutil, re, time, threading, queue
 from itertools import islice
 from llama_cpp import Llama
 from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
 from huggingface_hub import hf_hub_download
 from duckduckgo_search import DDGS
 # ------------------------------
-# Initialize Session State
 # ------------------------------
-if "chat_history" not in st.session_state:
-    st.session_state.chat_history = []
-if "pending_response" not in st.session_state:
-    st.session_state.pending_response = False
-if "model_name" not in st.session_state:
-    st.session_state.model_name = None
-if "llm" not in st.session_state:
-    st.session_state.llm = None
 # ------------------------------
-# Custom CSS for Improved Look & Feel
-# ------------------------------
-st.markdown("""
-<style>
-    .chat-container { margin: 1em 0; }
-    .chat-assistant { background-color: #eef7ff; padding: 1em; border-radius: 10px; margin-bottom: 1em; }
-    .chat-user { background-color: #e6ffe6; padding: 1em; border-radius: 10px; margin-bottom: 1em; }
-    .message-time { font-size: 0.8em; color: #555; text-align: right; }
-    .loading-spinner { font-size: 1.1em; color: #ff6600; }
-</style>
-""", unsafe_allow_html=True)
-# ------------------------------
-# Required Storage and Model Definitions
 # ------------------------------
 REQUIRED_SPACE_BYTES = 5 * 1024 ** 3  # 5 GB
@@ -94,26 +79,13 @@ MODELS = {
     },
 }
 # ------------------------------
-# Helper Functions
 # ------------------------------
-def retrieve_context(query, max_results=6, max_chars_per_result=600):
-    """Retrieve web search context using DuckDuckGo."""
-    try:
-        with DDGS() as ddgs:
-            results = list(islice(ddgs.text(query, region="wt-wt", safesearch="off", timelimit="y"), max_results))
-            context = ""
-            for i, result in enumerate(results, start=1):
-                title = result.get("title", "No Title")
-                snippet = result.get("body", "")[:max_chars_per_result]
-                context += f"Result {i}:\nTitle: {title}\nSnippet: {snippet}\n\n"
-            return context.strip()
-    except Exception as e:
-        st.error(f"Error during web retrieval: {e}")
-        return ""
 def try_load_model(model_path):
-    """Attempt to initialize the model from a specified path."""
     try:
         return Llama(
             model_path=model_path,
@@ -132,26 +104,20 @@ def try_load_model(model_path):
         return str(e)
 def download_model(selected_model):
-    """Download the model using Hugging Face Hub."""
-    with st.spinner(f"Downloading {selected_model['filename']}..."):
-        hf_hub_download(
-            repo_id=selected_model["repo_id"],
-            filename=selected_model["filename"],
-            local_dir="./models",
-            local_dir_use_symlinks=False,
-        )
 def validate_or_download_model(selected_model):
-    """Ensure the model is available and loaded properly; download if necessary."""
     model_path = os.path.join("models", selected_model["filename"])
     os.makedirs("models", exist_ok=True)
     if not os.path.exists(model_path):
-        if shutil.disk_usage(".").free < REQUIRED_SPACE_BYTES:
-            st.info("Insufficient storage space. Consider cleaning up old models.")
         download_model(selected_model)
     result = try_load_model(model_path)
     if isinstance(result, str):
-        st.warning(f"Initial model load failed: {result}\nAttempting re-download...")
         try:
             os.remove(model_path)
         except Exception:
@@ -159,22 +125,98 @@ def validate_or_download_model(selected_model):
         download_model(selected_model)
         result = try_load_model(model_path)
         if isinstance(result, str):
-            st.error(f"Model failed to load after re-download: {result}")
-            st.stop()
     return result
 # ------------------------------
-# Caching the Model Loading
 # ------------------------------
-@st.cache_resource
-def load_cached_model(selected_model):
-    return validate_or_download_model(selected_model)
-def stream_response(llm, messages, max_tokens, temperature, top_k, top_p, repeat_penalty, response_queue):
-    """Stream the model response token-by-token."""
-    final_text = ""
     try:
-        stream = llm.create_chat_completion(
             messages=messages,
             max_tokens=max_tokens,
             temperature=temperature,
@@ -184,169 +226,95 @@ def stream_response(llm, messages, max_tokens, temperature, top_k, top_p, repeat
             stream=True,
         )
         for chunk in stream:
             if "choices" in chunk:
                 delta = chunk["choices"][0]["delta"].get("content", "")
-                final_text += delta
-                response_queue.put(delta)
                 if chunk["choices"][0].get("finish_reason", ""):
                     break
     except Exception as e:
-        response_queue.put(f"\nError: {e}")
-    response_queue.put(None)  # Signal the end of streaming
 # ------------------------------
-# Sidebar: Settings and Advanced Options
 # ------------------------------
-with st.sidebar:
-    st.header("⚙️ Settings")
-    # Basic Settings
-    selected_model_name = st.selectbox("Select Model", list(MODELS.keys()),
-                                       help="Choose from the available model configurations.")
-    system_prompt_base = st.text_area("System Prompt",
-                                       value="You are a helpful assistant.",
-                                       height=80,
-                                       help="Define the base context for the AI's responses.")
-    # Generation Parameters
-    st.subheader("Generation Parameters")
-    max_tokens = st.slider("Max Tokens", 64, 1024, 1024, step=32,
-                           help="The maximum number of tokens the assistant can generate.")
-    temperature = st.slider("Temperature", 0.1, 2.0, 0.7,
-                            help="Controls randomness. Lower values are more deterministic.")
-    top_k = st.slider("Top-K", 1, 100, 40,
-                      help="Limits the token candidates to the top-k tokens.")
-    top_p = st.slider("Top-P", 0.1, 1.0, 0.95,
-                      help="Nucleus sampling parameter; restricts to a cumulative probability.")
-    repeat_penalty = st.slider("Repetition Penalty", 1.0, 2.0, 1.1,
-                               help="Penalizes token repetition to improve output variety.")
-    # Advanced Settings using expandable sections
-    with st.expander("Web Search Settings"):
-        enable_search = st.checkbox("Enable Web Search", value=False,
-                                    help="Include recent web search context to augment the prompt.")
-        max_results = st.number_input("Max Results for Context", min_value=1, max_value=20, value=6, step=1,
-                                      help="How many search results to use.")
-        max_chars_per_result = st.number_input("Max Chars per Result", min_value=100, max_value=2000, value=600, step=50,
-                                               help="Max characters to extract from each search result.")
-# ------------------------------
-# Model Loading/Reloading if Needed
-# ------------------------------
-selected_model = MODELS[selected_model_name]
-if st.session_state.model_name != selected_model_name:
-    with st.spinner("Loading selected model..."):
-        st.session_state.llm = load_cached_model(selected_model)
-        st.session_state.model_name = selected_model_name
-llm = st.session_state.llm
 # ------------------------------
-# Main Title and Chat History Display
 # ------------------------------
-st.title(f"🧠 {selected_model['description']}")
-st.caption(f"Powered by `llama.cpp` | Model: {selected_model['filename']}")
-# Render chat history with improved styling
-for chat in st.session_state.chat_history:
-    role = chat["role"]
-    content = chat["content"]
-    if role == "assistant":
-        st.markdown(f"<div class='chat-assistant'>{content}</div>", unsafe_allow_html=True)
-    else:
-        st.markdown(f"<div class='chat-user'>{content}</div>", unsafe_allow_html=True)
-# ------------------------------
-# Chat Input and Processing
-# ------------------------------
-user_input = st.chat_input("Your message...")
-if user_input:
-    if st.session_state.pending_response:
-        st.warning("Please wait until the current response is finished.")
-    else:
-        # Append user message with timestamp (if desired)
-        timestamp = time.strftime("%H:%M")
-        st.session_state.chat_history.append({"role": "user", "content": f"{user_input}\n\n<span class='message-time'>{timestamp}</span>"})
-        with st.chat_message("user"):
-            st.markdown(f"<div class='chat-user'>{user_input}</div>", unsafe_allow_html=True)
-        st.session_state.pending_response = True
-        # Retrieve web search context asynchronously, with a timeout, if enabled
-        retrieved_context = ""
-        if enable_search:
-            result_list = []
-            def run_search():
-                result = retrieve_context(user_input, max_results=max_results, max_chars_per_result=max_chars_per_result)
-                result_list.append(result)
-            search_thread = threading.Thread(target=run_search)
-            search_thread.start()
-            # Wait only up to 2 seconds for the search to return
-            search_thread.join(timeout=2)
-            if result_list:
-                retrieved_context = result_list[0]
-            # Display whichever result (or lack thereof) in the sidebar
-            with st.sidebar:
-                st.markdown("### Retrieved Context")
-                st.text_area("", value=retrieved_context or "No context found.", height=150)
-        # Augment the user prompt with the system prompt and optional web context
-        if enable_search and retrieved_context:
-            augmented_user_input = (
-                f"{system_prompt_base.strip()}\n\n"
-                f"Use the following recent web search context to help answer the query:\n\n"
-                f"{retrieved_context}\n\n"
-                f"User Query: {user_input}"
             )
-        else:
-            augmented_user_input = f"{system_prompt_base.strip()}\n\nUser Query: {user_input}"
-        # Limit conversation history to the last few turns (for context)
-        MAX_TURNS = 2
-        trimmed_history = st.session_state.chat_history[-(MAX_TURNS * 2):]
-        if trimmed_history and trimmed_history[-1]["role"] == "user":
-            messages = trimmed_history[:-1] + [{"role": "user", "content": augmented_user_input}]
-        else:
-            messages = trimmed_history + [{"role": "user", "content": augmented_user_input}]
-        # Set up a placeholder for displaying the streaming response and a queue for tokens
-        visible_placeholder = st.empty()
-        progress_bar = st.progress(0)
-        response_queue = queue.Queue()
-        # Start streaming response in a separate thread
-        stream_thread = threading.Thread(
-            target=stream_response,
-            args=(llm, messages, max_tokens, temperature, top_k, top_p, repeat_penalty, response_queue),
-            daemon=True
-        )
-        stream_thread.start()
-        # Poll the queue to update the UI with incremental tokens and update progress
-        final_response = ""
-        timeout = 300  # seconds
-        start_time = time.time()
-        progress = 0
-        while True:
-            try:
-                update = response_queue.get(timeout=0.1)
-                if update is None:
-                    break
-                final_response += update
-                # Remove any special tags from the output (for cleaner UI)
-                visible_response = re.sub(r"<think>.*?</think>", "", final_response, flags=re.DOTALL)
-                visible_placeholder.markdown(f"<div class='chat-assistant'>{visible_response}</div>", unsafe_allow_html=True)
-                progress = min(progress + 1, 100)
-                progress_bar.progress(progress)
-                start_time = time.time()
-            except queue.Empty:
-                if time.time() - start_time > timeout:
-                    st.error("Response generation timed out.")
-                    break
-        # Append assistant response with timestamp
-        timestamp = time.strftime("%H:%M")
-        st.session_state.chat_history.append({"role": "assistant", "content": f"{final_response}\n\n<span class='message-time'>{timestamp}</span>"})
-        st.session_state.pending_response = False
-        progress_bar.empty()  # Clear progress bar
-        gc.collect()

+import os
+import time
+import re
+import gc
+import threading
 from itertools import islice
+from datetime import datetime
+import gradio as gr
 from llama_cpp import Llama
 from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
 from huggingface_hub import hf_hub_download
 from duckduckgo_search import DDGS
 # ------------------------------
+# Global Cancellation Event
 # ------------------------------
+cancel_event = threading.Event()
 # ------------------------------
+# Model Definitions and Global Variables
 # ------------------------------
 REQUIRED_SPACE_BYTES = 5 * 1024 ** 3  # 5 GB
     },
 }
+LOADED_MODELS = {}
+CURRENT_MODEL_NAME = None
 # ------------------------------
+# Model Loading Helper Functions
 # ------------------------------
 def try_load_model(model_path):
     try:
         return Llama(
             model_path=model_path,
         return str(e)
 def download_model(selected_model):
+    hf_hub_download(
+        repo_id=selected_model["repo_id"],
+        filename=selected_model["filename"],
+        local_dir="./models",
+        local_dir_use_symlinks=False,
+    )
 def validate_or_download_model(selected_model):
     model_path = os.path.join("models", selected_model["filename"])
     os.makedirs("models", exist_ok=True)
     if not os.path.exists(model_path):
         download_model(selected_model)
     result = try_load_model(model_path)
     if isinstance(result, str):
         try:
             os.remove(model_path)
         except Exception:
         download_model(selected_model)
         result = try_load_model(model_path)
         if isinstance(result, str):
+            raise Exception(f"Model load failed: {result}")
     return result
+def load_model(model_name):
+    global LOADED_MODELS, CURRENT_MODEL_NAME
+    if model_name in LOADED_MODELS:
+        return LOADED_MODELS[model_name]
+    selected_model = MODELS[model_name]
+    model = validate_or_download_model(selected_model)
+    LOADED_MODELS[model_name] = model
+    CURRENT_MODEL_NAME = model_name
+    return model
 # ------------------------------
+# Web Search Context Retrieval Function
 # ------------------------------
+def retrieve_context(query, max_results=6, max_chars_per_result=600):
+    try:
+        with DDGS() as ddgs:
+            results = list(islice(ddgs.text(query, region="wt-wt", safesearch="off", timelimit="y"), max_results))
+            context = ""
+            for i, result in enumerate(results, start=1):
+                title = result.get("title", "No Title")
+                snippet = result.get("body", "")[:max_chars_per_result]
+                context += f"Result {i}:\nTitle: {title}\nSnippet: {snippet}\n\n"
+            return context.strip()
+    except Exception:
+        return ""
+# ------------------------------
+# Chat Response Generation (Streaming) with Cancellation
+# ------------------------------
+def chat_response(user_message, chat_history, system_prompt, enable_search,
+                  max_results, max_chars, model_name, max_tokens, temperature, top_k, top_p, repeat_penalty):
+    """
+    Generator function that:
+      - Uses the chat history (list of dicts) from the Chatbot.
+      - Appends the new user message.
+      - Optionally retrieves web search context.
+      - Streams the assistant response token-by-token.
+      - Checks for cancellation.
+    """
+    # Reset the cancellation event.
+    cancel_event.clear()
+    # Prepare internal history.
+    internal_history = list(chat_history) if chat_history else []
+    internal_history.append({"role": "user", "content": user_message})
+    # Retrieve web search context (with debug feedback).
+    debug_message = ""
+    if enable_search:
+        debug_message = "Initiating web search..."
+        yield internal_history, debug_message
+        search_result = [""]
+        def do_search():
+            search_result[0] = retrieve_context(user_message, max_results, max_chars)
+        search_thread = threading.Thread(target=do_search)
+        search_thread.start()
+        search_thread.join(timeout=2)
+        retrieved_context = search_result[0]
+        if retrieved_context:
+            debug_message = f"Web search results:\n\n{retrieved_context}"
+        else:
+            debug_message = "Web search returned no results or timed out."
+    else:
+        retrieved_context = ""
+        debug_message = "Web search disabled."
+    # Augment prompt.
+    if enable_search and retrieved_context:
+        augmented_user_input = (
+            f"{system_prompt.strip()}\n\n"
+            "Use the following recent web search context to help answer the query:\n\n"
+            f"{retrieved_context}\n\n"
+            f"User Query: {user_message}"
+        )
+    else:
+        augmented_user_input = f"{system_prompt.strip()}\n\nUser Query: {user_message}"
+    # Build final prompt messages.
+    messages = internal_history[:-1] + [{"role": "user", "content": augmented_user_input}]
+    # Load the model.
+    model = load_model(model_name)
+    # Add an empty assistant message.
+    internal_history.append({"role": "assistant", "content": ""})
+    assistant_message = ""
     try:
+        stream = model.create_chat_completion(
             messages=messages,
             max_tokens=max_tokens,
             temperature=temperature,
             stream=True,
         )
         for chunk in stream:
+            # Check if a cancellation has been requested.
+            if cancel_event.is_set():
+                assistant_message += "\n\n[Response generation cancelled by user]"
+                internal_history[-1]["content"] = assistant_message
+                yield internal_history, debug_message
+                break
             if "choices" in chunk:
                 delta = chunk["choices"][0]["delta"].get("content", "")
+                assistant_message += delta
+                internal_history[-1]["content"] = assistant_message
+                yield internal_history, debug_message
                 if chunk["choices"][0].get("finish_reason", ""):
                     break
     except Exception as e:
+        internal_history[-1]["content"] = f"Error: {e}"
+        yield internal_history, debug_message
+    gc.collect()
 # ------------------------------
+# Cancel Function
 # ------------------------------
+def cancel_generation():
+    cancel_event.set()
+    return "Cancellation requested."
 # ------------------------------
+# Gradio UI Definition
 # ------------------------------
+with gr.Blocks(title="Multi-GGUF LLM Inference") as demo:
+    gr.Markdown("## 🧠 Multi-GGUF LLM Inference with Web Search")
+    gr.Markdown("Interact with the model. Select your model, set your system prompt, and adjust parameters on the left.")
+    with gr.Row():
+        with gr.Column(scale=3):
+            default_model = list(MODELS.keys())[0] if MODELS else "No models available"
+            model_dropdown = gr.Dropdown(
+                label="Select Model",
+                choices=list(MODELS.keys()) if MODELS else [],
+                value=default_model,
+                info="Choose from available models."
             )
+            today = datetime.now().strftime('%Y-%m-%d')
+            default_prompt = f"You are a helpful assistant. Today is {today}. Please leverage the latest web data when responding to queries."
+            system_prompt_text = gr.Textbox(label="System Prompt",
+                                            value=default_prompt,
+                                            lines=3,
+                                            info="Define the base context for the AI's responses.")
+            gr.Markdown("### Generation Parameters")
+            max_tokens_slider = gr.Slider(label="Max Tokens", minimum=64, maximum=1024, value=1024, step=32,
+                                          info="Maximum tokens for the response.")
+            temperature_slider = gr.Slider(label="Temperature", minimum=0.1, maximum=2.0, value=0.7, step=0.1,
+                                           info="Controls the randomness of the output.")
+            top_k_slider = gr.Slider(label="Top-K", minimum=1, maximum=100, value=40, step=1,
+                                     info="Limits token candidates to the top-k tokens.")
+            top_p_slider = gr.Slider(label="Top-P (Nucleus Sampling)", minimum=0.1, maximum=1.0, value=0.95, step=0.05,
+                                     info="Limits token candidates to a cumulative probability threshold.")
+            repeat_penalty_slider = gr.Slider(label="Repetition Penalty", minimum=1.0, maximum=2.0, value=1.1, step=0.1,
+                                              info="Penalizes token repetition to improve diversity.")
+            gr.Markdown("### Web Search Settings")
+            enable_search_checkbox = gr.Checkbox(label="Enable Web Search", value=False,
+                                                 info="Include recent search context to improve answers.")
+            max_results_number = gr.Number(label="Max Search Results", value=6, precision=0,
+                                           info="Maximum number of search results to retrieve.")
+            max_chars_number = gr.Number(label="Max Chars per Result", value=600, precision=0,
+                                         info="Maximum characters to retrieve per search result.")
+            clear_button = gr.Button("Clear Chat")
+            cancel_button = gr.Button("Cancel Generation")
+        with gr.Column(scale=7):
+            chatbot = gr.Chatbot(label="Chat", type="messages")
+            msg_input = gr.Textbox(label="Your Message", placeholder="Enter your message and press Enter")
+            search_debug = gr.Markdown(label="Web Search Debug")
+    def clear_chat():
+        return [], "", ""
+    clear_button.click(fn=clear_chat, outputs=[chatbot, msg_input, search_debug])
+    cancel_button.click(fn=cancel_generation, outputs=search_debug)
+    # Submission that returns conversation and debug info.
+    msg_input.submit(
+        fn=chat_response,
+        inputs=[msg_input, chatbot, system_prompt_text, enable_search_checkbox,
+                max_results_number, max_chars_number, model_dropdown,
+                max_tokens_slider, temperature_slider, top_k_slider, top_p_slider, repeat_penalty_slider],
+        outputs=[chatbot, search_debug],
+        # Uncomment streaming=True if supported.
+        # streaming=True,
+    )
+demo.launch()

requirements.txt CHANGED Viewed

@@ -4,4 +4,5 @@ docopt @ https://github.com/GoogleCloudPlatform/gcloud-python-wheels/raw/refs/he
 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
 llama-cpp-python
 streamlit
-duckduckgo_search

 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
 llama-cpp-python
 streamlit
+duckduckgo_search
+gradio