Luigi commited on
Commit
a703203
·
1 Parent(s): 794ee70

switch to gradio version for stability reason

Browse files
Files changed (3) hide show
  1. README.md +19 -18
  2. app.py +185 -217
  3. requirements.txt +2 -1
README.md CHANGED
@@ -3,15 +3,22 @@ title: Multi-GGUF LLM Inference
3
  emoji: 🧠
4
  colorFrom: pink
5
  colorTo: purple
6
- sdk: streamlit
7
- sdk_version: 1.44.1
8
  app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
- short_description: Run GGUF models with llama.cpp
12
  ---
13
 
14
- This Streamlit app enables **chat-based inference** on various GGUF models using `llama.cpp` and `llama-cpp-python`.
 
 
 
 
 
 
 
15
 
16
  ### 🔄 Supported Models:
17
  - `Qwen/Qwen2.5-7B-Instruct-GGUF` → `qwen2.5-7b-instruct-q2_k.gguf`
@@ -23,19 +30,13 @@ This Streamlit app enables **chat-based inference** on various GGUF models using
23
  - `Qwen/Qwen2.5-Coder-7B-Instruct-GGUF` → `qwen2.5-coder-7b-instruct-q2_k.gguf`
24
 
25
  ### ⚙️ Features:
26
- - Model selection in the sidebar
27
- - Customizable system prompt and generation parameters
28
- - Chat-style UI with streaming responses
29
- - **Markdown output rendering** for readable, styled output
30
- - **DeepSeek-compatible `<think>` tag handling** shows model reasoning in a collapsible expander
31
-
32
- ### 🧠 Memory-Safe Design (for HuggingFace Spaces):
33
- - Loads only **one model at a time** to prevent memory bloat
34
- - Utilizes **manual unloading and `gc.collect()`** to free memory when switching models
35
- - Adjusts `n_ctx` context length to operate within a 16 GB RAM limit
36
- - Automatically downloads models as needed
37
- - Limits history to the **last 8 user-assistant turns** to prevent context overflow
38
 
39
- Ideal for deploying multiple GGUF chat models on **free-tier HuggingFace Spaces**!
40
 
41
- Refer to the configuration guide at https://huggingface.co/docs/hub/spaces-config-reference
 
3
  emoji: 🧠
4
  colorFrom: pink
5
  colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 3.29.1
8
  app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
+ short_description: Chat-based inference for GGUF models using llama.cpp and Gradio
12
  ---
13
 
14
+ This Gradio app enables **chat-based inference** on various GGUF models using `llama.cpp` and `llama-cpp-python`. The application features:
15
+
16
+ - **Real-Time Web Search Integration:** Uses DuckDuckGo to retrieve up-to-date context; debug output is displayed in real time.
17
+ - **Streaming Token-by-Token Responses:** Users see the generated answer as it comes in.
18
+ - **Response Cancellation:** A cancel button allows stopping response generation in progress.
19
+ - **Customizable Prompts & Generation Parameters:** Adjust the system prompt (with dynamic date insertion), temperature, token limits, and more.
20
+ - **Memory-Safe Design:** Loads one model at a time with proper memory management, ideal for deployment on Hugging Face Spaces.
21
+ - **Rate Limit Handling:** Implements exponential backoff to cope with DuckDuckGo API rate limits.
22
 
23
  ### 🔄 Supported Models:
24
  - `Qwen/Qwen2.5-7B-Instruct-GGUF` → `qwen2.5-7b-instruct-q2_k.gguf`
 
30
  - `Qwen/Qwen2.5-Coder-7B-Instruct-GGUF` → `qwen2.5-coder-7b-instruct-q2_k.gguf`
31
 
32
  ### ⚙️ Features:
33
+ - **Model Selection:** Select from multiple GGUF models.
34
+ - **Customizable Prompts & Parameters:** Set a system prompt (e.g., automatically including today’s date), adjust temperature, token limits, and more.
35
+ - **Chat-style Interface:** Interactive Gradio UI with streaming token-by-token responses.
36
+ - **Real-Time Web Search & Debug Output:** Leverages DuckDuckGo to fetch recent context, with a dedicated debug panel showing web search progress and results.
37
+ - **Response Cancellation:** Cancel in-progress answer generation using a cancel button.
38
+ - **Memory-Safe & Rate-Limit Resilient:** Loads one model at a time with proper cleanup and incorporates exponential backoff to handle API rate limits.
 
 
 
 
 
 
39
 
40
+ Ideal for deploying multiple GGUF chat models on Hugging Face Spaces with a robust, user-friendly interface!
41
 
42
+ For further details, check the [Spaces configuration guide](https://huggingface.co/docs/hub/spaces-config-reference).
app.py CHANGED
@@ -1,38 +1,23 @@
1
- import streamlit as st
2
- import os, gc, shutil, re, time, threading, queue
 
 
 
3
  from itertools import islice
 
 
4
  from llama_cpp import Llama
5
  from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
6
  from huggingface_hub import hf_hub_download
7
  from duckduckgo_search import DDGS
8
 
9
  # ------------------------------
10
- # Initialize Session State
11
  # ------------------------------
12
- if "chat_history" not in st.session_state:
13
- st.session_state.chat_history = []
14
- if "pending_response" not in st.session_state:
15
- st.session_state.pending_response = False
16
- if "model_name" not in st.session_state:
17
- st.session_state.model_name = None
18
- if "llm" not in st.session_state:
19
- st.session_state.llm = None
20
 
21
  # ------------------------------
22
- # Custom CSS for Improved Look & Feel
23
- # ------------------------------
24
- st.markdown("""
25
- <style>
26
- .chat-container { margin: 1em 0; }
27
- .chat-assistant { background-color: #eef7ff; padding: 1em; border-radius: 10px; margin-bottom: 1em; }
28
- .chat-user { background-color: #e6ffe6; padding: 1em; border-radius: 10px; margin-bottom: 1em; }
29
- .message-time { font-size: 0.8em; color: #555; text-align: right; }
30
- .loading-spinner { font-size: 1.1em; color: #ff6600; }
31
- </style>
32
- """, unsafe_allow_html=True)
33
-
34
- # ------------------------------
35
- # Required Storage and Model Definitions
36
  # ------------------------------
37
  REQUIRED_SPACE_BYTES = 5 * 1024 ** 3 # 5 GB
38
 
@@ -94,26 +79,13 @@ MODELS = {
94
  },
95
  }
96
 
 
 
 
97
  # ------------------------------
98
- # Helper Functions
99
  # ------------------------------
100
- def retrieve_context(query, max_results=6, max_chars_per_result=600):
101
- """Retrieve web search context using DuckDuckGo."""
102
- try:
103
- with DDGS() as ddgs:
104
- results = list(islice(ddgs.text(query, region="wt-wt", safesearch="off", timelimit="y"), max_results))
105
- context = ""
106
- for i, result in enumerate(results, start=1):
107
- title = result.get("title", "No Title")
108
- snippet = result.get("body", "")[:max_chars_per_result]
109
- context += f"Result {i}:\nTitle: {title}\nSnippet: {snippet}\n\n"
110
- return context.strip()
111
- except Exception as e:
112
- st.error(f"Error during web retrieval: {e}")
113
- return ""
114
-
115
  def try_load_model(model_path):
116
- """Attempt to initialize the model from a specified path."""
117
  try:
118
  return Llama(
119
  model_path=model_path,
@@ -132,26 +104,20 @@ def try_load_model(model_path):
132
  return str(e)
133
 
134
  def download_model(selected_model):
135
- """Download the model using Hugging Face Hub."""
136
- with st.spinner(f"Downloading {selected_model['filename']}..."):
137
- hf_hub_download(
138
- repo_id=selected_model["repo_id"],
139
- filename=selected_model["filename"],
140
- local_dir="./models",
141
- local_dir_use_symlinks=False,
142
- )
143
 
144
  def validate_or_download_model(selected_model):
145
- """Ensure the model is available and loaded properly; download if necessary."""
146
  model_path = os.path.join("models", selected_model["filename"])
147
  os.makedirs("models", exist_ok=True)
148
  if not os.path.exists(model_path):
149
- if shutil.disk_usage(".").free < REQUIRED_SPACE_BYTES:
150
- st.info("Insufficient storage space. Consider cleaning up old models.")
151
  download_model(selected_model)
152
  result = try_load_model(model_path)
153
  if isinstance(result, str):
154
- st.warning(f"Initial model load failed: {result}\nAttempting re-download...")
155
  try:
156
  os.remove(model_path)
157
  except Exception:
@@ -159,22 +125,98 @@ def validate_or_download_model(selected_model):
159
  download_model(selected_model)
160
  result = try_load_model(model_path)
161
  if isinstance(result, str):
162
- st.error(f"Model failed to load after re-download: {result}")
163
- st.stop()
164
  return result
165
 
 
 
 
 
 
 
 
 
 
 
166
  # ------------------------------
167
- # Caching the Model Loading
168
  # ------------------------------
169
- @st.cache_resource
170
- def load_cached_model(selected_model):
171
- return validate_or_download_model(selected_model)
 
 
 
 
 
 
 
 
 
172
 
173
- def stream_response(llm, messages, max_tokens, temperature, top_k, top_p, repeat_penalty, response_queue):
174
- """Stream the model response token-by-token."""
175
- final_text = ""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
176
  try:
177
- stream = llm.create_chat_completion(
178
  messages=messages,
179
  max_tokens=max_tokens,
180
  temperature=temperature,
@@ -184,169 +226,95 @@ def stream_response(llm, messages, max_tokens, temperature, top_k, top_p, repeat
184
  stream=True,
185
  )
186
  for chunk in stream:
 
 
 
 
 
 
 
187
  if "choices" in chunk:
188
  delta = chunk["choices"][0]["delta"].get("content", "")
189
- final_text += delta
190
- response_queue.put(delta)
 
191
  if chunk["choices"][0].get("finish_reason", ""):
192
  break
193
  except Exception as e:
194
- response_queue.put(f"\nError: {e}")
195
- response_queue.put(None) # Signal the end of streaming
 
196
 
197
  # ------------------------------
198
- # Sidebar: Settings and Advanced Options
199
  # ------------------------------
200
- with st.sidebar:
201
- st.header("⚙️ Settings")
202
-
203
- # Basic Settings
204
- selected_model_name = st.selectbox("Select Model", list(MODELS.keys()),
205
- help="Choose from the available model configurations.")
206
- system_prompt_base = st.text_area("System Prompt",
207
- value="You are a helpful assistant.",
208
- height=80,
209
- help="Define the base context for the AI's responses.")
210
-
211
- # Generation Parameters
212
- st.subheader("Generation Parameters")
213
- max_tokens = st.slider("Max Tokens", 64, 1024, 1024, step=32,
214
- help="The maximum number of tokens the assistant can generate.")
215
- temperature = st.slider("Temperature", 0.1, 2.0, 0.7,
216
- help="Controls randomness. Lower values are more deterministic.")
217
- top_k = st.slider("Top-K", 1, 100, 40,
218
- help="Limits the token candidates to the top-k tokens.")
219
- top_p = st.slider("Top-P", 0.1, 1.0, 0.95,
220
- help="Nucleus sampling parameter; restricts to a cumulative probability.")
221
- repeat_penalty = st.slider("Repetition Penalty", 1.0, 2.0, 1.1,
222
- help="Penalizes token repetition to improve output variety.")
223
-
224
- # Advanced Settings using expandable sections
225
- with st.expander("Web Search Settings"):
226
- enable_search = st.checkbox("Enable Web Search", value=False,
227
- help="Include recent web search context to augment the prompt.")
228
- max_results = st.number_input("Max Results for Context", min_value=1, max_value=20, value=6, step=1,
229
- help="How many search results to use.")
230
- max_chars_per_result = st.number_input("Max Chars per Result", min_value=100, max_value=2000, value=600, step=50,
231
- help="Max characters to extract from each search result.")
232
-
233
- # ------------------------------
234
- # Model Loading/Reloading if Needed
235
- # ------------------------------
236
- selected_model = MODELS[selected_model_name]
237
- if st.session_state.model_name != selected_model_name:
238
- with st.spinner("Loading selected model..."):
239
- st.session_state.llm = load_cached_model(selected_model)
240
- st.session_state.model_name = selected_model_name
241
-
242
- llm = st.session_state.llm
243
 
244
  # ------------------------------
245
- # Main Title and Chat History Display
246
  # ------------------------------
247
- st.title(f"🧠 {selected_model['description']}")
248
- st.caption(f"Powered by `llama.cpp` | Model: {selected_model['filename']}")
 
249
 
250
- # Render chat history with improved styling
251
- for chat in st.session_state.chat_history:
252
- role = chat["role"]
253
- content = chat["content"]
254
- if role == "assistant":
255
- st.markdown(f"<div class='chat-assistant'>{content}</div>", unsafe_allow_html=True)
256
- else:
257
- st.markdown(f"<div class='chat-user'>{content}</div>", unsafe_allow_html=True)
258
-
259
- # ------------------------------
260
- # Chat Input and Processing
261
- # ------------------------------
262
- user_input = st.chat_input("Your message...")
263
- if user_input:
264
- if st.session_state.pending_response:
265
- st.warning("Please wait until the current response is finished.")
266
- else:
267
- # Append user message with timestamp (if desired)
268
- timestamp = time.strftime("%H:%M")
269
- st.session_state.chat_history.append({"role": "user", "content": f"{user_input}\n\n<span class='message-time'>{timestamp}</span>"})
270
- with st.chat_message("user"):
271
- st.markdown(f"<div class='chat-user'>{user_input}</div>", unsafe_allow_html=True)
272
-
273
- st.session_state.pending_response = True
274
-
275
- # Retrieve web search context asynchronously, with a timeout, if enabled
276
- retrieved_context = ""
277
- if enable_search:
278
- result_list = []
279
- def run_search():
280
- result = retrieve_context(user_input, max_results=max_results, max_chars_per_result=max_chars_per_result)
281
- result_list.append(result)
282
- search_thread = threading.Thread(target=run_search)
283
- search_thread.start()
284
- # Wait only up to 2 seconds for the search to return
285
- search_thread.join(timeout=2)
286
- if result_list:
287
- retrieved_context = result_list[0]
288
- # Display whichever result (or lack thereof) in the sidebar
289
- with st.sidebar:
290
- st.markdown("### Retrieved Context")
291
- st.text_area("", value=retrieved_context or "No context found.", height=150)
292
-
293
- # Augment the user prompt with the system prompt and optional web context
294
- if enable_search and retrieved_context:
295
- augmented_user_input = (
296
- f"{system_prompt_base.strip()}\n\n"
297
- f"Use the following recent web search context to help answer the query:\n\n"
298
- f"{retrieved_context}\n\n"
299
- f"User Query: {user_input}"
300
  )
301
- else:
302
- augmented_user_input = f"{system_prompt_base.strip()}\n\nUser Query: {user_input}"
303
-
304
- # Limit conversation history to the last few turns (for context)
305
- MAX_TURNS = 2
306
- trimmed_history = st.session_state.chat_history[-(MAX_TURNS * 2):]
307
- if trimmed_history and trimmed_history[-1]["role"] == "user":
308
- messages = trimmed_history[:-1] + [{"role": "user", "content": augmented_user_input}]
309
- else:
310
- messages = trimmed_history + [{"role": "user", "content": augmented_user_input}]
311
-
312
- # Set up a placeholder for displaying the streaming response and a queue for tokens
313
- visible_placeholder = st.empty()
314
- progress_bar = st.progress(0)
315
- response_queue = queue.Queue()
316
-
317
- # Start streaming response in a separate thread
318
- stream_thread = threading.Thread(
319
- target=stream_response,
320
- args=(llm, messages, max_tokens, temperature, top_k, top_p, repeat_penalty, response_queue),
321
- daemon=True
322
- )
323
- stream_thread.start()
324
-
325
- # Poll the queue to update the UI with incremental tokens and update progress
326
- final_response = ""
327
- timeout = 300 # seconds
328
- start_time = time.time()
329
- progress = 0
330
- while True:
331
- try:
332
- update = response_queue.get(timeout=0.1)
333
- if update is None:
334
- break
335
- final_response += update
336
- # Remove any special tags from the output (for cleaner UI)
337
- visible_response = re.sub(r"<think>.*?</think>", "", final_response, flags=re.DOTALL)
338
- visible_placeholder.markdown(f"<div class='chat-assistant'>{visible_response}</div>", unsafe_allow_html=True)
339
- progress = min(progress + 1, 100)
340
- progress_bar.progress(progress)
341
- start_time = time.time()
342
- except queue.Empty:
343
- if time.time() - start_time > timeout:
344
- st.error("Response generation timed out.")
345
- break
346
-
347
- # Append assistant response with timestamp
348
- timestamp = time.strftime("%H:%M")
349
- st.session_state.chat_history.append({"role": "assistant", "content": f"{final_response}\n\n<span class='message-time'>{timestamp}</span>"})
350
- st.session_state.pending_response = False
351
- progress_bar.empty() # Clear progress bar
352
- gc.collect()
 
1
+ import os
2
+ import time
3
+ import re
4
+ import gc
5
+ import threading
6
  from itertools import islice
7
+ from datetime import datetime
8
+ import gradio as gr
9
  from llama_cpp import Llama
10
  from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
11
  from huggingface_hub import hf_hub_download
12
  from duckduckgo_search import DDGS
13
 
14
  # ------------------------------
15
+ # Global Cancellation Event
16
  # ------------------------------
17
+ cancel_event = threading.Event()
 
 
 
 
 
 
 
18
 
19
  # ------------------------------
20
+ # Model Definitions and Global Variables
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  # ------------------------------
22
  REQUIRED_SPACE_BYTES = 5 * 1024 ** 3 # 5 GB
23
 
 
79
  },
80
  }
81
 
82
+ LOADED_MODELS = {}
83
+ CURRENT_MODEL_NAME = None
84
+
85
  # ------------------------------
86
+ # Model Loading Helper Functions
87
  # ------------------------------
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  def try_load_model(model_path):
 
89
  try:
90
  return Llama(
91
  model_path=model_path,
 
104
  return str(e)
105
 
106
  def download_model(selected_model):
107
+ hf_hub_download(
108
+ repo_id=selected_model["repo_id"],
109
+ filename=selected_model["filename"],
110
+ local_dir="./models",
111
+ local_dir_use_symlinks=False,
112
+ )
 
 
113
 
114
  def validate_or_download_model(selected_model):
 
115
  model_path = os.path.join("models", selected_model["filename"])
116
  os.makedirs("models", exist_ok=True)
117
  if not os.path.exists(model_path):
 
 
118
  download_model(selected_model)
119
  result = try_load_model(model_path)
120
  if isinstance(result, str):
 
121
  try:
122
  os.remove(model_path)
123
  except Exception:
 
125
  download_model(selected_model)
126
  result = try_load_model(model_path)
127
  if isinstance(result, str):
128
+ raise Exception(f"Model load failed: {result}")
 
129
  return result
130
 
131
+ def load_model(model_name):
132
+ global LOADED_MODELS, CURRENT_MODEL_NAME
133
+ if model_name in LOADED_MODELS:
134
+ return LOADED_MODELS[model_name]
135
+ selected_model = MODELS[model_name]
136
+ model = validate_or_download_model(selected_model)
137
+ LOADED_MODELS[model_name] = model
138
+ CURRENT_MODEL_NAME = model_name
139
+ return model
140
+
141
  # ------------------------------
142
+ # Web Search Context Retrieval Function
143
  # ------------------------------
144
+ def retrieve_context(query, max_results=6, max_chars_per_result=600):
145
+ try:
146
+ with DDGS() as ddgs:
147
+ results = list(islice(ddgs.text(query, region="wt-wt", safesearch="off", timelimit="y"), max_results))
148
+ context = ""
149
+ for i, result in enumerate(results, start=1):
150
+ title = result.get("title", "No Title")
151
+ snippet = result.get("body", "")[:max_chars_per_result]
152
+ context += f"Result {i}:\nTitle: {title}\nSnippet: {snippet}\n\n"
153
+ return context.strip()
154
+ except Exception:
155
+ return ""
156
 
157
+ # ------------------------------
158
+ # Chat Response Generation (Streaming) with Cancellation
159
+ # ------------------------------
160
+ def chat_response(user_message, chat_history, system_prompt, enable_search,
161
+ max_results, max_chars, model_name, max_tokens, temperature, top_k, top_p, repeat_penalty):
162
+ """
163
+ Generator function that:
164
+ - Uses the chat history (list of dicts) from the Chatbot.
165
+ - Appends the new user message.
166
+ - Optionally retrieves web search context.
167
+ - Streams the assistant response token-by-token.
168
+ - Checks for cancellation.
169
+ """
170
+ # Reset the cancellation event.
171
+ cancel_event.clear()
172
+
173
+ # Prepare internal history.
174
+ internal_history = list(chat_history) if chat_history else []
175
+ internal_history.append({"role": "user", "content": user_message})
176
+
177
+ # Retrieve web search context (with debug feedback).
178
+ debug_message = ""
179
+ if enable_search:
180
+ debug_message = "Initiating web search..."
181
+ yield internal_history, debug_message
182
+ search_result = [""]
183
+ def do_search():
184
+ search_result[0] = retrieve_context(user_message, max_results, max_chars)
185
+ search_thread = threading.Thread(target=do_search)
186
+ search_thread.start()
187
+ search_thread.join(timeout=2)
188
+ retrieved_context = search_result[0]
189
+ if retrieved_context:
190
+ debug_message = f"Web search results:\n\n{retrieved_context}"
191
+ else:
192
+ debug_message = "Web search returned no results or timed out."
193
+ else:
194
+ retrieved_context = ""
195
+ debug_message = "Web search disabled."
196
+
197
+ # Augment prompt.
198
+ if enable_search and retrieved_context:
199
+ augmented_user_input = (
200
+ f"{system_prompt.strip()}\n\n"
201
+ "Use the following recent web search context to help answer the query:\n\n"
202
+ f"{retrieved_context}\n\n"
203
+ f"User Query: {user_message}"
204
+ )
205
+ else:
206
+ augmented_user_input = f"{system_prompt.strip()}\n\nUser Query: {user_message}"
207
+
208
+ # Build final prompt messages.
209
+ messages = internal_history[:-1] + [{"role": "user", "content": augmented_user_input}]
210
+
211
+ # Load the model.
212
+ model = load_model(model_name)
213
+
214
+ # Add an empty assistant message.
215
+ internal_history.append({"role": "assistant", "content": ""})
216
+ assistant_message = ""
217
+
218
  try:
219
+ stream = model.create_chat_completion(
220
  messages=messages,
221
  max_tokens=max_tokens,
222
  temperature=temperature,
 
226
  stream=True,
227
  )
228
  for chunk in stream:
229
+ # Check if a cancellation has been requested.
230
+ if cancel_event.is_set():
231
+ assistant_message += "\n\n[Response generation cancelled by user]"
232
+ internal_history[-1]["content"] = assistant_message
233
+ yield internal_history, debug_message
234
+ break
235
+
236
  if "choices" in chunk:
237
  delta = chunk["choices"][0]["delta"].get("content", "")
238
+ assistant_message += delta
239
+ internal_history[-1]["content"] = assistant_message
240
+ yield internal_history, debug_message
241
  if chunk["choices"][0].get("finish_reason", ""):
242
  break
243
  except Exception as e:
244
+ internal_history[-1]["content"] = f"Error: {e}"
245
+ yield internal_history, debug_message
246
+ gc.collect()
247
 
248
  # ------------------------------
249
+ # Cancel Function
250
  # ------------------------------
251
+ def cancel_generation():
252
+ cancel_event.set()
253
+ return "Cancellation requested."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
 
255
  # ------------------------------
256
+ # Gradio UI Definition
257
  # ------------------------------
258
+ with gr.Blocks(title="Multi-GGUF LLM Inference") as demo:
259
+ gr.Markdown("## 🧠 Multi-GGUF LLM Inference with Web Search")
260
+ gr.Markdown("Interact with the model. Select your model, set your system prompt, and adjust parameters on the left.")
261
 
262
+ with gr.Row():
263
+ with gr.Column(scale=3):
264
+ default_model = list(MODELS.keys())[0] if MODELS else "No models available"
265
+ model_dropdown = gr.Dropdown(
266
+ label="Select Model",
267
+ choices=list(MODELS.keys()) if MODELS else [],
268
+ value=default_model,
269
+ info="Choose from available models."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
270
  )
271
+ today = datetime.now().strftime('%Y-%m-%d')
272
+ default_prompt = f"You are a helpful assistant. Today is {today}. Please leverage the latest web data when responding to queries."
273
+ system_prompt_text = gr.Textbox(label="System Prompt",
274
+ value=default_prompt,
275
+ lines=3,
276
+ info="Define the base context for the AI's responses.")
277
+ gr.Markdown("### Generation Parameters")
278
+ max_tokens_slider = gr.Slider(label="Max Tokens", minimum=64, maximum=1024, value=1024, step=32,
279
+ info="Maximum tokens for the response.")
280
+ temperature_slider = gr.Slider(label="Temperature", minimum=0.1, maximum=2.0, value=0.7, step=0.1,
281
+ info="Controls the randomness of the output.")
282
+ top_k_slider = gr.Slider(label="Top-K", minimum=1, maximum=100, value=40, step=1,
283
+ info="Limits token candidates to the top-k tokens.")
284
+ top_p_slider = gr.Slider(label="Top-P (Nucleus Sampling)", minimum=0.1, maximum=1.0, value=0.95, step=0.05,
285
+ info="Limits token candidates to a cumulative probability threshold.")
286
+ repeat_penalty_slider = gr.Slider(label="Repetition Penalty", minimum=1.0, maximum=2.0, value=1.1, step=0.1,
287
+ info="Penalizes token repetition to improve diversity.")
288
+ gr.Markdown("### Web Search Settings")
289
+ enable_search_checkbox = gr.Checkbox(label="Enable Web Search", value=False,
290
+ info="Include recent search context to improve answers.")
291
+ max_results_number = gr.Number(label="Max Search Results", value=6, precision=0,
292
+ info="Maximum number of search results to retrieve.")
293
+ max_chars_number = gr.Number(label="Max Chars per Result", value=600, precision=0,
294
+ info="Maximum characters to retrieve per search result.")
295
+ clear_button = gr.Button("Clear Chat")
296
+ cancel_button = gr.Button("Cancel Generation")
297
+ with gr.Column(scale=7):
298
+ chatbot = gr.Chatbot(label="Chat", type="messages")
299
+ msg_input = gr.Textbox(label="Your Message", placeholder="Enter your message and press Enter")
300
+ search_debug = gr.Markdown(label="Web Search Debug")
301
+
302
+ def clear_chat():
303
+ return [], "", ""
304
+
305
+ clear_button.click(fn=clear_chat, outputs=[chatbot, msg_input, search_debug])
306
+
307
+ cancel_button.click(fn=cancel_generation, outputs=search_debug)
308
+
309
+ # Submission that returns conversation and debug info.
310
+ msg_input.submit(
311
+ fn=chat_response,
312
+ inputs=[msg_input, chatbot, system_prompt_text, enable_search_checkbox,
313
+ max_results_number, max_chars_number, model_dropdown,
314
+ max_tokens_slider, temperature_slider, top_k_slider, top_p_slider, repeat_penalty_slider],
315
+ outputs=[chatbot, search_debug],
316
+ # Uncomment streaming=True if supported.
317
+ # streaming=True,
318
+ )
319
+
320
+ demo.launch()
 
 
requirements.txt CHANGED
@@ -4,4 +4,5 @@ docopt @ https://github.com/GoogleCloudPlatform/gcloud-python-wheels/raw/refs/he
4
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
5
  llama-cpp-python
6
  streamlit
7
- duckduckgo_search
 
 
4
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
5
  llama-cpp-python
6
  streamlit
7
+ duckduckgo_search
8
+ gradio