Spaces:
Running
Running
metadata
title: Websearch
emoji: 🏢
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.36.2
app_file: app.py
pinned: false
Gradio News‑to‑Context Service
Prerequisites
$ pip install gradio httpx trafilatura python-dateutil
Environment
export SERPER_API_KEY="YOUR‑KEY‑HERE"
How it works – design notes
| Step | Technique | Why it matters |
|---|---|---|
| API search | Serper’s Google‑News JSON is fast, cost‑effective and immune to Google’s bot‑blocking. | |
| Concurrency | httpx.AsyncClient + asyncio.gather gets 10 articles in < 2 s on typical broadband. |
|
| Extraction | Trafilatura consistently tops accuracy charts for main‑content extraction and needs no browser or heavy ML models. | |
| Date parsing | python‑dateutil converts fuzzy strings (“16 hours ago”) into ISO YYYY‑MM‑DD so the LLM sees absolute dates. |
|
| LLM‑friendly output | Markdown headings and horizontal rules make chunk boundaries explicit; hyperlinks preserved for optional citation. |
Extending in production
- Caching – add
aiocacheor Redis to avoid re‑fetching identical URLs within TTL. - Long‑content trimming – if each article can exceed your LLM’s context window, pipe
bodythrough a sentence‑ranker or GPT‑based summariser before concatenation. - Paywalls / PDFs – guard
extract_main_textwith fallback libraries (e.g.readability‑lxmlorpymupdf) for unusual formats. - Rate‑limiting – Serper free tier allows 100 req/day; wrap the call with exponential‑backoff on HTTP 429.
Drop this file into any Python‑3.10+ environment, set SERPER_API_KEY, pip install the three libraries, and you have a ready‑to‑embed “query‑» context” micro‑service for your LLM pipeline.