Spaces:

victor
/

websearch

Running

App Files Files Community

websearch / README.md

victor HF Staff

Add Gradio web search application and update README with usage instructions

e90574b 5 months ago

preview code

raw

history blame

1.8 kB

metadata

title: Websearch
emoji: 🏢
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.36.2
app_file: app.py
pinned: false

Gradio News‑to‑Context Service

Prerequisites

$ pip install gradio httpx trafilatura python-dateutil

Environment

export SERPER_API_KEY="YOUR‑KEY‑HERE"

How it works – design notes

Step	Technique	Why it matters
API search	Serper’s Google‑News JSON is fast, cost‑effective and immune to Google’s bot‑blocking.
Concurrency	`httpx.AsyncClient` + `asyncio.gather` gets 10 articles in < 2 s on typical broadband.
Extraction	Trafilatura consistently tops accuracy charts for main‑content extraction and needs no browser or heavy ML models.
Date parsing	`python‑dateutil` converts fuzzy strings (“16 hours ago”) into ISO YYYY‑MM‑DD so the LLM sees absolute dates.
LLM‑friendly output	Markdown headings and horizontal rules make chunk boundaries explicit; hyperlinks preserved for optional citation.

Extending in production

Caching – add aiocache or Redis to avoid re‑fetching identical URLs within TTL.
Long‑content trimming – if each article can exceed your LLM’s context window, pipe body through a sentence‑ranker or GPT‑based summariser before concatenation.
Paywalls / PDFs – guard extract_main_text with fallback libraries (e.g. readability‑lxml or pymupdf) for unusual formats.
Rate‑limiting – Serper free tier allows 100 req/day; wrap the call with exponential‑backoff on HTTP 429.

Drop this file into any Python‑3.10+ environment, set SERPER_API_KEY, pip install the three libraries, and you have a ready‑to‑embed “query‑» context” micro‑service for your LLM pipeline.