websearch / README.md
victor's picture
victor HF Staff
Add Gradio web search application and update README with usage instructions
e90574b
|
raw
history blame
1.8 kB
metadata
title: Websearch
emoji: 🏢
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.36.2
app_file: app.py
pinned: false

Gradio News‑to‑Context Service

Prerequisites

$ pip install gradio httpx trafilatura python-dateutil

Environment

export SERPER_API_KEY="YOUR‑KEY‑HERE"

How it works – design notes

Step Technique Why it matters
API search Serper’s Google‑News JSON is fast, cost‑effective and immune to Google’s bot‑blocking.
Concurrency httpx.AsyncClient + asyncio.gather gets 10 articles in < 2 s on typical broadband.
Extraction Trafilatura consistently tops accuracy charts for main‑content extraction and needs no browser or heavy ML models.
Date parsing python‑dateutil converts fuzzy strings (“16 hours ago”) into ISO YYYY‑MM‑DD so the LLM sees absolute dates.
LLM‑friendly output Markdown headings and horizontal rules make chunk boundaries explicit; hyperlinks preserved for optional citation.

Extending in production

  • Caching – add aiocache or Redis to avoid re‑fetching identical URLs within TTL.
  • Long‑content trimming – if each article can exceed your LLM’s context window, pipe body through a sentence‑ranker or GPT‑based summariser before concatenation.
  • Paywalls / PDFs – guard extract_main_text with fallback libraries (e.g. readability‑lxml or pymupdf) for unusual formats.
  • Rate‑limiting – Serper free tier allows 100 req/day; wrap the call with exponential‑backoff on HTTP 429.

Drop this file into any Python‑3.10+ environment, set SERPER_API_KEY, pip install the three libraries, and you have a ready‑to‑embed “query‑» context” micro‑service for your LLM pipeline.