Spaces:
Sleeping
Sleeping
Commit
·
3bf50ea
0
Parent(s):
Reinitialize repository without large files
Browse files- .github/workflows/sync_to_huggingface_space.yml +18 -0
- .gitignore +0 -0
- README.md +141 -0
- api.py +25 -0
- app.py +84 -0
- requirements.txt +0 -0
- scrapes.py +81 -0
- sentiV_v2.py +153 -0
- tts_hindi_edgetts.py +21 -0
- utils.py +151 -0
.github/workflows/sync_to_huggingface_space.yml
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
name: Sync to Hugging Face hub
|
2 |
+
on:
|
3 |
+
push:
|
4 |
+
branches: [main]
|
5 |
+
workflow_dispatch:
|
6 |
+
|
7 |
+
jobs:
|
8 |
+
sync-to-hub:
|
9 |
+
runs-on: ubuntu-latest
|
10 |
+
steps:
|
11 |
+
- uses: actions/checkout@v3
|
12 |
+
with:
|
13 |
+
fetch-depth: 0
|
14 |
+
lfs: true
|
15 |
+
- name: Push to hub
|
16 |
+
env:
|
17 |
+
HF_TOKEN: ${{ secrets.HF_TOKEN }}
|
18 |
+
run: git push -f https://Shakespeared101:[email protected]/spaces/Shakespeared101/news-summarise-tts main
|
.gitignore
ADDED
Binary file (146 Bytes). View file
|
|
README.md
ADDED
@@ -0,0 +1,141 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: "News Summarizer & TTS"
|
3 |
+
emoji: "📰"
|
4 |
+
colorFrom: "blue"
|
5 |
+
colorTo: "green"
|
6 |
+
sdk: "streamlit"
|
7 |
+
app_file: "app.py"
|
8 |
+
pinned: false
|
9 |
+
---
|
10 |
+
# News Summarisation and Hindi TTS Application
|
11 |
+
|
12 |
+
## Project Overview
|
13 |
+
|
14 |
+
This project is a web-based application that extracts news articles from multiple sources for a given company, summarizes the articles using advanced NLP techniques (with both Transformer-based and fallback methods), performs sentiment analysis with visual graphs, translates the generated summary to Hindi, and finally converts the Hindi summary into an audio file via text-to-speech (TTS). The application is built using FastAPI for the backend and Streamlit for the frontend, ensuring a smooth and interactive user experience.
|
15 |
+
|
16 |
+
## Features
|
17 |
+
|
18 |
+
- **News Extraction:**
|
19 |
+
Extracts news articles from multiple sources using web scraping techniques.
|
20 |
+
|
21 |
+
- **Summarization:**
|
22 |
+
Generates a combined summary using a Transformer-based summarizer (with fallback to Sumy if needed).
|
23 |
+
|
24 |
+
- **Sentiment Analysis:**
|
25 |
+
Analyzes the sentiment of the news content and visualizes the comparative sentiment (Positive, Negative, Neutral) as a bar graph using matplotlib.
|
26 |
+
|
27 |
+
- **Translation:**
|
28 |
+
Translates the summary from English to Hindi using googletrans for improved quality.
|
29 |
+
|
30 |
+
- **Text-to-Speech (TTS):**
|
31 |
+
Converts the Hindi summary into an audio file using Edge TTS.
|
32 |
+
|
33 |
+
## Setup Instructions
|
34 |
+
|
35 |
+
### Dependencies
|
36 |
+
|
37 |
+
Install all required packages using the command below:
|
38 |
+
|
39 |
+
```bash
|
40 |
+
pip install fastapi uvicorn streamlit transformers newspaper3k beautifulsoup4 edge-tts selenium webdriver-manager spacy nltk sumy sacremoses requests googletrans==4.0.0-rc1 matplotlib
|
41 |
+
python -m spacy download en_core_web_sm
|
42 |
+
python -c "import nltk; nltk.download('vader_lexicon'); nltk.download('punkt')"
|
43 |
+
```
|
44 |
+
|
45 |
+
### Running the FastAPI Backend
|
46 |
+
|
47 |
+
In your project directory, run:
|
48 |
+
|
49 |
+
```bash
|
50 |
+
uvicorn api:app --reload
|
51 |
+
```
|
52 |
+
|
53 |
+
This will start the backend server at [http://127.0.0.1:8000](http://127.0.0.1:8000).
|
54 |
+
|
55 |
+
### Running the Streamlit Frontend
|
56 |
+
|
57 |
+
In another terminal (or a new tab), run:
|
58 |
+
|
59 |
+
```bash
|
60 |
+
streamlit run streamlit_app.py
|
61 |
+
```
|
62 |
+
|
63 |
+
This will launch the web interface where you can input a company name and interact with the application.
|
64 |
+
|
65 |
+
## Project Structure
|
66 |
+
|
67 |
+
- **`api.py`**
|
68 |
+
Contains the FastAPI application which exposes endpoints for processing news, generating summaries, performing sentiment analysis, translating summaries to Hindi, and creating TTS audio.
|
69 |
+
|
70 |
+
- **`utils.py`**
|
71 |
+
Houses utility functions for:
|
72 |
+
- Extracting articles from news URLs.
|
73 |
+
- Generating combined summaries using Transformer models with Sumy as a fallback.
|
74 |
+
- Translating text to Hindi using googletrans.
|
75 |
+
- Performing comparative sentiment analysis and generating a matplotlib bar chart.
|
76 |
+
- Generating TTS audio from the Hindi summary.
|
77 |
+
|
78 |
+
- **`streamlit_app.py`**
|
79 |
+
Provides a simple and interactive web-based interface using Streamlit. Users can input a company name, view extracted news and summaries, see the sentiment analysis graph, and play the generated TTS audio.
|
80 |
+
|
81 |
+
- **`scrapes.py`**
|
82 |
+
Contains functions for scraping valid news URLs and extracting article content from web pages.
|
83 |
+
|
84 |
+
- **`sentiV_v2.py`**
|
85 |
+
Implements sentiment analysis on the article content using both NLTK’s VADER and Transformer-based methods.
|
86 |
+
|
87 |
+
- **`tts_hindi_edgetts.py`**
|
88 |
+
Utilizes Edge TTS to convert text to speech and saves the output as an audio file.
|
89 |
+
|
90 |
+
- **`.gitignore`**
|
91 |
+
Ensures that large or unnecessary files (like the virtual environment folder `venv/`) are not tracked by Git.
|
92 |
+
|
93 |
+
## Deployment Details
|
94 |
+
|
95 |
+
The application can be deployed on platforms like [Hugging Face Spaces](https://huggingface.co/spaces), Heroku, or Render. For example, if deployed on Hugging Face Spaces:
|
96 |
+
|
97 |
+
- The repository is linked to a new Space.
|
98 |
+
- The Streamlit interface is used as the main application.
|
99 |
+
- The deployment link (e.g., `https://huggingface.co/spaces/your-username/news-summarisation`) will be provided in the repository README for access.
|
100 |
+
|
101 |
+
## Usage Instructions
|
102 |
+
|
103 |
+
1. **Launch the Application:**
|
104 |
+
Run the FastAPI backend and Streamlit frontend as described above.
|
105 |
+
|
106 |
+
2. **Input a Company Name:**
|
107 |
+
On the Streamlit interface, enter the name of a company (e.g., "Tesla", "Netflix") and click the "Fetch News" button.
|
108 |
+
|
109 |
+
3. **View Results:**
|
110 |
+
- **News Articles:**
|
111 |
+
See a list of extracted news articles along with their metadata (title, URL, date, sentiment, excerpt).
|
112 |
+
- **Sentiment Analysis:**
|
113 |
+
View the comparative sentiment counts and a bar chart visualizing the distribution of positive, negative, and neutral articles.
|
114 |
+
- **Summaries:**
|
115 |
+
Read the combined summary of the news and the translated Hindi summary.
|
116 |
+
- **Audio:**
|
117 |
+
Play the TTS-generated audio of the Hindi summary.
|
118 |
+
|
119 |
+
## Limitations & Future Improvements
|
120 |
+
|
121 |
+
### Limitations:
|
122 |
+
|
123 |
+
- Reliance on web scraping can sometimes result in incomplete article extraction due to website restrictions.
|
124 |
+
- The summarization and translation quality might vary based on input length and complexity.
|
125 |
+
- TTS accuracy depends on the Edge TTS service and may not always be perfect.
|
126 |
+
|
127 |
+
### Future Improvements:
|
128 |
+
|
129 |
+
- Integrate more robust error handling and fallback mechanisms.
|
130 |
+
- Enhance the UI for better user experience.
|
131 |
+
- Expand the number of news sources and improve the filtering of relevant content.
|
132 |
+
- Implement caching to reduce API call latency.
|
133 |
+
- Explore additional TTS options for higher quality audio output.
|
134 |
+
|
135 |
+
## License
|
136 |
+
|
137 |
+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
|
138 |
+
|
139 |
+
## Contributing
|
140 |
+
|
141 |
+
Contributions are welcome! Please see the [CONTRIBUTING](CONTRIBUTING.md) file for guidelines on how to contribute to this project.
|
api.py
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from fastapi import FastAPI
|
2 |
+
from utils import process_news
|
3 |
+
|
4 |
+
app = FastAPI(title="News Summarization & TTS API")
|
5 |
+
|
6 |
+
@app.get("/")
|
7 |
+
def read_root():
|
8 |
+
return {"message": "Welcome to the News Summarization & TTS API"}
|
9 |
+
|
10 |
+
@app.get("/news/{company_name}")
|
11 |
+
def get_news(company_name: str):
|
12 |
+
"""
|
13 |
+
Fetch processed news for a given company.
|
14 |
+
Returns:
|
15 |
+
• A list of articles with title, URL, date, content, sentiment, and score.
|
16 |
+
• A combined summary of all articles.
|
17 |
+
• A Hindi translated summary.
|
18 |
+
• The TTS audio file path.
|
19 |
+
• Comparative sentiment analysis including a visual graph.
|
20 |
+
"""
|
21 |
+
return process_news(company_name)
|
22 |
+
|
23 |
+
if __name__ == "__main__":
|
24 |
+
import uvicorn
|
25 |
+
uvicorn.run(app, host="0.0.0.0", port=8000)
|
app.py
ADDED
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import threading
|
3 |
+
import time
|
4 |
+
import requests
|
5 |
+
import streamlit as st
|
6 |
+
import uvicorn
|
7 |
+
from fastapi import FastAPI
|
8 |
+
from utils import process_news
|
9 |
+
|
10 |
+
import spacy
|
11 |
+
try:
|
12 |
+
spacy.load("en_core_web_sm")
|
13 |
+
except OSError:
|
14 |
+
import os
|
15 |
+
os.system("python -m spacy download en_core_web_sm")
|
16 |
+
|
17 |
+
# FastAPI app setup
|
18 |
+
api = FastAPI(title="News Summarization & TTS API")
|
19 |
+
|
20 |
+
@api.get("/")
|
21 |
+
def read_root():
|
22 |
+
return {"message": "Welcome to the News Summarization & TTS API"}
|
23 |
+
|
24 |
+
@api.get("/news/{company_name}")
|
25 |
+
def get_news(company_name: str):
|
26 |
+
return process_news(company_name)
|
27 |
+
|
28 |
+
# Function to run FastAPI in a separate thread
|
29 |
+
def run_fastapi():
|
30 |
+
uvicorn.run(api, host="0.0.0.0", port=8000)
|
31 |
+
|
32 |
+
# Start FastAPI in a separate thread
|
33 |
+
threading.Thread(target=run_fastapi, daemon=True).start()
|
34 |
+
|
35 |
+
# Streamlit app setup
|
36 |
+
API_URL = "http://127.0.0.1:8000" # Since FastAPI runs in the same Space
|
37 |
+
|
38 |
+
st.title("News Summarization and Hindi TTS Application")
|
39 |
+
company = st.text_input("Enter Company Name", "")
|
40 |
+
|
41 |
+
if st.button("Fetch News"):
|
42 |
+
if company.strip() == "":
|
43 |
+
st.warning("Please enter a valid company name.")
|
44 |
+
else:
|
45 |
+
with st.spinner("Fetching and processing news..."):
|
46 |
+
time.sleep(2) # Give FastAPI some time to start
|
47 |
+
try:
|
48 |
+
response = requests.get(f"{API_URL}/news/{company}")
|
49 |
+
if response.status_code == 200:
|
50 |
+
data = response.json()
|
51 |
+
st.header(f"News for {data['company']}")
|
52 |
+
|
53 |
+
for article in data["articles"]:
|
54 |
+
st.subheader(article.get("title", "No Title"))
|
55 |
+
st.markdown(f"**URL:** [Read More]({article.get('url', '#')})")
|
56 |
+
st.markdown(f"**Date:** {article.get('date', 'N/A')}")
|
57 |
+
st.markdown(f"**Sentiment:** {article.get('sentiment', 'Neutral')} (Score: {article.get('score', 0):.2f})")
|
58 |
+
st.markdown(f"**Excerpt:** {article.get('content','')[:300]}...")
|
59 |
+
st.markdown("---")
|
60 |
+
|
61 |
+
st.subheader("Comparative Sentiment Analysis")
|
62 |
+
comp_sent = data.get("comparative_sentiment", {})
|
63 |
+
st.write({k: comp_sent[k] for k in ["Positive", "Negative", "Neutral"]})
|
64 |
+
|
65 |
+
if "graph" in comp_sent and os.path.exists(comp_sent["graph"]):
|
66 |
+
st.image(comp_sent["graph"], caption="Sentiment Analysis Graph")
|
67 |
+
|
68 |
+
st.subheader("Final Combined Summary")
|
69 |
+
st.write(data.get("final_summary", "No summary available."))
|
70 |
+
|
71 |
+
st.subheader("Hindi Summary")
|
72 |
+
st.write(data.get("hindi_summary", ""))
|
73 |
+
|
74 |
+
st.subheader("Hindi Summary Audio")
|
75 |
+
audio_path = data.get("tts_audio", None)
|
76 |
+
if audio_path and os.path.exists(audio_path):
|
77 |
+
with open(audio_path, "rb") as audio_file:
|
78 |
+
st.audio(audio_file.read(), format='audio/mp3')
|
79 |
+
else:
|
80 |
+
st.error("Audio file not found or TTS generation failed.")
|
81 |
+
else:
|
82 |
+
st.error("Failed to fetch news from the API. Please try again.")
|
83 |
+
except requests.exceptions.ConnectionError:
|
84 |
+
st.error("API is not running yet. Please wait a moment and try again.")
|
requirements.txt
ADDED
Binary file (1.26 kB). View file
|
|
scrapes.py
ADDED
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import requests
|
2 |
+
import re
|
3 |
+
from bs4 import BeautifulSoup
|
4 |
+
from newspaper import Article
|
5 |
+
from selenium import webdriver
|
6 |
+
from selenium.webdriver.chrome.service import Service
|
7 |
+
from selenium.webdriver.chrome.options import Options
|
8 |
+
from webdriver_manager.chrome import ChromeDriverManager
|
9 |
+
|
10 |
+
|
11 |
+
def get_valid_news_urls(company_name):
|
12 |
+
search_url = f'https://www.google.com/search?q={company_name}+news&tbm=nws'
|
13 |
+
headers = {'User-Agent': 'Mozilla/5.0'}
|
14 |
+
response = requests.get(search_url, headers=headers)
|
15 |
+
|
16 |
+
if response.status_code != 200:
|
17 |
+
print("⚠️ Google News request failed!")
|
18 |
+
return []
|
19 |
+
|
20 |
+
soup = BeautifulSoup(response.text, 'html.parser')
|
21 |
+
links = []
|
22 |
+
for g in soup.find_all('a', href=True):
|
23 |
+
url_match = re.search(r'(https?://\S+)', g['href'])
|
24 |
+
if url_match:
|
25 |
+
url = url_match.group(1).split('&')[0]
|
26 |
+
if "google.com" not in url:
|
27 |
+
links.append(url)
|
28 |
+
|
29 |
+
return links[:10] # Limit to top 10 results
|
30 |
+
|
31 |
+
def extract_article_content(url):
|
32 |
+
try:
|
33 |
+
article = Article(url)
|
34 |
+
article.download()
|
35 |
+
article.parse()
|
36 |
+
return article.text
|
37 |
+
except Exception as e:
|
38 |
+
print(f"⚠️ Newspaper3k failed: {e}")
|
39 |
+
|
40 |
+
try:
|
41 |
+
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
|
42 |
+
if response.status_code != 200:
|
43 |
+
raise Exception("Request failed")
|
44 |
+
soup = BeautifulSoup(response.text, 'html.parser')
|
45 |
+
paragraphs = soup.find_all('p')
|
46 |
+
return '\n'.join(p.text for p in paragraphs if p.text)
|
47 |
+
except Exception as e:
|
48 |
+
print(f"⚠️ BeautifulSoup failed: {e}")
|
49 |
+
|
50 |
+
try:
|
51 |
+
options = Options()
|
52 |
+
options.add_argument("--headless") # Run in headless mode
|
53 |
+
options.add_argument("--no-sandbox")
|
54 |
+
options.add_argument("--disable-dev-shm-usage")
|
55 |
+
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
|
56 |
+
driver.get(url)
|
57 |
+
page_content = driver.page_source
|
58 |
+
driver.quit()
|
59 |
+
soup = BeautifulSoup(page_content, 'html.parser')
|
60 |
+
paragraphs = soup.find_all('p')
|
61 |
+
return '\n'.join(p.text for p in paragraphs if p.text)
|
62 |
+
except Exception as e:
|
63 |
+
print(f"⚠️ Selenium failed: {e}")
|
64 |
+
|
65 |
+
return None
|
66 |
+
|
67 |
+
def main():
|
68 |
+
company_name = input("Enter company name: ")
|
69 |
+
print(f"\n🔎 Searching news for: {company_name}\n")
|
70 |
+
urls = get_valid_news_urls(company_name)
|
71 |
+
|
72 |
+
for i, url in enumerate(urls, 1):
|
73 |
+
print(f"\n🔗 Article {i}: {url}\n")
|
74 |
+
content = extract_article_content(url)
|
75 |
+
if content:
|
76 |
+
print("📰 Extracted Content:\n", content[:], "...")
|
77 |
+
else:
|
78 |
+
print("⚠️ Failed to extract content....")
|
79 |
+
|
80 |
+
if __name__ == "__main__":
|
81 |
+
main()
|
sentiV_v2.py
ADDED
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import requests
|
2 |
+
import re
|
3 |
+
import spacy
|
4 |
+
import nltk
|
5 |
+
from bs4 import BeautifulSoup
|
6 |
+
from newspaper import Article
|
7 |
+
from transformers import pipeline
|
8 |
+
from selenium import webdriver
|
9 |
+
from selenium.webdriver.chrome.service import Service
|
10 |
+
from selenium.webdriver.chrome.options import Options
|
11 |
+
from webdriver_manager.chrome import ChromeDriverManager
|
12 |
+
from nltk.sentiment import SentimentIntensityAnalyzer
|
13 |
+
import time
|
14 |
+
|
15 |
+
# Download NLTK resources
|
16 |
+
nltk.download('vader_lexicon')
|
17 |
+
sia = SentimentIntensityAnalyzer()
|
18 |
+
|
19 |
+
# Load spaCy Named Entity Recognition model
|
20 |
+
nlp = spacy.load("en_core_web_sm")
|
21 |
+
|
22 |
+
# Load BERT Sentiment Analyzer
|
23 |
+
bert_sentiment = pipeline("sentiment-analysis", model="siebert/sentiment-roberta-large-english")
|
24 |
+
|
25 |
+
def get_valid_news_urls(company_name):
|
26 |
+
search_url = f'https://www.google.com/search?q={company_name}+news&tbm=nws'
|
27 |
+
headers = {'User-Agent': 'Mozilla/5.0'}
|
28 |
+
try:
|
29 |
+
response = requests.get(search_url, headers=headers)
|
30 |
+
response.raise_for_status()
|
31 |
+
except requests.RequestException as e:
|
32 |
+
print(f"⚠️ Google News request failed: {e}")
|
33 |
+
return []
|
34 |
+
|
35 |
+
soup = BeautifulSoup(response.text, 'html.parser')
|
36 |
+
links = set()
|
37 |
+
for g in soup.find_all('a', href=True):
|
38 |
+
url_match = re.search(r'(https?://\S+)', g['href'])
|
39 |
+
if url_match:
|
40 |
+
url = url_match.group(1).split('&')[0]
|
41 |
+
if "google.com" not in url: # Ignore Google-related URLs
|
42 |
+
links.add(url)
|
43 |
+
|
44 |
+
return list(links)[:10] # Limit to top 10 results
|
45 |
+
|
46 |
+
def extract_article_content(url):
|
47 |
+
try:
|
48 |
+
article = Article(url)
|
49 |
+
article.download()
|
50 |
+
article.parse()
|
51 |
+
if article.text.strip():
|
52 |
+
return article.text
|
53 |
+
except Exception as e:
|
54 |
+
print(f"⚠️ Newspaper3k failed: {e}")
|
55 |
+
|
56 |
+
try:
|
57 |
+
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
|
58 |
+
response.raise_for_status()
|
59 |
+
soup = BeautifulSoup(response.text, 'html.parser')
|
60 |
+
paragraphs = soup.find_all('p')
|
61 |
+
text = '\n'.join(p.text for p in paragraphs if p.text)
|
62 |
+
if text.strip():
|
63 |
+
return text
|
64 |
+
except Exception as e:
|
65 |
+
print(f"⚠️ BeautifulSoup failed: {e}")
|
66 |
+
|
67 |
+
try:
|
68 |
+
options = Options()
|
69 |
+
options.add_argument("--headless")
|
70 |
+
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
|
71 |
+
driver.get(url)
|
72 |
+
time.sleep(3) # Allow time for JavaScript to load content
|
73 |
+
page_content = driver.page_source
|
74 |
+
driver.quit()
|
75 |
+
|
76 |
+
soup = BeautifulSoup(page_content, 'html.parser')
|
77 |
+
paragraphs = soup.find_all('p')
|
78 |
+
text = '\n'.join(p.text for p in paragraphs if p.text)
|
79 |
+
if text.strip():
|
80 |
+
return text
|
81 |
+
except Exception as e:
|
82 |
+
print(f"⚠️ Selenium failed: {e}")
|
83 |
+
|
84 |
+
return None
|
85 |
+
|
86 |
+
def filter_relevant_sentences(text, company_name):
|
87 |
+
doc = nlp(text)
|
88 |
+
relevant_sentences = []
|
89 |
+
|
90 |
+
for sent in text.split('. '):
|
91 |
+
doc_sent = nlp(sent)
|
92 |
+
for ent in doc_sent.ents:
|
93 |
+
if company_name.lower() in ent.text.lower():
|
94 |
+
relevant_sentences.append(sent)
|
95 |
+
break
|
96 |
+
|
97 |
+
return '. '.join(relevant_sentences) if relevant_sentences else text
|
98 |
+
|
99 |
+
def analyze_sentiment(text):
|
100 |
+
if not text.strip():
|
101 |
+
return "Neutral", 0.0
|
102 |
+
|
103 |
+
vader_scores = sia.polarity_scores(text)
|
104 |
+
vader_compound = vader_scores['compound']
|
105 |
+
|
106 |
+
try:
|
107 |
+
bert_result = bert_sentiment(text[:512])[0] # Limit to 512 tokens
|
108 |
+
bert_label = bert_result['label']
|
109 |
+
bert_score = bert_result['score']
|
110 |
+
bert_value = bert_score if bert_label == "POSITIVE" else -bert_score
|
111 |
+
except Exception as e:
|
112 |
+
print(f"⚠️ BERT sentiment analysis failed: {e}")
|
113 |
+
bert_value = 0.0
|
114 |
+
|
115 |
+
final_sentiment = (vader_compound + bert_value) / 2
|
116 |
+
|
117 |
+
if final_sentiment > 0.2:
|
118 |
+
return "Positive", final_sentiment
|
119 |
+
elif final_sentiment < -0.2:
|
120 |
+
return "Negative", final_sentiment
|
121 |
+
else:
|
122 |
+
return "Neutral", final_sentiment
|
123 |
+
|
124 |
+
def main():
|
125 |
+
company_name = input("Enter company name: ")
|
126 |
+
print(f"\n🔎 Searching news for: {company_name}\n")
|
127 |
+
urls = get_valid_news_urls(company_name)
|
128 |
+
|
129 |
+
if not urls:
|
130 |
+
print("❌ No valid news URLs found.")
|
131 |
+
return
|
132 |
+
|
133 |
+
seen_articles = set()
|
134 |
+
|
135 |
+
for i, url in enumerate(urls, 1):
|
136 |
+
if url in seen_articles:
|
137 |
+
continue
|
138 |
+
seen_articles.add(url)
|
139 |
+
|
140 |
+
print(f"\n🔗 Article {i}: {url}\n")
|
141 |
+
content = extract_article_content(url)
|
142 |
+
|
143 |
+
if content:
|
144 |
+
filtered_text = filter_relevant_sentences(content, company_name)
|
145 |
+
sentiment, score = analyze_sentiment(filtered_text)
|
146 |
+
|
147 |
+
print(f"📰 Extracted Content:\n{filtered_text[:500]}...")
|
148 |
+
print(f"📊 Sentiment: {sentiment} (Score: {score:.2f})")
|
149 |
+
else:
|
150 |
+
print("⚠️ Failed to extract content....")
|
151 |
+
|
152 |
+
if __name__ == "__main__":
|
153 |
+
main()
|
tts_hindi_edgetts.py
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import edge_tts
|
2 |
+
import asyncio
|
3 |
+
|
4 |
+
async def text_to_speech_hindi(text, output_file="news_sample.mp3"):
|
5 |
+
"""
|
6 |
+
Convert text to Hindi speech and save as an audio file using Edge TTS.
|
7 |
+
"""
|
8 |
+
if not text.strip():
|
9 |
+
print("⚠️ No text provided for TTS.")
|
10 |
+
return
|
11 |
+
|
12 |
+
print("🎙️ Generating Hindi speech...")
|
13 |
+
communicate = edge_tts.Communicate(text, voice="hi-IN-MadhurNeural")
|
14 |
+
await communicate.save(output_file)
|
15 |
+
|
16 |
+
print(f"✅ Audio saved as {output_file}")
|
17 |
+
return output_file
|
18 |
+
|
19 |
+
# Example usage
|
20 |
+
if __name__ == "__main__":
|
21 |
+
asyncio.run(text_to_speech_hindi("आज की मुख्य खबरें टेस्ला के बारे में हैं।"))
|
utils.py
ADDED
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import asyncio
|
2 |
+
import nltk
|
3 |
+
import matplotlib.pyplot as plt
|
4 |
+
from scrapes import get_valid_news_urls, extract_article_content
|
5 |
+
from sentiV_v2 import analyze_sentiment
|
6 |
+
from newspaper import Article, Config
|
7 |
+
from deep_translator import GoogleTranslator # Replaced googletrans with deep-translator
|
8 |
+
|
9 |
+
# Helper: Chunk text into smaller parts based on a fixed word count
|
10 |
+
def chunk_text_by_words(text, chunk_size=100):
|
11 |
+
words = text.split()
|
12 |
+
return [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]
|
13 |
+
|
14 |
+
def process_articles(company_name):
|
15 |
+
"""Extract articles with metadata from news URLs and only keep those relevant to the company."""
|
16 |
+
urls = get_valid_news_urls(company_name)
|
17 |
+
articles = []
|
18 |
+
# Set up a custom config with a browser user-agent to help avoid 403 errors
|
19 |
+
user_agent = ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
|
20 |
+
'AppleWebKit/537.36 (KHTML, like Gecko) '
|
21 |
+
'Chrome/92.0.4515.159 Safari/537.36')
|
22 |
+
config = Config()
|
23 |
+
config.browser_user_agent = user_agent
|
24 |
+
config.request_timeout = 10
|
25 |
+
|
26 |
+
for url in urls:
|
27 |
+
try:
|
28 |
+
art = Article(url, config=config)
|
29 |
+
art.download()
|
30 |
+
art.parse()
|
31 |
+
content = art.text.strip() if art.text.strip() else extract_article_content(url)
|
32 |
+
# Filter out articles that do not mention the company (case-insensitive)
|
33 |
+
if not content or company_name.lower() not in content.lower():
|
34 |
+
continue
|
35 |
+
article_data = {
|
36 |
+
"title": art.title if art.title else "No Title",
|
37 |
+
"url": url,
|
38 |
+
"date": str(art.publish_date) if art.publish_date else "N/A",
|
39 |
+
"content": content
|
40 |
+
}
|
41 |
+
sentiment, score = analyze_sentiment(content)
|
42 |
+
article_data["sentiment"] = sentiment
|
43 |
+
article_data["score"] = score
|
44 |
+
articles.append(article_data)
|
45 |
+
except Exception as e:
|
46 |
+
print(f"Error processing article {url}: {e}")
|
47 |
+
return articles
|
48 |
+
|
49 |
+
def generate_combined_summary(articles):
|
50 |
+
"""Generate a combined summary from articles.
|
51 |
+
First attempts to use a transformers pipeline; if it fails, falls back to Sumy."""
|
52 |
+
combined_text = " ".join([article["content"] for article in articles])
|
53 |
+
if not combined_text.strip():
|
54 |
+
return ""
|
55 |
+
# Try using transformers summarizer
|
56 |
+
try:
|
57 |
+
from transformers import pipeline
|
58 |
+
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
|
59 |
+
summary = summarizer(combined_text, max_length=150, min_length=50, do_sample=False)
|
60 |
+
return summary[0]["summary_text"]
|
61 |
+
except Exception as e:
|
62 |
+
print(f"Transformers summarization failed: {e}")
|
63 |
+
# Fallback using Sumy extraction-based summarization
|
64 |
+
try:
|
65 |
+
from sumy.parsers.plaintext import PlaintextParser
|
66 |
+
from sumy.nlp.tokenizers import Tokenizer
|
67 |
+
from sumy.summarizers.lex_rank import LexRankSummarizer
|
68 |
+
parser = PlaintextParser.from_string(combined_text, Tokenizer("english"))
|
69 |
+
summarizer_sumy = LexRankSummarizer()
|
70 |
+
summary_sentences = summarizer_sumy(parser.document, sentences_count=5)
|
71 |
+
summarized_text = " ".join(str(sentence) for sentence in summary_sentences)
|
72 |
+
return summarized_text if summarized_text else combined_text[:500]
|
73 |
+
except Exception as e2:
|
74 |
+
print(f"Sumy summarization failed: {e2}")
|
75 |
+
return combined_text[:500]
|
76 |
+
|
77 |
+
def translate_to_hindi(text):
|
78 |
+
"""Translate English text to Hindi using deep-translator for better quality."""
|
79 |
+
try:
|
80 |
+
translator = GoogleTranslator(source='auto', target='hi')
|
81 |
+
return translator.translate(text)
|
82 |
+
except Exception as e:
|
83 |
+
print(f"Translation failed: {e}")
|
84 |
+
return text
|
85 |
+
|
86 |
+
def comparative_analysis(articles):
|
87 |
+
"""Perform comparative sentiment analysis across articles and generate a bar chart."""
|
88 |
+
pos, neg, neu = 0, 0, 0
|
89 |
+
for article in articles:
|
90 |
+
sentiment = article.get("sentiment", "Neutral")
|
91 |
+
if sentiment == "Positive":
|
92 |
+
pos += 1
|
93 |
+
elif sentiment == "Negative":
|
94 |
+
neg += 1
|
95 |
+
else:
|
96 |
+
neu += 1
|
97 |
+
|
98 |
+
# Create a bar chart using matplotlib
|
99 |
+
labels = ['Positive', 'Negative', 'Neutral']
|
100 |
+
counts = [pos, neg, neu]
|
101 |
+
plt.figure(figsize=(6, 4))
|
102 |
+
bars = plt.bar(labels, counts, color=['green', 'red', 'gray'])
|
103 |
+
plt.title("Comparative Sentiment Analysis")
|
104 |
+
plt.xlabel("Sentiment")
|
105 |
+
plt.ylabel("Number of Articles")
|
106 |
+
for bar, count in zip(bars, counts):
|
107 |
+
height = bar.get_height()
|
108 |
+
plt.text(bar.get_x() + bar.get_width()/2., height, str(count), ha='center', va='bottom')
|
109 |
+
image_path = "sentiment_analysis.png"
|
110 |
+
plt.savefig(image_path)
|
111 |
+
plt.close()
|
112 |
+
return {"Positive": pos, "Negative": neg, "Neutral": neu, "graph": image_path}
|
113 |
+
|
114 |
+
def generate_tts_audio(text, output_file="news_summary.mp3"):
|
115 |
+
"""Generate TTS audio file from text using Edge TTS (via tts_hindi_edgetts.py)."""
|
116 |
+
try:
|
117 |
+
from tts_hindi_edgetts import text_to_speech_hindi
|
118 |
+
return asyncio.run(text_to_speech_hindi(text, output_file))
|
119 |
+
except Exception as e:
|
120 |
+
print(f"TTS generation failed: {e}")
|
121 |
+
return None
|
122 |
+
|
123 |
+
def process_news(company_name):
|
124 |
+
"""
|
125 |
+
Process news by:
|
126 |
+
• Extracting articles and metadata (only those relevant to the company)
|
127 |
+
• Generating a combined summary of article contents
|
128 |
+
• Translating the summary to Hindi
|
129 |
+
• Generating a Hindi TTS audio file
|
130 |
+
• Performing comparative sentiment analysis with visual output
|
131 |
+
"""
|
132 |
+
articles = process_articles(company_name)
|
133 |
+
summary = generate_combined_summary(articles)
|
134 |
+
hindi_summary = translate_to_hindi(summary)
|
135 |
+
tts_audio = generate_tts_audio(hindi_summary)
|
136 |
+
sentiment_distribution = comparative_analysis(articles)
|
137 |
+
result = {
|
138 |
+
"company": company_name,
|
139 |
+
"articles": articles,
|
140 |
+
"comparative_sentiment": sentiment_distribution,
|
141 |
+
"final_summary": summary,
|
142 |
+
"hindi_summary": hindi_summary,
|
143 |
+
"tts_audio": tts_audio # file path for the generated audio
|
144 |
+
}
|
145 |
+
return result
|
146 |
+
|
147 |
+
if __name__ == "__main__":
|
148 |
+
company = input("Enter company name: ")
|
149 |
+
import json
|
150 |
+
data = process_news(company)
|
151 |
+
print(json.dumps(data, indent=4, ensure_ascii=False))
|