Shakespeared101 commited on
Commit
3bf50ea
·
0 Parent(s):

Reinitialize repository without large files

Browse files
Files changed (10) hide show
  1. .github/workflows/sync_to_huggingface_space.yml +18 -0
  2. .gitignore +0 -0
  3. README.md +141 -0
  4. api.py +25 -0
  5. app.py +84 -0
  6. requirements.txt +0 -0
  7. scrapes.py +81 -0
  8. sentiV_v2.py +153 -0
  9. tts_hindi_edgetts.py +21 -0
  10. utils.py +151 -0
.github/workflows/sync_to_huggingface_space.yml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Sync to Hugging Face hub
2
+ on:
3
+ push:
4
+ branches: [main]
5
+ workflow_dispatch:
6
+
7
+ jobs:
8
+ sync-to-hub:
9
+ runs-on: ubuntu-latest
10
+ steps:
11
+ - uses: actions/checkout@v3
12
+ with:
13
+ fetch-depth: 0
14
+ lfs: true
15
+ - name: Push to hub
16
+ env:
17
+ HF_TOKEN: ${{ secrets.HF_TOKEN }}
18
+ run: git push -f https://Shakespeared101:[email protected]/spaces/Shakespeared101/news-summarise-tts main
.gitignore ADDED
Binary file (146 Bytes). View file
 
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "News Summarizer & TTS"
3
+ emoji: "📰"
4
+ colorFrom: "blue"
5
+ colorTo: "green"
6
+ sdk: "streamlit"
7
+ app_file: "app.py"
8
+ pinned: false
9
+ ---
10
+ # News Summarisation and Hindi TTS Application
11
+
12
+ ## Project Overview
13
+
14
+ This project is a web-based application that extracts news articles from multiple sources for a given company, summarizes the articles using advanced NLP techniques (with both Transformer-based and fallback methods), performs sentiment analysis with visual graphs, translates the generated summary to Hindi, and finally converts the Hindi summary into an audio file via text-to-speech (TTS). The application is built using FastAPI for the backend and Streamlit for the frontend, ensuring a smooth and interactive user experience.
15
+
16
+ ## Features
17
+
18
+ - **News Extraction:**
19
+ Extracts news articles from multiple sources using web scraping techniques.
20
+
21
+ - **Summarization:**
22
+ Generates a combined summary using a Transformer-based summarizer (with fallback to Sumy if needed).
23
+
24
+ - **Sentiment Analysis:**
25
+ Analyzes the sentiment of the news content and visualizes the comparative sentiment (Positive, Negative, Neutral) as a bar graph using matplotlib.
26
+
27
+ - **Translation:**
28
+ Translates the summary from English to Hindi using googletrans for improved quality.
29
+
30
+ - **Text-to-Speech (TTS):**
31
+ Converts the Hindi summary into an audio file using Edge TTS.
32
+
33
+ ## Setup Instructions
34
+
35
+ ### Dependencies
36
+
37
+ Install all required packages using the command below:
38
+
39
+ ```bash
40
+ pip install fastapi uvicorn streamlit transformers newspaper3k beautifulsoup4 edge-tts selenium webdriver-manager spacy nltk sumy sacremoses requests googletrans==4.0.0-rc1 matplotlib
41
+ python -m spacy download en_core_web_sm
42
+ python -c "import nltk; nltk.download('vader_lexicon'); nltk.download('punkt')"
43
+ ```
44
+
45
+ ### Running the FastAPI Backend
46
+
47
+ In your project directory, run:
48
+
49
+ ```bash
50
+ uvicorn api:app --reload
51
+ ```
52
+
53
+ This will start the backend server at [http://127.0.0.1:8000](http://127.0.0.1:8000).
54
+
55
+ ### Running the Streamlit Frontend
56
+
57
+ In another terminal (or a new tab), run:
58
+
59
+ ```bash
60
+ streamlit run streamlit_app.py
61
+ ```
62
+
63
+ This will launch the web interface where you can input a company name and interact with the application.
64
+
65
+ ## Project Structure
66
+
67
+ - **`api.py`**
68
+ Contains the FastAPI application which exposes endpoints for processing news, generating summaries, performing sentiment analysis, translating summaries to Hindi, and creating TTS audio.
69
+
70
+ - **`utils.py`**
71
+ Houses utility functions for:
72
+ - Extracting articles from news URLs.
73
+ - Generating combined summaries using Transformer models with Sumy as a fallback.
74
+ - Translating text to Hindi using googletrans.
75
+ - Performing comparative sentiment analysis and generating a matplotlib bar chart.
76
+ - Generating TTS audio from the Hindi summary.
77
+
78
+ - **`streamlit_app.py`**
79
+ Provides a simple and interactive web-based interface using Streamlit. Users can input a company name, view extracted news and summaries, see the sentiment analysis graph, and play the generated TTS audio.
80
+
81
+ - **`scrapes.py`**
82
+ Contains functions for scraping valid news URLs and extracting article content from web pages.
83
+
84
+ - **`sentiV_v2.py`**
85
+ Implements sentiment analysis on the article content using both NLTK’s VADER and Transformer-based methods.
86
+
87
+ - **`tts_hindi_edgetts.py`**
88
+ Utilizes Edge TTS to convert text to speech and saves the output as an audio file.
89
+
90
+ - **`.gitignore`**
91
+ Ensures that large or unnecessary files (like the virtual environment folder `venv/`) are not tracked by Git.
92
+
93
+ ## Deployment Details
94
+
95
+ The application can be deployed on platforms like [Hugging Face Spaces](https://huggingface.co/spaces), Heroku, or Render. For example, if deployed on Hugging Face Spaces:
96
+
97
+ - The repository is linked to a new Space.
98
+ - The Streamlit interface is used as the main application.
99
+ - The deployment link (e.g., `https://huggingface.co/spaces/your-username/news-summarisation`) will be provided in the repository README for access.
100
+
101
+ ## Usage Instructions
102
+
103
+ 1. **Launch the Application:**
104
+ Run the FastAPI backend and Streamlit frontend as described above.
105
+
106
+ 2. **Input a Company Name:**
107
+ On the Streamlit interface, enter the name of a company (e.g., "Tesla", "Netflix") and click the "Fetch News" button.
108
+
109
+ 3. **View Results:**
110
+ - **News Articles:**
111
+ See a list of extracted news articles along with their metadata (title, URL, date, sentiment, excerpt).
112
+ - **Sentiment Analysis:**
113
+ View the comparative sentiment counts and a bar chart visualizing the distribution of positive, negative, and neutral articles.
114
+ - **Summaries:**
115
+ Read the combined summary of the news and the translated Hindi summary.
116
+ - **Audio:**
117
+ Play the TTS-generated audio of the Hindi summary.
118
+
119
+ ## Limitations & Future Improvements
120
+
121
+ ### Limitations:
122
+
123
+ - Reliance on web scraping can sometimes result in incomplete article extraction due to website restrictions.
124
+ - The summarization and translation quality might vary based on input length and complexity.
125
+ - TTS accuracy depends on the Edge TTS service and may not always be perfect.
126
+
127
+ ### Future Improvements:
128
+
129
+ - Integrate more robust error handling and fallback mechanisms.
130
+ - Enhance the UI for better user experience.
131
+ - Expand the number of news sources and improve the filtering of relevant content.
132
+ - Implement caching to reduce API call latency.
133
+ - Explore additional TTS options for higher quality audio output.
134
+
135
+ ## License
136
+
137
+ This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
138
+
139
+ ## Contributing
140
+
141
+ Contributions are welcome! Please see the [CONTRIBUTING](CONTRIBUTING.md) file for guidelines on how to contribute to this project.
api.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI
2
+ from utils import process_news
3
+
4
+ app = FastAPI(title="News Summarization & TTS API")
5
+
6
+ @app.get("/")
7
+ def read_root():
8
+ return {"message": "Welcome to the News Summarization & TTS API"}
9
+
10
+ @app.get("/news/{company_name}")
11
+ def get_news(company_name: str):
12
+ """
13
+ Fetch processed news for a given company.
14
+ Returns:
15
+ • A list of articles with title, URL, date, content, sentiment, and score.
16
+ • A combined summary of all articles.
17
+ • A Hindi translated summary.
18
+ • The TTS audio file path.
19
+ • Comparative sentiment analysis including a visual graph.
20
+ """
21
+ return process_news(company_name)
22
+
23
+ if __name__ == "__main__":
24
+ import uvicorn
25
+ uvicorn.run(app, host="0.0.0.0", port=8000)
app.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import threading
3
+ import time
4
+ import requests
5
+ import streamlit as st
6
+ import uvicorn
7
+ from fastapi import FastAPI
8
+ from utils import process_news
9
+
10
+ import spacy
11
+ try:
12
+ spacy.load("en_core_web_sm")
13
+ except OSError:
14
+ import os
15
+ os.system("python -m spacy download en_core_web_sm")
16
+
17
+ # FastAPI app setup
18
+ api = FastAPI(title="News Summarization & TTS API")
19
+
20
+ @api.get("/")
21
+ def read_root():
22
+ return {"message": "Welcome to the News Summarization & TTS API"}
23
+
24
+ @api.get("/news/{company_name}")
25
+ def get_news(company_name: str):
26
+ return process_news(company_name)
27
+
28
+ # Function to run FastAPI in a separate thread
29
+ def run_fastapi():
30
+ uvicorn.run(api, host="0.0.0.0", port=8000)
31
+
32
+ # Start FastAPI in a separate thread
33
+ threading.Thread(target=run_fastapi, daemon=True).start()
34
+
35
+ # Streamlit app setup
36
+ API_URL = "http://127.0.0.1:8000" # Since FastAPI runs in the same Space
37
+
38
+ st.title("News Summarization and Hindi TTS Application")
39
+ company = st.text_input("Enter Company Name", "")
40
+
41
+ if st.button("Fetch News"):
42
+ if company.strip() == "":
43
+ st.warning("Please enter a valid company name.")
44
+ else:
45
+ with st.spinner("Fetching and processing news..."):
46
+ time.sleep(2) # Give FastAPI some time to start
47
+ try:
48
+ response = requests.get(f"{API_URL}/news/{company}")
49
+ if response.status_code == 200:
50
+ data = response.json()
51
+ st.header(f"News for {data['company']}")
52
+
53
+ for article in data["articles"]:
54
+ st.subheader(article.get("title", "No Title"))
55
+ st.markdown(f"**URL:** [Read More]({article.get('url', '#')})")
56
+ st.markdown(f"**Date:** {article.get('date', 'N/A')}")
57
+ st.markdown(f"**Sentiment:** {article.get('sentiment', 'Neutral')} (Score: {article.get('score', 0):.2f})")
58
+ st.markdown(f"**Excerpt:** {article.get('content','')[:300]}...")
59
+ st.markdown("---")
60
+
61
+ st.subheader("Comparative Sentiment Analysis")
62
+ comp_sent = data.get("comparative_sentiment", {})
63
+ st.write({k: comp_sent[k] for k in ["Positive", "Negative", "Neutral"]})
64
+
65
+ if "graph" in comp_sent and os.path.exists(comp_sent["graph"]):
66
+ st.image(comp_sent["graph"], caption="Sentiment Analysis Graph")
67
+
68
+ st.subheader("Final Combined Summary")
69
+ st.write(data.get("final_summary", "No summary available."))
70
+
71
+ st.subheader("Hindi Summary")
72
+ st.write(data.get("hindi_summary", ""))
73
+
74
+ st.subheader("Hindi Summary Audio")
75
+ audio_path = data.get("tts_audio", None)
76
+ if audio_path and os.path.exists(audio_path):
77
+ with open(audio_path, "rb") as audio_file:
78
+ st.audio(audio_file.read(), format='audio/mp3')
79
+ else:
80
+ st.error("Audio file not found or TTS generation failed.")
81
+ else:
82
+ st.error("Failed to fetch news from the API. Please try again.")
83
+ except requests.exceptions.ConnectionError:
84
+ st.error("API is not running yet. Please wait a moment and try again.")
requirements.txt ADDED
Binary file (1.26 kB). View file
 
scrapes.py ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests
2
+ import re
3
+ from bs4 import BeautifulSoup
4
+ from newspaper import Article
5
+ from selenium import webdriver
6
+ from selenium.webdriver.chrome.service import Service
7
+ from selenium.webdriver.chrome.options import Options
8
+ from webdriver_manager.chrome import ChromeDriverManager
9
+
10
+
11
+ def get_valid_news_urls(company_name):
12
+ search_url = f'https://www.google.com/search?q={company_name}+news&tbm=nws'
13
+ headers = {'User-Agent': 'Mozilla/5.0'}
14
+ response = requests.get(search_url, headers=headers)
15
+
16
+ if response.status_code != 200:
17
+ print("⚠️ Google News request failed!")
18
+ return []
19
+
20
+ soup = BeautifulSoup(response.text, 'html.parser')
21
+ links = []
22
+ for g in soup.find_all('a', href=True):
23
+ url_match = re.search(r'(https?://\S+)', g['href'])
24
+ if url_match:
25
+ url = url_match.group(1).split('&')[0]
26
+ if "google.com" not in url:
27
+ links.append(url)
28
+
29
+ return links[:10] # Limit to top 10 results
30
+
31
+ def extract_article_content(url):
32
+ try:
33
+ article = Article(url)
34
+ article.download()
35
+ article.parse()
36
+ return article.text
37
+ except Exception as e:
38
+ print(f"⚠️ Newspaper3k failed: {e}")
39
+
40
+ try:
41
+ response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
42
+ if response.status_code != 200:
43
+ raise Exception("Request failed")
44
+ soup = BeautifulSoup(response.text, 'html.parser')
45
+ paragraphs = soup.find_all('p')
46
+ return '\n'.join(p.text for p in paragraphs if p.text)
47
+ except Exception as e:
48
+ print(f"⚠️ BeautifulSoup failed: {e}")
49
+
50
+ try:
51
+ options = Options()
52
+ options.add_argument("--headless") # Run in headless mode
53
+ options.add_argument("--no-sandbox")
54
+ options.add_argument("--disable-dev-shm-usage")
55
+ driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
56
+ driver.get(url)
57
+ page_content = driver.page_source
58
+ driver.quit()
59
+ soup = BeautifulSoup(page_content, 'html.parser')
60
+ paragraphs = soup.find_all('p')
61
+ return '\n'.join(p.text for p in paragraphs if p.text)
62
+ except Exception as e:
63
+ print(f"⚠️ Selenium failed: {e}")
64
+
65
+ return None
66
+
67
+ def main():
68
+ company_name = input("Enter company name: ")
69
+ print(f"\n🔎 Searching news for: {company_name}\n")
70
+ urls = get_valid_news_urls(company_name)
71
+
72
+ for i, url in enumerate(urls, 1):
73
+ print(f"\n🔗 Article {i}: {url}\n")
74
+ content = extract_article_content(url)
75
+ if content:
76
+ print("📰 Extracted Content:\n", content[:], "...")
77
+ else:
78
+ print("⚠️ Failed to extract content....")
79
+
80
+ if __name__ == "__main__":
81
+ main()
sentiV_v2.py ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests
2
+ import re
3
+ import spacy
4
+ import nltk
5
+ from bs4 import BeautifulSoup
6
+ from newspaper import Article
7
+ from transformers import pipeline
8
+ from selenium import webdriver
9
+ from selenium.webdriver.chrome.service import Service
10
+ from selenium.webdriver.chrome.options import Options
11
+ from webdriver_manager.chrome import ChromeDriverManager
12
+ from nltk.sentiment import SentimentIntensityAnalyzer
13
+ import time
14
+
15
+ # Download NLTK resources
16
+ nltk.download('vader_lexicon')
17
+ sia = SentimentIntensityAnalyzer()
18
+
19
+ # Load spaCy Named Entity Recognition model
20
+ nlp = spacy.load("en_core_web_sm")
21
+
22
+ # Load BERT Sentiment Analyzer
23
+ bert_sentiment = pipeline("sentiment-analysis", model="siebert/sentiment-roberta-large-english")
24
+
25
+ def get_valid_news_urls(company_name):
26
+ search_url = f'https://www.google.com/search?q={company_name}+news&tbm=nws'
27
+ headers = {'User-Agent': 'Mozilla/5.0'}
28
+ try:
29
+ response = requests.get(search_url, headers=headers)
30
+ response.raise_for_status()
31
+ except requests.RequestException as e:
32
+ print(f"⚠️ Google News request failed: {e}")
33
+ return []
34
+
35
+ soup = BeautifulSoup(response.text, 'html.parser')
36
+ links = set()
37
+ for g in soup.find_all('a', href=True):
38
+ url_match = re.search(r'(https?://\S+)', g['href'])
39
+ if url_match:
40
+ url = url_match.group(1).split('&')[0]
41
+ if "google.com" not in url: # Ignore Google-related URLs
42
+ links.add(url)
43
+
44
+ return list(links)[:10] # Limit to top 10 results
45
+
46
+ def extract_article_content(url):
47
+ try:
48
+ article = Article(url)
49
+ article.download()
50
+ article.parse()
51
+ if article.text.strip():
52
+ return article.text
53
+ except Exception as e:
54
+ print(f"⚠️ Newspaper3k failed: {e}")
55
+
56
+ try:
57
+ response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
58
+ response.raise_for_status()
59
+ soup = BeautifulSoup(response.text, 'html.parser')
60
+ paragraphs = soup.find_all('p')
61
+ text = '\n'.join(p.text for p in paragraphs if p.text)
62
+ if text.strip():
63
+ return text
64
+ except Exception as e:
65
+ print(f"⚠️ BeautifulSoup failed: {e}")
66
+
67
+ try:
68
+ options = Options()
69
+ options.add_argument("--headless")
70
+ driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
71
+ driver.get(url)
72
+ time.sleep(3) # Allow time for JavaScript to load content
73
+ page_content = driver.page_source
74
+ driver.quit()
75
+
76
+ soup = BeautifulSoup(page_content, 'html.parser')
77
+ paragraphs = soup.find_all('p')
78
+ text = '\n'.join(p.text for p in paragraphs if p.text)
79
+ if text.strip():
80
+ return text
81
+ except Exception as e:
82
+ print(f"⚠️ Selenium failed: {e}")
83
+
84
+ return None
85
+
86
+ def filter_relevant_sentences(text, company_name):
87
+ doc = nlp(text)
88
+ relevant_sentences = []
89
+
90
+ for sent in text.split('. '):
91
+ doc_sent = nlp(sent)
92
+ for ent in doc_sent.ents:
93
+ if company_name.lower() in ent.text.lower():
94
+ relevant_sentences.append(sent)
95
+ break
96
+
97
+ return '. '.join(relevant_sentences) if relevant_sentences else text
98
+
99
+ def analyze_sentiment(text):
100
+ if not text.strip():
101
+ return "Neutral", 0.0
102
+
103
+ vader_scores = sia.polarity_scores(text)
104
+ vader_compound = vader_scores['compound']
105
+
106
+ try:
107
+ bert_result = bert_sentiment(text[:512])[0] # Limit to 512 tokens
108
+ bert_label = bert_result['label']
109
+ bert_score = bert_result['score']
110
+ bert_value = bert_score if bert_label == "POSITIVE" else -bert_score
111
+ except Exception as e:
112
+ print(f"⚠️ BERT sentiment analysis failed: {e}")
113
+ bert_value = 0.0
114
+
115
+ final_sentiment = (vader_compound + bert_value) / 2
116
+
117
+ if final_sentiment > 0.2:
118
+ return "Positive", final_sentiment
119
+ elif final_sentiment < -0.2:
120
+ return "Negative", final_sentiment
121
+ else:
122
+ return "Neutral", final_sentiment
123
+
124
+ def main():
125
+ company_name = input("Enter company name: ")
126
+ print(f"\n🔎 Searching news for: {company_name}\n")
127
+ urls = get_valid_news_urls(company_name)
128
+
129
+ if not urls:
130
+ print("❌ No valid news URLs found.")
131
+ return
132
+
133
+ seen_articles = set()
134
+
135
+ for i, url in enumerate(urls, 1):
136
+ if url in seen_articles:
137
+ continue
138
+ seen_articles.add(url)
139
+
140
+ print(f"\n🔗 Article {i}: {url}\n")
141
+ content = extract_article_content(url)
142
+
143
+ if content:
144
+ filtered_text = filter_relevant_sentences(content, company_name)
145
+ sentiment, score = analyze_sentiment(filtered_text)
146
+
147
+ print(f"📰 Extracted Content:\n{filtered_text[:500]}...")
148
+ print(f"📊 Sentiment: {sentiment} (Score: {score:.2f})")
149
+ else:
150
+ print("⚠️ Failed to extract content....")
151
+
152
+ if __name__ == "__main__":
153
+ main()
tts_hindi_edgetts.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import edge_tts
2
+ import asyncio
3
+
4
+ async def text_to_speech_hindi(text, output_file="news_sample.mp3"):
5
+ """
6
+ Convert text to Hindi speech and save as an audio file using Edge TTS.
7
+ """
8
+ if not text.strip():
9
+ print("⚠️ No text provided for TTS.")
10
+ return
11
+
12
+ print("🎙️ Generating Hindi speech...")
13
+ communicate = edge_tts.Communicate(text, voice="hi-IN-MadhurNeural")
14
+ await communicate.save(output_file)
15
+
16
+ print(f"✅ Audio saved as {output_file}")
17
+ return output_file
18
+
19
+ # Example usage
20
+ if __name__ == "__main__":
21
+ asyncio.run(text_to_speech_hindi("आज की मुख्य खबरें टेस्ला के बारे में हैं।"))
utils.py ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ import nltk
3
+ import matplotlib.pyplot as plt
4
+ from scrapes import get_valid_news_urls, extract_article_content
5
+ from sentiV_v2 import analyze_sentiment
6
+ from newspaper import Article, Config
7
+ from deep_translator import GoogleTranslator # Replaced googletrans with deep-translator
8
+
9
+ # Helper: Chunk text into smaller parts based on a fixed word count
10
+ def chunk_text_by_words(text, chunk_size=100):
11
+ words = text.split()
12
+ return [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]
13
+
14
+ def process_articles(company_name):
15
+ """Extract articles with metadata from news URLs and only keep those relevant to the company."""
16
+ urls = get_valid_news_urls(company_name)
17
+ articles = []
18
+ # Set up a custom config with a browser user-agent to help avoid 403 errors
19
+ user_agent = ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
20
+ 'AppleWebKit/537.36 (KHTML, like Gecko) '
21
+ 'Chrome/92.0.4515.159 Safari/537.36')
22
+ config = Config()
23
+ config.browser_user_agent = user_agent
24
+ config.request_timeout = 10
25
+
26
+ for url in urls:
27
+ try:
28
+ art = Article(url, config=config)
29
+ art.download()
30
+ art.parse()
31
+ content = art.text.strip() if art.text.strip() else extract_article_content(url)
32
+ # Filter out articles that do not mention the company (case-insensitive)
33
+ if not content or company_name.lower() not in content.lower():
34
+ continue
35
+ article_data = {
36
+ "title": art.title if art.title else "No Title",
37
+ "url": url,
38
+ "date": str(art.publish_date) if art.publish_date else "N/A",
39
+ "content": content
40
+ }
41
+ sentiment, score = analyze_sentiment(content)
42
+ article_data["sentiment"] = sentiment
43
+ article_data["score"] = score
44
+ articles.append(article_data)
45
+ except Exception as e:
46
+ print(f"Error processing article {url}: {e}")
47
+ return articles
48
+
49
+ def generate_combined_summary(articles):
50
+ """Generate a combined summary from articles.
51
+ First attempts to use a transformers pipeline; if it fails, falls back to Sumy."""
52
+ combined_text = " ".join([article["content"] for article in articles])
53
+ if not combined_text.strip():
54
+ return ""
55
+ # Try using transformers summarizer
56
+ try:
57
+ from transformers import pipeline
58
+ summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
59
+ summary = summarizer(combined_text, max_length=150, min_length=50, do_sample=False)
60
+ return summary[0]["summary_text"]
61
+ except Exception as e:
62
+ print(f"Transformers summarization failed: {e}")
63
+ # Fallback using Sumy extraction-based summarization
64
+ try:
65
+ from sumy.parsers.plaintext import PlaintextParser
66
+ from sumy.nlp.tokenizers import Tokenizer
67
+ from sumy.summarizers.lex_rank import LexRankSummarizer
68
+ parser = PlaintextParser.from_string(combined_text, Tokenizer("english"))
69
+ summarizer_sumy = LexRankSummarizer()
70
+ summary_sentences = summarizer_sumy(parser.document, sentences_count=5)
71
+ summarized_text = " ".join(str(sentence) for sentence in summary_sentences)
72
+ return summarized_text if summarized_text else combined_text[:500]
73
+ except Exception as e2:
74
+ print(f"Sumy summarization failed: {e2}")
75
+ return combined_text[:500]
76
+
77
+ def translate_to_hindi(text):
78
+ """Translate English text to Hindi using deep-translator for better quality."""
79
+ try:
80
+ translator = GoogleTranslator(source='auto', target='hi')
81
+ return translator.translate(text)
82
+ except Exception as e:
83
+ print(f"Translation failed: {e}")
84
+ return text
85
+
86
+ def comparative_analysis(articles):
87
+ """Perform comparative sentiment analysis across articles and generate a bar chart."""
88
+ pos, neg, neu = 0, 0, 0
89
+ for article in articles:
90
+ sentiment = article.get("sentiment", "Neutral")
91
+ if sentiment == "Positive":
92
+ pos += 1
93
+ elif sentiment == "Negative":
94
+ neg += 1
95
+ else:
96
+ neu += 1
97
+
98
+ # Create a bar chart using matplotlib
99
+ labels = ['Positive', 'Negative', 'Neutral']
100
+ counts = [pos, neg, neu]
101
+ plt.figure(figsize=(6, 4))
102
+ bars = plt.bar(labels, counts, color=['green', 'red', 'gray'])
103
+ plt.title("Comparative Sentiment Analysis")
104
+ plt.xlabel("Sentiment")
105
+ plt.ylabel("Number of Articles")
106
+ for bar, count in zip(bars, counts):
107
+ height = bar.get_height()
108
+ plt.text(bar.get_x() + bar.get_width()/2., height, str(count), ha='center', va='bottom')
109
+ image_path = "sentiment_analysis.png"
110
+ plt.savefig(image_path)
111
+ plt.close()
112
+ return {"Positive": pos, "Negative": neg, "Neutral": neu, "graph": image_path}
113
+
114
+ def generate_tts_audio(text, output_file="news_summary.mp3"):
115
+ """Generate TTS audio file from text using Edge TTS (via tts_hindi_edgetts.py)."""
116
+ try:
117
+ from tts_hindi_edgetts import text_to_speech_hindi
118
+ return asyncio.run(text_to_speech_hindi(text, output_file))
119
+ except Exception as e:
120
+ print(f"TTS generation failed: {e}")
121
+ return None
122
+
123
+ def process_news(company_name):
124
+ """
125
+ Process news by:
126
+ • Extracting articles and metadata (only those relevant to the company)
127
+ • Generating a combined summary of article contents
128
+ • Translating the summary to Hindi
129
+ • Generating a Hindi TTS audio file
130
+ • Performing comparative sentiment analysis with visual output
131
+ """
132
+ articles = process_articles(company_name)
133
+ summary = generate_combined_summary(articles)
134
+ hindi_summary = translate_to_hindi(summary)
135
+ tts_audio = generate_tts_audio(hindi_summary)
136
+ sentiment_distribution = comparative_analysis(articles)
137
+ result = {
138
+ "company": company_name,
139
+ "articles": articles,
140
+ "comparative_sentiment": sentiment_distribution,
141
+ "final_summary": summary,
142
+ "hindi_summary": hindi_summary,
143
+ "tts_audio": tts_audio # file path for the generated audio
144
+ }
145
+ return result
146
+
147
+ if __name__ == "__main__":
148
+ company = input("Enter company name: ")
149
+ import json
150
+ data = process_news(company)
151
+ print(json.dumps(data, indent=4, ensure_ascii=False))