Riple's Tuned FastText News Categorization
FastText News Categorization is a simple, yet effective, project to classify news articles into different categories using Facebook’s FastText library. This repository contains scripts for data preprocessing, model training, evaluation, and prediction on news datasets.
Table of Contents
Overview
In today’s digital age, automatically categorizing news articles is essential for improving content organization and enhancing information retrieval. This project leverages FastText to build a text classifier that categorizes news articles into predefined topics (e.g., politics, sports, technology, entertainment).
Features
- Efficient Text Classification: Utilizes FastText’s supervised learning approach for quick and accurate news categorization.
- Easy Model Evaluation: Evaluate its performance with minimal configuration.
- Prediction Interface: Run predictions on new articles to determine their categories.
Below is a list of news categories along with their definitions:
- __label__POLITICS_AND_GOVERNMENT: News related to political events, government policies, elections, and political analysis.
- __label__BUSINESS_AND_ECONOMY: News concerning economic trends, business updates, financial markets, and economic policies.
- __label__CRIME_AND_JUSTICE: News focusing on crime reports, legal cases, law enforcement actions, and judicial decisions.
- __label__SPORTS: News covering sports events, athlete performances, game results, and sports analysis.
- __label__ENTERTAINMENT: News related to movies, music, television, celebrity gossip, and cultural events.
- __label__HEALTH_AND_SCIENCE: News covering medical research, health trends, scientific discoveries, and wellness advice.
- __label__ENVIRONMENT_AND_CLIMATE: News addressing long-term environmental issues, climate change, conservation efforts, and sustainability.
- __label__TECHNOLOGY: News about technological advancements, new gadgets, software innovations, and IT trends.
- __label__EDUCATION: News concerning educational policies, academic research, school and university updates, and academic achievements.
- __label__LIFESTYLE_AND_CULTURE: News covering cultural trends, lifestyle, fashion, travel, and social commentary.
- __label__DISASTER_AND_ACCIDENT: News related to natural disasters, accidents, emergencies, and crisis events.
- __label__SOCIAL_ISSUES: News addressing societal challenges, human rights, public debates, and community concerns.
- __label__MILITARY_AND_DEFENSE: News covering military operations, defense policies, international conflicts, and security matters.
- __label__WEATHER_AND_CLIMATE: News focused on immediate weather updates, forecasts, and meteorological conditions.
- __label__PROMOTIONAL: Content intended for advertising, sponsored material, or promotional purposes.
- __label__ARCHIVE: News that is outdated or no longer relevant and is generally not considered worth sharing.
- __label__MISCLENIOUS: News that do not fit into other categories, encompassing miscellaneous topics.
Dataset
The default dataset used in this project is a collection of news articles with labeled categories. The model is trained on 140,000 news datasets.
Results
After training and evaluation, the model typically achieves an accuracy of around 85-90% on the test set (depending on the dataset and preprocessing quality). Detailed evaluation reports are generated and saved in the results/
directory.
License
This project is licensed under the Apache 2.0 License.
- Downloads last month
- 0