|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
pipeline_tag: text-classification |
|
library_name: fasttext |
|
tags: |
|
- news |
|
--- |
|
|
|
# Riple's Tuned FastText News Categorization |
|
|
|
FastText News Categorization is a simple, yet effective, project to classify news articles into different categories using Facebook’s FastText library. This repository contains scripts for data preprocessing, model training, evaluation, and prediction on news datasets. |
|
|
|
## Table of Contents |
|
|
|
- [Overview](#overview) |
|
- [Features](#features) |
|
- [Usage](#usage) |
|
- [Evaluating the Model](#evaluating-the-model) |
|
- [Predicting Categories](#predicting-categories) |
|
- [Dataset](#dataset) |
|
- [Results](#results) |
|
- [Contributing](#contributing) |
|
- [License](#license) |
|
|
|
## Overview |
|
|
|
In today’s digital age, automatically categorizing news articles is essential for improving content organization and enhancing information retrieval. This project leverages FastText to build a text classifier that categorizes news articles into predefined topics (e.g., politics, sports, technology, entertainment). |
|
|
|
## Features |
|
|
|
- **Efficient Text Classification:** Utilizes FastText’s supervised learning approach for quick and accurate news categorization. |
|
- **Easy Model Evaluation:** Evaluate its performance with minimal configuration. |
|
- **Prediction Interface:** Run predictions on new articles to determine their categories. |
|
|
|
#### Below is a list of news categories along with their definitions: |
|
- **__label__POLITICS_AND_GOVERNMENT:** News related to political events, government policies, elections, and political analysis. |
|
- **__label__BUSINESS_AND_ECONOMY:** News concerning economic trends, business updates, financial markets, and economic policies. |
|
- **__label__CRIME_AND_JUSTICE:** News focusing on crime reports, legal cases, law enforcement actions, and judicial decisions. |
|
- **__label__SPORTS:** News covering sports events, athlete performances, game results, and sports analysis. |
|
- **__label__ENTERTAINMENT:** News related to movies, music, television, celebrity gossip, and cultural events. |
|
- **__label__HEALTH_AND_SCIENCE:** News covering medical research, health trends, scientific discoveries, and wellness advice. |
|
- **__label__ENVIRONMENT_AND_CLIMATE:** News addressing long-term environmental issues, climate change, conservation efforts, and sustainability. |
|
- **__label__TECHNOLOGY:** News about technological advancements, new gadgets, software innovations, and IT trends. |
|
- **__label__EDUCATION:** News concerning educational policies, academic research, school and university updates, and academic achievements. |
|
- **__label__LIFESTYLE_AND_CULTURE:** News covering cultural trends, lifestyle, fashion, travel, and social commentary. |
|
- **__label__DISASTER_AND_ACCIDENT:** News related to natural disasters, accidents, emergencies, and crisis events. |
|
- **__label__SOCIAL_ISSUES:** News addressing societal challenges, human rights, public debates, and community concerns. |
|
- **__label__MILITARY_AND_DEFENSE:** News covering military operations, defense policies, international conflicts, and security matters. |
|
- **__label__WEATHER_AND_CLIMATE:** News focused on immediate weather updates, forecasts, and meteorological conditions. |
|
- **__label__PROMOTIONAL:** Content intended for advertising, sponsored material, or promotional purposes. |
|
- **__label__ARCHIVE:** News that is outdated or no longer relevant and is generally not considered worth sharing. |
|
- **__label__MISCLENIOUS:** News that do not fit into other categories, encompassing miscellaneous topics. |
|
|
|
|
|
## Dataset |
|
|
|
The default dataset used in this project is a collection of news articles with labeled categories. The model is trained on 140,000 news datasets. |
|
|
|
## Results |
|
|
|
After training and evaluation, the model typically achieves an accuracy of around 85-90% on the test set (depending on the dataset and preprocessing quality). Detailed evaluation reports are generated and saved in the `results/` directory. |
|
|
|
## License |
|
|
|
This project is licensed under the [Apache 2.0 License](LICENSE). |