File size: 4,019 Bytes
f66464a
 
 
 
 
 
 
 
ac6a7fc
 
40eed1a
ac6a7fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
license: apache-2.0
language:
- en
pipeline_tag: text-classification
library_name: fasttext
tags:
- news
---

# Riple's Tuned FastText News Categorization

FastText News Categorization is a simple, yet effective, project to classify news articles into different categories using Facebook’s FastText library. This repository contains scripts for data preprocessing, model training, evaluation, and prediction on news datasets.

## Table of Contents

- [Overview](#overview)
- [Features](#features)
- [Usage](#usage)
  - [Evaluating the Model](#evaluating-the-model)
  - [Predicting Categories](#predicting-categories)
- [Dataset](#dataset)
- [Results](#results)
- [Contributing](#contributing)
- [License](#license)

## Overview

In today’s digital age, automatically categorizing news articles is essential for improving content organization and enhancing information retrieval. This project leverages FastText to build a text classifier that categorizes news articles into predefined topics (e.g., politics, sports, technology, entertainment).

## Features

- **Efficient Text Classification:** Utilizes FastText’s supervised learning approach for quick and accurate news categorization.
- **Easy Model Evaluation:** Evaluate its performance with minimal configuration.
- **Prediction Interface:** Run predictions on new articles to determine their categories.

#### Below is a list of news categories along with their definitions:
- **__label__POLITICS_AND_GOVERNMENT:** News related to political events, government policies, elections, and political analysis.
- **__label__BUSINESS_AND_ECONOMY:** News concerning economic trends, business updates, financial markets, and economic policies.
- **__label__CRIME_AND_JUSTICE:** News focusing on crime reports, legal cases, law enforcement actions, and judicial decisions.
- **__label__SPORTS:** News covering sports events, athlete performances, game results, and sports analysis.
- **__label__ENTERTAINMENT:** News related to movies, music, television, celebrity gossip, and cultural events.
- **__label__HEALTH_AND_SCIENCE:** News covering medical research, health trends, scientific discoveries, and wellness advice.
- **__label__ENVIRONMENT_AND_CLIMATE:** News addressing long-term environmental issues, climate change, conservation efforts, and sustainability.
- **__label__TECHNOLOGY:** News about technological advancements, new gadgets, software innovations, and IT trends.
- **__label__EDUCATION:** News concerning educational policies, academic research, school and university updates, and academic achievements.
- **__label__LIFESTYLE_AND_CULTURE:** News covering cultural trends, lifestyle, fashion, travel, and social commentary.
- **__label__DISASTER_AND_ACCIDENT:** News related to natural disasters, accidents, emergencies, and crisis events.
- **__label__SOCIAL_ISSUES:** News addressing societal challenges, human rights, public debates, and community concerns.
- **__label__MILITARY_AND_DEFENSE:** News covering military operations, defense policies, international conflicts, and security matters.
- **__label__WEATHER_AND_CLIMATE:** News focused on immediate weather updates, forecasts, and meteorological conditions.
- **__label__PROMOTIONAL:** Content intended for advertising, sponsored material, or promotional purposes.
- **__label__ARCHIVE:** News that is outdated or no longer relevant and is generally not considered worth sharing.
- **__label__MISCLENIOUS:** News that do not fit into other categories, encompassing miscellaneous topics.


## Dataset

The default dataset used in this project is a collection of news articles with labeled categories. The model is trained on 140,000 news datasets. 

## Results

After training and evaluation, the model typically achieves an accuracy of around 85-90% on the test set (depending on the dataset and preprocessing quality). Detailed evaluation reports are generated and saved in the `results/` directory.

## License

This project is licensed under the [Apache 2.0 License](LICENSE).