|
--- |
|
language: |
|
- ta |
|
- ml |
|
- te |
|
tags: |
|
- multimodal |
|
- hate-speech-detection |
|
- text-classification |
|
- audio-classification |
|
- deep-learning |
|
- tamil |
|
- malayalam |
|
- telugu |
|
license: cc-by-nc-4.0 |
|
datasets: |
|
- dravidian-hate-speech |
|
model-index: |
|
- name: Multimodal Hate Speech Detection in Dravidian Languages |
|
results: |
|
- task: |
|
type: text-classification |
|
name: Text Classification |
|
dataset: |
|
name: Dravidian Hate Speech Dataset |
|
type: dravidian-hate-speech |
|
metrics: |
|
- type: macro-f1 |
|
value: 0.6438 |
|
- task: |
|
type: audio-classification |
|
name: Audio Classification |
|
dataset: |
|
name: Dravidian Hate Speech Dataset |
|
type: dravidian-hate-speech |
|
metrics: |
|
- type: macro-f1 |
|
value: 0.88 |
|
--- |
|
|
|
# Multimodal Classification Model (Tamil, Malayalam, Telugu) |
|
|
|
This repository contains deep learning models for **text and audio classification** in three languages: **Tamil, Malayalam, and Telugu**. |
|
|
|
--- |
|
|
|
## 📌 Overview |
|
|
|
The models accept **text and audio inputs** and classify them into predefined categories. Each language has dedicated trained models and label encoders: |
|
|
|
- **Text Model:** Utilizes `xlm-roberta-large` for feature extraction with a deep learning classifier. |
|
- **Audio Model:** Uses **MFCC feature extraction** and a CNN-based classifier. |
|
|
|
--- |
|
|
|
## 🛠 1. Setup |
|
|
|
### 1.1 Clone the Repository |
|
|
|
```bash |
|
git clone https://huggingface.co/vasantharan/Multimodal_Hate_Speech_Detection_in_Dravidian_languages |
|
cd Multimodal_Hate_Speech_Detection_in_Dravidian_languages |
|
``` |
|
|
|
### 1.2 Install Dependencies |
|
|
|
Ensure Python is installed, then run: |
|
|
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
--- |
|
|
|
## 📂 2. Directory Structure |
|
|
|
``` |
|
├── audio_label_encoders/ # Label encoders for audio models |
|
├── audio_models/ # Trained audio classification models |
|
├── text_label_encoders/ # Label encoders for text models |
|
└── text_models/ # Trained text classification models |
|
``` |
|
|
|
Each folder contains three files, corresponding to **Tamil, Malayalam, and Telugu**. |
|
|
|
--- |
|
|
|
## 🚀 3. How to Use |
|
|
|
### 3.1 Load the Models |
|
|
|
```python |
|
import tensorflow as tf |
|
import pickle |
|
import numpy as np |
|
import torch |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
# Load Label Encoders |
|
with open("text_label_encoders/tamil_label_encoder.pkl", "rb") as f: |
|
tamil_text_label_encoder = pickle.load(f) |
|
|
|
with open("audio_label_encoders/tamil_audio_label_encoder.pkl", "rb") as f: |
|
tamil_audio_label_encoder = pickle.load(f) |
|
|
|
# Load Models |
|
text_model = tf.keras.models.load_model("text_models/tamil_text_model.h5") |
|
audio_model = tf.keras.models.load_model("audio_models/tamil_audio_model.keras") |
|
``` |
|
|
|
--- |
|
|
|
## 📝 4. Text Classification |
|
|
|
### 4.1 Preprocess Text |
|
|
|
```python |
|
from indicnlp.tokenize import indic_tokenize |
|
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory |
|
import advertools as adv |
|
|
|
stopwords = list(sorted(adv.stopwords["tamil"])) |
|
|
|
def preprocess_tamil_text(text): |
|
tokens = list(indic_tokenize.trivial_tokenize(text, lang="ta")) |
|
tokens = [token for token in tokens if token not in stopwords] |
|
return " ".join(tokens) |
|
``` |
|
|
|
### 4.2 Extract Features and Predict |
|
|
|
```python |
|
def extract_embeddings(model_name, texts): |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModel.from_pretrained(model_name) |
|
model.eval() |
|
|
|
embeddings = [] |
|
batch_size = 16 |
|
with torch.no_grad(): |
|
for i in range(0, len(texts), batch_size): |
|
batch_texts = texts[i:i + batch_size] |
|
encoded_inputs = tokenizer(batch_texts, padding=True, truncation=True, max_length=128, return_tensors="pt") |
|
outputs = model(**encoded_inputs) |
|
batch_embeddings = outputs.last_hidden_state.mean(dim=1).numpy() |
|
embeddings.extend(batch_embeddings) |
|
return np.array(embeddings) |
|
|
|
feature_extractor = "xlm-roberta-large" |
|
text = "உங்கள் உதவி மிகவும் பயனுள்ளதாக இருந்தது" |
|
processed_text = preprocess_tamil_text(text) |
|
text_embeddings = extract_embeddings(feature_extractor, [processed_text]) |
|
|
|
text_predictions = text_model.predict(text_embeddings) |
|
predicted_label = tamil_text_label_encoder.inverse_transform(np.argmax(text_predictions, axis=1)) |
|
print("Predicted Label:", predicted_label[0]) |
|
``` |
|
|
|
--- |
|
|
|
## 🔊 5. Audio Classification |
|
|
|
### 5.1 Preprocess Audio |
|
|
|
```python |
|
import librosa |
|
|
|
def extract_audio_features(file_path, sr=22050, n_mfcc=40): |
|
audio, _ = librosa.load(file_path, sr=sr) |
|
mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=n_mfcc) |
|
return np.mean(mfccs.T, axis=0) |
|
``` |
|
|
|
### 5.2 Predict Audio Class |
|
|
|
```python |
|
def predict_audio(file_path): |
|
features = extract_audio_features(file_path) |
|
reshaped_features = features.reshape((1, 40, 1, 1)) |
|
predicted_class = np.argmax(audio_model.predict(reshaped_features), axis=1) |
|
predicted_label = tamil_audio_label_encoder.inverse_transform(predicted_class) |
|
return predicted_label[0] |
|
|
|
audio_file = "test_audio.wav" |
|
predicted_audio_label = predict_audio(audio_file) |
|
print("Predicted Audio Label:", predicted_audio_label) |
|
``` |
|
|
|
--- |
|
|
|
## 📊 6. Batch Processing for a Dataset |
|
|
|
### 6.1 Load Dataset |
|
|
|
```python |
|
import os |
|
import pandas as pd |
|
|
|
def load_dataset(base_dir='../test', lang='tamil'): |
|
dataset = [] |
|
lang_dir = os.path.join(base_dir, lang) |
|
audio_dir = os.path.join(lang_dir, "audio") |
|
text_dir = os.path.join(lang_dir, "text") |
|
|
|
text_file = os.path.join(text_dir, [file for file in os.listdir(text_dir) if file.endswith(".xlsx")][0]) |
|
text_df = pd.read_excel(text_file) |
|
|
|
for file in text_df["File Name"]: |
|
if (file + ".wav") in os.listdir(audio_dir): |
|
audio_path = os.path.join(audio_dir, file + ".wav") |
|
transcript_row = text_df.loc[text_df["File Name"] == file] |
|
transcript = transcript_row.iloc[0]["Transcript"] if not transcript_row.empty else "" |
|
dataset.append({"File Name": audio_path, "Transcript": transcript}) |
|
else: |
|
transcript_row = text_df.loc[text_df["File Name"] == file] |
|
transcript = transcript_row.iloc[0]["Transcript"] if not transcript_row.empty else "" |
|
dataset.append({"File Name": "Nil", "Transcript": transcript}) |
|
|
|
return pd.DataFrame(dataset) |
|
|
|
dataset_df = load_dataset() |
|
``` |
|
|
|
### 6.2 Predict Text and Audio in Bulk |
|
|
|
```python |
|
dataset_df["Transcript"] = dataset_df["Transcript"].apply(preprocess_tamil_text) |
|
text_embeddings = extract_embeddings(feature_extractor, dataset_df["Transcript"].tolist()) |
|
text_predictions = text_model.predict(text_embeddings) |
|
text_labels = tamil_text_label_encoder.inverse_transform(np.argmax(text_predictions, axis=1)) |
|
|
|
dataset_df["Predicted Text Label"] = text_labels |
|
dataset_df["Predicted Audio Label"] = dataset_df["File Name"].apply(lambda x: predict_audio(x) if x != "Nil" else "No Audio") |
|
dataset_df.to_csv("predictions.tsv", sep="\t", index=False) |
|
``` |
|
|
|
--- |
|
|
|
## ☁️ 7. Deployment on Hugging Face |
|
|
|
```bash |
|
pip install huggingface_hub |
|
huggingface-cli login |
|
``` |
|
|
|
```python |
|
from huggingface_hub import upload_file |
|
|
|
upload_file(path_or_fileobj="text_models/tamil_text_model.h5", path_in_repo="text_models/tamil_text_model.h5", repo_id="<your-hf-repo>") |
|
``` |
|
|
|
--- |
|
|
|
## 📬 Contact |
|
|
|
For issues or improvements, feel free to raise an issue or email [**[email protected]**](mailto\:[email protected]). |
|
|
|
--- |
|
|
|
**License:** CC BY-NC 4.0 |
|
|