Update README.md

6c84751 verified about 1 month ago

7.56 kB

	---
	language:
	- ta
	- ml
	- te
	tags:
	- multimodal
	- hate-speech-detection
	- text-classification
	- audio-classification
	- deep-learning
	- tamil
	- malayalam
	- telugu
	license: cc-by-nc-4.0
	datasets:
	- dravidian-hate-speech
	model-index:
	- name: Multimodal Hate Speech Detection in Dravidian Languages
	results:
	- task:
	type: text-classification
	name: Text Classification
	dataset:
	name: Dravidian Hate Speech Dataset
	type: dravidian-hate-speech
	metrics:
	- type: macro-f1
	value: 0.6438
	- task:
	type: audio-classification
	name: Audio Classification
	dataset:
	name: Dravidian Hate Speech Dataset
	type: dravidian-hate-speech
	metrics:
	- type: macro-f1
	value: 0.88
	---

	# Multimodal Classification Model (Tamil, Malayalam, Telugu)

	This repository contains deep learning models for text and audio classification in three languages: Tamil, Malayalam, and Telugu.

	---

	## 📌 Overview

	The models accept text and audio inputs and classify them into predefined categories. Each language has dedicated trained models and label encoders:

	- Text Model: Utilizes `xlm-roberta-large` for feature extraction with a deep learning classifier.
	- Audio Model: Uses MFCC feature extraction and a CNN-based classifier.

	---

	## 🛠 1. Setup

	### 1.1 Clone the Repository

	```bash
	git clone https://huggingface.co/vasantharan/Multimodal_Hate_Speech_Detection_in_Dravidian_languages
	cd Multimodal_Hate_Speech_Detection_in_Dravidian_languages
	```

	### 1.2 Install Dependencies

	Ensure Python is installed, then run:

	```bash
	pip install -r requirements.txt
	```

	---

	## 📂 2. Directory Structure

	```
	├── audio_label_encoders/ # Label encoders for audio models
	├── audio_models/ # Trained audio classification models
	├── text_label_encoders/ # Label encoders for text models
	└── text_models/ # Trained text classification models
	```

	Each folder contains three files, corresponding to Tamil, Malayalam, and Telugu.

	---

	## 🚀 3. How to Use

	### 3.1 Load the Models

	```python
	import tensorflow as tf
	import pickle
	import numpy as np
	import torch
	from transformers import AutoTokenizer, AutoModel

	# Load Label Encoders
	with open("text_label_encoders/tamil_label_encoder.pkl", "rb") as f:
	tamil_text_label_encoder = pickle.load(f)

	with open("audio_label_encoders/tamil_audio_label_encoder.pkl", "rb") as f:
	tamil_audio_label_encoder = pickle.load(f)

	# Load Models
	text_model = tf.keras.models.load_model("text_models/tamil_text_model.h5")
	audio_model = tf.keras.models.load_model("audio_models/tamil_audio_model.keras")
	```

	---

	## 📝 4. Text Classification

	### 4.1 Preprocess Text

	```python
	from indicnlp.tokenize import indic_tokenize
	from indicnlp.normalize.indic_normalize import IndicNormalizerFactory
	import advertools as adv

	stopwords = list(sorted(adv.stopwords["tamil"]))

	def preprocess_tamil_text(text):
	tokens = list(indic_tokenize.trivial_tokenize(text, lang="ta"))
	tokens = [token for token in tokens if token not in stopwords]
	return " ".join(tokens)
	```

	### 4.2 Extract Features and Predict

	```python
	def extract_embeddings(model_name, texts):
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name)
	model.eval()

	embeddings = []
	batch_size = 16
	with torch.no_grad():
	for i in range(0, len(texts), batch_size):
	batch_texts = texts[i:i + batch_size]
	encoded_inputs = tokenizer(batch_texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
	outputs = model(**encoded_inputs)
	batch_embeddings = outputs.last_hidden_state.mean(dim=1).numpy()
	embeddings.extend(batch_embeddings)
	return np.array(embeddings)

	feature_extractor = "xlm-roberta-large"
	text = "உங்கள் உதவி மிகவும் பயனுள்ளதாக இருந்தது"
	processed_text = preprocess_tamil_text(text)
	text_embeddings = extract_embeddings(feature_extractor, [processed_text])

	text_predictions = text_model.predict(text_embeddings)
	predicted_label = tamil_text_label_encoder.inverse_transform(np.argmax(text_predictions, axis=1))
	print("Predicted Label:", predicted_label[0])
	```

	---

	## 🔊 5. Audio Classification

	### 5.1 Preprocess Audio

	```python
	import librosa

	def extract_audio_features(file_path, sr=22050, n_mfcc=40):
	audio, _ = librosa.load(file_path, sr=sr)
	mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=n_mfcc)
	return np.mean(mfccs.T, axis=0)
	```

	### 5.2 Predict Audio Class

	```python
	def predict_audio(file_path):
	features = extract_audio_features(file_path)
	reshaped_features = features.reshape((1, 40, 1, 1))
	predicted_class = np.argmax(audio_model.predict(reshaped_features), axis=1)
	predicted_label = tamil_audio_label_encoder.inverse_transform(predicted_class)
	return predicted_label[0]

	audio_file = "test_audio.wav"
	predicted_audio_label = predict_audio(audio_file)
	print("Predicted Audio Label:", predicted_audio_label)
	```

	---

	## 📊 6. Batch Processing for a Dataset

	### 6.1 Load Dataset

	```python
	import os
	import pandas as pd

	def load_dataset(base_dir='../test', lang='tamil'):
	dataset = []
	lang_dir = os.path.join(base_dir, lang)
	audio_dir = os.path.join(lang_dir, "audio")
	text_dir = os.path.join(lang_dir, "text")

	text_file = os.path.join(text_dir, [file for file in os.listdir(text_dir) if file.endswith(".xlsx")][0])
	text_df = pd.read_excel(text_file)

	for file in text_df["File Name"]:
	if (file + ".wav") in os.listdir(audio_dir):
	audio_path = os.path.join(audio_dir, file + ".wav")
	transcript_row = text_df.loc[text_df["File Name"] == file]
	transcript = transcript_row.iloc[0]["Transcript"] if not transcript_row.empty else ""
	dataset.append({"File Name": audio_path, "Transcript": transcript})
	else:
	transcript_row = text_df.loc[text_df["File Name"] == file]
	transcript = transcript_row.iloc[0]["Transcript"] if not transcript_row.empty else ""
	dataset.append({"File Name": "Nil", "Transcript": transcript})

	return pd.DataFrame(dataset)

	dataset_df = load_dataset()
	```

	### 6.2 Predict Text and Audio in Bulk

	```python
	dataset_df["Transcript"] = dataset_df["Transcript"].apply(preprocess_tamil_text)
	text_embeddings = extract_embeddings(feature_extractor, dataset_df["Transcript"].tolist())
	text_predictions = text_model.predict(text_embeddings)
	text_labels = tamil_text_label_encoder.inverse_transform(np.argmax(text_predictions, axis=1))

	dataset_df["Predicted Text Label"] = text_labels
	dataset_df["Predicted Audio Label"] = dataset_df["File Name"].apply(lambda x: predict_audio(x) if x != "Nil" else "No Audio")
	dataset_df.to_csv("predictions.tsv", sep="\t", index=False)
	```

	---

	## ☁️ 7. Deployment on Hugging Face

	```bash
	pip install huggingface_hub
	huggingface-cli login
	```

	```python
	from huggingface_hub import upload_file

	upload_file(path_or_fileobj="text_models/tamil_text_model.h5", path_in_repo="text_models/tamil_text_model.h5", repo_id="<your-hf-repo>")
	```

	---

	## 📬 Contact

	For issues or improvements, feel free to raise an issue or email [[email protected]](mailto\:[email protected]).

	---

	License: CC BY-NC 4.0