--- language: - ta - ml - te tags: - multimodal - hate-speech-detection - text-classification - audio-classification - deep-learning - tamil - malayalam - telugu license: cc-by-nc-4.0 datasets: - dravidian-hate-speech model-index: - name: Multimodal Hate Speech Detection in Dravidian Languages results: - task: type: text-classification name: Text Classification dataset: name: Dravidian Hate Speech Dataset type: dravidian-hate-speech metrics: - type: macro-f1 value: 0.6438 - task: type: audio-classification name: Audio Classification dataset: name: Dravidian Hate Speech Dataset type: dravidian-hate-speech metrics: - type: macro-f1 value: 0.88 --- # Multimodal Classification Model (Tamil, Malayalam, Telugu) This repository contains deep learning models for **text and audio classification** in three languages: **Tamil, Malayalam, and Telugu**. --- ## 📌 Overview The models accept **text and audio inputs** and classify them into predefined categories. Each language has dedicated trained models and label encoders: - **Text Model:** Utilizes `xlm-roberta-large` for feature extraction with a deep learning classifier. - **Audio Model:** Uses **MFCC feature extraction** and a CNN-based classifier. --- ## 🛠 1. Setup ### 1.1 Clone the Repository ```bash git clone https://huggingface.co/vasantharan/Multimodal_Hate_Speech_Detection_in_Dravidian_languages cd Multimodal_Hate_Speech_Detection_in_Dravidian_languages ``` ### 1.2 Install Dependencies Ensure Python is installed, then run: ```bash pip install -r requirements.txt ``` --- ## 📂 2. Directory Structure ``` ├── audio_label_encoders/ # Label encoders for audio models ├── audio_models/ # Trained audio classification models ├── text_label_encoders/ # Label encoders for text models └── text_models/ # Trained text classification models ``` Each folder contains three files, corresponding to **Tamil, Malayalam, and Telugu**. --- ## 🚀 3. How to Use ### 3.1 Load the Models ```python import tensorflow as tf import pickle import numpy as np import torch from transformers import AutoTokenizer, AutoModel # Load Label Encoders with open("text_label_encoders/tamil_label_encoder.pkl", "rb") as f: tamil_text_label_encoder = pickle.load(f) with open("audio_label_encoders/tamil_audio_label_encoder.pkl", "rb") as f: tamil_audio_label_encoder = pickle.load(f) # Load Models text_model = tf.keras.models.load_model("text_models/tamil_text_model.h5") audio_model = tf.keras.models.load_model("audio_models/tamil_audio_model.keras") ``` --- ## 📝 4. Text Classification ### 4.1 Preprocess Text ```python from indicnlp.tokenize import indic_tokenize from indicnlp.normalize.indic_normalize import IndicNormalizerFactory import advertools as adv stopwords = list(sorted(adv.stopwords["tamil"])) def preprocess_tamil_text(text): tokens = list(indic_tokenize.trivial_tokenize(text, lang="ta")) tokens = [token for token in tokens if token not in stopwords] return " ".join(tokens) ``` ### 4.2 Extract Features and Predict ```python def extract_embeddings(model_name, texts): tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) model.eval() embeddings = [] batch_size = 16 with torch.no_grad(): for i in range(0, len(texts), batch_size): batch_texts = texts[i:i + batch_size] encoded_inputs = tokenizer(batch_texts, padding=True, truncation=True, max_length=128, return_tensors="pt") outputs = model(**encoded_inputs) batch_embeddings = outputs.last_hidden_state.mean(dim=1).numpy() embeddings.extend(batch_embeddings) return np.array(embeddings) feature_extractor = "xlm-roberta-large" text = "உங்கள் உதவி மிகவும் பயனுள்ளதாக இருந்தது" processed_text = preprocess_tamil_text(text) text_embeddings = extract_embeddings(feature_extractor, [processed_text]) text_predictions = text_model.predict(text_embeddings) predicted_label = tamil_text_label_encoder.inverse_transform(np.argmax(text_predictions, axis=1)) print("Predicted Label:", predicted_label[0]) ``` --- ## 🔊 5. Audio Classification ### 5.1 Preprocess Audio ```python import librosa def extract_audio_features(file_path, sr=22050, n_mfcc=40): audio, _ = librosa.load(file_path, sr=sr) mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=n_mfcc) return np.mean(mfccs.T, axis=0) ``` ### 5.2 Predict Audio Class ```python def predict_audio(file_path): features = extract_audio_features(file_path) reshaped_features = features.reshape((1, 40, 1, 1)) predicted_class = np.argmax(audio_model.predict(reshaped_features), axis=1) predicted_label = tamil_audio_label_encoder.inverse_transform(predicted_class) return predicted_label[0] audio_file = "test_audio.wav" predicted_audio_label = predict_audio(audio_file) print("Predicted Audio Label:", predicted_audio_label) ``` --- ## 📊 6. Batch Processing for a Dataset ### 6.1 Load Dataset ```python import os import pandas as pd def load_dataset(base_dir='../test', lang='tamil'): dataset = [] lang_dir = os.path.join(base_dir, lang) audio_dir = os.path.join(lang_dir, "audio") text_dir = os.path.join(lang_dir, "text") text_file = os.path.join(text_dir, [file for file in os.listdir(text_dir) if file.endswith(".xlsx")][0]) text_df = pd.read_excel(text_file) for file in text_df["File Name"]: if (file + ".wav") in os.listdir(audio_dir): audio_path = os.path.join(audio_dir, file + ".wav") transcript_row = text_df.loc[text_df["File Name"] == file] transcript = transcript_row.iloc[0]["Transcript"] if not transcript_row.empty else "" dataset.append({"File Name": audio_path, "Transcript": transcript}) else: transcript_row = text_df.loc[text_df["File Name"] == file] transcript = transcript_row.iloc[0]["Transcript"] if not transcript_row.empty else "" dataset.append({"File Name": "Nil", "Transcript": transcript}) return pd.DataFrame(dataset) dataset_df = load_dataset() ``` ### 6.2 Predict Text and Audio in Bulk ```python dataset_df["Transcript"] = dataset_df["Transcript"].apply(preprocess_tamil_text) text_embeddings = extract_embeddings(feature_extractor, dataset_df["Transcript"].tolist()) text_predictions = text_model.predict(text_embeddings) text_labels = tamil_text_label_encoder.inverse_transform(np.argmax(text_predictions, axis=1)) dataset_df["Predicted Text Label"] = text_labels dataset_df["Predicted Audio Label"] = dataset_df["File Name"].apply(lambda x: predict_audio(x) if x != "Nil" else "No Audio") dataset_df.to_csv("predictions.tsv", sep="\t", index=False) ``` --- ## ☁️ 7. Deployment on Hugging Face ```bash pip install huggingface_hub huggingface-cli login ``` ```python from huggingface_hub import upload_file upload_file(path_or_fileobj="text_models/tamil_text_model.h5", path_in_repo="text_models/tamil_text_model.h5", repo_id="") ``` --- ## 📬 Contact For issues or improvements, feel free to raise an issue or email [**vasantharank.work@gmail.com**](mailto\:vasantharank.work@gmail.com). --- **License:** CC BY-NC 4.0