DeepAr

Model Description

DeepAr is a state-of-the-art Arabic Automatic Speech Recognition (ASR) model based on whisper-turbo-v3 architecture. This model represents our latest and most advanced version, trained on the complete CUAIStudents/Ar-ASR dataset for optimal performance.

Key Features:

High-fidelity transcription: Transcribes exactly what is pronounced, maintaining authenticity of speech patterns
Speech improvement tool: Designed to help users identify and correct speech patterns
Superior performance: Outperforms many existing Arabic ASR models based on Whisper and its variants
Arabic with Tashkil: Provides accurate diacritization for comprehensive Arabic text output

What Makes DeepAr Different

Unlike traditional ASR models that normalize speech to standard text, DeepAr transcribes exactly what is pronounced. This unique approach makes it particularly valuable for:

Speech therapy and improvement: Identifies pronunciation patterns and deviations
Language learning: Helps learners understand their actual pronunciation vs. intended speech
Linguistic research: Captures authentic speech patterns for analysis
Pronunciation assessment: Provides detailed feedback on spoken Arabic

Model Details

Base Architecture: whisper-turbo-v3
Language: Arabic (with Tashkil/diacritics)
Task: High-fidelity Automatic Speech Recognition
Training Data: Complete CUAIStudents/Ar-ASR dataset
Model Type: Production-ready, latest version

Performance

DeepAr demonstrates superior performance compared to many Arabic ASR models built on Whisper and its variants, particularly excelling in:

Pronunciation accuracy detection
Diacritic prediction
Handling of Arabic speech variations
Authentic speech pattern recognition

Intended Use

This model is ideal for:

Speech therapy and pronunciation correction applications
Arabic language learning platforms
Linguistic research and analysis
Educational tools for speech improvement
Applications requiring authentic speech transcription
Quality assessment of spoken Arabic

Usage

Installation

pip install transformers torch torchaudio

Quick Start

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import torchaudio

# Load model and processor
processor = WhisperProcessor.from_pretrained("CUAIStudents/DeepAr")
model = WhisperForConditionalGeneration.from_pretrained("CUAIStudents/DeepAr")

# Load and preprocess audio
audio_path = "path_to_your_arabic_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)

# Resample to 16kHz if necessary
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(sample_rate, 16000)
    waveform = resampler(waveform)

# Process audio
input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(input_features, language="ar")
    
# Decode transcription (exactly as pronounced)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Pronounced as: {transcription}")

Speech Analysis Example

def analyze_pronunciation(audio_path, target_text=None):
    """
    Analyze pronunciation and compare with target text if provided
    """
    waveform, sample_rate = torchaudio.load(audio_path)
    
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        waveform = resampler(waveform)
    
    input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features
    
    with torch.no_grad():
        predicted_ids = model.generate(input_features, language="ar")
    
    actual_pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    
    print(f"Actual pronunciation: {actual_pronunciation}")
    
    if target_text:
        print(f"Target text: {target_text}")
        print("Analysis: Compare the differences for speech improvement")
    
    return actual_pronunciation

# Example usage
pronunciation = analyze_pronunciation("student_reading.wav", "النص المطلوب قراءته")

Batch Processing for Speech Assessment

def assess_multiple_recordings(audio_files, target_texts=None):
    """
    Process multiple recordings for comprehensive speech assessment
    """
    results = []
    
    for i, audio_file in enumerate(audio_files):
        waveform, sample_rate = torchaudio.load(audio_file)
        
        if sample_rate != 16000:
            resampler = torchaudio.transforms.Resample(sample_rate, 16000)
            waveform = resampler(waveform)
        
        input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features
        
        with torch.no_grad():
            predicted_ids = model.generate(input_features, language="ar")
        
        pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
        
        result = {
            'file': audio_file,
            'pronunciation': pronunciation,
            'target': target_texts[i] if target_texts else None
        }
        results.append(result)
        
        print(f"File {i+1}: {pronunciation}")
    
    return results

# Example usage
audio_files = ["recording1.wav", "recording2.wav", "recording3.wav"]
target_texts = ["النص الأول", "النص الثاني", "النص الثالث"]
assessment_results = assess_multiple_recordings(audio_files, target_texts)

Training Data

This model was trained on the complete CUAIStudents/Ar-ASR dataset, utilizing the full scope of available Arabic speech data with corresponding high-quality transcriptions including diacritics.

Model Advantages

Authentic transcription: Captures exactly what is spoken, not what should be spoken
High accuracy: Superior performance compared to similar Whisper-based Arabic models
Comprehensive training: Utilizes the complete dataset for optimal coverage
Practical applications: Specifically designed for speech improvement and assessment
Diacritic accuracy: Excellent performance in Arabic diacritization

Limitations

MSA focus: Optimized primarily for Modern Standard Arabic (MSA) rather than dialectal variations

License

This model is released under the MIT License.

MIT License

Copyright (c) 2024 CUAIStudents

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.