|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
# DeepAr |
|
|
|
## Model Description |
|
|
|
DeepAr is a state-of-the-art Arabic Automatic Speech Recognition (ASR) model based on whisper-turbo-v3 architecture. This model represents our latest and most advanced version, trained on the complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset for optimal performance. |
|
|
|
**Key Features:** |
|
- **High-fidelity transcription**: Transcribes exactly what is pronounced, maintaining authenticity of speech patterns |
|
- **Speech improvement tool**: Designed to help users identify and correct speech patterns |
|
- **Superior performance**: Outperforms many existing Arabic ASR models based on Whisper and its variants |
|
- **Arabic with Tashkil**: Provides accurate diacritization for comprehensive Arabic text output |
|
|
|
## What Makes DeepAr Different |
|
|
|
Unlike traditional ASR models that normalize speech to standard text, DeepAr transcribes **exactly what is pronounced**. This unique approach makes it particularly valuable for: |
|
|
|
- **Speech therapy and improvement**: Identifies pronunciation patterns and deviations |
|
- **Language learning**: Helps learners understand their actual pronunciation vs. intended speech |
|
- **Linguistic research**: Captures authentic speech patterns for analysis |
|
- **Pronunciation assessment**: Provides detailed feedback on spoken Arabic |
|
|
|
## Model Details |
|
|
|
- **Base Architecture**: whisper-turbo-v3 |
|
- **Language**: Arabic (with Tashkil/diacritics) |
|
- **Task**: High-fidelity Automatic Speech Recognition |
|
- **Training Data**: Complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset |
|
- **Model Type**: Production-ready, latest version |
|
|
|
## Performance |
|
|
|
DeepAr demonstrates superior performance compared to many Arabic ASR models built on Whisper and its variants, particularly excelling in: |
|
- Pronunciation accuracy detection |
|
- Diacritic prediction |
|
- Handling of Arabic speech variations |
|
- Authentic speech pattern recognition |
|
|
|
## Intended Use |
|
|
|
This model is ideal for: |
|
- Speech therapy and pronunciation correction applications |
|
- Arabic language learning platforms |
|
- Linguistic research and analysis |
|
- Educational tools for speech improvement |
|
- Applications requiring authentic speech transcription |
|
- Quality assessment of spoken Arabic |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install transformers torch torchaudio |
|
``` |
|
|
|
### Quick Start |
|
|
|
```python |
|
from transformers import WhisperProcessor, WhisperForConditionalGeneration |
|
import torch |
|
import torchaudio |
|
|
|
# Load model and processor |
|
processor = WhisperProcessor.from_pretrained("CUAIStudents/DeepAr") |
|
model = WhisperForConditionalGeneration.from_pretrained("CUAIStudents/DeepAr") |
|
|
|
# Load and preprocess audio |
|
audio_path = "path_to_your_arabic_audio.wav" |
|
waveform, sample_rate = torchaudio.load(audio_path) |
|
|
|
# Resample to 16kHz if necessary |
|
if sample_rate != 16000: |
|
resampler = torchaudio.transforms.Resample(sample_rate, 16000) |
|
waveform = resampler(waveform) |
|
|
|
# Process audio |
|
input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features |
|
|
|
# Generate transcription |
|
with torch.no_grad(): |
|
predicted_ids = model.generate(input_features, language="ar") |
|
|
|
# Decode transcription (exactly as pronounced) |
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
|
print(f"Pronounced as: {transcription}") |
|
``` |
|
|
|
### Speech Analysis Example |
|
|
|
```python |
|
def analyze_pronunciation(audio_path, target_text=None): |
|
""" |
|
Analyze pronunciation and compare with target text if provided |
|
""" |
|
waveform, sample_rate = torchaudio.load(audio_path) |
|
|
|
if sample_rate != 16000: |
|
resampler = torchaudio.transforms.Resample(sample_rate, 16000) |
|
waveform = resampler(waveform) |
|
|
|
input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features |
|
|
|
with torch.no_grad(): |
|
predicted_ids = model.generate(input_features, language="ar") |
|
|
|
actual_pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
|
|
|
print(f"Actual pronunciation: {actual_pronunciation}") |
|
|
|
if target_text: |
|
print(f"Target text: {target_text}") |
|
print("Analysis: Compare the differences for speech improvement") |
|
|
|
return actual_pronunciation |
|
|
|
# Example usage |
|
pronunciation = analyze_pronunciation("student_reading.wav", "النص المطلوب قراءته") |
|
``` |
|
|
|
### Batch Processing for Speech Assessment |
|
|
|
```python |
|
def assess_multiple_recordings(audio_files, target_texts=None): |
|
""" |
|
Process multiple recordings for comprehensive speech assessment |
|
""" |
|
results = [] |
|
|
|
for i, audio_file in enumerate(audio_files): |
|
waveform, sample_rate = torchaudio.load(audio_file) |
|
|
|
if sample_rate != 16000: |
|
resampler = torchaudio.transforms.Resample(sample_rate, 16000) |
|
waveform = resampler(waveform) |
|
|
|
input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features |
|
|
|
with torch.no_grad(): |
|
predicted_ids = model.generate(input_features, language="ar") |
|
|
|
pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
|
|
|
result = { |
|
'file': audio_file, |
|
'pronunciation': pronunciation, |
|
'target': target_texts[i] if target_texts else None |
|
} |
|
results.append(result) |
|
|
|
print(f"File {i+1}: {pronunciation}") |
|
|
|
return results |
|
|
|
# Example usage |
|
audio_files = ["recording1.wav", "recording2.wav", "recording3.wav"] |
|
target_texts = ["النص الأول", "النص الثاني", "النص الثالث"] |
|
assessment_results = assess_multiple_recordings(audio_files, target_texts) |
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
This model was trained on the complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset, utilizing the full scope of available Arabic speech data with corresponding high-quality transcriptions including diacritics. |
|
|
|
## Model Advantages |
|
|
|
- **Authentic transcription**: Captures exactly what is spoken, not what should be spoken |
|
- **High accuracy**: Superior performance compared to similar Whisper-based Arabic models |
|
- **Comprehensive training**: Utilizes the complete dataset for optimal coverage |
|
- **Practical applications**: Specifically designed for speech improvement and assessment |
|
- **Diacritic accuracy**: Excellent performance in Arabic diacritization |
|
|
|
|
|
## Limitations |
|
|
|
- **MSA focus**: Optimized primarily for Modern Standard Arabic (MSA) rather than dialectal variations |
|
|
|
## License |
|
|
|
This model is released under the MIT License. |
|
|
|
``` |
|
MIT License |
|
|
|
Copyright (c) 2024 CUAIStudents |
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy |
|
of this software and associated documentation files (the "Software"), to deal |
|
in the Software without restriction, including without limitation the rights |
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
|
copies of the Software, and to permit persons to whom the Software is |
|
furnished to do so, subject to the following conditions: |
|
|
|
The above copyright notice and this permission notice shall be included in all |
|
copies or substantial portions of the Software. |
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
|
SOFTWARE. |
|
``` |
|
|