File size: 17,056 Bytes
148368a
 
c3ec9ca
 
 
 
 
 
 
 
 
 
4c29f5e
 
 
 
c3ec9ca
 
 
0934565
c3ec9ca
 
3fb7e9e
c3ec9ca
 
 
 
 
 
3fb7e9e
c3ec9ca
 
 
 
 
 
 
 
 
5756d9d
13c405b
02de5b7
13c405b
 
 
 
 
02de5b7
13c405b
 
 
 
 
 
 
0469dbf
13c405b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0469dbf
 
13c405b
0469dbf
18d0f01
13c405b
 
 
 
 
18d0f01
 
 
3fb7e9e
 
02de5b7
18d0f01
c3ec9ca
 
 
 
 
 
 
5756d9d
c3ec9ca
 
5756d9d
c3ec9ca
5756d9d
c3ec9ca
 
 
 
 
 
5756d9d
c3ec9ca
5756d9d
c3ec9ca
 
 
 
 
 
 
 
 
 
 
18d0f01
c3ec9ca
 
 
 
5756d9d
c3ec9ca
 
18d0f01
9847294
609e630
9847294
 
 
 
 
c3ec9ca
 
 
 
 
5756d9d
c3ec9ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ac52e9
 
 
c3ec9ca
 
3fb7e9e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0469dbf
 
e3aa6c7
0469dbf
5756d9d
e3aa6c7
 
 
 
0469dbf
 
 
 
 
 
 
 
5756d9d
 
0469dbf
 
 
 
 
 
 
 
 
 
 
 
 
5756d9d
 
0469dbf
 
 
fd168fe
0469dbf
 
fd168fe
 
0469dbf
 
fd168fe
 
0469dbf
 
 
5756d9d
0469dbf
 
 
 
fd168fe
0469dbf
 
 
fd168fe
0469dbf
 
fd168fe
 
0469dbf
 
 
 
3fb7e9e
 
 
 
 
983b407
3fb7e9e
 
 
 
 
 
 
 
 
 
 
 
 
 
5756d9d
18d0f01
c3ec9ca
 
 
 
 
18d0f01
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
668602f
 
 
 
 
 
18d0f01
 
c3ec9ca
 
148368a
c3ec9ca
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
<!doctype html>
<html>
<head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width" />
    <title>Iqra’Eval Shared Task</title>
    <link rel="stylesheet" href="style.css" />
</head>
<body>
    <div class="card">
        <h1>Iqra’Eval Shared Task</h1>

        <div style="text-align:center; margin: 20px 0;">
          <img src="IqraEval.png" alt="" style="max-width:100%; height:auto;" />
        </div>

        <!-- Overview Section -->
        <h2>Overview</h2>
        <p>
            <strong>Iqra'Eval</strong> is a shared task aimed at advancing <strong>automatic assessment of Qur’anic recitation pronunciation</strong> by leveraging computational methods to detect and diagnose pronunciation errors. The focus on Qur’anic recitation provides a standardized and well-defined context for evaluating Modern Standard Arabic (MSA) pronunciation.
        </p>
        <p>
            Participants will develop systems capable of Detecting Mispronunciations (e.g., substitution, deletion, or insertion of phonemes).
        </p>

        <!-- Timeline Section -->
        <h2>Timeline</h2>
        <ul>
            <li><strong>June 1, 2025</strong>: Official announcement of the shared task</li>
            <li><strong>June 10, 2025</strong>: Release of training data, development set (QuranMB), phonetizer script, and baseline systems</li>
            <li><strong>July 24, 2025</strong>: Registration deadline and release of test data</li>
            <li><strong>July 27, 2025</strong>: End of evaluation cycle (test set submission closes)</li>
            <li><strong>July 30, 2025</strong>: Final results released</li>
            <li><strong>August 15, 2025</strong>: System description paper submissions due</li>
            <li><strong>August 22, 2025</strong>: Notification of acceptance</li>
            <li><strong>September 5, 2025</strong>: Camera-ready versions due</li>
        </ul>

        <!-- Task Description -->

        <h2>Task Description: Quranic Mispronunciation Detection System</h2>

        <p>
          The aim is to design a model to detect and provide detailed feedback on mispronunciations in Quranic recitations. 
          Users read aloud vowelized Quranic verses; This model predicts the phoneme sequence uttered by the speaker, which may contain mispronunciations. 
          Models are evaluated on the <strong>QuranMB.v2</strong> dataset, which contains human‐annotated mispronunciations.
        </p>

        <div class="centered">
          <img src="task.png" alt="System Overview" style="max-width:100%; height:auto;" />
          <p><em>Figure: Overview of the Mispronunciation Detection Workflow</em></p>
        </div>
                    
          <h3>1. Read the Verse</h3>
          <p>
            The user is shown a <strong>Reference Verse</strong> (What should have been said) in Arabic script along with its corresponding <strong>Reference Phoneme Sequence</strong>.
          </p>
          <p><strong>Example:</strong></p>
          <ul>
            <li><strong>Arabic:</strong> إِنَّ الصَّفَا وَالْمَرْوَةَ مِنْ شَعَائِرِ اللَّهِ</li>
            <li>
              <strong>Phoneme:</strong> 
              <code>&lt; i n n a SS A f aa w a l m a r w a t a m i n $ a E a a &lt; i r i l l a h i</code>
            </li>
          </ul>
        
          <h3>2. Save Recording</h3>
          <p>
            The user recites the verse aloud; the system captures and stores the audio waveform for subsequent analysis.
          </p>
        
          <h3>3. Mispronunciation Detection</h3>
          <p>
            The stored audio is fed into a <strong>Mispronunciation Detection Model</strong>. 
            This model predicts the phoneme sequence uttered by the speaker, which may contain mispronunciations.
          </p>
          <p><strong>Example of Mispronunciation:</strong></p>
          <ul>
            <li><strong>Reference Phoneme Sequence (What should have been said):</strong> <code>&lt; i n n a SS A f aa w a l m a r w a t a m i n $ a E a a &lt; i r i l l a h i</code></li>
            <li><strong>Model Phoneme Prediction (What is predicted):</strong> <code>&lt; i n n a SS A f aa w a l m a r w a t a m i n s a E a a &lt; i r u l l a h i</code></li>
            <li>
              <strong>Annotated Phoneme Sequence (What is said):</strong> 
              <code>&lt; i n n a SS A f aa w a l m a r w a m i n <span class="highlight">s</span> a E a a &lt; i <span class="highlight">r u</span> l l a h i</code>
            </li>
          </ul>
          <p>
            In this case, the phoneme <code>$</code> was mispronounced as <code>s</code>, and <code>i</code> was mispronounced as <code>u</code>.
          </p>
          <p>
            The annotated phoneme sequence indicates that the phoneme <code>ta</code> was omitted, but the model failed to detect it.
          </p>
                                                             


        <h2>Training Dataset: Description</h2>
        <p>
            All data are hosted on Hugging Face. Two main splits are provided:
        </p>
        <ul>
            <li>
                <strong>Training set:</strong> 79 hours of Modern Standard Arabic (MSA) speech, augmented with multiple Qur’anic recitations.  
                <br />
                <code>df = load_dataset("IqraEval/Iqra_train", split="train")</code>
            </li>
            <li>
                <strong>Development set:</strong> 3.4 hours reserved for tuning and validation.  
                <br />
                <code>df = load_dataset("IqraEval/Iqra_train", split="dev")</code>
            </li>
        </ul>
        <p>
            <strong>Column Definitions:</strong>
        </p>
        <ul>
            <li><code>audio</code>: Speech Array.</li>
            <li><code>sentence</code>: Original sentence text (may be partially diacritized or non-diacritized).</li>
            <li><code>index</code>: If from the Quran, the verse index (0–6265, including Basmalah); otherwise <code>-1</code>.</li>
            <li><code>tashkeel_sentence</code>: Fully diacritized sentence (auto-generated via a diacritization tool).</li>
            <li><code>phoneme</code>: Phoneme sequence corresponding to the diacritized sentence (Nawar Halabi phonetizer).</li>
        </ul>
        <p>
            <strong>Data Splits:</strong>  
            <br />
            • Training (train): 79 hours total<br />
            • Development (dev): 3.4 hours total  
        </p>

        <!-- Additional TTS Data -->
        <h2>Training Dataset: TTS Data (Optional Use)</h2>
        <p>
            We also provide a high-quality TTS corpus for auxiliary experiments (e.g., data augmentation, synthetic pronunciation error simulation). This TTS set can be loaded via:
        </p>
        <ul>
            <li><code>df_tts = load_dataset("IqraEval/Iqra_TTS")</code></li>
        </ul>

      <h2>Test Dataset: QuranMB_v2</h2>
        <p>
          To construct a reliable test set, we select 98 verses from the Qur’an, which are read aloud by 18 native Arabic speakers (14 females, 4 males), resulting in approximately 2 hours of recorded speech. The speakers were instructed to read the text in MSA at their normal tempo, disregarding Qur’anic tajweed rules, while deliberately producing the specified pronunciation errors. To ensure consistency in error production, we developed a custom recording tool that highlighted the modified text and displayed additional instructions specifying the type of error. Before recording, speakers were required to silently read each sentence to familiarize themselves with the intended errors before reading them aloud. After recording, three linguistic annotators verified and corrected the phoneme sequence and flagged all pronunciation errors for evaluation.
        </p>
        <ul>
          <li><code>df_test = load_dataset("IqraEval/Iqra_QuranMB_v2")</code></li>
        </ul>
      
        <!-- Resources & Links -->
        <h2>Resources</h2>
        <ul>
            <li>
                <a href="https://huggingface.co/datasets/IqraEval/Iqra_train" target="_blank">
                    Training &amp; Development Data on Hugging Face
                </a>
            </li>
            <li>
                <a href="https://huggingface.co/datasets/IqraEval/Iqra_TTS" target="_blank">
                    IqraEval TTS Data on Hugging Face
                </a>
            </li>
            <li>
                <a href="https://github.com/Iqra-Eval/interspeech_IqraEval" target="_blank">
                    Baseline systems &amp; training scripts (GitHub)
                </a>
            </li>
        </ul>
        <p>
            <em>
                For detailed instructions on data access, phonetizer installation, and baseline usage, please refer to the <a href="https://github.com/Iqra-Eval" target="_blank">
                    GitHub
                </a>.  
            </em>
        </p>

        <!-- Submission Details -->
        <h2>Submission Details (Draft)</h2>
        <p>
            Participants are required to submit a CSV file named <code>submission.csv</code> containing the predicted phoneme sequences for each audio sample. The file must have exactly two columns:
        </p>
        <ul>
            <li><strong>ID:</strong> Unique identifier of the audio sample.</li>
            <li><strong>Labels:</strong> The predicted phoneme sequence, with each phoneme separated by a single space.</li>
        </ul>
        <p>
            Below is a minimal example illustrating the required format:
        </p>
        <pre>
ID,Labels
0000_0001, i n n a m a a y a k h a l l a h a m i n ʕ i b a a d i h u l ʕ u l a m
0000_0002, m a a n a n s a k h u m i n i ʕ a a y a t i n
0000_0003, y u k h i k u m u n n u ʔ a u ʔ a m a n a t a n m m i n h u
…  
        </pre>
        <p>
            The first column (ID) should match exactly the audio filenames (without extension). The second column (Labels) is the predicted phoneme string. 
        </p>
        <p>
            <strong>Important:</strong>  
            <ul>
                <li>Use UTF-8 encoding.</li>
                <li>Do not include extra spaces at the start or end of each line.</li>
                <li>Submit a single CSV file (no archives). Filename must be <code>teamID_submission.csv</code>.</li>
            </ul>
        </p>
      
        <h2>Evaluation Criteria</h2>
        <p>
          IqraEval Leaderboard rankings will primarily be based on the <strong>phoneme-level F1-score</strong>.        
        </p>
        <p>
          In addition, we adopt a hierarchical evaluation structure, <a href="https://arxiv.org/pdf/2310.13974" target="_blank">MDD Overview</a>, that breaks down performance into detection and diagnostic phases.
        </p>

        <p>
          <strong>Hierarchical Evaluation Structure:</strong>  
          The hierarchical mispronunciation detection process relies on three sequences:
          <ul>
            <li><em>What is said</em> (the <strong>annotated phoneme sequence</strong> from human annotation),</li>
            <li><em>What is predicted</em> (the <strong>model’s phoneme output</strong>),</li>
            <li><em>What should have been said</em> (the <strong>reference phoneme sequence</strong>).</li>
          </ul>
          By comparing these three sequences, we compute the following counts:
        </p>
        <ul>
          <li><strong>True Acceptance (TA):</strong>  
            Number of phonemes that are annotated as correct and also recognized as correct by the model.
          </li>
          <li><strong>True Rejection (TR):</strong>  
            Number of phonemes that are annotated as mispronunciations and correctly predicted as mispronunciations.  
            (These labels are further used to measure diagnostic errors by comparing the prediction to the canonical reference.)
          </li>
          <li><strong>False Rejection (FR):</strong>  
            Number of phonemes that are annotated as correct but wrongly predicted as mispronunciations.
          </li>
          <li><strong>False Acceptance (FA):</strong>  
            Number of phonemes that are annotated as mispronunciations but misclassified as correct pronunciations.
          </li>
        </ul>
        <p>
          From these counts, we derive three rates:
          <ul>
            <li><strong>False Rejection Rate (FRR):</strong>  
              FRR = FR/(TA + FR)  
              (Proportion of correctly pronounced phonemes that were mistakenly flagged as errors.)
            </li>
            <li><strong>False Acceptance Rate (FAR):</strong> 
              FAR = FA/(FA + TR)
              (Proportion of mispronounced phonemes that were mistakenly classified as correct.)
            </li>
            <li><strong>Diagnostic Error Rate (DER):</strong>
              DER = DE/(CD + DE)
              where DE is the number of misdiagnosed phonemes and CD is the number of correctly diagnosed ones.
            </li>
          </ul>
        </p>
        <p>
          In addition to these hierarchical measures, we compute the standard <strong>Precision</strong>, <strong>Recall</strong>, and <strong>F-measure</strong> for mispronunciation detection:
          <ul>
            <li><strong>Precision:</strong>  
              Precision = TR/(TR + FR)
              (Of all phonemes predicted as mispronounced, how many were actually mispronounced?)
            </li>
            <li><strong>Recall:</strong>  
              Recall = TR/(TR + FA)
              (Of all truly mispronounced phonemes, how many did we correctly detect?)
            </li>
            <li><strong>F1-score:</strong> 
              F1-score = 2 * Precision * Recall / (Precision + Recall)
            </li>
          </ul>
        </p>

      
                <h2>Potential Research Directions</h2>
          <ol>
            <li>
              <strong>Advanced Mispronunciation Detection Models</strong><br>
              Apply state-of-the-art self-supervised models (e.g., Wav2Vec2.0, HuBERT), using variants that are pre-trained/fine-tuned on Arabic speech. These models can then be fine-tuned on Quranic recitations to improve phoneme-level accuracy.
            </li>
            <li>
              <strong>Data Augmentation Strategies</strong><br>
              Create synthetic mispronunciation examples using pipelines like 
              <a href="https://arxiv.org/abs/2211.00923" target="_blank">SpeechBlender</a>. 
              Augmenting limited Arabic/Quranic speech data helps mitigate data scarcity and improves model robustness.
            </li>
            <li>
              <strong>Analysis of Common Mispronunciation Patterns</strong><br>
              Perform statistical analysis on the QuranMB dataset to identify prevalent errors (e.g., substituting similar phonemes, swapping vowels). 
              These insights can drive targeted training and tailored feedback rules.
            </li>
          </ol> 



        <!-- Placeholder for Future Details -->
        <h2>Future Updates</h2>
        <p>
            Further details on <strong>evaluation criteria</strong> (exact scoring weights), <strong>submission templates</strong>, and any clarifications will be posted on the shared task website when test data are released (June 5, 2025). Stay tuned!
        </p>

                <h2>References</h2>
          <ul>
            <li>
              El Kheir, Y., et al. 
              "<a href="https://arxiv.org/abs/2211.00923" target="_blank">SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation</a>," 
              <em>arXiv preprint arXiv:2211.00923</em>, 2022.
            </li>
            <li>
              Al Harere, A., & Al Jallad, K. 
              "<a href="https://arxiv.org/abs/2305.06429" target="_blank">Mispronunciation Detection of Basic Quranic Recitation Rules using Deep Learning</a>," 
              <em>arXiv preprint arXiv:2305.06429</em>, 2023.
            </li>
            <li>
              Aly, S. A., et al. 
              "<a href="https://arxiv.org/abs/2111.01136" target="_blank">ASMDD: Arabic Speech Mispronunciation Detection Dataset</a>," 
              <em>arXiv preprint arXiv:2111.01136</em>, 2021.
            </li>
            <li>
              Moustafa, A., & Aly, S. A. 
              "<a href="https://arxiv.org/abs/2111.06331" target="_blank">Towards an Efficient Voice Identification Using Wav2Vec2.0 and HuBERT Based on the Quran Reciters Dataset</a>," 
              <em>arXiv preprint arXiv:2111.06331</em>, 2021.
            </li>
            <li>
              El Kheir, Y., et al.
              "<a href="https://arxiv.org/pdf/2310.13974" target="_blank">Automatic Pronunciation Assessment - A Review</a>," 
              <em>arXiv preprint arXiv:2310.13974</em>, 2021.
            </li>
            
          </ul>

    </div>
</body>
</html>