Spaces:

IqraEval
/

SharedTask_ArabicNLP2025

Running

App Files Files Community

01Yassine commited on Jun 5

Commit

0469dbf

verified ·

1 Parent(s): 18d0f01

Update index.html

Browse files

Files changed (1) hide show

index.html +66 -10

index.html CHANGED Viewed

@@ -57,7 +57,7 @@
           <h3>1. Read the Verse</h3>
           <p>
-            The user is shown a <strong>Reference Verse</strong> in Arabic script along with its corresponding <strong>Reference Phoneme Sequence</strong>.
           </p>
           <p><strong>Example:</strong></p>
           <ul>
@@ -80,10 +80,10 @@
           </p>
           <p><strong>Example of Mispronunciation:</strong></p>
           <ul>
-            <li><strong>Reference Phoneme Sequence:</strong> <code>&lt; i n n a SS A f aa w a l m a r w a t a m i n $ a E a a &lt; i r i l l a h i</code></li>
-            <li><strong>Model Phoneme Prediction:</strong> <code>&lt; i n n a SS A f aa w a l m a r w a t a m i n s a E a a &lt; i r u l l a h i</code></li>
             <li>
-              <strong>Annotated Phoneme Sequence:</strong>
               <code>&lt; i n n a SS A f aa w a l m a r w a m i n <span class="highlight">s</span> a E a a &lt; i <span class="highlight">r u</span> l l a h i</code>
             </li>
           </ul>
@@ -189,18 +189,74 @@
                 For detailed instructions on data access, phonetizer installation, and baseline usage, please refer to the GitHub README.
             </em>
         </p>
-              <h2>Evaluation Criteria</h2>
         <p>
-            Systems will be scored on their ability to detect and correctly classify phoneme-level errors:
         </p>
         <ul>
-            <li><strong>Detection accuracy:</strong> Did the system spot that a phoneme-level error occurred in the segment?</li>
-            <li><strong>Classification F1-score:</strong> Mispronunciation Detection F1-score</li>
         </ul>
         <p>
-            <em>(Detailed evaluation weights and scripts will be made available on June 5, 2025.)</em>
         </p>
         <!-- Submission Details -->
         <h2>Submission Details (Draft)</h2>

           <h3>1. Read the Verse</h3>
           <p>
+            The user is shown a <strong>Reference Verse</strong> (What should have been said) in Arabic script along with its corresponding <strong>Reference Phoneme Sequence</strong>.
           </p>
           <p><strong>Example:</strong></p>
           <ul>
           </p>
           <p><strong>Example of Mispronunciation:</strong></p>
           <ul>
+            <li><strong>Reference Phoneme Sequence (What should have been said):</strong> <code>&lt; i n n a SS A f aa w a l m a r w a t a m i n $ a E a a &lt; i r i l l a h i</code></li>
+            <li><strong>Model Phoneme Prediction (What is predicted):</strong> <code>&lt; i n n a SS A f aa w a l m a r w a t a m i n s a E a a &lt; i r u l l a h i</code></li>
             <li>
+              <strong>Annotated Phoneme Sequence (What is said):</strong>
               <code>&lt; i n n a SS A f aa w a l m a r w a m i n <span class="highlight">s</span> a E a a &lt; i <span class="highlight">r u</span> l l a h i</code>
             </li>
           </ul>
                 For detailed instructions on data access, phonetizer installation, and baseline usage, please refer to the GitHub README.
             </em>
         </p>
+        <h2>Evaluation Criteria</h2>
+        <p>
+          The primary evaluation metric for the IqraEval system is the <strong>F1-score</strong> at the phoneme level. In addition, we adopt a hierarchical evaluation structure that breaks down performance into detection and diagnostic phases.
+        </p>
         <p>
+          <strong>Hierarchical Evaluation Structure:</strong>
+          The hierarchical mispronunciation detection process relies on three sequences:
+          <ul>
+            <li><em>What is said</em> (the <strong>annotated phoneme sequence</strong> from human annotation),</li>
+            <li><em>What is predicted</em> (the <strong>model’s phoneme output</strong>),</li>
+            <li><em>What should have been said</em> (the <strong>reference phoneme sequence</strong>).</li>
+          </ul>
+          By comparing these three sequences, we compute the following counts:
         </p>
         <ul>
+          <li><strong>True Acceptance (TA):</strong>
+            Number of phonemes that are annotated as correct and also recognized as correct by the model.
+          </li>
+          <li><strong>True Rejection (TR):</strong>
+            Number of phonemes that are annotated as mispronunciations and correctly predicted as mispronunciations.
+            (These labels are further used to measure diagnostic errors by comparing the prediction to the canonical reference.)
+          </li>
+          <li><strong>False Rejection (FR):</strong>
+            Number of phonemes that are annotated as correct but wrongly predicted as mispronunciations.
+          </li>
+          <li><strong>False Acceptance (FA):</strong>
+            Number of phonemes that are annotated as mispronunciations but misclassified as correct pronunciations.
+          </li>
         </ul>
         <p>
+          From these counts, we derive three rates:
+          <ul>
+            <li><strong>False Rejection Rate (FRR):</strong>
+              \( \displaystyle \text{FRR} = \frac{\text{FR}}{\text{TA} + \text{FR}} \)
+              (Proportion of correctly pronounced phonemes that were mistakenly flagged as errors.)
+            </li>
+            <li><strong>False Acceptance Rate (FAR):</strong>
+              \( \displaystyle \text{FAR} = \frac{\text{FA}}{\text{FA} + \text{TR}} \)
+              (Proportion of mispronounced phonemes that were mistakenly classified as correct.)
+            </li>
+            <li><strong>Diagnostic Error Rate (DER):</strong>
+              \( \displaystyle \text{DER} = \frac{\text{DE}}{\text{CD} + \text{DE}} \)
+              where DE is the number of misdiagnosed phonemes and CD is the number of correctly diagnosed ones.
+            </li>
+          </ul>
         </p>
+        <p>
+          In addition to these hierarchical measures, we compute the standard <strong>Precision</strong>, <strong>Recall</strong>, and <strong>F-measure</strong> for mispronunciation detection:
+          <ul>
+            <li><strong>Precision:</strong>
+              \( \displaystyle \text{Precision} = \frac{\text{TR}}{\text{TR} + \text{FR}} \)
+              (Of all phonemes predicted as mispronounced, how many were actually mispronounced?)
+            </li>
+            <li><strong>Recall:</strong>
+              \( \displaystyle \text{Recall} = \frac{\text{TR}}{\text{TR} + \text{FA}} \;=\; 1 - \text{FAR} \)
+              (Of all truly mispronounced phonemes, how many did we correctly detect?)
+            </li>
+            <li><strong>F-measure (F1):</strong>
+              \( \displaystyle F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \)
+              (Harmonic mean of Precision and Recall.)
+            </li>
+          </ul>
+        </p>
+        <p>
+          <em>(Detailed evaluation weights and scripts will be made available on June 5, 2025.)</em>
+        </p>
         <!-- Submission Details -->
         <h2>Submission Details (Draft)</h2>