Update README.md
Browse files
README.md
CHANGED
@@ -139,4 +139,36 @@ The following table summarizes the ROUGE scores (Recall, Precision, and F1) for
|
|
139 |
## **Improvements**
|
140 |
- Focus on enhancing **bigram overlap** (ROUGE-2) and overall **context understanding**.
|
141 |
- Reduce **irrelevant content** for improved **precision**.
|
142 |
-
- Improve **sequence coherence** for better **ROUGE-L** scores.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
139 |
## **Improvements**
|
140 |
- Focus on enhancing **bigram overlap** (ROUGE-2) and overall **context understanding**.
|
141 |
- Reduce **irrelevant content** for improved **precision**.
|
142 |
+
- Improve **sequence coherence** for better **ROUGE-L** scores.
|
143 |
+
|
144 |
+
## **METEOR Score**
|
145 |
+
|
146 |
+
| Metric | Meteor Score |
|
147 |
+
|-------------|--------------|
|
148 |
+
| **Mean** | 0.2079 |
|
149 |
+
| **Min** | 0.0915 |
|
150 |
+
| **Max** | 0.3216 |
|
151 |
+
| **STD** | 0.0769 |
|
152 |
+
|
153 |
+
### **Interpretation**
|
154 |
+
- **Mean**: The average METEOR score indicates good performance in terms of word alignment and synonyms, but there is still room for improvement.
|
155 |
+
- **Min**: The lowest METEOR score suggests some summaries may not align well with the reference.
|
156 |
+
- **Max**: The highest METEOR score shows the model's potential for generating very well-aligned summaries.
|
157 |
+
- **STD**: The standard deviation indicates some variability in the model's performance across different summaries.
|
158 |
+
|
159 |
+
### **Conclusion**
|
160 |
+
- The model's **METEOR Score** shows a generally solid performance in generating summaries that align well with reference content but still has variability in certain cases.
|
161 |
+
|
162 |
+
### **Improvements**
|
163 |
+
- Focus on improving the **alignment** and **synonym usage** to achieve higher and more consistent **METEOR scores** across summaries.
|
164 |
+
|
165 |
+
## **TLDR**
|
166 |
+
|
167 |
+
### **Comparison & Final Evaluation**
|
168 |
+
- **BERTScore** suggests the model is good at generating relevant tokens (precision) but struggles with capturing all relevant content (recall).
|
169 |
+
- **ROUGE-1** is decent, but **ROUGE-2** and **ROUGE-L** show weak performance, particularly in terms of bigram relationships and sequence coherence.
|
170 |
+
- **METEOR** results show solid alignment, but there’s significant variability, especially with lower scores.
|
171 |
+
|
172 |
+
### **Conclusion**
|
173 |
+
- The model performs decently but lacks consistency, especially in **bigram overlap** (ROUGE-2) and capturing **longer sequences** (ROUGE-L). There’s room for improvement in **recall** and **precision** to make the summaries more relevant and coherent.
|
174 |
+
- Focus on improving **recall**, **bigram relationships**, and **precision** to achieve more consistent, high-quality summaries.
|