Spaces:
Running
Running
title
Browse files
app.py
CHANGED
|
@@ -84,7 +84,8 @@ with st.expander("ℹ️ - About this app", expanded=False):
|
|
| 84 |
* TMA_prob: % probability that the target classification is True (using logprobs output from GPT-4o)
|
| 85 |
* TMA_eval: Boolean based on TMA_prob > 0.5
|
| 86 |
* VC_check: used for manually noting corrections
|
| 87 |
-
* TMA_check:
|
|
|
|
| 88 |
|
| 89 |
Evaluation with GPT4o-as-judge: to clarify, the automated pipeline is not 100% trustworthy, so I was just using the 'FALSE' tags as a starting point
|
| 90 |
The complete protocol is as follows:
|
|
@@ -94,6 +95,7 @@ with st.expander("ℹ️ - About this app", expanded=False):
|
|
| 94 |
4. TMA_eval == 'TRUE' AND TMA_prob < 0.9: manually check all remaining target labels where GPT4o was not very certain.
|
| 95 |
5. If incorrect classification: enter corrected value in 'VC_check' and 'TMA_check' columns.
|
| 96 |
|
|
|
|
| 97 |
Takeaways from evaluation:
|
| 98 |
* It appears the classifiers experience performance degradation in French-language source documents
|
| 99 |
* In particular, the vulnerability classifier had issues
|
|
@@ -101,15 +103,6 @@ with st.expander("ℹ️ - About this app", expanded=False):
|
|
| 101 |
* The GPT4o pipeline is a useful tool for the assessment, but only in terms of increasing accuracy over random sampling. It still takes time to review each document.
|
| 102 |
""")
|
| 103 |
|
| 104 |
-
st.write("""
|
| 105 |
-
What Happens in background?
|
| 106 |
-
|
| 107 |
-
- Step 1: Once the document is provided to app, it undergoes *Pre-processing*.\
|
| 108 |
-
In this step the document is broken into smaller paragraphs \
|
| 109 |
-
(based on word/sentence count).
|
| 110 |
-
- Step 2: The paragraphs are then fed to the **Vulnerability Classifier** which detects if
|
| 111 |
-
the paragraph contains any or multiple references to vulnerable groups.
|
| 112 |
-
""")
|
| 113 |
|
| 114 |
st.write("")
|
| 115 |
|
|
|
|
| 84 |
* TMA_prob: % probability that the target classification is True (using logprobs output from GPT-4o)
|
| 85 |
* TMA_eval: Boolean based on TMA_prob > 0.5
|
| 86 |
* VC_check: used for manually noting corrections
|
| 87 |
+
* TMA_check: used for manually noting corrections
|
| 88 |
+
|
| 89 |
|
| 90 |
Evaluation with GPT4o-as-judge: to clarify, the automated pipeline is not 100% trustworthy, so I was just using the 'FALSE' tags as a starting point
|
| 91 |
The complete protocol is as follows:
|
|
|
|
| 95 |
4. TMA_eval == 'TRUE' AND TMA_prob < 0.9: manually check all remaining target labels where GPT4o was not very certain.
|
| 96 |
5. If incorrect classification: enter corrected value in 'VC_check' and 'TMA_check' columns.
|
| 97 |
|
| 98 |
+
|
| 99 |
Takeaways from evaluation:
|
| 100 |
* It appears the classifiers experience performance degradation in French-language source documents
|
| 101 |
* In particular, the vulnerability classifier had issues
|
|
|
|
| 103 |
* The GPT4o pipeline is a useful tool for the assessment, but only in terms of increasing accuracy over random sampling. It still takes time to review each document.
|
| 104 |
""")
|
| 105 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
st.write("")
|
| 108 |
|