text-jennasparks

Sleeping

App Files Files Community

jennasparks commited on Jan 31

Commit

ca7705e

verified ·

1 Parent(s): fbc8c92

Updated with model info

Browse files

Files changed (1) hide show

README.md +14 -14

README.md CHANGED Viewed

@@ -8,23 +8,23 @@ pinned: false
 ---
-# Random Baseline Model for Climate Disinformation Classification
 ## Model Description
-This is a random baseline model for the Frugal AI Challenge 2024, specifically for the text classification task of identifying climate disinformation. The model serves as a performance floor, randomly assigning labels to text inputs without any learning.
 ### Intended Use
-- **Primary intended uses**: Baseline comparison for climate disinformation classification models
 - **Primary intended users**: Researchers and developers participating in the Frugal AI Challenge
 - **Out-of-scope use cases**: Not intended for production use or real-world classification tasks
 ## Training Data
-The model uses the QuotaClimat/frugalaichallenge-text-train dataset:
 - Size: ~6000 examples
-- Split: 80% train, 20% test
 - 8 categories of climate disinformation claims
 ### Labels
@@ -37,16 +37,19 @@ The model uses the QuotaClimat/frugalaichallenge-text-train dataset:
 6. Proponents are biased
 7. Fossil fuels are needed
 ## Performance
 ### Metrics
-- **Accuracy**: ~12.5% (random chance with 8 classes)
 - **Environmental Impact**:
   - Emissions tracked in gCO2eq
   - Energy consumption tracked in Wh
 ### Model Architecture
-The model implements a random choice between the 8 possible labels, serving as the simplest possible baseline.
 ## Environmental Impact
@@ -54,18 +57,15 @@ Environmental impact is tracked using CodeCarbon, measuring:
 - Carbon emissions during inference
 - Energy consumption during inference
-This tracking helps establish a baseline for the environmental impact of model deployment and inference.
 ## Limitations
-- Makes completely random predictions
-- No learning or pattern recognition
-- No consideration of input text
-- Serves only as a baseline reference
-- Not suitable for any real-world applications
 ## Ethical Considerations
 - Dataset contains sensitive topics related to climate disinformation
-- Model makes random predictions and should not be used for actual classification
 - Environmental impact is tracked to promote awareness of AI's carbon footprint
 ```

 ---
+# Fine-tuned ELECTRA model for Climate Disinformation Classification
 ## Model Description
+This is our best-performing model for the Frugal AI Challenge 2024, specifically for the text classification task of identifying climate disinformation.
 ### Intended Use
+- **Primary intended uses**:  Comparison to baseline (random selection of labels) for climate disinformation classification models
 - **Primary intended users**: Researchers and developers participating in the Frugal AI Challenge
 - **Out-of-scope use cases**: Not intended for production use or real-world classification tasks
 ## Training Data
+The model uses a balanced version of the QuotaClimat/frugalaichallenge-text-train training dataset. The dataset originally had the following structure:
 - Size: ~6000 examples
+- Split: 80\% train, 20\% test
 - 8 categories of climate disinformation claims
 ### Labels
 6. Proponents are biased
 7. Fossil fuels are needed
+The balancing was done to improve accuracy. We used the Marian MT model to augment our dataset by translating sentences from the classes with the lowest sample count to Spanish and back-translating them to English. The goal of this strategy was to keep sentences with similar sentences, but with different wording, under the same label. However, to avoid the dataset having more synthetic than original data, the target number of sentences per category was 2 times the smallest category. After augmenting the dataset, we removed duplicate sentences generated from back-translation, in order to avoid pseudoreplication. We then split the dataset into training and test sets, making sure to keep original and backtranslated within the same dataset to avoid data leakage.
 ## Performance
 ### Metrics
+- **Accuracy**:  Training - 0.91 , Validation - 0.87, Testing -
 - **Environmental Impact**:
   - Emissions tracked in gCO2eq
   - Energy consumption tracked in Wh
 ### Model Architecture
+We fine-tuned a pre-trained ELECTRA model on our balanced dataset for five epochs. The first four layers of the model were frozen, while the training was carried out on the last eight layers.  The Electra tokenizer was used to tokenize the sentences, with truncation set to True and padding to max length. An Adam optimizer was used, with a learning rate = 5e-5, epsilon=1e-7, beta 1=0.9 and beta_2=0.999. Our loss was Sparse Categorical Cross-entropy, and the main metric used was accuracy.
 ## Environmental Impact
 - Carbon emissions during inference
 - Energy consumption during inference
 ## Limitations
+- The dataset was small to begin with, and every after augmenting the data, using a Neural Network on the dataset may lead to overfitting.
+- While visual inspection of some sample augmented sentences suggested that the MariamMT model was successful in its back-translation, validation by subject matter experts is needed to guarantee that the augmented sentences maintain the label they were automatically assigned from the original sentence.
 ## Ethical Considerations
 - Dataset contains sensitive topics related to climate disinformation
 - Environmental impact is tracked to promote awareness of AI's carbon footprint
 ```