title: Submission Oriaz
emoji: 🔥
colorFrom: yellow
colorTo: green
sdk: docker
pinned: true
Benchmarkusing different techniques
Global Informations :
Intended Use
- Primary intended uses: Baseline comparison for climate disinformation classification models
- Primary intended users: Researchers and developers participating in the Frugal AI Challenge
- Out-of-scope use cases: Not intended for production use or real-world classification tasks
Training Data
The model uses the QuotaClimat/frugalaichallenge-text-train dataset:
- Size: ~6000 examples
- Split: 80% train, 20% test
- 8 categories of climate disinformation claims
Labels
- No relevant claim detected
- Global warming is not happening
- Not caused by humans
- Not bad or beneficial
- Solutions harmful/unnecessary
- Science is unreliable
- Proponents are biased
- Fossil fuels are needed
Environmental Impact
Environmental impact is tracked using CodeCarbon, measuring:
- Carbon emissions during inference
- Energy consumption during inference
This tracking helps establish a baseline for the environmental impact of model deployment and inference.
Ethical Considerations
- Dataset contains sensitive topics related to climate disinformation
- Environmental impact is tracked to promote awareness of AI's carbon footprint
ML model for Climate Disinformation Classification
Model Description
Find the best ML model to process vectorized quotes to detect climate change disinformation.
Performance
Metrics (I used NVIDIA T4 small GPU)
- Accuracy: ~69-72%
- Environmental Impact:
- Emissions tracked in gCO2eq (~0,7g)
- Energy consumption tracked in Wh (~1,8wh)
Model Architecture
ML models prefers numeric values so we need to embed our quotes. I used MTEB Leaderboard on HuggingFace to find the model with the best trade-off between performance and the number of parameters.
I then chosed "dunzhang/stella_en_400M_v5" model as embedder. It has the 7th best performance score with only 400M parameters.
Once the quote are embedded, I have 6091 values x 1024 features. After that, train-test split (70%, 30%).
Using TPOT Classifier, I found that the best model on my data was a Logistic Regressor.
Then here is the Confusion Matrix :
Limitations
- Embedding phase take ~30 secondes for 1800 quotes. It can be optimised and can have a real influence on carbon emissions.
- Hard to go over 70% accuracy with "simple" ML.
- Textual data have some interpretations limitations that little models can't find.
Bert model for Climate Disinformation Classification
Model Description
Fine tune model for model classification.
Performance
Metrics (I used NVIDIA T4 small GPU)
- Accuracy: ~90%
- Environmental Impact:
- Emissions tracked in gCO2eq (~0,25g)
- Energy consumption tracked in Wh (~0.7wh)
Model Architecture
Fine tuning of "bert-uncased" model with 70% train, 15% eval, 15% test datasets.
Limitations
- Not optimized. I need to try to run it on CPU
- Little models have limitations. Regularly between 70-80% accuracy. Hard to go over just by changing params.
Contacts :
LinkedIn : Mattéo GIRARDEAU email : [email protected]