metadata

title: Submission Oriaz
emoji: 🔥
colorFrom: yellow
colorTo: green
sdk: docker
pinned: true

Benchmarkusing different techniques

Global Informations :

Intended Use

Primary intended uses: Baseline comparison for climate disinformation classification models
Primary intended users: Researchers and developers participating in the Frugal AI Challenge
Out-of-scope use cases: Not intended for production use or real-world classification tasks

Training Data

The model uses the QuotaClimat/frugalaichallenge-text-train dataset:

Size: ~6000 examples
Split: 80% train, 20% test
8 categories of climate disinformation claims

Labels

No relevant claim detected
Global warming is not happening
Not caused by humans
Not bad or beneficial
Solutions harmful/unnecessary
Science is unreliable
Proponents are biased
Fossil fuels are needed

Environmental Impact

Environmental impact is tracked using CodeCarbon, measuring:

Carbon emissions during inference
Energy consumption during inference

This tracking helps establish a baseline for the environmental impact of model deployment and inference.

Ethical Considerations

Dataset contains sensitive topics related to climate disinformation
Environmental impact is tracked to promote awareness of AI's carbon footprint

ML model for Climate Disinformation Classification

Model Description

Find the best ML model to process vectorized quotes to detect climate change disinformation.

Performance

Metrics (I used NVIDIA T4 small GPU)

Accuracy: ~69-72%
Environmental Impact:
- Emissions tracked in gCO2eq (~0,7g)
- Energy consumption tracked in Wh (~1,8wh)

Model Architecture

ML models prefers numeric values so we need to embed our quotes. I used MTEB Leaderboard on HuggingFace to find the model with the best trade-off between performance and the number of parameters.

I then chosed "dunzhang/stella_en_400M_v5" model as embedder. It has the 7th best performance score with only 400M parameters.

Once the quote are embedded, I have 6091 values x 1024 features. After that, train-test split (70%, 30%).

Using TPOT Classifier, I found that the best model on my data was a Logistic Regressor.

Then here is the Confusion Matrix :

Limitations

Embedding phase take ~30 secondes for 1800 quotes. It can be optimised and can have a real influence on carbon emissions.
Hard to go over 70% accuracy with "simple" ML.
Textual data have some interpretations limitations that little models can't find.

Bert model for Climate Disinformation Classification

Model Description

Fine tune model for model classification.

Performance

Metrics (I used NVIDIA T4 small GPU)

Accuracy: ~90%
Environmental Impact:
- Emissions tracked in gCO2eq (~0,25g)
- Energy consumption tracked in Wh (~0.7wh)

Model Architecture

Fine tuning of "bert-uncased" model with 70% train, 15% eval, 15% test datasets.

Limitations

Not optimized. I need to try to run it on CPU
Little models have limitations. Regularly between 70-80% accuracy. Hard to go over just by changing params.

Contacts :

LinkedIn : Mattéo GIRARDEAU email : [email protected]