Dan Saattrup Nielsen

saattrupdan

AI & ML interests

NLP for low-resource languages.

Recent Activity

Organizations

Flax Community's profile picture Dansk Data Science Community's profile picture DaNLP's profile picture AI Sweden Model Hub's profile picture Blackbird AI's profile picture north's profile picture ScandEval's profile picture Alexandra Institute's profile picture Job Ad Generator's profile picture LumiOpen's profile picture Danish Foundation Models's profile picture CoRal's profile picture Merge Crew's profile picture RAG Demo's profile picture TrustLLM EU's profile picture Alpha BidCo's profile picture

saattrupdan's activity

New activity in alexandrainst/foqa 1 day ago

update citation

1
#3 opened 2 days ago by
davanstrien
updated a Space 2 days ago
reacted to davanstrien's post with πŸ”₯ about 1 month ago
view post
Post
3062
Introducing scandi-fine-web-cleaner davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!

FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?

Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.

Today, I'm happy to share the first classifier trained on this data.

πŸ” What we've built:

- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute

🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.

Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
  • 1 reply
Β·