SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper • 2502.02737 • Published 7 days ago • 154
view article Article Fine-tune ModernBERT for text classification using synthetic data By davidberenstein1957 • Dec 30, 2024 • 32
NeMo Curator - Classifier Models Collection Classifier models that can be used in NeMo Curator for labelling/filtering datasets. • 9 items • Updated 26 days ago • 14
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain Paper • 2412.13018 • Published Dec 17, 2024 • 41
🔱 Sailor2 Language Models Collection Sailing in South-East Asia with Inclusive Multilingual LLMs • 9 items • Updated Dec 3, 2024 • 22
DCLM Pools Collection Raw pools for use in DCLM competition • 5 items • Updated Jul 17, 2024 • 1
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Paper • 2406.08464 • Published Jun 12, 2024 • 66
MagpieLM Collection Aligning LMs with Fully Open Recipe + Synthetic Data Generated from Open-Source LMs. • 9 items • Updated 30 days ago • 16
Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch Paper • 2410.18693 • Published Oct 24, 2024 • 40
ScaleQuest Collection We introduce ScaleQuest, a scalable and novel data synthesis method. Project Page: https://scalequest.github.io/ • 9 items • Updated Jan 7 • 6
C4AI Aya Expanse Collection Aya Expanse is an open-weight research release of a model with highly advanced multilingual capabilities. • 3 items • Updated Dec 16, 2024 • 33
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling Paper • 2401.16380 • Published Jan 29, 2024 • 49
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20, 2024 • 76
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates Paper • 2410.07137 • Published Oct 9, 2024 • 7