Xiaosen Zheng's picture

Xiaosen Zheng

xszheng2020

·

AI & ML interests

Data-Centric AI and AI Safety.

Recent Activity

upvoted a paper 3 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

liked a model 5 days ago

allenai/OLMo-1B-0724-hf

liked a dataset 5 days ago

lkevinzc/CountDownZero

View all activity

Organizations

xszheng2020's activity

upvoted a paper 3 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 7 days ago • 154

upvoted an article 10 days ago

Article

Open-R1: Update #1

By

and 7 others •

10 days ago

• 270

upvoted a collection 17 days ago

DeepSeek-R1

8 items • Updated 22 days ago • 475

upvoted an article about 1 month ago

Article

Fine-tune ModernBERT for text classification using synthetic data

By

•

Dec 30, 2024

• 32

upvoted 2 collections about 2 months ago

NeMo Curator - Classifier Models

Classifier models that can be used in NeMo Curator for labelling/filtering datasets. • 9 items • Updated 26 days ago • 14

FastText Model for Pretraining Data Curation

4 items • Updated 14 days ago • 1

upvoted a paper about 2 months ago

OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

Paper • 2412.13018 • Published Dec 17, 2024 • 41

upvoted 2 collections 2 months ago

🔱 Sailor2 Language Models

Sailing in South-East Asia with Inclusive Multilingual LLMs • 9 items • Updated Dec 3, 2024 • 22

DCLM Pools

Raw pools for use in DCLM competition • 5 items • Updated Jul 17, 2024 • 1

upvoted a paper 3 months ago

Sample-Efficient Alignment for LLMs

Paper • 2411.01493 • Published Nov 3, 2024 • 11

upvoted a paper 4 months ago

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Paper • 2406.08464 • Published Jun 12, 2024 • 66

upvoted a collection 4 months ago

MagpieLM

Aligning LMs with Fully Open Recipe + Synthetic Data Generated from Open-Source LMs. • 9 items • Updated 30 days ago • 16

upvoted a paper 4 months ago

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

Paper • 2410.18693 • Published Oct 24, 2024 • 40

upvoted 3 collections 4 months ago

ScaleQuest

We introduce ScaleQuest, a scalable and novel data synthesis method. Project Page: https://scalequest.github.io/ • 9 items • Updated Jan 7 • 6

C4AI Aya Expanse

Aya Expanse is an open-weight research release of a model with highly advanced multilingual capabilities. • 3 items • Updated Dec 16, 2024 • 33

BGE

23 items • Updated 4 days ago • 83

upvoted an article 4 months ago

Article

SmolLM - blazingly fast and remarkably powerful

Jul 16, 2024

• 312

upvoted a paper 4 months ago

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Paper • 2401.16380 • Published Jan 29, 2024 • 49

upvoted an article 4 months ago

Article

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20, 2024

• 76

upvoted a paper 4 months ago

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Paper • 2410.07137 • Published Oct 9, 2024 • 7