15 51 35

Mohammed Hamdy

mmhamdy

AI & ML interests

TechBio | AI4Sci | NLP | Reinforcement Learning

Recent Activity

posted an update about 15 hours ago

⛓ Evaluating Long Context #2: SCROLLS and ZeroSCROLLS In this series of posts about tracing the history of long context evaluation, we started with Long Range Arena (LRA). Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation. But it wasn't introduced to evaluate LLMs, but rather the transformer architecture in general. 📜 The SCROLLS benchmark, introduced in 2022, addresses this gap in NLP/LLM research. SCROLLS challenges models with tasks that require reasoning over extended sequences (according to 2022 standards). So, what does it offer? 1️⃣ Long Text Focus: SCROLLS (unlike LRA) focus mainly on text and contain inputs with thousands of words, testing models' ability to synthesize information across lengthy documents. 2️⃣ Diverse Tasks: Includes summarization, question answering, and natural language inference across domains like literature, science, and business. 3️⃣ Unified Format: All datasets are available in a text-to-text format, facilitating easy evaluation and comparison of models. Building on SCROLLS, ZeroSCROLLS takes long text evaluation to the next level by focusing on zero-shot learning. Other features include: 1️⃣ New Tasks: Introduces tasks like sentiment aggregation and sorting book chapter summaries. 2️⃣ Leaderboard: A live leaderboard encourages continuous improvement and competition among researchers. 💡 What are some other landmark benchmarks in the history of long context evaluation? Feel free to share your thoughts and suggestions in the comments. - SCROLLS Paper: https://huggingface.co/papers/2201.03533 - ZeroSCROLLS Paper: https://huggingface.co/papers/2305.14196

reacted to lewtun's post with 🔥 about 18 hours ago

Introducing OpenR1-Math-220k! https://huggingface.co/datasets/open-r1/OpenR1-Math-220k The community has been busy distilling DeepSeek-R1 from inference providers, but we decided to have a go at doing it ourselves from scratch 💪 What’s new compared to existing reasoning datasets? ♾ Based on https://huggingface.co/datasets/AI-MO/NuminaMath-1.5: we focus on math reasoning traces and generate answers for problems in NuminaMath 1.5, an improved version of the popular NuminaMath-CoT dataset. 🐳 800k R1 reasoning traces: We generate two answers for 400k problems using DeepSeek R1. The filtered dataset contains 220k problems with correct reasoning traces. 📀 512 H100s running locally: Instead of relying on an API, we leverage vLLM and SGLang to run generations locally on our science cluster, generating 180k reasoning traces per day. ⏳ Automated filtering: We apply Math Verify to only retain problems with at least one correct answer. We also leverage Llama3.3-70B-Instruct as a judge to retrieve more correct examples (e.g for cases with malformed answers that can’t be verified with a rules-based parser) 📊 We match the performance of DeepSeek-Distill-Qwen-7B by finetuning Qwen-7B-Math-Instruct on our dataset. 🔎 Read our blog post for all the nitty gritty details: https://huggingface.co/blog/open-r1/update-2

reacted to lewtun's post with ❤️ about 18 hours ago

View all activity

Organizations

Posts 7

Post

1265

⛓ Evaluating Long Context #2: SCROLLS and ZeroSCROLLS

In this series of posts about tracing the history of long context evaluation, we started with Long Range Arena (LRA). Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation. But it wasn't introduced to evaluate LLMs, but rather the transformer architecture in general.

📜 The SCROLLS benchmark, introduced in 2022, addresses this gap in NLP/LLM research. SCROLLS challenges models with tasks that require reasoning over extended sequences (according to 2022 standards). So, what does it offer?

1️⃣ Long Text Focus: SCROLLS (unlike LRA) focus mainly on text and contain inputs with thousands of words, testing models' ability to synthesize information across lengthy documents.
2️⃣ Diverse Tasks: Includes summarization, question answering, and natural language inference across domains like literature, science, and business.
3️⃣ Unified Format: All datasets are available in a text-to-text format, facilitating easy evaluation and comparison of models.

Building on SCROLLS, ZeroSCROLLS takes long text evaluation to the next level by focusing on zero-shot learning. Other features include:

1️⃣ New Tasks: Introduces tasks like sentiment aggregation and sorting book chapter summaries.
2️⃣ Leaderboard: A live leaderboard encourages continuous improvement and competition among researchers.

💡 What are some other landmark benchmarks in the history of long context evaluation? Feel free to share your thoughts and suggestions in the comments.

- SCROLLS Paper: SCROLLS: Standardized CompaRison Over Long Language Sequences (2201.03533)
- ZeroSCROLLS Paper: ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding (2305.14196)

Post

1923

🔗 Evaluating Long Context #1: Long Range Arena (LRA)

Accurately evaluating how well language models handle long contexts is crucial, but it's also quite challenging to do well. In this series of posts, we're going to examine the various benchmarks that were proposed to assess long context understanding, starting with Long Range Arens (LRA)

Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation.

📌 Key Features of LRA

1️⃣ Diverse Tasks: The LRA benchmark consists of a suite of tasks designed to evaluate model performance on long sequences ranging from 1,000 to 16,000 tokens. These tasks encompass different data types and modalities: Text, Natural and Synthetic Images, and Mathematical Expressions.

2️⃣ Synthetic and Real-world Tasks: LRA is comprised of both synthetic probing tasks and real-world tasks.

3️⃣ Open-Source and Extensible: Implemented in Python using Jax and Flax, the LRA benchmark code is publicly available, making it easy to extend.

📌 Tasks

1️⃣ Long ListOps

2️⃣ Byte-level Text Classification and Document Retrieval

3️⃣ Image Classification

4️⃣ Pathfinder and Pathfinder-X (Long-range spatial dependency)

👨‍💻 Long Range Arena (LRA) Github Repository: https://github.com/google-research/long-range-arena

📄 Long Range Arena (LRA) paper: Long Range Arena: A Benchmark for Efficient Transformers (2011.04006)

View all Posts