โ Evaluating Long Context #2: SCROLLS and ZeroSCROLLS
In this series of posts about tracing the history of long context evaluation, we started with Long Range Arena (LRA). Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation. But it wasn't introduced to evaluate LLMs, but rather the transformer architecture in general.
๐ The SCROLLS benchmark, introduced in 2022, addresses this gap in NLP/LLM research. SCROLLS challenges models with tasks that require reasoning over extended sequences (according to 2022 standards). So, what does it offer?
1๏ธโฃ Long Text Focus: SCROLLS (unlike LRA) focus mainly on text and contain inputs with thousands of words, testing models' ability to synthesize information across lengthy documents. 2๏ธโฃ Diverse Tasks: Includes summarization, question answering, and natural language inference across domains like literature, science, and business. 3๏ธโฃ Unified Format: All datasets are available in a text-to-text format, facilitating easy evaluation and comparison of models.
Building on SCROLLS, ZeroSCROLLS takes long text evaluation to the next level by focusing on zero-shot learning. Other features include:
1๏ธโฃ New Tasks: Introduces tasks like sentiment aggregation and sorting book chapter summaries. 2๏ธโฃ Leaderboard: A live leaderboard encourages continuous improvement and competition among researchers.
๐ก What are some other landmark benchmarks in the history of long context evaluation? Feel free to share your thoughts and suggestions in the comments.
The community has been busy distilling DeepSeek-R1 from inference providers, but we decided to have a go at doing it ourselves from scratch ๐ช
Whatโs new compared to existing reasoning datasets?
โพ Based on AI-MO/NuminaMath-1.5: we focus on math reasoning traces and generate answers for problems in NuminaMath 1.5, an improved version of the popular NuminaMath-CoT dataset.
๐ณ 800k R1 reasoning traces: We generate two answers for 400k problems using DeepSeek R1. The filtered dataset contains 220k problems with correct reasoning traces.
๐ 512 H100s running locally: Instead of relying on an API, we leverage vLLM and SGLang to run generations locally on our science cluster, generating 180k reasoning traces per day.
โณ Automated filtering: We apply Math Verify to only retain problems with at least one correct answer. We also leverage Llama3.3-70B-Instruct as a judge to retrieve more correct examples (e.g for cases with malformed answers that canโt be verified with a rules-based parser)
๐ We match the performance of DeepSeek-Distill-Qwen-7B by finetuning Qwen-7B-Math-Instruct on our dataset.
โจ MIT License : enabling distillation for custom models โจ 32B & 70B models match OpenAI o1-mini in multiple capabilities โจ API live now! Access Chain of Thought reasoning with model='deepseek-reasoner'
reacted to hba123's
post with ๐about 2 months ago
Blindly applying algorithms without understanding the math behind them is not a good idea frmpv. So, I am on a quest to fix this!
I wrote my first hugging face article on how you would derive closed-form solutions for KL-regularised reinforcement learning problems - what is used for DPO.
๐ From instruction-following to creative storytelling, dive into 2024's most impactful AI datasets! These gems are shaping everything from scientific research to video understanding.
๐ Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.
Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Tรฉcnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.
๐ท๏ธ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!
Thanks to this annotation process, the open dataset contains two subsets:
1. ๐ฝ Culturally Agnostic: no specific regional, cultural knowledge is required. 2. โ๏ธ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.
Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.
I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.
Hugging face presents FineVideo ๐ฅ! Unlocking the next generation of Video understanding ๐
๐คฏ3400 hours of annotated Creative Common videos with rich character descriptions, scene splits, mood, and content descriptions per scene as well as QA pairs. ๐ฅ @mfarre processed over 2M videos of Youtube-CC to make this incredibly powerful selection.
The cleaning process consists of: - Joining the separate splits together / add split column - Converting string messages into list of structs - Removing empty system prompts
I wanted to introduce myself and my company @Overlaiapp. We are a collective of filmmakers, photographers, and AI engineers working on high resolution (8K+) training data.
We plan to share a lot of our datasets with the community and are kicking things off with two curated datasets:
๐ฅ Oversampled: Every clip is captured in stunning 8K resolution, delivering rich detail ideal for fine tuning scenic landscapes and ocean dynamics.
๐ธ Variance: Includes close-up details, slow-motion footage of crashing waves, sweeping landscapes, and wildlife shots.
๐ Detailed Metadata: Every clip is paired with structured metadata, including creative descriptions, precise camera movements, lens information, field of view calculations, and shot settings, ensuring AI models can fully understand and replicate real-world cinematography with accuracy.
โ๏ธ Consistency: Re-thinking training data at the point of capture by "overshooting" a subject, enabling models to learn more nuanced relationships and views across scenes.
๐ Light: Shot during early morning and sunset light for optimal color contrast and dynamic range, maximizing visual quality for color and lighting-sensitive tasks.
๐ Curation: Curated specifically for machine learning, providing clean, high-quality data for next generation model training.
Microsoft researchers dropped a groundbreaking technique that could slash the energy use in transformer computations : their novel "linear-complexity multiplication" (L-Mul) algorithm approximates floating-point multiplication using energy-efficient integer addition instead of costly multiplications.
๐ก Quick reminder on how floats are coded on 8 bits (FP8): In the e4m3 FP8 standard, you encode a number as: Sign (1 bit) | Exponent (4 bits) | Mantissa (3 bits) Example: 0 (positive) | 1000 (8) | 101 (1/2 + 1/8 = 0.625) Calculation: you add one to the mantissa, and multiply it by 2 power (the exponent - a bias term which is 7 for e4m3):
โก๏ธย You get (1 + 0.625) ร 2^(8-7) = 3.25
Now back to the paper. ๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
โก๏ธ Multiplication is extremely energy-intensive compared to addition. For 32-bit operations, multiplication (3.7 pJ) uses 37x more energy than addition (0.1 pJ)!
๐งฎ Traditional floating-point multiplication go like (noting xm the mantissa and xe the exponent): Mul(x,y) = (1 + xm) ยท 2^xe ยท (1 + ym) ยท 2^ye = (1 + xm + ym + xm ยท ym) ยท 2^(xe+ye)
๐ก L-Mul cleverly approximates this as: L-Mul(x,y) = (1 + xm + ym + 2^-l(m)) ยท 2^(xe+ye), eliminating the costly xm ยท ym term
๐ง l(m) term is adaptively set based on mantissa size for optimal accuracy
๐ Benchmarks on the Llama-3.1-8B-Instruct model show L-Mul preserves precision across various NLP tasks, with performance nearly identical to full BFloat16 precision
๐ฌ Authors claim: "We can achieve the same model inference performance while reducing the energy cost of attention computations by 80%."
This breakthrough is still theoretical and would need implementation on dedicated hardware to confirm real-world gains, but itโs a really exciting path for more sustainable AI! ๐ฑ
๐ Evaluating Long Context #1: Long Range Arena (LRA)
Accurately evaluating how well language models handle long contexts is crucial, but it's also quite challenging to do well. In this series of posts, we're going to examine the various benchmarks that were proposed to assess long context understanding, starting with Long Range Arens (LRA)
Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation.
๐ Key Features of LRA
1๏ธโฃ Diverse Tasks: The LRA benchmark consists of a suite of tasks designed to evaluate model performance on long sequences ranging from 1,000 to 16,000 tokens. These tasks encompass different data types and modalities: Text, Natural and Synthetic Images, and Mathematical Expressions.
2๏ธโฃ Synthetic and Real-world Tasks: LRA is comprised of both synthetic probing tasks and real-world tasks.
3๏ธโฃ Open-Source and Extensible: Implemented in Python using Jax and Flax, the LRA benchmark code is publicly available, making it easy to extend.
๐ Tasks
1๏ธโฃ Long ListOps
2๏ธโฃ Byte-level Text Classification and Document Retrieval
3๏ธโฃ Image Classification
4๏ธโฃ Pathfinder and Pathfinder-X (Long-range spatial dependency)
We're thrilled to announce the release of Argilla 2.2.0, packed with powerful new features to enhance your data annotation and LLM workflow:
๐จ๏ธ ChatField: Work with text conversations natively in Argilla. Perfect for building datasets for conversational LLMs! โ๏ธ Adjustable Task Distribution: Modify settings on the fly and automatically recalculate completed and pending records. ๐ Progress Tracking: Monitor annotation progress directly from the SDK, including user-specific metrics. ๐ง Automatic Settings Inference: Importing datasets from Hugging Face Hub just got easier with automatic settings detection. ๐ Task Templates: Jump-start your projects with pre-built templates for common dataset types. ๐ง Background Jobs Support: Improved performance for long-running tasks (requires Redis).