⛓ Evaluating Long Context #2: SCROLLS and ZeroSCROLLS
In this series of posts about tracing the history of long context evaluation, we started with Long Range Arena (LRA). Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation. But it wasn't introduced to evaluate LLMs, but rather the transformer architecture in general.
📜 The SCROLLS benchmark, introduced in 2022, addresses this gap in NLP/LLM research. SCROLLS challenges models with tasks that require reasoning over extended sequences (according to 2022 standards). So, what does it offer?
1️⃣ Long Text Focus: SCROLLS (unlike LRA) focus mainly on text and contain inputs with thousands of words, testing models' ability to synthesize information across lengthy documents. 2️⃣ Diverse Tasks: Includes summarization, question answering, and natural language inference across domains like literature, science, and business. 3️⃣ Unified Format: All datasets are available in a text-to-text format, facilitating easy evaluation and comparison of models.
Building on SCROLLS, ZeroSCROLLS takes long text evaluation to the next level by focusing on zero-shot learning. Other features include:
1️⃣ New Tasks: Introduces tasks like sentiment aggregation and sorting book chapter summaries. 2️⃣ Leaderboard: A live leaderboard encourages continuous improvement and competition among researchers.
💡 What are some other landmark benchmarks in the history of long context evaluation? Feel free to share your thoughts and suggestions in the comments.
Time Stream is a groundbreaking AI tool that transforms your text into a mesmerizing video journey from the past to the future. With this innovative technology, your ideas evolve over time, visualized through a dynamic image strip and a fluid video narrative. Imagine typing a simple prompt and watching as your words transform into vivid scenes that capture every moment of change—like a time machine for creativity! 🎥✨
Key Features: • Text-to-Video Transformation: Enter any text, and Time Stream converts it into a compelling video that travels through time, turning your ideas into a visual story. 📽️ • Dynamic Image Strip: Alongside the video, a vibrant image strip is created, showcasing each stage of the transformation so you can see every detail of the evolution. 📸 • Customizable Settings: Adjust parameters such as strength, guidance scale, and more to fine-tune your video’s appearance and ensure it perfectly matches your creative vision. ⚙️ • User-Friendly Interface: With a modern and sleek design, Time Stream is incredibly easy to use. Its intuitive layout lets you focus on your creativity without any technical hurdles. 🖥️🌟
Time Stream is perfect for artists, storytellers, designers, and anyone who loves to see their ideas come to life in new and exciting ways. Whether you’re reflecting on the past, celebrating the present, or dreaming about the future, Time Stream turns your narrative into a vivid, ever-changing masterpiece. Dive in and let your imagination soar as you journey through time, one image at a time! 🚀🔥
Tutorial 💥 Training a non-English reasoning model with GRPO and Unsloth
I wanted to share my experiment with training reasoning models in languages other than English/Chinese.
Using Llama 3.1 8B as base, GRPO trainer from trl, and Unsloth optimizations, I got a working prototype in Bulgarian after ~5 hours on an L40S GPU. The approach should work for any language where the base model has some pre-training coverage.
The community has been busy distilling DeepSeek-R1 from inference providers, but we decided to have a go at doing it ourselves from scratch 💪
What’s new compared to existing reasoning datasets?
♾ Based on AI-MO/NuminaMath-1.5: we focus on math reasoning traces and generate answers for problems in NuminaMath 1.5, an improved version of the popular NuminaMath-CoT dataset.
🐳 800k R1 reasoning traces: We generate two answers for 400k problems using DeepSeek R1. The filtered dataset contains 220k problems with correct reasoning traces.
📀 512 H100s running locally: Instead of relying on an API, we leverage vLLM and SGLang to run generations locally on our science cluster, generating 180k reasoning traces per day.
⏳ Automated filtering: We apply Math Verify to only retain problems with at least one correct answer. We also leverage Llama3.3-70B-Instruct as a judge to retrieve more correct examples (e.g for cases with malformed answers that can’t be verified with a rules-based parser)
📊 We match the performance of DeepSeek-Distill-Qwen-7B by finetuning Qwen-7B-Math-Instruct on our dataset.
Hugging Face just launched the AI Agents Course – a free journey from beginner to expert in AI agents!
- Learn AI Agent fundamentals, use cases and frameworks - Use top libraries like LangChain & LlamaIndex - Compete in challenges & earn a certificate - Hands-on projects & real-world applications
What do you need to know about Spacy NER models: ☑️ Models represent a python packages; packages could be installed directly into environemnt or via python CLI. ☑️ Library has a pipeline for optimized request handling in batches. ☑️ Architecture: DNN embedding-based models (not transformers)
A Brief Survey of Associations Between Meta-Learning and General AI
The paper titled "A Brief Survey of Associations Between Meta-Learning and General AI" explores how meta-learning techniques can contribute to the development of Artificial General Intelligence (AGI). Here are the key points summarized:
1. General AI (AGI) and Meta-Learning: - AGI aims to develop algorithms that can handle a wide variety of tasks, similar to human intelligence. Current AI systems excel at specific tasks but struggle with generalization to unseen tasks. - Meta-learning or "learning to learn" improves model adaptation and generalization, allowing AI systems to tackle new tasks efficiently using prior experiences.
2. Neural Network Design in Meta-Learning: - Techniques like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks enable self-improvement and adaptability for deep models, supporting generalization across tasks. - Highway networks and ResNet-style models use shortcuts for efficient backpropagation, allowing deeper models that can be used in meta-learning frameworks.
3. Coevolution: - Coevolution involves the mutual evolution of multiple components, such as learners or task-solvers, to improve overall performance. - Coevolution between learners enhances collaboration and competition within AI systems, while coevolution between tasks and solvers (e.g., POWERPLAY and AI-GA frameworks) pushes solvers to adapt to increasingly complex tasks.
4. Curiosity in Meta-Learning: - Curiosity-based exploration encourages AI systems to discover new, diverse features of the environment, avoiding local optima. - Curiosity-based objectives can be combined with performance-based objectives to ensure efficient exploration and adaptation in complex tasks.
5. Forgetting Mechanisms: - Forgetting is crucial to avoid memory overload in AI systems
Just been starting to port my articles over that mattered most to me from Civitai. Look, i'm not going to sit here and whine, complain and moan entirely - they know why i've left, they're going to thrive without me. I'm a mere spec compared to their future, and that's amazing. But the journey continues, i've posted my Design 101 for Ai - the first one up -- i BELEIVE it's the first one, as it delves back to how Arts and Crafts connect to AI. I'm still looking for a model hub in future for my insane 800+ models i'd published - considering that that's half of what i've got sitting in my repos on HF.
RAG techniques continuously evolve to enhance LLM response accuracy by retrieving relevant external data during generation. To keep up with current AI trends, new RAG types incorporate deep step-by-step reasoning, tree search, citations, multimodality and other effective techniques.
3. Chain-of-Retrieval Augmented Generation (CoRAG) -> Chain-of-Retrieval Augmented Generation (2501.14342) Retrieves information step-by-step and adjusts it, also deciding how much compute power to use at test time. If needed it reformulates queries.
Note: expected 3.9-3.10 Python. Accelerate in Python 3.11 may require further tweaks for launching. Might try out to wrap other frameworks later on here↗️: https://github.com/nicolay-r/nlp-thirdgate
The new release bulk-ner 0.25.1 in which the following updates were made: ✅ Removing sentnce index from output #21 ✅ API + support function for custom entities construction ✅ hub for providers
I am presenting Decoder-Only Transformer (DOT) Policy a simple Behavioral Control policy that outperforms SOTA models on two simple benchmark tasks:
✅ PushT (pushing an object to a goal) – 84% success on keypoints, 74% on images (previous best: 75% / 69%) ✅ ALOHA Insert (precise bimanual insertion) – 30% success (previous best: ~21%)
The best part? DOT is much smaller (sometimes 100 times less parameters) than previous SOTA models, trains faster, and avoids complexity: 🚫 No generative models (Diffusion, VAE, GANs) 🚫 No discretization/tokenization of actions 🚫 No reinforcement learning or multi-stage training ✅ Just learns from human demos, plain and simple
This is still early — more complex real-life tasks need testing, and no guarantees it will actually work well there, but I think it's interesting to share. Sometimes, simpler approaches can be just as effective (or even better) than complex ones.
📢 SmolLM2 paper released! Learn how the 🤗 team built one of the best small language models: from data choices to training insights. Check out our findings and share your thoughts! 🤏💡