Trending Papers

Submitted by

jt-zhang

TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

TurboDiffusion accelerates video generation by 100-200x using attention acceleration, step distillation, and quantization, while maintaining video quality.

University of California, Berkeley · Published on Dec 18, 2025

Upvote

42

GitHub 1.91k arXiv Page

Submitted by

jt-zhang

TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

TurboDiffusion accelerates video generation by 100-200x using attention acceleration, step distillation, and quantization, while maintaining video quality.

University of California, Berkeley · Dec 18, 2025

Upvote

42

GitHub 1.91k arXiv Page

Submitted by

amael-apple

Sharp Monocular View Synthesis in Less Than a Second

SHARP synthesizes photorealistic views from a single image using a 3D Gaussian representation, achieving state-of-the-art results with rapid processing.

Apple · Published on Dec 11, 2025

Upvote

13

GitHub 5.08k arXiv Page

Submitted by

amael-apple

Sharp Monocular View Synthesis in Less Than a Second

SHARP synthesizes photorealistic views from a single image using a 3D Gaussian representation, achieving state-of-the-art results with rapid processing.

Apple · Dec 11, 2025

Upvote

13

GitHub 5.08k arXiv Page

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

9 authors

· Published on Feb 7, 2025

Upvote

8

GitHub 61.6k arXiv Page

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

9 authors

· Feb 7, 2025

Upvote

8

GitHub 61.6k arXiv Page

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

11 authors

· Published on Nov 17, 2025

Upvote

10

GitHub 13.2k arXiv Page

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

11 authors

· Nov 17, 2025

Upvote

10

GitHub 13.2k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Published on Oct 8, 2024

Upvote

26

GitHub 26.6k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Oct 8, 2024

Upvote

26

GitHub 26.6k arXiv Page

Submitted by

andito

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

13 authors

· Published on Mar 14, 2025

Upvote

121

GitHub 47.7k arXiv Page

Submitted by

andito

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

13 authors

· Mar 14, 2025

Upvote

121

GitHub 47.7k arXiv Page

Submitted by

Jiaqi-hkust

Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

A novel framework, Robust-R1, enhances multimodal large language models' robustness to visual degradations through explicit modeling, supervised fine-tuning, reward-driven alignment, and dynamic reasoning depth scaling, achieving state-of-the-art performance on real-world degradation benchmarks.

10 authors

· Published on Dec 19, 2025

Upvote

61

GitHub 169 arXiv Page

Submitted by

Jiaqi-hkust

Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

A novel framework, Robust-R1, enhances multimodal large language models' robustness to visual degradations through explicit modeling, supervised fine-tuning, reward-driven alignment, and dynamic reasoning depth scaling, achieving state-of-the-art performance on real-world degradation benchmarks.

10 authors

· Dec 19, 2025

Upvote

61

GitHub 169 arXiv Page

Submitted by

Cxxs

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

The study reveals that in text-to-image generation, CFG Augmentation is the primary driver of few-step distillation in Distribution Matching Distillation (DMD), while the distribution matching term acts as a regularizer.

Tongyi-MAI · Published on Nov 27, 2025

Upvote

28

GitHub 7.85k arXiv Page

Submitted by

Cxxs

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

The study reveals that in text-to-image generation, CFG Augmentation is the primary driver of few-step distillation in Distribution Matching Distillation (DMD), while the distribution matching term acts as a regularizer.

Tongyi-MAI · Nov 27, 2025

Upvote

28

GitHub 7.85k arXiv Page

Submitted by

Paper99

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image, a 6B-parameter Scalable Single-Stream Diffusion Transformer (S3-DiT) model, achieves high-performance image generation with reduced computational cost, offering sub-second inference and compatibility with consumer hardware.

Tongyi-MAI · Published on Nov 27, 2025

Upvote

212

GitHub 7.88k arXiv Page

Submitted by

Paper99

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image, a 6B-parameter Scalable Single-Stream Diffusion Transformer (S3-DiT) model, achieves high-performance image generation with reduced computational cost, offering sub-second inference and compatibility with consumer hardware.

Tongyi-MAI · Nov 27, 2025

Upvote

212

GitHub 7.88k arXiv Page

Submitted by

akhaliq

Efficient Memory Management for Large Language Model Serving with PagedAttention

PagedAttention algorithm and vLLM system enhance the throughput of large language models by efficiently managing memory and reducing waste in the key-value cache.

9 authors

· Published on Sep 12, 2023

Upvote

26

GitHub 66.2k arXiv Page

Submitted by

akhaliq

Efficient Memory Management for Large Language Model Serving with PagedAttention

PagedAttention algorithm and vLLM system enhance the throughput of large language models by efficiently managing memory and reducing waste in the key-value cache.

9 authors

· Sep 12, 2023

Upvote

26

GitHub 66.2k arXiv Page

Submitted by

taesiri

SAM Audio: Segment Anything in Audio

SAM Audio, a diffusion transformer-based foundation model, achieves superior performance in general audio separation using unified text, visual, and temporal span prompts across various audio types.

AI at Meta · Published on Dec 19, 2025

Upvote

12

GitHub 2.51k arXiv Page

Submitted by

taesiri

SAM Audio: Segment Anything in Audio

SAM Audio, a diffusion transformer-based foundation model, achieves superior performance in general audio separation using unified text, visual, and temporal span prompts across various audio types.

AI at Meta · Dec 19, 2025

Upvote

12

GitHub 2.51k arXiv Page

Submitted by

taesiri

StoryMem: Multi-shot Long Video Storytelling with Memory

StoryMem enhances multi-shot video generation with cinematic quality and long-range consistency using a memory bank and pre-trained single-shot video diffusion models.

ByteDance · Published on Dec 22, 2025

Upvote

15

GitHub 144 arXiv Page

Submitted by

taesiri

StoryMem: Multi-shot Long Video Storytelling with Memory

StoryMem enhances multi-shot video generation with cinematic quality and long-range consistency using a memory bank and pre-trained single-shot video diffusion models.

ByteDance · Dec 22, 2025

Upvote

15

GitHub 144 arXiv Page

Submitted by

unilm

VibeVoice Technical Report

VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.

Microsoft Research · Published on Aug 26, 2025

Upvote

138

GitHub 19k arXiv Page

Submitted by

unilm

VibeVoice Technical Report

VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.

Microsoft Research · Aug 26, 2025

Upvote

138

GitHub 19k arXiv Page

Submitted by

taesiri

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

Unified Autoencoding combines semantic and pixel-level information through a frequency-band modulator, resulting in a latent space with state-of-the-art performance on image benchmarks.

5 authors

· Published on Dec 22, 2025

Upvote

60

GitHub 91 arXiv Page

Submitted by

taesiri

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

Unified Autoencoding combines semantic and pixel-level information through a frequency-band modulator, resulting in a latent space with state-of-the-art performance on image benchmarks.

5 authors

· Dec 22, 2025

Upvote

60

GitHub 91 arXiv Page

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Published on Sep 27, 2024

Upvote

36

GitHub 51k arXiv Page

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Sep 27, 2024

Upvote

36

GitHub 51k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Published on Sep 26, 2025

Upvote

139

GitHub 51k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Sep 26, 2025

Upvote

139

GitHub 51k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle · Published on Oct 16, 2025

Upvote

108

GitHub 66.8k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle · Oct 16, 2025

Upvote

108

GitHub 66.8k arXiv Page

Submitted by

nielsr

DINOv3

DINOv3, a self-supervised learning model, achieves superior performance across various vision tasks by scaling datasets and models, addressing dense feature degradation, and enhancing flexibility with post-hoc strategies.

AI at Meta · Published on Aug 13, 2025

Upvote

291

GitHub 9.06k arXiv Page

Submitted by

nielsr

DINOv3

DINOv3, a self-supervised learning model, achieves superior performance across various vision tasks by scaling datasets and models, addressing dense feature degradation, and enhancing flexibility with post-hoc strategies.

AI at Meta · Aug 13, 2025

Upvote

291

GitHub 9.06k arXiv Page

Submitted by

taesiri

DeepCode: Open Agentic Coding

DeepCode, a fully autonomous framework, addresses the challenges of document-to-codebase synthesis by optimizing information flow through source compression, structured indexing, knowledge injection, and error correction, achieving state-of-the-art performance and surpassing human experts.

5 authors

· Published on Dec 8, 2025

Upvote

31

GitHub 13.1k arXiv Page

Submitted by

taesiri

DeepCode: Open Agentic Coding

DeepCode, a fully autonomous framework, addresses the challenges of document-to-codebase synthesis by optimizing information flow through source compression, structured indexing, knowledge injection, and error correction, achieving state-of-the-art performance and surpassing human experts.

5 authors

· Dec 8, 2025

Upvote

31

GitHub 13.1k arXiv Page

Submitted by

taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

AI at Meta · Published on Nov 20, 2025

Upvote

121

GitHub 6.46k arXiv Page

Submitted by

taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

AI at Meta · Nov 20, 2025

Upvote

121

GitHub 6.46k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Published on Mar 20, 2024

Upvote

175

GitHub 64.5k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Mar 20, 2024

Upvote

175

GitHub 64.5k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Published on Dec 28, 2024

Upvote

14

GitHub 27k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Dec 28, 2024

Upvote

14

GitHub 27k arXiv Page

Submitted by

AaronHuangWei

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

DSR Suite enhances vision-language models with dynamic spatial reasoning through automated data generation and a geometry selection module that integrates geometric priors.

The University of Hong Kong · Published on Dec 23, 2025

Upvote

40

GitHub 24 arXiv Page

Submitted by

AaronHuangWei

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

DSR Suite enhances vision-language models with dynamic spatial reasoning through automated data generation and a geometry selection module that integrates geometric priors.

The University of Hong Kong · Dec 23, 2025

Upvote

40

GitHub 24 arXiv Page

Submitted by

MoonQiu

HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.

AI at Meta · Published on Dec 24, 2025

Upvote

12

GitHub 17 arXiv Page

Submitted by

MoonQiu

HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.

AI at Meta · Dec 24, 2025

Upvote

12

GitHub 17 arXiv Page

Submitted by

FrancisRing

FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

FlashPortrait is a diffusion-based video transformer for long-portrait animation that ensures ID consistency and achieves 6x acceleration through a dynamic sliding-window scheme and higher-order latent derivatives.

Fudan University · Published on Dec 18, 2025

Upvote

10

GitHub 252 arXiv Page

Submitted by

FrancisRing

FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

FlashPortrait is a diffusion-based video transformer for long-portrait animation that ensures ID consistency and achieves 6x acceleration through a dynamic sliding-window scheme and higher-order latent derivatives.

Fudan University · Dec 18, 2025

Upvote

10

GitHub 252 arXiv Page

Submitted by

taesiri

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

WorldPlay is a streaming video diffusion model that achieves real-time, interactive world modeling with long-term geometric consistency by using a Dual Action Representation, Reconstituted Context Memory, and Context Forcing.

10 authors

· Published on Dec 16, 2025

Upvote

63

GitHub 739 arXiv Page

Submitted by

taesiri

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

WorldPlay is a streaming video diffusion model that achieves real-time, interactive world modeling with long-term geometric consistency by using a Dual Action Representation, Reconstituted Context Memory, and Context Forcing.

10 authors

· Dec 16, 2025

Upvote

63

GitHub 739 arXiv Page

Submitted by

JDihlmann

3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework

3D-RE-GEN reconstructs single images into modifiable 3D textured mesh scenes with comprehensive backgrounds, achieving top performance through compositional generation and scene optimization.

3 authors

· Published on Dec 19, 2025

Upvote

10

GitHub 112 arXiv Page

Submitted by

JDihlmann

3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework

3D-RE-GEN reconstructs single images into modifiable 3D textured mesh scenes with comprehensive backgrounds, achieving top performance through compositional generation and scene optimization.

3 authors

· Dec 19, 2025

Upvote

10

GitHub 112 arXiv Page

Submitted by

xw-eric

The Unreasonable Effectiveness of Scaling Agents for Computer Use

Behavior Best-of-N (bBoN) improves the reliability and success rates of computer-use agents by generating and selecting among multiple rollouts using behavior narratives, achieving state-of-the-art performance on OSWorld and strong generalization to different operating systems.

Simular · Published on Oct 2, 2025

Upvote

24

GitHub 9.09k arXiv Page

Submitted by

xw-eric

The Unreasonable Effectiveness of Scaling Agents for Computer Use

Behavior Best-of-N (bBoN) improves the reliability and success rates of computer-use agents by generating and selecting among multiple rollouts using behavior narratives, achieving state-of-the-art performance on OSWorld and strong generalization to different operating systems.

Simular · Oct 2, 2025

Upvote

24

GitHub 9.09k arXiv Page

Submitted by

akhaliq

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

FunAudioLLM enhances voice interactions by integrating SenseVoice for multilingual speech recognition, emotion detection, and audio event detection with CosyVoice for natural speech generation across languages, timbres, and styles.

1 authors

· Published on Jul 4, 2024

Upvote

40

GitHub 18.3k arXiv Page

Submitted by

akhaliq

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

FunAudioLLM enhances voice interactions by integrating SenseVoice for multilingual speech recognition, emotion detection, and audio event detection with CosyVoice for natural speech generation across languages, timbres, and styles.

1 authors

· Jul 4, 2024

Upvote

40

GitHub 18.3k arXiv Page

Submitted by

xw-eric

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Agent S2, a compositional framework using Mixture-of-Grounding and Proactive Hierarchical Planning, achieves state-of-the-art performance in computer use automation across various benchmarks and operating systems.

Simular · Published on Apr 1, 2025

Upvote

27

GitHub 9.08k arXiv Page

Submitted by

xw-eric

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Agent S2, a compositional framework using Mixture-of-Grounding and Proactive Hierarchical Planning, achieves state-of-the-art performance in computer use automation across various benchmarks and operating systems.

Simular · Apr 1, 2025

Upvote

27

GitHub 9.08k arXiv Page

Submitted by

xianbao

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM-4.5, a Mixture-of-Experts large language model with 355B parameters, achieves strong performance across agentic, reasoning, and coding tasks using multi-stage training and reinforcement learning.

171 authors

· Published on Aug 8, 2025

Upvote

195

GitHub 3.48k arXiv Page

Submitted by

xianbao

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM-4.5, a Mixture-of-Experts large language model with 355B parameters, achieves strong performance across agentic, reasoning, and coding tasks using multi-stage training and reinforcement learning.

171 authors

· Aug 8, 2025

Upvote

195

GitHub 3.48k arXiv Page

Submitted by

Keh0t0

EgoX: Egocentric Video Generation from a Single Exocentric Video

EgoX framework generates egocentric videos from exocentric inputs using video diffusion models with LoRA adaptation, unified conditioning, and geometry-guided self-attention for coherence and visual fidelity.

KAIST AI · Published on Dec 9, 2025

Upvote

109

GitHub 336 arXiv Page

Submitted by

Keh0t0

EgoX: Egocentric Video Generation from a Single Exocentric Video

EgoX framework generates egocentric videos from exocentric inputs using video diffusion models with LoRA adaptation, unified conditioning, and geometry-guided self-attention for coherence and visual fidelity.

KAIST AI · Dec 9, 2025

Upvote

109

GitHub 336 arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Published on Apr 28, 2025

Upvote

35

GitHub 44.7k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Apr 28, 2025

Upvote

35

GitHub 44.7k arXiv Page

Submitted by

taesiri

Memory in the Age of AI Agents

This survey provides an updated overview of agent memory research, distinguishing its forms, functions, and dynamics, and highlights emerging research directions.

47 authors

· Published on Dec 15, 2025

Upvote

111

GitHub 515 arXiv Page

Submitted by

taesiri

Memory in the Age of AI Agents

This survey provides an updated overview of agent memory research, distinguishing its forms, functions, and dynamics, and highlights emerging research directions.

47 authors

· Dec 15, 2025

Upvote

111

GitHub 515 arXiv Page

Submitted by

taesiri

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

A framework for Scientific General Intelligence (SGI) is presented, evaluated using SGI-Bench, and improved with Test-Time Reinforcement Learning, highlighting gaps in existing models' scientific capabilities.

107 authors

· Published on Dec 18, 2025

Upvote

102

GitHub 96 arXiv Page

Submitted by

taesiri

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

A framework for Scientific General Intelligence (SGI) is presented, evaluated using SGI-Bench, and improved with Test-Time Reinforcement Learning, highlighting gaps in existing models' scientific capabilities.

107 authors

· Dec 18, 2025

Upvote

102

GitHub 96 arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Published on Oct 23, 2024

Upvote

7

GitHub 51.4k arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Oct 23, 2024

Upvote

7

GitHub 51.4k arXiv Page

Submitted by

xw-eric

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Agent S, a framework for autonomous GUI interactions, enhances task automation with experience-augmented hierarchical planning and Multimodal Large Language Models.

6 authors

· Published on Oct 10, 2024

Upvote

26

GitHub 9.09k arXiv Page

Submitted by

xw-eric

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Agent S, a framework for autonomous GUI interactions, enhances task automation with experience-augmented hierarchical planning and Multimodal Large Language Models.

6 authors

· Oct 10, 2024

Upvote

26

GitHub 9.09k arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Published on Feb 8, 2025

Upvote

6

GitHub 17k arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Feb 8, 2025

Upvote

6

GitHub 17k arXiv Page

FlipVQA-Miner: Cross-Page Visual Question-Answer Mining from Textbooks

An automated pipeline using layout-aware OCR and LLM-based semantic parsing extracts high-quality QA and VQA pairs from educational documents, enhancing LLM training without synthetic data.

6 authors

· Published on Nov 20, 2025

Upvote

1

GitHub 1.77k arXiv Page

FlipVQA-Miner: Cross-Page Visual Question-Answer Mining from Textbooks

An automated pipeline using layout-aware OCR and LLM-based semantic parsing extracts high-quality QA and VQA pairs from educational documents, enhancing LLM training without synthetic data.

6 authors

· Nov 20, 2025

Upvote

1

GitHub 1.77k arXiv Page

Submitted by

taesiri

PersonaLive! Expressive Portrait Image Animation for Live Streaming

PersonaLive is a diffusion-based framework for real-time portrait animation that enhances speed and efficiency through multi-stage training, hybrid implicit signals, appearance distillation, and autoregressive micro-chunk streaming.

GVC Lab at Great Bay University · Published on Dec 12, 2025

Upvote

31

GitHub 743 arXiv Page

Submitted by

taesiri

PersonaLive! Expressive Portrait Image Animation for Live Streaming

PersonaLive is a diffusion-based framework for real-time portrait animation that enhances speed and efficiency through multi-stage training, hybrid implicit signals, appearance distillation, and autoregressive micro-chunk streaming.

GVC Lab at Great Bay University · Dec 12, 2025

Upvote

31

GitHub 743 arXiv Page

Submitted by

dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

6 authors

· Published on Oct 26, 2025

Upvote

11

GitHub 17.8k arXiv Page

Submitted by

dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

6 authors

· Oct 26, 2025

Upvote

11

GitHub 17.8k arXiv Page

Submitted by

imsuperkong

WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

WorldWarp addresses the challenge of generating consistent long-range videos by integrating a 3D geometric cache with a spatio-temporal diffusion model, ensuring structural consistency and textural refinement.

National University of Singapore · Published on Dec 22, 2025

Upvote

26

GitHub 52 arXiv Page

Submitted by

imsuperkong

WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

WorldWarp addresses the challenge of generating consistent long-range videos by integrating a 3D geometric cache with a spatio-temporal diffusion model, ensuring structural consistency and textural refinement.

National University of Singapore · Dec 22, 2025

Upvote

26

GitHub 52 arXiv Page

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Published on Apr 14, 2025

Upvote

306

arXiv Page

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Apr 14, 2025

Upvote

306

arXiv Page

Submitted by

zhongwenxu

Single-stream Policy Optimization

Single-stream Policy Optimization (SPO) improves policy-gradient training for Large Language Models by eliminating group-based issues and providing a stable, low-variance learning signal, leading to better performance and efficiency.

Tencent · Published on Sep 16, 2025

Upvote

33

GitHub 17.8k arXiv Page

Submitted by

zhongwenxu

Single-stream Policy Optimization

Single-stream Policy Optimization (SPO) improves policy-gradient training for Large Language Models by eliminating group-based issues and providing a stable, low-variance learning signal, leading to better performance and efficiency.

Tencent · Sep 16, 2025

Upvote

33

GitHub 17.8k arXiv Page

Submitted by

Jeff-Wang

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0, a VLA foundation model, uses world model-generated data to enhance cross-task generalization and policy robustness, improving real-world performance on complex manipulation tasks.

GigaAI · Published on Oct 22, 2025

Upvote

49

GitHub 988 arXiv Page

Submitted by

Jeff-Wang

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0, a VLA foundation model, uses world model-generated data to enhance cross-task generalization and policy robustness, improving real-world performance on complex manipulation tasks.

GigaAI · Oct 22, 2025

Upvote

49

GitHub 988 arXiv Page

Submitted by

taesiri

Step-GUI Technical Report

A self-evolving training pipeline with the Calibrated Step Reward System and GUI-MCP protocol improve GUI automation efficiency, accuracy, and privacy in real-world scenarios.

StepFun · Published on Dec 17, 2025

Upvote

121

GitHub 1.73k arXiv Page

Submitted by

taesiri

Step-GUI Technical Report

A self-evolving training pipeline with the Calibrated Step Reward System and GUI-MCP protocol improve GUI automation efficiency, accuracy, and privacy in real-world scenarios.

StepFun · Dec 17, 2025

Upvote

121

GitHub 1.73k arXiv Page

Submitted by

LiheYoung

In Pursuit of Pixel Supervision for Visual Pre-training

Pixio, an enhanced masked autoencoder, demonstrates competitive performance across various downstream tasks using pixel-space self-supervised learning, outperforming latent-space approaches.

8 authors

· Published on Dec 17, 2025

Upvote

7

GitHub 238 arXiv Page

Submitted by

LiheYoung

In Pursuit of Pixel Supervision for Visual Pre-training

Pixio, an enhanced masked autoencoder, demonstrates competitive performance across various downstream tasks using pixel-space self-supervised learning, outperforming latent-space approaches.

8 authors

· Dec 17, 2025

Upvote

7

GitHub 238 arXiv Page

Submitted by

AdinaY

SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

SCAIL framework improves character animation by using a novel 3D pose representation and a diffusion-transformer architecture with full-context pose injection, achieving studio-grade quality and realism.

Z.ai · Published on Dec 5, 2025

Upvote

19

GitHub 527 arXiv Page

Submitted by

AdinaY

SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

SCAIL framework improves character animation by using a novel 3D pose representation and a diffusion-transformer architecture with full-context pose injection, achieving studio-grade quality and realism.

Z.ai · Dec 5, 2025

Upvote

19

GitHub 527 arXiv Page

Submitted by

taesiri

HunyuanVideo 1.5 Technical Report

HunyuanVideo 1.5 is a lightweight video generation model with state-of-the-art visual quality and motion coherence, using a DiT architecture with SSTA and an efficient video super-resolution network.

81 authors

· Published on Nov 24, 2025

Upvote

22

GitHub 2.15k arXiv Page

Submitted by

taesiri

HunyuanVideo 1.5 Technical Report

HunyuanVideo 1.5 is a lightweight video generation model with state-of-the-art visual quality and motion coherence, using a DiT architecture with SSTA and an efficient video super-resolution network.

81 authors

· Nov 24, 2025

Upvote

22

GitHub 2.15k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Data Intelligence Lab@HKU · Published on Oct 14, 2025

Upvote

52

GitHub 11.3k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Data Intelligence Lab@HKU · Oct 14, 2025

Upvote

52

GitHub 11.3k arXiv Page

byAK and the research community