MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems
Abstract
Despite impressive performance across diverse tasks, Multimodal Large Language Models (MLLMs) have yet to fully demonstrate their potential in visual mathematical problem-solving, particularly in accurately perceiving and interpreting diagrams. Inspired by typical processes of humans, we hypothesize that the perception capabilities to extract meaningful information from diagrams is crucial, as it directly impacts subsequent inference processes. To validate this hypothesis, we developed FlowVerse, a comprehensive benchmark that categorizes all information used during problem-solving into four components, which are then combined into six problem versions for evaluation. Our preliminary results on FlowVerse reveal that existing MLLMs exhibit substantial limitations when extracting essential information and reasoned property from diagrams and performing complex reasoning based on these visual inputs. In response, we introduce MathFlow, a modular problem-solving pipeline that decouples perception and inference into distinct stages, thereby optimizing each independently. Given the perceptual limitations observed in current MLLMs, we trained MathFlow-P-7B as a dedicated perception model. Experimental results indicate that MathFlow-P-7B yields substantial performance gains when integrated with various closed-source and open-source inference models. This demonstrates the effectiveness of the MathFlow pipeline and its compatibility to diverse inference frameworks. The FlowVerse benchmark and code are available at https://github.com/MathFlow-zju/MathFlow.
Community
wow, that is good
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts (2025)
- R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization (2025)
- VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search (2025)
- MindGYM: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions (2025)
- Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning (2025)
- Forgotten Polygons: Multimodal Large Language Models are Shape-Blind (2025)
- Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper