license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
π€ Train Dataset β’ π€ Benchmark β’ π€ Model β’ π Paper
π Table of Contents
- βοΈ LongWriter-V Deployment
- π€οΈ LongWriter-Agent-V
- π₯οΈ Model Training
- π Evaluation
- π Cases
- π Citation
βοΈ LongWriter-V Deployment
Environmental Setup: To inference Qwen2.5-VL based models, you may need to install transformers from source. Refer to this issue for more details.
We open-source three models: LongWriter-V-7B and LongWriter-V-7B-DPO, trained based on Qwen2.5-VL-7B-Instruct and LongWriter-V-72B, trained based on Qwen2.5-VL-72B-Instruct.
π€οΈ LongWriter-Agent-V
We are also open-sourcing LongWriter-Agent-V under agentwrite/
, our automated ultra-long output data construction pipeline. Run outline_vlm.py
to obtain the final data. Please configure your API key in config.py
.
π₯οΈ Model Training
You can download and save the LongWriter-V-22K data through the Hugging Face datasets (π€ HF Repo).
You can train the model with LLaMA-Factory, we used the official Qwen2_VL training script for training.
π Evaluation
We introduce two evaluation benchmarks: MMLongBench-Write and LongWrite-V-Ruler. MMLongBench-Write focuses more on measuring the long output quality as well as the output length, while LongWrite-V-Ruler is designed as a light-weight stress test of the model's maximum output length.
We provide our evaluation code under eval/
. Run
python -m eval.mmlongbench_write --model {model_name} --method {vlm, caption_llm}
python -m eval.longwrite_v_ruler --model {model_name}
to get evaluation resuts. Remember to configure your OpenAI API key in config.py
since we adopt GPT-4o as the judge.
Here are the evaluation results on MMLongBench-Write:
Here are the evaluation results on LongWrite-V-Ruler:
π Cases
Here are LongWriter-V-7B's outputs to random test prompts. (Examples truncated for brevity).
π Citation
If you find our work useful, please kindly cite:
@misc{tu2025longwriterv,
title={LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models},
author={Shangqing Tu and Yucheng Wang and Daniel Zhang-Li and Yushi Bai and Jifan Yu and Yuhao Wu and Lei Hou and Huiqin Liu and Zhiyuan Liu and Bin Xu and Juanzi Li},
year={2025},
eprint={2502.14834},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.14834},
}