Alibaba unveils QVQ-72B, an open-source AI model for vision-based reasoning

Unlike conventional models that focus solely on visual or reasoning tasks, QVQ-72B integrates both, offering a powerful solution for tasks requiring deep context analysis.

Alibaba’s Qwen research team has launched QVQ-72B, a new open-source artificial intelligence (AI) model for advanced visual reasoning. The experimental model, shared in the preview, combines capabilities for analyzing visual data from images with reasoning structures to answer complex queries. This marks a significant step forward for vision-based AI technologies.

The QVQ-72B model was revealed on Hugging Face, where its advanced functionality was detailed. Unlike conventional models that focus solely on visual or reasoning tasks, QVQ-72B integrates both, offering a powerful solution for tasks requiring deep context analysis. In internal tests, it scored 71.4% on the MathVista (mini) benchmark, surpassing OpenAI’s o1 model, which scored 71%. It also achieved 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark.

Advertisement

This innovation builds upon previous releases like Alibaba’s QwQ-32B and Marco-o1, emphasizing reasoning-focused large language models (LLMs). While the new model offers improved performance, the team acknowledged challenges such as occasional language-switching errors and recursive reasoning loops that impact accuracy.

By merging visual analysis and reasoning, the QVQ-72B model offers potential applications in areas such as image-based problem-solving, enhanced multimodal tasks, and more nuanced AI-driven interactions. This aligns with Alibaba’s ongoing efforts to develop open-source AI technologies, bolstering its standing in the global AI landscape.