Qwen: QvQ 72B Preview
qwen/qvq-72b-preview
QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities.
Performance
QVQ-72B-Preview | o1-2024-12-17 | gpt-4o-2024-05-13 | Claude3.5 Sonnet-20241022 | Qwen2VL-72B | |
---|---|---|---|---|---|
MMMU(val) | 70.3 | 77.3 | 69.1 | 70.4 | 64.5 |
MathVista(mini) | 71.4 | 71.0 | 63.8 | 65.3 | 70.5 |
MathVision(full) | 35.9 | – | 30.4 | 35.6 | 25.9 |
OlympiadBench | 20.4 | – | 25.9 | – | 11.2 |
Limitations
- Language Mixing and Code-Switching: The model might occasionally mix different languages or unexpectedly switch between them, potentially affecting the clarity of its responses.
- Recursive Reasoning Loops: There's a risk of the model getting caught in recursive reasoning loops, leading to lengthy responses that may not even arrive at a final answer.
- Safety and Ethical Considerations: Robust safety measures are needed to ensure reliable and safe performance. Users should exercise caution when deploying this model.
- Performance and Benchmark Limitations: Despite the improvements in visual reasoning, QVQ doesn’t entirely replace the capabilities of Qwen2-VL-72B. During multi-step visual reasoning, the model might gradually lose focus on the image content, leading to hallucinations. Moreover, QVQ doesn’t show significant improvement over Qwen2-VL-72B in basic recognition tasks like identifying people, animals, or plants.
Note: Currently, the model only supports single-round dialogues and image outputs. It does not support video inputs.