BOP-ASK: Object-Interaction Reasoning for Vision-Language Models
In this project, we introduce BOP-Ask, a large-scale dataset that teaches vision-language models to reason about physical interactions — like where to grasp an object, how to move it without collisions, or which items to move first in a cluttered space. By combining precise 3D geometry with millions of interaction questions, we enable models to go beyond simple spatial descriptions (“left of”, “behind”) toward actionable understanding that supports real robot manipulation. The result is a step toward AI systems that can bridge perception and action — not just describing scenes, but reasoning about how to physically operate within them.
