MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping

Robots often struggle to pick up objects when given natural language instructions — especially in cluttered environments with many similar items. In this work, we introduce MapleGrasp, a system that helps robots understand which object you mean and how to grasp it efficiently. Our key idea is mask-guided feature pooling: the model first identifies the target object using vision-language reasoning, then focuses its attention only on that region to predict precise grasp points. This simple but powerful design makes the system both faster and more accurate, improving grasp success while reducing computation.

Paper Code