3D Scene-Aware Vision-Language Action Modeling for Robot Manipulation

Enhanced OpenVLA by integrating object detection, depth features, and chain-of-thought reasoning to improve its spatial and semantic understanding. Using Molmo VLM and SAM, I implemented object-specific attention masking, ensuring the model focuses on task-relevant objects. We further introduced depth-aware embeddings via PointNet, combining RGB and depth data to improve 3D spatial reasoning. To enhance task execution, I generated detailed GPT-4-based task narrations, breaking high-level instructions into structured action steps. These improvements boosted task success rates by 8% on the LIBERO-Long benchmark, demonstrating superior performance in long-horizon robotic manipulation.

Report Video [Code will be shared soon]