HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models
HiFi-CS is a lightweight Vision-Language Model (VLM) framework designed to improve Referring Grasp Synthesis (RGS)—enabling robots to grasp objects based on natural language queries. By hierarchically applying Featurewise Linear Modulation (FiLM), it enhances visual grounding for complex, attribute-rich text descriptions in cluttered environments. HiFi-CS integrates a frozen VLM with a lightweight decoder, outperforming baselines in closed vocabulary settings while being 100x smaller. It also improves open-vocabulary grounding by guiding object detectors like GroundedSAM. Validated on a 7-DOF robotic arm, HiFi-CS achieves 90.33% visual grounding accuracy in real-world tabletop experiments.
Paper Video [Code will be released soon]