Lightweight Language-driven Grasp Detection using Conditional Consistency Model (Part 3)
1. Experiments
Experiment Setup
Dataset. We use the Grasp-Anything dataset in our experiment. Grasp-Anything is a large-scale dataset for language-driven grasp detection with 1M samples. Each image in the dataset is accompanied by one or several prompts describing a general object grasping action or grasping an object at a specific location. Evaluation Metrics. Our primary evaluation metric is the success rate, defined similarly to previous works, necessitating an IoU score of the predicted grasp exceeding 25% with the ground truth grasp and an offset angle less than . We also use the harmonic mean (`H') to measure the overall success rates. All methods' latency (inference time) in seconds is reported using the same NVIDIA A100 GPU.
Comparison with Grasp Detection Methods
Baseline | Seen | Unseen | H | Latency |
---|---|---|---|---|
GR-ConvNet | 0.37 | 0.18 | 0.24 | 0.022 |
Det-Seg-Refine | 0.30 | 0.15 | 0.20 | 0.200 |
GG-CNN | 0.12 | 0.08 | 0.10 | 0.040 |
CLIPORT | 0.36 | 0.26 | 0.29 | 0.131 |
CLIP-Fusion | 0.40 | 0.29 | 0.33 | 0.157 |
MaskGrasp | 0.50 | 0.46 | 0.45 | 0.116 |
LLGD (ours) with 1 timestep | 0.47 | 0.34 | 0.40 | 0.035 |
LLGD (ours) with 3 timesteps | 0.52 | 0.38 | 0.45 | 0.106 |
LLGD (ours) with 10 timesteps | 0.53 | 0.39 | 0.46 | 0.264 |
Table 1: Comparision with Traditional Grasp Detection Methods.
We compare our LLGD with GR-CNN, Det-Seg-Refine, GG-CNN, CLIPORT, MaskGrasp, and CLIP-Fusion. Table 1 compares our method and other baselines on the GraspAnything dataset. This table shows that our proposed LLGD outperforms traditional grasp detection methods by a clear margin. Our inference time is also competitive with other methods.
Comparison with Lightweight Diffusion Models
Method | Seen | Unseen | H | Latency |
---|---|---|---|---|
LGD with 3 timesteps | 0.42 | 0.29 | 0.35 | 0.074 |
LGD with 30 timesteps | 0.49 | 0.41 | 0.45 | 0.741 |
LGD with 1000 timesteps | 0.52 | 0.42 | 0.47 | 26.12 |
SnapFusion with 500 timesteps | 0.49 | 0.37 | 0.43 | 12.95 |
LightGrad with 250 timesteps | 0.51 | 0.34 | 0.43 | 6.420 |
LLGD (ours) with 1 timestep | 0.47 | 0.34 | 0.40 | 0.035 |
LLGD (ours) with 3 timesteps | 0.52 | 0.38 | 0.45 | 0.106 |
LLGD (ours) with 10 timesteps | 0.53 | 0.39 | 0.46 | 0.264 |
Table 2: Comparison with Diffusion Models for Language-Driven Grasp Detection.
In this experiment, we compare our LLGD with other diffusion models for language-driven grasp detection. In particular, we compare with LGD using DDPM, and recent lightweight diffusion works: SnapFusion with 500 timesteps and LightGrad with 250 timesteps.
Table 2 shows the result of diffusion models for language-driven grasp detection. We can see that the accuracy and inference time of the classical diffusion model LGD strongly depend on the number of denoising timesteps. LGD with 1000 timesteps achieves reasonable accuracy but has significant long latency. Lightweight diffusion models such as SnapFusion and LightGrad show reasonable results and inference speed. However, our method achieves the highest accuracy with the fastest inference speed.
Conditional Consistency Model Demonstration
Figure 1: Consistency model analysis. With text prompt input "Grasp the cup at its handle", we compare the trajectory grasp pose of our method and LGD. In the figure, the top row illustrates the trajectory of LGD, while the bottom row corresponds to the trajectory of our LLGD.
In this analysis, we will verify the effectiveness of our conditional consistency model. In Figure 1, we visualize the grasp pose aspect to time index . In the LGD model, as the discrete diffusion model is employed with , we have to perform the diffusion steps with a step size of 1, which results in a very slow inference speed. Moreover, the grasp pose trajectory still exhibits significant fluctuations. Our method can arbitrarily select boundary time points for the continuous consistency model. It is evident that the number of iterations required by our method is significantly less than that of LGD for the exact value of , which contributes to the "lightweight" factor. Furthermore, the grasp pose at has almost converged to the ground truth, while LGD using DDPM at has not yet achieved a successful grasp.
Ablation Study
Figure 2: Visualization of detection results of different language-driven grasp detection methods.
Visualization. Figure 2 shows qualitative results of our method and other baselines. The outcomes suggest that our method LLGD generates more semantically plausible grasp poses given the same text query than other baselines. In particular, other methods usually show grasp poses at a location not well-aligned with the text query, while our method shows more suitable detection results.*
Figure 3: In the wild detection results. Images are from the internet.
In the Wild Detection. Figure 3 illustrates the outcomes of applying our method to random images from the internet. The results demonstrate that our LLGD can effectively detect the grasp pose given the language instructions on real-world images. Our method showcases a promising zero-shot learning ability, as it successfully interprets grasp actions on images it has never encountered during training.
Figure 4: Prediction failure cases.
Failure Cases. Although promising results have been achieved, our method predicts incorrect grasp poses. Many objects and grasping prompts pose a challenging problem as the network cannot capture all the diverse circumstances that arise in real life. Figure 4 depicts some failure cases where LLGD incorrectly predicts the results, which can be attributed to multiple similar objects that are difficult to distinguish and text prompts that lack detailed descriptions for accurate result determination.
Robotic Experiments
Robotic Setup. Our lightweight language-driven grasp detection pipeline is incorporated within a robotic grasping framework that employs a KUKA LBR iiwa R820 robot to deliver quantifiable outcomes. Utilization of the RealSense D435i camera enables the translation of grasping information from LLGD into a 6DoF grasp posture, bearing resemblance to previous works. Subsequently, a trajectory optimization planner is used to execute the grasping action. Experiments were conducted on a table surface for the single object scenario and the cluttered scene scenario, wherein various objects were placed to test each setup. Table 3 shows the success rate of our method and other baseline models.
Baseline | Single | Cluttered |
---|---|---|
GR-ConvNet + CLIP | 0.33 | 0.30 |
Det-Seg-Refine + CLIP | 0.30 | 0.23 |
GG-CNN + CLIP | 0.10 | 0.07 |
CLIPORT | 0.27 | 0.30 |
CLIP-Fusion | 0.40 | 0.40 |
SnapFusion | 0.40 | 0.39 |
LightGrad | 0.41 | 0.40 |
LLGD (ours) | 0.43 | 0.42 |
Table 3: Robotic language-driven grasp detection results.
Our method outperforms other baselines in both single object and cluttered scenarios. Furthermore, our lightweight model allows rapid execution speed without sacrificing the accuracy of visual grasp detection.
2. Discussion
Limitation. Despite achieving notable results in real-time applications, our method still has limitations and predicts incorrect grasp poses in challenging real-world images. Faulty grasp poses are often due to the correlation between the text and the attention map of the visual features not being well-aligned as Figure 4. From our experiment, we see that when grasp instruction sentences contain rare and challenging nouns that are popular in the dataset, ambiguity in parsing or text prompts occurs, which is usually the main cause of incorrect predictions of grasp poses. Therefore, providing the instruction prompts with clear meanings is essential for the robot to understand and execute the correct grasping action.
Future work. We see several prospects for improvement in future work: 1. Expanding our method to handle 3D space is essential, implementing it for 3D point clouds and RGB-D images to avoid the lack of depth information in robotic applications. 2. Addressing the gap between the semantic concept of text prompts and input images, analyzing the detailed geometry of objects to effectively distinguish between items with similar structures. 3. Expanding the problem to more complex language-driven manipulation applications. For instance, if the robot wants to grasp a plate containing apples, it would need to manipulate the objects in such a manner that prevents the apples from falling.