Visual-Tactile Fusion for Robotic Stable Grasping

*Bin Fang, Chao Yang, Fuchun Sun and Huaping Liu*

### **Abstract**

The stable grasp is the basis of robotic manipulation. It requires balance of the contact forces and the operated object. The status of the grasp determined by vision is direct according to the object's shape or texture, but quite challenging. The tactile sensor can provide the effective way. In this work, we propose the visual-tactile fusion framework for predicting the grasp. Meanwhile, the object intrinsic property is also used. More than 2550 grasping trials using a novel robot hand with multiple tactile sensors are collected. And visual-tactile intrinsic deep neural network (DNN) is evaluated to prove the performance. The experimental results show the superiority of the proposed method.

**Keywords:** stable, grasp, tactile, visual, deep neural network

### **1. Introduction**

In recent years, dexterous robotic manipulation increasingly attracts worldwide attention, because it plays an important role in robotic service. Furthermore, the stable grasp is the basis of manipulation. However, stable grasp is still challenging, since it depends on various factors, such as the actuator, sensor, movement, object, environment, etc. With the development of the neural network, the data-driven methods [1] become popular. For example, Levine et al. used 14 robots to randomly grasp over 800,000 times for collecting the data and training the convolutional neural network (CNN) [2]. Guo et al. trained the deep neural network (DNN) with 12 K-labeled images to learn the end-to-end grasping polices [3]. Mahler et al. built the dataset that included millions of point cloud data for training Grasp Quality Convolutional Neural Network (GQ-CNN) with an analytic metric. Then GQ-CNN developed the optimal grasp strategy that achieves 93% success rate for eight kinds of objects [4–6]. Zhang et al. trained robots to manipulate objects by videos that were made by virtual reality (VR). For pick-place tasks, the success rate was increased when the number of samples increased [7]. Therefore, sufficient highquality data is important for robotic grasping.

Nowadays, a few datasets of robot grasping have been developed. Playpen dataset obtains 60-hour grasping data of robot PR2 with RGBD cameras [8]. Columbia dataset collects about 22,000 grasping samples via the GraspIt! simulator [9]. Besides experiments with robots and numerical simulations, human manipulation videos are also useful. Self-supervised learning algorithms are developed from demonstration of videos [10]. While the above datasets focus on the whole grasping process, there are other datasets that concentrate on specific tasks, like grasp

planning and slip detection. Pinto et al. instructed robots to automatically generate labeled images for grasp planning with 50,000 times by self-supervised learning algorithms [11]. MIT built the grasp dataset by vision-based tactile sensor and external vision [12]. While some experiments produced slip with extra force or fix objects [13, 14], researchers recorded the actual random grasping process with 46% failure results in 1000 times grasp [15, 16]. The real data can contribute to the precision grasping [17]. In daily life overabundance of the object's types leads to the difficulty of building datasets. Some researchers select the common objects and build 3D object set models such as KIT objects [18], YCB object set [19], etc. They are more convenient for research. However, there are few datasets that include the visual and tactile data. Sufficient visual, tactile, and position data can clearly describe the grasping process and improve the robot's ability of grasping.

According to the previous work, it is necessary to build a complete dataset for the robotic manipulation. In this chapter, a new grasp dataset based on the threefinger robot hand is built. In the following section, the structure of the multimodal dataset is introduced in detail. Moreover, the CNN and long short-term memory networks (LSTMs) are designed to complete grasp stability prediction.

### **2. Grasp stability prediction**

In this section, the multimodal fusion framework of grasp stability prediction is proposed.

### **2.1 Visual representation learning**

Under the visual image set, we can only observe 2700\*2 = 5400 sets of image data in total, which is in use. It is difficult to extract visual features with convolutional neural networks (ResNet-18 network structure is used in our experiment). Training convergence is less on a small dataset, so time comparison network is used [10], capture video information from the capture process, anchor, positive, negative data. Then we define the triplet loss function [20] and use the characteristics of the continuous change of motion in the video to learn the operation process. The visual characteristics are also used as a pre-training process for the subsequent stable retrieval of the convolutional network part of the prediction network. Such as shown in **Figure 1**, we cleverly use a multi-angle camera to record the video image of the same capture process; at the same time, different image in the perspective should represent the same robot state, that is, its embedded layer embedding vector. A certain distance from the feature representation is relatively small, and the image at the same perspective at different times represents the robot. At different grasping states, a certain distance of the embedded layer Embedding vector is relatively large, formally:

$$\left\|\left\|f\left(\mathbf{x}\_{i}^{a}\right)-f\left(\mathbf{x}\_{i}^{p}\right)\right\|\right\|\_{2}^{2}+a<\left\|\left\|f\left(\mathbf{x}\_{i}^{a}\right)-f\left(\mathbf{x}\_{i}^{n}\right)\right\|\right\|\_{2}^{2}\tag{1}$$

**2.2 Predicting grasp stability**

*Visual-Tactile Fusion for Robotic Stable Grasping DOI: http://dx.doi.org/10.5772/intechopen.91455*

*Visual representation network.*

can be calculated as

**Figure 1.**

of the LSTMs at each step.

**123**

In order to describe the properties of the objects like shape or size, the images are captured before grasping from two cameras, represented by *Ib* (**Figure 2**). *Id* is the position of the robot concerning the object grasped. Hence the vision feature *fv*

The images are passed through the standard convolutional network that uses the ResNet-18 architecture. Different from the previous work [22], the tactile sensors are used to obtain the force applied by the robot during the manipulation. As tactile

where *ft* is the last time step of the LSTMs' output and *T*0*,T*1*, … ,TT* is the input

where *R* represents the pre-trained neural network.

sequences, the LSTMs are applied as the feature extractor:

*fv* ¼ *R*ð*Ib*,*Id*Þ (3)

*ft* ¼ *L T*ð Þ 0, *T*1, … , *TT* (4)

where *f xa i* � �, *f x<sup>p</sup> i* � �, and *f x<sup>n</sup> i* � � represent the anchor, positive, and negative image features extracted by CNN. So, we can define the loss function [21] as

$$d(a, p, n) = \frac{1}{N} \left( \sum\_{i=1}^{N} \max\left\{ d(a\_i, p\_i) - d(a\_i, n\_i) + a, \mathbf{0} \right\} \right) \tag{2}$$

*Visual-Tactile Fusion for Robotic Stable Grasping DOI: http://dx.doi.org/10.5772/intechopen.91455*

planning and slip detection. Pinto et al. instructed robots to automatically generate labeled images for grasp planning with 50,000 times by self-supervised learning algorithms [11]. MIT built the grasp dataset by vision-based tactile sensor and external vision [12]. While some experiments produced slip with extra force or fix objects [13, 14], researchers recorded the actual random grasping process with 46% failure results in 1000 times grasp [15, 16]. The real data can contribute to the precision grasping [17]. In daily life overabundance of the object's types leads to the difficulty of building datasets. Some researchers select the common objects and build 3D object set models such as KIT objects [18], YCB object set [19], etc. They are more convenient for research. However, there are few datasets that include the visual and tactile data. Sufficient visual, tactile, and position data can clearly describe the grasping process and improve the robot's

According to the previous work, it is necessary to build a complete dataset for the robotic manipulation. In this chapter, a new grasp dataset based on the threefinger robot hand is built. In the following section, the structure of the multimodal dataset is introduced in detail. Moreover, the CNN and long short-term memory

In this section, the multimodal fusion framework of grasp stability prediction is

Under the visual image set, we can only observe 2700\*2 = 5400 sets of image

data in total, which is in use. It is difficult to extract visual features with convolutional neural networks (ResNet-18 network structure is used in our experiment). Training convergence is less on a small dataset, so time comparison network is used [10], capture video information from the capture process, anchor, positive, negative data. Then we define the triplet loss function [20] and use the characteristics of the continuous change of motion in the video to learn the operation process. The visual characteristics are also used as a pre-training process for the subsequent stable retrieval of the convolutional network part of the prediction network. Such as shown in **Figure 1**, we cleverly use a multi-angle camera to record the video image of the same capture process; at the same time, different image in the perspective should represent the same robot state, that is, its embedded layer embedding vector. A certain distance from the feature representation is relatively small, and the image at the same perspective at different times represents the robot. At different grasping states, a certain distance of the embedded layer Embedding

networks (LSTMs) are designed to complete grasp stability prediction.

ability of grasping.

proposed.

**2. Grasp stability prediction**

*Industrial Robotics - New Paradigms*

**2.1 Visual representation learning**

vector is relatively large, formally:

*l a*ð Þ¼ , *p*, *n*

where *f xa*

**122**

*i* � �, *f x<sup>p</sup> i* � �, and *f x<sup>n</sup>*

*f xa i* � � � *f xp*

> 1 *N*

� � � � �

*i*

X *N*

*i*¼1

*i*

features extracted by CNN. So, we can define the loss function [21] as

� 2

<sup>2</sup> <sup>þ</sup> *<sup>α</sup>*<sup>&</sup>lt; *f x<sup>a</sup>*

max *d ai*, *pi*

*i* � � � *f xn*

� � � *d ai* ð Þþ , *ni <sup>α</sup>*, 0 � � !

� � � � �

*i*

� � represent the anchor, positive, and negative image

� 2

<sup>2</sup> (1)

(2)

**Figure 1.** *Visual representation network.*

### **2.2 Predicting grasp stability**

In order to describe the properties of the objects like shape or size, the images are captured before grasping from two cameras, represented by *Ib* (**Figure 2**). *Id* is the position of the robot concerning the object grasped. Hence the vision feature *fv* can be calculated as

$$f\text{\(\! = R(I\text{b}, Id)\text{\(}}\text{\)}\tag{3}$$

where *R* represents the pre-trained neural network.

The images are passed through the standard convolutional network that uses the ResNet-18 architecture. Different from the previous work [22], the tactile sensors are used to obtain the force applied by the robot during the manipulation. As tactile sequences, the LSTMs are applied as the feature extractor:

$$\text{fit} = L(T0, T1, \dots, TT) \tag{4}$$

where *ft* is the last time step of the LSTMs' output and *T*0*,T*1*, … ,TT* is the input of the LSTMs at each step.

**Figure 2.** *Multimodal information predict grasp stability network.*

Besides, the mass and mass distribution of the object also affect the stability of grasping. In order to simplify the problem, the weight of the object is known and the mass distribution is assumed uniform. Then the intrinsic object property is described as

$$f\mathbf{i} = \mathbf{M}(w) \tag{5}$$

avoid the interference of light reflection. The UR5 robot arm with the Eagle Shoal robot hand fixes at the backside of the table. One RealSense camera is on the opposite side of the table for recording the front view of grasping. The other RealSense camera is located in the left of the table for recording the lateral view. The general grasp dataset is built with various variables including shape, size, weight, grasp style, etc. The objects in the dataset contain different sizes of cuboid, cylinder, and special shapes, and their weights change by adding granules or water. Different grasping methods are tested by grasping from three directions including back, right, and top. The dataset with unstable grasping data is generated by slip-

**Hand Type Weight Force (mA) Direction Trail Data type Total** Eagle Shoal 10 objects Empty 50/100/150 Top/right/back 10 times T1/I 900 sets Eagle Shoal 10 objects Half/full 50/100/150 Top/right/back 10 times T1/T2/I/V 1650 sets

ping with added weight, changed grasp force, and adjusted grasp position

to approach the object, and then add the random error of 5 mm.

point cloud data and computer the target's position.

open the hand directly then prepare the next grasp.

1.The object is put on the center of the table; the front camera is used to get the

2.Choose the object's half height position as the grasped point, control the robot

3.Based on the object's size, controlling the robot hand to grasp with a position loop mode, and then after 1 second, the robot arm lifts up with a speed of

4.After the robot arm moves to a certain position, the robotic finger's position is changed. If the hand is bending too much, this grasp is labeled as failure, and

5.The grasp is labeled as success, for a light object, if the robot puts down the object and, for some heavy object, the robot opens the hand and drops the

(**Table 1**). The detailed processes are as follows:

20 mm/s.

**Figure 3.**

**Table 1.** *Dataset statistics.*

*Experiment platform.*

*Visual-Tactile Fusion for Robotic Stable Grasping DOI: http://dx.doi.org/10.5772/intechopen.91455*

object directly.

**125**

where *fi* represents the intrinsic object feature and *w* is the object weight.

Then the multilayer perceptron (MLP) is used to extract the intrinsic feature. The sensory modalities provide the complementary information about the prospects for a successful grasp. For example, the camera's images show that the gripper is near the center of the object, and the tactile shows that the force is enough to keep stable for grasping. In order to study the method of multimodal fusion for predicting grasp outcomes, a neural network is trained to predict whether robot's grasp would be successful integrated by visual, tactile, and object's characters. The network computes *y* = {(*X*)}, where *y* is the probability of a successful grasp and *X* = [ *fv, ft, fi*] contains a set of images from multiple modalities: visual, tactile, and object intrinsic properties.

Train **the network**: initializing the weights of visual by CNN with a model pretrained in Section III-A. The visual representation network is trained 200 epochs using the Adam optimizer [23], starting with a learning rate of 105 which is decreased by an order of magnitude halfway through the training process.

During the training, the RGB images are cropped with containing the table that holds the objects. Then, following the standard practice in object recognition, the images are resized to be 256 � 256 and randomly sampled at 224 � 224. Meanwhile the images are randomly flipped in the horizontal direction. However, the same data is still applied for augmentation to prevent overfitting.

### **3. Experiment and data collection**

The experiment platform consists of the Eagle Shoal robot hand, two RealSense SR300 cameras, and the UR5 robot arm. As shown in **Figure 3**, they are arranged around the table of length 600 mm and width 600 mm. There is a layer of sponge on the surface of the table for protection. A soft flannel sheet covers the table to

*Visual-Tactile Fusion for Robotic Stable Grasping DOI: http://dx.doi.org/10.5772/intechopen.91455*

**Figure 3.** *Experiment platform.*

