**Table 1.**

Besides, the mass and mass distribution of the object also affect the stability of grasping. In order to simplify the problem, the weight of the object is known and the mass distribution is assumed uniform. Then the intrinsic object property is

where *fi* represents the intrinsic object feature and *w* is the object weight. Then the multilayer perceptron (MLP) is used to extract the intrinsic feature. The sensory modalities provide the complementary information about the prospects for a successful grasp. For example, the camera's images show that the gripper is near the center of the object, and the tactile shows that the force is enough to keep

Train **the network**: initializing the weights of visual by CNN with a model pretrained in Section III-A. The visual representation network is trained 200 epochs using the Adam optimizer [23], starting with a learning rate of 105 which is decreased by an order of magnitude halfway through the training process.

During the training, the RGB images are cropped with containing the table that holds the objects. Then, following the standard practice in object recognition, the images are resized to be 256 � 256 and randomly sampled at 224 � 224. Meanwhile the images are randomly flipped in the horizontal direction. However, the same

The experiment platform consists of the Eagle Shoal robot hand, two RealSense SR300 cameras, and the UR5 robot arm. As shown in **Figure 3**, they are arranged around the table of length 600 mm and width 600 mm. There is a layer of sponge on the surface of the table for protection. A soft flannel sheet covers the table to

stable for grasping. In order to study the method of multimodal fusion for predicting grasp outcomes, a neural network is trained to predict whether robot's grasp would be successful integrated by visual, tactile, and object's characters. The network computes *y* = {(*X*)}, where *y* is the probability of a successful grasp and *X* = [ *fv, ft, fi*] contains a set of images from multiple modalities: visual, tactile, and

data is still applied for augmentation to prevent overfitting.

**3. Experiment and data collection**

**124**

*fi* ¼ *M w*ð Þ (5)

described as

*Multimodal information predict grasp stability network.*

*Industrial Robotics - New Paradigms*

**Figure 2.**

object intrinsic properties.

*Dataset statistics.*

avoid the interference of light reflection. The UR5 robot arm with the Eagle Shoal robot hand fixes at the backside of the table. One RealSense camera is on the opposite side of the table for recording the front view of grasping. The other RealSense camera is located in the left of the table for recording the lateral view.

The general grasp dataset is built with various variables including shape, size, weight, grasp style, etc. The objects in the dataset contain different sizes of cuboid, cylinder, and special shapes, and their weights change by adding granules or water. Different grasping methods are tested by grasping from three directions including back, right, and top. The dataset with unstable grasping data is generated by slipping with added weight, changed grasp force, and adjusted grasp position (**Table 1**). The detailed processes are as follows:


6.Putting the object on the center of the table, the robot arm returns to the initial place and waits for the next loop.

The proposed method is contrasted with traditional classifiers including k-nearest neighbor (KNN) [24], support-vector machine (SVM) [25], and naive Bayes (NB) [26]. A total of 2550 sets have been divided into 80% for training and 20% for testing. The KNN classifier is set with k = 3, and the SVM kernel is the radial basis function (RBF). The success rate with a criterion, the number of detection n and the number of label data m, is calculated by n/m. The contrast result in **Table 2** shows the performance of LSTM and SVM is both well with a success rate. However, the SVM's labels are on the falling edge, which means the SVM model gets a good classification result by learning the falling edge features. The falling edge means the object is dropped already and cannot help to realize a stable grasp. SVM proves unsuitable for this test.

Besides the success rate, another criterion is necessary to evaluate the slip detection. If the time of predict result turns from 1 to 0 ahead of the time in label data, set it as ahead sample and counted number n*ahead*, calculate the ahead rate by n*ahead*/m, and set it as the criterion. The results are shown in **Table 2**. With these two criteria, LSTM shows the superior performance that attains the higher success rate and higher ahead rate (**Figures 4** and **5**).


### **Table 2.**

*Classification results of different classifiers.*

**4. Conclusions**

*grasping process tactile sensor value.*

*Visual-Tactile Fusion for Robotic Stable Grasping DOI: http://dx.doi.org/10.5772/intechopen.91455*

**Figure 5.**

**127**

In this chapter, the end-to-end approach for predicting stable grasp is proposed. Raw visual, tactile, and object intrinsic information are used, and the tactile sensor provides detailed information about contacts, forces, and compliance. More than 2500 grasp data are autonomously collected, and the multiple deep neural network model is proposed for predicting grasp stability with different modalities. The results show that visual-tactile fusion method improves the ability to predict grasp outcomes. In order to further validate the method, the real-world evaluations of the different models in the active grasp are implemented. Our experimental results

*Visual and tactile information visualization. Visual: grasping process video image sequence; and tactile:*

demonstrate the superiority of the proposed method.

**Figure 4.**

*All the grasp object, from YCB object set.*

*Visual-Tactile Fusion for Robotic Stable Grasping DOI: http://dx.doi.org/10.5772/intechopen.91455*

6.Putting the object on the center of the table, the robot arm returns to the initial

The proposed method is contrasted with traditional classifiers including k-nearest neighbor (KNN) [24], support-vector machine (SVM) [25], and naive Bayes (NB) [26]. A total of 2550 sets have been divided into 80% for training and 20% for testing. The KNN classifier is set with k = 3, and the SVM kernel is the radial basis function (RBF). The success rate with a criterion, the number of detection n and the number of label data m, is calculated by n/m. The contrast result in **Table 2** shows the performance of LSTM and SVM is both well with a success rate. However, the SVM's labels are on the falling edge, which means the SVM model gets a good classification result by learning the falling edge features. The falling edge means the object is dropped already and cannot help to realize a stable grasp. SVM proves unsuitable for this test. Besides the success rate, another criterion is necessary to evaluate the slip detection. If the time of predict result turns from 1 to 0 ahead of the time in label data, set it as ahead sample and counted number n*ahead*, calculate the ahead rate by n*ahead*/m, and set it as the criterion. The results are shown in **Table 2**. With these two criteria, LSTM shows the superior performance that attains the higher success

**Classifier Success rate Ahead drop Ahead forecast** KNN 0.7970 0.8176 0.4961 SVM 0.8467 0.6667 0.2569 NB 0.6881 0.6569 0.4843 OUR 0.9460 0.8588 0.6373

place and waits for the next loop.

*Industrial Robotics - New Paradigms*

rate and higher ahead rate (**Figures 4** and **5**).

*Classification results of different classifiers.*

**Table 2.**

**Figure 4.**

**126**

*All the grasp object, from YCB object set.*

**Figure 5.**

*Visual and tactile information visualization. Visual: grasping process video image sequence; and tactile: grasping process tactile sensor value.*
