**3.3 Classifiers**

The traditional SIFT descriptor mainly extracts stable feature points in the image, which will lead to loss of some information in the image and long calculation time. And the number of feature points extracted from each image is different, which will inevitably lead to different dimensions. Lazebnik et al. improved the number and distribution of SIFT descriptors to obtain dense SIFT [31]. The main difference between the dense SIFT descriptor and the traditional SIFT descriptor is that the sampling method is different. The SIFT descriptor constructs a scale space to detect and filter feature points. The dense SIFT algorithm applies a fixed-size rectangular window for sampling from the left to the right of the image and from the top to the bottom according to the specified step size. The center of the window is used as a key point, and an image block composed of 16 pixels around the center is divided into 4 4 pixel-sized units. Within each pixel, the SIFT algorithm is used to calculate the gradient histogram in 8 directions and obtain 4 4 8 = 128 dimensional feature vectors to form a DSIFT descriptor. The feature points extracted by this method are uniformly distributed, and the specifications are the same; they maintain good stability to illumination, changes in perspective, and

Bag of visual word (BOVW) model was mainly applied to text classification and retrieval technology. The core idea of the bag of visual word model is to treat text as a collection of different words, ignoring the word order, grammar, and syntax of the text, and these words are discrete and independent of each other or do not depend on the presence of other words. The frequency of each word in the text is counted and is represented with histogram so that each text is represented as a

Due to the successful application of the BOVW model in text retrieval, Csurka et al. introduced the BOVW model to the field of computer vision [32]. Think of an image as a document and the features of the image (usually referred to as local features) as the words that make up the image. Unlike the words in the text, there are no ready-made words in the image. We need to extract independent features from the image, which are called visual word. Similar features can be regarded as a visual word. In this way, the image can be described as an unordered set of visual words (local features). Although local features (such as SIFT) also can describe an image, each SIFT is a 128-dimensional vector, and an image contains hundreds or thousands of SIFT descriptor. The calculation amount is very large, so these vectors

are clustered, and the cluster center was used to represent a visual word.

include SIFT descriptor and SURF descriptor.

The image classification using BOVW model mainly includes the following

1. Image feature extraction and description: Local feature vectors of the entire training set image are obtained through methods such as point-of-interest detection, dense sampling, or random sampling. Commonly used local features

2.Construct a visual vocabulary: After obtaining the local feature vectors of all sample images, use the k-means algorithm to cluster the local feature vectors. The k-means algorithm is an unsupervised learning algorithm. It divides the data into different categories through an iterative process and then calculates the Euclidean distance between each data and various types of centers [33]. The smaller the distance, the higher the similarity. k represents the number of clusters, and means represents the mean of the data in the clusters. If there are

affine transformation, scaling, and rotation.

*Advances in Forest Management under Global Change*

vector.

steps:

**144**

**3.2 Bag of visual word-based feature representation**

#### *3.3.1 Support vector machines*

Support vector machines (SVMs) were proposed by Corinna Cortes and Vapnik in 1995 [34]. It is a learning method based on VC statistical theory and structural risk minimization criteria. It has advantage in solving small sample, nonlinear, and high-dimensional pattern recognition problems. The basic idea of the SVMs is to map the low-dimensional space vector to the high-dimensional space through the nonlinear transformation defined by the inner product. In this high-dimensional space, the optimal classification hyperplane is determined according to the maximum geometric distance between the support vector and the classification plane. SVMs were initially used to classify two-class problems in the analysis of linear separable cases and require smaller sample sizes and an appropriate train rule, which have led to widespread use in image classification and recognition.

With the deepening of research on support vector machines, many scholars have carried out various toolkits in order to make them suitable for specific fields. In this manuscript a linear classifier LIBLINEAR designed by Professor Lin Zhiren of the National Taiwan University is used, mainly for processing large-scale data and features [35]. LIBLINEAR can be used in the following three cases: when the number of features is much larger than the number of samples; when the number of features and samples is large; and when the number of features is much smaller than the number of samples. Because the complexity of the linear classifier is lower than the nonlinear classifier, the training operation time is greatly reduced, and the training performance of the linear and nonlinear classifiers is also comparable under a large amount of data.

#### *3.3.2 Multi-layer perceptron*

The perceptron was proposed by Rosenblatt in 1958 [36]. It is an artificial neural network structure and the earliest feed-forward neural network. A single-layer perceptron contains only two layers, namely, the input layer and the output layer. Due to its limited mapping capability, it can only achieve linearly separable classification problems. A multi-layer perceptron has one or more hidden layers between the input layer and the output layer, which is mainly used for nonlinear classification and regression. The training algorithm is consistent with the traditional multilayer neural network and also uses a back-propagation algorithm.

Perceptron in this manuscript uses a three-layer structure. Because the extracted features are 1000-dimensional vectors, the input layer contains 1000 nodes, the

hidden layer contains 100 nodes, and the output layer contains 7 nodes, which refer to the number of types of tea leaf disease.

principle of neuron signal excitation. It will make some neurons' output 0, making the network sparse and reducing the interdependence of parameters, effectively alleviating overfitting. At the same time, ReLU has better transmission error characteristics and solves the problem of gradient disappearance, so it makes the train-

*Automatic Recognition of Tea Diseases Based on Deep Learning*

*DOI: http://dx.doi.org/10.5772/intechopen.91953*

After the nonlinear neuron output of the first two convolutional layers, a local response normalization operation is introduced. It is a normalization operation and mimics the lateral inhibition phenomenon of neurobiology. Local response normalization creates a competition mechanism for the output of local neurons. Local response normalization creates a competition mechanism for the output of local neurons, making the neurons with large responses larger, thereby enhancing the

The first two fully connected layers have introduced the dropout operation. The

dropout technique is an effective solution to overfitting via the training of only some of the randomly selected nodes rather than the entire network [37]. In this

Softmax is the activation function of the last fully connected layer, which is mainly used in the output layer of multi-classification problems. It can make the sum of all output values equal to 1. That is, the output value of multiple classifications is converted into a relative probability, in which the category which has a high

LeafNet's training uses stochastic gradient descent (SGD) technique. The weight values of all convolutional layers and fully connected layers are initialized with a Gaussian distribution, and the bias is initialized with a constant of 1. This setting guarantees that the input of the ReLU activation function is a positive number and can also speed up the training speed of the network [25]. Because the number of samples is small, the batch size is set to 16. Batch training can improve the convergence speed of the network and keep the memory usage at a low level. The initial learning rate of all layers of the network is set to 0.1. The learning rate is reduced according to the decline of the error, and each time it is reduced to 0.1 times the original learning rate in subsequent iterations, with the minimum threshold of the learning rate set to 0.0001. The number of epochs was set as 100, while the weight of decay was set to 0.0005 and the momentum was set to 0.9 [38]. LeafNet is implemented using Matlab's MatConvNet toolbox. The network training is performed on a Windows system, configured with a Core i7-3770K CPU, 8 GB of

RAM, and accelerated training via two NVIDIA GeForce GTX 980 GPUs.

as the correct classification rate for class k, as shown in Eq. (1):

As mentioned in [39], the classification accuracy and mean class accuracy (MCA) are used to evaluate the performance of the algorithm. CCRk is first defined

CCRk <sup>¼</sup> Ck

Where Ck is the number of correctly identified for class k and Nk is the total number of elements in class k. Classification accuracy is then defined by Eq. (2):

Nk

(1)

ing network converge faster.

generalization ability of the model.

article, the dropout ratio is set to 0.5.

relative probability is the predicted value.

**5. Performance measurements**

**147**

**4.2 Training network**
