**3.1 Image visual feature**

The extraction and selection of image visual features is an important means to transform the image content into a quantitatively calculated information description form, which mainly include global features and local features. Global features refer to the overall attributes of the entire image, mainly including color features, texture features, and shape features. These features are features that can be directly observed by the eyes. Global features are pixel-level shallow features with good stability, real-time performance, and simple and easy-to-implement algorithms. However, their shortcomings are high feature dimensions, large amount of calculations, and changes in image scale, lighting, and perspective. Local features are features extracted from local areas of the image, including corners, lines, edges, and areas with special attributes. Local features are distinguishable and robust to changes in lighting, rotation, perspective, and scale, as well as low dimensions and easy implementation.

The scale-invariant feature transform (SIFT) is local feature descriptor proposed by David G. Lowe in 1999 [30]. The SIFT descriptor maintains invariance to image rotation, translation, scaling, affine transformation, perspective and brightness changes, and noise and also maintains stability. And it can be combined with other algorithms to form a new optimization algorithm, thereby increasing the operation speed.

The traditional SIFT descriptor mainly extracts stable feature points in the image, which will lead to loss of some information in the image and long calculation time. And the number of feature points extracted from each image is different, which will inevitably lead to different dimensions. Lazebnik et al. improved the number and distribution of SIFT descriptors to obtain dense SIFT [31]. The main difference between the dense SIFT descriptor and the traditional SIFT descriptor is that the sampling method is different. The SIFT descriptor constructs a scale space to detect and filter feature points. The dense SIFT algorithm applies a fixed-size rectangular window for sampling from the left to the right of the image and from the top to the bottom according to the specified step size. The center of the window is used as a key point, and an image block composed of 16 pixels around the center is divided into 4 4 pixel-sized units. Within each pixel, the SIFT algorithm is used to calculate the gradient histogram in 8 directions and obtain 4 4 8 = 128 dimensional feature vectors to form a DSIFT descriptor. The feature points extracted by this method are uniformly distributed, and the specifications are the same; they maintain good stability to illumination, changes in perspective, and affine transformation, scaling, and rotation.

k cluster centers (i.e., visual words), then the size of the visual vocabulary is also k. This manuscript selects 1000 visual words, and the size of the visual

3.Representing images by word frequency: using the vocabulary as a standard, count the number of occurrences of each visual word in the image, and each image becomes a word frequency vector corresponding to the visual word sequence in the vocabulary, that is, each image is represented by a 1000-

4. Select classifier to classify the 1000-dimensional numerical vector generated in

Support vector machines (SVMs) were proposed by Corinna Cortes and Vapnik in 1995 [34]. It is a learning method based on VC statistical theory and structural risk minimization criteria. It has advantage in solving small sample, nonlinear, and high-dimensional pattern recognition problems. The basic idea of the SVMs is to map the low-dimensional space vector to the high-dimensional space through the nonlinear transformation defined by the inner product. In this high-dimensional space, the optimal classification hyperplane is determined according to the maximum geometric distance between the support vector and the classification plane. SVMs were initially used to classify two-class problems in the analysis of linear separable cases and require smaller sample sizes and an appropriate train rule, which have led to widespread use in image classification and recognition.

With the deepening of research on support vector machines, many scholars have carried out various toolkits in order to make them suitable for specific fields. In this manuscript a linear classifier LIBLINEAR designed by Professor Lin Zhiren of the National Taiwan University is used, mainly for processing large-scale data and features [35]. LIBLINEAR can be used in the following three cases: when the number of features is much larger than the number of samples; when the number of features and samples is large; and when the number of features is much smaller than the number of samples. Because the complexity of the linear classifier is lower than the nonlinear classifier, the training operation time is greatly reduced, and the training performance of the linear and nonlinear classifiers is also comparable under

The perceptron was proposed by Rosenblatt in 1958 [36]. It is an artificial neural

Perceptron in this manuscript uses a three-layer structure. Because the extracted features are 1000-dimensional vectors, the input layer contains 1000 nodes, the

network structure and the earliest feed-forward neural network. A single-layer perceptron contains only two layers, namely, the input layer and the output layer. Due to its limited mapping capability, it can only achieve linearly separable classification problems. A multi-layer perceptron has one or more hidden layers between the input layer and the output layer, which is mainly used for nonlinear classification and regression. The training algorithm is consistent with the traditional multi-

layer neural network and also uses a back-propagation algorithm.

vocabulary is 1000.

**3.3 Classifiers**

*3.3.1 Support vector machines*

a large amount of data.

**145**

*3.3.2 Multi-layer perceptron*

dimensional numerical vector.

*DOI: http://dx.doi.org/10.5772/intechopen.91953*

the previous step as the input of the classification.

*Automatic Recognition of Tea Diseases Based on Deep Learning*

#### **3.2 Bag of visual word-based feature representation**

Bag of visual word (BOVW) model was mainly applied to text classification and retrieval technology. The core idea of the bag of visual word model is to treat text as a collection of different words, ignoring the word order, grammar, and syntax of the text, and these words are discrete and independent of each other or do not depend on the presence of other words. The frequency of each word in the text is counted and is represented with histogram so that each text is represented as a vector.

Due to the successful application of the BOVW model in text retrieval, Csurka et al. introduced the BOVW model to the field of computer vision [32]. Think of an image as a document and the features of the image (usually referred to as local features) as the words that make up the image. Unlike the words in the text, there are no ready-made words in the image. We need to extract independent features from the image, which are called visual word. Similar features can be regarded as a visual word. In this way, the image can be described as an unordered set of visual words (local features). Although local features (such as SIFT) also can describe an image, each SIFT is a 128-dimensional vector, and an image contains hundreds or thousands of SIFT descriptor. The calculation amount is very large, so these vectors are clustered, and the cluster center was used to represent a visual word.

The image classification using BOVW model mainly includes the following steps:


k cluster centers (i.e., visual words), then the size of the visual vocabulary is also k. This manuscript selects 1000 visual words, and the size of the visual vocabulary is 1000.

