**4. Experiments and results**

#### **4.1 Annotation results**

In the first trial, Cohen's *kappa* and Krippendorff's *alpha* are obtained at 0.80 and 0.78 respectively. Which are highly acceptable in the study since the scores measured the overall attribute and polarity. To identify the category that has the largest variation between two coders, Cohen's *kappa* for each label was calculated separately. Results (**Table 1**) indicated that Polarity had the highest agreement, while attribute showed lower agreement among two annotators. At the end of the first trial, both coders discussed the issues they encountered when they were annotating the corpus and make efforts to improve the preliminary annotation schema. The problems include dealing with the sentence that is difficult to assign the aspects.

Based on the revisions of the annotation schema, the coders conducted the second trial. With the revised annotation schema, the Cohen's *kappa* for the attribute and polarity is obtained at 0.89 and 0.91 respectively. In addition, Cohen's *kappa* and Krippendorff's *alpha* for the aspect-sentiment pair is computed by the end of the second trial, with 0.82 and 0.81 respectively, which indicated that the annotation schema in this study is valid.

#### **4.2 Model training**

The experiment was conducted on the dataset of TripAdvisor hotel reviews which contains 5506 sentences, where the numbers of positive, neutral, and negative sentiment samples are 3032, 2986, and 2725, respectively. Given a dataset, maximizing the predictive performance and training efficiency of a model requires finding the optimal network architecture and tuning hyper-parameters. In addition, the samples can significantly affect the performance of the model. To investigate the effect of


**Table 1.**

*Cohen's kappa for categories of aspect and polarity.*

sentiment sample fractions on the model performance, four sub-datasets with 4000 sentiment samples subjected to different sentiment fractions were resampled from the TripAdvisor hotel dataset as the train sets, one is a balanced dataset and three are unbalanced datasets that the sample fraction of sentiment positive, neutral, and negative dominated, respectively. In addition, it is observed that the average number of the aspects in a sentence is about 1.4, and the average length of the aspects in a sentence is about 8.0, which indicates that one sentence normally contains more than one aspect and the aspect averagely contains eight characters. The number of aspects in train and test sets is more than 850 and 320, respectively, which confirms the diversity of aspects in the dataset of TripAdvisor hotel reviews. For each train set, 20% of reviews were selected as the validation set.

Attention-based gated RNN models including LSTM and GRU were used for ABSA. Attention-based GRU/LSTM without and with aspect embedding were referred to as AT-GRU/AT-LSTM and ATAE-GRU/ATAE-LSTM, respectively. The details of the configurations and used hyper-parameters are summarized in **Table 2**. In the experiments, all word embeddings with the dimension of 300 were initialized by GloVe [17]. The word embeddings were pre-trained on an unlabeled corpus of which size is about 840 billion. The dimension of hidden layer vectors and aspect embedding are 300 and 100 respectively. The weight matrices are initialized with the uniform distribution U (0.1, 0.1), and the bias vectors are initialized to zero. The learning rate and mini-batch size are 0.001 and 16 respectively. The best optimizer and number of epochs were obtained from {SGD, Adam, AdaBelief} and {100, 300, 500} respectively via grid search. The optimal parameters based on the best performance on the validation set were kept and the optimal model is used for evaluation in the test set.

The aim of the training is to minimize the cross-entropy error between the target sentiment distribution *y* and the predicted sentiment distribution ^*y*. However, overfitting is a common issue during training. In order to avoid the over-fitting, regularization procedures including L2-regularization, early stopping as well as dropout were used in the experiment. L2-regularization adds "squared magnitude" of coefficient as a penalty term to the loss function.


#### **Table 2.**

*Details of configurations and used hyper-parameters.*

*Tourist Sentiment Mining Based on Deep Learning DOI: http://dx.doi.org/10.5772/intechopen.98836*

$$\text{loss} = -\sum\_{i} \sum\_{j} y\_i^j \log \hat{y}\_i^j + \lambda ||\theta||^2 \tag{15}$$

where *i* is the index of review; *j* is the index of sentiment class, and the classification in this paper is three-way; *λ* is the L2-regularization term, which modified the learning rule to multiplicatively shrink the parameter set on each step before performing the usual gradient update; *θ* is the parameter set.

On the other hand, early stopping is a commonly used and effective way to avoid over-fitting. It reliably occurs that the training error decreases steadily over time, but validation set error begins to rise again. Therefore, early stopping terminates when no parameters have improved over the best-recorded validation error for a pre-specified number of iterations. Additionally, dropout is a simple way to prevent the neural network from overfitting, which refers to temporarily removing cells and their connections from a neural network [55]. In an RNN model, dropout can be implemented on input, output, and hidden layers. In this study, only the output layer with a dropout ratio of 0.5 was followed by a linear layer to transform the feature representation to the conditional probability distribution.

Optimizers are algorithms used to update the attributes of the neural network such as parameter set and learning rate to reduce the losses to provide the most accurate results possible. Three optimizers namely SGD [56], Adam [57], and AdaBelief [58] were used in the experiment to search for the best performance. The standard SGD uses a randomly selected batch of samples from the train set to compute derivate of loss, on which the update of the parameter set is dependent. The updates in the case of the standard SGD are much noisy because the derivative is not always toward minima. As result, the standard SGD may have a more time complexity to converge and get stuck at local minima. In order to overcome this issue, SGD with momentum is proposed by Polyak [56] (1964) to denoise derivative using the previous gradient information to the current update of the parameter set. Given a loss function *f*ð Þ*θ* to be optimized, the SGD with momentum is given by:

$$
\sigma\_{t+1} = \beta v\_t - a \mathbf{g}\_t \tag{16}
$$

$$
\theta\_{t+1} = \theta\_t + \nu\_{t+1} \tag{17}
$$

where *α* >0 is the learning rate; *β* ∈ ½ � 0, 1 is the momentum coefficient, which decides the degree to which the previous gradient contributing to the updates of the parameter set, and *gt* ¼ ∇*f*ð Þ *θ<sup>t</sup>* is the gradient at *θt*.

Both Adam and AdaBelief are adaptive learning rates optimizer. Adam records the first moment of gradient *mt* which is similar to SGD with momentum and second moment of gradient *vt* in the meanwhile. *mt* and *vt* are updated using the exponential moving average (EMA) of *gt* and *g*<sup>2</sup> *<sup>t</sup>* , respectively:

$$m\_{t+1} = \beta\_1 m\_t + (1 - \beta\_1) \mathbf{g}\_t \tag{18}$$

$$
\sigma\_{t+1} = \beta\_2 \upsilon\_t + (1 - \beta\_2) \mathbf{g}\_t^2 \tag{19}
$$

where *β*<sup>1</sup> and *β*<sup>2</sup> are exponential decay rates.

The second moment of gradient *st* in AdaBelief is updated using the EMA of *gt* � *mt* � �<sup>2</sup> , which is easily modified from Adam without extra parameters:

$$\mathbf{s}\_{t+1} = \beta\_2 \mathbf{s}\_t + (\mathbf{1} - \beta\_2) \left(\mathbf{g}\_t - m\_t\right)^2 \tag{20}$$

The update rules for parameter set using Adam and AdaBelief are given by Eqs. (23) and (24), respectively:

$$
\theta\_{t+1} = \theta\_t - \frac{am\_t}{\sqrt{v\_t} + \varepsilon} \tag{21}
$$

$$
\theta\_{t+1} = \theta\_t - \frac{am\_t}{\sqrt{s\_t} + \varepsilon} \tag{22}
$$

where *ε* is a small number, typically set as 10�8.

Specifically, the update direction in Adam is *mt=* ffiffiffiffi *vt* p , while the update direction in AdaBelief is *mt=* ffiffiffi *st* p . Intuitively, 1*=* ffiffiffi *st* p is the "belief" in the observation, viewing *mt* as the prediction of *gt* , AdaBelief takes a large step when observation *gt* is close to prediction *mt*, and a small step when the observation greatly deviates from the prediction.

It is noted that the best models in the validation set were obtained by returning to the parameter set at the point in time with the lowest validation set error.

#### **4.3 Results and analysis**

As for the confusion matrix for a multi-class classification task, accuracy is the most basic evaluation measure of classification. The evaluation measure accuracy represents the proportion of the correct predictions of the trained model, and it can be calculated as:

$$Accuracy = \frac{\sum\_{1}^{C} TP\_i}{N} \tag{23}$$

where *C* is the number of classes (*C* equals to 3 in this study); *N* is the sample number of the test set; *TPi* is the number of true predictions for the samples of the *i th* class, which is diagonally positioned in the confusion matrix. In addition to accuracy, classification effectiveness is usually evaluated in terms of macro precision and recall, which are aimed at a class with only local significance. As **Figure 1** illustrates, the class that is being measured is referred to as the positive class and the rest classes are uniformly referred to as the negative classes. The macro precision is the proportion of correct predictions among all predictions with the positive class, while macro recall is the proportion of correct predictions among all positive instances. The macro F1-score is the harmonic mean of macro precision and recall. The macro-average measures take evaluations of each class into consideration, which can be computed as:

$$\text{MacroPrecision} = \frac{1}{\text{C}} \sum\_{i=1}^{C} \frac{TP\_i}{TP\_i + FP\_i} \tag{24}$$

$$\text{MacroRecall} = \frac{1}{\text{C}} \sum\_{i=1}^{\text{C}} \frac{T P\_i}{T P\_i + F \text{N}\_i} \tag{25}$$

$$Macro-F1 = \frac{2 \times \text{MacroPrecision} \times \text{MacroRecall}}{\text{MacroPrecision} + \text{MacroRecall}} \tag{26}$$

where *FPi* and *FNi* are the number of false predictions for the positive and negative samples of the *i th* class, respectively.

**Figure 1.** *Summary of model performance.*

This study computed accuracy (A), macro precision (P), macro recall (R), and macro F1-score (F) of AT-GRU, ATAE-GRU, AT-LSTM, and ATAE-LSTM trained with various optimizers and epochs. The results show: (1) Attention-based models (AT-GRU and AT-LSTM) performed better than attention-based models with aspect embedding (ATAE-GRU and ATAE-LSTM). Taken Dataset 1 for example, the best accuracy in the test set using AT-GRU was 80.7%, while the best accuracy using ATAE-GRU was 75.3%; (2) Attention-based GRU performed better than attention-based LSTM. Taken AT-GRU and AT-LSTM for example, the accuracy and macro F1-score of AT-GRU for all datasets were higher than those of AT-LSTM; (3) The balanced dataset (Dataset 1) achieved the best predictive performance for all models. For the unbalanced datasets, the accuracy was exactly close to that of the balanced dataset. However, the macro precision, recall, and F1-score were significantly lower than those of the balanced dataset, which confirmed that the balanced dataset had the best generalization and stability in this study; (4) For Dataset 3 in which the neutral sentiment samples dominated, all of the models exhibited the worst predictive performance compared with other datasets. The candidate model for each dataset is illustrated in **Figure 1**. It is noted that the candidate model was selected according to accuracy. However, the model with a higher macro F1-score was selected as the candidate model instead when the accuracies of models were similar. Among 16 models, AT-GRU trained with the optimizer of AdaBelief and epoch of 300 in Dataset 1 achieved the highest accuracy of 80.7% and macro F1-score of 75.0% in the meanwhile. **Figure 2** illustrates the normalized confusion matrix of the best predictive model of which diagonal represented for the precisions. The precisions of positive and negative sentiment classification were about 20% higher than that of neutral sentiment classification, which confirmed that the need to boost the precision of neutral sentiment classification in order to globally improve the accuracy of the model in future work.

Early stopping was used in this research to avoid overfitting and save training time. **Figure 3** illustrates the learning history of AT-GRU using early stopping in four

**Figure 2.** *Normalized confusion matrix of model with best predictive performance.*

**Figure 3.** *Learning history of AT-GRU using early stopping.*

#### *Tourist Sentiment Mining Based on Deep Learning DOI: http://dx.doi.org/10.5772/intechopen.98836*

datasets, where the training stopped when the validation loss kept increasing for 5 epochs (i.e., "patience" equals to 5 in this study). For all datasets, the validation accuracy was exactly close to the training validation during the training procedure, which confirmed that early stopping was able to effectively avoid overfitting. Experimental results of A/P/R/F obtained based on training AT-GRU and AT-LSTM using early stopping. The accuracies obtained by AT-GRU and AT-LSTM were similar. For the balanced dataset, the accuracy and macro F1-score obtained by early stopping were significantly lower than that obtained by the corresponding model without early stopping. This is because the loss function probably found the local minima if the training stopped when the loss started to rise for 5 epochs. All of the optimizers used in this study were aimed at avoiding the loss function sticking at the local minima to find the global loss minima, therefore, using more epochs in the training was effective to obtain the best predictive performance model. On the other hand, for the unbalanced datasets, the accuracy and macro F1-score obtained by early stopping were similar to that obtained by the corresponding model without early stopping, which indicated that early stopping was effective to avoid overfitting as the loss converged fast in the unbalanced dataset. Although early stopping is a straightforward way of avoiding overfitting and improving training efficiency, the trade-off is that the model for test set possibly returns at the time point when reaching the local minima of loss function especially for the balanced dataset, and a new hyper-parameter of "patience" which is sensitive to the results is introduced.

Three optimizers were used in this study to find the best model. **Figure 4** illustrates the learning history of AT-GRU in four datasets. The gap between training and

**Figure 4.** *Learning history of AT-GRU.*

validation accuracy was the largest, which indicated that the worst generalization of Adam among three optimizers in this study although it converged quickly at the very beginning except for Dataset 3. Both SGD and AdaBelief can achieve good predictive performance with good generalization, however, AdaBelief converged faster than SGD, and the best results were achieved by AdaBelief.
