**2. Literature review**

An extensive literature review of the state-of-art techniques for ABSA and the studies using DL in tourism is provided in this section.

#### **2.1 Input vectors**

To convert the NLP problems into the form that computers can deal with, the texts are required to be transformed into a numerical value. In ML-based approaches, Onehot and Counter Vectorizer are commonly used. One-hot encoding can realize a token-level representation of a sentence. However, the use of One-hot encoding usually results in high dimension issues, which is not computationally efficient [15]. Another issue is the difficulty of extracting meanings as this approach assumes that words in the sentence are independent, and the similarities cannot be measured by distance nor cosine-similarity. As for Counter Vectorizer, although it can convert the whole sentence into one vector, it cannot consider the sequence of the words and the context.

Nevertheless, in DL based approaches, pre-trained word embeddings have been proposed in [16, 17]. Word embedding, or word representation, refers to the learned representation of texts in which the words with identical meanings would have similar representation. It has been proved that the use of word embeddings as the input vectors can make a 6–9% increase in aspect extraction [18] and 2% in the identification of sentiment polarity [19]. Pre-trained word embeddings are favored as random initialization could result in stochastic gradient descent (SGD) in local minima [20]. Based on the network language model, a feedforward architecture, which combined a linear projection layer and a non-linear hidden layer, could learn the word vector representation and a statistical language model [21].

Word2Vec [16] proposed the skip-gram and continuous bag-of-words (CBOW) models. By setting the window size, skip-gram can predict the context based on the given words, while the CBOW can predict the word based on the context. Frequent words also are assigned binary codes in Huffman trees because Also, due to the fact that the word frequency is appropriate to acquire classes in neural net language models, frequent words are assigned binary codes in Huffman trees. This practice in Word2Vec helps reduce the number of output units that are required to be assessed. However, the window-based approaches of Word2Vec do not work on the cooccurrence of the text and do not harness the huge amount of repetition in the texts. Therefore, to capture the global representation of the words in all sentences, GloVe can take advantage of the nonzero elements in a word-word cooccurrence matrix [17]. Although the models discussed above performed well in similarity tasks and named entity recognition, they cannot cope with the polysemous words. In a more recent development, Embeddings from language model (ELMo) [22], Bi-directional Encoder Representations from Transformers (BERT) [23] can identify the context-sensitive features in the corpus. The main difference between these two architectures is that ELMo is feature-based, while BERT is deeply bidirectional. To be specific, the contextual representation of each token is obtained through the concatenation of the leftto-right and right-to-left representations. In contrast, BERT applies masked language models (MLM) to acquire the pre-trained deep bidirectional representations. MLM can randomly mask certain tokens from the input and predict the ID of the input depending only on the context. Additionally, BERT is capable of addressing the issues of long text dependence.

Nonetheless, researchers have combined certain features with word embedding to produce more pertinent results. These features include Part-Of-Speech (POS) and chunk tags, and commonsense knowledge. It has been observed that aspect terms are usually nouns or noun phrases [8]. The original word embeddings of the texts are concatenated with as k-dimensional binary vectors that represent the k POS, or k tags. The concatenated word embeddings are fed into the models (Do et al.,, Prasad, Maag, and Alsadoon, 2019 [9]). It has been proved that the use of POS tagging as input can improve the performance of aspect extraction, with gains from 1% [18, 20] to 4% [24]. Apart from the POS, concepts that are closely related to the affections are

suggested to be added as word embeddings [25, 26]. POS focused on the grammatical tagging of the words in a corpus, while concepts that are extracted from SenticNet emphasize the multi-word expressions and the dependency relation between clauses. For example, the multi-word expression "win lottery" could be related to the emotions "Arise-joy" and the single-word expression "dog" is associated with the property "Isa-pet" and the emotions "Arise-joy" [26]. After being parsed by SenticNet, the obtained concept-level information (property and the emotions) is embedded into the deep neural sequential models. The performance of the Long Short-Term Memory (LSTM) [27] combined with SenticNet exceeded the baseline LSTM [26].

#### **2.2 DL methods for ABSA**

This section reviews the DL methods used for ABSA, including Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Attention-based RNN, and Memory Network.

#### *2.2.1 CNN*

CNN can learn to capture the fixed-length expressions based on the assumption that keywords usually include the aspect terms with few connections of the positions [28]. Besides, as CNN is a non-linear model, it usually outperforms the linear-model and rarely relies on language rules [29]. A local feature window of 5 words was firstly created for each word in the sentence to extract the aspects. Then, a seven-layer of CNN was tested and generated better results [29]. To capture the multi-word expressions, the model proposed [30] contained two separate convolutional layers with non-linear gates. N-gram features can be obtained by the convolutional layers with multiple filters. Li et al. [13] put position information between the aspect words and the context words into the input layer in CNN and introduced the aspect-aware transformation parts. Fan et al. [31] integrated the attention mechanism with a convolutional memory network. This proposed model can learn multi-word expressions in the sentence and identify longdistance dependency.

Apart from simply extracting the aspects alone, CNN can identify the sentiment polarity at the same time, which can be regarded as multi-label tasking classification or multitasking issues. As for researchers who considered ABSA multi-label tasking classification, a probability distribution threshold was applied to select the aspect category and the aspect vector was concatenated with the word embedding, which was then further performed using CNN. Xu et al. [32] combined the CNN with the non-linear CRF to extract the aspect, which was then concatenated with the word embeddings and fed into another CNN to identify the sentiment polarity. Gu et al. [33] proposed a CNN with two levels that integrated the aspect mapping and sentiment classification. Compared with conventional ML approaches, this approach can lessen the feature engineering work and elapsed time [9]. It should be noticed that the performance of multitasking CNN does not necessarily outperform multitasking methods [19].

#### *2.2.2 RNN and attention-based RNN*

RNN has been applied for the ABSA and SBSA in the UGC. RNN models use a fixed-size vector to represent one sequence, which could be a sentence or a document, *Tourist Sentiment Mining Based on Deep Learning DOI: http://dx.doi.org/10.5772/intechopen.98836*

to feed each token into a recurrent unit. The main differences between CNN and RNN are: (1) the parameters of different layers in RNN are the same, making a fewer number of parameters required to be learned; (2) since the outputs from RNN relies on the prior steps, RNN can identify the context dependency and suitable for texts of different lengths [34–36].

However, the standard RNN has prominent shortcomings of gradient explosion and vanishing, causing difficulties to train and fine-tune the parameter during the process of prorogation [34]. LSTM and Gated Recurrent Unit (GRU) [37] have been proposed to tackle such issues. Also, Bi-directional RNN (Bi-RNN) models have been proposed in many studies [38, 39]. The principle behind Bi-RNN is the context-aware representation can be acquired by concatenating the backward and the forward vectors. Instead of the forward layer alone, a backward layer was combined to learn from both prior and future, enabling Bi-RNN to predict by using the following words. It has been proved that the Bi-RNN model achieved better results than LSTM in the highly skewed data in the task of aspect category detection [40]. Especially, Bi-directional GRU is capable of extracting aspects and identifying the sentiment in the meanwhile [23, 41] by using Bi-LSTM-CRF and CNN to extract the aspects in the sentence that has more than one sentiment targets.

Another drawback of RNN is that RNN encodes peripheral information, especially when it is fed with information-rich texts, which would further result in semantic mismatching problems. To tackle the issue, the attention mechanism is proposed to capture the weights from each lower level, which are further aggregated as the weighted vector for high-level representation [42]. In doing so, the attention mechanism can emphasize aspects and the sentiment in the sentence. Single attention-based LSTM with aspect embeddings [43], and position attention-based LSTM [44], syntactic-aware vectors [45] were used to capture the important aspects and the context words. The aspect and opinion terms can be extracted in the Coupled Multi-Layer Attention Model based on GRU [46] and the Bi-CNN with attention [47]. These frameworks require fewer engineering features compared with the use of CRF.

#### *2.2.3 Memory network*

The development of the deep memory network in ABSA was originated from the multi-hop attention mechanism that applies the exterior memory to compute the influence of context words on the given aspects [36]. A multi-hop attention mechanism was set over an external memory that can recognize the importance level of the context words and can infer the sentiment polarity based on the contexts. The tasks of aspect extraction and sentiment identification can be achieved simultaneously in the memory network in the model proposed by [13]. Li et al. [13] used the signals obtained in aspect extraction as the basis to predict the sentiment polarity, which would further be computed to identify the aspects.

Memory networks can tackle the problems that cannot be addressed by attention mechanism. To be specific, in certain sentences, the sentiment polarity is dependent on the aspects and cannot be inferred from the context alone. For example, "the price is high" and "the screen resolution is high". Both sentences contain the word "high". When "high" is related to "price", it refers to negative sentiment, while it represents positive sentiment when "high" is related to "screen resolution". Wang et al. [48] proposed a target-sensitive memory network proposed six techniques to design target-sensitive memory networks that can deal with the issues effectively.

#### **2.3 Studies using DL methods in tourism and research gap**

To obtain finer-grained sentiment of tourists' experiences in economy hotels in China, [11] used Word2Vec to obtain the word embeddings as the model input, and bidirectional LSTM with CRF model was used to train and predict the data. The whole model includes the text layer, POS layer, connection layer, and output layer, in which CRF was used for data output, reaching an accuracy of 84%. Chang et al. [49] applied GloVe to pre-train the word embedding. To improve the performance, feature vectors, like sentiment scores, temporal intervals, reviewer profiles, were added into CNN models. Their results proved that temporal intervals made a greater contribution than the sentiment score and review profile for the managers to respond to the reviews. Gao et al. [50] explored the model that built CNN on LSTM and proved that the combined model outperformed the single CNN or LSTM model, with an improvement of 3.13% and 1.71% respectively.

To summarize, DL methods have been extensively used to perform ABSA. However, ABSA in the domain of tourism is little in the literature. Therefore, this study aimed at conducting ABSA using a dataset collected from TripAdvisor for predicting sentiments. Based on the literature review, it can be observed that RNN models especially attention-based RNN models achieved better performance than CNN models in terms of accuracy. Therefore, attention-based gated RNN models including LSTM and GRU were used in this study, which is summarized in the following section. Zhou et al. [14] conducted a series of ABSA on Semeval datasets [51, 52] using various DL methods. The experimental results confirmed that RNN with an attention-based mechanism obtained higher accuracies but relatively low precisions and recalls. This is because the Semeval datasets are naturally unbalanced datasets in which the fraction of positive sentiment samples is significantly higher than the fractions of neutral and negative sentiment samples, which indicates the importance of fractions of sentiment samples in the datasets. Inspired by ABSA on Semeval datasets, four datasets with different fractions of sentiment samples were resampled from the dataset of TripAdvisor hotel reviews to investigate the effect of sample imbalance on the model performance. Also, optimizers to minimize loss play a key role in model training. Therefore, three optimizers including the state-of-art optimizer were used in this study to compare their performance.
