**3. Research design and experiment**

#### **3.1 Corpora design**

Based on the consideration and the purpose of the study, the corpora in this study will be completely in English and will include reviews collected from casino resorts in Macao. A self-designed tool programmed in Python was implemented to acquire all the URLs, which were first stored and further used as the initial page to crawl all the UGC that belongs to the hotel. The corpus includes 61544 reviews of 66 hotels. The length of the reviews varied greatly, with a maximum of 15 sentences, compared to the minimum of one sentence.

In terms of the size of the corpora that requires annotation, as there is no clear instruction regarding the size of the corpora, this study referred to Liu's work and SemEval's task. In machine learning based studies, it is reasonable to consider that the corpus that has 800–1000 aspects would be sufficient, while for deep-learning based

approach, we think at least 5000 aspects in total would be acceptable. As the original data was annotated first to be further analyzed, 1% of the reviews were randomly sampled from the corpus. Therefore, 600 reviews that contain 5506 sentences were selected for ABSA in this study.

#### **3.2 Annotation**

Although previous works annotated the corpora and performed sentiment analysis, they did not reveal the annotation principles [51, 53] and the categories are rather coarse. For example, [53] used pre-defined categories to annotate the aspects of the restaurant. The categories involved "Food, Service, Price, Ambience, Anecdotes, and Miscellaneous", which did not annotate the aspects of finer levels. In addition, the reliability and validity of the annotation scheme have not been proved.

As the training of the models discussed above requires the annotation of domainspecific corpora, this study referred to [54]. The design of the annotation schema calls for the identification of aspect-sentiment pairs. Specifically, Α is the collection of aspects *aj* (with *j* ¼ 1, … , *s*). Then, sentiment polarity *pk* (with *k* ¼ 1, … , *t*) should be added to each aspect in the form of a tuple (*aj*, *pk*).

To ensure the reliability and validity, Cohen's *kappa*, Krippendorff's *alpha*, and Inter-Annotator-Agreement (IAA) are introduced in this study, which are calculated by the agreement package in NLTK. Both indicators are used to measure (1) the agreement of the entire aspect-sentiment pair, (2) the agreement of each independent category.

#### **3.3 Attention-based gated RNN**

#### *3.3.1 LSTM unit*

The LSTM unit proposed by [25] overcomes the gradient vanishing or exploding issues in the standard RNN. The LSTM unit is consisted of forget, input, and output gates, as well as a cell memory state. The LSTM unit maintained a memory cell *ct* at time *t* instead of the recurrent unit computing a weighted sum of the inputs and applying an activation function. Each LSTM unit can be computed as follows:

$$X = \begin{bmatrix} h\_{t-1} \ \varkappa\_t \end{bmatrix} \tag{1}$$

$$f\_t = \sigma \left( X\mathcal{W}\_f^T + b\_f \right) \tag{2}$$

$$i\_t = \sigma(\mathbf{X}\mathbf{W}\_i^T + b\_i) \tag{3}$$

$$\rho\_t = \sigma(X\mathcal{W}\_o^T + b\_o) \tag{4}$$

$$c\_t = f\_t \odot c\_{t-1} + i\_t \odot \tanh\left(\mathbf{X} \mathbf{W}\_c^T + b\_c\right) \tag{5}$$

$$h\_t = o\_t \odot \tanh(c\_t) \tag{6}$$

where *Wf* , *Wi*, *Wo*, *Wc* ∈ *<sup>d</sup>*�2*<sup>d</sup>* are the weighted matrices, and *bf* , *bi*, *bo*, *bc* ∈ *<sup>d</sup>* are the bias vectors to be learned, parameterizing the transformation of three gates; *d* is the dimension of the word embedding; σ is the sigmoid activation function, and ⊙ represents element-wise multiplication; *xt* and *ht* are the word embedding vectors and hidden layer at time*t*, respectively.

The forget gate decides the extent to which the existing memory is kept (Eq. (2)), while the extent to which the new memory is added to the memory cell is controlled by the input gate (Eq. (3)). The memory cell is updated by partially forgetting the existing memory and adding a new memory content (Eq. (5)). The output gate summarizes the memory content exposure in the unit (Eq. (4)). LSTM unit can decide whether to keep the existing memory with three gates. Intuitively, if the LSTM unit detects an important feature from an input sequence at an early stage, it easily carries this information (the existence of the feature) over a long distance, hence, capturing potential long-distance dependencies.

#### *3.3.2 GRU*

A Gated Recurrent Unit (GRU) that adaptively remembers and forgets was proposed by [37]. GRU has reset and update gates that modulate the flow of information inside the unit without having a memory cell compared with the LSTM unit. Each GRU can be computed as follows:

$$X = \begin{bmatrix} h\_{t-1} & x\_t \end{bmatrix} \tag{7}$$

$$r\_t = \sigma(\mathbf{X}\mathbf{W}\_r^T + b\_r) \tag{8}$$

$$\mathbf{z}\_t = \sigma(\mathbf{X}\mathbf{W}\_x^T + b\_x) \tag{9}$$

$$h\_t = (\mathbf{1} - z\_t) \odot h\_{t-1} + z\_t \odot \tanh\left(\begin{bmatrix} r\_{t \odot h\_{t-1}} & \varkappa\_t \end{bmatrix} \mathcal{W}^T + b\right) \tag{10}$$

The reset gate filters the information from the previous hidden layer as a forget gate does in the LSTM unit (Eq. (8)), which effectively allows the irrelevant information to be dropped, thus, allowing a more compact representation. On the other hand, the update gate decides how much the GRU updates its information (Eq. (9)). This is similar to LSTM. However, the GRU does not have the mechanism to control the degree to which its state is exposed instead of fully exposing the state each time.

#### *3.3.3 Attention mechanism*

The standard LSTM and GRU cannot detect the important part for aspect-level sentiment classification. To address this issue, [43] proposed an attention mechanism that allows the model to capture the key part of a sentence when different aspects are concerned. The architecture of a gated RNN model considering the attention mechanism which can produce an attention weight vector *α*, and a weighted hidden representation *r*.

$$M = \tanh\left(\begin{bmatrix} W\_h H\\ W\_v v\_a \otimes e\_N \end{bmatrix}\right) \tag{11}$$

$$a = \text{softmax}(W\_mM) \tag{12}$$

$$r = Ha^T \tag{13}$$

where *H* ∈ *dh*�*<sup>N</sup>* is the hidden matrix, *dh* is the dimension of the hidden layer, *N* is the length of the given sentence; *va* ∈ *da* is the aspect embedding, and *eN* ∈ *<sup>N</sup>* is a *N*-dimensional vector with an element of 1; ⨂ represents element-wise

multiplication; *Wh* ∈ *<sup>d</sup>*�*<sup>d</sup>* , *Wv* ∈ *da*�*da* , *Wm* ∈ *<sup>d</sup>*þ*da* , and *α*∈ *<sup>N</sup>* are the parameters to be learned.

The feature representation of a sentence with an aspect *h*<sup>∗</sup> is given by:

$$h^\* = \tanh\left(W\_p r + W\_\text{x} h\_N\right) \tag{14}$$

where *h*<sup>∗</sup> ∈ *<sup>d</sup>*, *Wp* and *Wx* ∈ *<sup>d</sup>*�*<sup>d</sup>* are the parameters to be learned.

To better take advantage of aspect information, aspect embedding is appended into each word embedding to allow its contribution to the attention weight. Therefore, the hidden layer can gather information from the aspect and the interdependence of words and aspects can be modeled when computing the attention weights.
