3. Studying emoji usage using formal frameworks

Emoji usage has had a deep impact on humans' computer-mediated communication (CMC). With the increasing use of social media platforms such as Facebook, Twitter, or Instagram, people now massively interchange messages and ideas through text-based chat tools that support emoji usage, imbuing these with semantic, emotional, and meaningful meaning. In order to analyze and extract comprehensive knowledge from emoji-embedded message data sets, many methods have been developed through the usage of a multidisciplinary approach, which involves ML along with NLP, psychology, robotics, and so on. Among the tasks developed with ML algorithms for the analysis of emoji usage stand sentiment analysis [5, 19], polarity analysis [10, 20], sentiment lexicon buildage [10], utterance embeddings [21], personality assessment [18], to mention a few. These applications are summarized in Table 1.

The following section shows an analysis from the point of view of the use of ML algorithms to support tasks related to the sentiment analysis through the use of emoji, classification, comparison, polarity detection, data preprocessing from tweets with emoji embeddings, and computer vision techniques for video processing to detect facial expression.

#### 3.1 Emoji classification and comparison

In recent years, algorithms such as deep learning (DL) have emerged as a new area of ML, which involve new techniques for signal and information processing. This type of algorithms employ several nonlinear layers for information processing through supervised and unsupervised feature extraction, and transformation for pattern analysis and classification. It also includes algorithms for multiple levels of representation attaining models that can describe the complex relations within data. Particularly, if data sets are considerably large, a deep-learning approach is the best option for reaching a well-trained model regarding if data are labeled or not [25, 26]. Until our days, ML algorithms that use shallow architecture show a good performance and effectiveness for solving simple problems, for instance, linear regression (LR), support vector machines (SVM), multilayer perceptron (MPL) with a single hidden layer, decision trees like random forest or ID3, among others. These architectures have limitations for extracting patterns from a wide complex problem's variety, such as signals, human speech, natural language, images, and sound processing [25]. For this reason, a deep-learning approach allows to solve these limitations showing good results.

Emoji classification and comparison constitute two important tasks for discriminating several kinds of emoji, including those with similar meaning. Deep-learning models have been used for this goal in texts where emoji are embedded, producing better result than softmax methods, such as logistic regression, naive Bayes, artificial neural networks, and others. For example, Xiang Li et al. developed a deep neural network architecture for getting a trained model that could predict the correct emoji for its corresponding utterance [21]. This approach provides the possibility that machines generate an automatic message for humans during a conversation with the use of implicit sentiment and better semantic on ideas.

In Li et al.'s [21] proposal, the system receives as input an utterance set Y ¼ y1, y2, …, yn and an emoji set <sup>X</sup> <sup>¼</sup> f g <sup>x</sup>1, <sup>x</sup>2, …, xn . The main goal is to train a classification model, which could predict the correct emoji for an utterance given.

The architecture used in this work has two parts. The first is a convolutional neural network (CNN) for giving a sentence embedding that represents an


#### Table 1.

Comparative table of the articles analyzed.

Emoji as a Proxy of Emotional Communication DOI: http://dx.doi.org/10.5772/intechopen.88636

utterance, and the second one is the embedding of emoji and this part should be trained. In order to join both parts, a matching structure was created due to embeddings in continuous vector space that could well represent emoji, consequently performing better than discrete softmax classifier.

The bottom of CNN is a word embedding layer for tasks of NLP. This provides semantic information about a word using real vector that represents its features. For an utterance that represent a sequence of words, for each word wi is a one-hot vector of dictionary dimension, a bit from wi takes value 1 if it corresponds to word on the dictionary and 0 for remaining bits. In Eq. (1), the embedding matrix is defined such that [21]:

$$E\_1 e \, R^{D \ge V},\tag{1}$$

where D and V are word embedding and word dictionary dimensions, respectively. Each e1ð Þ wi ϵ E1is the embedding for word in a dictionary. The convolutional layer uses sliding windows to get information from word embeddings; for this process, the following function is used (see the Eq. (2) [21]):

$$Y\_1 = f(W\_1[e\_1(w\_1); e\_1(w\_2); \dots; e\_1(w\_t)] + b\_1),\tag{2}$$

where t is the size of window and b<sup>1</sup> is the bias vector. Hence, the parameter to be trained is W1.

Once obtained a series continuous representations of local features from convolutional layer, theory of dynamic pooling is used for sensitizing these embeddings into one vector of the whole utterance. This produces by output the max pooling. The hidden layer uses the sentence embedding of the utterance obtained as y<sup>2</sup> and returns finally the vector to represent the utterance.

Similarly to the word embedding layer, the emoji embedding layer uses a matrix defined as <sup>E</sup>2<sup>ϵ</sup> RDxV to obtain <sup>e</sup>2ð Þ xi , where \$K\$ is the one-hot vector's length that represents each emoji xi. Each e2ð Þ xi of E<sup>2</sup> is one parameter of neural network. The process of training is a forward propagation for computing the matching score between the given utterance and the correct emoji, and matching score between the given utterance and the negative emoji. Backward propagation is used to update model parameters. For calculating the matching score, the cosine similarity measure is used, whereas for training the neural network, the Hinge Loss function was used. It is worth mentioning that the latter is very useful for carrying out pairwise comparison to identify similar emoji types.

Finally, the author obtains an architecture that uses a CNN and a matching approach for classifying and learning emoji embeddings. The importance of the aforementioned work regarding the field of robotics is the possibility of producing a facial gesture as a result of the introduction of a statement, conversation, or idea to a machine, employing the semantic and emotional relation of emoji.

#### 3.2 Emoji sentiment analysis

In the area of decision making, it is being relevant to know how the people think and what they will do in the future. These produce the needs of grouping peoples in accordance with their interaction on Internet and social networks. Sentiment analysis or opinion mining is the study of people's opinions, sentiments, or emotions, using an NLP approach, which includes, but is not limited to, text mining, data mining, ML, and deep learning [20]. For instance, the CNN's usage has been employed to predict the tweets' emoji polarities. These techniques have showed to

be more effective than shallow models in image recognition and text classification where they reach better results [19].

Tweets processing for mining opinion and text analysis tasks play a crucial role for different areas in the industry because these produce relevant result for feedback the design of products and services. As Twitter is a platform where user interactions are very informal and unstructured and people use many languages and acronyms, it becomes necessary to build a model language-independent and nonsupervised learning. We can see the use of emoji or emoticons in this scenario through heuristic labels for a system; for this, the feature's extraction process was developed by unsupervised techniques. The emoji/emoticons are the final result that represents a sentiment that a tweet contains. According to Mohammad Hanafy et al. in order to get a trained model for text processing, it is essential to do a data preprocessing for obtaining the data sets, where noisy elements are removed such as hashtags and other stranger characters like "at," reduction of words by removing duplicated words, and very important, reemphasizing the emoticons with their scores. Each emoticon has a raw data that contain a sentiment classified as negative, neutral, or positive. For each classification, a continuous value is recorded. This representation is used in auto-label phase, for generating the training data using the score for determining emoji [19].

Feature extraction stage uses the Tf-idf approach; it indicates the importance of a word in the text through its frequency in the text or text's set. Using Eq. (3), we can calculate this as follows [19, 27]:

$$\text{tr}\text{fIDF}(t,d,F) = t\mathcal{f}(t,d) \bullet \log \frac{n\_d}{df(d,t) + 1} \tag{3}$$

where t is the word and d is the tweet. Term frequency in the document is tf, df is document frequency where word exists, and nd is the number of tweets.

Other feature-extracting methods employed were bag-of-words (BOW) and Word2Vec. BOW selects a set of important words in tweets, and then each document is represented as a vector of the number of occurrences of the selected words. Word2Vec uses a two-layer neural network to represent each word as a vector of certain length based on its context. This feature extraction model computes a distributed vector representation of the word, been its main advantage that similar words are close in the vector space. Moreover, it is very useful for named entity recognition, parsing, disambiguation, tagging, and machine translation. In the area of big data processing, the library Spark ML within the Apache Spark engine uses skip-gram-model-based implementation that seeks to learn vector representation that take into account the contexts in which words occur [27].

Skip-gram model learns word vector representations that are good at predicting its context in the same sentence or sequence of training words denoted as W ¼ f g w1, w2, …, wt , where T is k k W . The objective function is to maximize the average log-likelihood, which is defined by Eq. (4) [27]:

$$\frac{1}{T} \bullet \sum\_{t=1}^{T} \sum\_{j=-k}^{j=k} \log p\left(w\_{t+j}|w\_t\right),\tag{4}$$

where k is the size of training windows. Each w is associated with two vectors uw as word and vw as context, respectively. Using Eq. (5), given the word wj, the probability of correctly predicting wi is computed as [27]:

$$p\left(w\_i|w\_j\right) = \frac{\exp\left(u\_{w\_i}, u\_{w\_j}\right)}{\sum\_{l=1}^{V} \exp\left(v\_l, v\_{w\_j}\right)},\tag{5}$$

Emoji as a Proxy of Emotional Communication DOI: http://dx.doi.org/10.5772/intechopen.88636

where V is the vocabulary length. The cost of computing p wijw<sup>j</sup> is expensive; consequently, Spark ML uses hierarchical softmax with computational cost of Oð Þ log ð Þ V [27].

These feature extractor models were used with other classifiers, such as SVM, MaxEnt, voting ensembles, CNN, and LSTM to extend the architecture of recurrent neural network (RNN). As solution proposal, a weighted voting ensemble classifier is used that combines the output of different models and its classification probabilities. For each model, a different weight when voting is assigned. The proposed model reaches a considerable accuracy in comparison with other models. This approach is very important in scenarios where we need no human intervention and any information about the used language; it is very useful to apply a good combination between classical and deep-learning algorithm to achieve better accuracy [19].

#### 3.3 From video to emoji

As consequence of the semantic meaning that emoji carriers, there are some applications and researches that involve the image processing for generating emoji classification or an utterance with emoji embeddings. For that purpose, Chembula et al. have created an application that receives as input a stream of video or images from a person and create an emoticon based on image face. The solution detects the facial expression at the time that message is being generated. Once that facial expression is detected, the device generates a message with the suitable emoticon [28].

This system performs a facial detection, facial feature detection, and classification task to finally identify the facial expression. Although the initial processing proposed by Chembula and Pradesh [22] was not specified on the general description, we can use open source solutions in order to aim this job.

OpenCV is an open source library for computer vision, and it includes classifiers for real-time face detection and tracing like Haar classifiers and Adaptive Boosting (AdaBoost). We can download trained model for performing this task; the model is an XML file that can be imported inside the OpenCV project. For featuring extraction, the library includes algorithm for detecting region of interest in human face like eyes, mouth, and nose. For this propose, drop information from image stream using gray scale convert and afterward using Gaussian Blur for reducing noise is important. Canny algorithms may be used for tracking facial features with more precision than others like Sobel and Laplace [29].

In [24], Microsoft's emotion API is used as a tool to detect facial images from the Webcam image capture of the computer. Once the image is captured, the detected face is classified into seven emotion tags. Although the process is not specified exactly, the API mentioned works on an implementation of the OpenCV library for .NET [30], so the algorithms used for face detection should be the same as those described above.

For classification task, we can use nearest neighbor classifier, SVM, logistic classifier, Parzen density classifiers, normal density Bayes classifiers, and Fisher's linear discriminant [31]. Finally, when the classification is done, the output layer consists a group of types of emoji according to the meaning for each type of emotion detected in the image face. The importance of this contribution lies in the possibility of introducing new forms of human-computer interaction through the use of emotions. This can be useful for intelligent assistants both physical and visual that are able to react or are current according to the mood of people who use a particular intelligent ecosystem.

Figure 2.

General process of facial detection and its corresponding classification using emoji.

Figure 2 shows in a general way the operation of what has been explained above.
