Advanced Techniques

## **Chapter 2**

## Preprocessing of Slang Words for Sentiment Analysis on Public Perceptions in Twitter

*Media Anugerah Ayu and Abdul Haris Muhendra*

## **Abstract**

Nowadays, many people express their evaluations on certain issues *via* social media freely, which makes huge amounts of data generated every day on social media. On Twitter, public opinions are diverse, which makes them possible to be processed for sentiment analysis. However, many people conveniently use slang words in expressing their opinions on Twitter. These slang words in the text can sometimes lead to miscalculation of language processing due to the absence of the "real words." This research aimed to investigate the effect of adding slang words as part of the preprocessing stage to the performance of the conducted sentiment analysis. The sentiment analysis was performed using Naïve Bayes Classifier as the classification algorithm with term frequency-inverse document frequency (TF-IDF) as the feature extraction. The research focused on comparing the performance of the conducted sentiment analysis on data that was preprocessed using slang dictionary and the ones that did not use slang dictionary. The case used in this research was texts related to COVID-19 pandemic in Indonesia, especially the ones related to the implementation of vaccines. The performance evaluation results indicate that sentiment analysis of data preprocessed using slang word dictionary has shown better accuracy than the ones preprocessed without it.

**Keywords:** sentiment analysis, slang words, social media, performance evaluation, public opinions, Naïve Bayes, Twitter

## **1. Introduction**

The rapid growth of the Internet nowadays has made huge amounts of information spread through different platforms, such as blog posts, online discussion forums, product websites, social media, and so forth. Several tools/applications are used, in the form of social media, as the basis for people to communicate and share their opinion or information with different methods such as texts, images, videos, audios, and so on. One of the popular social media that are capable of gathering information and opinion from general people is Twitter. Twitter is one of the 10 most-visited websites that have been used as a platform to collect data; for example, it is used to collect the

tweets related to the candidate for election [1, 2]. Using several unique features such as hashtags and retweets can make data collection easier. The collected data then is analyzed to see whether the opinion goes toward positive, negative, or neutral sentiment. Sentiment analysis or opinion mining is one of the methods of text mining to determine the attitude of a subject toward a certain topic [2, 3]. Many studies have been done with a different approach. In their work, Bouazizi and Ohtsuki [4] approached the work by proposing multi-class classification sentiment analysis, while [5] approached the work by comparing the preprocessing method in sentiment analysis. Sentiment analysis requires the classification of the tweets that have been collected, toward the determination of its positive, negative, or neutral review.

Classification is a process or technique of categorizing different sets of data into different classes [1]. There are two techniques for classifying the data, which are lexicon based and machine learning. The lexicon-based approach works by classifying the sentiment based on the dictionary that has been provided beforehand. The dictionary contains a large amount of data, where each of them is labeled by annotators, either manually or automatically. On the other hand, machine learning uses training and testing data to predict the output in classifying the data. Some of the examples use common algorithms like Naïve Bayes, Maximum Entropy, Support Vector Machine, and K-means for classification.

In machine learning, Naïve Bayes is one of the most commonly used techniques for classification. Naïve Bayes works best when used on a well-formed text corpus. Corpus is a collection of documents with a large number of total documents. This means that the algorithm will use training data as a way to learn the input data given and make decisions from it. The decision is then divided into three sentiments, which are positive, negative, and neutral sentiments. In this research, the Naïve Bayes algorithm has been assessed for finding accuracy, precision, recall, and F-measure.

Out of many specific kinds of sentiment analysis that have been conducted, assessing sarcasm is regarded as one of the hardest challenges to explore, especially in Indonesia where research on that area is limited. Sarcasm or irony can also be a burden on the performance of sentiment analysis [6]. Another issue in Indonesia is that a popular way to type a tweet is by using slang words or abbreviations. Singh and Kumari [7] stated that slang is one of the major challenges in this area other than noise, relevance, emoticons, and folksonomies. Disambiguation because of the ignorance of the slang sometimes leads to miscalculation of the sentiment. Some researchers have done research optimizing the data cleaning when the slang word occurs in the document. In Indonesia, some researchers such as [6, 8, 9] specifically focus on the slang word in their paper. The method used in their paper varies, from improving the stemming process for the slang to generating their slang lexicon. Using one of the basic stemming algorithms for the Indonesian language, evaluating the sentiment can be done better in terms of accuracy. The common method for the Indonesian language stemming is by using Nazief and Adriani Stemming Algorithm [10]. Some other research studies, such as Drus and Khalid [1], Jianqiang and Xiaolin [5], Rahayu et al. [6], Nuritha et al. [11], Adarsh and Ravikumar [12], Ferdiana et al. [13], Fitri et al. [14], Mandloi and Patel [15], show the effectiveness of term frequency-inverse document frequency (TF-IDF) from the lexicon-based approach as feature extraction, while Naïve Bayes is the optimum classification from machine learning approach.

One interesting case for sentiment analysis to be done in Indonesia is the topic regarding the coronavirus (COVID-19) pandemic. Over a year of pandemic events throughout the world, Indonesia had become the country with the highest case prevalence and fatality rate among Southeast Asia countries. By checking the trending tweets that are discussing

### *Preprocessing of Slang Words for Sentiment Analysis on Public Perceptions in Twitter DOI: http://dx.doi.org/10.5772/intechopen.113725*

the virus on Twitter, hashtags related to it, such as "#covid", "#covid19", "#delta", "#omicron", "#vaccine". Social and physical distancing to reduce the transmission of the virus had been implemented in several countries, including Indonesia. The campaign to limit human-to-human transmission as well as self-hygiene was required to be done. After more than a year of the first case of coronavirus in Indonesia, positive cases in Indonesia had risen with a total of 4,763,252 as of 12 February 2022. The government had taken action to apply the coronavirus vaccine to help reduce the spread of the virus. Up until 12 February 2022, 135,209,233 people of Indonesia had been given fully dosed vaccine, which is 50.7% of the population.

Observing the sentiment of people talking about the virus may become one of the measurements to see if people's perceptions toward global pandemic can be used to measure the emotion of the people in relation to the pandemic. Therefore, one of the objectives of this research is to help in concluding the temporary result of the perception of Indonesian people toward pandemic. The main objective of this research study is to seek a better result of sentiment analysis if the slang words and abbreviations that are commonly used in tweets can be considered in the process. The data collection and processing will be retrieved from Twitter API. The selected sentiment of people's opinion on Twitter can be done by choosing several popular words related to COVID-19 and its vaccination. The collected data are then processed into two different stages, one that uses slang word and abbreviation dictionary while the other one does not use slang word and abbreviation dictionary in the preprocessing step. Evaluation then will be done by comparing the performance measure of both processes, the one with slang words included and the one without.

The remainder of this paper is structured as follows: Section 2 describes the related work from previous study and Section 3 discusses the method used in this research. Section 4 presents result from preliminary research and the main experiments and their discussions. Section 5 discusses the conclusion.

## **2. Related work**

This section discusses previous studies done that are related to this research study. The discussed studies are grouped into four, that is, studies related to public perception, sentiment analysis, Twitter, and COVID-19.

#### **2.1 Public perception**

Public opinion/perception refers to the social and political attitudes held by the public toward the emergence, spread, and change of social events in a certain social space. It can be expressed according to entities, behaviors, and emotional words. Previous research has been conducted on many branches of the topic of assessing public perception.

Assessing public perceptions is usually conducted through the use of surveys, including defined preference or customer satisfaction surveys. Casas and Delmelle [16] discussed how Twitter can be a method to assess public perceptions of BRT (bus rapid transit) in area of Cali, Colombia. The main purpose of their research was that they wanted to know what discussion is happening in terms of transportation systems, especially on the topic of user satisfaction and/or service quality. Moreover, they wanted to ensure that the information of tweets in the Latin American context is similar to the knowledge about the quality factors in the country. They used Twitter

Search API, twitterSearch library, within a 9-day time frame, which was filtered by geographic location within a 60 km radius from the center of Cali city. Moreover, they only filtered two search keywords of tweets: MetroCali and MIO.

While other research used public perception to understand user satisfaction of public transportation in a city, public perception can be also used as a way to get crowdsourcing information in disasters, for example, in getting information of building seismic safety following the Canterbury earthquakes in New Zealand [17]. The purpose of their research is close to this research, which is related to the topic of a nation-level disaster of coronavirus 2019, which seeks for risk and expert opinion to relieve public anxiety and acceptance of building standards regarding the durability to withstand earthquakes.

In terms of social media itself, many researchers discuss it more specifically, especially when talking about public opinion with social media data. Klašnja et al. [18], in part of Oxford Handbooks Online, discussed social media data and public opinion. They stated three factors of why social media can be used to measure public opinion. First, social media offers a chance to observe the opinions of the public without any prompting or framing effects from analysts. It means that the analyst does not need any other burdensome environment or deciding a topic from the analyst's view; rather we can observe them by choosing what the analyst wants and filtering all of the related opinions. The second factor is the reach of their data. Since social media can be found all over the world, they provide tons of data on a daily or even hourly basis. Twitter itself is likely already the biggest time series dataset of individual public opinion available to the public. Third and the last factor is cost and practicality. With a few codes executed in a simple device, anyone can capture a selected topic in real time for free. These three factors are the main reasons why social media is considered to be a good choice for examining public opinion.

## **2.2 Sentiment analysis**

Sentiment analysis or opinion mining is the study of determining people's perspective of opinion, attitude, and emotion into something related to them, such as entities, individuals, issues, events, or topics [2, 3]. Its focus is to analyze opinions from a text document. It is part of natural language processing (NLP), which is a technique for analyzing and describing text naturally. The study involves classifying the attitude of texts into three common parts, which are a positive, negative, and neutral statement. To classify the different sentiment methods, various algorithms were developed.

In their paper, Drus and Khalid [1] present a systematic literature review (SLR) of sentiment analysis topic. Taken from five online resource databases that publish literature, which are Emerald Insight, Science Direct, Association for Computing Machinery (ACM), Scopus, and IEEE, they identified a total search of 407 articles with keywords "Sentiment analysis, social media, Facebook, Twitter" during publish time between 2014 and February 2019. After screening the available articles, a total of 24 articles are selected. Out of 24 papers, 7 papers used lexicon-based methods, 10 papers used machine learning methods, and 7 papers showed the combination of both methods. Another paper [2] also conducted an SLR on sentiment analysis that focused on Twitter data. Out of 42 papers deeply reviewed, 23 used machine learning-based approaches, 10 employed lexicon-based approaches, and 9 papers used hybrid-based ones.

*Preprocessing of Slang Words for Sentiment Analysis on Public Perceptions in Twitter DOI: http://dx.doi.org/10.5772/intechopen.113725*

#### *2.2.1 Lexicon-based approach*

Lexicon is one of the methods to approach sentiment analysis, which does not require any training data but only depends on the dictionary that has been prepared before. Lexicon-based approach is included as an unsupervised learning method [1]. The lexicon-based method works by determining the overall sentiment tendency of a given text by utilizing a pre-established lexicon of words weighted with their sentiment orientation or dictionary. It works by identifying the final polarity score of the given text from prepared language resources of positive, negative, and neutral words.

Many papers have discussed using the lexicon method to get the sentiment of people's opinions. In their work, Al-Thubaity et al. [19] create their lexicon by using the dataset of Saudi Dialect Twitter Corpus (SDTC) that consists of 5400 tweets containing Saudi dialect. The corpus was chosen to minimize the risk of dataset prejudice against a specific topic. Then, the tweet classification is done using SaudiSenti, which is a lexicon containing 4431 words. The lexicon is then compared with the previous lexicon available AraSenTi with the result that SaudiSenti outperformed AraSenti when comparing neutral tweets. Mukhtar et al. [20] works on a lexicon-based approach in the Urdu language. The method used is to first create a Sentiment Lexicon in the Urdu language with the help of annotators; then, the analyzer is created to perform sentiment analysis. Even though they use the lexicon approach, some machine learning approaches are still in use, such as stop word removal, sentences classification, and attribute selection. The result is that the lexicon-based approach outperforms the machine learning approach in many aspects, such as accuracy, precision, recall, F-measure, time taken, and effort. This can happen because the lexicon and the analyzer are well-developed.

Besides being used to get the sentiment, the lexicon can also be used to collect other things, such as a slang dictionary. Wu et al. [21], Salsabila et al. [22] and Muliady and Widiputra [23] discussed the context of making a slang dictionary. The crawled slang words are retrieved by the online dictionary in their respective language, and some provide them with a sentiment score beside the meaning and choose most of them to avoid mistakes.

#### *2.2.2 Machine learning approach*

According to Vieira et al. [24], machine learning is "an area of artificial intelligence that is concerned with identifying patterns from data and using these patterns to make a prediction about unseen data." It involves learning patterns in the data, storing the processed patterns, and then making them as a method to do predictions. It differs from a traditional statistic in at least four ways: it has a capability of speculating at the individual level; it focuses on maximizing generalizability; it is a data-driven approach; it takes into account individual heterogeneity. Based on the category of machine learning, supervised learning is by far the most commonly used approach in research that requires machine learning. Supervised learning is a machine learning algorithm where prepared correlations between data and expected outcomes are provided as examples [25]. It uses the algorithm to learn the optimal function that occupies the relationship between the input and the variable.

Taking an example, the learning process can be compared with student learning with a teacher. The teacher knows the correct answers to some questions, and the student tries to answer the questions as close to the correct answers as possible. If the student happens to get the wrong answers, the teacher corrects the mistake. It means the process of predicting the result with the difference of the predictions and target should be as

small as possible. A supervised method works based on training classifiers by using combinations of features, for example: in tweet context, the information features can be in the form of hashtags, retweet, emoticon, capital words, and so forth [26]. It works by utilizing algorithms to extract and detect sentiment from data with the most commonly used algorithm: Naïve Bayes, Support Vector Machine, and Random Forest.

Work presented in Singh et al. [27] performs Twitter sentiment analysis using the Rapid Miner tool. The author uses two common algorithms, Naïve Bayes and k-NN algorithms. The dataset was fetched from Twitter with the topic of government campaign and ready to be classified into positive and negative opinions. Both common algorithms Naïve Bayes and K-NN perform with 100% accuracy to find positive values but fail to find negative values. The author suggests using a tool other than the Rapid Miner tool, which is the NLTK toolkit from Python since it consists of many sources of inbuilt libraries.

#### **2.3 Twitter**

Twitter is one of the famous social media that allow users to post brief text updates, with one tweet (text message) limited to 280 characters. The official release of this microblogging service was on 13 July 2006, which can be accessed *via* web or mobile [14]. With over 313 million monthly active users and over 500 million tweets per day, Twitter has become one of the most promising platforms to enhance the social, political, or economic side of individuals or organizations [5].

Many interesting features have made Twitter popular as a data source for many studies related to public opinions. With limitation for 280 characters, users only need to spend a little time creating one tweet. Moreover, properties like "ReTweet" make spreading information become so much faster. Users only need to click or tap the retweet icon (described as a double arrow sign that creates a loop) to make the tweet appear on their homepage. Hashtag (labeled by the sign "#") usage is also making people find the topic easily. According to Bouazizi and Ohtsuki [28], hashtags are "labels used on social network and microblogging services which make it easier for users to find messages with a specific theme or content." It is useful not only to spread news or discussion to refer to the topics being discussed but also to set a trending topic. Another uniqueness of Twitter is that the data provided can be accessed freely by using the Twitter API, thus making the data easier to collect. By registering for Twitter Developer, collecting and processing the data can be done without the need to do anything that breaks the rules.

Various studies have used Twitter as their data source in doing sentiment analysis. Work presented in Drus and Khalid [1] has reviewed 24 papers related to sentiment analysis, whereby only 6 of them did not use Twitter as their context, rather using other sources, such as YouTube, Facebook, Stock Twits, or news blog. Another work presented in Wang et al. [2] has reviewed 42 papers using Twitter as their data source for conducting sentiment analysis. A study by Zimmer and Proferes [29] shows a topology of Twitter research over 380 academic publications ranged from 2006 to 2012 that used Twitter as their main platform of data collection and analysis. Furthermore, a recent study presented in [30] has also been based on Twitter data to develop a sentiment analysis model in relation to stock market price.

#### **2.4 COVID-19**

It is mentioned in Harapan et al. [31] that the coronavirus was first identified as a cold in 1960, which was treated as a simple nonfatal virus. It was known as COVID-19

#### *Preprocessing of Slang Words for Sentiment Analysis on Public Perceptions in Twitter DOI: http://dx.doi.org/10.5772/intechopen.113725*

when the first case was identified at Wuhan, China, in December 2019. Later, a new type of coronavirus 2019-nCoV was found from the outbreak in Wuhan. WHO declared that this is a global pandemic on 11 March 2020 since it affected 172 out of 195 countries with more than 30,000 reported deaths. The way the coronavirus spread generally was through airborne droplets. People can get the infection if one of the following body parts is in contact with the infected droplet: eyes, nose, or mouth. The effect causes respiratory infection including pneumonia, cold, sneezing, and coughing [32].

The strategy to reduce the spread of the virus is by doing simple practices; covering the mouth and nose while coughing or sneezing, maintaining a minimum of 1-m distance between persons, and frequent handwashing just postpone the virus from spreading. The movement of "social distancing" was being held in many countries that listed containing positive cases, with the strategies of closing any educational institutions and workplaces, canceling any event that required mass gatherings, selfquarantining people who were suspected with the contact of the virus, stay-at-home recommendations, and even lockdown in some cities [33]. Self-quarantine of people with symptoms of this virus is because the incubation period of the virus is 14 days or less with an average of 5 days [34]. Hence, the facilities still open even in this outbreak need to check common symptoms that people have. Every facility needs to be equipped with at least a thermal detector and hand sanitizer.

A study presented in Nicola et al. [35] reviewed the pandemic in terms of socioeconomic aspects. The classification is divided into three sectors: primary sectors, which are industries that consist of raw materials; secondary sectors, producing complete products; and tertiary sectors, including service providers. We can see that there is an important missing part, which is social impact. Lockdown in many countries had increased the level of problems in domestic violence and physical, emotional, and sexual abuse. Many instances have been found that it is more difficult to expose domestic violence since no one can leave their house if it is not necessary. Thus, the guideline to find and report domestic abuse can be found in several media. Vieira et al. [33] talks about how to treat well-being during the pandemic. Stress is one of the unavoidable effects of lockdown due to limited activity that can be done. The author suggests that people need to be aware of this pandemic to prevent the risk of health problems due to stress. Updating on the situation needs to be done daily on reliable sources of information. Misinformation among news should be reduced by using more diverse channels such as television, radio, newspaper, and online news. Information should be spread out in ways that people understand what they need to do.

Another study in Chen et al. [36] has focused on retrieving public opinion from one of the popular news websites with keywords related to the topic of coronavirus, ranging from 1 January 2020 to 7 July 2020. By using a skip-gram model of word to vector and manual screening, the filtered trigger words are selected as the dataset. They construct a relationship between the dominant public opinion by analyzing the frequency and probability of keywords in each category.

## **3. Methodology**

As mentioned earlier, this study aims to investigate the effect of including a step with slang word dictionary in the preprocessing phase of the tweet-data to the performance of the conducted sentiment analysis. The dataset is retrieved by crawling tweets with related keywords on Twitter. The search query used to get the twitter is related to the topic of COVID-19 in Indonesia, such as "corona," "covid-19," "vaksin", as well as the hashtags related to it, such as "#vaccine," "#vaksin," and "#corona". The tweets data were taken every day, which was limited to 7 days (1 week) from the day of execution. After that, the data would be stored as a CSV file, which will be used to get the sentiment score. The scoring of the sentiment would be held automatically using the Indonesian lexicon approach that is available on Github.

To be able to get Twitter datasets, we need to create a Twitter developer account. Apps of the developer are also needed to generate the key and token. There are four keys to getting access to data collection: Access token, access token secret, consumer key, and consumer key secret. These keys will be used to crawl the tweets legally *via* Twitter API. The dictionary for slang words is retrieved from other work, which are Okky Ibrohim's slang word dictionaries [37] that can be found in GitHub (link), Louis Owen's in GitHub (link), and Rama Prakoso's in Github (link). This dictionary later will be used in preprocessing part of the slang word process or usually called normalization.

Later on, the dataset from the crawling process will be divided into two parts, which are training data and testing data. The training data will be labeled with positive, negative, or neutral sentiment before being applied to the classification process. When the data has been labeled and trained into the classification process, the testing data will be applied to the process as the data that will be evaluated. This process is repeated once again but with different treatment from the last time. The first treatment will be without slang word dictionary as the base compared to the other experiment. The other experiment will use the combination of the slang dictionary mentioned above as the treatment. The details of the research process can be seen in **Figure 1**.

**Figure 1** shows the research model of sentiment analysis, and the process was divided into four different processes to make it easier. In the beginning, the data was collected from Twitter through API credentials. The collected data were stored in a database in a corpus type file (.csv) and then moved to the preprocessing stage to read tweets. The preprocessing stage was divided into two, which in the first method did not use slang word and abbreviation dictionary, while the second method used it. The Python library is called "Sastrawi," which allows the words in the Indonesian language (Bahasa Indonesia) to be reduced into their base form (stemming). The results were labeled by using the TextBlob library of Python language. The training set and the testing set were processed for the feature extraction; then, the model was evaluated based on the result given. Otherwise, the error was prompted when the machine learning algorithm fails to predict the sentiment. In the end, by looking for both accuracy and error, this study can conclude the result of the tweets.

### **3.1 Data preprocessing**

Steps done in the preprocessing phase of this research are: case folding, cleansing, converting negation, converting emoticon, tokenization, stop words removal, and stemming. The difference between process one and two is the additional slang word and abbreviation dictionary that is applied before the stemming process. Methods to do the preprocessing are listed in **Figure 1** as well.

Case folding is a step where all the uppercase letters in the tweeted document will be converted to lowercase. The only word from "a" to "z" that accepted in this stage. The purpose is to remove the data redundancy where the difference is only from the letter. Next, cleansing is done to clean the words that do not correlate with the result

*Preprocessing of Slang Words for Sentiment Analysis on Public Perceptions in Twitter DOI: http://dx.doi.org/10.5772/intechopen.113725*

#### **Figure 1.**

*Process flow of the sentiment analysis with slang words.*

of sentiment classification. The component of the tweeted document has various attributes that do not affect the sentiment since every tweet mostly has those attributes. Examples of unimportant attributes of tweets are: the mention feature (symbolized by "@"), hashtag (symbolized by "#"), link (symbolized by "http," "b it.ly," and ".com"), and character (�!@#\$%^&\*()\_ + {}[]|?<>;':). These attributes will be replaced by a space ""character to make it easier to be classified. Then, the process to convert negation word that exists in a tweet. This negation will change the sentiment value of the document; thus, the negation word will be combined with the next word. Examples of negation words are "bukan," "jangan," "tidak," and so forth. It is then followed by convert emoticon, which removes every emoticon from the text. Examples of emoticon are (" ", " ", " ", " ", " ").

The next process is tokenizing, which cuts every word and arranges it into a single piece. The word in the document is the word that is separated by space. The result of this process is a single word for weighting. Then, stop word removal is performed to remove the word that is not suitable for the document topic, in which the word does not affect the accuracy of sentiment classification. The removed words will be stored in the stop word database. If in the document, there are stop words, then it will be replaced by a space character. Then, the process with slang words which is the main part of the research. By comparing this additional process and the one without it in terms of performance, the comparison can be analyzed. This process is done by changing the word that is not following the Indonesian standard word (EYD, "Ejaan yang Disempurnakan") referring to the slang word dictionary used. After that, stemming is done to convert the words in a document to be back to their root by using certain rules. The process of Indonesian language stemming is done by removing suffix, prefix, and confix, on the document.

### **3.2 Feature extraction with TF-IDF**

TF-IDF is one of the methods commonly used in feature extraction. This method is famous for being efficient, easy, and accurate. It is used to calculate the weight of the words used in information retrieval. It calculates the value of TF and IDF on every token (words) in every document in the corpus.

TF is the amount of word occurrence in a document. The more a word appears in a document, the more it affects the document. Otherwise, the less a word appears in a document, the less it affects the document. IDF is word weighting that is based on how much a document contains a certain word. The more a document contains a certain word, the less the word affects the document. Otherwise, the less a document contains a certain word, the more the word affects the document. The equation to determine TF-IDF can be found below:

$$\text{IDF}\left(\mathbf{w}\right) = \log\left(\frac{N}{\text{DF}(\mathbf{w})}\right) \tag{1}$$

TF � IDF w, d ð Þ¼ TF w, d ð Þ� IDF Wð Þ (2)

Where IDF(w) is the inverse document frequency of word W, *N* is the number of documents, DF(w) is the number of documents containing word W, TF-IDF(w,d) is the weight of a words in all document, TF(w,d) is the frequency of word *W* occurrence in document, and W is a word and *d* is a document.

*Preprocessing of Slang Words for Sentiment Analysis on Public Perceptions in Twitter DOI: http://dx.doi.org/10.5772/intechopen.113725*

#### **3.3 Classification with Naïve Bayes algorithm**

In this paper, the algorithm used for the classification process is the Naïve Bayes algorithm. The algorithm was chosen because it is simple and can perform well with a small dataset, which will be useful for classifying positive and negative words that are conditionally independent of each other. Depending on the probability model, this classifier can be trained to run the supervised learning effectively. The algorithm is derived from the classifier that is based on the appearance or absence of class A in a given document B. The following is the basis formula used in Naïve Bayes algorithm:

$$P(\mathbf{A}|\mathbf{B}) = \frac{P(\mathbf{A})P(\mathbf{B}|\mathbf{A})}{P(\mathbf{B})} \tag{3}$$

Where A belongs to a positive or negative class and B belongs to the document whose class is being predicted. The numerator (*P*(A) and *P*(B|A) was obtained during data training. It represents every tweet in attribute (*a*1, *a*2, *a*3, … , *a*n) where *a*<sup>1</sup> is the first word, *a*<sup>2</sup> is the second, and so on, where *V* represents the class set. When the classification begins, this method will create a category or class with the highest probability (*V*MAP) by inserting attributes (*a*1, *a*2, *a*3, … , *a*n). The equation is given below:

$$V\_{\text{MAP}} = \mathop{\rm argmax}\_{\upsilon\_{j} \in \upsilon} P(\upsilon\_{j}|a\_{1}, a\_{2}, a\_{3}, \dots a\_{n}) \tag{4}$$

By using Bayes theorem, Eq. (4) can be written as:

$$V\_{\text{MAP}} = \operatorname\*{argmax}\_{v\_j \in v} P \frac{(a\_1, a\_2, a\_3, \dots a\_n | V\_j) P(V\_j)}{P(a\_1, a\_2, a\_3, \dots a\_n)} \tag{5}$$

*P*(*a*1, *a*2, *a*3, … , *an*) becomes constant for every *vj*; thus, the equation can be declared by Eq. (6) as below:

$$\boldsymbol{V}\_{\text{MAP}} = \mathop{\rm arg\,max}\_{\boldsymbol{v}\_{j} \in \boldsymbol{v}} \boldsymbol{P}\left(\boldsymbol{v}\_{j}|\boldsymbol{a}\_{1}, \boldsymbol{a}\_{2}, \boldsymbol{a}\_{3}, \dots \boldsymbol{a}\_{n}\right) \boldsymbol{P}\left(\boldsymbol{V}\_{j}\right) \tag{6}$$

Naïve Bayes Classifier simplifies this by assuming that in every category, each attribute is conditionally independent of each other. Thus:

$$P(a\_1, a\_2, a\_3, \dots a\_n | V\_j) = \prod\_i P(a\_i | v\_j) \tag{7}$$

Then, by substituting Eq. (6) to Eq. (7), it will create a formula (8) as below:

$$\mathcal{V}\_{\text{MAP}} = \mathop{\text{argmax}}\_{\boldsymbol{v}\_{j} \in \boldsymbol{V}} P(\boldsymbol{v}\_{j}) \times \prod\_{i} P(\boldsymbol{a}\_{i} | \boldsymbol{v}\_{j}) \tag{8}$$

*P*(*vj*) and probability of word *ai* for every category, *P*(*ai*|*vj*) will be calculated at training process based on the following formulas (9) and (10):

$$\mathbf{h}\left(v\_{j}\right) = \frac{\mathbf{d}\mathbf{o}\mathbf{c}s\_{j}}{\mathbf{t}\mathbf{r}\mathbf{a}\mathbf{i}\mathbf{n}\mathbf{n}\mathbf{g}}\tag{9}$$

$$P(a\_i|v\_j) = \frac{n\_i + 1}{n + \text{vacabulary}} \tag{10}$$

Where docsj is the sum of a document in category *j* and training is the sum of documents used in the training process, while *ni* is the amount of appearance of word ai in category *vj*, *n* is the amount of vocabulary that appears in category *vj*, and vocabulary is the number of unique words on every training data.

#### **3.4 Design of experiments**

There are two phases of experiments conducted in this research, which are preliminary works and main experiments. In the preliminary research, we did experiments by looking at several variables, which are the effect of using slang dictionaries, and the other one is the splitting of training and testing data to different ratios. **Table 1** shows the design of experiments (DoEs) for preliminary research. For the experiment, the data used was from 4000 tweets crawled on 14 July 2021. The slang word dictionary used for the preliminary works was dictionary A (Okky Ibrohim), which generates six results for the preliminary works. The results were then analyzed to choose which data splitting is going to be used in the main experiments.

The main experiments were then conducted with different parameters involved, which are various slang word dictionaries. There were eight different experiments conducted as presented in **Table 2**.

## **4. Result and discussion**

This section presents the results and discussions from two phases of the research study, that is, preliminary works and main experiments.


#### **Table 1.**

*DoEs for preliminary research.*


**Table 2.** *DoEs for the main experiment.*

*Preprocessing of Slang Words for Sentiment Analysis on Public Perceptions in Twitter DOI: http://dx.doi.org/10.5772/intechopen.113725*

### **4.1 Preliminary works**

The preliminary work was conducted with tweets crawled using Python script to get query by limiting the search area within a 50 km radius from the central geocode of Jakarta, Indonesia. **Table 3** shows the example of first five results of the raw crawled data.

Preprocessing stage was then conducted to the scrapped tweets. As explained in the methodology section, the preprocessing was done to clean the data to ease and simplify further process. Two types of preprocessing were performed: (i) tweets were cleaned without using slang word dictionary and (ii) tweets were cleaned using slang word dictionary. **Table 4** shows the results from both preprocessing channels, respectively. It can be observed that there are some differences in the number of words that are not covered when not using slang dictionary.

Next, results from the preprocessing stage were labeled using a lexicon-based approach, which retrieved from the number of words containing the sentiment value and scored it based on the dictionary of positive and negative words. The scoring of sentiment is divided into three, which are positive for a score above 0, negative for a score below 0, and neutral for a score exactly 0. **Tables 5** and **6** show the results of the first five tweets that have been labeled with lexicon-based approach.


**Table 3.**

*The first five results of crawled tweets.*


#### **Table 4.**

*Results from cleaning process of the first five tweets.*


#### **Table 5.**

*First five tweets tokenized and labeled using slang word dictionary.*

The results of tokenizing the tweets show differences between the ones with slang word dictionary and the ones without. This can make different results of calculations when the feature extraction process is applied. This propagates to the differences in polarity score, even though the polarity labels are all the same.


#### **Table 6.**

*First five tweets tokenized and labeled without slang word dictionary.*

**Figure 2** shows the sentiment distribution of the tweets dataset used in the form of number of tweets and percentage. It can be seen that there is a difference in the total of number of tweets resulted from the preprocessing and labeling with slang word dictionary, which was 1952 tweets, and the one without, which was 1958 tweets.

After the dataset has been cleaned and labeled, it goes to feature extraction process. The TF-IDF feature extraction has been selected with *n*-gram and bigram features. The dataset was then split into two, which are the training data and the testing data. The data was then classified using Naïve Bayes and assessed to see the performance. Three combinations of ratio for data splitting, that is, 60:40, 70:30, and 80:20, were used in the experiments, and the performance evaluation results are displayed in **Figure 3**. Ratio 3 (80:20) has shown the best performance among the three as presented in **Figure 3**.

Next, main experiments were performed with the following notes:


### *Advances in Sentiment Analysis – Techniques, Applications, and Challenges*

#### **Figure 2.**

*Sentiment distribution of dataset after preprocessing and labeling.*

#### **Figure 3.**

*Performance evaluation results of the sentiment classification process.*

(Dict. B) with 1026 words, and Rama Prakoso (Dict. C) with 1319 words and our own dictionary (Dict. D) with 882 words;

3.The main experiment covered 8 different experiments as presented in **Table 7**. These main experiments were conducted at 80:20 ratio of data splitting as the results from preliminary works has shown that best performance resulted from this data splitting ratio.

Performance evaluation covering accuracy, precision, recall, and F1-score was done to each of the 16 experiments in the main phase. On top of this performance evaluation, the computation time for each experiment was observed and monitored as well. It took approximately 1 hour (59 minutes and 26 seconds) to complete conducting experiment 1 and less than 1 hour (41 minutes and 55 seconds) to conduct experiment 16. This shows that using slang word dictionary in the preprocessing of the data can reduce the total computation time required.

*Preprocessing of Slang Words for Sentiment Analysis on Public Perceptions in Twitter DOI: http://dx.doi.org/10.5772/intechopen.113725*


#### **Table 7.**

*DoE of the main experiment.*

#### **Figure 4.**

The performance evaluation results in **Figure 4** show that using slang word dictionary can improve the accuracy of the sentiment classification process. Experiment 1, which did not use slang word dictionary, has the lowest accuracy compared to other experiments that used slang word dictionary. These results are also quite promising compared to another recent study conducted by [38], which reported 71.97% being the highest accuracy of the conducted sentiment analysis of tweets in social networks.

ANOVA test was then conducted to analyze whether the experimental results show significant difference between treatments [39]. Since only one factor, which is slang word dictionary, was used in the experiments, ANOVA with single factor was used for the analysis. Results from ANOVA analysis are presented in **Table 8**.

Data in **Table 8** shows that the *p*-value, which is 404253E�18, is way less than the significance level (0.05) of the ANOVA used. By this, it can be concluded that the null hypothesis H0 is rejected and alternative hypothesis H1 is accepted.

$$\mathbf{H}\_0: \mathfrak{\mu}\_1 = \mathfrak{\mu}\_2 = \dots = \mathfrak{\mu}\_2 \text{ null hypothesis}$$

$$\mathbf{H}\_1: \mathfrak{\mu}\_1 \neq \mathfrak{\mu}\_\mathbf{m} \text{ alternate hypothesis.} \tag{11}$$


#### **Table 8.**

*The ANOVA results from the main experiment.*

Where H0: there is no significant difference in treatment of dictionary in sentiment analysis and H1: there is a significant difference in treatment of dictionary in sentiment analysis.

In the next step, since there is a significant difference between the group of dictionaries, the least significant difference (LSD) test then can be conducted to see which group has the significant difference [40]. The test can be done by calculating it *via* the following formula. The formula was used because the same number of repetitions were performed in each experiment.

$$\text{LSD} = t\_{v,a} \sqrt{\text{MS}\_{\text{S(A)}} \frac{2}{\text{S}}}$$

$$t\_v = 1.997729654 \sqrt{0.0000225152352610331 \ast \frac{2}{\text{S}}} = 0.005995218\tag{12}$$

**Table 9** shows the usage of the LSD as well as the notation labeling for finding the significant difference among the group of experiments.


#### **Table 9.**

*Results of LSD test of the main experiment.*

*Preprocessing of Slang Words for Sentiment Analysis on Public Perceptions in Twitter DOI: http://dx.doi.org/10.5772/intechopen.113725*

The results in **Table 9** show that Experiment 16 is the one that has a significant difference from the other group. The concept used to determine which experiment(s) shows significant difference is based on the notation given. For example, Experiment 13 has six notations "defghi," which means that the other experiment that has the same notation does not give a significant difference toward Experiment 13. Another example is from Experiments 1 and 16; it can be seen in Experiment 1 has the notation "a," while Experiment 16 has the notation "i," which means that both experiments are significantly different from each other. Even though the last experiment, Experiment 16, bears the same notation "i" with 6, 13, 12, 8, and 14, it has the most significant difference toward the other 8 notation "a," "b," "c," "d," "e," "f," "g," and "h" and is the experiment with the highest accuracy.

In regard to the combination of dictionaries used in the research, the difference in the result of the accuracy can be seen. Experiment 1 shows 72.92% of accuracy, while Experiment 16 has 76.15% of accuracy. The amount of dictionary words used increased the accuracy of the sentiment result. However, it can be seen that the number of words from Dictionary A (Okky Ibrohim), which is 16,167 words, compared with our own dictionary that was created with the help of annotators, which only has 882 words, has raised a question. It is because when we see other groups result, for example, Experiment 12 (Dictionary ABC) with an accuracy of 75.697% and Experiment 15 (Dictionary BCD) with an accuracy of 75.090%. We further analyzed why this problem had happened, and it was because the Dictionary A was outdated with slang terms that are rarely used nowadays, although some common slang words still in use are still available there. In the dictionary D, the slang words were taken from the raw crawling data itself, taken manually and vetted by annotators, and then translated the meaning with the help of annotators as well as KBBI (Kamus Besar Bahasa Indonesia). Even though only one tenth of the Dictionary A words, there are around 270 unique words compared to the dictionary A, which help the preprocessing to be more accurate.

The above discussion shows that preprocessing with slang word dictionaries has significantly improved the performance of the sentiment analysis conducted. However, it needs to also be highlighted that the quality of the dictionary used related to its slang word collection has an effect to the contributed improvement. The research works conducted were limited to only involving four slang word dictionaries in Bahasa Indonesia, with their limited number of word collections. To determine the optimum number of slang word collections need to be used in preprocessing stage is a challenge that could significantly contribute to the sentiment analysis performance. Another limitation of this work that can be expanded further is the machine learning algorithm used. It would also be interesting to find out how the combination of different algorithm and slang word dictionary contributes to the performance of the sentiment analysis.

## **5. Conclusion**

This study has shown that sentiment analysis can be performed well using Naïve Bayes Classifier combined with the TF-IDF for feature selection. Moreover, it also has been shown that the number of instances in the dataset used has an impact on the performance of the conducted sentiment analysis. In the preliminary stage, with the same data splitting of 80:20, the accuracy score was 64.796%, while the accuracy score in the main experiment, when the number of instances was much bigger, was 73.722% as being the lowest score. Its performance improved in about 8.926% of accuracy.

Another highlight from this study is how the inclusion of slang word dictionary in the preprocessing part has contributed to the improvement of the sentiment analysis performance. The experiment without the dictionary and all of the dictionaries combined has given different results of evaluation score, where there was improvement from 73.722% in Experiment 1 to 76.248% in Experiment 6, with an increment of 2.526% in its accuracy. In addition, the total time required for the complete sentiment analysis process has been significantly reduced, from computation time of 59 minutes and 26 seconds without slang word dictionary to 41 minutes and 55 seconds with slang word dictionary.

## **Author details**

Media Anugerah Ayu\* and Abdul Haris Muhendra Faculty of Engineering and Technology, Sampoerna University, Jakarta, Indonesia

\*Address all correspondence to: media.ayu@sampoernauniversity.ac.id

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Preprocessing of Slang Words for Sentiment Analysis on Public Perceptions in Twitter DOI: http://dx.doi.org/10.5772/intechopen.113725*

## **References**

[1] Drus Z, Khalid H. Sentiment analysis in social media and its application: Systematic literature review. Procedia Computer Science. 2019;**161**:707-714. DOI: 10.1016/j.procs.2019.11.174

[2] Wang Y, Guo J, Yuan C, Li B. Sentiment analysis of Twitter data. Applied Sciences. 2022;**12**:11775. DOI: 10.3390/app122211775

[3] Heikal M, Torki M, El-Makky N. Sentiment analysis of Arabic tweets using deep learning. Procedia Computer Science. 2018;**142**:114-122. DOI: 10.1016/ j.procs.2018.10.466

[4] Bouazizi M, Ohtsuki T. Multi-class sentiment analysis on Twitter: Classification performance and challenges. Big Data Mining and Analytics. 2019;**2**(3):181-194. DOI: 10.26599/BDMA.2019.9020002

[5] Jianqiang Z, Xiaolin G. Comparison research on text pre-processing methods on Twitter sentiment analysis. IEEE Access. 2017;**5**:2870-2879. DOI: 10.1109/ access.2017.2672677

[6] Rahayu DA, Kuntur S, Hayatin N. Sarcasm detection on Indonesian twitter feeds. Proceeding of the Electrical Engineering Computer Science and Informatics. 2018;**5**(5):137-141. DOI: 10.11591/eecsi.v5i5.1724

[7] Singh T, Kumari M. Role of text preprocessing in Twitter sentiment analysis. Procedia Compuer Science. 2016;**89**: 549-554. DOI: 10.1016/j.procs.2016. 06.095

[8] Maylawati DS, Zulfikar WB, Slamet C. An improved of stemming algorithm for mining Indonesian text with slang on social media. In: 6th International Conference on CYber

and IT Service Management (CTTSM). 2018

[9] Yunitasari Y, Musdholifah A, Sari AK. Sarcasm detection for sentiment analysis in Indonesian tweets. Indonesian Journal of Computing and Cybernetics Systems. 2019;**13**:53-62. DOI: 10.22146/ijccs.41136

[10] Adriani M, Asian J, Nazief B, Tahaghoghi SM, Williams HE. Stemming Indonesian: A confix-stripping approach. ACM Transactions on Asian Language Information Processing. 2007;**6**(4):1-33. DOI: 10.1145/1316457.1316459

[11] Nuritha I, Arifiyanti AA, Widartha VP. Analysis of Public Perception on Organic Coffee through Text Mining Approach using Naive Bayes Classifier. In: East Indonesia Conference on Computer and Information Technology (EIConCIT). 2018. pp. 153-158. DOI: 978-1- 5386-8050-6/18/\$31.00

[12] Adarsh MJ, Ravikumar P. Sarcasm detection in text data to bring out genuine sentiments for sentimental analysis. In: 2019 1st International Conference on Advances in Information Technology (ICAIT). 2019. DOI: 10.1109/icait47043.2019.8987393

[13] Ferdiana R, Jatmiko F, Purwanti DD, Ayu AS, Dicka WF. Dataset Indonesia untuk Analisis Sentimen. Jurnal Nasional Teknik Elektro dan Teknologi Informasi (JNTETI). 2019;**8**(4):334-339. DOI: 10.22146/jnteti.v8i4.533

[14] Fitri VA, Andreswari R, Hasibuan MA. Sentiment analysis of social media Twitter with case of anti-LGBT campaign in Indonesia using Naïve Bayes, decision tree, and random forest algorithm. Procedia Computer Science. 2019;**161**:765-772

[15] Mandloi L, Patel R. Twitter Sentiments Analysis Using Machine Learning Methods. In: International Conference for Emerging Technology (INCET). 2020. pp. 1-5. doi:978-1- 7281-6221-8/20/\$31.00

[16] Casas I, Delmelle EC. Tweeting about public transit-gleaning public perceptions from a social media microblog. Case Studies on Transport Policy. 2017;**5**(4):634-642. DOI: 10.1016/ j.cstp.2017.08.004

[17] Mora K, Chang J, Beatson A, Morahan C. Public perceptions of building seismic safety following the Canterbury earthquakes: A qualitative analysis using Twitter and focus groups. International Journal of Disaster Risk Reduction. 2015;**13**:1-9. DOI: 10.1016/j. ijdrr.2015.03.008

[18] Klašnja M, Barberá P, Beauchamp N, Nagler J, Tucker JA. Measuring Public Opinion with Social Media Data. In: Atkeson LR, Alvarez RM, editors. The Oxford Handbook of Polling and Survey Methods, Oxford Handbooks (2018; online ed). Oxford Academic; 5 Oct 2015. pp. 555-582. DOI: 10.1093/ oxfordhb/9780190213299.013.3

[19] Al-Thubaity A, Alqahtani Q, Aljandal A. Sentiment lexicon for sentiment analysis of Saudi dialect tweets. Procedia Computer Science. 2018;**142**:301-307. DOI: 10.1016/j. procs.2018.10.494

[20] Mukhtar N, Khan MA, Chiragh N. Lexicon-based approach outperforms supervised machine learning approach for Urdu sentiment analysis in multiple domains. Telematics and Informatics. 2018;**35**(8):2173-2183. DOI: 10.1016/j. tele.2018.08.003

[21] Wu L, Morstatter F, Liu H. SlangSD: Building and using a sentiment dictionary of slang words for short-text sentiment classification. Language Resources and Evaluation. 2018;**52**(3):839-852. DOI: 10.1007/s10579-018-9416-0

[22] Salsabila NA, Winatmoko YA, Septiandri AA. Colloquial Indonesian Lexicon. In: 2018 International Conference on Asian Language Processing (IALP). 2018. pp. 226-229. DOI: 10.1109/ialp.2018.8629151

[23] Muliady W, Widiputra H. Generating Indonesian Slang Lexicons from Twitter. In: 2012 2nd International Conference on Uncertainty Reasoning and Knowledge Engineering. 2012. pp. 123-126. DOI: 10.1109/urke.2012. 6319524

[24] Vieira S, Pinaya WH, Mechelli A. Introduction to machine learning. In: Mechelli A, Vieira S, editors. Machine Learning. Academic Press; 2020. pp. 1- 20. DOI: 10.1016/b978-0-12-815739- 8.00001-8

[25] Yeturu K. Machine learning algorithms, applications, and practices in data science. In: Srinivasa Rao ASR, Rao CR, editors. Handbook of Statistics Principles and Methods for Data Science. Elsevier; 2020. pp. 81-206. DOI: 10.1016/ bs.host.2020.01.002

[26] Jianqiang Z, Xiaolin G, Xuejun Z. Deep convolution neural networks for Twitter sentiment analysis. IEEE Access. 2018;**6**:23253-23260. DOI: 10.1109/ access.2017.2776930

[27] Singh S, Pareek A, Sharma A. Twitter sentiment analysis using rapid miner tool. International Journal of Computer Applications. 2019;**177**(16): 44-50. DOI: 10.5120/ijca2019919604

[28] Bouazizi M, Ohtsuki T. A patternbased approach for multi-class sentiment analysis in Twitter. IEEE Access. 2017;**5**:

*Preprocessing of Slang Words for Sentiment Analysis on Public Perceptions in Twitter DOI: http://dx.doi.org/10.5772/intechopen.113725*

20617-20639. DOI: 10.1109/ access.2017.2740982

[29] Zimmer M, Proferes N. A topology of Twitter research: Disciplines, methods, and ethics. Aslib Journal of Information Management. 2014;**66**(3):250-261. DOI: 10.1108/ajim-09-2013-0083

[30] Guo X, Li J. A novel twitter sentiment analysis model with baseline correlation for financial market prediction with improved efficiency. In: Proceedings of the Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), Granada, Spain, 22–25 October 2019. 2019. pp. 472-477

[31] Harapan H, Itoh N, Yufika A, Winardi W, Keam S, Te H, et al. Coronavirus disease 2019 (COVID-19): A literature review. Journal of Infection and Public Health. 2020;**13**:667-673

[32] Kumar D, Malviya R, Sharm PK. Corona virus: A review of COVID-19. Eurasian Journal of Medicine and Oncology. 2020;**4**(10):8-25. DOI: 10.14744/ejmo.2020.51418

[33] Vieira CM, Franco OH, Restrepo CG, Abel T. COVID-19: The forgotten priorities of the pandemic. Maturitas. 2020;**136**:38-41. DOI: 10.1016/j. maturitas.2020.04.004

[34] WHO. 2019 Novel Coronavirus (2019-nCoV) Strategic Preparedness and Response Plan for the South-East Asia Region. 2020. pp. 1-22. Retrieved from World Health Organization

[35] Nicola M, Alsafi Z, Sohrabi C, Kerwan A, Al-Jabir A, Iosifidis C, et al. The socio-economic implications of the coronavirus pandemic (COVID-19): A review. International Journal of Surgery. 2020;**78**:185-193. DOI: 10.1016/j.ijsu. 2020.04.018

[36] Chen L, Liu Y, Chang Y, Wang X, Luo X. Public opinion analysis of novel coronavirus from online data. Journal of Safety Science and Resilience. 2020;**1**(2): 120-127. DOI: 10.1016/j.jnlssr.2020. 08.002

[37] Ibrohim O, Budi I. Multi label hate speech and abusive language detection in Indonesian Twitter. ALW3: 3rd Workshop on Abusive Language Online. 2019. pp. 46-57

[38] AminiMotlagh M, Shahhoseini H, Fatehi N. A reliable sentiment analysis for classification of tweets in social networks. Social Network Analysis and Mining. 2023;**13**:7. DOI: 10.1007/ s13278-022-00998-2

[39] Alassaf M, Qamar AM. Improving sentiment analysis of Arabic tweets by one-way ANOVA. Journal of King Saud University - Computer and Information Sciences. 2020;**34**(6):2849-2859. DOI: 10.1016/j. jksuci.2020.10.023

[40] Williams LJ, Abdi H. Fisher's least significant difference (LSD) test. In: Salkind N, editor. Encyclopedia of Research Design. Thousand Oaks: Sage; 2010. DOI: 10.4135/9781412961288.n154

## **Chapter 3**

## A Comparative Performance Evaluation of Algorithms for the Analysis and Recognition of Emotional Content

*Konstantinos Kyritsis, Nikolaos Spatiotis, Isidoros Perikos and Michael Paraskevas*

## **Abstract**

Sentiment Analysis is highly valuable in Natural Language Processing (NLP) across domains, processing and evaluating sentiment in text for emotional understanding. This technology has diverse applications, including social media monitoring, brand management, market research, and customer feedback analysis. Sentiment Analysis identifies positive, negative, or neutral sentiments, providing insights into decisionmaking, customer experiences, and business strategies. With advanced machine learning models like Transformers, Sentiment Analysis achieves remarkable progress in sentiment classification. These models capture nuances, context, and variations for more accurate results. In the digital age, Sentiment Analysis is indispensable for businesses, organizations, and researchers, offering deep insights into opinions, sentiments, and trends. It impacts customer service, reputation management, brand perception, market research, and social impact analysis. In the following experimental research, we will examine the Zero-Shot technique on pre-trained Transformers and observe that, depending on the Model we use, we can achieve up to 83% in terms of the model's ability to distinguish between classes in this Sentiment Analysis problem.

**Keywords:** Sentiment Analysis, Natural Language Processing (NLP), sentiment classification, machine learning, transformers

## **1. Introduction**

In this chapter, we present relatively new technologies in the field of sentiment analysis and examine their performance. The term "Sentiment Analysis" emerged and gained popularity around the late 2000s. While the concept of sentiment analysis had been present before, the term "Sentiment Analysis" was formally defined to refer to the automated processing and evaluation of sentiment expressed in texts, primarily in natural language texts. Since then, Sentiment Analysis has evolved and expanded with the development of advanced machine learning models, such as the scikit-learn library and later the Transformers. These powerful tools have significantly enhanced the capabilities of sentiment analysis by providing more accurate and efficient sentiment classification algorithms. Sentiment Analysis falls into a distinct category of text classification. It involves the process of comprehending and evaluating the sentiment expressed within a sentence, paragraph, or text. The primary objective is to identify and categorize the emotional tone conveyed in these written expressions. Sentiment Analysis commonly employs various categories to capture the nuances of sentiment. Positive category encompasses texts that convey positive emotions, including pleasure, excitement, joy, optimism, and more, Negative, where this category refers to texts that express negative emotions, such as frustration, sadness, anger, worry, and others. Finally, texts falling into Neutral category do not exhibit strong positive or negative sentiments. They often maintain an impartial stance, describing information or presenting neutral viewpoints.

It is possible to expand the aforementioned categories to five by further distinguishing between "Positive" and "Negative." This can be accomplished by introducing additional subcategories: "Very Positive" and "Positive" under the Positive category, as well as "Very Negative" and "Negative" under the Negative category. With this refinement, along with the inclusion of the Neutral category, the total number of sentiment categories becomes five. However, it is essential to exercise caution when implementing such subdivisions. Introducing more categories may have implications for evaluation metrics, as it can create ambiguity between closely related terms, making it more challenging for the model to accurately differentiate and classify them.

In general, Sentiment Analysis represents a crucial area in Natural Language Processing (NLP), offering the ability to comprehend and evaluate the emotional aspects of human expressions through automated processing. By automatically analyzing and interpreting text data, Sentiment Analysis enables us to gain insights into people's sentiments, opinions, and attitudes, thereby facilitating various applications such as market research, brand monitoring, social media analysis, and customer feedback analysis.

Transformers are a class of advanced machine learning models that have emerged in recent years and have revolutionized the field of Natural Language Processing (NLP) [1]. Unlike more traditional machine algorithms, Transformers have the ability to analyze and understand complex linguistic relationships, enabling them to solve problems like Sentiment Analysis with high levels of accuracy.

On the other hand, machine learning algorithms can also be used for Sentiment Analysis, such as Naive Bayes, Decision Trees, Random Forests, Support Vector Machines, and others. These algorithms are more traditional and rely on statistical and algebraic methods. They can be successfully applied to sentence or text-level Sentiment Analysis but may not achieve the same level of accuracy and results as Transformers.

In contrast, Transformers utilize recursive neural networks and specialized models with millions of parameters, such as BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and others, which have been trained on large volumes of text data. These models can learn rich linguistic features and word compositions to recognize and categorize sentiments with high accuracy.

In summary, while the traditional Sentiment Analysis algorithms in the scikitlearn library can produce reliable results, Transformers are more advanced models capable of handling more complex linguistic problems and achieving higher accuracy in Sentiment Analysis tasks.

In the following sections, we will dive into the Zero-Shot technique, the dataset employed, the utilization of Tokenizers in Transformers, the applications of Transformers in various tasks, and a detailed examination of four pre-trained Transformer Models. We will explore how these models function and their

*A Comparative Performance Evaluation of Algorithms for the Analysis and Recognition… DOI: http://dx.doi.org/10.5772/intechopen.112627*

experimental performance on the same dataset used in the Zero-Shot technique. Additionally, we will evaluate the effectiveness of each model based on various evaluation metrics and from an overall table of the models' metrics and a bar chart, we will see which model exhibits the best overall performance. The chapter will conclude with directions for future work.

## **2. Related works**

In the literature, various works examine the use of transformers in sentiment analysis and in text classification. In the work presented in Prottasha et al. [2], the authors fine-tuned the BERT model, which had been pre-trained on the largest BanglaLM dataset. The model was subsequently combined with layers of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM). The proposed research compared various word embedding approaches, such as Word2Vec, GloVe, fastText, and BERT. The researchers demonstrated that the transformer-based BERT model outperformed conventional techniques, achieving state-of-the-art results with sufficient fine-tuning. The study also compared several machine learning and deep learning algorithms to validate the performance of the hybrid integrated model CNN-BiLSTM (Bidirectional-LSTM). The results were analyzed using accuracy, precision, recall, F1 score, Cohen's kappa, and Receiver Operating Characteristic Area Under Curve (ROC AUC). Furthermore, the proposed model's performance was evaluated on various sentiment datasets, including the Secure Anonymised Information Linkage (SAIL) dataset, the aspect-based sentiment analysis (ABSA) dataset (cricket and restaurant parts), the BengFastText dataset, the YouTube Comments dataset, and the CogniScenti dataset. The results showed that the hybrid integrated model CNN-BiLSTM outperformed other techniques in terms of accuracy and F1 score, especially when combined with Bangla-BERT embedding.

In the work presented in Chi et al. [3], the main focus is to explore the use of pretrained BERT models for aspect-based sentiment analysis (ABSA) tasks. The authors investigate different methods of constructing auxiliary sentences to transform ABSA into a sentence-pair classification task. These methods include question sentences, single pseudo sentences, question sentences with labels, and pseudo questions with labels. Through fine-tuning the pre-trained BERT model, they achieve new state-ofthe-art results on the ABSA task using pair sentences on the datasets they evaluated. Specifically, they achieve an F1 score of 92.18 on the SentiHood dataset and an F1 score of 95.6 on the SemEval-2014 Task 4 dataset.

In the work presented in Zhang et al. [4], the authors propose a comprehensive multitask transformer network called Broad Multitask Transformer Network for Sentiment Analysis (BMT-Net) to address these issues. BMT-Net combines the strengths of feature-based and fine-tuning approaches and is specifically designed to leverage robust and contextual representations. Authors' proposed architecture ensures that the learned representations are applicable across multiple tasks through the use of multitask transformers. Furthermore, BMT-Net is capable of thoroughly learning robust contextual representations for a broad learning system, thanks to its powerful ability to explore deep and extensive feature spaces. Authors conducted experiments using two widely used datasets, namely the binary Stanford Sentiment Treebank (SST-2) and SemEval Sentiment Analysis in Twitter (Twitter). When compared to other state-of-the-art methods, authors' approach achieves superior results. Specifically, it achieves an improved F1 score of 0.778 for Twitter sentiment

analysis and an accuracy of 94.0% for the SST-2 dataset. These experimental findings not only demonstrate BMT-Net's proficiency in sentiment analysis, but also emphasize the importance of previously overlooked design choices concerning the exploration of contextual features in deep and extensive domains.

In the work presented in Junyan et al. [5], the authors propose the multimodal Sparse Phased Transformer (SPT) as a solution that mitigates the complexities associated with self-attention and memory usage. SPT employs a sampling function to generate a sparse attention matrix, effectively compressing long sequences into shorter sequences of hidden states. At each layer, SPT captures interactions between hidden states from different modalities. To further enhance the efficiency of our approach, we utilize Layer-wise parameter sharing and Factorized Co-Attention. These techniques allow for parameter sharing between Cross Attention Blocks, minimizing the impact on task performance. Authors evaluate the model using three sentiment analysis datasets and achieve comparable or superior performance compared to existing methods, all the while reducing the number of parameters by 90%. Through the experiments, authors demonstrate that SPT, along with parameter sharing, can effectively capture multimodal interactions while reducing the model size and improving sample efficiency.

In the work presented in Tan et al. [6], the authors introduce a hybrid deep learning approach that combines the benefits of both sequence models and Transformer models while mitigating the limitations of sequence models. The proposed model incorporates the Robustly optimized BERT approach and Long Short-Term Memory (LSTM) for sentiment analysis. The Robustly optimized BERT approach effectively maps words into a condensed and meaningful word embedding space, while the LSTM model excels at capturing long-range contextual semantics. Through experimental evaluations, the results demonstrate that the proposed hybrid model surpasses the performance of state-of-the-art methods. It achieves impressive F1 scores of 93, 91, and 90% on the Internet Movie Database (IMDb) dataset, Twitter US Airline Sentiment dataset, and Sentiment140 dataset, respectively. These findings highlight the effectiveness of the hybrid approach in sentiment analysis tasks.

In the work presented in Tesfagergish et al. [7], authors tackle the problem of emotion detection as a component of the broader sentiment analysis task and propose a two-stage methodology. The first stage involves an unsupervised Zero-Shot learning model, which utilizes a sentence transformer to generate probabilities for 34 different emotions. This model operates without relying on labeled data. The output of the Zero-Shot model serves as input for the second stage, which involves training a supervised machine learning classifier using ensemble learning techniques and sentiment labels. Through the proposed hybrid semi-supervised approach, authors achieve the highest accuracy of 87.3% on the English SemEval 2017 dataset. This methodology effectively combines unsupervised and supervised techniques to address sentiment analysis, incorporating emotion detection and outperforming alternative methods.

## **3. Zero-Shot text classification**

One relatively new field in research compared to other domains is Sentiment Analysis on text datasets, where models encounter classes for the first time. These transformer models are pre-trained in natural language and utilize the Zero-Shot Text Classification technique [8].

*A Comparative Performance Evaluation of Algorithms for the Analysis and Recognition… DOI: http://dx.doi.org/10.5772/intechopen.112627*

Zero-Shot Text Classification is a machine learning technique that leverages a model's ability to classify text into categories it has never seen before. This technique is applied to texts that were not used during the model's training or were not used to develop its initial understanding of the text. This means that the model can recognize and classify data (texts) into new categories that it has not "seen" during its pretraining phase. During pre-training, these models are trained on a large volume of texts from various sources, developing a general understanding of language [9].

With this technique, the models can comprehend the meaning of the text and evaluate it in relation to predefined categories provided to them, even without having seen them before. What is important here is that they recognize the meaning of these categories. As a result, these models can classify text into new categories, increasing their flexibility and applicability in various cases, such as Zero-Shot Sentiment Analysis [10].

## **4. Research design and methodology**

#### **4.1 Data description**

In the context of our work, we explore the "Twitter US Airline Sentiment" dataset using various variations of BERT, employing the Zero-Shot text classification technique [11]. The "Twitter US Airline Sentiment" dataset is a popular collection of tweets related to US airline companies and the evaluation of their services. This dataset was published on the Kaggle platform and comprises 14,640 tweets, accompanied by comments from each customer who wrote them, the airline company mentioned in each tweet, and the corresponding sentiment category (positive, negative, or neutral). Therefore, each comment is labeled as positive, negative, or neutral. This dataset is frequently utilized in Natural Language Processing and the development of machine learning algorithms for sentiment analysis in text data. We will experimentally explore four different pre-trained Transformers using the Zero-Shot text classification technique to evaluate their performance on an unseen dataset of customer comments for airline companies. The task involves categorizing texts into positive, neutral, and negative sentiment labels, essentially performing Sentiment Analysis. These Transformers have not been previously exposed to or trained specifically on this dataset, making the evaluation more robust and insightful [11]. By investigating how these models respond to the new data, we aim to gain valuable insights into their effectiveness in sentiment analysis tasks and their adaptability to previously unseen contexts.

We will experimentally examine several pre-trained Transformer models to determine if they are effective enough to perform Sentiment classification on three classes using the Zero-Shot technique. For this purpose, we selected a dataset from Kaggle that consists of a total of 14,640 customer comments on airline companies. These comments are divided into 9781 Negative comments, 3099 Neutral comments, and 2363 Positive comments [11]. The following bar plot visually illustrates the distribution of instances based on their category. This dataset does not have a good class distribution or balance; it is imbalanced, which makes the classification task more challenging for any algorithm (Twitter US Airline Sentiment) (**Figure 1**) [12].

So, we are dealing with a quite demanding dataset for any model trained on it. However, we will examine this dataset using the Zero-Shot technique, which means without any training. Therefore, the Transformer models should have a deep understanding of the English language to achieve better results [8].

**Figure 1.** *Bar plot dataset's labels.*

The data preprocessing, we performed, was relatively straightforward. We removed all columns that were irrelevant to our purpose, keeping only the column containing the comments and the column with the labels, which represent the actual sentiment ratings (negative, neutral, or positive). We also removed all the names of the airlines. All the other preprocessing steps that we used to do on the texts we wanted to input in the past are now handled by the built-in tokenizer of each Transformer model.

#### **4.2 Tokenizers**

#### *4.2.1 BERT tokenizer*

The BERT tokenizer is responsible for breaking down the text into smaller units called "tokens." The underlying concept of the BERT tokenizer is to represent the text using a set of tokens that correspond to significant units of the text, such as words or computational symbols.

The tokenizer operates in two main steps. First, it segments the text into words and computational symbols. Then, it converts these words and symbols into unique tokens, each of which is assigned a unique numerical identifier. This transformation allows BERT to operate with inputs of a predetermined size, as each token represents a unit of information [13].

The BERT tokenizer is designed to work in conjunction with the BERT model, creating input that represents the text by utilizing the concept of tokens. Its main function is to represent the text using a set of tokens that correspond to significant units of the text, such as words or computational symbols.

The tokenizer operates in two main steps. First, it segments the text into words and computational symbols. Then, it converts these words and symbols into unique tokens, each of which is assigned a unique numerical identifier. This transformation allows BERT to work with inputs of a fixed size, as each token represents a unit of information.

## *A Comparative Performance Evaluation of Algorithms for the Analysis and Recognition… DOI: http://dx.doi.org/10.5772/intechopen.112627*

The BERT tokenizer also includes special functionalities, such as handling special characters (e.g., articles, punctuation marks) and managing the representation of words that exceed the maximum length limit by applying techniques like truncation or padding.

Using the BERT tokenizer, the input text is effectively prepared for processing by the BERT model. It enables the model to understand the meaning of the text and evaluate it in relation to pre-defined categories, without having seen them before [14].

### *4.2.2 DistilBERT tokenizer*

The tokenizer of DistilBERT operates somewhat differently from that of BERT. DistilBERT utilizes a compressed version of BERT with fewer layers and reduced parameters. The tokenizer of DistilBERT follows a similar process as the BERT tokenizer, which involves breaking down the text into smaller units called "tokens." However, due to the reduced number of layers in DistilBERT, its tokenizer performs a simplified tokenization process. This means that the tokens of DistilBERT are fewer in comparison to BERT, and there might be a slight loss of detail in the text representation. Nevertheless, the tokenizer of DistilBERT maintains the fundamental function of the BERT tokenizer, which is to represent the text using tokens [15].

#### *4.2.3 DistilRoBERTa tokenizer*

Also, the tokenizer of DistilRoBERTa is different from that of BERT. DistilRoBERTa is based on the RoBERTa model, which is an improved version of BERT. The tokenizer of DistilRoBERTa follows a similar process to the tokenizer of BERT, where the text is broken down into smaller units called "tokens." However, there are some differences in the tokenization rules and token processing. The tokenizer of DistilRoBERTa typically uses a smaller vocabulary compared to BERT, with a limited number of tokens. This results in smaller token representations, but it can still provide high-quality performance in language tasks. Overall, the tokenizer of DistilRoBERTa is adapted to the architecture and requirements of the DistilRoBERTa model for efficiency and effective text processing.

### **4.3 Transformers**

## *4.3.1 Masked language modeling (MLM)*

In order to better understand Transformers and how they work in relation to Sentiment Analysis, we need to grasp one of their fundamental techniques: Masked Language Modeling (MLM).

First and foremost, it is important to know that Transformers have been designed differently depending on the task they aim to accomplish. For instance, when the task at hand is Sentiment Classification or Named Entity Recognition or Question-Answering, suitable Transformers such as BERT, DistilBERT, RoBERTa, and others have been developed specifically for these purposes. On the other hand, when our task involves translation or summarization, appropriate Transformers include Facebook's BART, Google's T5, and others. Similarly, for text generation, models like GPT, GPT2, GPT3, GPT3.5, and GPT-4 utilized by OpenAI, and others are employed.

Masked Language Modeling (MLM) is a technique used in the field of Natural Language Processing (NLP) and Machine Learning to train language models.

In Masked Language Modeling, a randomly selected word or sequence of words in a sentence is hidden (masked), and the model is tasked with predicting what that hidden word or words are. This encourages the model to understand the context and meaning of the surrounding words in order to make the prediction.

For example, a sentence that could be used in an MLM model is as follows: "The big \_\_\_\_\_\_\_\_ soared through the sky, capturing everyone's attention."

In this case, a word like "bird," "plane," or "kite" could be masked, and the model would need to predict the correct word within the context of the sentence.

Training MLM models is widely known, with BERT (Bidirectional Encoder Representations from Transformers) being one of the most well-known examples. BERT is trained on large bodies of text, where a random portion of words is masked, and the model attempts to predict the correct word based on the context.

Masked Language Modeling models have been successfully used in various applications, such as text completion, information retrieval, and language understanding. The idea is that MLM models can learn from the sequential content of text and reproduce human-like language understanding to a great extent. This ability adds to the model's capacity to classify or characterize texts based on sentiment [16].

#### *4.3.2 Pre-trained models*

#### *4.3.2.1 bert-base-uncased*

"bert-base-uncased" is a specific pre-trained model variant of BERT (Bidirectional Encoder Representations from Transformers). BERT is a successful machine learning model for Natural Language Processing (NLP) that is trained on large bodies of text to understand the semantic richness of words and sentence structure.

The "bert-base-uncased" version refers to a particular implementation of BERT where words are treated as lowercase (uncased), meaning they are all converted to lowercase. This means that words like "Hello" and "hello" are essentially considered the same by the model.

The difference between "bert-base-uncased" and "BERT" is that "BERT" is a general term referring to the original idea and architecture of the model, while "bert-base-uncased" is a specific implementation of that idea with specific processing parameters.

In general, the designation "bert-base-uncased" is used to describe a specific pretrained BERT model with certain settings. There are also other variations of BERT, such as "bert-base-cased" (where uppercase and lowercase letters are preserved) and "bert-large-uncased" (a larger model size with more parameters).

As the variations of BERT can have different settings and parameters, it is important to be familiar with the descriptions and documentation to understand precisely what the differences and functionalities of each variation are [13].

#### *4.3.2.2 distilbert-base-cased*

"distilbert-base-cased" is a variation of the original BERT (Bidirectional Encoder Representations from Transformers) model that has undergone a process called "distillation" to compress the original model into a smaller size without significant loss in performance.

The differences between "distilbert-base-cased" and the original BERT lie in the following aspects:

*A Comparative Performance Evaluation of Algorithms for the Analysis and Recognition… DOI: http://dx.doi.org/10.5772/intechopen.112627*


The main advantages of "distilbert-base-cased" are its lower memory requirements and computational power compared to the original BERT, making it suitable for applications with limited resources, such as systems with limited memory capacity or low computational power.

Overall, "distilbert-base-cased" is a compressed version of the original BERT that offers reasonably good performance relative to its size compared to the full BERT model, while requiring less space and computational power [15].

### *4.3.2.3 distilbert-base-uncased-mnli*

"distilbert-base-uncased-mnli" is a variation of the BERT (Bidirectional Encoder Representations from Transformers) model that has been trained on the MultiNLI (Multi-Genre Natural Language Inference) dataset.

The differences of "distilbert-base-uncased-mnli" from the original BERT are as follows:


Variations of BERT, such as "distilbert-base-uncased-mnli," provide pre-trained models that are adapted to specific domains and datasets. In the case of "distilbertbase-uncased-mnli," it has been specifically trained on the MultiNLI dataset for better performance in logical analysis and evaluating the relationship between sentences.

Overall, "distilbert-base-uncased-mnli" is a compressed variation of the BERT model that has been trained on the MultiNLI dataset. This variation offers a smaller model size while maintaining the ability to comprehend and evaluate the relationship between sentences [15].

## *4.3.2.3.1 MultiNLI (multi-genre natural language inference)*

The MultiNLI (Multi-Genre Natural Language Inference) dataset is a popular dataset used in the field of Natural Language Processing (NLP) to evaluate the ability of models to understand the meaning and relationship between sentences.

The MultiNLI dataset consists of pairs of sentences known as "hypothesis" and "premise." The "hypothesis" is a statement expressing an idea or hypothesis, while the "premise" is the sentence from which the hypothesis is derived. The main task is to evaluate whether the hypothesis is "entailment," "contradiction," or "neutral" based on the relationship between the two sentences.

MultiNLI encompasses a variety of linguistic materials, covering different genres of literature, scientific texts, news articles, and other types of written material. This ensures the diversity and generalization of the dataset, ensuring that models trained on it can comprehend and respond to various linguistic scenarios.

MultiNLI has been widely used as a dataset for evaluating and training NLP models, including BERT models. Using MultiNLI, we can study a model's ability to understand the meaning of sentences and process the relationships between them.

Overall, MultiNLI represents an important dataset for the development and evaluation of NLP models that deal with recognizing and evaluating the relationship between sentences [17].

## *4.3.2.4 nli-distilroberta-base*

Let us first examine RoBERTa in relation to BERT to understand the version of the pre-trained model nli-distilroberta-base:

RoBERTa is a pre-trained model for Natural Language Processing (NLP) that is a variation of the original BERT (Bidirectional Encoder Representations from Transformers) model. The name "RoBERTa" stands for "Robustly Optimized BERT approach."

The differences between RoBERTa and the original BERT are as follows:


*A Comparative Performance Evaluation of Algorithms for the Analysis and Recognition… DOI: http://dx.doi.org/10.5772/intechopen.112627*

architecture and ideas that guided BERT remain at the core of RoBERTa, with the differences mainly focusing on training and data preprocessing.

4.Larger model size: RoBERTa has a larger model size compared to the original BERT. This implies that RoBERTa has more parameters and a more detailed representation of words and sentences [18].

The above differences constitute improvements that allow RoBERTa to achieve better results in various NLP tasks compared to the original BERT. However, it is important to note that the basic architecture and ideas that guided BERT remain at the core of RoBERTa, with the differences focusing on training and data preprocessing [18].

Therefore, "nli-distilroberta-base" is a pre-trained model for Natural Language Processing (NLP) that is a variation of the original BERT (Bidirectional Encoder Representations from Transformers) model. This variation utilizes the DistilRoBERTa architecture and has been trained on the NLI (Natural Language Inference) dataset.

The differences of "nli-distilroberta-base" from the original BERT are as follows:


## **5. Experimental results**

We will describe the results we obtained from the experimental process of four pre-trained Transformers on the same dataset (Twitter US Airline Sentiment) that we described earlier [12]. The experiments were conducted by us using our own computational resources. The code for the metrics we present was written in Python, utilizing relevant libraries for Transformers with pipelines for the Zero-Shot technique. All Confusion Matrices were generated by combining two functions: confusion\_matrix from the sklearn.metrics library and sns.heatmap from the seaborn library.

#### **5.1 Zero-Shot and sentiment analysis distilbert-base-cased**

Applying the Zero-Shot technique to the pre-trained Transformer distilbert-basecased, we obtain the Confusion Matrix (**Figure 2**).

The diagonal of the Confusion Matrix always shows the percentages that the Transformer predicted correctly (**Table 1**). So here we can see that the Transformer correctly predicted 18% of the positive sentiments, 25% of the negative sentiments, and 60% of the neutral sentiments. In the other cells, we can observe the following:

• 23% of the comments that were actually positive were predicted as negative by the model.

#### **Figure 2.**

*Confusion Matrix of distilbert-base-cased.*


#### **Table 1.**

*Metrics of distilbert-base-cased.*


*A Comparative Performance Evaluation of Algorithms for the Analysis and Recognition… DOI: http://dx.doi.org/10.5772/intechopen.112627*

between classes. A score of 0.5 represents randomness, while a score of 1 represents perfect discrimination. In your case, the ROC AUC score is approximately 0.531, suggesting a moderate ability to discriminate between classes.

Overall, the model appears to have relatively low performance based on the presented metrics. This could be because the pre-trained model may not have adequately understood such comments, which often contain irony or sarcasm. This does not mean that pre-trained Transformers cannot understand such comments; it means that this specific model has not reached the levels of language comprehension required for use in Zero-Shot Sentiment Classification.

#### **5.2 Zero-Shot and sentiment analysis bert-base-uncased**

Applying the Zero-Shot technique to the pre-trained Transformer bert-baseuncased, we obtain the Confusion Matrix (**Figure 3**).

The diagonal of the Confusion Matrix always shows the percentages that the Transformer predicted correctly. So here we can see that the Transformer correctly predicted 58% of the negative sentiments, 17% of the neutral sentiments, and 40% of the positive sentiments. In the other cells, we can observe the following (**Table 2**).


**Figure 3.** *Confusion Matrix of bert-base-uncased.*


**Table 2.**

*Metrics of bert-base-uncased.*


Based on the metrics provided for the Transformer bert-base-uncased for sentiment analysis with Zero-Shot text classification, we can draw the following conclusions:

The validation accuracy is low, at 0.4644808743169399. This indicates that the model struggles in recognizing the three classes in the dataset.

The F1 score for the model is 0.4644808743169399, representing the harmonic mean of precision and recall. This suggests that the model has limited performance in both precision and recall.

The ROC AUC score is 0.5083023131851064, which is low. This indicates that the model has limited ability to correctly distinguish the three classes.

## **5.3 Zero-Shot and sentiment analysis distilbert-base-uncased-mnli**

Applying the Zero-Shot technique to the pre-trained Transformer distilbert-baseuncased-mnli, we obtain the Confusion Matrix (**Figure 4**).

The diagonal of the Confusion Matrix always shows the percentages that the Transformer predicted correctly. So here we can see that the Transformer correctly

**Figure 4.**

*Confusion Matrix of distilbert-base-uncased-mnli.*

*A Comparative Performance Evaluation of Algorithms for the Analysis and Recognition… DOI: http://dx.doi.org/10.5772/intechopen.112627*


**Table 3.**

*Metrics of distilbert-base-uncased-mnli.*

predicted 3.7% of the neutral sentiments, 92% of the positive sentiments, and 67% of the negative sentiments. In the other cells, we can observe the following (**Table 3**).


Based on the metrics provided for the Transformer distilbert-base-uncased-mnli for sentiment analysis with Zero-Shot text classification, we can draw the following conclusions:

Validation accuracy (Val\_accuracy): The validation accuracy is 0.576. This means that the Transformer correctly classifies the sentiment of the text into three categories (3 classes) with an average accuracy of 57.6%. This accuracy indicates that the Transformer has a relatively moderate performance, and there is room for improvement.

F1 score: The F1 score is 0.576, which is equal to the validation accuracy. The F1 score is a measure of the overall performance that combines precision and recall. The value of 0.576 indicates that the Transformer has a moderate performance and needs improvement in this area.

ROC AUC score: The ROC AUC score is 0.768. This metric evaluates the model's ability to distinguish between classes and correctly rank examples based on the predicted probabilities. A ROC AUC score of 0.768 indicates that the Transformer has a relatively good discriminative ability between classes, but there is still room for improvement.

Overall, we can say that the Transformer distilbert-base-uncased-mnli has a moderate performance in sentiment analysis with Zero-Shot text classification, and there is room for improvement in terms of accuracy and F1 score. However, the ability to distinguish between classes, as represented by the ROC AUC score, is relatively good.

## **5.4 Zero-Shot and sentiment analysis nli-distilroberta-base**

Applying the Zero-Shot technique to the pre-trained Transformer nli-distilroberta-base, we obtain the Confusion Matrix (**Figure 5**).

The diagonal of the Confusion Matrix always shows the percentages that the Transformer predicted correctly. So here we can see that the Transformer correctly predicted 4.4% of the neutral sentiments, 87% of the positive sentiments, and 86% of the negative sentiments. In the other cells, we can observe the following (**Table 4**).


## **Figure 5.**

*Confusion Matrix of nli-distilroberta-base.*


**Table 4.**

*Metrics of nli-distilroberta-base.*

*A Comparative Performance Evaluation of Algorithms for the Analysis and Recognition… DOI: http://dx.doi.org/10.5772/intechopen.112627*

• 10% of the comments that were actually negative were predicted as positive by the model.

Based on the metrics provided for the Transformer nli-distilroberta-base, which performs sentiment analysis using Zero-Shot text classification, we can draw the following conclusions:


In summary, the Transformer nli-distilroberta-base demonstrates a relatively good performance in sentiment analysis using Zero-Shot text classification. It achieves high accuracy, F1 score, and ROC AUC score, indicating its effectiveness in accurately predicting sentiment and distinguishing between different sentiment classes (**Table 5** and **Figure 6**).

Based on the overall table for the performances of the Transformers we have and the bar plot, we can draw the following conclusions:

Validation accuracy: The models significantly differ in validation accuracy. The nli-distilroberta-base model has the highest validation accuracy at around 69.2%, while the distilbert-base-cased model has the lowest validation accuracy at around 31.1%.

F1 score: Similar to the validation accuracy, the nli-distilroberta-base model achieves the highest F1 score at around 69.2%, while the distilbert-base-cased model has the lowest F1 score at around 31.1%.

ROC AUC score: The nli-distilroberta-base model has the highest ROC AUC score at around 82.6%, indicating a good discriminative ability between classes.


#### **Table 5.**

*Comparison in the metrics of all models.*

**Figure 6.** *Bar plot comparison of all models.*

On the contrary, the distilbert-base-cased model has the lowest ROC AUC score at around 53.1%.

Overall, the nli-distilroberta-base model stands out among the other three models in all metrics. It demonstrates higher validation accuracy, F1 score, and ROC AUC score compared to the other models. On the other hand, the distilbert-base-cased model shows the lowest performance across all metrics.

Therefore, we can conclude that the nli-distilroberta-base model is the most effective among the four models examined for sentiment analysis.

## **6. Conclusions**

The fact that we followed the Zero-Shot Sentiment Classification technique limits us in terms of fine-tuning to achieve optimal results in Sentiment Analysis for this specific dataset. Through these experiments, a new technique is highlighted, which can be applied to vast datasets. With the Zero-Shot technique, we can achieve Sentiment Classification without human supervision. One might wonder how many human hours are required to evaluate a massive dataset without errors. This method can be likened to unsupervised learning. Furthermore, it is another approach to understand how well the pre-trained Model has immersed itself in the language.

The percentages we achieved in the experiments demonstrate to what extent this particular Transformer Model has been trained on similar data and how well it has understood the language. Such an effort to apply the Zero-Shot Sentiment Classification technique on the Twitter US Airline Sentiment dataset has not been done before, so there is no comparative reference we can provide. However, works have been done with this technique in other domains, such as a study proposing a method for conducting Zero-Shot Aspect-Based Sentiment Analysis without using domain-specific training data, among others [19].

## *A Comparative Performance Evaluation of Algorithms for the Analysis and Recognition… DOI: http://dx.doi.org/10.5772/intechopen.112627*

The abundance of user-generated information on the Web necessitates accurate methods for analyzing and determining users' opinions and attitudes toward events, products, and entities. In this study, we designed and implemented BERT-like Transformers for the task of Zero-Shot classification. These four pre-trained Transformer models deliver commendable results, despite having only a few million parameters.

Future work will focus on several directions based on the presented results in this chapter using the Zero-Shot technique. First, exploring other models known for their performance in sentiment analysis with the Zero-Shot technique should be considered. Evaluating their accuracy, F1 score, and ROC AUC score and comparing them to those of the existing models will be beneficial. Experimenting with different model variations to identify the most suitable one for specific requirements is recommended. Additionally, examining data preprocessing techniques and evaluating the steps involved in data preprocessing should be conducted. Lastly, exploring ensemble models that combine multiple models to enhance performance can be advantageous. The utilization of diverse models can offer improvements in terms of accuracy and overall performance. These are potential avenues for future work to enhance the results of sentiment analysis using the Zero-Shot techniques.

## **Acknowledgements**

We would like to express our gratitude to Mr. Panagiotis Hadjidoukas from the Department of Computer Logic, part of the Department of Computer Engineering & Informatics at the University of Patras, for providing us with the computational resources necessary to conduct these demanding experiments.

This work was partially supported by the Project entitled "Strengthening the Research Activities of the Directorate of Infrastructure and Networks," funded by the Computer Technology Institute and Press "Diophantus" with project code 0822/001.

## **Author details**

Konstantinos Kyritsis1,2, Nikolaos Spatiotis1,2, Isidoros Perikos1,2,3 and Michael Paraskevas1,2\*

1 Computer Technology Institute and Press "Diophantus," Patras, Greece

2 Electrical and Computer Engineering Department, University of Peloponnese, Greece

3 Computer Engineering and Informatics Department, University of Patras, Greece

\*Address all correspondence to: mparask@cti.gr

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## **References**

[1] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. New York: Cornell University Library in Ithaca; 2017. Available from: https://arxiv.org/ abs/1706.03762

[2] Prottasha NJ, Sami AA, Kowsher M, Murad SA, Bairagi AK, Masud M, et al. Transfer learning for sentiment analysis using BERT based supervised fine-tuning. Sensors. 2022;**22**(11):4157

[3] Chi S, Luyao H, Xipeng Q. Utilizing BERT for aspect-based sentiment analysis. arxiv.org. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1; 2019. pp. 380-385. Available from: https:// aclanthology.org/N19-1035/

[4] Zhang T, Gong X, Chen CLP. BMTnet: Broad multitask transformer network for sentiment analysis. IEEE Access. 2022;**52**(7):6232-6243. Available from: https://ieeexplore.ieee.org/ document/9369997

[5] Cheng J, Fostiropoulos I, Boehm B, Soleymani M. Multimodal phased transformer for sentiment analysis. In: Conference on Empirical Methods in Natural Language Processing. United States: Association for Computational Linguistics (ACL); 2021. Available from: https://aclanthology. org/2021.emnlp-main.189/

[6] Tan KL, Lee CP, Lim KM, Anbananthen KSM. Sentiment analysis with ensemble hybrid deep. IEEE Access. 2022;**10**:103694-103704. Available from: https://doaj.org/article/948b7ca90291416 fb31bda6b789b8920

[7] Tesfagergish SG, Kapočiūtė-Dzikienė J, Damaševičius R. Zero-Shot emotion detection for semi-supervised sentiment analysis using sentence transformers and ensemble learning. Applied Sciences. 2022;**12**(17):8662

[8] Yang P, Wang J, Gan R, Zhu X, Zhang L, Wu Z, et al. Zero-Shot learners for natural language understanding via a unified multiple choice perspective. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Pennsylvania, United States: Association for Computational Linguistics (ACL); 2022

[9] Yin W, Hay J and Roth D. Benchmarking Zero-Shot text classification: Datasets, evaluation and entailment approach. In: IJCNLP 2019. Pennsylvania, United States: Association for Computational Linguistics (ACL); 2019. Available from: https:// aclanthology.org/D19-1404/

[10] Pushp PK, Srivastava MM. Train once, test anywhere: Zero-Shot learning for text classification. arXiv: Computation and Language. 2017. [preprint]

[11] Delangue C, Chaumond J, Wolf T. Hugging Face [Online]. U.S.: Hugging Face, Inc.; 2016. Available from: https:// huggingface.co/

[12] Sculley D, Elliott J, Hamner B, Moser J. Kaggle [Online]. 2010. Available from: https://www.kaggle.com/ crowdflower/twitter-airline-sentiment

[13] Devlin J, Ming-Wei C, Kenton L and Kristina T. BERT: Pre-training of deep bidirectional transformers for language understanding. United States: Association for Computational

*A Comparative Performance Evaluation of Algorithms for the Analysis and Recognition… DOI: http://dx.doi.org/10.5772/intechopen.112627*

Linguistics (ACL); 2019. DOI: 10.48550/ arXiv.1810.04805.

[14] Kudo T, Richardson J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In: Empirical Methods in Natural Language Processing: System Demonstrations. Pennsylvania, United States: Association for Computational Linguistics (ACL); 2018

[15] Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. In: 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS 2019. Vancouver; 2020

[16] Salazar J, Liang D, Nguyen TQ, Kirchhoff K. Masked language model scoring. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics [Online]. Pennsylvania, United States: Association for Computational Linguistics (ACL); 2020. Available from: https:// aclanthology.org/2020.acl-main.240/

[17] Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research. 2020;**21**:1-67

[18] Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: A robustly optimized BERT Pretraining approach. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics. Huhhot; 2021

[19] Shu L, Xu H, Liu B, Chen J. Zero-Shot Aspect-Based Sentiment Analysis. 2022. Available from: https://arxiv.org/ pdf/2202.01924.pdf
