**1.1 Data collections**

There are two different types of datasets: data training (CSV files) and data crawling JSON as the data testing. 66.171 data are found in the data training, which contains username, caption, and labeling. There are 1.894 Instagram captions obtained for data testing. The caption data retrieval will be processed to produce a certain weight, which will be used later during the Instagram caption data classification process.

The data in **Table 1** is then divided into five categories; Fashion, Food & Beverage, Technology, Health & Beauty, and Lifestyle & Travel. The **Table 2** shows the proportion of the amount of data in each category:



#### **Table 1.**

*Instagram caption data.*


#### **Table 2.**

*Proportion of the amount of data In each category/class.*

We can see in the **Table 2** that shows an imbalanced dataset, where a disproportionate ratio is found in each class. This disproportionate ratio can be spotted in the Health & Beauty and Technology category data, which has a significant difference in data. This imbalanced dataset will impact the prediction process in each class later. With the imbalanced dataset, the model will tend to predict the majority class data. Meanwhile, the minority class will be treated as noise or even ignored on some occasions. Due to that, there might be misclassification of the minority class compared to the majority class. In this research, the way to resolve the imbalanced dataset is by using the performance matrix, which is the *F*<sup>1</sup> score.

#### *1.1.1 Text preprocessing*

The text preprocessing step is the beginning part of text mining. In text mining, preprocessing is the act of transforming poorly formatted input into structured data that meets the demands of the process.

The preprocessing stage is presented in **Figure 2**. After collecting the data, the next process was text processing. It included case folding, tokenizing, and cleaning.

**Case folding** is the process of converting the letters contained in the text into lowercase letters. Characters other than letters in the A-Z alphabet are omitted. This process was carried out due to the inconsistent use of lowercase and uppercase letters in Instagram captions. Case Folding aims to convert all data in the form of Instagram

*Text Classification on the Instagram Caption Using Support Vector Machine DOI: http://dx.doi.org/10.5772/intechopen.99684*

**Figure 2.** *Text preprocessing.*

captions to conform to the standard, which usually uses lowercase letters [5]. The other characters which are not letters or numbers, like punctuation and space, will be considered as delimiter. The other characters which are not letters or numbers, like punctuation and space, will be considered as delimiter. The illustration is displayed in **Figure 3**.

**Tokenizing** is a process to divides a large number of characters in a text into a single word unit by distinguishing particular characters required as a word separator [5]. Each word is identified or separated with another using space character, so this tokenizing process relies on space characters in the document to separate the words. The process is illustrated in **Figure 4**.

**Filtering** is a method that uses a stoplist (removing unnecessary words) or wordlist to extract certain key words from the token results (including crucial words). Some English stopword examples are "the" "from", "and", and others. The meaning behind the stopword use is to remove words with low information in a text to focus on the essential words to replace them. Filtering is done by determining what terms will be used to represent a document, where a document describes each of its contents and differs from one another. This process is illuatrated in **Figure 5**.
