**3.2.2 Creation of sample data for the classifier**

Positive samples and negative samples must be created manually. There are two points of consideration.

First, this process is very sensitive. One must classify positive tweets and negative tweets accurately. Therefore, it is necessary to acquire records of actual earthquakes. One must choose positive tweets referring to these earthquake records to classify them precisely.

Second, one must prepare equal numbers of positive tweets and negative tweets. The number of samples needed depends on the task. Generally, it is said that sample data must comprise 300–500 samples. Actually, one should increase the number of samples until finding the classification which provides sufficient performance.

### **3.2.3 Extraction of features from sample data**

Next, one must select features of tweets for machine learning. In the spam mail filter example, words included in sample mails are chosen as features. Toretter uses features of three kinds. We explain them in detail and use the following sentence for explanation.

Oh! Earthquake happened right now!

**Keyword features** all words included in a tweet.

example sentence → Oh, earthquake, happened , right, now

**Statistical features** number of words in a tweet message and the position of the search keyword within a tweet

example sentence → number of words: *five*, the position of the search keyword: *second*

**Context features** words before and after a search keyword.

example sentence → *Oh*, *happened*

*Statistical features* are the most effective in these three features according to results of our earlier research(Sakaki et al., 2010). It is guessed that this is true because people who came across an earthquake were surprised and in an emergency situation so that they tend to post short tweets such as "Oh! earthquake!" and "It's shaking".

Of course, these features can differ depending on language, country, and culture. Therefore, effective features should be chosen when creating a filter for tweets.

It is possible to obtain polars of each tweet in the *output file* New tweets are classifiable into a

Earthquake Observation by Social Sensors 325

SVM-Light(Joachims, 2008), LIBSVM(Chih-Chung & Chih-Jen, 2011), and Classias(Okazaki, 2009) have compatibility such that the process we explain here is applicable to LIBSVM and

First, it is difficult to believe these tweets directly because some users misinterpret shaking caused by something other than an earthquake. Some ill-willed users post positive tweets to deceive others. This closely resembles physical sensors, and sometimes produces a wrong value. Therefore, we must process positive tweets to detect earthquakes with high accuracy,

Figure 10 depicts the sizes of earthquakes and counts of positive tweets filtered by SVM on Feb 11 2011. These two graphs are correlated: whenever an earthquake occurs, a peak appears in the graph of positive tweet counts. Therefore, we can detect earthquakes by detecting the

Fig. 10. Sizes of earthquakes and changes of filtered tweet counts Feb 11 2011.

using an exponential function. We explain this method hereinafter.

explain the temporal model we use to calculate this probability.

Many methods have been used to detect peaks from time-series data for purposes such as burst detection(Kleinberg, 2002; Zhu & Shasha, 2003) and anomaly detection(Cheng et al., 2008; Krishnamurthy et al., 2003). Toretter uses a static rule *5 tweets in 5 min* that is calculated

To detect an earthquake using physical sensors, we must calculate the probability of earthquake occurrence based on signals from those sensors. Similarly, we must calculate the probability of earthquake occurrence from signals of social sensors. In this subsection, we

Figure 11 presents graphs of positive tweet counts during earthquakes. In Fig. 11, the green line shows an exponential function. As shown here, the green line resembles the red line,

**4. Earthquake detection from a time-series data using a probabilistic model**

positive class and negative class by the classifier for tweets as described.

The second step of Fig. 4 detects an earthquake from positive tweets.

Classias. (Toretter uses Classias for SVM tools.)

similarly to treating physical sensors.

peaks of positive tweet counts.

**4.1 Temporal model**


Table 7. Sample features for SVM-Light.

#### **3.2.4 Applying machine learning**

Some machine learning methods can create a classifier for any problem: Naive Bayes classifier, Neural Networks, Decision Tree, and Support Vector Machine. In this chapter, Support Vector Machine is used for our explanation because it is said that SVM is a superior method for classification problems and regression problems, and many SVM software packages exist. We treat SVM-Light, which is a popular SVM tool, as an example in this chapter.

Creating a classifier demands three steps.

3.2.4.1 Create training data from tweets

First, it necessary to convert tweet data into a training data file format for SVM-Light. The training data file format of SVM-Light is

<target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info> -1 1:0.43 3:0.12 9284:0.2 # abcdef

In this file format, each line corresponds to a single tweet. **<target>** expresses a polar of each tweet. +1 means positive and −1 means negative. **<feature>** expresses a feature ID of each feature and **<value>** expresses the weight of each feature in the tweet. Each feature should be assigned to each feature ID. For example, if one assigns each feature to each feature ID, as in Table 7, then a tweet conversion into a training data for SVM-Light as shown below.

I am in Japan, earthquake right now → +1 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:7 8:5 9:1 10:1

You must run the following command to create a classifier for tweets after converting positive tweets collected into a training data file *training data file*.

svm\_learn *"training data file" "model file"*

*svm\_learn* is a command in SVM-Light to create a model file for classifier. After running *svm\_learn*, it is possible to obtain *model file* as an output of *svm\_learn*. It is possible to classify the tweet command *svm\_classify* with this model file. When classifying new tweets into a positive class and negative class, each tweet is converted into *test data* in the same format as *training data*. Then the following command is executed.

svm\_classify *"test data file" "model file" "output file*

12 Will-be-set-by-IN-TECH

feature ID feature feature ID feature 0 I 1 am 2 in 3 Japan 4 earthquake 5 right

 *number of words in tweets position of search keyword word before keywords* is Japan *word after keywords* is right

Some machine learning methods can create a classifier for any problem: Naive Bayes classifier, Neural Networks, Decision Tree, and Support Vector Machine. In this chapter, Support Vector Machine is used for our explanation because it is said that SVM is a superior method for classification problems and regression problems, and many SVM software packages exist. We

First, it necessary to convert tweet data into a training data file format for SVM-Light. The

<target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>

In this file format, each line corresponds to a single tweet. **<target>** expresses a polar of each tweet. +1 means positive and −1 means negative. **<feature>** expresses a feature ID of each feature and **<value>** expresses the weight of each feature in the tweet. Each feature should be assigned to each feature ID. For example, if one assigns each feature to each feature ID, as in

I am in Japan, earthquake right now → +1 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:7 8:5 9:1 10:1

You must run the following command to create a classifier for tweets after converting positive

*svm\_learn* is a command in SVM-Light to create a model file for classifier. After running *svm\_learn*, it is possible to obtain *model file* as an output of *svm\_learn*. It is possible to classify the tweet command *svm\_classify* with this model file. When classifying new tweets into a positive class and negative class, each tweet is converted into *test data* in the same format as

Table 7, then a tweet conversion into a training data for SVM-Light as shown below.

6 now

treat SVM-Light, which is a popular SVM tool, as an example in this chapter.

Table 7. Sample features for SVM-Light.

Creating a classifier demands three steps. 3.2.4.1 Create training data from tweets

training data file format of SVM-Light is


tweets collected into a training data file *training data file*.

svm\_learn *"training data file" "model file"*

*training data*. Then the following command is executed.

svm\_classify *"test data file" "model file" "output file*

**3.2.4 Applying machine learning**

It is possible to obtain polars of each tweet in the *output file* New tweets are classifiable into a positive class and negative class by the classifier for tweets as described.

SVM-Light(Joachims, 2008), LIBSVM(Chih-Chung & Chih-Jen, 2011), and Classias(Okazaki, 2009) have compatibility such that the process we explain here is applicable to LIBSVM and Classias. (Toretter uses Classias for SVM tools.)
