**2. Overview of earthquake observation by social sensors**

We explain the basic idea of *social sensors* and introduce internet service users as social sensors to observe earthquakes.

#### **2.1 Earthquake observation services performed by social sensors**

We introduce four earthquake observation services that use information from internet users. In this chapter, we examine Toretter as an example. We explain its detailed mechanisms in the next chapter.

#### **Did You Feel It?**

The web site *Did You Feel It?*, which is operated by United States Geological Survey (USGS), is shown in Fig. 2. Through the internet, it gathers earthquake information from users who experienced those earthquakes directly (Intensity, 2005).

Fig. 2. Screenshot of *Did You Feel It*?

**TED**

The USGS also manages the Twitter Earthquake Detector (TED), which gathers tweets referring to earthquake occurrences from Twitter. They acquire location information and photographs attached to tweets and show this information related to maps(Survey, 2009).

#### **iShake**

2 Will-be-set-by-IN-TECH

questionnaire format(Intensity, 2005). From the Twitter web-site, *Toretter* extracts tweets that refer to earthquakes and estimates the location of an earthquake's epicenter using location

These methods treat social media users as sensors. We designate these virtual sensors as *social sensors*, which entail no costs. Unfortunately, such sensors provide a signal that is extremely noisy because users sometimes misunderstand phenomena, sleep, and are not near

We introduce these methods and explain a process for earthquake detection by analyzing social sensor information. We introduce current studies and services for earthquake observation using *social sensors* . Moreover, we explain *Toretter* as an example and describe

We explain the basic idea of *social sensors* and introduce internet service users as social sensors

We introduce four earthquake observation services that use information from internet users. In this chapter, we examine Toretter as an example. We explain its detailed mechanisms in the

The web site *Did You Feel It?*, which is operated by United States Geological Survey (USGS), is shown in Fig. 2. Through the internet, it gathers earthquake information from users who

The USGS also manages the Twitter Earthquake Detector (TED), which gathers tweets referring to earthquake occurrences from Twitter. They acquire location information and photographs attached to tweets and show this information related to maps(Survey, 2009).

information included with those tweets(Sakaki et al., 2010)

**2. Overview of earthquake observation by social sensors**

**2.1 Earthquake observation services performed by social sensors**

experienced those earthquakes directly (Intensity, 2005).

a computer.

its mechanisms.

next chapter.

**TED**

**Did You Feel It?**

Fig. 2. Screenshot of *Did You Feel It*?

to observe earthquakes.

The iShake project has developed a smartphone application (Fig. 3) that uses a phone to measure acceleration during an earthquake and report those data to researchers for processing (CITRIS, 2011). This project, conducted by UC Berkeley, is designed to create a system that moves beyond *Did You Feel It?*. Data from smartphone applications can complement data obtained from ground monitoring instruments, thereby improving the resolution and accuracy of earthquake intensity maps.

Fig. 3. Screenshot of *iShake*.

#### **Toretter**

*Toretter* extracts tweets referring to earthquakes and estimates the location of the earthquake epicenter using location information of those tweets(Sakaki et al., 2010). A temporal model and spatial model for earthquake detection are defined by social sensors. Then methods are proposed to detect earthquakes and to estimate the location of an earthquake epicenter automatically.

The Toretter mechanism is shown in Fig. 4.

Fig. 4. Image of the Toretter mechanism.

First it collects tweets referring to earthquakes by crawling with the Twitter API and filtering the tweet messages using a tweet classifier. Second it tries to detect an earthquake from collected tweets based on a temporal model for earthquake detection. Finally, it extracts location information for each tweet from Twitter. The system uses that information and a particle filter to estimate the earthquake epicenter based on a spatial model for social sensors.

In this chapter, we explain methods of earthquake observation using social sensors according to the Toretter mechanism. We explain this entire process in the following section.

#### **2.2 Overview of social sensors**

We introduce the mode of *social sensors* and describe their features in comparison to physical sensors.

#### **2.2.1 Basic idea of social sensors**

Many methods and infrastructure can be used to observe events and natural phenomena using physical sensors: heavy traffic, air pollution, astronomical events, weather phenomena, and earthquakes are some examples. The basic mechanisms of such observations by physical sensors are presented on the right side of Fig. 5. First, a target event for observation occurs. Second, some sensors for the target event respond with a positive signal. Third, a central server collects signals from sensors and analyzes them. Finally, the server detects the target event or produces some observation values as output.

If users of social media observe an event, then similarly to physical sensors, they make posts about the event. For example, some Twitter users might post "Oh earthquake!" or "pouring rain, thunder & lightning " or "It's a double rainbow! & the moon is out. Beautiful!". These actions by users are analogous to the response of physical sensors to a stimulus: the users and sensors send a signal when an event occurs. Therefore, a user of social media is a sensor of a kind. We designate such sensors as social sensors.

Fig. 5. Correspondence between event observation by social sensors and by physical sensors.

Earthquake Observation by Social Sensors 317

Social sensors are uncontrollable. They sometimes become inoperable because some users are not on-line; maybe they are sleeping or busy doing something else. They also function improperly more often than physical sensors because users misinterpret events more often than physical sensors. Therefore, it is necessary to know that social sensors are noisier than

Social sensors, which are users of social media, are located over a wide area. They can give responses to events of many kinds, ranging from natural phenomena, such as earthquakes and hurricanes, to events related to human activities, such as heavy traffic, live performances, and elections. The extremely numerous social sensors all over the world present the possibility of responding to events of many kinds. In other words, detection of target events can be done with no cost to set up sensors. However, when using social media systems such as Twitter, which incorporate these social sensors, it is necessary to filter the signals (tweets) posted by social sensors (Twitter users) according to the event that is to be observed. Using some method, it is necessary to extract tweets referring to a target event. We summarize the

In the first step portrayed in Fig. 4, it is necessary to collect tweets referring to an earthquake from Twitter. This process includes two steps: crawling tweets from Twitter and filtering out

physical sensors and that their signals must be analyzed more carefully.

features of social sensors and physical sensors in Table 1.

We explain these methods in the next section.

**3. Tweet collection**

An observation system incorporating social sensors is depicted on the left side of Fig. 5. First, an event occurs. Second, social media users make posts about the event. Third, the posts are collected at a central server and analyzed. Finally, the server detects the event or produces some observation value. This whole process corresponds to a process of observation by physical sensors, presented for comparison in Fig. 5

Methods for observing phenomena by physical sensors can be adapted to social sensors. Actually, some services based on social media use methods of observation resembling methods used with physical sensors.

Regarding Twitter users as social sensors, we can work with the following assumption.


#### **2.2.2 Features of social sensors**

Social sensors differ from physical sensors in some points. We describe features of social sensors in comparison to physical sensors.

4 Will-be-set-by-IN-TECH

In this chapter, we explain methods of earthquake observation using social sensors according

We introduce the mode of *social sensors* and describe their features in comparison to physical

Many methods and infrastructure can be used to observe events and natural phenomena using physical sensors: heavy traffic, air pollution, astronomical events, weather phenomena, and earthquakes are some examples. The basic mechanisms of such observations by physical sensors are presented on the right side of Fig. 5. First, a target event for observation occurs. Second, some sensors for the target event respond with a positive signal. Third, a central server collects signals from sensors and analyzes them. Finally, the server detects the target

If users of social media observe an event, then similarly to physical sensors, they make posts about the event. For example, some Twitter users might post "Oh earthquake!" or "pouring rain, thunder & lightning " or "It's a double rainbow! & the moon is out. Beautiful!". These actions by users are analogous to the response of physical sensors to a stimulus: the users and sensors send a signal when an event occurs. Therefore, a user of social media is a sensor of a

An observation system incorporating social sensors is depicted on the left side of Fig. 5. First, an event occurs. Second, social media users make posts about the event. Third, the posts are collected at a central server and analyzed. Finally, the server detects the event or produces some observation value. This whole process corresponds to a process of observation

Methods for observing phenomena by physical sensors can be adapted to social sensors. Actually, some services based on social media use methods of observation resembling

1. Each Twitter user is regarded as a sensor. A sensor detects a target event and makes a

Social sensors differ from physical sensors in some points. We describe features of social

Regarding Twitter users as social sensors, we can work with the following assumption.

2. Each tweet is associated with a time and location, which is a latitude–longitude pair.

to the Toretter mechanism. We explain this entire process in the following section.

sensors.

sensors.

**2.2 Overview of social sensors**

**2.2.1 Basic idea of social sensors**

event or produces some observation values as output.

kind. We designate such sensors as social sensors.

by physical sensors, presented for comparison in Fig. 5

methods used with physical sensors.

report probabilistically.

**2.2.2 Features of social sensors**

sensors in comparison to physical sensors.

First it collects tweets referring to earthquakes by crawling with the Twitter API and filtering the tweet messages using a tweet classifier. Second it tries to detect an earthquake from collected tweets based on a temporal model for earthquake detection. Finally, it extracts location information for each tweet from Twitter. The system uses that information and a particle filter to estimate the earthquake epicenter based on a spatial model for social

Fig. 5. Correspondence between event observation by social sensors and by physical sensors.

Social sensors are uncontrollable. They sometimes become inoperable because some users are not on-line; maybe they are sleeping or busy doing something else. They also function improperly more often than physical sensors because users misinterpret events more often than physical sensors. Therefore, it is necessary to know that social sensors are noisier than physical sensors and that their signals must be analyzed more carefully.

Social sensors, which are users of social media, are located over a wide area. They can give responses to events of many kinds, ranging from natural phenomena, such as earthquakes and hurricanes, to events related to human activities, such as heavy traffic, live performances, and elections. The extremely numerous social sensors all over the world present the possibility of responding to events of many kinds. In other words, detection of target events can be done with no cost to set up sensors. However, when using social media systems such as Twitter, which incorporate these social sensors, it is necessary to filter the signals (tweets) posted by social sensors (Twitter users) according to the event that is to be observed. Using some method, it is necessary to extract tweets referring to a target event. We summarize the features of social sensors and physical sensors in Table 1.

We explain these methods in the next section.

### **3. Tweet collection**

In the first step portrayed in Fig. 4, it is necessary to collect tweets referring to an earthquake from Twitter. This process includes two steps: crawling tweets from Twitter and filtering out

Fig. 6. Search results from Twitter Search API.

Table 3. Search conditions of Twitter Search API.

search tweets posted six days ago.

the latest 1500 tweets at one time.)

time: 180 command executions per hour.

**3.1.2 Twitter Streaming API**

• One is limited to API requests.

per hour.)

Some points must be considered when using Twitter Search API:

• It is only possible to collect the latest 1500 tweets at one time.

name explanation required value q search keywords required rpp the number of tweets to return per page optional up to 100 result type search result of type optional mixed/recent/popular until tweets before the given date optional before today since tweets after the given date optional after 5 days ago lang restricts tweets to the given language optional jp/en/all/others

Earthquake Observation by Social Sensors 319

• It is possible to collect tweets posted only during the prior five days. It is not possible to

(Technically speaking, it is possible to access one page with a request and track pages back to the 15th page. One page includes 100 tweets at most. Therefore it is possible to acquire

(No limit is published, but it is possible to access the Twitter Search API at least 500 times

Therefore, we recommend the collection of tweets every 10 min or more often because it is impossible to crawl all tweets including *earthquake* if those tweets are posted at 2000 tweets per hour and one uses Twitter Search API every hour. Actually, tweets including *earthquake* were posted at more than 5000 per hour when the earthquake occurred on March 11, 2011. Toretter requests the API command *search* 15 times every 5 min to collect the latest tweets each

The Twitter Streaming API extraction is defined in Twitter API documentation as follows:


Table 1. Features of physical sensor and social sensors.

tweets that do not refer to the earthquake. For crawling and filtering tweets, we recommend using script programming languages, such as Python, PERL, and Ruby.

#### **3.1 Crawling tweets from Twitter**

To collect tweets or some user information from Twitter, one must use the Twitter Application Programmers Interface (API). Twitter API is a group of commands that are necessary to extract data from Twitter. Twitter has APIs of three kinds: Search API, REST API, and Streaming API. In this section, we introduce Search API and Streaming API, which are necessary to crawl tweets from Twitter. We explain REST API later because REST API is necessary to extract location information from Twitter information.

Additionally, it is known that Twitter API specifications are subject to change. When using Twitter API, it is necessary to know the latest details and requirements. They are obtainable from Twitter API documentation1.

#### **3.1.1 Twitter Search API**

The Twitter Search API extracts tweets from Twitter, including search keywords or those fitting other retrieval conditions, in chronological order. It is possible to use language, date, location and other conditions as retrieval conditions.

When searching tweets including *earthquake* posted from 1 Aug 2011 to 5 Aug 2011, one might access the following URL:


http://search.twitter.com/search.json?q=earthquake&since=2011-09-01&until=2011-09-05

Table 2. Search results of keyword *earthquake* after the conversion.

It is possible to obtain results in Fig. 6, as described in JavaScript Object Notation (JSON) format, which is a text-based open standard designed for human-readable data. It is possible to convert this result in Fig. 6 into Table 2 by parsing the result using a script programming language. Parameters that are often used to collect tweets are shown in Table 3 (This table is referred to Twitter API Documentation2).

<sup>1</sup> https://dev.twitter.com/docs

<sup>2</sup> https://dev.twitter.com/docs/api/1/get/search

6 Will-be-set-by-IN-TECH

tweets that do not refer to the earthquake. For crawling and filtering tweets, we recommend

To collect tweets or some user information from Twitter, one must use the Twitter Application Programmers Interface (API). Twitter API is a group of commands that are necessary to extract data from Twitter. Twitter has APIs of three kinds: Search API, REST API, and Streaming API. In this section, we introduce Search API and Streaming API, which are necessary to crawl tweets from Twitter. We explain REST API later because REST API is necessary to extract

Additionally, it is known that Twitter API specifications are subject to change. When using Twitter API, it is necessary to know the latest details and requirements. They are obtainable

The Twitter Search API extracts tweets from Twitter, including search keywords or those fitting other retrieval conditions, in chronological order. It is possible to use language, date,

When searching tweets including *earthquake* posted from 1 Aug 2011 to 5 Aug 2011, one might

It is possible to obtain results in Fig. 6, as described in JavaScript Object Notation (JSON) format, which is a text-based open standard designed for human-readable data. It is possible to convert this result in Fig. 6 into Table 2 by parsing the result using a script programming language. Parameters that are often used to collect tweets are shown in Table 3 (This table is

#earthquake #fukushima #japan #CNN #tsunami #prayforJapan

EARTHQUAKE WARNING for San Diego, - 6.0+ likely - hey @....

SOUTHERN GREECEDate time 2011-09-04 04:37:42.0 UTCLocation ...

http://search.twitter.com/search.json?q=earthquake&since=2011-09-01&until=2011-09-05

2011-09-04 04:47:10 user 1 The truth of 311 seismic terror.http://t.co/R9I6U9w 911

2011-09-04 04:47:08 user 3 ML 2.3 SOUTHERN GREECE: Magnitude ML 2.3Region

2011-09-04 04:47:09 user 2 FML! What did I say?! @..... RT @.... 24 HOUR

Table 2. Search results of keyword *earthquake* after the conversion.

features physical social accuracy high accuracy noisy versatility target events only event of any kind cost high very low processing simple complex

Table 1. Features of physical sensor and social sensors.

location information from Twitter information.

location and other conditions as retrieval conditions.

**3.1 Crawling tweets from Twitter**

from Twitter API documentation1.

**3.1.1 Twitter Search API**

access the following URL:

tweet time user tweet text

referred to Twitter API Documentation2).

<sup>2</sup> https://dev.twitter.com/docs/api/1/get/search

<sup>1</sup> https://dev.twitter.com/docs

using script programming languages, such as Python, PERL, and Ruby.

Fig. 6. Search results from Twitter Search API.


Table 3. Search conditions of Twitter Search API.

Some points must be considered when using Twitter Search API:


Therefore, we recommend the collection of tweets every 10 min or more often because it is impossible to crawl all tweets including *earthquake* if those tweets are posted at 2000 tweets per hour and one uses Twitter Search API every hour. Actually, tweets including *earthquake* were posted at more than 5000 per hour when the earthquake occurred on March 11, 2011.

Toretter requests the API command *search* 15 times every 5 min to collect the latest tweets each time: 180 command executions per hour.

### **3.1.2 Twitter Streaming API**

The Twitter Streaming API extraction is defined in Twitter API documentation as follows:

tweet real-time SYF News: Magnitude 7.0 earthquake shakes Vanuatu; no tsunami alert no HOLY \*\*\*\* EARTHQUAKE yes Powerful earthquake rocks Vanuatu, no tsunami warnings (Newkerala ) no AAAAAAAAAH earthquake ! yes Holy \*\*\*\*, that earthquake scared the \*\*\*\* outta me yes a year on after our very first earthquake... and the shakes are still happening no

Earthquake Observation by Social Sensors 321

Table 6. Sample tweets and relevance of real-time earthquake detection.

Fig. 7. Size of earthquakes and change of tweet counts on February 11, 2011

We collected data from tweets including keywords related to earthquakes, such as *earthquake*,

Those tweets include not only tweets that users posted immediately after they felt earthquakes, but also tweets that users posted shortly after they heard earthquake news, or perhaps they misinterpreted some sense of shaking from a large truck passing nearby. Figure 7 presents sizes of earthquakes and counts of Japanese tweets including the keyword *earthquake* on February 11, 2011. When the seismic activity reached its peak, the graph of tweets invariably showed a peak. However, when the graph of tweet counts showed a peak, the seismic activity did not necessarily show a peak. Some "false-positive" peaks of the graph of tweet counts arise from mistakes by users or some news related to earthquakes. Therefore, we must filter tweets to extract those posted immediately after the earthquake. We designate tweets posted by users who felt earthquakes as *positive* tweets, and other tweets as *negative*

Here, we describe the creation of a classifier to categorize crawled tweets into *positive* tweets

Supervised learning, a machine learning method, solves classification problem and regression problems analyzing training data. It is often used for spam mail filtering and gender

and *negative* tweets, using Support Vector Machine: a supervised learning method.

**3.2 Filtering tweets using machine learning**

*shake*. Sample tweets are presented in Table6.

tweets.

**3.2.1 Supervised learning**

estimation of Web users.

The Twitter Streaming API enables high-throughput near-real-time access to various subsets of public and protected Twitter data.

Twitter Streaming API provides some methods shown in Table 4, of which *filter* method can be used to crawl tweets related to earthquakes.


Table 4. Streaming API methods.

*Filter* method returns public statuses that match one or more filtering conditions. All conditions of *filter* are presented in Table 5. It is possible to use the parameter *track* to collect tweets because keywords can be set as a condition value of *track*.


Table 5. Conditions of *filter* methods.

When using a *filter* command with the parameter keyword, *earthquake*, it is necessary to create a file called *tracking* that contains *track=earthquake*. Then one can access the following URL:

https://stream.twitter.com/1/statuses/filter.json

Streaming API also returns results in the form of JSON, shown in Fig. 6. Therefore, it is possible to parse those results in the same way as results obtained with Search API.

It is possible to collect tweets including *earthquake* in real time. Some points must be considered when using Twitter Streaming API:


Using Toretter, we want to detect earthquakes in Japan. For that purpose, it is necessary to collect tweets including *earthquake* in Japanese. However, Japanese characters cannot be used in Twitter Streaming API. Therefore, Toretter uses the Twitter Search API to crawl tweets. To collect tweets of languages other than English, it is necessary to check whether that language is supported by the Twitter Streaming API.

8 Will-be-set-by-IN-TECH

Twitter Streaming API provides some methods shown in Table 4, of which *filter* method can

filter returns public statuses that match one or more filtering conditions.

sample returns a random sample of all public statuses.(ratio is about 1%)

*Filter* method returns public statuses that match one or more filtering conditions. All conditions of *filter* are presented in Table 5. It is possible to use the parameter *track* to collect

locations returns public statuses that posted from a specific set of bounding boxes to track.

When using a *filter* command with the parameter keyword, *earthquake*, it is necessary to create a file called *tracking* that contains *track=earthquake*. Then one can access the following URL:

Streaming API also returns results in the form of JSON, shown in Fig. 6. Therefore, it is

It is possible to collect tweets including *earthquake* in real time. Some points must be

• The prepared server must have sufficiently high specifications to process all data received

Using Toretter, we want to detect earthquakes in Japan. For that purpose, it is necessary to collect tweets including *earthquake* in Japanese. However, Japanese characters cannot be used in Twitter Streaming API. Therefore, Toretter uses the Twitter Search API to crawl tweets. To collect tweets of languages other than English, it is necessary to check whether that language

possible to parse those results in the same way as results obtained with Search API.

A few companies have permission to access this command.

subsets of public and protected Twitter data.

firehose returns all public statuses.

link returns all statuses containing http: and https:.

tweets because keywords can be set as a condition value of *track*.

https://stream.twitter.com/1/statuses/filter.json

• It is impossible to use some characters in Twitter Streaming API (e.g., Japanese characters can not be used in Twitter Streaming API).

follow returns public statuses that reference the given set of users. track returns public statuses that include specified keywords.

be used to crawl tweets related to earthquakes.

retweet returns all retweets

command explanation

Table 4. Streaming API methods.

Table 5. Conditions of *filter* methods.

considered when using Twitter Streaming API:

is supported by the Twitter Streaming API.

command explanation

from Twitter.

The Twitter Streaming API enables high-throughput near-real-time access to various


Table 6. Sample tweets and relevance of real-time earthquake detection.

#### **3.2 Filtering tweets using machine learning**

We collected data from tweets including keywords related to earthquakes, such as *earthquake*, *shake*. Sample tweets are presented in Table6.

Those tweets include not only tweets that users posted immediately after they felt earthquakes, but also tweets that users posted shortly after they heard earthquake news, or perhaps they misinterpreted some sense of shaking from a large truck passing nearby. Figure 7 presents sizes of earthquakes and counts of Japanese tweets including the keyword *earthquake* on February 11, 2011. When the seismic activity reached its peak, the graph of tweets invariably showed a peak. However, when the graph of tweet counts showed a peak, the seismic activity did not necessarily show a peak. Some "false-positive" peaks of the graph of tweet counts arise from mistakes by users or some news related to earthquakes. Therefore, we must filter tweets to extract those posted immediately after the earthquake. We designate tweets posted by users who felt earthquakes as *positive* tweets, and other tweets as *negative* tweets.

Here, we describe the creation of a classifier to categorize crawled tweets into *positive* tweets and *negative* tweets, using Support Vector Machine: a supervised learning method.

#### **3.2.1 Supervised learning**

Supervised learning, a machine learning method, solves classification problem and regression problems analyzing training data. It is often used for spam mail filtering and gender estimation of Web users.

Second, we extract various features from samples. We must select effective features for classification. Effective features are those which positive samples seem to have and which negative samples do not seem to have, or vice versa. For example, all words included in samples are often used to create spam filters because we can infer that spam messages include

Earthquake Observation by Social Sensors 323

Third, we input both positive samples and negative samples with feature information and create a classifier for those samples. If inputting a new mail into the classifier, then it outputs a positive value or a negative value. If the output is positive, the new mail is regarded as a

Positive samples and negative samples must be created manually. There are two points of

First, this process is very sensitive. One must classify positive tweets and negative tweets accurately. Therefore, it is necessary to acquire records of actual earthquakes. One must choose positive tweets referring to these earthquake records to classify them precisely.

Second, one must prepare equal numbers of positive tweets and negative tweets. The number of samples needed depends on the task. Generally, it is said that sample data must comprise 300–500 samples. Actually, one should increase the number of samples until finding the

Next, one must select features of tweets for machine learning. In the spam mail filter example, words included in sample mails are chosen as features. Toretter uses features of three kinds.

Oh! Earthquake happened right now!

**Statistical features** number of words in a tweet message and the position of the search

example sentence → number of words: *five*, the position of the search keyword: *second*

*Statistical features* are the most effective in these three features according to results of our earlier research(Sakaki et al., 2010). It is guessed that this is true because people who came across an earthquake were surprised and in an emergency situation so that they tend to post short

Of course, these features can differ depending on language, country, and culture. Therefore,

We explain them in detail and use the following sentence for explanation.

example sentence → Oh, earthquake, happened , right, now

effective features should be chosen when creating a filter for tweets.

**Context features** words before and after a search keyword.

tweets such as "Oh! earthquake!" and "It's shaking".

words such as "Free!", "50% off!", and "Call now!" with high probability.

**3.2.2 Creation of sample data for the classifier**

classification which provides sufficient performance.

**3.2.3 Extraction of features from sample data**

**Keyword features** all words included in a tweet.

keyword within a tweet

example sentence → *Oh*, *happened*

spam message.

consideration.

Fig. 8. Mechanism of Support Vector Machine.

Toretter uses Support Vector Machine (SVM), an extremely effective supervised learning method.

3.2.1.1 Support Vector Machine

SVM is a method used to create a classifier for two-class pattern classification. The SVM projects each training sample as points (as presented on the left side of Fig. 8) into multi-dimensional feature space. It creates a hyperplane that has the largest distance to the nearest training sample points of each class (as presented on the right side of Fig. 8). One must input positive samples and negative samples into SVM, which creates a classifier for two classes by searching the hyperplane.

To study them in detail, several books are useful (Bishop, 2006).

3.2.1.2 Process of creating a classifier using machine learning

Figure 9 depicts the process of supervised learning, which has three steps. We explain this process using an example of creation of a spam filter along the lines of Fig. 9

Fig. 9. Process of Machine Learning.

First, we prepare both sample collections of spam mails as positive samples and those of other mails as negative samples. Those must be classified manually by humans.

10 Will-be-set-by-IN-TECH

Toretter uses Support Vector Machine (SVM), an extremely effective supervised learning

SVM is a method used to create a classifier for two-class pattern classification. The SVM projects each training sample as points (as presented on the left side of Fig. 8) into multi-dimensional feature space. It creates a hyperplane that has the largest distance to the nearest training sample points of each class (as presented on the right side of Fig. 8). One must input positive samples and negative samples into SVM, which creates a classifier for

Figure 9 depicts the process of supervised learning, which has three steps. We explain this

First, we prepare both sample collections of spam mails as positive samples and those of other

mails as negative samples. Those must be classified manually by humans.

Fig. 8. Mechanism of Support Vector Machine.

3.2.1.1 Support Vector Machine

two classes by searching the hyperplane.

Fig. 9. Process of Machine Learning.

To study them in detail, several books are useful (Bishop, 2006). 3.2.1.2 Process of creating a classifier using machine learning

process using an example of creation of a spam filter along the lines of Fig. 9

method.

Second, we extract various features from samples. We must select effective features for classification. Effective features are those which positive samples seem to have and which negative samples do not seem to have, or vice versa. For example, all words included in samples are often used to create spam filters because we can infer that spam messages include words such as "Free!", "50% off!", and "Call now!" with high probability.

Third, we input both positive samples and negative samples with feature information and create a classifier for those samples. If inputting a new mail into the classifier, then it outputs a positive value or a negative value. If the output is positive, the new mail is regarded as a spam message.
