2.1.1. Brisbane hailstorm 2014

The Brisbane hailstorm occurred in Brisbane, Australia on November 27, 2014. It was the worst hailstorm in a decade, causing injury to about 40 people and costing around 1.1 billion Australian dollars. The data about this hailstorm were collected between November 27, 2014 and November 28, 2014 and contained both texts and images. The dataset contained 280,000 tweets. Figure 2 presents an example of the twitters (left column) and the word cloud for the data (right column). A word cloud is an image consisting of the words used in the data, where the size of each word indicates its occurrent frequency.

2.3. Feature extraction

for texts.

2.3.1. Textual features

Vd ¼ t1; …; ti ð Þ ; …; tk

as follows:

frequency nid

nd

In event detection, a set of features is required. A feature vector is a set of features used to reduce the dimensionality of the data, especially in the case of large volume data. Feature extraction involves reducing the amount of resources required to describe a large set of data accurately. Two approaches to feature extraction were employed for different data sets: content-based and description-based. Content-based feature extraction is based on the content of an object, whereas description-based extraction relies on metadata such as keywords. In this study, content-based features were used for images, and description-based features were used

Multiple Kernel-Based Multimedia Fusion for Automated Event Detection from Tweets

http://dx.doi.org/10.5772/intechopen.77178

53

In extracting textual features, two major processes are executed, including filtering and feature calculation. The filtering will derive the key information out of tweets. The feature calculation will represent the significance of a word within a given document using a measurement

The filtering consists of five major steps including: filtering tweets in such way that they are in English only; converting all words to lowercase; converting the string to a list of tokens based on whitespace; removing punctuation marks from the text; eliminating common words that do not tell anything about the dataset (such as the, and, for, etc.); and reducing each word to its

After the filtering, TF-IDF is calculated, which is a statistical measure that details the significance of a word within tweets based on how often the word occurs in an individual tweet compared with how often it occurs in other tweets [14]. The advantage of using the TF-IDF algorithm technique is that it allows the retrieval of information since the TF-IDF values increase proportionally with the number of times a certain keyword appears in a document, being offset by the frequency of the word in the database. The TF-IDF algorithm utilizes a

Suppose there is a vocabulary of k words, then each document is represented by a k-vector

where nid is the number of occurrences of word i in document d, ndis the total number of words in document d, niis the number of occurrences of term i in the database, and N is the total number of documents in the database. It can be seen that TF-IDF is a product of the word

document d (i.e., the higher the nid is), the bigger the ti is, meaning the word i is more significant. Note here, the significance of the word i in document d is offset by the frequency

log <sup>N</sup> ni

ti <sup>¼</sup> nid nd

<sup>T</sup> of weighted word frequencies with components ti. TF-IDF is computed

ti <sup>¼</sup> nid nd

ni

log <sup>N</sup> ni

. For a word <sup>i</sup>, the more it occurs in

(1)

named term frequency-inverse document frequency (TF-IDF) [13].

combination of term frequency and inverse document frequency.

ti <sup>¼</sup> nid nd

log <sup>N</sup> ni

and the inverse document frequency log <sup>N</sup>

stem by removing any prefixes or suffixes.
