2.3. Feature extraction

2.1.1. Brisbane hailstorm 2014

2.1.2. California wildfires 2017

2.2. Data preprocessing

tweets consisting of both text and images.

data will be ready for feature extraction.

the size of each word indicates its occurrent frequency.

52 Machine Learning - Advanced Techniques and Emerging Applications

The Brisbane hailstorm occurred in Brisbane, Australia on November 27, 2014. It was the worst hailstorm in a decade, causing injury to about 40 people and costing around 1.1 billion Australian dollars. The data about this hailstorm were collected between November 27, 2014 and November 28, 2014 and contained both texts and images. The dataset contained 280,000 tweets. Figure 2 presents an example of the twitters (left column) and the word cloud for the data (right column). A word cloud is an image consisting of the words used in the data, where

The 2017 wildfire season in California started in April and extended to December. 1,381,405 acres were burned and the economic cost was over 13.028 billion American dollars. The data for this event were collected for 5 days in July 2017. It contained 600,000 tweets with some

The goal of data preprocessing is to discover important features from collected raw data. Preprocessing is a set of techniques used prior to analysis to remove imperfection, inconsistency, and redundancy. In this study, there was a high need to preprocess text data, because many tweets were not properly formatted or contained spelling errors. As a result, using a filter, cleaning is done before the text data are further handled. For image data in Twitter, we extracted the image's hyperlink and removed a tweet if its hyperlink was empty or did not work, since in this study, the tweet must contain both image and text. After preprocessing, the

Figure 2. An example of tweets on Brisbane hailstorm (left) and the word cloud for the event (right).

In event detection, a set of features is required. A feature vector is a set of features used to reduce the dimensionality of the data, especially in the case of large volume data. Feature extraction involves reducing the amount of resources required to describe a large set of data accurately. Two approaches to feature extraction were employed for different data sets: content-based and description-based. Content-based feature extraction is based on the content of an object, whereas description-based extraction relies on metadata such as keywords. In this study, content-based features were used for images, and description-based features were used for texts.
