2.4.2. Kernel-based data fusion

Kernel methods are based on a kernel function, which is a similarity function that finds similarities over pairs of data points. The kernel function enables the kernel method to operate in a high-dimensional space by simply applying an inner product. The kernel method introduces nonlinearity into the decision parameters by simply mapping the original features of the original sources onto a higher dimensional space. For kernel function κ x; y and mapping function ϕ : X ! F, the model built by the kernel method can be expressed as an inner product in the following equation:

$$\kappa(\mathfrak{x}, \mathfrak{y}) = \left< \phi(\mathfrak{x}) \cdot \phi(\mathfrak{y}) \right> \tag{2}$$

where <sup>κ</sup> <sup>x</sup>; <sup>y</sup> is positive semidefinite and <sup>ϕ</sup> : <sup>X</sup> ! <sup>F</sup> maps each instance <sup>x</sup>, <sup>y</sup> into feature space F, which is a Hilbert space. With the kernel method, a simple mining technique such as classification can be applied further to analyze the data.

Kernel methods can be described as a class of algorithms for pattern analysis, whose best member is the support vector machine [18]. There are many kernel methods including polynomial, fisher, radial basis functions (RBF), string, and graph kernels. Several commonly used kernel functions are:

$$\text{Linear function}: \kappa(\mathbf{x}\_i, \mathbf{x}) = \mathbf{x}\_i \cdot \mathbf{x} \tag{3}$$

$$\text{Polynomial function}: \kappa(\mathbf{x}\_i, \mathbf{x}) = [(\mathbf{x}\_i \cdot \mathbf{x}) + \mathbf{1}]^\text{P} \tag{4}$$

$$\text{Radial basis function } (\text{RBF}): \kappa(\mathfrak{x}\_i, \mathfrak{x}) = e^{-\left\|\mathbf{x}\_i - \mathbf{x}\right\|^2 / 2\sigma^2} \tag{5}$$

where xi and x are two samples represented as feature vectors, x � x<sup>0</sup> k k is the distance between the two feature vectors, σ is a free parameter, and p is a constant.

Studies show that nonlinear kernels, for example, string kernel or RBF, have a significantly higher level of accuracy for multimedia data compared to linear classification models [19]. Kernel-based data fusion, denoted as kernel fusion, has been pioneered by Lanckriet et al. [20] as a statistical learning framework for genomic data fusion and has been applied widely in various applications. In particular, kernel representation resolves the heterogeneities of data sources by transforming different data structures into kernel matrices.

#### 2.4.3. Multiple kernel learning for fusion

When dealing with multimedia input, Kernel-based data fusion can be applied so that it merges all the features from different sources into a concatenated vector before achieving classification. However, it is hard to combine features into one representation without facing the problem of dimensionality [21]. Multiple kernel learning (MKL) is one of the most popular fusion technologies (Lan et al.), which allows us to combine possibly heterogeneous data sources, making use of the reduction of heterogeneous data to the common framework of kernel matrices. The reduction of heterogeneous data is achieved by using a kernel for each type of feature rather than using one kernel for all the features. For a set of base kernels κl, the optimal kernel combination is calculated as:

$$\kappa\_{optimal} = \sum\_{l} \beta\_l \kappa\_l \tag{6}$$

unknown parameters. To find the best kernel for image features and text features, crossvalidation is applied. The best kernel means the best σ of the RBF kernel. The final kernel is the weighted sum of each feature kernel, with each feature kernel having its optimal σ.

MKL is coupled with classifier learning, such as support vector machine (SVM) [24] in our method, enhancing mutually interpretability of results. Support vector machine is formalized as solving an optimization problem. In the process, it finds the best hyperplane separating relevant and irrelevant vectors by maximizing the size of the margin between the two sets. By

� �<sup>n</sup>

training example and yi is the corresponding class label. The nonlinear support vector machine maps a training example xi in the input space to a higher dimensional space ϕð Þ xi using a nonlinear mapping function ϕ. It constructs an optimal hyperplane, defined by Eq. (8), to

where b∈ R, w is a normal vector. The hyperplane constructed in kernel feature space is a maximum-margin hyperplane, one which maximizes the margin between the two datasets.

yi <sup>w</sup><sup>T</sup>ϕð Þþ xi <sup>b</sup> � � <sup>≥</sup> <sup>1</sup> � <sup>ξ</sup>i, <sup>i</sup> <sup>¼</sup> <sup>1</sup>, <sup>2</sup>, …, <sup>n</sup>

where ξ<sup>i</sup> are nonnegative slack variables and C is a regularization parameter that determines the trade-off between the margin and the error in training data. The minimizing operation is against parameters w, b, and ξi. The corresponding SVM dual problem for the primal problem

> yj k xi; xj � � <sup>þ</sup>

<sup>C</sup>δij ! � � subject to

The dual problem is a keypoint for deriving SVM algorithms and studying their convergence

coefficients. The Karush-Kuhn-Tucker (KKT) conditions are necessary conditions for the solution to the optimal parameters when there are one or more inequality constraints. Here, the KKT conditions for Eq. (10) are also sufficient for optimality since Eq. (10) meets the following three

<sup>T</sup>ϕ xj

1

� � is the kernel function and α<sup>j</sup> are the Lagrange

ξ<sup>i</sup> ≥ 0 i ¼ 1, 2, …, n

Xn j¼1

αiαjyi

i¼1

; xi <sup>∈</sup> <sup>ℝ</sup><sup>d</sup> and yi <sup>∈</sup>f g <sup>þ</sup>1; �<sup>1</sup> n o, where xi is a

Multiple Kernel-Based Multimedia Fusion for Automated Event Detection from Tweets

http://dx.doi.org/10.5772/intechopen.77178

<sup>w</sup><sup>T</sup>ϕð Þþ <sup>x</sup> <sup>b</sup> <sup>¼</sup> <sup>0</sup> (8)

(9)

57

(10)

using a kernel, it can find the maximum-margin hyperplane in a transformed space.

For a given set of n training examples, xi; yj

This is achieved by solving the primal SVM problem:

min <sup>1</sup> 2 k k w <sup>2</sup> <sup>þ</sup> <sup>C</sup> X i ξ2 i � � subject to

described in Eq. (9) is its Lagrangian defined as:

i¼1

<sup>α</sup><sup>i</sup> � <sup>1</sup> 2 Xn i¼1

0 ≤ α<sup>i</sup> ≤C, i ¼ 1, 2, …, n

� � <sup>¼</sup> <sup>ϕ</sup>ð Þ xi

where δij is the Kronecker δ defined to be 1 if i=j and 0.

max <sup>X</sup><sup>n</sup>

Xn i¼1 yi α<sup>i</sup> ¼ 0

properties. The function k xi; xj

separate the two classes.

where β<sup>l</sup> is the weight for each base kernel κl.

Multiple kernel learning is flexible for multimodal data, since each set of data features is assigned a different notion of similarity, i.e., a different kernel. Instead of building a specialized kernel for the applications with multimodal data, it is possible to define a kernel for each of these data and linearly combine these kernels [22]. Multiple kernel learning presents the solution of the optimal combination of the kernels. In this study, semi-infinite programming [23] is used to achieve robustly and automatically optimizing the kernel weights. It solves the MKL in two steps: the first step is the initialization of the problem with a small number of linear constraints and the second step is to solve the parameters.

In event detection, the MKL framework defines a new kernel function as a linear combination of l base kernels:

$$\mathsf{w}(\mathsf{x}\_{i},\mathsf{x}) = \sum\_{l} \beta\_{l} \mathsf{w}\_{l}(\mathsf{x}\_{i},\mathsf{x}) \tag{7}$$

where each base kernel κ<sup>l</sup> is selected for one specific feature, the nonnegative coefficient β<sup>l</sup> represents the weight of the l th base kernel in the combination, and P <sup>l</sup>¼<sup>1</sup> <sup>β</sup><sup>l</sup> <sup>¼</sup> <sup>1</sup>.

A kernel is utilized for each of the features followed by a combination of multiple features as indicated in Eq. (7). To select the spread parameter σ for each kernel, a cross-validation is performed with grid search for the range 0.001–0.01. Such selection is suitable for our data, resulting in the best classification accuracy without need for long time processing. The crossvalidation is a model evaluation method that is applied during the training phase to find unknown parameters. To find the best kernel for image features and text features, crossvalidation is applied. The best kernel means the best σ of the RBF kernel. The final kernel is the weighted sum of each feature kernel, with each feature kernel having its optimal σ.

Studies show that nonlinear kernels, for example, string kernel or RBF, have a significantly higher level of accuracy for multimedia data compared to linear classification models [19]. Kernel-based data fusion, denoted as kernel fusion, has been pioneered by Lanckriet et al. [20] as a statistical learning framework for genomic data fusion and has been applied widely in various applications. In particular, kernel representation resolves the heterogeneities of data

When dealing with multimedia input, Kernel-based data fusion can be applied so that it merges all the features from different sources into a concatenated vector before achieving classification. However, it is hard to combine features into one representation without facing the problem of dimensionality [21]. Multiple kernel learning (MKL) is one of the most popular fusion technologies (Lan et al.), which allows us to combine possibly heterogeneous data sources, making use of the reduction of heterogeneous data to the common framework of kernel matrices. The reduction of heterogeneous data is achieved by using a kernel for each type of feature rather than using one kernel for all the features. For a set of base kernels κl, the

<sup>κ</sup>optimal <sup>¼</sup> <sup>X</sup>

Multiple kernel learning is flexible for multimodal data, since each set of data features is assigned a different notion of similarity, i.e., a different kernel. Instead of building a specialized kernel for the applications with multimodal data, it is possible to define a kernel for each of these data and linearly combine these kernels [22]. Multiple kernel learning presents the solution of the optimal combination of the kernels. In this study, semi-infinite programming [23] is used to achieve robustly and automatically optimizing the kernel weights. It solves the MKL in two steps: the first step is the initialization of the problem with a small number of

In event detection, the MKL framework defines a new kernel function as a linear combination

l

where each base kernel κ<sup>l</sup> is selected for one specific feature, the nonnegative coefficient β<sup>l</sup>

A kernel is utilized for each of the features followed by a combination of multiple features as indicated in Eq. (7). To select the spread parameter σ for each kernel, a cross-validation is performed with grid search for the range 0.001–0.01. Such selection is suitable for our data, resulting in the best classification accuracy without need for long time processing. The crossvalidation is a model evaluation method that is applied during the training phase to find

th base kernel in the combination, and P

<sup>κ</sup> xi ð Þ¼ ; <sup>x</sup> <sup>X</sup>

l

βlκ<sup>l</sup> (6)

βlκ<sup>l</sup> xi ð Þ ; x (7)

<sup>l</sup>¼<sup>1</sup> <sup>β</sup><sup>l</sup> <sup>¼</sup> <sup>1</sup>.

sources by transforming different data structures into kernel matrices.

2.4.3. Multiple kernel learning for fusion

56 Machine Learning - Advanced Techniques and Emerging Applications

optimal kernel combination is calculated as:

where β<sup>l</sup> is the weight for each base kernel κl.

of l base kernels:

represents the weight of the l

linear constraints and the second step is to solve the parameters.

MKL is coupled with classifier learning, such as support vector machine (SVM) [24] in our method, enhancing mutually interpretability of results. Support vector machine is formalized as solving an optimization problem. In the process, it finds the best hyperplane separating relevant and irrelevant vectors by maximizing the size of the margin between the two sets. By using a kernel, it can find the maximum-margin hyperplane in a transformed space.

For a given set of n training examples, xi; yj � �<sup>n</sup> i¼1 ; xi <sup>∈</sup> <sup>ℝ</sup><sup>d</sup> and yi <sup>∈</sup>f g <sup>þ</sup>1; �<sup>1</sup> n o, where xi is a training example and yi is the corresponding class label. The nonlinear support vector machine maps a training example xi in the input space to a higher dimensional space ϕð Þ xi using a nonlinear mapping function ϕ. It constructs an optimal hyperplane, defined by Eq. (8), to separate the two classes.

$$w^T \phi(\mathbf{x}) + b = \mathbf{0} \tag{8}$$

where b∈ R, w is a normal vector. The hyperplane constructed in kernel feature space is a maximum-margin hyperplane, one which maximizes the margin between the two datasets. This is achieved by solving the primal SVM problem:

$$\begin{aligned} \min & \left( \frac{1}{2} \|\boldsymbol{w}\|\|^2 + C \sum\_{i} \xi\_i^2 \right) \qquad \text{subject to} \\ & y\_i (\boldsymbol{w}^T \boldsymbol{\phi}(\boldsymbol{x}\_i) + b) \succeq \mathbf{1} - \xi\_{i\prime} \quad i = \mathbf{1}, 2, \ldots, n \\ & \xi\_i \succeq \mathbf{0} \; i = \mathbf{1}, 2, \ldots, n \end{aligned} \tag{9}$$

where ξ<sup>i</sup> are nonnegative slack variables and C is a regularization parameter that determines the trade-off between the margin and the error in training data. The minimizing operation is against parameters w, b, and ξi. The corresponding SVM dual problem for the primal problem described in Eq. (9) is its Lagrangian defined as:

$$\begin{aligned} \max & \left( \sum\_{i=1}^{n} \alpha\_i - \frac{1}{2} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \alpha\_i \alpha\_j y\_j y\_j \left( k(\mathbf{x}\_i, \mathbf{x}\_j) + \frac{1}{C} \mathbb{R}\_{ij} \right) \right) \text{ subject to } \\ & \sum\_{i=1}^{n} y\_i \alpha\_i = 0 \\ & 0 \le \alpha\_i \le C, \quad i = 1, 2, \dots, n \end{aligned} \tag{10}$$

where δij is the Kronecker δ defined to be 1 if i=j and 0.

The dual problem is a keypoint for deriving SVM algorithms and studying their convergence properties. The function k xi; xj � � <sup>¼</sup> <sup>ϕ</sup>ð Þ xi <sup>T</sup>ϕ xj � � is the kernel function and α<sup>j</sup> are the Lagrange coefficients. The Karush-Kuhn-Tucker (KKT) conditions are necessary conditions for the solution to the optimal parameters when there are one or more inequality constraints. Here, the KKT conditions for Eq. (10) are also sufficient for optimality since Eq. (10) meets the following three conditions: the object function is concave, the inequality constraint is a continuously differentiable convex function, and the equality constraint is an affine function. According to the KKT conditions, the optimal parameters α<sup>∗</sup>, w<sup>∗</sup>, and b<sup>∗</sup> must satisfy:

$$\alpha\_i^\* \left[ \mathbf{y}\_i \left( \sum\_{j=1}^n \alpha\_j^\* \mathbf{y}\_j k(\mathbf{x}\_i, \mathbf{x}\_j) + b^\* \right) - \mathbf{1} + \xi\_i \right] = \mathbf{0}, \ i = 1, 2, \dots, n \tag{11}$$

In classification, only a small subset of the Lagrange multipliers α<sup>∗</sup> <sup>i</sup> tend to be nonzero usually. The training examples with nonzero α<sup>∗</sup> <sup>i</sup> are defined as support vectors. They construct the optimal separating hyperplane as:

$$\mathbf{w}^{\*T}\boldsymbol{\phi}(\mathbf{x}) + \mathbf{b}^{\*} = \sum\_{j=1}^{n} \alpha\_{j}^{\*} \boldsymbol{\upchi}\_{j} \mathbf{k}(\mathbf{x}, \mathbf{x}\_{j}) + \mathbf{b}^{\*} = \mathbf{0} \tag{12}$$

F xð Þ¼ sign <sup>X</sup>

normalized data space.

3.1. Experiment design

only as input.

i

based on image and text features which are extracted from the same tweet.

3. Experiment design, result, and discussion

two events: Brisbane hailstorm and California wildfire.

or not, its features are extracted for fusing operation.

The accuracy for the event detection method is defined as

3.2. Performance evaluation parameters

X l

where xi are support vectors, α<sup>i</sup> denote Lagrange multipliers corresponding to support vectors, and b is a bias which intercepts the hyperplane that separates the two groups in the

Depending on the sign of Eq. (14), the Twitter data are divided into two groups. The first group contains twitters of a positive class, meaning the event has happened. The second group contains twitters of a negative class, meaning the event has not happened. Both classes are

Experiments have been done to build the event detection method and test its performance on real twitters. The algorithm is implemented in Matlab. In the experiments, the tweets that contain both text and image are collected from the Twitter streams. The data collection is for

The data are separated into two sets, including training and testing. Training data are divided into two groups: the event has happened or the event has not happened, which are manually labeled. Each group has the same number of tweets. The same process is applied to the testing data. The numbers of samples for the two sets are the same. The reasons to have the same number of samples are: the greater the size of the training set and testing sets, the better the algorithm is trained and tested, and the total number of samples is big enough to split the data into two equal sets. For each tweet set to be used for detecting whether an event has happened

In order to validate the performance of the proposed MKL event detection using both text and image, two other methods are also built and tested. Both the other two methods are based on single kernel learning, with one method taking text only as input and the other taking image

In order to measure the performance of the proposed method and those of other comparing methods more objectively and comprehensively, four performance parameters are used,

including accuracy (A), precision, recall, and F-score [25]. They are defined below.

<sup>A</sup> <sup>¼</sup> TP <sup>þ</sup> TN

TP <sup>þ</sup> TN <sup>þ</sup> FP <sup>þ</sup> FN (15)

β<sup>l</sup> kl xi ð Þ ð Þ ; x :α<sup>i</sup> þ b !

Multiple Kernel-Based Multimedia Fusion for Automated Event Detection from Tweets

http://dx.doi.org/10.5772/intechopen.77178

(14)

59

In SVM framework, the task of multiple kernel learning is considered as a way of optimizing the kernel weights at the same time of training SVM. For multiple kernels, Eq. (12) can be converted into the following equation to derive the dual form for MKL.

$$\begin{aligned} \max & \left( \sum\_{i=1}^{n} \alpha\_i - \frac{1}{2} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \alpha\_i \alpha\_j y\_j y\_j \sum\_{l=1}^{m} \beta\_l k\_l \langle \mathbf{x}\_i, \mathbf{x}\_j \rangle \right) \text{ subject to} \\ & \sum\_{i=1}^{n} y\_i \alpha\_i = 0 \\ & 0 \le \alpha\_i \le C, \quad i = 1, 2, \dots, n \\ & \beta\_l \ge 0, \quad \sum\_{l=1}^{m} \beta\_l = 1, \; l = 1, 2, \dots, m \end{aligned} \tag{13}$$

In Eq. (13), both the base kernel weights β<sup>l</sup> and the Lagrange coefficients α<sup>j</sup> need to be optimized. A two-step procedure is considered to decompose the problem into two optimization problems.

In the first step, through grid search and cross-validation, the best weights β<sup>l</sup> are derived by minimizing the 2-norm soft margin error function using linear programming. The weights for text features and image features are changed according to the type of data. For example, for wildfire data, the weight for text features was chosen as 0.70, and the weight for image features was chosen as 0.30. In the second step, the Lagrange coefficients α<sup>j</sup> are obtained by maximizing Eq. (13) using quadratic programming. The interior point method is used to solve quadratic programming in the proposed method, which achieves optimization by traversing the convex interior of the feasible region.

#### 2.5. Final event detection

As described above, the training process of multimedia data fusion builds the system by deriving parameters αj, b, xi, βl, and kl. For a test input x, the decision function for MKL, i.e., the event detection function F(x), is a convex combination of basis kernels, computed as:

Multiple Kernel-Based Multimedia Fusion for Automated Event Detection from Tweets http://dx.doi.org/10.5772/intechopen.77178 59

$$F(\mathbf{x}) = \text{sign}\left(\sum\_{i} \sum\_{l} \beta\_l(k\_l(\mathbf{x}\_i, \mathbf{x}).\alpha\_i + b)\right) \tag{14}$$

where xi are support vectors, α<sup>i</sup> denote Lagrange multipliers corresponding to support vectors, and b is a bias which intercepts the hyperplane that separates the two groups in the normalized data space.

Depending on the sign of Eq. (14), the Twitter data are divided into two groups. The first group contains twitters of a positive class, meaning the event has happened. The second group contains twitters of a negative class, meaning the event has not happened. Both classes are based on image and text features which are extracted from the same tweet.
