2.4.1. About multimedia data fusion

of the word in the whole database. This offsetting will result in different ti for word i that are

In calculating visual features, each image is represented with a visual-word vector consisting of visual words. A visual word is a cluster in an image that represents a specific pattern shared by keypoints in that cluster. A keypoint in an image is a section of the image that is highly distinctive, allowing its correct match in a large database of features to be found. A keypoint is detected based on various image features. In this study, four types of features are used to detect a keypoint, including histogram of oriented gradients (HOG) [15], gray-level cooccurrence matrix (GLCM) [16], color histogram (CH) [17], and scale-invariant feature trans-

HOG is a feature descriptor that is calculated by counting occurrences of gradient orientation in localized portions of an image. Operating on local cells, HOG is invariant to geometric and

GLCM is got by calculating how often pairs of pixel with specific values and in a specified spatial relationship occur in an image. It is used to describe texture such as a land surface. It can provide useful information about the texture of an object but not information about the

CH is defined as the distribution of colors in an image. It represents the actual number of pixels of a certain color in each of a fixed list of color ranges. A major drawback of a color histogram

SIFT is an algorithm to detect and describe local features in images. It produces an image descriptor for image-based matching and recognition. It mainly detects interest points from a gray image, at which statistics of local gradient directions of image intensities are accumulated to give a summarizing description of the local image structures around each interest point. The descriptor is used for matching corresponding interest points between different images.

In calculating visual word, the four types of features are firstly calculated for an image. Then, keypoints are derived based on these features. Thirdly, K-means clustering algorithm is used to cluster the keypoints into a large number of clusters. Each cluster is then considered as a visual word that represents a specific pattern. In this way, the clustering process generates a visualword vocabulary describing different patterns in the images. The number of clusters deter-

Starting from an introduction of multimedia data fusion, this section discusses the principle of kernel-based data fusion, then presents the details of the proposed multiple kernel learning for

unevenly distributed among the documents.

54 Machine Learning - Advanced Techniques and Emerging Applications

photometric transformations, but for object orientation.

is that it does not take into account the size and shape of object.

data fusion, and finally gives the details of final event detection.

2.3.2. Visual features

form (SIFT) [14].

shape or size.

mines the size of the vocabulary.

2.4. Multimedia data fusion

Multimedia data fusion is the process in which different features of multimedia are brought together for the purpose of analyzing specific media data. Some common multimedia analyses that enable understanding of multimodal data include event detection, human tracking, audiovisual speaker detection, and semantic concept detection. The purpose of data fusion is to ensure that the algorithm of a process is improved. Through the use of a fusion strategy, the multimedia analysis can improve the accuracy of the output, resulting in more reliable decision-making.

There are many fusion methods such as linear fusion, linear weighted fusion, nonlinear fusion, and nonlinear weighted fusion. This study relates to a fusion strategy of combining both textual and visual modalities in the context of event detection. A new method of multimedia fusion has been proposed. It is based on multiple kernel learning (MKL). It has the advantage of incorporating with classifier learning and handling a big volume of data.
