*7.2.1 Motivation of Spark*

*Multimedia Information Retrieval*

**Figure 4**.

**Figure 4.**

steps:

record.

the final result [30].

**7. Big data processing platforms**

• The input data are divided into records.

*An example of data flow in the MapReducee big data architecture [29].*

grouped by a key, then they are sorted.

URL access frequencies [28]. The MapReduce has three main parts, including the Master, the Map and reduce function. An example of this data flow is shown in

• Map functions process this data and produce key/value pairs for each

**7.1 The Hadoop platform for the distributed computing of big data**

columns projected for servers distributed in clusters [31].

ensures fault tolerance through data replication.

• All key/value pairs resulting from the Map function are merged together and

• The intermediate results are passed to the Reduce function, which will produce

First of all, Hadoop is a free framework, written in java, created and distributed by the Apache foundation, and intended for the processing of large data (of the order of petabytes and more) as well as for their intensive management. Inspired by several technical publications written by the giant Google, its goal is to provide a distributed, scalable and extensible storage and data processing system. It can handle a large number of data types (including unstructured data). We say that it is organized in a non-relational mode, it is more general than NoSQL, we can for example store data with two types of systems HDFS (Hadoop Distributed File System) and HBase which form a database management system oriented data,

Hadoop parallelizes the processing of data across many nodes that are part of a cluster of computers, which speeds up calculations and hides the latency of input and output operations. Hadoop contains a reliable distributed file system that

The Master is responsible for the management of the Map and Reduce functions and the provision of data and procedures, he organizes communication between mappers and reducers. The map function applies to each input record and produces a list of intermediate records. The Collapse function (also known as Reducer) is applied to each group of intermediate records with the same key and generates a value. Therefore, the MapReduce process includes the following

**10**

Since its inception, Hadoop has become an important technology for Big Data. One of the main reasons for this success is its ability to manage huge amounts of data regardless of their type (structured, semi-structured, unstructured). However, users have been consistently complaining about the high latency issue with Hadoop MapReduce stating that the batch response to all of these real-time applications is very painful when it comes to processing and analysis data.
