**7.1 The Hadoop platform for the distributed computing of big data**

First of all, Hadoop is a free framework, written in java, created and distributed by the Apache foundation, and intended for the processing of large data (of the order of petabytes and more) as well as for their intensive management. Inspired by several technical publications written by the giant Google, its goal is to provide a distributed, scalable and extensible storage and data processing system. It can handle a large number of data types (including unstructured data). We say that it is organized in a non-relational mode, it is more general than NoSQL, we can for example store data with two types of systems HDFS (Hadoop Distributed File System) and HBase which form a database management system oriented data, columns projected for servers distributed in clusters [31].

Hadoop parallelizes the processing of data across many nodes that are part of a cluster of computers, which speeds up calculations and hides the latency of input and output operations. Hadoop contains a reliable distributed file system that ensures fault tolerance through data replication.

**11**

*Towards Large Scale Image Retrieval System Using Parallel Frameworks*

very painful when it comes to processing and analysis data.

external organizations have started using Spark.

now a high level [32] project.

and flow processing (**Figure 5**) [32].2

*7.2.4 Advantages of Spark over Hadoop MapReduce*

<sup>2</sup> https://meritis.fr/bigdata/larchitecture-framework-spark/

*7.2.3 Definition*

*7.2.4.1 Speed*

**7.2 The Spark platform for the distributed computing of big data**

Since its inception, Hadoop has become an important technology for Big Data. One of the main reasons for this success is its ability to manage huge amounts of data regardless of their type (structured, semi-structured, unstructured). However, users have been consistently complaining about the high latency issue with Hadoop MapReduce stating that the batch response to all of these real-time applications is

Spark is a high-speed compute cluster developed by contributions from nearly 250 developers from 50 AMPLab companies at UC Berkeley, to make data analysis faster and easier to write and thus run. Spark started in 2009 as a research project in the Berkeley Lab RAD, which would later become AMPLLab. Researchers in the lab had previously worked on Hadoop MapReduce, and observed that MapReduce was ineffective for iterative and interactive computing jobs. So from the start Spark was designed to be fast for interactive queries and iterative algorithms, bringing ideas like in-memory storage support and efficient fault recovery. Research papers have been published about Spark at academic conferences and shortly after its inception in 2009 it was already 10–100 times faster than MapReduce for some jobs. Some of the early Spark users were other groups in UC Berkeley, including researchers, such as the Millennium Mobile Project, which used Spark to monitor and forecast traffic jams in San Bay. Francisco Machine Learning. In a very short time, however, many

In 2011, AMPLab started developing high-level components on Spark, such as Shark and Spark streaming. These and other components are sometimes referred to as Berkeley Data Analytics Stack (ODB). The Spark was open source in March 2010, and it was transferred to the Apache Software Foundation on June 2013, where it is

Apache Spark is an open source processing framework, it is built around speed, ease of use and the ability to handle large data sets, which are of diverse nature (text data, graph data, etc.), Spark extends the MapReduce model to efficiently support multiple types of computations, including iterative processing, interactive queries,

Spark is a strong framework for future large data applications that may require low latency queries, iterative computing, and real-time processing. The Spark has many advantages over the Hadoop MapReduce Framework among them we find [32, 33]:

Spark is an open source compute environment similar to Hadoop, but it has some

useful differences that make it superior in some workloads, it allows loading the

*DOI: http://dx.doi.org/10.5772/intechopen.94910*

*7.2.1 Motivation of Spark*

*7.2.2 History of Spark*
