*7.2.4.1 Speed*

Spark is an open source compute environment similar to Hadoop, but it has some useful differences that make it superior in some workloads, it allows loading the

<sup>2</sup> https://meritis.fr/bigdata/larchitecture-framework-spark/

dataset into distributed memory to optimize iterative workload and queries. Spark can run jobs 10 to 100 times faster than Hadoop MapReduce simply by reducing the number of reads and writes to disk.

#### *7.2.4.2 Iterative processing*

There are many algorithms which apply the same function to several steps. Like learning algorithms, Hadoop MapReduce is based on an acyclic data flow model, that is, the output of a previous MapReduce job is the input of the next MapReduce job. In this case we waste a lot of time in the I/O operation, so in Hadoop MapReduce between two MapReduce operations, there is a synchronization barrier and we need to keep the data on disk every time [33].

But with Spark, the concept of RDD (Resilient Distributed Datasets) allows data to be saved to memory and preserve disk only for result operations. So it does not have a whole synchronization barrier that could possibly slow down the process. So Spark allows to reduce the number of read/write on the disk.

### *7.2.4.3 Interactive queries*

For processing in interactive data extraction algorithms where a user needs to run multiple queries on the same subset of data, Hadoop loads the same data multiple times from disk depending on the number of queries.

But Spark loads the data only once, it stores that data in distributed memory, then it does the proper processing. For processing in interactive data extraction algorithms where a user needs to run multiple queries on the same subset of data.
