*7.2.7.2 Characteristics*


**Transformation:** Transformations do not return a single value, they return a new RDD. Nothing is evaluated when you call a transform function. The evaluation

**15**

**8. Storage**

**Figure 6.**

*Spark components [32].*

**8.1 Classical storage**

**8.2 Centralized network storage**

**8.3 Parallel and distributed storage**

*Towards Large Scale Image Retrieval System Using Parallel Frameworks*

computed at that time and the result value is returned [33].

the hardware and software infrastructure used for storage.

of transformations is lazy, operations are performed only when a result must be returned. **Action:** an operation that evaluates and returns a new value. When an action function is called on an RDD object, all data processing requests were

When processing a large amount of data, input data and results should be stored. Additionally, the performance of data intensive applications typically depends on

In this type of storage, the primary and easiest way to store data is a simple hard drive attached directly to the node. This type of storage system is sometimes referred to as a direct storage (DAS). On these disks, data is stored using a classic hierarchical file system like ext3 or ReiserFS. These file systems are typically implemented by an operating system driver as a sensitive part for security, performance, and reliability. This type of storage allows for fast read and write operations since everything is done locally. It is also simple to use as it is used with any operating system. However, there is no easy way to exchange data between multiple nodes.

A second way to store data is centralized network storage, usually referred to as etworkAttached Storage (NAS). In this case, a node has one or more disk connected and allows other nodes to read and write files through a standard interface and serve them through the network. Network File System (NFS) is primarily a protocol for accessing files over the network. While the server is free to implement any means of accessing the actual data to be provided over the network, most implementations simply depend on whether the data is directly accessible on the server. One of the main advantages of this type of architecture is the ease of sharing data between multiple compute nodes. Since the data is stored on a server, it is easily maintained.

In order to overcome the limitations of centralized network storage, data can be distributed across multiple storage nodes. Using multiple nodes allows access to multiple files at the same time without conflict. It also allows better throughput

*DOI: http://dx.doi.org/10.5772/intechopen.94910*

<sup>4</sup> Spark Programming Guide - Spark 1.2.0 Documentation. [Online]. Available: http://spark.apache.org/ docs/1.2.0/programming-guide.html.

*Towards Large Scale Image Retrieval System Using Parallel Frameworks DOI: http://dx.doi.org/10.5772/intechopen.94910*

**Figure 6.** *Spark components [32].*

*Multimedia Information Retrieval*

*7.2.6 Components of Spark*

and manipulating these collections.

*7.2.7 RDD dataset resilient distributed*

applications.

(**Figure 6**) [32].

*7.2.7.1 Definition*

provides fault tolerance.4

docs/1.2.0/programming-guide.html.

*7.2.7.2 Characteristics*

Spark Standalone is the easiest way to set up. This cluster manager relies on Akka for exchanges and on Zookeeper to guarantee the high availability of the master node. It has a console to supervise processing, and a mechanism to collect logs from slaves. Alternatively, YARN the Hadoop cluster manager, Spark can run on it, and alongside Hadoop jobs. Finally, more sophisticated and more general, Mesos allows you to configure more finely the allocation of resources (memory, CPU) to different

Because Spark's core engine is both fast and versatile, it powers multiple specialized high-level components for various workloads, such as SQL or machine learning. These components allow you to combine them like libraries in a software project. Spark Core: Contains the basic functionality of Spark, including components for job scheduling, memory management, disaster recovery, interaction with storage systems, and more. Spark Core is also the API that defines Elastic Distributed Datasets (RDDs), which are the main programming abstractions in Spark. RDDs represent a collection of objects distributed over several compute nodes that can be manipulated in parallel. Spark Core offers many APIs for building

Other than Spark Core API, there are additional libraries that are part of the Spark ecosystem and provide additional capabilities in big data analysis 6. These libraries are: Spark streaming, Spark SQL, Spark MLlib, Spark GraphX

An RDD is a collection of objects partitioned across a set of machines, allowing programmers to perform in-memory calculations on large clusters in a way that

1.RDD achieves fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information to simply rebuild that partition. This

2.There are two possibilities to create an RDD either to reference external data or to parallelize an existing collection. Spark allows you to create an RDD from

3.You can modify an RDD with a transformation, but the transformation returns

**Transformation:** Transformations do not return a single value, they return a new RDD. Nothing is evaluated when you call a transform function. The evaluation

<sup>4</sup> Spark Programming Guide - Spark 1.2.0 Documentation. [Online]. Available: http://spark.apache.org/

any data source accepted by Hadoop (local file, HDFS, HBase, etc.).

4.RDD supports two types of operations Transformations and Actions:

removes the need for replication to achieve fault tolerance.

you a new RDD while the original RDD remains the same.

**14**

of transformations is lazy, operations are performed only when a result must be returned. **Action:** an operation that evaluates and returns a new value. When an action function is called on an RDD object, all data processing requests were computed at that time and the result value is returned [33].
