**8.3 Parallel and distributed storage**

In order to overcome the limitations of centralized network storage, data can be distributed across multiple storage nodes. Using multiple nodes allows access to multiple files at the same time without conflict. It also allows better throughput

#### *Multimedia Information Retrieval*

to play the same file when there are replicas on multiple nodes. A distributed file system is usually designed to be better than centralized storing. Additionally, in theory, distributed storage systems can avoid the single point of the fault tolerance problem.

The distributed storage system often has a single entry point node that receives all requests to read or write data. As its role is central and critical, its work should be kept to a minimum. This master node generally only allows global collaboration among all the nodes involved in the storage system and can store metadata (file name, file size, access attributes, …). Thus, when a request to read or write a file is received, the client is redirected to another node which will actually process his request. However, while the metadata can be stored on the master node, the actual data is still stored on other nodes and can be reproduced. The downside of distributed storage is that for the best performance, the application should consider locality. This is because even though it was thought that the default behavior of the storage system might be quite good, it is usually better to read or write data from the node closest to the system storage from'a node with a high network cost.

An example of this storage system is the Hadoop Distributed File System (HDFS) and Tachyon.

### *8.3.1 HDFS architecture*

HDFS has a master/slave architecture. An HDFS cluster consists of a single master called NameNode which manages the namespace of the file system and regulates access to files by clients (open, close, rename, etc.), as well as a set of DataNodes to manage the actual data storage (**Figure 7**) [30].

#### **8.4 Definition of Tachyon**

Tachyon is a memory-centric, distributed storage system that allows users to share data across platforms and perform read/write actions at memory speed across cluster processing platforms. It also achieves a write rate of 110x more than in HDFS [34]. To ensure fault tolerance, the lost output is recovered by rerunning the operations that created the output, called lineage [34]. Thus, the Tachyon lineage option is seen as a major challenge in Tachyon, and the lineage layer provides high throughput I/O and follows the job sequence and data lineage in the storage layer.

#### *8.4.1 Tachyon architecture*

Indeed, Tachyon uses a standard master–slave architecture similar to HDFS (see **Figure 8**),5 this architecture is called master-worker.

The master manages the metadata and contains a workflow manager, the latter interacts with a cluster resource manager to allocate resources and recalculate. Whereas, workers manage local resources and report status to the master, and each worker uses a RAM disk to store memory-mapped files.

#### *8.4.2 The components of Tachyon*

Tachyon's design uses a single master and multiple workers. Tachyon can be divided into three components, the master, the workers and the customers. The master and workers together constitute the Tachyon servers, which are the components that a system administrator would maintain and manage. Customers are

**17**

**9. Conclusion**

**Figure 7.**

**Figure 8.**

*The Tachyon architecture [34].*

*HDFS architecture [30].*

*Towards Large Scale Image Retrieval System Using Parallel Frameworks*

typically applications, such as Spark or MapReduce, or Tachyon command line users. So Tachyon users usually only need to interact with the client part of Tachyon [34].

a.**Master:** Tachyon can be deployed in one of two main modes, a single master or several masters. The master is primarily responsible for managing the overall metadata of the system, for example, the file system tree. Clients can interact with the master to read or modify this metadata. In addition, all workers periodically poll the master to maintain their participation in the cluster. The master does not initiate communication with the components; it only interacts

b.**Workers:** Tachyon workers are responsible for managing the local resources allocated to Tachyon. These resources could be local memory, SSD or hard drive and they are user configurable. Tachyon workers store data as blocks, and respond to customer requests to read or write data by reading or creating new blocks. However, the worker is only responsible for the data in these blocks.

c.**Customer:** The Tachyon client provides users with a gateway to interact with Tachyon servers. It initiates communication with the master to carry out metadata operations and with workers for reading and writing data.

In this chapter, we presented the domain of content based image retrieval system

for large scale images using parallel platforms, we covered the basic concepts of

with components responding to requests.

*DOI: http://dx.doi.org/10.5772/intechopen.94910*

<sup>5</sup> https://www.slideshare.net/DavidGroozman/tachyon-meetup-slides

*Towards Large Scale Image Retrieval System Using Parallel Frameworks DOI: http://dx.doi.org/10.5772/intechopen.94910*

**Figure 7.** *HDFS architecture [30].*

*Multimedia Information Retrieval*

problem.

(HDFS) and Tachyon.

*8.3.1 HDFS architecture*

**8.4 Definition of Tachyon**

*8.4.1 Tachyon architecture*

*8.4.2 The components of Tachyon*

**Figure 8**),5

manage the actual data storage (**Figure 7**) [30].

to play the same file when there are replicas on multiple nodes. A distributed file system is usually designed to be better than centralized storing. Additionally, in theory, distributed storage systems can avoid the single point of the fault tolerance

node closest to the system storage from'a node with a high network cost.

An example of this storage system is the Hadoop Distributed File System

HDFS has a master/slave architecture. An HDFS cluster consists of a single master called NameNode which manages the namespace of the file system and regulates access to files by clients (open, close, rename, etc.), as well as a set of DataNodes to

Tachyon is a memory-centric, distributed storage system that allows users to share data across platforms and perform read/write actions at memory speed across cluster processing platforms. It also achieves a write rate of 110x more than in HDFS [34]. To ensure fault tolerance, the lost output is recovered by rerunning the operations that created the output, called lineage [34]. Thus, the Tachyon lineage option is seen as a major challenge in Tachyon, and the lineage layer provides high through-

Indeed, Tachyon uses a standard master–slave architecture similar to HDFS (see

The master manages the metadata and contains a workflow manager, the latter interacts with a cluster resource manager to allocate resources and recalculate. Whereas, workers manage local resources and report status to the master, and each

Tachyon's design uses a single master and multiple workers. Tachyon can be divided into three components, the master, the workers and the customers. The master and workers together constitute the Tachyon servers, which are the components that a system administrator would maintain and manage. Customers are

put I/O and follows the job sequence and data lineage in the storage layer.

this architecture is called master-worker.

worker uses a RAM disk to store memory-mapped files.

<sup>5</sup> https://www.slideshare.net/DavidGroozman/tachyon-meetup-slides

The distributed storage system often has a single entry point node that receives all requests to read or write data. As its role is central and critical, its work should be kept to a minimum. This master node generally only allows global collaboration among all the nodes involved in the storage system and can store metadata (file name, file size, access attributes, …). Thus, when a request to read or write a file is received, the client is redirected to another node which will actually process his request. However, while the metadata can be stored on the master node, the actual data is still stored on other nodes and can be reproduced. The downside of distributed storage is that for the best performance, the application should consider locality. This is because even though it was thought that the default behavior of the storage system might be quite good, it is usually better to read or write data from the

**16**

**Figure 8.** *The Tachyon architecture [34].*

typically applications, such as Spark or MapReduce, or Tachyon command line users. So Tachyon users usually only need to interact with the client part of Tachyon [34].

