3. Real-time big data architecture

Big data architects often need a distributed system structure for data analysis, which requires data storage. S. Tehranian et al. proposed an architectural model that provides performance, reliability, and scalability, consisting of candidate hardware and software for distributed real-time processing of satellite data at ground stations [26]. The resulting prototype system was implemented using the open source adaptive communication environment (ACE) framework and C ++ and tested on the cluster; real-time systems have achieved significant performance without sacrificing reliability and high availability. Structures and mechanisms for parallel programming are needed so that RS data can be analyzed in a distributed manner. In this context, Y. Ma et al. proposed a generic parallel programming framework for RS applications on high-performance clusters [27]. The proposed mechanism has programming templates that provide both distributed and generic parallel programming frameworks for RS algorithms. The proposed magHD for storage, analysis, and visualization of multidimensional data combines Hadoop and MapReduce technologies with various indexing techniques for use on clusters [28]. Some systems, such as [29], contain all of the different steps of addition, filtering, load balancing, processing, combining, and interpreting. In the related work, a real-time approach to continuous feature extraction and detection aimed at finding rivers, land, and rail from satellite images was proposed using the Hadoop ecosystem and MapReduce.

The user interface is a component that basically allows the user to query the server-based application through a visual interface, query, load, delete, update, request analysis, define workflow. The user interface has not been investigated as a major concern in this chapter.

Performance-Aware High-Performance Computing for Remote Sensing Big Data Analytics

http://dx.doi.org/10.5772/intechopen.75934

75

SSD and PCM devices that are being used instead of HDD as data storage are far from the I/O performance required for big data. Enterprise storage architectures, such as DAS, NAS, and

Distributed File System (DFS) is a virtual file system that spans multiple nodes on the cloud [15]. It provides the abstraction of heterogeneity of data nodes in different centers. Thus, the distributed file system provides a common interface for applications to access data on heterogeneous nodes that use different operating systems and different file systems on different nodes individually. There are some capabilities that a distributed file system should generally provide. Location transparency is where the application can access data such as being held locally without actually having to hold it. Access transparency is a common interface for access to data independent of the operating system and the file system. Fault tolerance is the ability to keep a replica of a replica on more than one node so that in the event of an error the replica is preserved in the nodes holding the replica. Scalability means that the number of nodes the file system is running on can be increased to the required amount (without the error system

The Hadoop distributed file system (HDFS) is an open-source distributed file system distributed with an Apache license (Figure 3). HDFS is designed especially for big datasets and high

SAN, have drawbacks and limitations when used as distributed systems [2].

3.1. Storage

3.1.1. Distributed file system

Figure 2. Generalized real-time big data architecture.

hanging down) if needed.

At a minimum, the components that the big data architecture should have in order to be able to perform real-time analysis are as follows: user interface, distributed file system, distributed database, high-performance computing (Figure 2).

A distributed file system is a virtual file system that allows distributed data to be stored on multiple computer clusters. In clusters that may consist of heterogeneous nodes, the application provides a common interface for accessing the data in isolation from the operating and file systems.

A distributed database consists of separate databases on each node/server in a multi-computer cluster connected by a computer network. The distributed database management system allows distribution and management of data on the server.

High-performance computing systems are technologies that provide an infrastructure that provides a sufficiently fast computing environment for parallel analysis of big data. It is crucial for the system to respond to the user within the reasonable time.

Performance-Aware High-Performance Computing for Remote Sensing Big Data Analytics http://dx.doi.org/10.5772/intechopen.75934 75

Figure 2. Generalized real-time big data architecture.

The user interface is a component that basically allows the user to query the server-based application through a visual interface, query, load, delete, update, request analysis, define workflow. The user interface has not been investigated as a major concern in this chapter.

### 3.1. Storage

fragments in hundreds of image files. Thus, in this type of access, a large number of read/write

"Irregular access pattern" is used by algorithms such as fast fourier transform (FFT), image distortion, and information extraction, which require scattered access or the entire image. These algorithms use diagonal and polygonal access patterns as irregular access. In these patterns, different sized parts can be in different nodes, even though they are small and non-contiguous pieces of data.

Big data architects often need a distributed system structure for data analysis, which requires data storage. S. Tehranian et al. proposed an architectural model that provides performance, reliability, and scalability, consisting of candidate hardware and software for distributed real-time processing of satellite data at ground stations [26]. The resulting prototype system was implemented using the open source adaptive communication environment (ACE) framework and C ++ and tested on the cluster; real-time systems have achieved significant performance without sacrificing reliability and high availability. Structures and mechanisms for parallel programming are needed so that RS data can be analyzed in a distributed manner. In this context, Y. Ma et al. proposed a generic parallel programming framework for RS applications on high-performance clusters [27]. The proposed mechanism has programming templates that provide both distributed and generic parallel programming frameworks for RS algorithms. The proposed magHD for storage, analysis, and visualization of multidimensional data combines Hadoop and MapReduce technologies with various indexing techniques for use on clusters [28]. Some systems, such as [29], contain all of the different steps of addition, filtering, load balancing, processing, combining, and interpreting. In the related work, a real-time approach to continuous feature extraction and detection aimed at finding rivers, land, and rail from satellite images was proposed using the

At a minimum, the components that the big data architecture should have in order to be able to perform real-time analysis are as follows: user interface, distributed file system, distributed

A distributed file system is a virtual file system that allows distributed data to be stored on multiple computer clusters. In clusters that may consist of heterogeneous nodes, the application provides a

A distributed database consists of separate databases on each node/server in a multi-computer cluster connected by a computer network. The distributed database management system

High-performance computing systems are technologies that provide an infrastructure that provides a sufficiently fast computing environment for parallel analysis of big data. It is crucial

common interface for accessing the data in isolation from the operating and file systems.

In addition to I/O difficulties, the problem of identifying irregular data areas also arises.

operations take place, which is a time-consuming process.

3. Real-time big data architecture

74 Data Mining

Hadoop ecosystem and MapReduce.

database, high-performance computing (Figure 2).

allows distribution and management of data on the server.

for the system to respond to the user within the reasonable time.

### 3.1.1. Distributed file system

SSD and PCM devices that are being used instead of HDD as data storage are far from the I/O performance required for big data. Enterprise storage architectures, such as DAS, NAS, and SAN, have drawbacks and limitations when used as distributed systems [2].

Distributed File System (DFS) is a virtual file system that spans multiple nodes on the cloud [15]. It provides the abstraction of heterogeneity of data nodes in different centers. Thus, the distributed file system provides a common interface for applications to access data on heterogeneous nodes that use different operating systems and different file systems on different nodes individually. There are some capabilities that a distributed file system should generally provide. Location transparency is where the application can access data such as being held locally without actually having to hold it. Access transparency is a common interface for access to data independent of the operating system and the file system. Fault tolerance is the ability to keep a replica of a replica on more than one node so that in the event of an error the replica is preserved in the nodes holding the replica. Scalability means that the number of nodes the file system is running on can be increased to the required amount (without the error system hanging down) if needed.

The Hadoop distributed file system (HDFS) is an open-source distributed file system distributed with an Apache license (Figure 3). HDFS is designed especially for big datasets and high

data blocks located on the same disk. In the proposed approach, a trade-off was attempted

Performance-Aware High-Performance Computing for Remote Sensing Big Data Analytics

http://dx.doi.org/10.5772/intechopen.75934

77

Parallel virtual file system (PVFS) compared with HDFS [32] shows that PVFS has not shown a

Rasdaman database is a database that supports large multidimensional arrays that conventional databases cannot handle and can store large remote sensing data by nature [33]. The architecture of Rasdaman is based on the sequence shredding process called "tiling." The Rasdaman parallel server architecture feature provides a scalable and distributed environment for efficient processing of large numbers of concurrent user requests. Thus it is possible to present distributed datasets over the web. In order to retrieve and process the dataset from Rasdaman, queries of data retrieval in the query language defined by open geospatial consortium's (OGC) web coverage processing service (WCPS) standards should be run. The PetaScope component, developed as a Java Servlet used at this point, provides queries for multidimensional data retrieval, retrieval filtering, and processing by implementing OGC standard interfaces. It also adds support for geographic and temporal coordinate systems.

Depending on the size of the RS data in the remote sensing applications, the unavoidable I/O load and irregular data access patterns are not applicable to traditional cluster-based parallel I/ O systems [25]. In the study conducted by L. Wang et al., an RS file-based parallel file system for remote sensing applications was proposed and implemented using the OrangeFS file system. By providing an application-specific data placement policy, efficiency is achieved for different data access patterns. The improvement in the performance of the proposed system is

The classical approaches used in managing structured data have a schema for data storage and a relational database for retrieving data. Existing database management tools have been inadequate for processing large volumes that grow rapidly and become complex. Data warehouse and data-market approaches have gained popularity in systems with more than one structured data [2]. One of these approaches is the data warehouse, which is used to store, analyze, and report results to the user. The data market (March) approach is an approach that improves data access and analysis based on the data warehouse. The enterprise data warehouse (EDW), which is favored by large organizations, allows the data processing and analysis capability to be used on a very large and unified enterprise database [21]. Some cloud providers can offer a petabyte data and more scaling solution with EDW. For example, Amazon Redshift uses a massively parallel processing (MPP) architecture consisting of a large number of processors for high-performance interrogation, with columnar storage and data compression. In addition, the amount of I/O required by queries is reduced using local attached storage

For storing and managing unstructured or non-relational data, the NoSQL approach is divided into two independent parts: data storage and management [2]. With the key-value storage

significant improvement in terms of completion time and throughput.

between different block sizes.

seen as an average of 20%.

3.1.2. Distributed database

and zone maps.

Figure 3. HDFS architecture.

availability, apart from the common abilities. It is also platform independent as it is implemented in Java. Applications are accessed via the HDFS API, which is maintained by any filing system. Thus, file access is isolated from local file systems. Compared to other distributed file systems (IRODS, Luster), it is stated that the performance is different in design and HDFS is the only DFS with automatic load balancing [15]. At the same time, because it is platform independent and the availability of MapReduce support makes it easy to use on many systems, it is the preferred choice.

The Google file system (GFS) is a proprietary distributed file system developed by Google for its own use [30]. The reason for the development is the need for a scalable distributed file system that emerges in big data-intensive applications. It is designed to enable reliable, efficient, and fault-tolerant use of data in a multitude of thousands of drives and machines, each with thousands of simultaneous users.

Storage systems such as amazon simple storage (S3), nirvanix cloud storage, openstack swift, and windows azure blob that are used in cloud systems do not fully meet the scalability and replication needs of cloud applications and the concurrency and performance requirements of analysis applications.

General parallel file system (GPFS) is a high-performance clustered file system developed by IBM. GPFS can be built on shared drives or shared-nothing distributed parallel nodes. Since it fully supports the POSIX-based file system, it removes the need to learn the new API set introduced by other storage systems. On the other hand, HDFS and GFS are not completely POSIX compliant and require new API definitions to provide analysis solutions in the cloud. In the study conducted by Schmuck et al., it is stated that GPFS is in terms of file-reading performance of HDFS with a meta-block concept [31]. A meta block is a set of consecutive data blocks located on the same disk. In the proposed approach, a trade-off was attempted between different block sizes.

Parallel virtual file system (PVFS) compared with HDFS [32] shows that PVFS has not shown a significant improvement in terms of completion time and throughput.

Rasdaman database is a database that supports large multidimensional arrays that conventional databases cannot handle and can store large remote sensing data by nature [33]. The architecture of Rasdaman is based on the sequence shredding process called "tiling." The Rasdaman parallel server architecture feature provides a scalable and distributed environment for efficient processing of large numbers of concurrent user requests. Thus it is possible to present distributed datasets over the web. In order to retrieve and process the dataset from Rasdaman, queries of data retrieval in the query language defined by open geospatial consortium's (OGC) web coverage processing service (WCPS) standards should be run. The PetaScope component, developed as a Java Servlet used at this point, provides queries for multidimensional data retrieval, retrieval filtering, and processing by implementing OGC standard interfaces. It also adds support for geographic and temporal coordinate systems.

Depending on the size of the RS data in the remote sensing applications, the unavoidable I/O load and irregular data access patterns are not applicable to traditional cluster-based parallel I/ O systems [25]. In the study conducted by L. Wang et al., an RS file-based parallel file system for remote sensing applications was proposed and implemented using the OrangeFS file system. By providing an application-specific data placement policy, efficiency is achieved for different data access patterns. The improvement in the performance of the proposed system is seen as an average of 20%.

### 3.1.2. Distributed database

availability, apart from the common abilities. It is also platform independent as it is implemented in Java. Applications are accessed via the HDFS API, which is maintained by any filing system. Thus, file access is isolated from local file systems. Compared to other distributed file systems (IRODS, Luster), it is stated that the performance is different in design and HDFS is the only DFS with automatic load balancing [15]. At the same time, because it is platform independent and the availability of MapReduce support makes it easy to use on

The Google file system (GFS) is a proprietary distributed file system developed by Google for its own use [30]. The reason for the development is the need for a scalable distributed file system that emerges in big data-intensive applications. It is designed to enable reliable, efficient, and fault-tolerant use of data in a multitude of thousands of drives and machines, each

Storage systems such as amazon simple storage (S3), nirvanix cloud storage, openstack swift, and windows azure blob that are used in cloud systems do not fully meet the scalability and replication needs of cloud applications and the concurrency and performance requirements of

General parallel file system (GPFS) is a high-performance clustered file system developed by IBM. GPFS can be built on shared drives or shared-nothing distributed parallel nodes. Since it fully supports the POSIX-based file system, it removes the need to learn the new API set introduced by other storage systems. On the other hand, HDFS and GFS are not completely POSIX compliant and require new API definitions to provide analysis solutions in the cloud. In the study conducted by Schmuck et al., it is stated that GPFS is in terms of file-reading performance of HDFS with a meta-block concept [31]. A meta block is a set of consecutive

many systems, it is the preferred choice.

with thousands of simultaneous users.

analysis applications.

Figure 3. HDFS architecture.

76 Data Mining

The classical approaches used in managing structured data have a schema for data storage and a relational database for retrieving data. Existing database management tools have been inadequate for processing large volumes that grow rapidly and become complex. Data warehouse and data-market approaches have gained popularity in systems with more than one structured data [2]. One of these approaches is the data warehouse, which is used to store, analyze, and report results to the user. The data market (March) approach is an approach that improves data access and analysis based on the data warehouse. The enterprise data warehouse (EDW), which is favored by large organizations, allows the data processing and analysis capability to be used on a very large and unified enterprise database [21]. Some cloud providers can offer a petabyte data and more scaling solution with EDW. For example, Amazon Redshift uses a massively parallel processing (MPP) architecture consisting of a large number of processors for high-performance interrogation, with columnar storage and data compression. In addition, the amount of I/O required by queries is reduced using local attached storage and zone maps.

For storing and managing unstructured or non-relational data, the NoSQL approach is divided into two independent parts: data storage and management [2]. With the key-value storage model in storage, NoSQL's focal point is scalability and high performance of data storage. In the management section, data management tasks can be performed at the application layer through the lower-level access mechanism. The most important features of the NoSQL database are the ability to quickly change the data structure by providing schema freedom and the need to rewrite the data so that the structured data can be stored heterogeneously, providing flexibility. The most popular NoSQL database is the Cassandra database, which was first used by Facebook and published as open source in 2008. There are also NoSQL implementations such as SimpleDB, Google BigTable, MongoDB, and Voldemort. Social networking applications such as Twitter, LinkedIn, and Netflix also benefited from NoSQL capabilities.

With the combination of betting technologies, the processing time in some applications can be reduced to as short a time as can be decided by the helpers in a timely manner. Although it is reasonable to process the remote sensing data as soon as possible, it does not seem possible to

Performance-Aware High-Performance Computing for Remote Sensing Big Data Analytics

http://dx.doi.org/10.5772/intechopen.75934

79

The multi-core processor is an integrated circuit which has two and more processors to process multiple tasks efficiently. After the frequency of processors reached limits due to the heating problem, processing capabilities of new CPUs continue to increase by multiplying the number of cores [37]. Recently new CPUs could handle 8–12 simultaneous threads. To benefit from the multi-core CPUs, the problem should be divided into partitions which can be processed simultaneously. Multi-core programming is needed to achieve this benefit and rewriting of application is needed affordably. Multi-core programming is the implementation of algorithms using a multi-core processor on a single computer to improve performance. Some APIs and standards such as OpenMP and MPI are needed to implement algorithms which could be run

GPU is a specialized circuit dedicated to graphical processing preliminarily. After it has had a brilliant rise in manipulating computer graphics and image processing in recent years, this technology gets used for developing parallel algorithms on RS image data widely [38–40]. RS complex algorithms should be rewritten to benefit from GPU parallelism by using thousands

Some onboard remote sensing data processing scenarios require components that can operate with low weight and low power, especially in systems where air vehicles such as unmanned air vehicle and satellite are used. While these components reduce the amount of payload, they can produce real/near-real-time analysis results at the same time as data is being obtained from the sensor. For this purpose, programmable hardware devices such as FPGAs can be used [41, 42]. FPGAs are the digital integrated circuits which consist of an array of programmable logic blocks and reconfigurable interconnects that allow the blocks to be connected simply. But

One of the most used approaches when considering hardware-based improvements is commercial off-the-shelf (COTS)-based computer-based solutions. In this approach, a cluster is created from a number of computers to work together as a team [43]. These parallel systems, installed with a large number of CPUs, provide good results for both real-time and near-real-time

the need for FPGA programming and learning a new set of APIs is emerging.

perform real-time processing automatically.

simultaneously on multi-core processors.

3.2.1.2. Graphic processing units

of simultaneous threads.

3.2.2. Distributed architecture

3.2.2.1. Cluster

3.2.1.3. Field programmable gate arrays

3.2.1. Onboard architecture 3.2.1.1. Multi-core processor

According to the method proposed by L. Wang et al. for the management problem of conventional remote sensing data, the image data is divided into blocks based on the GeoSOT global discrete grid system and the data blocks are stored in HBase [34]. In this method, the data is first recorded in the MetaDataInfo table. The satellite-sensor acquisition time is used as the row ID. In the DataGridBlock table, the row ID is kept with the MetaDataInfo row ID as well as the geographic coordinate. HBase tables ensure that blocks that are geographically close to the ascending order of row numbers will be held in adjacent rows in the table. When a spatial query arrives, the GeoSOT codes are first calculated and the DataGridBlock table is filtered by these codes. In addition, a distributed processing method that uses MapReduce model to deal with image data blocks is also designed. When MapReduce starts the job, it splits the table into bounds of regions, each region containing a set of image data blocks. The map function then processes each data block from the region and sends the resulting results to the reduce function.

The analysis of ultra-big databases has attracted many researchers' interest, as traditional databases are inefficient for storing and analyzing large digital data. Apache HBase, the NoSQL distributed database developed on HDFS, is one of the results of these researches. A study by M.N. Vora evaluated a hybrid approach in which HDFS retains data such as nontextual images and HBase retains these data [35]. This hybrid architecture makes it possible to search and retrieve data faster.

### 3.2. HPC systems

When data analysis is considered, the most important difficulty is scalability, depending on the volume of data. In recent years, researchers have focused more on accelerating their analysis algorithms. However, the amount of data is much faster than CPU speed. This has led processors to come to a position to support parallel computing as multi-core. Timeliness for realtime applications comes first. Thus, many difficulties arise not only in hardware development but also in the direction of development of software architects. The most important trend at this point is to make distributed computing improvements using cloud computing technology.

The technologies used in remote sensing applications have difficulties in delivering, processing, and responding in time [36]. Web technologies, grid computing, data mining, and parallel computation on remote sensing data generated by R. Patrick and J. Karpjoo have been scanned. The size of the data volume, the data formats, and the download time are general difficulties. With the combination of betting technologies, the processing time in some applications can be reduced to as short a time as can be decided by the helpers in a timely manner. Although it is reasonable to process the remote sensing data as soon as possible, it does not seem possible to perform real-time processing automatically.

### 3.2.1. Onboard architecture

model in storage, NoSQL's focal point is scalability and high performance of data storage. In the management section, data management tasks can be performed at the application layer through the lower-level access mechanism. The most important features of the NoSQL database are the ability to quickly change the data structure by providing schema freedom and the need to rewrite the data so that the structured data can be stored heterogeneously, providing flexibility. The most popular NoSQL database is the Cassandra database, which was first used by Facebook and published as open source in 2008. There are also NoSQL implementations such as SimpleDB, Google BigTable, MongoDB, and Voldemort. Social networking applica-

According to the method proposed by L. Wang et al. for the management problem of conventional remote sensing data, the image data is divided into blocks based on the GeoSOT global discrete grid system and the data blocks are stored in HBase [34]. In this method, the data is first recorded in the MetaDataInfo table. The satellite-sensor acquisition time is used as the row ID. In the DataGridBlock table, the row ID is kept with the MetaDataInfo row ID as well as the geographic coordinate. HBase tables ensure that blocks that are geographically close to the ascending order of row numbers will be held in adjacent rows in the table. When a spatial query arrives, the GeoSOT codes are first calculated and the DataGridBlock table is filtered by these codes. In addition, a distributed processing method that uses MapReduce model to deal with image data blocks is also designed. When MapReduce starts the job, it splits the table into bounds of regions, each region containing a set of image data blocks. The map function then processes each data block from the region

The analysis of ultra-big databases has attracted many researchers' interest, as traditional databases are inefficient for storing and analyzing large digital data. Apache HBase, the NoSQL distributed database developed on HDFS, is one of the results of these researches. A study by M.N. Vora evaluated a hybrid approach in which HDFS retains data such as nontextual images and HBase retains these data [35]. This hybrid architecture makes it possible to

When data analysis is considered, the most important difficulty is scalability, depending on the volume of data. In recent years, researchers have focused more on accelerating their analysis algorithms. However, the amount of data is much faster than CPU speed. This has led processors to come to a position to support parallel computing as multi-core. Timeliness for realtime applications comes first. Thus, many difficulties arise not only in hardware development but also in the direction of development of software architects. The most important trend at this point is to make distributed computing improvements using cloud computing technology.

The technologies used in remote sensing applications have difficulties in delivering, processing, and responding in time [36]. Web technologies, grid computing, data mining, and parallel computation on remote sensing data generated by R. Patrick and J. Karpjoo have been scanned. The size of the data volume, the data formats, and the download time are general difficulties.

tions such as Twitter, LinkedIn, and Netflix also benefited from NoSQL capabilities.

and sends the resulting results to the reduce function.

search and retrieve data faster.

3.2. HPC systems

78 Data Mining

### 3.2.1.1. Multi-core processor

The multi-core processor is an integrated circuit which has two and more processors to process multiple tasks efficiently. After the frequency of processors reached limits due to the heating problem, processing capabilities of new CPUs continue to increase by multiplying the number of cores [37]. Recently new CPUs could handle 8–12 simultaneous threads. To benefit from the multi-core CPUs, the problem should be divided into partitions which can be processed simultaneously. Multi-core programming is needed to achieve this benefit and rewriting of application is needed affordably. Multi-core programming is the implementation of algorithms using a multi-core processor on a single computer to improve performance. Some APIs and standards such as OpenMP and MPI are needed to implement algorithms which could be run simultaneously on multi-core processors.

### 3.2.1.2. Graphic processing units

GPU is a specialized circuit dedicated to graphical processing preliminarily. After it has had a brilliant rise in manipulating computer graphics and image processing in recent years, this technology gets used for developing parallel algorithms on RS image data widely [38–40]. RS complex algorithms should be rewritten to benefit from GPU parallelism by using thousands of simultaneous threads.

### 3.2.1.3. Field programmable gate arrays

Some onboard remote sensing data processing scenarios require components that can operate with low weight and low power, especially in systems where air vehicles such as unmanned air vehicle and satellite are used. While these components reduce the amount of payload, they can produce real/near-real-time analysis results at the same time as data is being obtained from the sensor. For this purpose, programmable hardware devices such as FPGAs can be used [41, 42]. FPGAs are the digital integrated circuits which consist of an array of programmable logic blocks and reconfigurable interconnects that allow the blocks to be connected simply. But the need for FPGA programming and learning a new set of APIs is emerging.

### 3.2.2. Distributed architecture

### 3.2.2.1. Cluster

One of the most used approaches when considering hardware-based improvements is commercial off-the-shelf (COTS)-based computer-based solutions. In this approach, a cluster is created from a number of computers to work together as a team [43]. These parallel systems, installed with a large number of CPUs, provide good results for both real-time and near-real-time applications using both remote sensors and data streams, but these systems are both expensive, and scalability does not exceed a certain capacity.

communities for handling big geospatial data. Many investigations were carried out for adopting those technologies to processing big geospatial data, but there are very few studies for optimizing the computing resources to handle the dynamic geo-processing workload

Performance-Aware High-Performance Computing for Remote Sensing Big Data Analytics

http://dx.doi.org/10.5772/intechopen.75934

81

In existing software systems, computing is seen as the most expensive part. After daily collected data amount is grown exponentially with recent technologies, data movement has a deep impact on performance with a bigger cost than computing which is cheap and massively parallel [48]. At that point, new high-performance systems need to update themselves to adapt to the data-centric paradigm. New systems must use data locality to succeed that adaptation. Current systems ignore the incurred cost of communication and rely on the hardware cache coherency to virtualize data movement. Increasing amount of data reveals more communication between the processing elements and that situation requires supporting data locality and affinity. With the upcoming new model, data locality should be succeeded with recent technologies which contain tiling, data layout, array views, task and thread affinity, and topologyaware communication libraries. Combination of the best of these technologies can help us develop a comprehensive model for managing data locality on a high-performance computing

Domain decomposition is an important subject in high-performance modeling and numerical simulations in the area such as intelligence and military, meteorology, agriculture, urbanism, and search and rescue [50]. It makes possible parallel computing by dividing a large computational task into smaller parts and distributing them to different computing resources. To increase the performance, high-performance computing is extensively used by dividing the entire problem into multiple subdomains and distributing each one to different computing nodes in a parallel fashion. Inconvenient allocation of resources induces imbalanced task loads and redundant communications among computing nodes. Hereby, resource allocation is a vital part which has a deep impact on the efficiency of the parallel process. Resource allocation algorithm should minimize total execution time by taking into consideration the communication and computing cost for each computing node and reduce total communication cost for the entire processing. In this chapter, a new data partitioning strategy is proposed to benefit from the current situation of resources on the cloud. At this new strategy, RS data should be partitioned based on performance metrics which is formulated by a combination of available resources of network, memory, CPU, and storage. After appropriate cloud site and resource nodes are found by stage 1 and 2 as described in Sections 6.2 and 6.3, the receiving data is divided into the selected resource nodes according to the number of available resources on that cloud. For this dividing operation, the system should determine which portion of data will be allocated to found nodes based on some performance metrics such as the below heuristic

> pi <sup>¼</sup> <sup>t</sup> i

Pn j¼1 t j

throughput þ t

throughput þ t

i processing

> j processing

(1)

efficiently.

system [49].

function:

4.1. Data partition strategy

Cavallaro et al. have addressed the classification of land cover types over an image-based dataset as a concrete big data problem in their work [44]. In the scope of the study, PiSVM, an implementation based on LibSVM, was used for classification. The PiSVM code is stale and stable despite the I/O limits. While PiSVM is used in parallel, MPI is used for communication on multiple nodes. For the parallel analysis, the JUDGE cluster in Jülich Supercomputing Center in Germany was used. The training period has been reduced significantly in the PiSVM, which runs parallel to the running of the series MATLAB. In parallel operation, the accuracy of SVM remains the same as in serial operation (97%).

### 3.2.2.2. Cloud

Cloud computing is one of the most powerful big data techniques [45]. The ability to provide flexible processing, memory, and drivers by virtualizing computing resources on a physical computer made the supercomputing concept more affordable and easily accessible [46]. The use of the cloud concept, which provides a multi-computer infrastructure for data management and analysis, provides great ease in terms of high scalability and usability, fault tolerance, and performance. Especially considering the critical applications that need to extract information from the data that the next-generation remote sensors can produce near real time, it is very important to use cloud computing technologies for high-performance computing [21]. In addition, the cloud computing infrastructure has the ability to create an efficient platform for the storage of big data as well as for the performance of the analysis process. Thus, together with the use of this technology, expensive computing hardware such as cluster systems, allocated space and software requirements can be eliminated [22].

The Hadoop ecosystem has emerged as one of the most successful infrastructures for cloud and big data analysis [23, 45]. The platform brings together several tools for various purposes, with two major services: HDFS, a distributed file system, and MapReduce, a high-performance parallel-data processing engine. The MapReduce model is an open-source implementation of the Apache Hadoop framework. This model allows big datasets to be distributed concurrently on multiple computers. Remote sensing applications using the MapReduce model have become a research topic in order to improve the performance of the analysis process as a result of exponential growth within the latest developments in sensor technologies [15, 16]. The processing services on the cloud are accessed via the distributed file system. In order to reduce data access, it is reasonable to process the data in the central computer where the data is stored.

### 4. Performance-aware HPC

Processing of big geospatial data is vital for time-critical applications such as natural disasters, climate changes, and military/national security systems. Its challenges are related to not only massive data volume but also intrinsic complexity and high dimensions of the geospatial datasets [47]. Hadoop and similar technologies have attracted increasingly in geosciences communities for handling big geospatial data. Many investigations were carried out for adopting those technologies to processing big geospatial data, but there are very few studies for optimizing the computing resources to handle the dynamic geo-processing workload efficiently.

In existing software systems, computing is seen as the most expensive part. After daily collected data amount is grown exponentially with recent technologies, data movement has a deep impact on performance with a bigger cost than computing which is cheap and massively parallel [48]. At that point, new high-performance systems need to update themselves to adapt to the data-centric paradigm. New systems must use data locality to succeed that adaptation. Current systems ignore the incurred cost of communication and rely on the hardware cache coherency to virtualize data movement. Increasing amount of data reveals more communication between the processing elements and that situation requires supporting data locality and affinity. With the upcoming new model, data locality should be succeeded with recent technologies which contain tiling, data layout, array views, task and thread affinity, and topologyaware communication libraries. Combination of the best of these technologies can help us develop a comprehensive model for managing data locality on a high-performance computing system [49].

### 4.1. Data partition strategy

applications using both remote sensors and data streams, but these systems are both expensive,

Cavallaro et al. have addressed the classification of land cover types over an image-based dataset as a concrete big data problem in their work [44]. In the scope of the study, PiSVM, an implementation based on LibSVM, was used for classification. The PiSVM code is stale and stable despite the I/O limits. While PiSVM is used in parallel, MPI is used for communication on multiple nodes. For the parallel analysis, the JUDGE cluster in Jülich Supercomputing Center in Germany was used. The training period has been reduced significantly in the PiSVM, which runs parallel to the running of the series MATLAB. In parallel operation, the accuracy of

Cloud computing is one of the most powerful big data techniques [45]. The ability to provide flexible processing, memory, and drivers by virtualizing computing resources on a physical computer made the supercomputing concept more affordable and easily accessible [46]. The use of the cloud concept, which provides a multi-computer infrastructure for data management and analysis, provides great ease in terms of high scalability and usability, fault tolerance, and performance. Especially considering the critical applications that need to extract information from the data that the next-generation remote sensors can produce near real time, it is very important to use cloud computing technologies for high-performance computing [21]. In addition, the cloud computing infrastructure has the ability to create an efficient platform for the storage of big data as well as for the performance of the analysis process. Thus, together with the use of this technology, expensive computing hardware such as cluster systems,

The Hadoop ecosystem has emerged as one of the most successful infrastructures for cloud and big data analysis [23, 45]. The platform brings together several tools for various purposes, with two major services: HDFS, a distributed file system, and MapReduce, a high-performance parallel-data processing engine. The MapReduce model is an open-source implementation of the Apache Hadoop framework. This model allows big datasets to be distributed concurrently on multiple computers. Remote sensing applications using the MapReduce model have become a research topic in order to improve the performance of the analysis process as a result of exponential growth within the latest developments in sensor technologies [15, 16]. The processing services on the cloud are accessed via the distributed file system. In order to reduce data access, it is reasonable to process the data in the central computer where the data is stored.

Processing of big geospatial data is vital for time-critical applications such as natural disasters, climate changes, and military/national security systems. Its challenges are related to not only massive data volume but also intrinsic complexity and high dimensions of the geospatial datasets [47]. Hadoop and similar technologies have attracted increasingly in geosciences

and scalability does not exceed a certain capacity.

SVM remains the same as in serial operation (97%).

allocated space and software requirements can be eliminated [22].

4. Performance-aware HPC

3.2.2.2. Cloud

80 Data Mining

Domain decomposition is an important subject in high-performance modeling and numerical simulations in the area such as intelligence and military, meteorology, agriculture, urbanism, and search and rescue [50]. It makes possible parallel computing by dividing a large computational task into smaller parts and distributing them to different computing resources. To increase the performance, high-performance computing is extensively used by dividing the entire problem into multiple subdomains and distributing each one to different computing nodes in a parallel fashion. Inconvenient allocation of resources induces imbalanced task loads and redundant communications among computing nodes. Hereby, resource allocation is a vital part which has a deep impact on the efficiency of the parallel process. Resource allocation algorithm should minimize total execution time by taking into consideration the communication and computing cost for each computing node and reduce total communication cost for the entire processing. In this chapter, a new data partitioning strategy is proposed to benefit from the current situation of resources on the cloud. At this new strategy, RS data should be partitioned based on performance metrics which is formulated by a combination of available resources of network, memory, CPU, and storage. After appropriate cloud site and resource nodes are found by stage 1 and 2 as described in Sections 6.2 and 6.3, the receiving data is divided into the selected resource nodes according to the number of available resources on that cloud. For this dividing operation, the system should determine which portion of data will be allocated to found nodes based on some performance metrics such as the below heuristic function:

$$p\_i = \frac{t\_{through}^i + t\_{processing}^i}{\sum\_{j=1}^n t\_{through}^j + t\_{processing}^j} \tag{1}$$

pi is the portion of node i and t i throughput are the transfer time needed for data with network throughput of i and processing time needed for data with CPU frequency of i. P<sup>n</sup> <sup>j</sup>¼<sup>1</sup> <sup>t</sup> j throughput þt j processing is the total sum of transfer and processing time for selected nodes. In the formula, transfer time can be computed with t i throughput ¼ data\_size=bandwidthi and processing time can be computed with t i processing ¼ data\_size=CPU\_frequencyi .

### 4.2. Geo-distributed cloud

Geo-distributed cloud is the application of cloud computing technologies which consist of multiple cloud sites distributed in different geographic locations to interconnect data and applications [51]. Cloud providers prefer the distributed cloud systems to enable lower latency and provide better performance for cloud consumers. Recently, most of the large online services have been geo-distributed toward the exponential increase in data [52]. The most important reason for realizing the services as geo-distributed is latency. Geo-distributed clouds provide a point of presence nearby clients with reference to reduce latency. In this chapter, we introduce a novel resource allocation technique for managing RS big data in the geo-distributed private cloud. This new approach will select the most appropriate cloud site which has minimum latency. It also finds efficient data layout for data which gives a higher performance for selected data nodes in the related cloud site. Within this context, resource management should match instantaneous application requirements with optimal CPU, storage, memory, and network resources [53]. Putting the data into a more appropriate node with enough resources and providing an efficient layout of the dependent data partitions on the nodes with minimum latency on the network should decrease processing time of algorithms and also minimize transferring time of dependent data which is needed by algorithm processing on different nodes.

In the proposed approach, acquired RS data should be stored on geo-distributed cloud within two stages (Figure 4). In the first stage, each cloud site determines a score based on latency, bandwidth capacity, CPU, memory, and storage workloads. At that point, each cloud site has a multi-criteria decision-making (MCDM) process. MCDM is a subdiscipline of operation research that evaluates multiple criteria in decision-making problems [54]. As a criteria value for the resource, workloads of resources could be computed by dividing the used amount of the resource to the total capacity of it. For CPU workload, the equation is given as follows:

$$\mathbf{W}\_{\rm CPU}^{\rm i} = \frac{\left(\sum\_{j=1}^{n} \mathbf{c}\_{ij}\right)}{\mathbf{t}\_{\rm CPU}^{i}} \tag{2}$$

Let t i

Let t i

for coming data in the cloud site.

Figure 4. Geo-distributed cloud hierarchy and scene on earth.

width between them is Bi.

STR be the total size of storage in the cloud site i and sij be the size of storage that is used for

Performance-Aware High-Performance Computing for Remote Sensing Big Data Analytics

http://dx.doi.org/10.5772/intechopen.75934

83

P<sup>n</sup> <sup>j</sup>¼<sup>1</sup> mij � � t i MEM

MEM be the total size of memory in the cloud site i and mij be the size of storage that is used

(4)

, band-

coming data in the cloud site. For memory workload, the equation is given as follows:

Time of transferring data between consumer and cloud site is described as latencyLi

After determining criteria for decision-making, a processing method is needed to compute numerical values using the relative importance of the criteria to determine a ranking of each alternative [54]. Some of the well-known MCDM processing models are weighted sum model (WSP), weighted product model (WPM), and analytic hierarchy process (AHP). If we define the solution with AHP which is based on decomposing a complex problem into a system of

wijCj, for i ¼ 1, 2, 3, …, M: (5)

Wi MEM ¼

hierarchies, the best alternative could be defined as the below relationship:

X N

j¼1

AAHP ¼ mini

Let t i CPU be the total number of physical CPU threads in the cloud site i and cij be the number of virtual CPUs that are allocated for virtual machines (VMs) in the cloud site. For storage workload, the equation is given as follows:

$$\mathbf{W}\_{\rm SIR}^{\rm i} = \frac{\left(\sum\_{j=1}^{n} \mathbf{s}\_{\vec{\mathbb{T}}}\right)}{\mathbf{t}\_{\rm SIR}^{\rm i}}\tag{3}$$

Performance-Aware High-Performance Computing for Remote Sensing Big Data Analytics http://dx.doi.org/10.5772/intechopen.75934 83

Figure 4. Geo-distributed cloud hierarchy and scene on earth.

pi is the portion of node i and t

be computed with t

4.2. Geo-distributed cloud

transfer time can be computed with t

i

þt j

82 Data Mining

Let t i i

data which is needed by algorithm processing on different nodes.

workload, the equation is given as follows:

throughput of i and processing time needed for data with CPU frequency of i.

i

processing ¼ data\_size=CPU\_frequencyi

processing is the total sum of transfer and processing time for selected nodes. In the formula,

Geo-distributed cloud is the application of cloud computing technologies which consist of multiple cloud sites distributed in different geographic locations to interconnect data and applications [51]. Cloud providers prefer the distributed cloud systems to enable lower latency and provide better performance for cloud consumers. Recently, most of the large online services have been geo-distributed toward the exponential increase in data [52]. The most important reason for realizing the services as geo-distributed is latency. Geo-distributed clouds provide a point of presence nearby clients with reference to reduce latency. In this chapter, we introduce a novel resource allocation technique for managing RS big data in the geo-distributed private cloud. This new approach will select the most appropriate cloud site which has minimum latency. It also finds efficient data layout for data which gives a higher performance for selected data nodes in the related cloud site. Within this context, resource management should match instantaneous application requirements with optimal CPU, storage, memory, and network resources [53]. Putting the data into a more appropriate node with enough resources and providing an efficient layout of the dependent data partitions on the nodes with minimum latency on the network should decrease processing time of algorithms and also minimize transferring time of dependent

In the proposed approach, acquired RS data should be stored on geo-distributed cloud within two stages (Figure 4). In the first stage, each cloud site determines a score based on latency, bandwidth capacity, CPU, memory, and storage workloads. At that point, each cloud site has a multi-criteria decision-making (MCDM) process. MCDM is a subdiscipline of operation research that evaluates multiple criteria in decision-making problems [54]. As a criteria value for the resource, workloads of resources could be computed by dividing the used amount of the

> P<sup>n</sup> <sup>j</sup>¼<sup>1</sup> cij � �

> > t i CPU

CPU be the total number of physical CPU threads in the cloud site i and cij be the number of virtual CPUs that are allocated for virtual machines (VMs) in the cloud site. For storage

> P<sup>n</sup> <sup>j</sup>¼<sup>1</sup> sij � �

> > t i STR

resource to the total capacity of it. For CPU workload, the equation is given as follows:

Wi CPU ¼

Wi STR ¼ .

throughput are the transfer time needed for data with network

throughput ¼ data\_size=bandwidthi and processing time can

P<sup>n</sup> <sup>j</sup>¼<sup>1</sup> <sup>t</sup> j throughput

(2)

(3)

Let t i STR be the total size of storage in the cloud site i and sij be the size of storage that is used for coming data in the cloud site. For memory workload, the equation is given as follows:

$$\mathbf{W}\_{\text{MEM}}^{\text{i}} = \frac{\left(\sum\_{j=1}^{n} \mathbf{m}\_{\text{i}}\right)}{\mathbf{t}\_{\text{MEM}}^{\text{i}}} \tag{4}$$

Let t i MEM be the total size of memory in the cloud site i and mij be the size of storage that is used for coming data in the cloud site.

Time of transferring data between consumer and cloud site is described as latencyLi , bandwidth between them is Bi.

After determining criteria for decision-making, a processing method is needed to compute numerical values using the relative importance of the criteria to determine a ranking of each alternative [54]. Some of the well-known MCDM processing models are weighted sum model (WSP), weighted product model (WPM), and analytic hierarchy process (AHP). If we define the solution with AHP which is based on decomposing a complex problem into a system of hierarchies, the best alternative could be defined as the below relationship:

$$\mathbf{A}\_{\rm AHP} = \min\_{\mathbf{i}} \sum\_{\mathbf{j}=1}^{N} \mathbf{w}\_{\overline{\mathbf{j}}} \mathbf{C}\_{\mathbf{j}} \quad \text{for } \mathbf{i} = 1, 2, 3, \dots, \mathbf{M}. \tag{5}$$

where P<sup>N</sup> <sup>j</sup>¼<sup>1</sup> wij <sup>¼</sup> 1, N is criteria amount, and <sup>i</sup> cloud site. If we write the model for earlier five criteria:

$$\mathbf{A\_{AHP}} = \min\_{\mathbf{i}} \left( \mathbf{w\_1 L\_i} + \mathbf{w\_2} \frac{1}{\mathbf{B\_i}} + \mathbf{w\_3} \mathbf{W\_{CPU}^i} + \mathbf{w\_4} \mathbf{W\_{STR}^i} + \mathbf{w\_5} \mathbf{W\_{MEM}^i} \right) \text{ for } \mathbf{i} = 1, 2, 3, \dots, \mathbf{M} \tag{6}$$

The agents cooperatively optimize a global objective function denoted by f(x), which is a

where <sup>T</sup> : <sup>R</sup><sup>n</sup> ! <sup>R</sup> and decision vector x is bounded by a constraint set, where x <sup>∈</sup> C, which consists of local constraints and global constraints that may be imposed by the network

where Cg represent global constraints. This model leads to following optimization problem:

where f : <sup>R</sup><sup>n</sup> ! <sup>R</sup> and the set C is constraint set. The decision vector in Eq. (9) can be considered as resource vector whose component corresponds to resources allocated to each node or global

After defining some basic notation and terminology for optimization of a function, continue with where we left off. The cloud site should decide which resources will be used for coming RS data when it is determined to receive it. Each node in the cloud site evaluates network latency and bandwidth between other nodes for optimum network communications. In addition to that, current amounts of CPU, memory, and storage are also taken into account together for finding the best-fitted solution to store and process RS data. Hence each node solves the

<sup>þ</sup> w3Hij <sup>þ</sup> w4Wj

Xn j¼1 xjA<sup>j</sup>

8

>>>>>>>>>>>>>>><

>>>>>>>>>>>>>>>:

Xn j¼1 xjA<sup>j</sup>

where xj indicates that the jth node would be in the data-receiving group together with node i or not, Lij is latency between i and j, Bij is bandwidth between i and j, Hij is hop count between i and

MEM � � � �

for i ¼ 1, 2, 3, …, N

MEM ≥ Isize

STR ≥ Isize

ð Þx to minimize decision vector with AHP model in a decentralized manner. The

Hij ≤ Hopmax Bij ≤ Cmax xi,<sup>j</sup> ∈f g 0; 1 ∀i, j

decision vector which is estimated by the nodes on the network using local information.

C ¼ ⋂ n i¼1 Xi

f xð Þ¼ T f ð Þ <sup>1</sup>ð Þx ;…; fið Þx (7)

http://dx.doi.org/10.5772/intechopen.75934

85

Performance-Aware High-Performance Computing for Remote Sensing Big Data Analytics

� � <sup>∩</sup> Cg (8)

minimize fð Þx subject to x∈C (9)

CPU <sup>þ</sup> w5Wj

MEM are CPU, storage, and memory resources in node j. Each node

STR <sup>þ</sup> w6Wj

(10)

combination of the local objective functions, that is:

structure, that is:

defined formula:

j, and W<sup>j</sup>

should solve f <sup>i</sup>

CPU, W<sup>j</sup>

STR, and <sup>W</sup><sup>j</sup>

min fið Þ¼ <sup>x</sup> <sup>X</sup><sup>n</sup>

j¼1

xj w1Lij þ w2

1 Bij

subject to

Although AHP is similar to WSM, it uses related values instead of actual values. This makes it possible to use the AHP in multidimensional decision-making problems by removing the problem of combining different dimensions in various units (similar to adding oranges and apples).

After the minimum valued alternative cloud site is found, the second stage takes place for evaluating which resources should be used optimally in the related cloud site and finding an optimal layout in the network for RS big data.

### 4.3. Resource optimization in cloud network with performance metrics

Large-scale networks and its applications lack centralized access to information [55]. When we interpret RS big data on the cloud as a large-scale network, optimization of resource allocation depends on local observation and information of each node. At this point, control and optimization algorithms should be deployed in a distributed manner for finding optimum resource allocation in such a network. Optimization algorithm should be robust against link or node failures and scalable horizontally.

To succeed distributed optimization of the system, each node should run the optimization algorithm locally which can be called as an agent. At this system which consists of multi-agents connected over a network, each agent has a local objective function and local constraint set which are known by just this agent. The agents try to decide on a global decision vector cooperatively based on their objective function and constraints (Figure 5).

Figure 5. Multi-agent optimization problem for resource allocation on a network. fi(x): Local objective function where f: <sup>ℝ</sup><sup>n</sup> ! <sup>ℝ</sup>. Xi: Local constraint set where Xi <sup>⊂</sup> <sup>ℝ</sup><sup>n</sup> . x: Global decision vector which agents collectively try to decide on, where x ∈ ℝ<sup>n</sup> .

The agents cooperatively optimize a global objective function denoted by f(x), which is a combination of the local objective functions, that is:

where P<sup>N</sup>

AAHP ¼ mini w1Li þ w2

failures and scalable horizontally.

<sup>ℝ</sup><sup>n</sup> ! <sup>ℝ</sup>. Xi: Local constraint set where Xi <sup>⊂</sup> <sup>ℝ</sup><sup>n</sup>

x ∈ ℝ<sup>n</sup> .

optimal layout in the network for RS big data.

1 Bi

<sup>þ</sup> w3Wi

4.3. Resource optimization in cloud network with performance metrics

based on their objective function and constraints (Figure 5).

criteria:

84 Data Mining

<sup>j</sup>¼<sup>1</sup> wij <sup>¼</sup> 1, N is criteria amount, and <sup>i</sup> cloud site. If we write the model for earlier five

STR <sup>þ</sup> w5Wi

MEM

for i ¼ 1, 2, 3, …, M (6)

CPU <sup>þ</sup> w4Wi

Although AHP is similar to WSM, it uses related values instead of actual values. This makes it possible to use the AHP in multidimensional decision-making problems by removing the problem of combining different dimensions in various units (similar to adding oranges and apples). After the minimum valued alternative cloud site is found, the second stage takes place for evaluating which resources should be used optimally in the related cloud site and finding an

Large-scale networks and its applications lack centralized access to information [55]. When we interpret RS big data on the cloud as a large-scale network, optimization of resource allocation depends on local observation and information of each node. At this point, control and optimization algorithms should be deployed in a distributed manner for finding optimum resource allocation in such a network. Optimization algorithm should be robust against link or node

To succeed distributed optimization of the system, each node should run the optimization algorithm locally which can be called as an agent. At this system which consists of multi-agents connected over a network, each agent has a local objective function and local constraint set which are known by just this agent. The agents try to decide on a global decision vector cooperatively

Figure 5. Multi-agent optimization problem for resource allocation on a network. fi(x): Local objective function where f:

. x: Global decision vector which agents collectively try to decide on, where

� �

f xð Þ¼ T f ð Þ <sup>1</sup>ð Þx ;…; fið Þx (7)

where <sup>T</sup> : <sup>R</sup><sup>n</sup> ! <sup>R</sup> and decision vector x is bounded by a constraint set, where x <sup>∈</sup> C, which consists of local constraints and global constraints that may be imposed by the network structure, that is:

$$\mathbf{C} = \left(\bigcap\_{i=1}^{n} \mathbf{X}\_{i}\right) \cap \mathbf{C}\_{\mathbf{g}} \tag{8}$$

where Cg represent global constraints. This model leads to following optimization problem:

$$\text{minimize} f(\mathbf{x}) \text{ subject to } \mathbf{x} \in \mathbb{C} \tag{9}$$

where f : <sup>R</sup><sup>n</sup> ! <sup>R</sup> and the set C is constraint set. The decision vector in Eq. (9) can be considered as resource vector whose component corresponds to resources allocated to each node or global decision vector which is estimated by the nodes on the network using local information.

After defining some basic notation and terminology for optimization of a function, continue with where we left off. The cloud site should decide which resources will be used for coming RS data when it is determined to receive it. Each node in the cloud site evaluates network latency and bandwidth between other nodes for optimum network communications. In addition to that, current amounts of CPU, memory, and storage are also taken into account together for finding the best-fitted solution to store and process RS data. Hence each node solves the defined formula:

$$\min \mathbf{f}\_{i}(\mathbf{x}) = \sum\_{j=1}^{n} \mathbf{x}\_{j} \left[ \left( \mathbf{w}\_{1} \mathbf{I}\_{\overline{\neg}} + \mathbf{w}\_{2} \frac{1}{\mathbf{B}\_{\overline{\neg}}} + \mathbf{w}\_{3} \mathbf{H}\_{\overline{\neg}} + \mathbf{w}\_{4} \mathbf{W}\_{\text{CPU}}^{\dagger} + \mathbf{w}\_{5} \mathbf{W}\_{\text{MEM}}^{\dagger} + \mathbf{w}\_{6} \mathbf{W}\_{\text{MEM}}^{\dagger} \right) \right]$$

$$\begin{aligned} \text{for } \mathbf{i} &= 1, 2, 3, \dots, \text{N} \\\\ \text{subject to} \begin{cases} \sum\_{j=1}^{n} \mathbf{x}\_{i} \mathbf{A}\_{\text{MEM}}^{\dagger} \ge \mathbf{I}\_{\text{size}} \\\\ \sum\_{j=1}^{n} \mathbf{x}\_{i} \mathbf{A}\_{\text{STR}}^{\dagger} \ge \mathbf{I}\_{\text{size}} \\\\ \mathbf{H}\_{\overline{\text{\neg}}} \le \mathbf{H} \mathbf{o}\_{\text{max}} \\\\ \mathbf{B}\_{\overline{\text{\neg}}} \le \mathbf{C}\_{\text{max}} \\\\ \mathbf{x}\_{i} \in \{0, 1\} \forall i \text{ j} \end{cases} \end{aligned} \tag{10}$$

where xj indicates that the jth node would be in the data-receiving group together with node i or not, Lij is latency between i and j, Bij is bandwidth between i and j, Hij is hop count between i and j, and W<sup>j</sup> CPU, W<sup>j</sup> STR, and <sup>W</sup><sup>j</sup> MEM are CPU, storage, and memory resources in node j. Each node should solve f <sup>i</sup> ð Þx to minimize decision vector with AHP model in a decentralized manner. The individual optimization problem is a mix integer program for each node. Lagrangian relaxation is a heuristic method that stands for solving mix integer problems with decomposing constraints. The idea of the method is to decompose constraints which complicated the problem by adding them to the objective function with the associated vector μ called the Lagrange multiplier. After applying it to our problem, we derive the dual problem of Eq. (9) for an efficient solution by adding complicated constraints 1 and 2.

$$\min \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}, \boldsymbol{\lambda}) = \sum\_{j=1}^{n} \mathbf{x}\_{j} \mathbf{U}\_{\overline{\boldsymbol{\eta}}} + \mu\_{j} \left( \mathbf{I}\_{\text{size}} - \sum\_{j=1}^{n} \mathbf{x}\_{j} \mathbf{A}\_{\text{MEM}}^{j} \right) + \lambda\_{j} \left( \mathbf{I}\_{\text{size}} - \sum\_{j=1}^{n} \mathbf{x}\_{j} \mathbf{A}\_{\text{STR}}^{j} \right)$$
 
$$\text{subject to } \begin{cases} \mathbf{H}\_{\overline{\boldsymbol{\eta}}} \le \text{Hop}\_{\text{max}} \\ \mathbf{B}\_{\overline{\boldsymbol{\eta}}} \le \mathbf{C}\_{\text{max}} \\ \mathbf{x}\_{i,j} \in \{0, 1\} \forall i, j \end{cases}$$

optimizing workflows for analysis, in the storage, and retrieval of spatial big data of remote

Performance-Aware High-Performance Computing for Remote Sensing Big Data Analytics

http://dx.doi.org/10.5772/intechopen.75934

87

According to investigated researches in this chapter, it is observed that existing techniques and systems cannot find a solution that covers the existing problems when analyzing the real and near-real-time big data analysis in remote sensing. Hadoop and similar technologies have attracted increasingly in geosciences communities for handling big geospatial data. Many investigations were carried out for adopting those technologies to processing big geospatial data, but there are very few studies for optimizing the computing resources to handle the

In this chapter, a two-stage innovative approach has been proposed to store RS big data on a suitable cloud site and to process them with optimizing resource allocation on a geo-distributed cloud. In the first stage, each cloud site determines a score based on latency, bandwidth capacity, CPU, memory, and storage workloads with an MCDM process. After minimum valued alternative cloud site is found, the second stage takes place for evaluating which resources should be used optimally in related cloud site and finding an optimal layout in the network for RS big data with respect to latency, bandwidth capacity, CPU, memory, and storage amount. Lastly, data should be divided into partitions based on a performance metric which could be computed with

As future work, optimal replication methods will be searched for preventing failure situations when transferring and processing RS data in a distributed manner. For succeeding that, a

[1] Fadiya SO, Saydam S, Zira VV. Advancing big data for humanitarian needs. Procedia

[2] Chen CLP, Zhang C-Y. Data-intensive applications, challenges, techniques and technolo-

[3] Özköse H, Arı ES, Gencer C. Yesterday, today and tomorrow of big data. Procedia-Social

performance-based approach is considered to maintain high-performance computing.

available network and processing resources of selected nodes in the cloud site.

\* and Muhammet Ünal<sup>2</sup>

gies: A survey on big data. Information Sciences. 2014;275:314-347

\*Address all correspondence to: mkpekturk@havelsan.com.tr

sensing and data streams.

Author details

References

Mustafa Kemal Pektürk<sup>1</sup>

1 HAVELSAN A.Ş., Ankara, Turkey

Engineering. 2014;78:88-95

and Behavioral Sciences. 2015;195:1042-1050

2 Gazi University, Ankara, Turkey

dynamic geo-processing workload efficiently.

where utilization function for node:

$$\mathbf{U\_{ij}} = \left[ \left( \mathbf{w\_1 L\_{ij}} + \mathbf{w\_2} \frac{1}{\mathbf{B\_{ij}}} + \mathbf{w\_3 H\_{ij}} + \mathbf{w\_4 W\_{\rm CPU}^j} + \mathbf{w\_5 W\_{\rm STR}^j} + \mathbf{w\_6 W\_{\rm MEM}^j} \right) \right]$$

and μ<sup>j</sup> ≥ 0∀j and λ<sup>j</sup> ≥ 0∀j are the dual variables.

After obtaining above optimization problem, it could be separable in the variables xi and it also decomposes into sub-problems for each node i. Thus each node needs to solve the onedimensional optimization problem Eq. (11). This optimization problem consists of its own utility function and Lagrangian multipliers which are available for node i. Generally, the subgradient method is used to solve the obtained dual problem because of simplicity of computations per iteration. First-order methods such as subgradient have slower convergence rate for high accuracy but they are very effective in large-scale multi-agent optimization problems where the aim is to find near-optimal approximate solutions [55].

When one optimal approximate solution is found, RS data should be divided into partitions and proportionally distributed to found nodes in the solution according to performance heuristic (Eq. (1)) which is given in Section 6.1.

### 5. Conclusion

As a result of technological developments, the amount of data produced by many organizations on a daily basis has increased to terabyte levels. Remotely sensed data, which is spatially and spectrally amplified and heterogeneous by means of different sensing techniques, causes great difficulties in storing, transferring, and analyzing with conventional methods. It has become a necessity to implement distributed approaches instead of conventional methods that are inadequate in critical applications when real/near-real-time analysis of relevant big data is needed. Existing distributed file systems, databases, and high-performance computing systems are experiencing difficulties in

optimizing workflows for analysis, in the storage, and retrieval of spatial big data of remote sensing and data streams.

According to investigated researches in this chapter, it is observed that existing techniques and systems cannot find a solution that covers the existing problems when analyzing the real and near-real-time big data analysis in remote sensing. Hadoop and similar technologies have attracted increasingly in geosciences communities for handling big geospatial data. Many investigations were carried out for adopting those technologies to processing big geospatial data, but there are very few studies for optimizing the computing resources to handle the dynamic geo-processing workload efficiently.

In this chapter, a two-stage innovative approach has been proposed to store RS big data on a suitable cloud site and to process them with optimizing resource allocation on a geo-distributed cloud. In the first stage, each cloud site determines a score based on latency, bandwidth capacity, CPU, memory, and storage workloads with an MCDM process. After minimum valued alternative cloud site is found, the second stage takes place for evaluating which resources should be used optimally in related cloud site and finding an optimal layout in the network for RS big data with respect to latency, bandwidth capacity, CPU, memory, and storage amount. Lastly, data should be divided into partitions based on a performance metric which could be computed with available network and processing resources of selected nodes in the cloud site.

As future work, optimal replication methods will be searched for preventing failure situations when transferring and processing RS data in a distributed manner. For succeeding that, a performance-based approach is considered to maintain high-performance computing.
