2. Big data

remote sensing technologies have had a tremendous increase in remote sensor data [8]. The amount of remote sensing (RS) data collected from a single satellite data center has dramatically increased and has reached several terabyte values per day [9]. This is because sensors have high resolution and a large amount of band due to new camera technologies. Thus, RS data reaching

Large image files which consist of both voluminous pixel and multiple spectral bands (multispectral/hyperspectral) cause great difficulties to read and store in memory. Besides this, data mining algorithms which extract information from satellite and remote sensing data involve high computational complexity. With these features, remote sensor applications are both data intensive [10, 11] and compute intensive [12]. However, the computational complexity of many data mining algorithms is super linear to the number of samples and the size of the sample. Hence, optimization of algorithms is not enough to obtain better performance when those two

When talking about big data analysis, the main difficulties in computer architecture are CPU intensiveness and slow input/output (I/O) operations. According to Moore's Law, CPU and driver performance doubles every 18 months. On the other hand, when the trends in I/O interfaces are examined, the improvement is near-to-network speed improvement but still behind of it [13]. Although I/O interfaces operate at the same speed as the processor bus, I/O lags behind because peripheral hardware cards operate at low speeds. PCI cards are being used to increase I/O performance. Thus I/O performance increases by about 33% annually. In spite of these improvements in traditional approaches, when collected data exponentially increases, usage of relatively slow data processing techniques makes real-time analysis difficult particularly [2]. For this reason, the most important factor determining the performance of analytical processes is the limited structure of hardware resources inherently [14]. Therefore, modern technologies and high-performance computing (HPC) techniques, which have parallel processing capabilities such as multi-core programming, GPU programming, field programmable gate arrays (FPGA), cluster computer, and cloud computing are needed to perform analysis on large volumes of data, which is complex and time-consuming to extract informa-

The HPC system usage attracts more attention in remote sensing applications because of a great deal of data recently [6, 7]. HPC integrates some of the computing environments and programming techniques to solve large-scale problems in the remote sensing era. Many applications of remote sensing such as environmental studies, military applications, tracking and monitoring of hazards, and so on require real-time or near-real-time processing capabilities for urgent intervention that is timely in the necessary situation. HPC systems such as multi- or many-integrated core, GPU/GPGPU, FPGA, cluster, and cloud have become inevitable to meet this requirement. Multi-core, GPU, FPGA, and so on technologies meet parallel computing needs for onboard image processing particularly. In such technologies, developed algorithms run on a single node only but with multiple cores. So they could just scale vertically with hardware upgrades—for example, more CPUs, better graphics cards, more memories [17]. When distributed computing is considered, the most conceivable technology is commercial off-the-shelf (COTS) computer equipment, which is called cluster [6, 7]. In this approach, a

high dimensions is defined as "Big data."

70 Data Mining

variables continue to increase.

tion [14–16].

Big data is characterized as 3 V by many studies: volume, velocity, and variety [22, 23]. Volume is the most important big data quality that expresses the size of the dataset. The velocity indicates the rate of production of big data, but the increase in the rate of production also reveals the need for faster processing of the data. Variety refers to the diversity of different sources of data.

Given the variety of the data, the majority of the data obtained is unstructured or semi-structured [21]. Considering the velocity of data, velocity requirements vary according to application areas. In general, velocity is addressed under the heading of processing at a specific time interval, such as batch processing, near-real-time requirement, continuous input–output requirement real time, and stream processing requirements. Application of critical and live analytics-based batches to improve data and analysis processes requires continuous and real-time analysis, and critical applications require immediate intervention, depending on the analysis of incoming data streams.

### 2.1. Remote sensing data

Remote sensing (RS) is defined as the ability to measure the quality of a surface or object from a distance [9]. RS data are obtained from various data acquisition technologies (lidar, hyperspectral camera, etc.) in airplanes and unmanned aerial vehicles. With the recent developments in RS technologies, the amount, production rate, and diversity of remote sensor data have increased exponentially (Table 1). Thus, the data received from the remote sensors are being treated as RS "Big Data."

Diversity and multidimensionality are the greatest factors in the complexity of RS big data. RS data is used in a variety of geosciences with environmental monitoring, underground, atmospheric, hydrological, and oceanographic content. Due to such different and wide range of


Table 1. Satellite data centers, data rates, and volumes.

application areas, the diversity of RS data has increased greatly. There are approximately 7000 different types of RS datasets in NASA archives, as far as is known [9]. Numerous satellites and sensors with different resolutions have emerged due to higher spatial resolution, temporal resolution, and even spectral resolution. As remote sensing data continues to increase and complex, a new architecture has become a necessity for existing algorithms and data management [24].

• Value-added Operations: Value-added operations such as fine-tuning, orthorectification, fusion, and mosaic are performed. It is possible to say that the mosaic process has a longer

Performance-Aware High-Performance Computing for Remote Sensing Big Data Analytics

http://dx.doi.org/10.5772/intechopen.75934

73

• Information Extraction/Thematic Implementation: In this step, classification, attribute extraction, and quantitative deductions are obtained with images subjected to preprocessing and value-added operations, for example, extracting information such as leaf area index, surface

The data access patterns in the remote sensing algorithms vary according to the characteristics of the algorithms. There are four different access patterns for RS data. Depending on the size of the RS data, the unavoidable I/O load and irregular data access patterns make inapplicable the

"Sequential row access pattern" is used by algorithms such as pixel-based processing-based radiometric verification, support vector machine (SVM) classifier. When algorithms are impleme-

"Rectangular block access pattern" is used by algorithms such as convolution filter and resample which require neighbor-based processing. Non-contiguous I/O patterns are visible

"Cross-file access pattern" is used by algorithms like fusion and normalized difference vegetation index (NDVI) that require inter-band calculations. This data consists of small and non-contiguous

nted in parallel, each processor needs logically multiple consecutive image rows.

in these algorithms and are not efficiently supported by normal parallel file systems.

processing time than the others.

Figure 1. Remote sensing data-processing steps [9].

temperature, and snowy area on RS images.

2.3. Data access pattern in remote sensing algorithm

traditional cluster-based parallel I/O systems [25].

### 2.2. Remote sensing algorithms

The processing of RS data has an important role in Earth observation systems. The longest RS workflow starts with data acquisition and ends with a thematic application (Figure 1). There can be processes which are processed sequentially or concurrently in each step within the workflow. An RS application typically consists of the following stages, respectively [9]:


Performance-Aware High-Performance Computing for Remote Sensing Big Data Analytics http://dx.doi.org/10.5772/intechopen.75934 73

Figure 1. Remote sensing data-processing steps [9].

application areas, the diversity of RS data has increased greatly. There are approximately 7000 different types of RS datasets in NASA archives, as far as is known [9]. Numerous satellites and sensors with different resolutions have emerged due to higher spatial resolution, temporal resolution, and even spectral resolution. As remote sensing data continues to increase and complex, a new architecture has become a necessity for existing algorithms and data manage-

The processing of RS data has an important role in Earth observation systems. The longest RS workflow starts with data acquisition and ends with a thematic application (Figure 1). There can be processes which are processed sequentially or concurrently in each step within the workflow. An RS application typically consists of the following stages, respectively [9]:

• Data Acquisition/Collection: Images obtained from satellite or hyperspectral cameras are captured on the downlink channel. With high spatial and temporal resolution, data

• Data Transmission: The data stream obtained is sent simultaneously to the satellite center. The high bandwidth of the data transfer is critical for applications requiring real-time processing. In reality, it is impossible to transmit real-time data because the data increase

• Preprocessing: Preprocessing related to image quality and geographical accuracy such as decomposition, radiometric verification, and geometric verification is performed.

due to satellite and camera technologies is faster than the data transmission rate.

ment [24].

72 Data Mining

2.2. Remote sensing algorithms

Table 1. Satellite data centers, data rates, and volumes.

volume is increasing.


### 2.3. Data access pattern in remote sensing algorithm

The data access patterns in the remote sensing algorithms vary according to the characteristics of the algorithms. There are four different access patterns for RS data. Depending on the size of the RS data, the unavoidable I/O load and irregular data access patterns make inapplicable the traditional cluster-based parallel I/O systems [25].

"Sequential row access pattern" is used by algorithms such as pixel-based processing-based radiometric verification, support vector machine (SVM) classifier. When algorithms are implemented in parallel, each processor needs logically multiple consecutive image rows.

"Rectangular block access pattern" is used by algorithms such as convolution filter and resample which require neighbor-based processing. Non-contiguous I/O patterns are visible in these algorithms and are not efficiently supported by normal parallel file systems.

"Cross-file access pattern" is used by algorithms like fusion and normalized difference vegetation index (NDVI) that require inter-band calculations. This data consists of small and non-contiguous fragments in hundreds of image files. Thus, in this type of access, a large number of read/write operations take place, which is a time-consuming process.

"Irregular access pattern" is used by algorithms such as fast fourier transform (FFT), image distortion, and information extraction, which require scattered access or the entire image. These algorithms use diagonal and polygonal access patterns as irregular access. In these patterns, different sized parts can be in different nodes, even though they are small and non-contiguous pieces of data. In addition to I/O difficulties, the problem of identifying irregular data areas also arises.
