**5. The basics of big data**

Every day, we generate trillions of bytes of data (Big Data). This data comes from everywhere: from sensors used to collect climate information, messages on social media sites, digital images and videos posted online, transactional records of online purchases and GPS signals from phones mobile, to name a few sources.

Big Data is characterized by its volume (massive data); they are also known for their variety in terms of formats and new structures, as well as a requirement in terms of speed in processing. But until now, according to our research, no software is able to handle all this data which has many types and forms and which is growing very rapidly. So Big Data issues are part of our daily life, and more advanced solutions are needed to manage this mass of data in a short time.

Distributed computing is concerned with processing large amounts of data. This processing cannot be achieved with traditional data processing paradigms, it requires the use of distributed platforms. In the literature, there are several solutions, for the implementation of this paradigm. Among these solutions we find the example Google, which has developed a very reliable programming model for the processing of Big Data: it is the MapReduce model. This model is implemented on several platforms such as the Hadoop platform. Despite all these advantages, Hadoop suffers from latency problems which is the main cause of development of a new alternative to improve the performance of processing data, it is the Spark platform which is more powerful, more flexible and faster than Hadoop MapReduce.

In this chapter, we will explain the basics of Big Data, Big Data processing platforms, as well as storage.

#### **5.1 Definition**

Big Data refers to a very large volume of often heterogeneous data which has several forms and formats (text, sensor data, sound, video, route data, log files, etc.), and including heterogeneous formats: structured data, unstructured and semi-structured. Big Data has a complex nature that requires powerful technologies and advanced algorithms for its processing and storage. Thus, it cannot be processed using tools such as the traditional DBMS [23]. Most scientists and data experts define big data with the concept of 3Vs as follows [23]:


**7**

**Figure 1.**

*The 3 V big data model.*

*Towards Large Scale Image Retrieval System Using Parallel Frameworks*

shared or confidential, complete or incomplete, etc.

include structured and unstructured data, public or private, local or remote,

Thereafter, the three original dimensions are widened by two other dimensions

• Truth: Truthfulness (or validity) of data is the reliability and accuracy of data, and the confidence that big data inspires in decision-makers. If the users of this data doubt its quality or relevance, it becomes difficult to invest more in it.

• Value: This last V plays a key role in Big Data, the Big Data approach only makes sense to achieve strategic goals of creating value for customers and for

One of the reasons for the emergence of the concept of Big Data is the need to realize the technical challenge of processing large volumes of information of several types (structured, semi-structured and unstructured) generated at high speed. Big

1.The logs (connection logs) from traffic on the company's official website: These data sources are the paths taken by visitors to reach the site: search engines, directories, bounces from other sites, etc. Businesses today have a web storefront through its official website. The latter generates traffic that it is essential to analyze, so these companies have trackers on the different pages in order to measure the navigation paths, or the time spent on each page, etc. Some of the best-known analytics solutions include: Google Analytics, Adobe

• Volume: it represents the amount of data generated, stored and used. The volume of data stored today is exploding, it is almost 800,000 petabytes, Twitter generates more than 7 terabytes of data every day, Facebook generates more than 10 terabytes and the data volume in 2020 can reach 40 zeta bytes

*DOI: http://dx.doi.org/10.5772/intechopen.94910*

of big data (also known as the "5 V Big Data"):

companies in all areas (**Figure 2**).

Data is based on four data sources [25]:

Omniture, Coremetics.

(**Figure 1**) [24].

*Multimedia Information Retrieval*

**5. The basics of big data**

platforms, as well as storage.

of the fast speed of big data.

**5.1 Definition**

these machines.

reasonable time, and optimum precision.

images (Big Data), this imposes a parallelisation of calculations to obtain results in

Every day, we generate trillions of bytes of data (Big Data). This data comes from everywhere: from sensors used to collect climate information, messages on social media sites, digital images and videos posted online, transactional records of online purchases and GPS signals from phones mobile, to name a few sources.

Big Data is characterized by its volume (massive data); they are also known for their variety in terms of formats and new structures, as well as a requirement in terms of speed in processing. But until now, according to our research, no software is able to handle all this data which has many types and forms and which is growing very rapidly. So Big Data issues are part of our daily life, and more advanced solu-

Distributed computing is concerned with processing large amounts of data. This processing cannot be achieved with traditional data processing paradigms, it requires the use of distributed platforms. In the literature, there are several solutions, for the implementation of this paradigm. Among these solutions we find the example Google, which has developed a very reliable programming model for the processing of Big Data: it is the MapReduce model. This model is implemented on several platforms such as the Hadoop platform. Despite all these advantages, Hadoop suffers from latency problems which is the main cause of development of a new alternative to improve the performance of processing data, it is the Spark platform which is more powerful, more flexible and faster than Hadoop MapReduce. In this chapter, we will explain the basics of Big Data, Big Data processing

Big Data refers to a very large volume of often heterogeneous data which has several forms and formats (text, sensor data, sound, video, route data, log files, etc.), and including heterogeneous formats: structured data, unstructured and semi-structured. Big Data has a complex nature that requires powerful technologies and advanced algorithms for its processing and storage. Thus, it cannot be processed using tools such as the traditional DBMS [23]. Most scientists and data

• Velocity: Data is generated quickly and must be processed quickly to extract useful information and relevant information. For example, Wallmart (an international chain of discount retailers) generates over 2.5 petabytes (PB) of data every hour from its customers' transactions. YouTube is another good example

• Variety: Big data are generated from various sources distributed in multiple formats (e.g. videos, documents, commentaries, journals). Large data sets

tions are needed to manage this mass of data in a short time.

experts define big data with the concept of 3Vs as follows [23]:

Massively parallel machines are more and more available at increasingly affordable costs, such is the case with multiprocessors. This justifies our motivation to direct our research efforts in large-scale image classification towards the exploitation of such architectures with new Big Data platforms that use the performance of

**6**

include structured and unstructured data, public or private, local or remote, shared or confidential, complete or incomplete, etc.

• Volume: it represents the amount of data generated, stored and used. The volume of data stored today is exploding, it is almost 800,000 petabytes, Twitter generates more than 7 terabytes of data every day, Facebook generates more than 10 terabytes and the data volume in 2020 can reach 40 zeta bytes (**Figure 1**) [24].

Thereafter, the three original dimensions are widened by two other dimensions of big data (also known as the "5 V Big Data"):


One of the reasons for the emergence of the concept of Big Data is the need to realize the technical challenge of processing large volumes of information of several types (structured, semi-structured and unstructured) generated at high speed. Big Data is based on four data sources [25]:

1.The logs (connection logs) from traffic on the company's official website: These data sources are the paths taken by visitors to reach the site: search engines, directories, bounces from other sites, etc. Businesses today have a web storefront through its official website. The latter generates traffic that it is essential to analyze, so these companies have trackers on the different pages in order to measure the navigation paths, or the time spent on each page, etc. Some of the best-known analytics solutions include: Google Analytics, Adobe Omniture, Coremetics.

**Figure 1.** *The 3 V big data model.*

**Figure 3.** *The four sources of big data.*


**9**

les-4-sources-du-big-data/

*Towards Large Scale Image Retrieval System Using Parallel Frameworks*

to private data, sensitive and security information, documents protected by

Big Data requires redefining the data storage and processing systems that can support this volume of data. Indeed, several technologies have been proposed in order to represent this data, these technologies take at least one axis between the two, either improving storage capacities or improving computing power [23]:

• Improved computing power: the goal of these techniques is to allow processing on a large set of data, at considerable cost, and to improve execution performance such as processing time and tolerance breakdowns. Before the appearance of the Hadoop platform, there were several technologies such as Could Computing, massively parallel MPP architectures and In-Memory

• Improvement of storage capacities: improvement of storage of distributed systems, where the same file can be distributed over several hard drives, this allows storage volumes to be increased by using basic hardware. These storage technologies are always evolving to offer faster access to data such as NoSQL,

Traditional business systems normally have a centralized server to store and process data. The traditional model is certainly not suited to handling large volumes of scalable data and cannot be handled by standard database servers. In addition, the centralized system creates too much bottleneck when processing multiple files simultaneously. Google solved this bottleneck issue using MapReduce template.

It was designed in the 2000s by Google engineers. It is a programming model designed to process several terabytes of data on thousands of computing nodes in a [26] cluster. MapReduce can process terabytes and petabytes of data faster and more efficiently. Therefore, its popularity has grown rapidly for various brands of companies in many fields. It provides a highly efficient platform for parallel execution of applications, allocation of data in distributed database systems, and fault tolerant network communications [27]. The main goal of MapReduce is to facilitate data parallelization, distribution, and load balancing in a simple [26] library.

Google created MapReduce to process large quantities unstructured or semistructured data, such as documents and logs of requests for web pages, on large clusters of nodes. It produced different types of data, such as inverted indices or

<sup>1</sup> The four sources of big data, https://www.communication-web.net/2016/03/07/

HDFS from the Hadoop platform, HBase, Cloud Computing, etc.

*DOI: http://dx.doi.org/10.5772/intechopen.94910*

**5.2 Big data processing and storage technologies**

copyright, etc. (**Figure 3**).1

technologies.

**6. MapReduce**

**6.1 Why MapReduce?**

**6.2 MapReduce model definition**

**6.3 The MapReduce model architecture**

to private data, sensitive and security information, documents protected by copyright, etc. (**Figure 3**).1
