**6. Application services execution time prediction**

#### **6.1 Problem statement**

The application services execution time prediction problem can be described in the following way. Let we know *ID*(*cn*)*—*ID for every computational node *cn*, for every *wk* and every *s k <sup>j</sup>* it is known the set of *cn* computing installations identifiers (IDs), a set of variants of the *s k <sup>j</sup>* parameters (for more details see below), the amount of resources requested for known executions of the *s k <sup>j</sup>* with its parameter variants on certain *cn* node, and the execution time of this *s k <sup>j</sup>* with its input data variant and on some of *cn ϵ N*. Neither the source code of the program nor its binary code or the architecture of computer installations are unknown. We emphasize that only the *s k <sup>j</sup>* execution time is known on some *cn* nodes but not for each. Further, the application service will be treated simply as a program; for the sake of simplicity of terminology, we will use the term program and denote *Pi*. Because the set of *s k <sup>j</sup>* is limited (|*W*| is finite, any |*wk* | is finite), there is a numeration that for every *s k <sup>j</sup>* defines the unique index *i*.

We will use the following notations for the problem statement:

1.*Pi*—unique program ID;

*Network Powered by Computing: Next Generation of Computational Infrastructure DOI: http://dx.doi.org/10.5772/intechopen.110178*


5.*cni*—unique computational node ID;

6.{*cn*1, *cn*2, *…* , *cnN*} *=* {*cni*} *i =* <sup>1</sup> *<sup>N</sup>*—a set of *M* unique IDs of computer installations;

By the notions above, the problem statement can be specified as following:

#### • **Given**

{*Pi*}*<sup>i</sup>* = 1*N*—*N* program IDs; {*Arg*(*i*) *<sup>j</sup>*}*j =* <sup>1</sup> *Ai—*the set of arguments Pi program*;* {*cni*}*i =* <sup>1</sup> *<sup>N</sup>—*computational node IDs;

*V =* {[(*Pi*, *Arg*(*i*) *<sup>j</sup>*), *cnk*]}, where *i =* 1,*M*, *j =* 1,*Ai*, *k =* 1,*N* and |*V*| *=* ( P*<sup>M</sup> <sup>i</sup>*¼<sup>1</sup>*Ai*) *• N; T*(*v*) where *v* is the partial defined function on *V*. The values of *T*(*v*) is the execution time the program *Pi* with arguments *Arg*(*i*) *<sup>j</sup>* on computational node *cnk*.

#### • **Required**

Redefine the values of the function *T*(*v*) at the undefined points of *V*.

The problem of estimating the execution time of a program on a computer is a classic problem that has been known since the 1960s. The problem still exists in many forms, for example, for worst-case execution time estimation and for different computer architectures [12–15]. Different ways for this problem were proposed as analytical [16] and statistical [17], based on program behavior analysis [18], time series prediction [19], and neural networks [20]. Execution time can be predicted from test runs [21]. All algorithms predicting program execution time mentioned above use the history of program executions. Their main drawback is that all of them are applicable only when the histories of program execution are known for the certain computer installation; that is, to predict the execution time on a certain computer installation, there is the need to know the whole history of the program executions on this computer installation.

As we will demonstrate below, to predict the program execution time on some computer nodes, just some running histories of this program on them are sufficient. There is no need to know each program run on each computational node. The accuracy of the prediction depends on the number of program running histories on computer installations from a certain set. To solve the execution time prediction problem, the following technique was used:

1.Let us represent the information about *V* set and function *T* (*v*) as a matrix where each row corresponds to pair (*Pi*, *Arg*(*i*) *<sup>j</sup>*) and each column to the computational node *cnk*. The intersection of row and column is the execution time of the corresponding program *Pi* with parameters *Arg*(*i*) *<sup>j</sup>* on *cnk*. We will denote such matrix as *PC*;


$$PredictionError\left(P\_i, Arg^{(i)}\_{\phantom{j}}, \mathcal{C}\_k\right) = \left(|predict-target|/target\right) \tag{3}$$

where *predict* is predicted time if program *Pi* with parameters *Arg*(*i*) *<sup>j</sup>* on *cnk*, *target* is a true execution time of program *Pi* with the same parameters*. The total prediction error* is calculated as an average of the errors calculated using the Eq. (3) for all programs.

To form the *PC matrix*, the datasets from the website [22] dated 8 *June* 2021 (the datasets on this website are periodically updated) was used. These datasets contain the description of total amount of resources of numerous computer installations and the results of executions of programs with various input parameters. From this site, we took three datasets with execution results of programs as MPI as OpenMP on a wide range of computers. These programs cover numerous application areas [23].

Naturally, the question arises: why, when developing the method for estimating the execution time, MPI programs were taken? The fact is that this class of programs is used primarily on supercomputers. It is well known that the execution time of a supercomputer program is very dependent on its architecture. Therefore, if we manage to develop a time estimation method for this case, then for calculators used in traditional servers, it will certainly be no worse.

The brief description of the selected datasets are presented in **Table 1**.

The used data was uploaded on github in [24] along with the developed algorithm.


**Table 1.**

*Datasets of MPI and OpenMP programs executions on various computer installations from [22].*

*Network Powered by Computing: Next Generation of Computational Infrastructure DOI: http://dx.doi.org/10.5772/intechopen.110178*

It has been recognized that the considered problem is very similar to the problems solved in recommender systems, in which matrix decomposition algorithms as in [25] are very widely used in various combinations. The proposed solutions were developed as a combination of several algorithms. They are run in parallel with the same PC matrix, each of which makes its own prediction. At the end, all predictions are averaged. The resulting value is considered the expected execution time of the program. This technique is called ensemble averaging in [26]. Three algorithms were chosen: ridge regression, Pearson correlation, and matrix decomposing.


*Ridge regression* algorithm is used as is. It is only worth to mention that ridge regression is used to predict the value in empty entries in a row of *PC matrix*. If there is an empty value in a row of the *PC matrix*, this value is prediction using all known values. In fact, the problem of interpolation is solved. If the columns corresponding to the computer installations in the *PC matrix* are ordered by performance (this data can be taken from the description of the computer installations), then this type of regression can get quite well prediction. Ridge regression works well on dense matrices with a small number of empty entries in *PC matrix*.

*Pearson correlation* is proposed to estimate the proximity of the set of vectors of program execution times. Since the algorithm is used as is and no novelty was introduced into it is just recalled below. The columns of the *PC matrix* are considered as such sets in other words. The correlation between the columns *cni* and *cnj* shows how close in performance different computational nodes running these programs are. If the Pearson correlation between these columns is close to 1, then *cni* and *cnj* are close to each other from the point of view of performance on the given set of the program executions. In this case, the estimate of the program execution time for cnj can be obtained by multiplying by the constant of the time estimate on the *cnj* node.

The procedure for distributing computing nodes into groups consists of the following steps:

1.Calculate Pearson correlation for each pair of columns of the *PC matrix N\**(*N-1*)*/2* pairs, where *N* is the number of computational nodes;

	- a. Each vertex represents *cn*;
	- b. An edge between the *cni* and *cnj* exists if the Pearson correlation between the corresponding columns in the PC matrix is modulo greater than some threshold. The value of threshold is an algorithm parameter; this way brings us a graph where nodes are *cni*, and arcs are correlated pairs of computational nodes.

To search for groups, a special algorithm was developed, presented in [24], whose complexity does not exceed H3 , the size of groups is less than 3HH/3. This algorithm was described with detail in [31].

However, if threshold for the value of Pearson correlation is close to 1, then further prediction algorithm described in [32] is pretty good even if some vertexes in the cliques would be missed.

If it is not possible to calculate Pearson correlation (e.g., the considered program *P* has not been run on any of the computer installations from clique) but corresponding row in *PC matrix* for *P* program is non-empty, then one needs to use ridge regression for the prediction. See step 4 above.

The error of prediction for the algorithm presented above can be estimated by Eq. (3).

*Matrix decomposition:* Because this stage in the proposed algorithms ensemble we consider as our main contribution to the considered problem, we will spend more space on it description. As mentioned above, the problem of programs execution time prediction is very similar to the problem solved in the recommender systems. In these systems, there are usually two types of objects, the relationships between which are measured. For example, such objects can be users and movies, users and books, users and goods, and so on. The relationship between them is often a measure to what extent the user prefers some movies, books, or goods. It is called goods rating. One can build a rating matrix where the rows (or columns) correspond to movies, books, or goods, and the columns (or rows) correspond to users. This matrix is often sparse, since there are a lot of users and objects, and users cannot physically rate all the objects. The problem that solve the recommender systems, is to determine the ratings of all users for all objects; in other words, the system has to fill in the empty entries in the rating matrix.

Let us consider the following analogy: users are computer installations, the goods are programs, and the ratings are execution times. Thus, the computer installations "rate" the programs, and the smaller rating (execution time), the better the computer installation meets the program.

*Network Powered by Computing: Next Generation of Computational Infrastructure DOI: http://dx.doi.org/10.5772/intechopen.110178*

**Figure 2.** *Matrix decomposition.*

The problem of filling empty entries in recommender systems is solved by matrix decomposition method in [23]. We propose to use this technique to solve the problem under consideration.

The application matrix decomposition techniques to some matrix result in two or more matrices, the product of which gives the approximation of the original one. For empty entries of the original matrix, that is, for unknown values, the product of the matrices gives values that estimate the unknown values.

*PC matrix* decomposition allows one to get a vector representation of programs and computational nodes, which have a remarkable property: the scalar product of the vector representation of the program and the vector representation of the nodes is the program execution time on the node. The vector representations of programs and computational nodes are called as embeddings of programs and computational nodes, respectively. Embedding techniques and methods of applying embeddings are very well-known in such areas as NLP in [33], topic modeling in [34], and recommender systems in [25].

**Figure 2** demonstrates matrix decomposition. The rows are programs, and columns are computers that should be rated. The matrix entries are execution times corresponded to the pairs (program, computer). Some entry could be empty. K is a parameter of the decomposition and is a subject of tuning to get the admittable accuracy. The result of the decomposition procedure of rating matrix of size NxM is program matrix of size NxK and computer matrix of size KxM. The rows in program matrix are vectors represented execution times for the program on the corresponded computers. The columns are vector representations of the "computational power" the corresponded computer for the programs under consideration.

In our study, ALS algorithm [35] was selected.

As mentioned above, three algorithms were chosen: ridge regression, Pearson correlation, and matrix decomposition for the program execution time prediction. As it was said previously, the averaging ensemble in [26] of them to improve the accuracy of the predicted execution time was developed. These algorithms can be combined into an ensemble of algorithms to improve the accuracy of the prediction in [26].

#### **6.2 Experimental study**

Here, the results of experimental studies of the algorithms compassion are presented.

The purposes of the experiments are analysis of the quality of prediction results:


Dataset MPIM2007 with 13 programs and 437 computer installations from [22] was used for the experiments. The algorithm based on grouping computer installations by Pearson correlation is very sensitive to the presence of outliers in the data, as well as to what extent the *PC matrix* is low-density. In order to analyze the quality of prediction by this algorithm, three experiments were conducted. As a basic algorithm, the ridge regression algorithm was used. Each experiment was conducted according to the following methodology. In each row, only one value was removed, and then, the prediction was made based on the remaining values in this row of the matrix according to the algorithm above.

In the first experiment, the execution time is predicted by the ridge regression algorithm; as a result, the prediction error was 0*.*25 or 25%. In the second and third experiments, grouping algorithms based on Pearson correlation were used; threshold for correlation value was 0*.*97; the grouping resulted in 46 groups with two and more computer installations and 27 groups with only one computer installation. In the second experiment, execution time was predicted only for groups with size greater than or equal to 2; groups with size 1 were ignored. As a result, the prediction error was 0*.*068 or 6*.*8%. In the third experiment, execution time was predicted by Pearson correlation algorithm for groups with size greater than or equal to 2, but for groups with size 1, ridge regression algorithm was used. As a result, the prediction error was 0*.*115 or 11*.*5%. Thus, the accuracy of prediction on dense matrices is 88*.*5%.

#### *6.2.2 Analysis of the quality of prediction based on ALS matrix decomposition algorithm*

Dataset MPIM2007 with 13 programs and 437 computer installations from [22] was used for the experiments. Experiments were conducted according to the methodology described in Section 4. To study the quality of the predictions based on ALS matrix decomposition algorithm, 4 experiments were made with *K* = 1, 2, 3, 4. The results of the matrix decomposition were compared with each other, as well as with ridge regression algorithm that was chosen as the basic prediction algorithm. In **Figure 2**, *X*-axis is the percentage of empty entries in the *PC matrix* (which randomly was removed from it); *Y*-axis is the prediction error. Also, for comparison, the result of predictions by the Pearson correlation algorithm was added to **Figure 2**. According to the plots in **Figure 2**, the conclusion can be made that the ridge and cliques

*Network Powered by Computing: Next Generation of Computational Infrastructure DOI: http://dx.doi.org/10.5772/intechopen.110178*

**Figure 3.** *Ridge regression and ALS matrix decomposition with* K *= 1, 2, 3, 4.*

algorithms work well on dense matrices (up to 1% percentage of empty entries). The matrix decomposition technique with *K* = 1 gives the best results for sparse matrices in which the number of empty entries is more than 15%. Even if 80% of the entries are removed from the *PC matrix*, they can be predicted with an accuracy of 60%.

Thus, as a result of experiments, we can conclude that the matrix decomposition technique with *K* = 1 gives the best solutions when the percentage of empty entries in the *PC matrix* is more than **15%**.

An important advantage of using the matrix decomposition technique is the vector representations (embeddings) of programs, and computer installations in case K = 1 are the points in a space of dimension 1. So, the total ordering on the set of computer installations could be defined, and one can work with them as scalars. **Figure 3** shows the less embedding of computational node, the less execution time of the corresponded program.

#### *6.2.3 Analysis of the prediction quality of an ensemble of algorithms*

The ensemble averaging described by (Eq. (3)) was used. The prediction result was compared with the following algorithms: *ridge regression*, *Cliques*, and *ALS* with *K* = 1. All three datasets—MPIL2007, MPIM2007, and ACCEL OMP from [22] were used for the experiments. Experiments were conducted according to the methodology described in Section 4. As we can see in **Figure 3**, the best estimations give us the ALS algorithm when the matrix sparsity is at least 14%. According to **Figure 4**, the ensemble averaging is better when the matrix sparsity is over 14% up to 94%. One more testing of the proposed method was done of the dataset ACCEL OMP that covers 15 programs and 30 computers. The results of this experiment are shown in **Figure 5**.

#### **7. Organization of NPC computing resources**

As I presented in [2], the computing node (CN) could be as Edges in [36] as supercomputer or HPC installation. Existing data center construction approaches demand high quality of communication channels service, to ensure availability of service, and very high capital construction costs of a centralized data center. Significant problems of traditional DС are scaling and low level of resource utilization due to the lack of a centralized management system and orchestration system [37].

**Figure 4.** *Results of ridge, cliques, matrix decomposition, and an ensemble of algorithms on MPIM2007 (1–94%).*

**Figure 5.** *Results of Ridge, Cliques, Matrix decomposition and an ensemble of algorithms on ACCEL OMP (1–92%).*

The advantages of building a NPC based on edges over the traditional approach have been discussed in detail in [38] and characterized by: reduction of transport requirements by proximity of the service copy to the final consumer, reducing the cost of organizing a data center due to the absence of the need to build a centralized data center, efficient scaling through the use of a centralized cloud platform, increasing the efficiency of the network due to a centralized management and orchestration system, and the proximity of the service to the client. The problems of organizing the control plane and the data plane in edges are in many ways similar to those that were already listed above for the NPC Federate DTN control layer (see **Figure 1**). The main difference—the decision-making speed should be much higher.

#### *Network Powered by Computing: Next Generation of Computational Infrastructure DOI: http://dx.doi.org/10.5772/intechopen.110178*

Another important point is the following. Managing and optimizing power has been a long-standing challenge in computer systems, with many fundamental techniques, and has been increasingly receiving attention. Today, a wide range of applications such as IoT services, ML inference, data analytics, and scientific computing now use Functions as a Service (FaaS) offerings of cloud platforms such as Amazon Lambda, Azure and Google Functions, and so on. Widespread application deployment model has nowadays become serverless computing, where applications are partitioned into small, fine-grained "functions", whose execution is managed by the cloud platform as in [39]. However, the constant evolution of cloud abstractions and usage models poses new energy efficiency challenges.

Serverless features can provide unique opportunities to reduce cloud heat generation by reducing the level of concentration of computing power in a relatively small location and programming model. Many power-saving techniques, such as workload migration and on-demand scaling, that are difficult for regular VMs and containers, can be significantly easier to develop and optimize for serverless functions that can "run anywhere" [40]. In this way, FaaS can provide new power tools for cloud platforms to quickly and finely move applications to environmentally friendly locations, which will be especially useful for distributed cloud edges powered by renewable energy such as solar and wind, as in [41]. The FaaS programming model also allows power management at the functional level. A function can have multiple implementations that differ in power consumption and performance. This may allow the cloud provider to use the appropriate feature implementation based on power availability and application performance constraints. Finally, features are reused, enabling data-driven and machine learning methods such as transfer learning that can be used for general and practical energy management.

One possible approach to energy management is a computing infrastructure model based on the concept of a cloud data center network and cloud edges where computing can be scheduled based on energy consumption. In the FaaS model, functions are not tied to any specific servers or locations and can potentially "run anywhere" as long as the runtime platform has access to the function's code dependencies and the container/VM "image". By separating computing from its location, serverless computing allows us to run functions in the most power-efficient location. This location independence can be an extremely efficient technique for resilient computing but is often challenging for other workloads. Because renewable energy sources (such as solar and wind) can be fickle, the availability of servers powered by them is temporary [42].

AI routing in distributed edge clouds can offer different trade-offs between energy and carbon emissions depending on location, time, and availability of resources and hardware. Functions can be run on the edge to ensure low latency. The trade-off between power consumption and performance adds a new dimension to the discussion of future cloud architectures. While edge clouds may have performance and security/privacy benefits, their energy benefits require further analysis.

As you know, for maximum efficiency of program execution, a certain set of hardware and their configuration is required. Several attempts have already been made to implement the approach of dynamically adapting the architecture to the application, that is, see [32]. Currently, to meet this need, it is proposed to use the resource disaggregation approach in [27]. Its essence is as follows. Data centers have been using the monolithic server model for over 20 years, where each server has a motherboard that houses all types of hardware resources, typically including the processor, memory chips, storage devices, and network cards. Resource disaggregation involves dividing the server's hardware resources into standalone networkconnected devices that applications can access remotely. Applications must be given virtualized and secure access to hardware resources, and data centers must support these applications with tools that ensure their good performance.

One of the possible approaches to manage energy consumption and carbon footprint is the computational infrastructure model based on the concept network of cloud DС and cloud edges, where computation can be organized by energy and carbonbased scheduling. In FaaS model, functions are not tied to any specific servers or locations and can potentially be "run anywhere", as long as the execution platform has access to the function's code dependencies and the container/VM "image". By decoupling computation from its location, serverless computing allows us to run functions at the most energy-suitable location. Thus, even though individual functions may not be energy efficient, they can be run in carbon-friendly locations to achieve better carbon efficiency. This location independence can be an extremely potent technique for sustainable computing, but is often challenging for other workloads. Since renewable energy (such as solar and wind) can be intermittent, the availability of servers powered by them is only transient [42].

AI routing in distributed edge clouds can offer different energy/carbon trade-offs depending on location, time, and resource and hardware availability. Functions can be run on the edge to provide low latency. The energy versus performance trade-off adds a new dimension to the discussion on future cloud architectures. While edge clouds may have performance and security/privacy advantages, their energy benefits need additional analysis.

In the history of computer architecture, there have been many attempts to make the architecture of computers dynamically adoptable to the structure of the algorithm of the program that it executes [32]. However, all of them were not very successful. Currently, a new direction is gaining momentum—resource disaggregation. This direction does not use the monolithic model of server that hosts all types of hardware resources like CPU cores, memory chips, storages, and NCIs. In resource disaggregation, server is split on individual devices connected by a high-speed network. The user can choose the architecture for the virtual machine and the configuration of individual devices to ensure the most efficient execution of his applications. Application has virtualized and reliable access to all devices. This approach to computer architecture requires a rethinking of the concept of the operating system. The traditional view of it has already become obsolete. But this is a topic for a separate post.

#### **8. Conclusion**

The Network Powered by Computing (NPC) concept of next generation computational infrastructure was presented. This concept based on the convergence of data communication networks with computing facilities like DC, edge, and HPC centers united by the functional architecture is presented in the article. The NPC concept is the incarnation of the slogan I got from Jhon Gadge from Sun Microsystems: "Network is a computer". Here, we considered the functional architecture of NPC, and the main problems on the way of its implementation are described. The presented concept allows to achieve deep automation in the management of resources of this infrastructure, load distribution, and energy consumption through the use of methods based on machine learning algorithms.

The issue of organizing the ASNF layer, which, together with the OAM and the NPCIC layers, is essentially new generation of the operating environment—an

*Network Powered by Computing: Next Generation of Computational Infrastructure DOI: http://dx.doi.org/10.5772/intechopen.110178*

analogue of the traditional operating system. But this is an independent, large topic that requires a separate publication.
