*2.1.6 Optimal policy learning mechanism*

The local controllers obtain the QoS information (such as the delay, jitter, and PLR) from the data plane devices for all the service requests and the service classes on the E2E paths. The service requests and service classes are shown in **Table 2** [36] and **Table 3** [37]. The service request is a combination of the E2E delay, jitter, and PLR for an application. An example of the offered service classes in 5 E2E domains is shown in **Table 3**. Each local controller shares this information with the global controller. Thus, global controller has the E2E view of the network.

Reinforcement learning with Q-learning enabled AI agent is used to maximize the rewards for an agent. Q-learning is one of the methodologies to leverage reinforcement learning. It does not require a model of the environment, and it can cope with problems utilizing stochastic transitions with rewards, without demanding adaptations. For a finite Markov decision process (FMDP), Q-learning computes an optimal policy aiming to maximize the expected value of the accumulated reward over every as well as all successive steps, beginning from current state. Q-learning can find an optimal action-selection policy for any given FMDP, given infinite exploration time along with partly-random policy [38]. Q is the function name that the algorithm learns with the maximum expected rewards for an action taken in a given state [39].

If the service request meets the end-to-end QoS demand for a state action pair, a high reward factor is assigned. For this purpose, the *DC* ratio is checked for the state action pair. The *DC* ratio denotes whether the QoS requirements are meeting for a service request or not. For example, if the application service request E2E demand for delay is 150 and the service classes offer a delay of 40, 20, 15, 0 and 45 on the E2E path, then the ratio will be 150/120 i.e., 1.25. Hence, if the *DC*≻1 it is awarded


**Table 2.** *An example of the E2E service requests.*


*Management of Software-Defined Networking Powered by Artificial Intelligence DOI: http://dx.doi.org/10.5772/intechopen.97197*

#### **Table 3.**

*An example of the service classes on the E2E path passing through five domains.*

a high *Q* value for the service request. On the contrary if the *DC*≺1, the reward is low for the state action pair for that service request. This process continues until all the possible source to destination paths are explored and checked for the *DC* value against each state action pair.
