**7. Digital twin AI advisor using Bellman function**

Digital twin advisor uses Bellman [20] expectation function in finding optimal actions for achieving Nash equilibrium. Bellman expectation function for strategy π is

$$\boldsymbol{\nu}\_{\boldsymbol{\delta}}\left(\boldsymbol{\sigma}\right) = \boldsymbol{E}\_{\pi} \left[\boldsymbol{R}\_{\boldsymbol{\epsilon}\ast 1} + \beta \boldsymbol{\nu}\left(\boldsymbol{S}\_{\boldsymbol{\epsilon}\ast 12}\right)\right] \tag{8}$$

where

*Rt*<sup>+</sup>1 = immediate reward

β*v S*( *<sup>t</sup>*+<sup>12</sup> ) = discounted future value (12 months estimation).

Optimal policy forms from the actions that result in optimal value function, thus

$$q\_{\delta \ast} \left( s, \omega, \omega \right) = R\_{\delta \ast}^{\prime} + \beta\_{a \approx A}^{\prime \text{arcmax}} \upsilon\_{\pi \ast} \left( \mathcal{S}\_{\iota \ast 12} \right) \tag{9}$$

**49**

**Figure 9.**

*Simulation game user-interface.*

*The Digital Twin of an Organization by Utilizing Reinforcing Deep Learning*

In our digital twin AI assistant is using Bellman function. It returns the combination of actions that gives the best value after floating 12 months. This is achieved so that first each action value is analyzed and sorted in magnitude of the value. Then the combinations of best actions are evaluated until marginal productivity of the value is achieved, see example at **Figure 8**. One simulation episode is 12 months; thus, the Bellman function maximize future reward even when the episode is coming to end. Simulation game is done using UNITY 3D, for making possible to play the learning game episodes. Each episode is 12 months, consisting several workplace challenges. In the test runs we used Cash Cow episode where problems are easy, the market situation is steady, and the company does not seek special increase in revenue. State space problems are signaled by the workers that comes meeting the team-leader (agent). In this ODT there is so far 25 workplace challenges which reduce QWL according situational probability matrix. Leader has 32 best management practices (action space) that may be used as the leader prefers. Each action reduce profit and may improve QWL according state space situation specific prob-

We tested simulation using three different competence values; 30%, 60% and

• BIAS = human simulation episode (round) with bias to maximize short term profit. Only problem-solving actions are made. In BIAS episode the focus is on

• Learning = human simulation episode where leader has learned to maximize best result in QWL and profit. Agent execute best learning strategy (see **Figure 7**) with long-term profit mind-set, problems solving as good as possible and

• Bellman = artificial intelligence episode where all actions are chosen according

It seems that with management competence levels 30% there are difficulties to achieve budgeted target result in profit. If QWL is sacrificed for short term wins, the cumulative profit result at the end of the year will be poor. It seems that in

90%. **Table 1** contains the results of three simulation rounds as follows:

*DOI: http://dx.doi.org/10.5772/intechopen.96168*

ability function [23] (**Figure 9**).

maximizing short-term profit.

Bellman function (see **Figure 8**).

following yearly management-plan of actions.

where

ð *<sup>s</sup> R* <sup>∗</sup> = immediate state reward from strategy π\*

( <sup>12</sup> ) *arcmax aA t* β ∈ ∗+ *v S* π= discounted maximum future value (12 months estimation).

**Figure 8.** *Bellman function principle of marginal productivity value.*
