**4. Defining management-game for ODT**

Bayesian theorem with stochastic game is utilized in defining managementgame for ODT. Using game theory, we can model the strategic interaction between different players (agents) in a predefined environment. Our management-game is multi-agent game for the profit unit where the agents are workers and manager. The concept is non-symmetric because manager (team leader) and workers have different roles and their reward characteristics differ. Workers are motivated to maintain and improve their self-esteem (QWL). In addition, there might be some hidden motivation drivers. Team leader motivation drivers are unit profit and possible personal incentives, which may be hidden (e.g. biases). In our game the focus is profit-unit manager's behavior and learning.

#### *The Digital Twin of an Organization by Utilizing Reinforcing Deep Learning DOI: http://dx.doi.org/10.5772/intechopen.96168*

*Deep Learning Applications*

The self-esteem categories:

• Physical and emotional safety (PE);

• Collaboration and identity (CI); and

Chosen categories and their effect on performance form the theory of QWL index. It is also important to know that in addition that QWL index is production parameter,

> ( ) ( ( 2 3 ) ( )) <sup>1</sup> <sup>2</sup> *CI x OC x*

(3)

<sup>+</sup> = ∗

The functions of the self-esteem categories are adjusted so that the final QWL

Bayesian theorem with stochastic game is utilized in defining managementgame for ODT. Using game theory, we can model the strategic interaction between different players (agents) in a predefined environment. Our management-game is multi-agent game for the profit unit where the agents are workers and manager. The concept is non-symmetric because manager (team leader) and workers have different roles and their reward characteristics differ. Workers are motivated to maintain and improve their self-esteem (QWL). In addition, there might be some hidden motivation drivers. Team leader motivation drivers are unit profit and possible personal incentives, which may be hidden (e.g. biases). In our game the focus

it has also logical connection to customer satisfaction (see [17]) (**Figure 3**). Finally, the QWL index is the combination of all three self-esteem factors

QWL is calculated using the quality of working life index (0 … 1). PE(x1) is the value of the function of physical and emotional safety. CI(x2) is the value of the function of collaboration and identity. OC(x3) is the value of the function of objectives and creativity.

*QWL PE x*

• Objectives and creativity (OC).

according the following equation:

result is always between 0 and 100% [12].

**4. Defining management-game for ODT**

is profit-unit manager's behavior and learning.

where

**Figure 3.** *The theory of QWL.*

**42**

At Nash equilibrium the optimal outcome of the game is one where no agent wants to deviate from the chosen policy because that seems to be parallel with opponents' policy. Workplace problems have reducing tendencies on workers' selfesteem, thus decreasing QWL as a production parameter. Management practices have tendencies to improve QWL, but each action will reduce short-term profit. Manager's strategy hypothesis guides the actions at different state events. When the consequence data of action tendencies update the status after each Markovsequence, the player can update the management strategy, which further controls the next actions. Bayesian probability is related to player subjective behavior, relying on the phenomenon that rational thinking will probably lead to optimal result as the new information comes available [19].

The manager should learn the optimal leadership strategy without knowing the exact reward function or state transition function. This approach is called stochastic model-free reinforcement learning and can be defined with the Nash Q-learning approach. The leader has prior-believe about the state of nature of profit-unit business situation and expected future reward. The uniqueness of the game comes from the fact that it has predictive features that allow for the use of reinforcing learning artificial intelligence for learning Nash equilibrium between staff QWL and organization profit.

Management game is signaling game since workers give essential signals about possible workplace problems that may threaten their self-esteem (QWL) and therefore team performance. Workers preference strategy is to give their leader signals about the problems. In simplified digital team leaders' learning-game the worker's strategy may be stationary, meaning that workers behavior may be chosen in advance when the events scenario is known.

Team leader, as an agent of the management game, is responsible for team profit performance that is the outcome of producing customer value measured by revenue. Agent registers workers' signals and makes own prior belief for the strategy. Agent monitors also scorecards from business outcomes of monthly and cumulative profit, and forms a prior believe policy on how to act to these measures. Agent is rewarded by the profit at each month and cumulative profit at the end of the year. After each state transition the agent will get profit signals and QWL signals from the worker's response from the state change at workers QWL. State-change signals and reward results may cause changes at the preference strategy of the agent for the next sequence (Markov sequence [19]) (**Figure 4**).

**Figure 4.** *Leader's prior believe is biased and this strategy leads to delayed punishment.*

Leader reward function is (γL) the combination of monthly profit change, and expected affect to future profit. πLi is the leader's strategy at current state (month). It seems that at the beginning, the leader strategy is weighted at the monthly profit and the expected future reward is based on simple linear regression of data achieved so far. This means biased prior believe where the expected reward is not nearly the same as the outcome of the strategy. Thus, the value function under biased strategy is the following:

$$
\gamma\_{\rm L} = \gamma\_{\rm te} + \gamma\_{\rm 12e} \tag{4}
$$

where

γt€ is the observed state reward.

γ12€ is the expected future reward.

When the leader gets more experience and learns to understand the complexity of the system as well as the meaning of workers' QWL, the prior believe value function changes. QWL change starts to be more interesting, because leader learn to expect more future profit when QWL improves. Thus, along this information the leader adjusts the strategy for optimizing cumulative yearly profit. Here the leadership game stochastic nature is key to learning the Nash general sum equilibrium between the QWL and profit.

$$
\gamma\_{\rm L} = \alpha\_{\rm t} \left( \gamma\_{\rm te} + \gamma\_{\rm 12e} \right) \tag{5}
$$

where

γt€ is the observed state reward.

γ12€ is the expected future reward by improving the QWL.

αt is the learning rate.

QWL is improved by leadership actions that reduce the monthly working time for making the revenue. Thus, improving QWL reduces monthly revenue and profit, but may increase effective working time in the future and so increase the future profit. In monthly basis, this phenomenon may be contradictory and confusing, but by practice, the best reward is achieved where both workers' and leader's payoff functions flourish. This means the Nash equilibrium where yearly QWL is improved with high profit. In Nash equilibrium, leader's choices are the best response to the workers' signals and business cumulative outcome at the end of the year.

Bayesian stochastic strategic non-symmetric signaling learning game follows Markov decision process [20–22]. Management-game forms stochastic game tuple

$$\left[\text{N}, \text{S}, \text{C}, \text{A}, \text{T}, \text{P}, \text{R}\right] \tag{6}$$

**45**

*The Digital Twin of an Organization by Utilizing Reinforcing Deep Learning*

immediate payoffs and actions from past months. A leader does not know exactly which actions would be the best but can choose actions that should be good enough. The leader will get workers emotional feedback immediately and information from profit monthly change and cumulative reward. After several game rounds, the player (leader) will learn the optimal actions to improve both the QWL and annual profit. Thus, the player will achieve the Nash equilibrium of stochastic Markov learning game.

Management-game has context specific Markov-sequences. State and state change transition follows the Markov property where the future is independent of the past given the current situation. Once a state is defined, its change is determined by the behavior of the parties. State change is sequential, following the players actions and

1.Workers interpret the state situation and give signals based on prior

3.Leader updates standard-strategy (πst). Note: at this first month there is no

6.Leader observes immediate (γ€1) and cumulative (γΣ€) profit rewards (or sacrifices). From now on, the leader gets also profit outcome, thus updates

7.According the combination of rewards, the leader upgrades prior believes

8.Leader upgrades profit-strategy and standard-strategy for choosing

9.Workers give signals to be considered when deciding actions t + 1

11–13. Leader makes actions t + 1 in line with all three strategies.

15… From now on, the supervisor should update all three strategies simultaneously as learning sequences progress (**Figures 5** and **6**).

10.Leader updates signal-strategy for choosing actions t + 1

2.Leader observes the signals and updates the signal-strategy (πτ).

5.Actions leads to state change with possible outside intervention

*DOI: http://dx.doi.org/10.5772/intechopen.96168*

**5. Management-game Markov sequences**

state transition probability function. Sequences are:

data to update this year profit strategy.

4.Leader makes actions (or decide doing nothing) (a)

First Month (January)

believe (τ).

(stochastic)

actions t + 1

also profit strategy.

concerning own behavior

14. State transition to state t + 1

Leadership game Q-learning function is (7)

where

N is set of players, i.

S is set of states, s.

C is set of competences at actions a.

A is set of actions, a.

T is set of signals, τ.

P is transition probability function; P: S x A x C thus P(s, c, a), ρ:SxA|C → Δ is the transition function, where Δ is the set of probability distributions over state space S.

R is reward function, R = r1,…rn, γ:SxA|C → R.

There is incomplete but perfect information. The agents (workers and leader) do not know other agents' payoff functions in detail, but they can observe other agents' *The Digital Twin of an Organization by Utilizing Reinforcing Deep Learning DOI: http://dx.doi.org/10.5772/intechopen.96168*

immediate payoffs and actions from past months. A leader does not know exactly which actions would be the best but can choose actions that should be good enough. The leader will get workers emotional feedback immediately and information from profit monthly change and cumulative reward. After several game rounds, the player (leader) will learn the optimal actions to improve both the QWL and annual profit. Thus, the player will achieve the Nash equilibrium of stochastic Markov learning game.
