**5. Management-game Markov sequences**

Management-game has context specific Markov-sequences. State and state change transition follows the Markov property where the future is independent of the past given the current situation. Once a state is defined, its change is determined by the behavior of the parties. State change is sequential, following the players actions and state transition probability function. Sequences are:

First Month (January)

*Deep Learning Applications*

biased strategy is the following:

between the QWL and profit.

αt is the learning rate.

γt€ is the observed state reward.

γ12€ is the expected future reward by improving the QWL.

workers' signals and business cumulative outcome at the end of the year.

γt€ is the observed state reward. γ12€ is the expected future reward.

where

where

where

space S.

N is set of players, i. S is set of states, s.

A is set of actions, a. T is set of signals, τ.

C is set of competences at actions a.

R is reward function, R = r1,…rn, γ:SxA|C → R.

Leader reward function is (γL) the combination of monthly profit change, and expected affect to future profit. πLi is the leader's strategy at current state (month). It seems that at the beginning, the leader strategy is weighted at the monthly profit and the expected future reward is based on simple linear regression of data achieved so far. This means biased prior believe where the expected reward is not nearly the same as the outcome of the strategy. Thus, the value function under

When the leader gets more experience and learns to understand the complexity of the system as well as the meaning of workers' QWL, the prior believe value function changes. QWL change starts to be more interesting, because leader learn to expect more future profit when QWL improves. Thus, along this information the leader adjusts the strategy for optimizing cumulative yearly profit. Here the leadership game stochastic nature is key to learning the Nash general sum equilibrium

QWL is improved by leadership actions that reduce the monthly working time for making the revenue. Thus, improving QWL reduces monthly revenue and profit, but may increase effective working time in the future and so increase the future profit. In monthly basis, this phenomenon may be contradictory and confusing, but by practice, the best reward is achieved where both workers' and leader's payoff functions flourish. This means the Nash equilibrium where yearly QWL is improved with high profit. In Nash equilibrium, leader's choices are the best response to the

Bayesian stochastic strategic non-symmetric signaling learning game follows Markov decision process [20–22]. Management-game forms stochastic game tuple

P is transition probability function; P: S x A x C thus P(s, c, a), ρ:SxA|C → Δ is the transition function, where Δ is the set of probability distributions over state

There is incomplete but perfect information. The agents (workers and leader) do not know other agents' payoff functions in detail, but they can observe other agents'

L t€ 12€ γ =γ +γ (4)

γ =α γ +γ L t t€ 12€ ( ) (5)

[N,S,C,A,T,P,R] (6)

**44**


11–13. Leader makes actions t + 1 in line with all three strategies.

14. State transition to state t + 1

15… From now on, the supervisor should update all three strategies simultaneously as learning sequences progress (**Figures 5** and **6**).

Leadership game Q-learning function is (7)

$$Q\_{\epsilon \ast 1} \left( \mathbf{s}, \mathbf{c}, a \right) = \left( \mathbf{1} - \alpha \right) Q\_{\epsilon} \left( \mathbf{s}, \mathbf{c}, a \right) + \alpha \left[ \boldsymbol{\gamma}\_{\perp 1} + \boldsymbol{\gamma}\_{\perp 1 2} + \boldsymbol{\beta}^{\max}\_{\;\;a} Q\_{\epsilon} \left( \mathbf{s}\_{\iota \ast 1}, \mathbf{c}, a \right) \right] \tag{7}$$

where

β is [0,1] is discounted reward factor αt is [0,1] is the learning rate (1-αt) γ∆i is the monthly profit reward γ∆12 is the expected cumulative profit reward (floating 12 months) ( ) max <sup>1</sup> ,, ,, *a tt Q s ca* <sup>+</sup> is expected maximum equilibrium strategy state change pay from best actions a at competence levels c.

**Figure 6.** *Management game learning phenomenon for finding equilibrium.*

With expected equilibrium strategy pay ( ( ) max <sup>1</sup> ,, ,, ) *a tt Q s ca* <sup>+</sup> you can calibrate the Q-learning points so that it gives approximately 0-points when no actions are done, thus no learning was achieved. In our Q-learning function this value is 221 € monthly improvement value per employee. This corresponds the costs of one absence day per month for each worker or one working day more in work efficiency. Using this value, Q-learning gives 0 points regardless of what the supervisor's skills are.

**47**

**Figure 7.**

*Management learning strategies.*

*The Digital Twin of an Organization by Utilizing Reinforcing Deep Learning*

organization and its' players change of characteristics (**Figure 7**).

fostering of psychological agreement at the working society.

employees learn that problems are not worth reporting.

The focus of the standard-strategy (

π

and utilize them to achieve best reward. This strategy is strongly related to the psychological agreement between workers and supervisor. When working team members learn to play general-sum-game, the signals are provided early and in constructive way, which foster optimal actions. In case signal-strategy turn to 0-sum-game the signals tend to be hided or used to harm other members of the team. Thus, creating best foundation for signal-strategy is grounded on continuous

achieved at anticipated time span. Economical profit indicators are usually constantly monitored, giving them a lot of attention. In addition, organization profit target time span is determined at management system, which create certain predefined attitude towards achieving profit. From a strategic point of view, there is a big difference between focusing on maximum result this month or aiming for the maximum profit with delay of several months. If a management system requires maximum results over a short period of time, then it reinforces the detrimental profit-maximization bias. In this bias the team-leader tend to push workers performance too much, which lead to maximizing performance that is declining. In addition, a manager under this bias neglect employee signals because the signals pose a risk that short-term profits are threatened when scarce working hours are used to solve the problem. Clearly, this behavior damages the signal game, as

π

advance to secure the reward in the future. Usually this strategy comes from the

€ ) focus is to learn from experience how target profit is

<sup>ô</sup> ) is to learn to understand employee signals

*st* ) is to learn how to plan actions in

There are three different strategic areas of prior-believes that forms the manager's learning context. These strategies are influenced by the supervisor's interaction skills (competences), which tend to either promote or hinder learning in the area. Every manager has personal competences, which seems to form personal Nash equilibrium and corresponding Q-learning results. According to this article, it seems that Nash equilibrium is different for each combination of manager's competences. In addition, the leader's strategic mind-set defines the equilibrium. Indeed, management equilibrium seems to be evolving phenomenon, depending on

*DOI: http://dx.doi.org/10.5772/intechopen.96168*

**6. Management learning strategies**

The focus of the signal-strategy (

π

Profit-strategy (
