*2.3.1. Discrete parameter learning*

In LPN, RL is used to adjust *VW* and *VT* through interacting with environment. RL could learn the optimal policy of the dynamic system through environment state observation and improvement of its behavior through trial and error with the environment. RL agent senses the environment and takes actions. It receives numeric award and punishments from some reward function. The agent learns to choose actions to maximize a long term sum or average of the future reward it will receive.

The arc weight function learning algorithm is based on Q-learning – a kind of RL [18]. In arc weight function learning algorithm, *VWCij,i,j* is randomly set firstly. So, the weight function on the arc is arbitrary. When the system runs, formula (1) is used to update *VWCij,i,j*.

$$\text{V}\mathbf{V}\mathbf{V}\_{\text{Ci}\boldsymbol{j},\boldsymbol{i},\boldsymbol{j}} = \mathbf{V}\mathbf{V}\mathbf{V}\_{\text{Ci}\boldsymbol{j},\boldsymbol{i},\boldsymbol{j}} + \alpha \mathbf{I} \mathbf{r} + \gamma \overline{\left(\mathbf{V}\mathbf{V}\overline{\mathbf{V}\_{\text{ci}+1\boldsymbol{j},\boldsymbol{i}+1,\boldsymbol{j}}\right)}} - \mathbf{V}\mathbf{V}\mathbf{V}\_{\text{Ci}\boldsymbol{j},\boldsymbol{i}} \tag{1}$$

where,

i. *á* is the step-size,is a discount rate.

146 Petri Nets – Manufacturing and Computer Science

**2.3. Learning algorithm for LPN** 

*2.3.1. Discrete parameter learning* 

of the future reward it will receive.

kinds of parameters which are learnt using RL.

RL.

where,

An example of LPN model is shown in Figure 1 Using LPN, a mapping of input-output tokens is gotten. For example, in Figure 1, colored tokens *Cij* (*i=*1;*j=*1, 2, …, *n*) are input to *P*<sup>1</sup> by *Trinput*. There are *n* weight functions *W*(<*C*1*<sup>j</sup>*>, *VWC*1*j,*1*,j*) on a same arc *F*1*,j*. it is according to the value *VWCij,i,j* that token *C*1*<sup>j</sup>* obeys what weight functions in *W*(<*Cij*>, *VWCij,i,j*) to fire a transition. After token *C*1*<sup>j</sup>* passed through arc *Fi,j* (*i=*1; *j=*1, 2, …, *n*), one of *Tri,j* (*i=*1; *j=*1, 2, …, *n*) fires and generates Tokens *Cij* (*i=*2; *j=*1, 2, …, *n*) in *P2*. After *P2* has color Token *Cij* (*i=*2; *j=*1, 2, …, *n*), *Tri,j* (*i=*2; *j=*1, 2, …, *n*) fires and different colored Token *Cij* (*i=*3; *j=*1, 2, …, *n*) is generated. Then, a mapping of *C*1*<sup>j</sup>* – *C*3*<sup>j</sup>* is gotten. At the same time, a reward will be gotten from environment according to whether it accords with system rule that *C*3*<sup>j</sup>* generated by *C*1*<sup>j</sup>*. These rewards are propagated to every *VWCij,i,j* and adjust the *VWCij,i,j*. After training, the

Using LPN to model a dynamic system, the system state is modeled as Petri net marking which is marked for a set of colored token in all places of Petri net, and the change of the system state (i.e. the system action) is modeled as fired of transitions. Some parameters of system can be expressed as token number and color, arc weight function, transition delay time, and so on. For example, different system signals are expressed as different colored of token. When the system is modeled, some parameters are unknown or uncertain. So, these parameters are set randomly. When system runs, the system parameters are gotten gradually and appropriately through system acting with environment and the effect of

In LPN, there are two kinds of parameters. One is discrete parameter −− the arc's weight function which describes the input and output colored tokens for transition. The other is continuous parameter −− the delay time for the transition ring. Now, we will discuss two

In LPN, RL is used to adjust *VW* and *VT* through interacting with environment. RL could learn the optimal policy of the dynamic system through environment state observation and improvement of its behavior through trial and error with the environment. RL agent senses the environment and takes actions. It receives numeric award and punishments from some reward function. The agent learns to choose actions to maximize a long term sum or average

The arc weight function learning algorithm is based on Q-learning – a kind of RL [18]. In arc weight function learning algorithm, *VWCij,i,j* is randomly set firstly. So, the weight function


on the arc is arbitrary. When the system runs, formula (1) is used to update *VWCij,i,j*.

 *VWCij,i,j* = *VWCij,i,jj* +*α*[*r*+ 1 , 1, ( ) *VWci j i j*

LPN is able to express a correct mapping of input-output tokens.


$$\overline{\chi(\overline{\mathcal{V}W\_{ci+1\not{j},i+1,j}})} = \gamma(\overline{\mathcal{V}W\_{ci+1\not{j},i+1,j}}) \, {}^{t:1} \, {}^{rt} \tag{2}$$

where *t* is time for that <*Ci+*1*<sup>j</sup>*> is generated by *W*(<*Cij*>, *VWCij,i,j*).

When every weight function of input arc of the transition has gotten the value, each transition has a value of its action. The policy of the action selection needs to be considered. The simplest action selection rule is to select the service with the highest estimated stateaction value, i.e. the transition corresponding to the maximum *VWCij,i,j*. This action is called a greedy action. If a greedy action is selected, the learner (agent) exploits the current knowledge. If selecting one of the non-greedy actions instead, agent intends to explore to improve its policy. Exploitation is to do the right thing to maximize the expected reward on the one play; meanwhile exploration may produce the greater total reward in the long run. Here, a method using near-greedy selection rule called ε-greedy method is used in action selection; i.e., the action is randomly selected at a small probability ε and selected the action which has the biggest *VW cij,i,j* at probability 1−ε. Now, we show the algorithm of LPN which is listed in Table 1.

Algorithm 1. Weight function learning algorithm

**Step 1.** Initialization: Set all *VWij* and *r* of all input arc's weight function to zero.

**Step 2.** Initialize the learning Petri net. i.e. make the Petri net state as *M0*.

Repeat i) and ii) until system becomes end state.

i. When a place gets a colored Token *Cij,* there is a choice that which arc weight function is obeyed if the functions include this Token. This choice is according to selection policy which is ε greedy (ε is set according to execution environment by user, usually 0<ε<<1). A: Select the function which has the biggest *VW cij,i,j* at probability1-ε;

B: Select the function randomly at probability *ε*.

ii. The transition which the function correlates fires and reward is observed. Adjust the weight function value using *VWCij,i,j* = *VWCij,i,jj* +*α*[*r*+ 1 , 1, ( ) *VWci j i j* - *VWCij,i,j*]. At the same time, *α*[*r*+ 1 , 1, ( ) *VWci j i j* - *VWCij,i,j*] is fed back to the weight function with generated *Cij* as its reward for next time.

#### 148 Petri Nets – Manufacturing and Computer Science

#### *2.3.2. Continuous parameter learning*

The delay time of transition is a continuous variable. So, the delay time learning is a problem of RL in continuous action spaces. Now, there are several methods of RL in continuous spaces: discretization method, function approximation method, and so on [4]. Here, discretization method and function approximation method are used in the delay time learning in LPN.

#### *Discretization method*

As shown in Figure 2 (i), the transition *tr*1 has a delay time *t*1. When *p*1 has a token <*tokenn*>, the system is at a state that *p*1 has a Token. This time transition *tr*1 is enabled. Because *tr*1 has a delay time *t*1, *tr*1 doesn't fire immediately. After passing time *t*1 and *tr*1 fires, the token in *p*<sup>1</sup> is taken out and this state is terminated. Then, during the delay time of *tr*1, the state that *p*<sup>1</sup> has a token continues.

Because the delay time is a continuous variable, the different delay time is discretized for using RL to optimize the delay time. For example, *tr*1 in Figure 2 (i) has an undefined delay time *t*1. *Tr*<sup>1</sup> is discretized into several different transitions which have different delay times (shown in Figure 2 (ii)) and every delay time has a value item *Q*. After *Tr1* fired at delay time *t1i*, it gets a reward *r* immediately or after its subsequence gets rewards. The value of *Q* is updated by formula (3).

$$\mathcal{Q}(\mathcal{P}, \mathcal{T}r) \leftarrow \mathcal{Q}(\mathcal{P}, \mathcal{T}r \;) + a[r + \gamma \mathcal{Q}(\mathcal{P}', \mathcal{T}r') \; \cdot \; \mathcal{Q}(\mathcal{P}, \mathcal{T}r)] \tag{3}$$

Construction and Application of Learning Petri Net 149

Now, we found the learning algorithm of delay time of LPN using the discretization

**Step 1.** Initialization: discretize the delay time and set *Q*(*p,t*) of every transition's delay time to zero.

First, the transition delay time is selected randomly and executed. The value of the delay time is obtained using formula (3). When the system is executed *m* times, the data (*ti*, *Qi*(*p,ti*)) (*i* = 1, 2, …, *m*) is yielded. The relation of value of delay time *Q* and delay time *t* is supposed as *Q* = *F*(*t*). Using the least squares method, *F*(*t*) will be obtained as follows. It is supposed that *F* is a function class which is constituted by a polynomial. And it is supposed that

> 0 *<sup>n</sup> <sup>k</sup> k k*

Here, the degree *m* of data (*ti*, *Qi*(*p,ti*)) is not less than data number *n* of formula (5).

2 2 2 1 10 || || [ ] min *m mn <sup>k</sup> i ki i*

*i k*

 *at Q* 

*at Q t*

*at Q*

2 2 1 0 || || [ ] *m n <sup>k</sup>*

*ki i*

(*j* =0, 1, …, *n*) (9)

*i ik*

In fact, (7) is a problem which evaluates the minimum solution of function (8).

 

1 0 || || 2 ( )0 *m n <sup>k</sup> <sup>j</sup> ki i i*

0 *<sup>n</sup> <sup>k</sup> k i*

*k a t* 

*at F*

(5)

(*i* = 1, 2, …, *m* ; *m* ≥ *n*) (6)

(7)

(8)

ii. After transition fired and reward is observed, value of *Q*(*p,t*) is adjusted using formula (3).

method. And it is listed in Table 2.

i. Select a transition using formula (4).

*Function approximation method* 

formula (5) hold.

Transition's delay time learning algorithm 1 (Discretization method):

**Table 2.** Delay time learning algorithm using the discretization method

**Step 2.** Initialize Petri net, i.e. make the Petri net state as *P1*. Repeat (i) and (ii) until system becomes end state.

**Step 3.** Step 3. Repeat Step2 until *t* is optimal as required.

 *f*(*t*) =

 *f*(*ti*) =

So, function (9), (10) are gotten from (8).

The data (*ti*, *Qi*(*p,ti*)) are substituted in formula (5). Then:

According to the least squares method, we have (2.7).

2

*a* 

*j i k*

where, *Q*(*P, Tr*) is value of transition *Tr* at Petri net state *P*. *Q*(*P',Tr'* ) is value of transition *T'*  at next state *P'* of *P*. *α* is a step-size, *γ* is a discount rate.

 (i) The high-level time Petri net model (ii) The discretization learning model for the delay time **Figure 2.** Transformation form from high-level Petri net to the learning model

After renewing of *Q*, the optimal delay time will be selected. In Figure 2 (ii), when *tr11,*…,*tr1n* get value *Q11,*…,*Q1n* ,respectively, the transition is selected by the soft-max method according to a probability of Gibbs distribution.

$$\Pr\{t \neq t \mid p \models p\} = \frac{e^{\beta \mathcal{Q}(p, t)}}{\sum\_{b \in A} e^{\beta \mathcal{Q}(p, b)}} \tag{4}$$

where, Pr{*tt=t*|*pt=p*} is a probability selecting of transition *t* at state *p*, *â* is a positive inverse temperature constant and *A* is a set of available transitions.

Now, we found the learning algorithm of delay time of LPN using the discretization method. And it is listed in Table 2.


**Table 2.** Delay time learning algorithm using the discretization method

#### *Function approximation method*

148 Petri Nets – Manufacturing and Computer Science

*2.3.2. Continuous parameter learning* 

learning in LPN.

*Discretization method* 

has a token continues.

updated by formula (3).

at next state *P'* of *P*. *α* is a step-size, *γ* is a discount rate.

to a probability of Gibbs distribution.

Pr{*tt=t*|*pt=p*} =

temperature constant and *A* is a set of available transitions.

The delay time of transition is a continuous variable. So, the delay time learning is a problem of RL in continuous action spaces. Now, there are several methods of RL in continuous spaces: discretization method, function approximation method, and so on [4]. Here, discretization method and function approximation method are used in the delay time

As shown in Figure 2 (i), the transition *tr*1 has a delay time *t*1. When *p*1 has a token <*tokenn*>, the system is at a state that *p*1 has a Token. This time transition *tr*1 is enabled. Because *tr*1 has a delay time *t*1, *tr*1 doesn't fire immediately. After passing time *t*1 and *tr*1 fires, the token in *p*<sup>1</sup> is taken out and this state is terminated. Then, during the delay time of *tr*1, the state that *p*<sup>1</sup>

Because the delay time is a continuous variable, the different delay time is discretized for using RL to optimize the delay time. For example, *tr*1 in Figure 2 (i) has an undefined delay time *t*1. *Tr*<sup>1</sup> is discretized into several different transitions which have different delay times (shown in Figure 2 (ii)) and every delay time has a value item *Q*. After *Tr1* fired at delay time *t1i*, it gets a reward *r* immediately or after its subsequence gets rewards. The value of *Q* is

 *Q*(*P,Tr*) ←*Q*(*P,Tr* ) +*α*[*r* + γ*Q*(*P',Tr'* ) - *Q*(*P,Tr*)] (3) where, *Q*(*P, Tr*) is value of transition *Tr* at Petri net state *P*. *Q*(*P',Tr'* ) is value of transition *T'* 

*p*1

( ,) ( ,)

*Qpt Qpb*

*b A*

*e e*

where, Pr{*tt=t*|*pt=p*} is a probability selecting of transition *t* at state *p*, *â* is a positive inverse

*p*2

**: : :** *tr11*

*tr1n*

(4)

(*t*11*,Q*11)

(*t1n,Q1n*)

**Figure 2.** Transformation form from high-level Petri net to the learning model

(i) The high-level time Petri net model (ii) The discretization learning model for the delay time

After renewing of *Q*, the optimal delay time will be selected. In Figure 2 (ii), when *tr11,*…,*tr1n* get value *Q11,*…,*Q1n* ,respectively, the transition is selected by the soft-max method according First, the transition delay time is selected randomly and executed. The value of the delay time is obtained using formula (3). When the system is executed *m* times, the data (*ti*, *Qi*(*p,ti*)) (*i* = 1, 2, …, *m*) is yielded. The relation of value of delay time *Q* and delay time *t* is supposed as *Q* = *F*(*t*). Using the least squares method, *F*(*t*) will be obtained as follows. It is supposed that *F* is a function class which is constituted by a polynomial. And it is supposed that formula (5) hold.

$$f(t) = \sum\_{k=0}^{n} a\_k t^k \in F \tag{5}$$

The data (*ti*, *Qi*(*p,ti*)) are substituted in formula (5). Then:

$$f(t) = \sum\_{k=0}^{n} a\_k t\_i^k \text{ (\$i = 1, 2, \dots, m\$ ; m \ge n\$)}\tag{6}$$

Here, the degree *m* of data (*ti*, *Qi*(*p,ti*)) is not less than data number *n* of formula (5). According to the least squares method, we have (2.7).

$$\|\|\|\mathcal{S}\|\|^2 = \sum\_{i=1}^m \delta\_i^2 = \sum\_{i=1}^m \|\sum\_{k=0}^n a\_k t\_i^k - \mathbb{Q}\_i\|^2 \implies \min \tag{7}$$

In fact, (7) is a problem which evaluates the minimum solution of function (8).

$$\|\|\|\mathcal{S}\|\|^2 = \sum\_{i=1}^{m} \|\sum\_{k=0}^{n} a\_k t\_i^k - Q\_i\|^2 \tag{8}$$

So, function (9), (10) are gotten from (8).

$$\frac{\left\|\mathcal{D}\right\| \left\|\mathcal{S}\right\|^2}{\left\|a\_j\right\|} = 2\sum\_{i=1}^m \sum\_{k=0}^n (a\_k t\_i^k - Q\_i) t\_i^j = 0 \quad \text{(j=0, 1, \dots, n\text{)}}\tag{9}$$

#### 150 Petri Nets – Manufacturing and Computer Science

$$\sum\_{i=1}^{m} (\sum\_{k=0}^{n} t\_i^{j+k}) a\_k = \sum\_{i=1}^{m} t\_i^j \mathbb{Q}\_i \quad \text{ (j = 0, 1, \dots, n)}. \tag{10}$$

Solution of Equation (10) *a0*, *a1*, …, *an* can be deduced and *Q* = *f*(*t*) is attained. The solution *t\*opt* of *Q* = *f*(*t*) which makes maximum *Q* is the expected optimal delay time.

$$\frac{\partial f(t)}{\partial t} = 0 \tag{11}$$

Construction and Application of Learning Petri Net 151

and control system will be constructed for making AIBO understand several human voice commands by Japanese and English and take corresponding action. The simulation system is developed on Sony AIBO's OPEN-R (Open Architecture for Entertainment Robot) [19]. The architecture of the simulation system is showed in Figure 3. Because there are English and Japanese voice commands for same AIBO action, the partnerships of voice and action are established in part (4). The lasted time of an AIBO action is learning in part (5). After an AIBO action finished, the rewards for correctness of action and action lasted time are given

In the LPN model for AIBO voice command recognition system, AIBO action change, action time are modeled as transition, transition delay, respectively. The human voice command is modeled by the different color Token. The LPN model is showed in Figure 4. The meaning of every transition is listed below: *Tr input* changes voice signal as colored Token which describe the voice characteristic. *Tr*11*, Tr*<sup>12</sup> and *Tr*<sup>13</sup> can analyze the voice signal. *Tr1* generates 35 different Token *VL1….VL35* according to the voice length. *Tr2* generates 8 different Token *E21…E28* according to the front twenty voice sample energy characteristic. *Tr3* generates 8 different Token *E41…E48* according to the front forty voice sample energy characteristic [8]. These three types of the token are compounded into a compound Token *<VLl> + <VE2m> +* 

*Tr2j* generates the different voice Token. The input arc's weight function is *((<VLl>+<VE2m>+ <VE4n>), VWVlmn,2j)* and the output arc's weight function is different voice Token. And voice Token will generate different action Token through *Tr3j*. When *Pr4* – *Pr8* has Token, AIBO's action will last. *Tr4j* takes Token out from *p4* – *p8*, and makes corresponding AIBO action terminates. *Tr4j* has a delay time *DT4i*, and every *DT4i* has a value *VT4i*. Transition adopts

When the system begins running, it can't recognize the voice commands. A voice command comes and it is changed into a compound Token in *p2*. This compound Token will randomly generate a voice Token and puts into *p3*. This voice Token randomly arouses an action Token. A reward for action correctness is gotten, then, *VW* and *VT* are updated. For example, a compound colored Token *(<VLl>+ <VE2m> + <VE4n>)* fired *Tr21* and colored Token

by the touch of different AIBO's sensors.

**Figure 3.** System architecture of voice command recognition

*LPN model for AIBO voice command recognition system* 

*<VE4n>* in *p*<sup>2</sup> [12].

*Results of simulation* 

which delay time *DT4i* according to *VT4i*.

The multi-solution of (11) *t = topt* (*opt* = 1, 2, …, *n*-1) is checked by function (5) and a *t\* opt* ∈*topt* which makes *f*(*t\*opt*)= max *f*(*t*opt) (*opt* = 1, 2, …, *n-*1) is the expected optimal delay time. *t\*opt* is used as delay time and the system is executed and new *Q*(*p, t\*opt*) is gotten. This (*t\*opt*, *Q*(*p, t\*opt*)) is used as the new and the least squares method can be used again to acquire more precise delay time.

After the values of actions are gotten, the soft-max method is selected as the actions selection policy. And then, we found the learning algorithm of delay time of Learning Petri net using the function approximation method. And it is listed in Table 3.

Transition's delay time learning algorithm 2 ( Function approximation method):

**Step 1.** Step 1. Initialization: Set *Q*(*p, t*) of every transition's delay time to zero.

**Step 2.** Step 2. Initialize Petri net, i.e. make the Petri net state as *P1*.

Repeat (i) and (ii) until system becomes end state.

i. Randomly select the transition delay time *t.*

ii. After transition fires and reward is observed, the value of *Q*(*p*, *t*) is adjusted using formula (3).

**Step 3.** Step 3. Repeat Step 2 until adequacy data are gotten. Then, evaluate the optimal *t* using the function approximation method.

**Table 3.** Delay time learning algorithm using the function approximation method
