**4.1. Decision rule**

( ) <sup>0</sup> ( ) () () ( ) ( ) <sup>0</sup> 0, min , , , *<sup>t</sup>*

where Δ*x* = *x*(*t*<sup>0</sup> + Δ*t*) − *x*(*t*0). Thus, letting Δ*t* →0 and dividing by Δ*t*, we have

Discretizing via Taylor series expansion, we get

142 Robust Control - Theoretical Models and Case Studies

with boundary condition

Then, for any processes, *t*<sup>0</sup> ≤*τ*<sup>1</sup> ≤*τ*<sup>2</sup> ≤*t*1:

the time interval is feasible for *x*(*t*).

∈Λ*τ*1,*x*(*τ*1). Hence,

be any optimal control in Λ*τ*2,*x*(*τ*2), where *<sup>u</sup>* \*

*Proof:*

Let *u* \*

Then, *u* \*

*<sup>u</sup> V x f t x t u t dt V t x t* <sup>ì</sup> <sup>D</sup> <sup>ï</sup> <sup>=</sup> <sup>í</sup> +D D ïî<sup>ò</sup> (42)

*t x* ( ) ( ) ( )( ) , min , , , , , { *<sup>u</sup>* -= + *V xt f t xu V xt g t xu* (44)

( ) , 0. *VTxT* = (45)

11 2 2 , , ( ) £ ( ) ( ) (46)

→

r r (47)

, , £ r r (48)

<sup>1</sup> and is given by

is defined on *τ*1, *τ*

2 1

 t

() ( ) ( )( ) ( ) <sup>0</sup> { 0 0 0 0 0 0 00 0, min , , , , , *t x <sup>u</sup> V x ft x u t Vt x V t x t V x t x* <sup>=</sup> D+ + D+ D +<sup>L</sup> (43)

**Theorem 1 [8]:** Let *t*0, *t*1 denotes the range of time in which a sequence of control is applied.

 t t

and for any *t*, such that *t*<sup>0</sup> ≤*t* ≤*t*1, the setΛ*t*,*x*(*t*) is not empty, as the restriction of the control to

*Vx V x* ( ) tt

( ) ( )

x

> t t

*u if <sup>u</sup> u if* x

( ) \* 1 2

x

, ,

( ) ( ) ( ) ( ) \* 1 1 11 1 *Vx x*

 ft

<sup>ì</sup> £ £ <sup>ï</sup> <sup>=</sup> <sup>í</sup> ï £ £ î

 t xt

 t xt A successful sequential decision requires a decision rule that will prescribe a procedure for action selection in each state at a specified decision epoch. This is a known strategy in the field of operation research. More so, the problems of decision-making under uncertainty are best modeled as Markov decision processes (MDP) [8]. When a rule depends on the previous states of the system or actions through the current state or action only, it is said to be Markovian but deterministic if it chooses an action with certainty [8]. Thus, a deterministic decision rule that depends on the past history of the system is known as "history dependent". In general, MDP can be expressed as a process that


It follows a time-homogeneous, finite state, and finite action semi-MDP (SMDP) defined as

$$\begin{array}{ll} \textbf{i.} & P(\textbf{x}\_{t+1} \mid \boldsymbol{\mu}\_{t\prime} \; \textbf{x}\_{t})\_{\prime} \; t = [0, \textbf{1}, \textbf{2}, \; \cdots, \textbf{\underline{\underline{\underline{\underline{\underline{\underline{\underline{\underline{\underline{\underline{\underline{\alpha}}}}}}}}}}] \; t\prime} \; \textbf{transtion} \; probability} \textbf{probability}; \textbf{!} \end{array}$$

**ii.** *P*(*rt* |*ut*, *xt*) reward probability; and

**iii.** *P*(*ut* | *xt*) =*π*(*ut* | *xt*) policy

This implies that, although the system state may change several times between decision epochs, the decision rule remains that only the state at a decision epoch is relevant to the decision maker. Consider the stochastic process *x*0, *x*1, *x*2, ..., where *x <sup>t</sup>* or *x*(*t*) (which may be used interchangeably). Note that we are considering an optimal control of a discrete-time Markov process with a finite time horizon *T*, where the Markov process *x* takes values in some measurable space Ω. In what follows, assuming that we have a sequence of control *u*0, *u*1, *u*2, ...,

where *u <sup>n</sup>* is the action taken by the decision maker at time *t* =0, 1, ⋯, *n*, take values in some assessable space *U* of allowable control. The decision rule is described by considering a class of randomized history-dependent strategies consisting of a sequence of functions

$$d\_n = \left(d\_0, d\_1, \dots, d\_{T-1}\right),\tag{50}$$

and also by considering the following sequence of events:


The basic problem therefore is to find a policy *π* =(*d*0, *d*1) consisting of *d* 0 and *d* 1 that will minimize the objective functional *J*(*x*0) =*∫ f x*1, *d*1(*x*1) *P*(*x*<sup>1</sup> | *x*0, *d*0(*x*0)), which is given as *P*(*ut* | *xt*) =*π*(*ut* | *xt*). Hence, we set *μt <sup>π</sup>* :*Ht <sup>d</sup>* →*R* <sup>1</sup> to denote the total expected reward obtained by using Eq. (50) at decision epochs *t*, *t* + 1, ⋯, *T* −1. With an assumption that the history at decision epoch *t* is *ht <sup>d</sup>* <sup>∈</sup>*Ht <sup>d</sup>* , the decision rule follows *μt <sup>π</sup>* for *t* < *T* such that

$$\mu\_t^{\pi} \left( h\_t^d \right) = E\_{h\_t}^{\pi} \left[ \sum\_{k=t}^{T-1} r\_k \left( \mathbf{x}\_k, \mu\_k \right) + \eta\_k \left( \mathbf{x}\_T \right) \right] \tag{51}$$

In particular, if the SMD processes (i) to (iii) are stationary, then, for a given rule *π* and an initial state *x*, the future rewards can be estimated. Let *V <sup>π</sup>*(*x*) be the value function; then, the expected discounted return could be measured as

$$W^{\pi} \left( \mathbf{x} \right) = E \left[ \sum\_{t=0}^{\infty} \theta^{t} rt \mid \mathbf{x}\_{0} = \mathbf{x}; \pi \right] \tag{52}$$

However, the entire cast of players involved in oil spill control (the contingency planners, response officials, government agencies, pipeline operators, tanker owners, etc.) shares keen interest in being able to anticipate oil spill response costs for planning purposes according to Arapostathis et al. [9]. This means that the type of decision and/or action chosen at a given point in time is a function of the clean-up cost. In other words, the clean-up/response cost is a key indicator for the optimal control. Thus, to set a pace for rapid response, it is important to introduce cost concepts into the control paradigm as discussed in the next section.
