**4.1 Priority-based access**

This access is only based on priority, not in bidding price, *i.e.* licensed users are given higher priority than secondary users. Therefore the objective is to minimize the blocking probability of licensed users but also that of unlicensed users. The general rule is that primary users are always accepted if there are available channels but, depending on the available channels, the controller can deny access to secondary users. Once a secondary user occupies a channel, it is this user who decides when to release this channel and it can not be removed by the controller. 6 Will-be-set-by-IN-TECH

Centralized access has received less attention than decentralized access in cognitive radio research in general and in the application of MDP in particular. On the one hand, decentralized access constitutes a harder research challenge because each agent only has partial and sometimes unreliable information about the wireless network and the spectrum bands. This leads to the harder POMDP problems. On the other hand, although centralized access relies on a spectrum broker which generally has full information about the system state, the dimension of the problem increases proportionally to the total number of managed channels. Therefore, although the MDP or CMDP problem may be solvable, its dimension imposes a serious computational overhead. This drawback may be overcome with an off-line computation of the policies. However, when traffic conditions are non-stationary this approach is not applicable and approximate solutions based on reinforcement learning strategies should be explored. In this work we focus on the application of MDP to centralized

Other applications of MDP have been found within the framework of cognitive radio. In Hoang et. al. (2010), authors propose an algorithm based on finite-horizon MDP to schedule the duration of spectrum sensing periods and data transmission periods at the cognitive users aiming to improve their throughput. Berthold et. al. (2008) formulate the spectral resource detection problem as an MDP allowing the cognitive users to select the frequency bands with the most available resources. Galindo-Serrano and Giupponi (2010) deals with the problem of aggregated interference generated by multiple cognitive radios at the receivers of primary (licensed) users. The problem is formulated as a POMDP and it is solved heuristically by means of an approximated dynamic programming method known as distributed Q-learning. In this paper we highlight another application of MDP: dynamic trading of spectrum bands. While this issue has been typically addressed with a game-theoretic approach, we explore the use of MDP and CMDP formulations to balance benefit and grade of service for primary users

In this section we consider two models for coordinated spectrum access. In the first one, secondary users are accepted or rejected according to an admission policy that only considers the impact on the blocking probability for primary users. In this first model there is a trade-off between the blocking probability of licensed and unlicensed users. The second model includes a spectrum bidding procedure, in which secondary users offer a price, within a finite countable set of prices for mathematical tractability, for the use of a channel. In the second model the trade-off appears between the the blocking probability of licensed users

This access is only based on priority, not in bidding price, *i.e.* licensed users are given higher priority than secondary users. Therefore the objective is to minimize the blocking probability of licensed users but also that of unlicensed users. The general rule is that primary users are always accepted if there are available channels but, depending on the available channels, the controller can deny access to secondary users. Once a secondary user occupies a channel, it is this user who decides when to release this channel and it can not be removed by the controller.

access and how it can be exploited to balance GoS of each class of user.

**3.3 Other applications**

**4. System model**

**4.1 Priority-based access**

in a centralized spectrum access framework.

and the expected benefit obtained from spectrum rental.

There are several approaches to address this type of problems. One of them is to formulate an MDP where the expected cost is obtained as a linear combination (more precisely a convex combination) of the blocking probability of each class of users. By adjusting the weighting factors we can compute a Pareto front for both blocking probabilities. A Pareto front is defined as the set of values corresponding to several coupled objective functions such that, for every point of the set, one objective cannot be improved without worsening the rest of objective values. In this type of access, the Pareto front allows to fix a blocking probability value for the licensed users and know the best possible performance for unlicensed users.

Incoming traffic is characterized by a classic Poisson model. Licensed users arrive with a rate of *λ<sup>L</sup>* arrivals per unit of time. The arrival rate for unlicensed users is denoted by *λU*. The licensed spectrum managed by the central controller is assumed to be divided into channels (or bands) with equal bandwidth. Each user occupies a single channel. The average holding times for licensed and unlicensed users are given by 1/*μ<sup>L</sup>* and 1/*μ<sup>U</sup>* respectively, where *μ<sup>L</sup>* and *μ<sup>U</sup>* denote the departure rate for each class. Because a Poisson traffic model is considered, both the inter-arrival time and the channel holding times are exponentially distributed random variables for both user classes. The model can be easily extended including more user classes, the probability that a user occupies two or more channels, and so on. Essentially the procedure is the same, but the Markov chain would comprise more states as more features are considered in the model. In this model, the state of the Markov chain is determined by the number of channels *k* occupied by licensed users (LU), and the number of channels *s* occupied by secondary users (SU). Because spectrum is a limited resource, there is a finite number *N* of channels. Figure 1 depicts a diagram of the model and its parameters. Note that we can map all the possible combinations of (*k*,*s*) for 0 ≤ *k* ≤ *N*, 0 ≤ *s* ≤ *N* and *k* + *s* ≤ *N* to a single integer *i* such that

$$0 \le i \le \frac{N\left(N+1\right)}{2} + N + 1. \tag{4}$$

The number in the right hand side of 4 is the total number of states. Let *NT* denote this number.

Fig. 1. Diagram of the priority based access model. The system has *N* channels that can be occupied by *k* licensed users (LU) and *s* secondary users (SU) such that *k* + *s* ≤ *N*. The total departure rates for each type of users depend on *k* and *s*.

The model described above consists of a continuous-time Markov chain. In the framework of MDPs we have to define the actions and the costs of these actions. Let *g*(*i*, *u*) denote the

market. The increasing demand of spectrum and the existence of spectrum holes have revealed the inefficiency of this mechanism. One practical and economically feasible way to solve this inefficiency is to allow spectrum owners to sell their spectrum opportunities in a secondary market. In contrast to the primary market, the secondary operates in real-time. Secondary users, that may be operators without a spectrum license, submit their bids for spectrum opportunities to the spectrum owner, who determines the winner or winners by

Dynamic Spectrum Access in Cognitive Radio: An MDP Approach 103

The arrival processes are modeled, as in previous subsection, as independent Poisson processes. The arrival rates for licensed and unlicensed users are *λ<sup>L</sup>* and *λ<sup>U</sup>* respectively. The service rates are *μ<sup>L</sup>* and *μU*. Again, it is assumed that each incoming user occupies a single channel. The system state is given by the number of primary users *k* and secondary users *s* holding a channel: (*k*,*s*) for 0 ≤ *k* ≤ *N*, 0 ≤ *s* ≤ *N* and *k* + *s* ≤ *N*. Each state is mapped into an integer *i* ≡ (*k*,*s*), so that *i* = 0, 1, . . . *NT*, where *NT* is given by 4. For mathematic tractability, the bidding prices are classified into a finite set of values: **B** = {*b*1, *b*2, . . . *bm*} given in money charged per unit of time. Each price on this set has a probability *pi*, *i* = 1... *m* to be offered

*<sup>i</sup>*=<sup>1</sup> *pi* = 1. Figure 2 depicts the model described.

k·<sup>L</sup>

1

3

. .

s·<sup>U</sup> .

s SU

k LU

N

2

giving them access to the band and charging them the bidding price.

L

b1

b2

bm

Fig. 2. Diagram of the auction based access model. Secondary users (SU) can offer up to *m* different bid prices. Each bid offer is assigned a probability. The access policy decides upon

In this case, the objective of the MDP is to obtain the maximum economic profit with the minimum impact on the licensed users. The control *u* at each stage determines the admitted and rejected bidding prices. Logically, the control should be defined as a threshold, *i.e.* when *u* = *i* only bids equal or above *pi* are admitted. For notation convenience, the control *u* = *m* + 1 indicates that no bid is accepted. The per-stage reward function *g*(*i*, *u*) is given by the linear combination of *gL*(*i*, *u*) (defined in previous subsection) and *gU*(*i*, *u*) defined, in this model, as the expected benefit at stage *i* when decision *u* is made. Therefore *g*(*i*, *u*) = *αgL*(*i*, *u*) + *βgU*(*i*, *u*) where the scalars *α* and *β* are weighting factors. Note that *β <* 0 since the objective is to minimize the average expected cost given by *g*(*i*, *u*). Let *Bi* denote the expected income when an unlicensed user whose bidding price is *bi* is accepted. Since the average channel holding time for unlicensed users is 1/*μU*, then *Bi* = *bi*/*μU*. Given a control *u*, *P* (*r*|*u*) denotes

p1·<sup>U</sup>

p2·<sup>U</sup>

. . . . . .

pm·<sup>U</sup>

each offer according to the price offered and the system's state.

by an incoming user. Obviously ∑*<sup>m</sup>*

U

instantaneous cost of taking action *u* at state *i*. In the system considered action *u* is simply defined as

*u* = 1 , if incoming user is not accepted 0 , otherwise (5)

The above formula refers only to unlicensed users. It is assumed that licensed users are always accepted unless all channels are occupied. The function *g*(*i*, *u*) is given by the convex combination of two per-stage cost functions, *i.e. g*(*i*, *u*) = *αgL*(*i*, *u*)+(1 − *α*)*gU*(*i*, *u*), where

$$\log\_L(i,u) = \begin{cases} 1 & \text{, if } i \equiv (k,s) \text{ and } k+s=N\\ 0 & \text{, otherwise} \end{cases} \tag{6}$$

where the symbol "≡" denotes equivalence, i.e. *i* maps a state (*k*,*s*) such that *k* + *s* = *N*. Similarly,

$$\mathcal{g}\_{\mathcal{U}}(\mathbf{i},\boldsymbol{\mu}) = \begin{cases} 1 & \text{, if } \boldsymbol{i} \equiv (\boldsymbol{k}, \boldsymbol{s}) \text{ and } \boldsymbol{k} + \boldsymbol{s} = \boldsymbol{N} \\ \boldsymbol{u} & \text{, otherwise} \end{cases} \tag{7}$$

These functions determine the blocking probability per unit of time for each class of users. Note that the blocking probability is defined as the probability that the system does not provide a channel to an incoming user. The objective is to find a policy such that, for a relative importance given to each cost (determined by *α*), the expected average value of the combined cost is minimized. The function to minimize is then given by

$$\lim\_{K \to \infty} \frac{1}{E\left\{t\_K\right\}} \mathbb{E}\left\{ \int\_0^{t\_K} g\left(\mathbf{x}(t), \boldsymbol{\mu}(t)\right) \right\} \tag{8}$$

where *tK* is the completion time of the *K*-th transition. The problem can be solved by formulating its auxiliary discrete-time average cost problem. Let *γ* be a scalar greater than the transition rate at any state of the chain, *i.e. γ > vi*(*u*). We can compute the transitions probabilities *p*˜*i*,*j*(*u*) for the auxiliary discrete-time problem from the probabilities *pi*,*j*(*u*) of the original problem as

$$\tilde{p}\_{i,j}(u) = \begin{cases} \frac{v\_i(u)}{\gamma} p\_{i,j}(u) & \text{, if } i \neq j \\ 1 - \frac{v\_i(u)}{\gamma} & \text{, if } i = j \end{cases} \tag{9}$$

It is known (see Bertsekas (2007)) that if the scalar *λ* and the vector ˜ *h* satisfy

$$\tilde{h}\left(i\right) = \min\_{u \in \{0, 1\}} \left[ g\left(i, u\right) - \lambda + \sum\_{j=1}^{N\_{\overline{\mathbb{T}}}} \tilde{p}\_{ij}(u) \tilde{h}\left(j\right) \right] \quad i = 1, \dots, n \tag{10}$$

then *λ* and the vector *h* with components *h*(*i*) = *γ*˜ *h*(*i*) solve the original problem. It can be anticipated that the structure of this problem, essentially a connection admission control problem, requires a threshold type solution in which upcoming unlicensed users will only be admitted into the system if the number of occupied channels is below certain threshold.

### **4.2 Auction-based access**

As explained in the introduction, public administrations assign the spectrum bands to wireless operators by a license scheme. Generally, operators gain spectrum licenses by bidding for them in public auction processes. We refer to this spectrum assignment framework as primary 8 Will-be-set-by-IN-TECH

instantaneous cost of taking action *u* at state *i*. In the system considered action *u* is simply

The above formula refers only to unlicensed users. It is assumed that licensed users are always accepted unless all channels are occupied. The function *g*(*i*, *u*) is given by the convex combination of two per-stage cost functions, *i.e. g*(*i*, *u*) = *αgL*(*i*, *u*)+(1 − *α*)*gU*(*i*, *u*), where

where the symbol "≡" denotes equivalence, i.e. *i* maps a state (*k*,*s*) such that *k* + *s* = *N*.

These functions determine the blocking probability per unit of time for each class of users. Note that the blocking probability is defined as the probability that the system does not provide a channel to an incoming user. The objective is to find a policy such that, for a relative importance given to each cost (determined by *α*), the expected average value of the combined

where *tK* is the completion time of the *K*-th transition. The problem can be solved by formulating its auxiliary discrete-time average cost problem. Let *γ* be a scalar greater than the transition rate at any state of the chain, *i.e. γ > vi*(*u*). We can compute the transitions probabilities *p*˜*i*,*j*(*u*) for the auxiliary discrete-time problem from the probabilities *pi*,*j*(*u*) of the

*vi*(*u*)

<sup>1</sup> <sup>−</sup> *vi*(*u*)

*NT* ∑ *j*=1 *p*˜*ij u* ˜ *h* (*j*) 

be anticipated that the structure of this problem, essentially a connection admission control problem, requires a threshold type solution in which upcoming unlicensed users will only be admitted into the system if the number of occupied channels is below certain threshold.

As explained in the introduction, public administrations assign the spectrum bands to wireless operators by a license scheme. Generally, operators gain spectrum licenses by bidding for them in public auction processes. We refer to this spectrum assignment framework as primary

1 , if incoming user is not accepted

1 , if *i* ≡ (*k*,*s*) and *k* + *s* = *N*

1 , if *i* ≡ (*k*,*s*) and *k* + *s* = *N*

*g* (*x*(*t*), *u*(*t*))

*<sup>γ</sup> pi*,*j*(*u*) , if *i* �= *j*

0 , otherwise (5)

0 , otherwise (6)

*<sup>u</sup>* , otherwise (7)

*<sup>γ</sup>* , if *<sup>i</sup>* <sup>=</sup> *<sup>j</sup>* (9)

*h* satisfy

*h*(*i*) solve the original problem. It can

*i* = 1, . . . , *n* (10)

(8)

*u* = 

*gL*(*i*, *u*) =

*gU*(*i*, *u*) =

cost is minimized. The function to minimize is then given by

lim *K*→∞

*p*˜*i*,*j*(*u*) =

It is known (see Bertsekas (2007)) that if the scalar *λ* and the vector ˜

 *g i*, *u* − *λ* +

1 *E* {*tK*} *E tK* 0

defined as

Similarly,

original problem as

˜

**4.2 Auction-based access**

*h* (*i*) = min

*u*∈{0,1}

then *λ* and the vector *h* with components *h*(*i*) = *γ*˜

market. The increasing demand of spectrum and the existence of spectrum holes have revealed the inefficiency of this mechanism. One practical and economically feasible way to solve this inefficiency is to allow spectrum owners to sell their spectrum opportunities in a secondary market. In contrast to the primary market, the secondary operates in real-time. Secondary users, that may be operators without a spectrum license, submit their bids for spectrum opportunities to the spectrum owner, who determines the winner or winners by giving them access to the band and charging them the bidding price.

The arrival processes are modeled, as in previous subsection, as independent Poisson processes. The arrival rates for licensed and unlicensed users are *λ<sup>L</sup>* and *λ<sup>U</sup>* respectively. The service rates are *μ<sup>L</sup>* and *μU*. Again, it is assumed that each incoming user occupies a single channel. The system state is given by the number of primary users *k* and secondary users *s* holding a channel: (*k*,*s*) for 0 ≤ *k* ≤ *N*, 0 ≤ *s* ≤ *N* and *k* + *s* ≤ *N*. Each state is mapped into an integer *i* ≡ (*k*,*s*), so that *i* = 0, 1, . . . *NT*, where *NT* is given by 4. For mathematic tractability, the bidding prices are classified into a finite set of values: **B** = {*b*1, *b*2, . . . *bm*} given in money charged per unit of time. Each price on this set has a probability *pi*, *i* = 1... *m* to be offered by an incoming user. Obviously ∑*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *pi* = 1. Figure 2 depicts the model described.

Fig. 2. Diagram of the auction based access model. Secondary users (SU) can offer up to *m* different bid prices. Each bid offer is assigned a probability. The access policy decides upon each offer according to the price offered and the system's state.

In this case, the objective of the MDP is to obtain the maximum economic profit with the minimum impact on the licensed users. The control *u* at each stage determines the admitted and rejected bidding prices. Logically, the control should be defined as a threshold, *i.e.* when *u* = *i* only bids equal or above *pi* are admitted. For notation convenience, the control *u* = *m* + 1 indicates that no bid is accepted. The per-stage reward function *g*(*i*, *u*) is given by the linear combination of *gL*(*i*, *u*) (defined in previous subsection) and *gU*(*i*, *u*) defined, in this model, as the expected benefit at stage *i* when decision *u* is made. Therefore *g*(*i*, *u*) = *αgL*(*i*, *u*) + *βgU*(*i*, *u*) where the scalars *α* and *β* are weighting factors. Note that *β <* 0 since the objective is to minimize the average expected cost given by *g*(*i*, *u*). Let *Bi* denote the expected income when an unlicensed user whose bidding price is *bi* is accepted. Since the average channel holding time for unlicensed users is 1/*μU*, then *Bi* = *bi*/*μU*. Given a control *u*, *P* (*r*|*u*) denotes

Moreover, the state and action spaces are finite. Under these circumstances, as shown in Puterman (2005), every feasible solution of the LP problem corresponds to some randomized stationary policy. Therefore, if the constrained problem is feasible, then there exists an optimal

Dynamic Spectrum Access in Cognitive Radio: An MDP Approach 105

The LP approach consists of expressing the objective and the constraints in terms of *φ* (*i*, *u*).

 *K* ∑ *k*=0

where *k* denotes the decision epoch of the process. The objective is to find the policy *μ* solving

min *μ*

The constraints are defined similarly to the main objective: each constraint impose a bound on an average cost related to different per-stage cost. Each constraint has the following form:

where *c* (*x*(*t*), *u*(*t*)) is the real-valued function providing the per-stage cost associated to the constraint *β*. Therefore the constrained average reward MDP with one constraint is defined

> min *λ* s.t. *c* ≤ *β*

Given the characteristics of the problem (finite state and action spaces and recurrent Markov

∑ *u*∈*U*(*i*)

∑ *u*∈*U*(*i*)

respectively. In addition, the following conditions must be hold by the *dual* variables:

*i*∈*S*

which, together with *φ* (*j*, *u*) ≥ 1 for *i* ∈ *S* and *u* ∈ *U*(*i*) correspond to the definition of *φ* (*i*, *u*) as a limiting average state action frequency. In consequence, the LP for the CMDP has the

for all *j* ∈ *S*, which is closely related to the balance equations of the Markov chain and

∑ *u*∈*U*(*i*)

∑ *u*∈*U*(*i*)

*φ* (*j*, *u*) = ∑

∑ *i*∈*S* *g* (*xk*, *uk*)

*c* (*xk*, *uk*)

*λ* (16)

*g* (*i*, *u*) *φ* (*i*, *u*) (19)

*c* (*i*, *u*) *φ* (*i*, *u*) (20)

*φ* (*i*, *u*) = 1, (22)

*pi*,*<sup>j</sup>* (*u*) *φ* (*i*, *u*) (21)

≤ *β* (17)

(15)

(18)

1 *K E*

Once the problem is discretized, the average cost is defined as

*λ* = lim *K*→∞

*c* = lim *K*→∞

chain under every policy), the limits in (15) and (17) exist and are equal to

*λ* = ∑ *i*∈*S*

*c* = ∑ *i*∈*S*

∑ *u*∈*U*(*j*)

1 *K E K* ∑ *k*=0

randomized stationary policy.

as

and

the conditional probability that the bidding price of the next accepted secondary user is *br*.

$$P\left(r|u\right) = \begin{cases} \frac{p\_r}{\sum\_{j=u}^{m} p\_j} & \text{, if } r \ge u\\ 0 & \text{, otherwise} \end{cases} \tag{11}$$

Let us define *g*˜*U*(*i*, *u*, *j*) as the average benefit associated to the transition from state *i* to state *j*. Its expression is

$$\mathfrak{F}\_{\mathcal{U}}(\mathfrak{i},\mathfrak{u},\mathfrak{j}) = \begin{cases} p\_{\mathcal{U}} \sum\_{r=1}^{m} B\_{r} P\left(r|\mathfrak{u}\right) & \text{, if } \mathfrak{j} = \mathfrak{i} + 1\\ 0 & \text{, otherwise} \end{cases} \tag{12}$$

where *pU* = *λU*/(*λ<sup>U</sup>* + *λL*) denotes the probability that the next arrival corresponds to a secondary user. Therefore, the per-stage benefit *gU*(*i*, *u*) is given by

$$\begin{split} g\_{\mathcal{U}}(i,\boldsymbol{\mu}) &= \sum\_{j=1}^{N\_{\mathcal{T}}} \tilde{g}\_{\mathcal{U}}(i,\boldsymbol{\mu},j) p\_{i,j}(\boldsymbol{\mu}) \\ &= p\_{i,i+1}(\boldsymbol{\mu}) p\_{\mathcal{U}} \sum\_{j=1}^{N\_{\mathcal{T}}} \mathcal{B}\_{\mathcal{V}} \boldsymbol{P} \left(r|\boldsymbol{\mu}\right). \end{split} \tag{13}$$

We can formulate the auxiliary discrete-time average cost problem for the model described. The equation providing the optimum average cost *λ* is

$$\tilde{h}\left(i\right) = \min\_{\boldsymbol{u} \in \{0, 1\}} \left[ \arg\_L(i, \boldsymbol{u}) + \beta \lg\_{\boldsymbol{I}\boldsymbol{I}}(i, \boldsymbol{u}) \boldsymbol{v}\_{\boldsymbol{i}}(\boldsymbol{u}) - \lambda + \sum\_{j=1}^{N\_{\Gamma}} \tilde{p}\_{ij}(\boldsymbol{u}) \tilde{h}\left(j\right) \right] \tag{14}$$

for *i* = 1, . . . , *n*. The structure of this problem also anticipates a threshold-type solution. In this case, there will be a set of thresholds, one per bidding price. By properly adjusting the weighting factors *α* and *β* we can also compute a Pareto front allowing us to determine the maximum possible benefit for a given blocking objective for the licensed users.

## **4.3 Constrained MDP**

So far, the approach to merge several objectives consisted on combining them into a single objective by means of a weighted sum and solving the problem as a conventional MDP. However, as explained in Section 2, when several objectives concur in an MDP problem, the formulation strategy may consist on optimizing one of them subject to constraints on the other objectives. This strategy results in a CMDP formulation of the problem. Solving MDPs by iterative methods such as policy or value iteration allows us to find deterministic policies, *i.e.* policies that associate each system's state *i* ∈ *S* to a single control *u* ∈ *U*(*i*), where *U*(*i*) is a subset of *U* containing the controls allowed in state *i*. However, these policies do not, in general, solve CMDP problems. Instead, the solution of CMDPs is a randomized policy, defined as a function that associates each state to a probability distribution defined over the elements in *U*(*i*).

There are mainly two approaches to solve CMDPs, linear programming (LP) and Lagrangian relaxation of the Bellman's equation. This paper follows the former one. Each feasible LP formulation relies on the use of the *dual* variables *φ* (*i*, *u*), defined as the stationary probability that the system is in state *i* and chooses action *u* under a given randomized stationary policy. The problems addressed in this paper result, under every stationary policy, in a truncated birth-death process, since primary users are always accepted. In consequence, every resulting Markov chain is *irreducible*, in other words, it is recurrent and there are not transient states. Moreover, the state and action spaces are finite. Under these circumstances, as shown in Puterman (2005), every feasible solution of the LP problem corresponds to some randomized stationary policy. Therefore, if the constrained problem is feasible, then there exists an optimal randomized stationary policy.

The LP approach consists of expressing the objective and the constraints in terms of *φ* (*i*, *u*). Once the problem is discretized, the average cost is defined as

$$\lambda = \lim\_{K \to \infty} \frac{1}{K} E\left\{ \sum\_{k=0}^{K} g\left(\mathbf{x}\_k, \boldsymbol{\mu}\_k\right) \right\} \tag{15}$$

where *k* denotes the decision epoch of the process. The objective is to find the policy *μ* solving

$$\min\_{\mu} \lambda \tag{16}$$

The constraints are defined similarly to the main objective: each constraint impose a bound on an average cost related to different per-stage cost. Each constraint has the following form:

$$\mathcal{L} = \lim\_{K \to \infty} \frac{1}{K} E\left\{ \sum\_{k=0}^{K} c\left(\mathbf{x}\_k, u\_k\right) \right\} \le \beta \tag{17}$$

where *c* (*x*(*t*), *u*(*t*)) is the real-valued function providing the per-stage cost associated to the constraint *β*. Therefore the constrained average reward MDP with one constraint is defined as

$$\begin{array}{l} \min \lambda\\ \text{s.t.}\\ c \le \beta \end{array} \tag{18}$$

Given the characteristics of the problem (finite state and action spaces and recurrent Markov chain under every policy), the limits in (15) and (17) exist and are equal to

$$\lambda = \sum\_{i \in S} \sum\_{u \in \mathcal{U}(i)} \operatorname{g} \left( i, u \right) \phi \left( i, u \right) \tag{19}$$

and

10 Will-be-set-by-IN-TECH

the conditional probability that the bidding price of the next accepted secondary user is *br*.

*pr* ∑*<sup>m</sup> <sup>j</sup>*=*<sup>u</sup> pj*

Let us define *g*˜*U*(*i*, *u*, *j*) as the average benefit associated to the transition from state *i* to state

where *pU* = *λU*/(*λ<sup>U</sup>* + *λL*) denotes the probability that the next arrival corresponds to a

= *pi*,*i*+1(*u*)*pU* <sup>∑</sup>*NT*

We can formulate the auxiliary discrete-time average cost problem for the model described.

*αgL*(*i*, *u*) + *βgU*(*i*, *u*)*vi*(*u*) − *λ* +

for *i* = 1, . . . , *n*. The structure of this problem also anticipates a threshold-type solution. In this case, there will be a set of thresholds, one per bidding price. By properly adjusting the weighting factors *α* and *β* we can also compute a Pareto front allowing us to determine the

So far, the approach to merge several objectives consisted on combining them into a single objective by means of a weighted sum and solving the problem as a conventional MDP. However, as explained in Section 2, when several objectives concur in an MDP problem, the formulation strategy may consist on optimizing one of them subject to constraints on the other objectives. This strategy results in a CMDP formulation of the problem. Solving MDPs by iterative methods such as policy or value iteration allows us to find deterministic policies, *i.e.* policies that associate each system's state *i* ∈ *S* to a single control *u* ∈ *U*(*i*), where *U*(*i*) is a subset of *U* containing the controls allowed in state *i*. However, these policies do not, in general, solve CMDP problems. Instead, the solution of CMDPs is a randomized policy, defined as a function that associates each state to a probability distribution defined over the

There are mainly two approaches to solve CMDPs, linear programming (LP) and Lagrangian relaxation of the Bellman's equation. This paper follows the former one. Each feasible LP formulation relies on the use of the *dual* variables *φ* (*i*, *u*), defined as the stationary probability that the system is in state *i* and chooses action *u* under a given randomized stationary policy. The problems addressed in this paper result, under every stationary policy, in a truncated birth-death process, since primary users are always accepted. In consequence, every resulting Markov chain is *irreducible*, in other words, it is recurrent and there are not transient states.

maximum possible benefit for a given blocking objective for the licensed users.

*<sup>j</sup>*=<sup>1</sup> *g*˜*U*(*i*, *u*, *j*)*pi*,*j*(*u*)

, if *r* ≥ *u*

*<sup>r</sup>*=<sup>1</sup> *BrP* (*r*|*u*) , if *j* = *i* + 1

<sup>0</sup> , otherwise (12)

*NT* ∑ *j*=1 *p*˜*ij* � *u* �˜ *h* (*j*) �

*<sup>j</sup>*=<sup>1</sup> *BrP* (*r*|*u*). (13)

(11)

(14)

0 , otherwise

⎧ ⎨ ⎩

*P* (*r*|*u*) =

secondary user. Therefore, the per-stage benefit *gU*(*i*, *u*) is given by

*gU*(*i*, *<sup>u</sup>*) = <sup>∑</sup>*NT*

� *pU* ∑*<sup>m</sup>*

*g*˜*U*(*i*, *u*, *j*) =

The equation providing the optimum average cost *λ* is

�

*u*∈{0,1}

˜

**4.3 Constrained MDP**

elements in *U*(*i*).

*h* (*i*) = min

*j*. Its expression is

$$\mathcal{L} = \sum\_{i \in S} \sum\_{u \in \mathcal{U}(i)} \mathcal{c}\left(i, u\right) \phi\left(i, u\right) \tag{20}$$

respectively. In addition, the following conditions must be hold by the *dual* variables:

$$\sum\_{\boldsymbol{\mu}\in\mathcal{U}(j)}\phi\left(\boldsymbol{j},\boldsymbol{\mu}\right) = \sum\_{\boldsymbol{i}\in\mathcal{S}}\sum\_{\boldsymbol{\mu}\in\mathcal{U}(i)}p\_{\boldsymbol{i},\boldsymbol{j}}\left(\boldsymbol{\mu}\right)\phi\left(\boldsymbol{i},\boldsymbol{\mu}\right)\tag{21}$$

for all *j* ∈ *S*, which is closely related to the balance equations of the Markov chain and

$$\sum\_{i \in \mathcal{S}} \sum\_{u \in \mathcal{U}(i)} \phi \left( i, u \right) = 1,\tag{22}$$

which, together with *φ* (*j*, *u*) ≥ 1 for *i* ∈ *S* and *u* ∈ *U*(*i*) correspond to the definition of *φ* (*i*, *u*) as a limiting average state action frequency. In consequence, the LP for the CMDP has the

the well-known Erlang's B formula (see Kleinrock (1975)):

The three scenarios are summarized in Table 1.

ranging from 0.01 to 1.

0.05 0.06 0.07 0.08 0.09 0.1 0.11

scenario 3 (c)

LU blocking probability

<sup>0</sup> 0.2 0.4 0.6 0.8 <sup>1</sup> 0.04

SU blocking probability (a) scenario 1

*E* (*n*, *ρ*) =

system has any available channel, and a lower bound for the secondary users.

where *n* is the number of channels and *ρ* denotes the utilization factor. In our case *ρ*=*λ*/*μ<sup>L</sup>* = *λ*/*μU*. According to this formula, if the system accepted every incoming user, the total blocking probability would be *E* (10, 8)=0.12. As we will see, this probability is an upper bound for the blocking probability of the primary users, which are always accepted if the

Dynamic Spectrum Access in Cognitive Radio: An MDP Approach 107

**parameter scenario 1 scenario 2 scenario 3** *λ<sup>L</sup>* (calls/h) 30 20 10 *λ<sup>U</sup>* (calls/h) 10 20 30 *μL*=*μ<sup>U</sup>* (calls/h) 5 5 5 *N* 10 10 10

First, we show in Fig. 3 the Pareto front obtained by means of an MDP where the blocking costs of licensed and unlicensed users were merged by means of a convex combination. The Pareto front was obtained by solving each MDP problem for 10000 values of the *α* parameter

<sup>0</sup> 0.2 0.4 0.6 0.8 <sup>1</sup> <sup>0</sup>

SU blocking probability (b) scenario 2

Fig. 3. Pareto fronts obtained for the priority-based access in scenario 1 (a), scenario 2 (b) and

All the three scenarios receive the same total traffic intensity. However, when the traffic intensity of the primary users is smaller, the Pareto front is closer to both axes, *i.e.* the performances of both the primary and secondary users improve. This is an expectable result since only the traffic of secondary users is controlled by the access policy. When the optimization affects to a higher portion of the total amount of traffic the improvement is also

The Pareto fronts obtained by means of the CMDP formulation in previous scenarios are identical to those shown in Fig. 3, showing that both formulations are equivalent in terms of finding the Pareto front for the priority-based access problem. The only difference relies on practical considerations. The CMDP approach allows us to find a policy with a predefined

0.02 0.04 0.06 0.08 0.1

LU blocking probability

more noticeable, showing the benefits of the MDP formulation.

Table 1. Parameters values at the three scenarios of the priority based access problem.

*ρn n*! <sup>∑</sup>*j*=*<sup>n</sup> j*=0 *ρj j*!

(25)

<sup>0</sup> 0.2 0.4 0.6 0.8 <sup>1</sup> <sup>0</sup>

SU blocking probability (c) scenario 3

0.02 0.04 0.06 0.08 0.1

LU blocking probability

following formulation

$$\begin{aligned} \min\_{\boldsymbol{\phi}} & \sum\_{i \in S} \sum\_{\boldsymbol{u} \in \boldsymbol{U}(i)} g\left(\boldsymbol{i}, \boldsymbol{u}\right) \boldsymbol{\phi}\left(\boldsymbol{i}, \boldsymbol{u}\right) \\ & \qquad \text{s.t.} \\ & \sum\_{i \in S} \sum\_{\boldsymbol{u} \in \boldsymbol{U}(i)} c\left(\boldsymbol{i}, \boldsymbol{u}\right) \boldsymbol{\phi}\left(\boldsymbol{i}, \boldsymbol{u}\right) \leq \boldsymbol{\beta} \\ & \sum\_{\boldsymbol{u} \in \boldsymbol{U}(j)} \boldsymbol{\phi}\left(\boldsymbol{j}, \boldsymbol{u}\right) - \sum\_{i \in S} \sum\_{\boldsymbol{u} \in \boldsymbol{U}(i)} p\_{i,j}\left(\boldsymbol{u}\right) \boldsymbol{\phi}\left(\boldsymbol{i}, \boldsymbol{u}\right) = \boldsymbol{0} \\ & \sum\_{i \in S} \sum\_{\boldsymbol{u} \in \boldsymbol{U}(i)} \boldsymbol{\phi}\left(\boldsymbol{i}, \boldsymbol{u}\right) = \boldsymbol{1} \\ & \qquad \boldsymbol{\phi}\left(\boldsymbol{j}, \boldsymbol{u}\right) \geq 1 \end{aligned} \tag{23}$$

Assuming that the problem is feasible and *φ*∗ is the optimal solution of the LP problem above, the stationary randomized optimal policy *μ*∗ is generated by

$$q\_{\mu^\*(i)}\left(\mu\right) = \frac{\phi^\*\left(i,\mu\right)}{\sum\_{\mu' \in \mathcal{U}(i)} \phi^\*\left(i,\mu'\right)}\tag{24}$$

for cases where the sum in the denominator is nonzero. Otherwise, the state is transitory and the control is irrelevant. Note that *qμ*∗(*i*) (*u*) denotes the probability of choosing action *u* at state *i* under policy *μ*∗.

Using the approach above in the problems described in previous section is straightforward:

