**3.1 Decentralized access**

4 Will-be-set-by-IN-TECH

Given the conditions for the limit in 2 to exist, the *optimal* average cost can be obtained by

with the condition *h* (*n*) = 0. It is known (see Bertsekas (2007)) that previous equations have a unique solution and the stationary policy *μ* providing the minimum at the right side of (3) is an optimal policy. *h*(*i*) is known as relative or differential cost for each state *i*. It represents the minimum, over all policies, of the difference between the expected cost to reach *n* from *i* for the first time and the cost that would be incurred if the cost per stage were equal to the

There are several computational methods for solving Bellman equation: the value iteration algorithm, the policy iteration algorithm and the linear programming method provide exact solutions to the problem (see Bertsekas (2007) and Puterman (2005)). However, when the dimension of the sets *S* and *U* is relatively large, the problem becomes so complex that solving it exactly may be computationally intractable. This is known as the *curse of dimensionality* in dynamic programming. In some situations, we are not able to compute all the transition probabilities *pij*(*u*) of the model, therefore obtaining an exact solution is impossible. For these cases multiple approximate methods have been developed within the framework of

There are several variations for MDP problems. One of the most important ones refers to the time horizon over which the process is assumed to operate. It may be finite, when the optimization is done over a finite number of stages, or infinite, when the number of stages is assumed to be infinite. The latter type of problems present some theoretical difficulties, and some technical conditions must hold to be solvable. However, when these conditions are present, infinite-horizon problems require less computational effort than finite-horizon problems with similar dimension. Sometimes, more than one performance objective must be attained. In these cases, it is usual to set bounds in all the objectives except one, which should be optimized assuring that the other objectives remain within their bounds, *i.e.* the rest of objectives constitute constraints on the MDP problem. This strategy is known as constrained MDP (CMDP). To solve these problems, the most usual approaches are to re-formulate the problem as a linear-programming one or to use Lagrangian relaxation on the constraints. Finally, in some problems, the control decision at each state must be taken without complete knowledge of the state. Instead of directly observing the state, the controller observes an additional variable related with the state, so that the probability of each state can be inferred. These problems are known as Partially Observable MDP (POMDP) and are tractable, in general, only for small dimensional problems. The more complex versions of MDPs are, in fact, generalizations of the problem. As we will see, some problems must be formulated as Constrained POMDP, for which very few results are available so far and are generally

MDP has been frequently applied in the design of MAC protocols in cognitive radio. They can be classified into two classes: decentralized and centralized access protocols. In the decentralized case, each unlicensed user is responsible of performing spectrum sensing and spectrum access, in general with limited, and sometimes unreliable, information about the

approximate dynamic programming (see Powell (2005)) or reinforcement learning.

*N* ∑ *j*=1 *pij u h* (*j*) 

*i* ∈ *S* (3)

solving the following Bellman's equation

average *λ* at all states.

addressed by heuristic methods.

**3. MDP applications in cognitive radio**

*h* (*i*) = min *u*∈*U*

 *g i*, *u* − *λ* +

> In Zhao et. al. (2007), the activity of a licensed user is modeled as an on-off model represented by a two-state Markov chain. The problem of channel sensing and access in a spectrum overlay system was formulated as a POMDP. The actions consists on sensing and accessing a channel, and the channel sensing result is considered an observation. The reward is defined as the number of transmitted bits. The objective is to maximize the expected total number of transmitted bits in a certain number of time slots under the constraint that the collision probability with a licensed user should be maintained below a target level.

> Geirhofer et. al. (2008) propose a cognitive radio that can coexist with multiple parallel WLAN channels, operating below a given interference constraint. The coexistence between conventional and cognitive radios is based on the prediction of WLAN's behavior by means of a continuous-time Markov chain model. The cognitive MAC is derived from this model by recasting the problem as a constrained Markov decision process (CMDP).

> The goal in Chen et. al (2008) is to maximize the throughput of a secondary user while limiting the probability of colliding with primary users. The access mechanism comprises the following three basic components: a spectrum sensor that identifies spectrum opportunities, a sensing strategy that determines which channels to sense and an access strategy that decides whether to access based on potentially erroneous sensing outcomes. This joint design was formulated as a constrained partially observable Markov decision process (POMDP).

> The approach in Li et. al. (2011) is to maximize the throughput of the secondary user subject to collision constraints imposed by the primary users. The formulation follows a constrained partially observable Markov decision process.
