**2. Introduction to Markov Decision Processes**

Markov Decision Processes (MDPs) are an application of a more general optimization technique known as dynamic programming (DP). The goal of DP is to find the optimal values of a variable when these values (decisions or actions) must be chosen in consecutive stages. The algorithms to solve DP problems rely on the principle of optimality, which states that in an optimal sequence of decisions, every subsequence must also be optimal. DP is generally applied in the framework of dynamical systems. Several basic concepts must be introduced to understand this framework:


2 Will-be-set-by-IN-TECH

scheme requires greater computational resources at the user terminal and generally does not achieve globally optimal solutions. On the other side, distributed schemes imply a smaller

MAC protocols for DSA can also include spectrum trading features. In situations of low spectrum usage, the licensed operator may decide to sell spectrum opportunities to unlicensed users. In order to do this in real-time, a protocol is required to support negotiations on access price, channel holding time, *etc*, between the spectrum owner and secondary users. There are several models for spectrum trading. In this work, we consider the bid-auction model, in

This chapter addresses the design of DSA MAC protocols for centralized dynamic spectrum access. We explore the possibilities of a formal design based on a Markov decision process (MDP) formulation. We survey previous works on this issue and propose a design framework to balance the grade-of-service (e.g. blocking probability) of different user categories and the expected economic revenue. When two or more contrary objectives are balanced on an optimization problem, there is not an optimal solution, in the strict sense, but a Pareto front, defined as the set of values, for each individual objective, such that any objective can not be improved without worsening the others. In this work we study the Pareto front solutions for two possible access models. The first one consists of simply providing priority to the licensed users, and the second one is an auction-based model, where unlicensed users offer a bidding price for the spectrum opportunities. In the priority-based access, the centralized policy should balance the blocking probability of each class of users. In the auction-based access, the trade-off appears between the blocking probability of primary users and the expected

The content structure of the rest of this chapter is the following. Section 2 provides a brief introduction to Markov Decision Processes. Section 3 reviews previous works using the MDP approach in cognitive radio systems. Section 4 explains the system model and MDP formulation for both DSA procedures considered. Section 5 contains the performance analysis of each model based on numerical evaluations of practical examples. Section 6 summarizes

Markov Decision Processes (MDPs) are an application of a more general optimization technique known as dynamic programming (DP). The goal of DP is to find the optimal values of a variable when these values (decisions or actions) must be chosen in consecutive stages. The algorithms to solve DP problems rely on the principle of optimality, which states that in an optimal sequence of decisions, every subsequence must also be optimal. DP is generally applied in the framework of dynamical systems. Several basic concepts must be introduced

• *Stage*: In a discrete-time dynamical system, a stage is a single step in the temporal advance of the process followed by the system. At each stage the system performs a transition from on state to an adjacent one. A process may consist of a finite or infinite number of stages. • *Action*: At each state, there may be one or several variables whose value can be chosen in order to influence the transition performed at the present stage. The values selected

• *State*: Is determined by the values of the variables that characterize the system.

which secondary users bid for the spectrum of a single spectrum owner.

communication overhead.

revenue.

the conclusions of this work.

to understand this framework:

constitute the action at this stage.

**2. Introduction to Markov Decision Processes**


As can be anticipated from previous definitions, the goal of DP is to find the optimal policy for a given process. DP is, in fact, a decomposition strategy for complex optimization problems. In this case, the decomposition exploits the discrete-time structure of the policy.

Markov Decision Processes are the application of DP to systems described by controlled discrete-time Markov chains, that is, Markov chains whose transition probabilities are determined by a decision variable.

Let the integer *k* denote the *k*-th stage of an MDP. At a given stage, let *i* and *u* denote the state of the system and the action taken, respectively. The set of possible values of the state, the *state space*, is denoted by *S*, therefore *i* ∈ *S*. The control space *U*, is defined similarly. In general, at each state *i* only a subset of actions *U*(*i*) ⊆ *U* is allowed. We restrict our attention to processes where both *S*, *U*(*i*) and *U* are independent of *k*. In this case, the transition probability from state *i* to state *j* is denoted as *pij*(*u*). A policy takes the form: *u* = *μ*(*i*), and because it does not depend on *k* it is said to be a *stationary* policy. It is said that a policy is admissible if *μ*(*i*) ∈ *U*(*i*) for *i* ∈ *S*. At each state *i*, the policy provides the probability distribution of next state as *pij*(*μ*(*i*)), for *j* ∈ *S*.

The cost of each pair action-state is denoted by *g*(*i*, *u*). Sometimes the costs are associated to transitions instead of states. Let *g*˜(*i*, *u*, *j*) denote the transition cost from state *i* to state *j*. In this case, we use the *expected cost* per stage defined as:

$$\lg(i,\mu) = \sum\_{j \in S} \lg(i,\mu,j) p\_{i,j}(\mu) \tag{1}$$

The objective of the MDP is to find the optimal stationary policy *μ* such that the total cost is minimized. The total cost may be defined in several ways. We will focus our attention on average cost problems. In this case, the cost to be optimized is given by the following equation

$$\lambda = \lim\_{N \to \infty} \frac{1}{N} E\left\{ \sum\_{k=1}^{N-1} g\left(\mathbf{x}\_k, \mu(\mathbf{x}\_k)\right) \right\} \tag{2}$$

where *xk* represents the system's state at the *k*-th stage. Note that in the definition of the average cost *λ* we are implicitly assuming that its value is independent of the initial state of the system. This is generally not always true. However there are certain conditions under which this assumption holds. For example, in our scenario, the value of the per-stage cost is always bounded and both *S* and *U* are finite sets. Moreover, there is at least one state, *n* that is *recurrent* in every stationary policy. Given previous conditions, the limit in the right side of (2) exists and the average cost does not depend on the initial state.

Sometimes the system is modeled as a continuous-time Markov chain. In this case, as we shall see, the definition of the average cost is slightly different. In order to solve it by means of the known equations for average cost MDP problems, we have to construct an auxiliary discrete-time problem whose average cost equals the one of the continuous-time problem.

spectrum usage. In consequence, it is usual to find partially observed MDP (POMDP) formulations of the problem, which easily become intractable when the dimension of the problem increases. The access of secondary users to the spectrum should have the less possible impact on licensed users. When including these restrictions on the formulation the resulting problem is a constrained POMDP. In the centralized case, a central device, generally referred to as spectrum broker, performs spectrum management, controlling the access of secondary users to idle spectrum channels. It is usually assumed that the spectrum broker has perfect information about the spectrum usage, therefore the problem is formulated as an MDP, or as

Dynamic Spectrum Access in Cognitive Radio: An MDP Approach 99

In Zhao et. al. (2007), the activity of a licensed user is modeled as an on-off model represented by a two-state Markov chain. The problem of channel sensing and access in a spectrum overlay system was formulated as a POMDP. The actions consists on sensing and accessing a channel, and the channel sensing result is considered an observation. The reward is defined as the number of transmitted bits. The objective is to maximize the expected total number of transmitted bits in a certain number of time slots under the constraint that the collision

Geirhofer et. al. (2008) propose a cognitive radio that can coexist with multiple parallel WLAN channels, operating below a given interference constraint. The coexistence between conventional and cognitive radios is based on the prediction of WLAN's behavior by means of a continuous-time Markov chain model. The cognitive MAC is derived from this model by

The goal in Chen et. al (2008) is to maximize the throughput of a secondary user while limiting the probability of colliding with primary users. The access mechanism comprises the following three basic components: a spectrum sensor that identifies spectrum opportunities, a sensing strategy that determines which channels to sense and an access strategy that decides whether to access based on potentially erroneous sensing outcomes. This joint design was

The approach in Li et. al. (2011) is to maximize the throughput of the secondary user subject to collision constraints imposed by the primary users. The formulation follows a constrained

In Yu et. al. (2007) the spectrum broker controls the access of secondary users based on a threshold rule computed by means of an MDP formulation with the objective of minimizing the blocking probability of secondary users. In order to cope with the non-stationarity of traffic conditions, the authors propose a finite horizon MDP instead of an infinite horizon one. The drawback is that the policy cannot be computed off-line, imposing a high computational

Tang et. al. (2009) study several admission control schemes at a centralized spectrum manager. The objective is to meet the traffic demands of secondary users, increasing spectrum utilization efficiency while assuring a grade of service in terms of blocking probability to primary users. Among the schemes analyzed, the best performing one is based on a

formulated as a constrained partially observable Markov decision process (POMDP).

probability with a licensed user should be maintained below a target level.

recasting the problem as a constrained Markov decision process (CMDP).

a CMDP if constraints are included.

partially observable Markov decision process.

constrained Markov decision process (CMDP).

**3.2 Centralized access**

overhead on the system.

**3.1 Decentralized access**

Given the conditions for the limit in 2 to exist, the *optimal* average cost can be obtained by solving the following Bellman's equation

$$h(i) = \min\_{u \in U} \left[ g(i, u) - \lambda + \sum\_{j=1}^{N} p\_{ij}(u)h(j) \right] \quad i \in \mathcal{S} \tag{3}$$

with the condition *h* (*n*) = 0. It is known (see Bertsekas (2007)) that previous equations have a unique solution and the stationary policy *μ* providing the minimum at the right side of (3) is an optimal policy. *h*(*i*) is known as relative or differential cost for each state *i*. It represents the minimum, over all policies, of the difference between the expected cost to reach *n* from *i* for the first time and the cost that would be incurred if the cost per stage were equal to the average *λ* at all states.

There are several computational methods for solving Bellman equation: the value iteration algorithm, the policy iteration algorithm and the linear programming method provide exact solutions to the problem (see Bertsekas (2007) and Puterman (2005)). However, when the dimension of the sets *S* and *U* is relatively large, the problem becomes so complex that solving it exactly may be computationally intractable. This is known as the *curse of dimensionality* in dynamic programming. In some situations, we are not able to compute all the transition probabilities *pij*(*u*) of the model, therefore obtaining an exact solution is impossible. For these cases multiple approximate methods have been developed within the framework of approximate dynamic programming (see Powell (2005)) or reinforcement learning.

There are several variations for MDP problems. One of the most important ones refers to the time horizon over which the process is assumed to operate. It may be finite, when the optimization is done over a finite number of stages, or infinite, when the number of stages is assumed to be infinite. The latter type of problems present some theoretical difficulties, and some technical conditions must hold to be solvable. However, when these conditions are present, infinite-horizon problems require less computational effort than finite-horizon problems with similar dimension. Sometimes, more than one performance objective must be attained. In these cases, it is usual to set bounds in all the objectives except one, which should be optimized assuring that the other objectives remain within their bounds, *i.e.* the rest of objectives constitute constraints on the MDP problem. This strategy is known as constrained MDP (CMDP). To solve these problems, the most usual approaches are to re-formulate the problem as a linear-programming one or to use Lagrangian relaxation on the constraints. Finally, in some problems, the control decision at each state must be taken without complete knowledge of the state. Instead of directly observing the state, the controller observes an additional variable related with the state, so that the probability of each state can be inferred. These problems are known as Partially Observable MDP (POMDP) and are tractable, in general, only for small dimensional problems. The more complex versions of MDPs are, in fact, generalizations of the problem. As we will see, some problems must be formulated as Constrained POMDP, for which very few results are available so far and are generally addressed by heuristic methods.
