2. Distributionally robust optimization

This section introduces distributionally robust optimization models. We will first present a generic formulation of the problem. Then, individual components of the optimization and their solvability issues via equivalent formulations will be discussed.

## 2.1. Model

Consider a decision-maker who wants to select an action a∈ A ⊂ R<sup>n</sup> in order to optimize her objective r að Þ ; ω , where ω is an uncertain parameter. The information structure is the following:


The decision-maker chooses to experiment several trials and obtains statistical realizations of ω from measurements. The measurement data can be noisy, imperfect and erroneous. Then, an empirical distribution (or histogram) m is built from the realizations of ω: However, m is not the true distribution of the random variable ω, and m may not be a reliable measure due to statistical, bias, measurement, observation or computational errors. Therefore, the decisionmaker is facing a risk. The risk-sensitive decision-maker should decide action that improves the performance of Em<sup>~</sup> r að Þ ; ω among alternative distributions m~ within a certain level of deviation r > 0 from the distribution m: The distributionally robust optimization problem is therefore formulated as

$$\sup\_{\mu \in \mathcal{A}} \inf\_{\bar{m} \in \mathcal{B}\_{\rho}(m)} \mathbb{E}\_{\omega \sim \tilde{m}} r(\mathfrak{a}, \omega). \tag{1}$$

where Brð Þ m is the uncertainty set of alternative admissible distributions from m within a certain radius r > 0: Different distributional uncertainty sets are presented: the f-divergence and the Wasserstein metric, defined below.

#### 2.1.1. f -divergence

production time, maximizing profits, or minimizing costs under uncertain parameters. There are numerous techniques of robust optimization methods such as robust linear programming, robust dynamic programming, robust geometric programming, queuing theory, risk analysis, etc. One of the main drawbacks of the robust optimization is that the worst scenario may be too conservative. The bounds provided by the worst case scenarios may not be useful in many interesting problems (see the wireless communication example provided below). However, distributionally robust optimization is not based on the worst case parameters. The distributional robustness method is based the probability distribution instead of worst parameters. The worse case distribution within a certain carefully designed distributional uncertainty set may provide interesting features. Distributionally robust programming can be used not only to provide a distributionally robust solution to a problem when the true distribution is unknown, but it also can, in many instances, give a general solution taking into account some risk. The presented methodology is simple and reduces significantly the dimensionality of the distributionally robust optimization. We hope that the designs of distributionally robust programming presented here can help designers, engineers, cost–benefit analyst, managers to solve concrete

The rest of the chapter is organized as follows. Section 2 presents some preliminary concepts of distributionally robust optimization. A class of constrained distributionally robust optimization problems are presented in Section 3. Section 4 focuses on distributed distributionally robust optimization. Afterwards, illustrative examples in distributed power networks and in wireless communication networks are provided to evaluate the performance of the method.

Notation: Let R, Rþ, denote the set of real and non-negative real numbers, respectively, ð Þ Ω; d be a separable completely metrizable topological space with d : Ω � Ω ! R<sup>þ</sup> a metric (dis-

This section introduces distributionally robust optimization models. We will first present a generic formulation of the problem. Then, individual components of the optimization and

Consider a decision-maker who wants to select an action a∈ A ⊂ R<sup>n</sup> in order to optimize her objective r að Þ ; ω , where ω is an uncertain parameter. The information structure is the following:

Finally, prior works and concluding remarks are drawn in Section 5.

their solvability issues via equivalent formulations will be discussed.

• The true distribution of ω is not known to the decision-maker.

• The upper/lower bound (if any) of ω are unknown to the decision-maker.

• The decision-maker can measure/observe realization of the random variable ω:

tance). Let Pð Þ Ω be the set of all probability measures over Ω:

2. Distributionally robust optimization

2.1. Model

problems under unknown distribution.

2 Optimization Algorithms - Examples

We introduce the notion of f � divergence which will be used to compute the discrepancy between probability distributions.

Definition 1. Let m and m be two probability measures over ~ Ω such that m is absolutely continuous with respect to m~ : Let f be a convex function. Then, the f -divergence between m and m is defined as ~ follows:

$$D\_f(m\|\tilde{m}) \equiv \int\_{\Omega} f\left(\frac{dm}{d\tilde{m}}\right) d\tilde{m} - f(1)\_\prime$$

where dm dm<sup>~</sup> is the Radon-Nikodym derivative of the measure m with the respect the measure m~ :

By Jensen's inequality:

$$\begin{split} D\_f(m\|\|\dot{m}) &= \int\_{\Omega} f\left(\frac{dm}{d\|\dot{m}}\right) d\dot{m} - f(1) \\ &\geq f\left(\int\_{\Omega} \frac{dm}{d\dot{m}} d\dot{m}\right) - f(1) \\ &= f\left(\int\_{\Omega} dm\right) - f(1) \\ &= f(1) - f(1) = 0. \end{split} \tag{2}$$

Thus, Dfð Þ m∥m~ ≥ 0 for any convex function f : Note however that, the f� divergence Dfð Þ m∥m~ is not a distance (for example, it does not satisfy the symmetry property). Here the distributional uncertainty set imposed to the alternative distribution m~ is given by

$$B\_{\rho}(m) = \left\{ \tilde{m} | \tilde{m}(.) \ni 0, \ \int\_{\Omega} d\tilde{m} = \tilde{m}(\Omega) = 1, \ \,\, D\_{f}(\tilde{m} || m) \preceq \rho \right\}.$$

Example 2. The L<sup>θ</sup>-Wasserstein distance between two Dirac measures δω<sup>0</sup> and δω<sup>~</sup> <sup>0</sup> is W<sup>θ</sup> δω<sup>0</sup> ; δω<sup>~</sup> <sup>0</sup> ð Þ¼

We have defined <sup>B</sup>rð Þ <sup>m</sup> and <sup>B</sup>~rð Þ <sup>m</sup> : The goal now is to solve (1) under both <sup>f</sup> � divergence and Wasserstein metric. One of the difficulties of the problem is the curse of dimensionality. The distributionally robust optimization problem (1) of the decision-maker is an infinitedimensional robust optimization problem because B<sup>r</sup> is of infinite dimensions. Below we will show that (1) can be transformed into an optimization in the form of supinfsup: The latter

We first present the duality gap and develop a triality theory to solve equivalent formulations

a<sup>1</sup> ∈ A<sup>1</sup>

a<sup>2</sup> ∈ A<sup>2</sup>

sup a<sup>2</sup> ∈ A<sup>2</sup>

min a<sup>1</sup> ∈ A<sup>1</sup> r2ð Þ a1; a<sup>2</sup>

r2ð Þ a1; a<sup>2</sup> ,

<sup>i</sup>¼<sup>1</sup> <sup>A</sup>i:

(3)

(4)

K P<sup>K</sup>

<sup>2</sup> μK; ν<sup>K</sup> � � ≤ <sup>1</sup>

problem has three alternating terms. Solving this problem requires a triality theory.

of (1). Consider uncoupled domains Ai, i∈ f g 1; 2; 3 : For a general function r2, one has

r2ð Þ a1; a<sup>2</sup> ≤ inf

r2ð Þ� a1; a<sup>2</sup> max

is called duality gap. As it is widely known in duality theory from Sion's Theorem [1] (which is an extension of von Neumann minimax Theorem) the duality gap vanishes, for example for convex-concave function, and the value is achieved by a saddle point in the case of non-empty

Triality theory focuses on optimization problems of the forms: sup infsup or infsup inf: The term triality is used here because there are three key alternating terms in these optimizations.

inf<sup>a</sup><sup>1</sup> <sup>∈</sup> <sup>A</sup>1, <sup>a</sup><sup>3</sup> <sup>∈</sup> <sup>A</sup><sup>3</sup> r3ð Þ a1; a2; a<sup>3</sup> ≤

inf<sup>a</sup><sup>1</sup> <sup>∈</sup> <sup>A</sup><sup>1</sup> r3ð Þ a1; a2; a<sup>3</sup> ≤

inf<sup>a</sup><sup>2</sup> <sup>∈</sup> <sup>A</sup><sup>2</sup> r3ð Þ a1; a2; a<sup>3</sup> ≤

r3ð Þ a1; a2; a<sup>3</sup> ,

r3ð Þ a1; a2; a<sup>3</sup> :

r3ð Þ a1; a2; a<sup>3</sup> ≤

Proposition 1. Let að Þ <sup>1</sup>; <sup>a</sup>2; <sup>a</sup><sup>3</sup> <sup>↦</sup> <sup>r</sup>3ð Þ <sup>a</sup>1; <sup>a</sup>2; <sup>a</sup><sup>3</sup> <sup>∈</sup> <sup>R</sup> be a function defined on the product space <sup>Q</sup><sup>3</sup>

inf<sup>a</sup><sup>2</sup> <sup>∈</sup> <sup>A</sup><sup>2</sup> sup<sup>a</sup><sup>1</sup> <sup>∈</sup> <sup>A</sup><sup>1</sup>


Distributionally Robust Optimization http://dx.doi.org/10.5772/intechopen.76686 5

:

<sup>i</sup>¼<sup>1</sup> <sup>ω</sup><sup>k</sup> � <sup>ω</sup><sup>~</sup> <sup>k</sup> ½ �<sup>2</sup>

<sup>d</sup>ð Þ <sup>ω</sup>0; <sup>ω</sup><sup>~</sup> <sup>o</sup> : More generally, for K <sup>≥</sup> <sup>2</sup>, the L<sup>2</sup>

K P<sup>K</sup>

> sup a<sup>2</sup> ∈ A<sup>2</sup>

min a<sup>1</sup> ∈ A<sup>1</sup>

inf a<sup>1</sup> ∈ A<sup>1</sup>

max a<sup>2</sup> ∈ A<sup>2</sup>

sup<sup>a</sup><sup>2</sup> <sup>∈</sup> <sup>A</sup><sup>2</sup>

inf<sup>a</sup><sup>3</sup> <sup>∈</sup> <sup>A</sup><sup>3</sup> sup<sup>a</sup><sup>2</sup> <sup>∈</sup> <sup>A</sup><sup>2</sup>

sup<sup>a</sup><sup>1</sup> <sup>∈</sup> <sup>A</sup>1, <sup>a</sup><sup>3</sup> <sup>∈</sup> <sup>A</sup><sup>3</sup>

sup<sup>a</sup><sup>3</sup> <sup>∈</sup> <sup>A</sup><sup>3</sup>

inf<sup>a</sup><sup>1</sup> <sup>∈</sup> <sup>A</sup>1, <sup>a</sup><sup>3</sup> <sup>∈</sup> <sup>A</sup><sup>3</sup> sup<sup>a</sup><sup>2</sup> <sup>∈</sup> <sup>A</sup><sup>2</sup>

inf<sup>a</sup><sup>2</sup> <sup>∈</sup> <sup>A</sup><sup>2</sup> sup<sup>a</sup><sup>1</sup> <sup>∈</sup> <sup>A</sup>1, <sup>a</sup><sup>3</sup> <sup>∈</sup> <sup>A</sup><sup>3</sup>

<sup>k</sup>¼<sup>1</sup> δω<sup>~</sup> <sup>k</sup> is W<sup>2</sup>

<sup>k</sup>¼<sup>1</sup> δω<sup>k</sup> and <sup>ν</sup><sup>K</sup> <sup>¼</sup> <sup>1</sup>

<sup>μ</sup><sup>K</sup> <sup>¼</sup> <sup>1</sup> K P<sup>K</sup>

2.2. Triality theory

and the difference

convex compact domain.

and similarly

Then, the following inequalities hold:

Example 1. From the notion of f � divergence one can derive the following important concept:

• α-divergence for

$$f(a) = \begin{cases} \frac{4}{(\alpha+1)(1-\alpha)} \left(1 - a^{\frac{\alpha+1}{2}}\right) & \text{if } |\alpha \notin \{-1, +1\}| \\ a \log a & \text{if } |\alpha = 1| \\ -\log a & \text{if } |\alpha = -1| \end{cases}$$

• In particular, Kullback–Leibler divergence (or relative entropy) is retrieved as α goes to 1:

#### 2.1.2. Wasserstein metric

The Wasserstein metric between two probability distributions m~ and m is defined as follows:

Definition 2. For m, m~ ∈Pð Þ Ω , let Πð Þ m~ ; m be the set of all couplings between m and m~ : That is,

$$\left\{\pi \in \mathcal{P}(\Omega \times \Omega) \: \mid \: \pi(A \times \Omega) = m(A), \ \pi(\Omega \times B) = \vec{m}(B), \ (A, B) \in \mathcal{B}^2(\Omega)\right\}.$$

Bð Þ Ω denotes the measurable sets of Ω: Let θ∈ ½ � 1; ∞ : The Wasserstein metric between m and m is ~ defined as

$$\mathcal{W}\_{\theta}(\bar{m};m) = \inf\_{\pi \in \Pi(\bar{m};m)} \|d\|\_{L^{\theta}\_{\pi}} = \inf\_{\pi \in \Pi(\bar{m};m)} \int\_{(a,b)} d^{\theta}(a,b)\pi(da,db),$$

It is well-known that for every θ ≥ 1, Wθð Þ m~ ; m is a true distance in the sense that it satisfies the following three axioms:


Note that m~ is not necessarily absolutely continuous with respect to m: Now the distributional uncertainty/constraint set is the set of all possible probability distributions within a L<sup>θ</sup>-Wasserstein distance below r:

$$\tilde{B}\_{\rho}(m) = \left\{ \tilde{m} \, \middle| \, \int\_{\Omega} d\tilde{m} = \tilde{m}(\Omega) = 1, \,\, W\_{\theta}(\tilde{m}; m) \preceq \rho \right\},$$

Note that, if m is a random measure (obtained from a sampled realization), we use the expected value of the Wasserstein metric.

Example 2. The L<sup>θ</sup>-Wasserstein distance between two Dirac measures δω<sup>0</sup> and δω<sup>~</sup> <sup>0</sup> is W<sup>θ</sup> δω<sup>0</sup> ; δω<sup>~</sup> <sup>0</sup> ð Þ¼ <sup>d</sup>ð Þ <sup>ω</sup>0; <sup>ω</sup><sup>~</sup> <sup>o</sup> : More generally, for K <sup>≥</sup> <sup>2</sup>, the L<sup>2</sup> -Wasserstein distance between empirical measures <sup>μ</sup><sup>K</sup> <sup>¼</sup> <sup>1</sup> K P<sup>K</sup> <sup>k</sup>¼<sup>1</sup> δω<sup>k</sup> and <sup>ν</sup><sup>K</sup> <sup>¼</sup> <sup>1</sup> K P<sup>K</sup> <sup>k</sup>¼<sup>1</sup> δω<sup>~</sup> <sup>k</sup> is W<sup>2</sup> <sup>2</sup> μK; ν<sup>K</sup> � � ≤ <sup>1</sup> K P<sup>K</sup> <sup>i</sup>¼<sup>1</sup> <sup>ω</sup><sup>k</sup> � <sup>ω</sup><sup>~</sup> <sup>k</sup> ½ �<sup>2</sup> :

We have defined <sup>B</sup>rð Þ <sup>m</sup> and <sup>B</sup>~rð Þ <sup>m</sup> : The goal now is to solve (1) under both <sup>f</sup> � divergence and Wasserstein metric. One of the difficulties of the problem is the curse of dimensionality. The distributionally robust optimization problem (1) of the decision-maker is an infinitedimensional robust optimization problem because B<sup>r</sup> is of infinite dimensions. Below we will show that (1) can be transformed into an optimization in the form of supinfsup: The latter problem has three alternating terms. Solving this problem requires a triality theory.

#### 2.2. Triality theory

Brð Þ¼ m m~ jm~ ð Þ: ≥ 0;

8 >>><

>>>:

<sup>W</sup>θð Þ¼ <sup>m</sup><sup>~</sup> ; <sup>m</sup> inf <sup>π</sup><sup>∈</sup> <sup>Π</sup>ð Þ <sup>m</sup><sup>~</sup> ;<sup>m</sup>

<sup>B</sup>~rð Þ¼ <sup>m</sup> <sup>m</sup><sup>~</sup> <sup>j</sup>

expected value of the Wasserstein metric.

ð Ω

f að Þ¼

• α-divergence for

4 Optimization Algorithms - Examples

2.1.2. Wasserstein metric

following three axioms: • positive-definiteness,

stein distance below r:

• the symmetry property, • the triangle inequality.

defined as

ð Ω

Example 1. From the notion of f � divergence one can derive the following important concept:

• In particular, Kullback–Leibler divergence (or relative entropy) is retrieved as α goes to 1:

The Wasserstein metric between two probability distributions m~ and m is defined as follows:

Definition 2. For m, m~ ∈Pð Þ Ω , let Πð Þ m~ ; m be the set of all couplings between m and m~ : That is,

<sup>π</sup> <sup>∈</sup>Pð Þj <sup>Ω</sup> � <sup>Ω</sup> <sup>π</sup>ð Þ¼ <sup>A</sup> � <sup>Ω</sup> m Að Þ; <sup>π</sup>ð Þ¼ <sup>Ω</sup> � <sup>B</sup> m B <sup>~</sup> ð Þ; ð Þ <sup>A</sup>; <sup>B</sup> <sup>∈</sup> <sup>B</sup><sup>2</sup> ð Þ <sup>Ω</sup> � �:

∥d∥L<sup>θ</sup>

Bð Þ Ω denotes the measurable sets of Ω: Let θ∈ ½ � 1; ∞ : The Wasserstein metric between m and m is ~

It is well-known that for every θ ≥ 1, Wθð Þ m~ ; m is a true distance in the sense that it satisfies the

Note that m~ is not necessarily absolutely continuous with respect to m: Now the distributional uncertainty/constraint set is the set of all possible probability distributions within a L<sup>θ</sup>-Wasser-

Note that, if m is a random measure (obtained from a sampled realization), we use the

dm~ ¼ m~ ð Þ¼ Ω 1; Wθð Þ m~ ; m ≤ r � �

<sup>π</sup> <sup>¼</sup> inf <sup>π</sup><sup>∈</sup> <sup>Π</sup>ð Þ <sup>m</sup><sup>~</sup> ;<sup>m</sup>

ð

ð Þ a;b

<sup>d</sup><sup>θ</sup>ð Þ <sup>a</sup>; <sup>b</sup> <sup>π</sup>ð Þ da; db ,

,

1 � a αþ1 2 � �

a log a if α ¼ 1, � log a if α ¼ �1,

4 ð Þ α þ 1 ð Þ 1 � α

dm~ ¼ m~ ð Þ¼ Ω 1; Dfð Þ m~ km ≤ r

if α∉ f g �1; þ1 ,

:

� �

We first present the duality gap and develop a triality theory to solve equivalent formulations of (1). Consider uncoupled domains Ai, i∈ f g 1; 2; 3 : For a general function r2, one has

$$\sup\_{a\_2 \in \mathcal{A}\_2} \inf\_{a\_1 \in \mathcal{A}\_1} r\_2(a\_1, a\_2) \le \inf\_{a\_1 \in \mathcal{A}\_1} \sup\_{a\_2 \in \mathcal{A}\_2} r\_2(a\_1, a\_2)$$

and the difference

$$\min\_{a\_1 \in \mathcal{A}\_1} \max\_{a\_2 \in \mathcal{A}\_2} r\_2(a\_1, a\_2) - \max\_{a\_2 \in \mathcal{A}\_2} \min\_{a\_1 \in \mathcal{A}\_1} r\_2(a\_1, a\_2).$$

is called duality gap. As it is widely known in duality theory from Sion's Theorem [1] (which is an extension of von Neumann minimax Theorem) the duality gap vanishes, for example for convex-concave function, and the value is achieved by a saddle point in the case of non-empty convex compact domain.

Triality theory focuses on optimization problems of the forms: sup infsup or infsup inf: The term triality is used here because there are three key alternating terms in these optimizations.

Proposition 1. Let að Þ <sup>1</sup>; <sup>a</sup>2; <sup>a</sup><sup>3</sup> <sup>↦</sup> <sup>r</sup>3ð Þ <sup>a</sup>1; <sup>a</sup>2; <sup>a</sup><sup>3</sup> <sup>∈</sup> <sup>R</sup> be a function defined on the product space <sup>Q</sup><sup>3</sup> <sup>i</sup>¼<sup>1</sup> <sup>A</sup>i: Then, the following inequalities hold:

$$\begin{aligned} \sup\_{a\_2 \in \mathcal{A}\_2} & \inf\_{a\_1 \in \mathcal{A}\_1, a\_3 \in \mathcal{A}\_3} r\_3(a\_1, a\_2, a\_3) \le \\ \inf\_{a\_3 \in \mathcal{A}\_3} & \sup\_{a\_2 \in \mathcal{A}\_2} \inf\_{a\_1 \in \mathcal{A}\_1} r\_3(a\_1, a\_2, a\_3) \le \\ \inf\_{a\_1 \in \mathcal{A}\_1, a\_3 \in \mathcal{A}\_3} & \sup\_{a\_2 \in \mathcal{A}\_2} r\_3(a\_1, a\_2, a\_3), \end{aligned} \tag{3}$$

and similarly

$$\begin{split} \sup\_{\boldsymbol{a}\_{\mathcal{I}} \in \mathcal{A}\_{\mathcal{I}}, a\_{\mathcal{I}} \in \mathcal{A}\_{\mathcal{I}}} \inf\_{a\_{2} \in \mathcal{A}\_{2}} r\_{3}(a\_{1}, a\_{2}, a\_{3}) \leq \\ \sup\_{\boldsymbol{a}\_{\mathcal{I}3} \in \mathcal{A}\_{\mathcal{I}}} \inf\_{\boldsymbol{a}\_{2} \in \mathcal{A}\_{2}} \sup\_{\boldsymbol{a}\_{1} \in \mathcal{A}\_{\mathcal{I}}} r\_{3}(a\_{1}, a\_{2}, a\_{3}) \leq \\ \inf\_{\boldsymbol{a}\_{\mathcal{I}2} \in \mathcal{A}\_{\mathcal{I}2}} \sup\_{\boldsymbol{a}\_{1} \in \mathcal{A}\_{\mathcal{I}}, a\_{3} \in \mathcal{A}\_{\mathcal{I}}} r\_{3}(a\_{1}, a\_{2}, a\_{3}) . \end{split} \tag{4}$$

Proof. Define

$$\hat{g}\left(a\_2, a\_3\right) \coloneqq \inf\_{a\_1 \in \mathcal{A}\_1} r\_3\left(a\_1, a\_2, a\_3\right).$$

Thus, for all a2, a3, one has g a ^ð Þ <sup>2</sup>; a<sup>3</sup> ≤ r3ð Þ a1; a2; a<sup>3</sup> : It follows that, for any a1, a3,

$$\sup\_{a\_2 \in \mathcal{A}\_2} \hat{\mathcal{g}}\,(a\_2, a\_3) \le \sup\_{a\_2 \in \mathcal{A}\_2} r\_3(a\_1, a\_2, a\_3).$$

Using the definition of g, ^ one obtains

$$\sup\_{a\_2 \in \mathcal{A}\_2} \inf\_{a\_1 \in \mathcal{A}\_1} r\_3(a\_1, a\_2, a\_3) \le \sup\_{a\_2 \in \mathcal{A}\_2} r\_3(a\_1, a\_2, a\_3), \quad \forall a\_1, a\_3.$$

Taking the infimum in a<sup>1</sup> yields:

$$\sup\_{a\_2 \in \mathcal{A}\_2} \inf\_{a\_1 \in \mathcal{A}\_1} r\_3(a\_1, a\_2, a\_3) \le \inf\_{a\_1 \in \mathcal{A}\_1} \sup\_{a\_2 \in \mathcal{A}\_2} r\_3(a\_1, a\_2, a\_3), \quad \forall a\_3 \tag{5}$$

dimensions). To see this, the original problem need to be transformed. Let us introduce the

r að Þ ; ω~ Lð Þ ω~ dmð Þ ω~

� λ r þ fð Þ� 1

ð ω~

supainf<sup>L</sup><sup>∈</sup> <sup>L</sup><sup>r</sup> ð Þ <sup>m</sup> sup<sup>λ</sup> <sup>≥</sup> <sup>0</sup>,μ<sup>∈</sup> <sup>R</sup>~r a; <sup>L</sup>; <sup>λ</sup>; <sup>μ</sup> � �:

A full understanding of problem 6ð Þ requires a triality theory (not a duality theory). The use of

sup<sup>a</sup><sup>∈</sup> <sup>A</sup>infm<sup>~</sup> <sup>∈</sup>B<sup>r</sup> ð Þ <sup>m</sup> <sup>E</sup>m<sup>~</sup> ½�¼ <sup>r</sup> sup<sup>a</sup><sup>∈</sup> <sup>A</sup>,<sup>λ</sup> <sup>≥</sup> <sup>0</sup>,μ<sup>∈</sup> <sup>R</sup>Emh, <sup>n</sup>

½ �¼� h i L; ξ � f Lð Þ inf

Note that the righthand side of (7) is of dimension n þ 2, which reduces considerably the

Similarly, the distributionally robust optimization problem under Wasserstein metric is equivalent to the finite dimensional stochastic optimization problem (when A is a set of finite dimension). If the function ω ↦ r að Þ ; ω is upper semi-continuous and ð Þ Ω; d is a Polish space

<sup>~</sup><sup>h</sup> <sup>¼</sup> <sup>λ</sup>r<sup>θ</sup> <sup>þ</sup> <sup>μ</sup> <sup>þ</sup> supω^ <sup>∈</sup> <sup>Ω</sup> r að Þ� ; <sup>ω</sup> <sup>μ</sup> � <sup>λ</sup>d<sup>θ</sup>ð Þ <sup>ω</sup>; <sup>ω</sup>^ � �;

The next subsection presents algorithms for computing a distributionally robust solution from

ð Þ <sup>m</sup> <sup>E</sup>m<sup>~</sup> ½�¼ <sup>r</sup> sup<sup>a</sup><sup>∈</sup> <sup>A</sup>sup<sup>λ</sup> <sup>≥</sup> <sup>0</sup>E<sup>m</sup> <sup>~</sup><sup>h</sup>

then the Wasserstein distributionally robust optimization problem is equivalent to

� μ 1 �

ð ω~

� �:

ð ω~

Lð Þ ω~ dmð Þ ω~ � �,

> ∗ rþμ �λ

� �, where f

� �

Lð Þ ω~ dmð Þ¼ ω~ 1

Distributionally Robust Optimization http://dx.doi.org/10.5772/intechopen.76686

f Lð Þ ð Þ ω~ dmð Þ ω~

(6)

7

(7)

(9)

<sup>∗</sup> is Legendre-Fenchel

<sup>L</sup> ½ � f Lð Þ� h i <sup>L</sup>; <sup>ξ</sup> : (8)

h i,

f Lð Þ ð Þ ω~ dm � fð Þ1 ≤ r;

dm ð Þ ω~ , and set

ð ω~

ð ω~

<sup>~</sup>r a; <sup>L</sup>; <sup>λ</sup>; <sup>μ</sup> � � <sup>¼</sup>

n

where h is the integrand function �λð Þ� r þ fð Þ1 μ � λf

ð Þ¼ ξ sup L

sup<sup>a</sup><sup>∈</sup> <sup>A</sup>infm<sup>~</sup> <sup>∈</sup>B~<sup>r</sup>

likelihood functional <sup>L</sup>ð Þ¼ <sup>ω</sup><sup>~</sup> dm<sup>~</sup>

Lrð Þ¼ m Lj

Then, the Lagrangian of the problem is

where λ ≥ 0 and μ∈ R: The problem becomes

triality theory leads to the following equation:

f ∗

dimensionality of the original problem (1).

8 < :

the equivalent formulations above.

transform of f defined by

2.3.2. Wasserstein metric

Now, we use two operations for the variable a3:

• Taking the infimum in the inequality (5) in a<sup>3</sup> yields

$$\begin{aligned} \inf\_{a\_3 \in \mathcal{A}\_3} & \sup\_{a\_2 \in \mathcal{A}\_2} \inf\_{a\_1 \in \mathcal{A}\_1} r\_3(a\_1, a\_2, a\_3) \le \inf\_{a\_3 \in \mathcal{A}\_3} \inf\_{a\_1 \in \mathcal{A}\_1} \sup\_{a\_2 \in \mathcal{A}\_2} r\_3(a\_1, a\_2, a\_3) \\ &= \inf\_{(a\_1, a\_3) \in \mathcal{A}\_1 \times \mathcal{A}\_3} \sup\_{a\_2 \in \mathcal{A}\_2} r\_3(a\_1, a\_2, a\_3), \end{aligned}$$

which proves the second part of the inequalities (3). The first part of the inequalities (3) follows immediately from (5).

• Taking the supremum in inequality (5) in a<sup>3</sup> yields

$$\sup\_{(a\_2, a\_3)\in\mathcal{A}\_2\times\mathcal{A}\_3} \inf\_{a\_1\in\mathcal{A}\_1} r\_3(a\_1, a\_2, a\_3) \le \sup\_{a\_3\in\mathcal{A}\_3} \inf\_{a\_1\in\mathcal{A}\_1} \sup\_{a\_2\in\mathcal{A}\_2} r\_3(a\_1, a\_2, a\_3).$$

which proves the first part of the inequalities (4). The second part of the inequalities (4) follows immediately from (5).

This completes the proof.

#### 2.3. Equivalent formulations

Below we explain how the dimensionality of problem (1) can be significantly reduced using a representation by means of the triality theory inequalities of Proposition 1.

#### 2.3.1. f -divergence

Interestingly, the distributionally robust optimization problem (1) under f-divergence is equivalent to the finite dimensional stochastic optimization problem (when A are of finite dimensions). To see this, the original problem need to be transformed. Let us introduce the likelihood functional <sup>L</sup>ð Þ¼ <sup>ω</sup><sup>~</sup> dm<sup>~</sup> dm ð Þ ω~ , and set

$$L\_{\rho}(m) = \left\{ L \Big| \int\_{\tilde{\omega}} f(L(\tilde{\omega})) dm - f(1) \le \rho, \quad \int\_{\tilde{\omega}} L(\tilde{\omega}) dm(\tilde{\omega}) = 1 \right\}.$$

Then, the Lagrangian of the problem is

Proof. Define

Thus, for all a2, a3, one has g a

6 Optimization Algorithms - Examples

Using the definition of g,

immediately from (5).

immediately from (5).

2.3.1. f -divergence

This completes the proof.

2.3. Equivalent formulations

Taking the infimum in a<sup>1</sup> yields:

g a

sup a<sup>2</sup> ∈ A<sup>2</sup>

^ one obtains

inf a<sup>1</sup> ∈ A<sup>1</sup>

inf a<sup>1</sup> ∈ A<sup>1</sup>

• Taking the infimum in the inequality (5) in a<sup>3</sup> yields

inf a<sup>1</sup> ∈ A<sup>1</sup>

> ¼ inf ð Þ a1;a<sup>3</sup> ∈ A1�A<sup>3</sup>

sup a<sup>2</sup> ∈ A<sup>2</sup>

• Taking the supremum in inequality (5) in a<sup>3</sup> yields

inf a<sup>1</sup> ∈ A<sup>1</sup>

sup ð Þ a2;a<sup>3</sup> ∈ A2�A<sup>3</sup>

sup a<sup>2</sup> ∈ A<sup>2</sup>

sup a<sup>2</sup> ∈ A<sup>2</sup>

Now, we use two operations for the variable a3:

inf a<sup>3</sup> ∈ A<sup>3</sup> g a

^ð Þ <sup>2</sup>; a<sup>3</sup> ≔ inf

^ð Þ <sup>2</sup>; a<sup>3</sup> ≤ sup

r3ð Þ a1; a2; a<sup>3</sup> ≤ sup

r3ð Þ a1; a2; a<sup>3</sup> ≤ inf

r3ð Þ a1; a2; a<sup>3</sup> ≤ inf

r3ð Þ a1; a2; a<sup>3</sup> ≤ sup

a<sup>1</sup> ∈ A<sup>1</sup>

a<sup>2</sup> ∈ A<sup>2</sup>

a<sup>2</sup> ∈ A<sup>2</sup>

a<sup>1</sup> ∈ A<sup>1</sup>

a<sup>3</sup> ∈ A<sup>3</sup>

a<sup>3</sup> ∈ A<sup>3</sup>

sup a<sup>2</sup> ∈ A<sup>2</sup>

which proves the second part of the inequalities (3). The first part of the inequalities (3) follows

which proves the first part of the inequalities (4). The second part of the inequalities (4) follows

Below we explain how the dimensionality of problem (1) can be significantly reduced using a

Interestingly, the distributionally robust optimization problem (1) under f-divergence is equivalent to the finite dimensional stochastic optimization problem (when A are of finite

representation by means of the triality theory inequalities of Proposition 1.

inf a<sup>1</sup> ∈ A<sup>1</sup>

r3ð Þ a1; a2; a<sup>3</sup> ,

inf a<sup>1</sup> ∈ A<sup>1</sup>

sup a<sup>2</sup> ∈ A<sup>2</sup>

sup a<sup>2</sup> ∈ A<sup>2</sup> r3ð Þ a1; a2; a<sup>3</sup>

r3ð Þ a1; a2; a<sup>3</sup> ,

sup a<sup>2</sup> ∈ A<sup>2</sup>

r3ð Þ a1; a2; a<sup>3</sup> :

r3ð Þ a1; a2; a<sup>3</sup> :

r3ð Þ a1; a2; a<sup>3</sup> , ∀a1, a3:

r3ð Þ a1; a2; a<sup>3</sup> , ∀a<sup>3</sup> (5)

^ð Þ <sup>2</sup>; a<sup>3</sup> ≤ r3ð Þ a1; a2; a<sup>3</sup> : It follows that, for any a1, a3,

$$\begin{aligned} \tilde{r}\left(a,L,\lambda,\mu\right) &= \int\_{\tilde{\omega}} r(a,\tilde{\omega})L(\tilde{\omega})dm(\tilde{\omega}) \\ &- \lambda \Big(\rho + f(1) - \int\_{\tilde{\omega}} f(L(\tilde{\omega}))dm(\tilde{\omega})\Big) \\ &- \mu \Big(1 - \int\_{\tilde{\omega}} L(\tilde{\omega})dm(\tilde{\omega})\Big), \end{aligned}$$

where λ ≥ 0 and μ∈ R: The problem becomes

$$\left\{ \underline{\operatorname{supp}}\_{a} \inf\_{L \in L\_{\rho}(m)} \operatorname{supp}\_{\lambda \ge 0, \mu \in \mathbb{R}} \tilde{r} \left( a, L, \lambda, \mu \right) . \tag{6} \right\}$$

A full understanding of problem 6ð Þ requires a triality theory (not a duality theory). The use of triality theory leads to the following equation:

$$\left\{ \underline{\operatorname{supp}}\_{a \in \mathcal{A}} \inf\_{\tilde{m} \in \mathcal{B}\_{\rho}(m)} \mathbb{E}\_{\tilde{m}}[r] = \operatorname{supp}\_{a \in \mathcal{A}, \lambda \ge 0, \,\mu \in \mathbb{R}} \mathbb{E}\_{m} h\_{\prime} \tag{7} \right\}$$

where h is the integrand function �λð Þ� r þ fð Þ1 μ � λf ∗ rþμ �λ � �, where f <sup>∗</sup> is Legendre-Fenchel transform of f defined by

$$f^\*(\xi) = \sup\_L \left[ \langle L, \xi \rangle - f(L) \right] = -\inf\_L \left[ f(L) - \langle L, \xi \rangle \right]. \tag{8}$$

Note that the righthand side of (7) is of dimension n þ 2, which reduces considerably the dimensionality of the original problem (1).

#### 2.3.2. Wasserstein metric

Similarly, the distributionally robust optimization problem under Wasserstein metric is equivalent to the finite dimensional stochastic optimization problem (when A is a set of finite dimension). If the function ω ↦ r að Þ ; ω is upper semi-continuous and ð Þ Ω; d is a Polish space then the Wasserstein distributionally robust optimization problem is equivalent to

$$\begin{cases} \sup\_{a \in \mathcal{A}} \inf\_{\tilde{m} \in \bar{B}\_{\rho}} (m) \mathbb{E}\_{\tilde{m}}[r] = \sup\_{a \in \mathcal{A}} \sup\_{\lambda \ge 0} \mathbb{E}\_{\tilde{m}} \left[ \tilde{h} \right], \\ \tilde{h} = \lambda \rho^{\theta} + \mu + \sup\_{\hat{\omega} \in \Omega} \left[ r(a, \omega) - \mu - \lambda d^{\theta}(\omega, \hat{\omega}) \right]; \end{cases} \tag{9}$$

The next subsection presents algorithms for computing a distributionally robust solution from the equivalent formulations above.

#### 2.4. Learning algorithms

Learning algorithms are crucial for finding approximate solutions to optimization and control problems. They are widely used for seeking roots/kernel of a function and for finding feasible solutions to variational inequalities. Practically, a learning algorithm generates a certain trajectory (or a set of trajectories) toward a potential approximate solution. Selecting a learning algorithm that has specific properties such as better accuracy, more stability, less-oscillatory and quick convergence is a challenging task [2–5]. From the calculus of variations point of view, however, a learning algorithm generates curves. Therefore, selecting an algorithm among the others leads to an optimal control problem on the spaces of curves. Hence, it is natural to use optimal control theory to derive faster algorithms for a family of curves. Bergman-based algorithms and risk-aware version of it are introduced below to meet specific properties. We start by introducing the Bregman divergence.

2: a að Þ0

6: end while

8: end procedure

4: Compute a tð Þ solution of (10)

7: return a tð Þ, regrett ⊳ get a(t) and the regret

analogue of the Armijo gradient flow [6], which is given by

d

Then the average regret within t½ � <sup>0</sup>; T , t<sup>0</sup> > 0 is bounded above by

ðT t0

where a is solution to (10). The function W is positive and <sup>d</sup>

T � t<sup>0</sup>

regretT <sup>≔</sup> <sup>1</sup>

aa <sup>E</sup>m∇ahat ð Þ ð Þ; <sup>ω</sup> � � <sup>þ</sup> <sup>d</sup>

d

d

dt a tðÞ¼ <sup>∇</sup><sup>2</sup>

<sup>a</sup>ð Þ¼ <sup>0</sup> <sup>a</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>þ<sup>2</sup>

g � ��<sup>1</sup>

5: Compute regrett

a tð Þ be the solution to (10).

Proof. Let

<sup>t</sup> <sup>E</sup>m∇ah að Þ ; <sup>ω</sup> ; <sup>g</sup>�<sup>1</sup>

On the other hand,

Hence,

3: while regret > e and t ≤ T do ⊳ We have the answer if regret is 0

Proposition 2. Let a <sup>↦</sup> <sup>E</sup>mh að Þ ; <sup>ω</sup> : <sup>R</sup><sup>n</sup>þ<sup>2</sup> ! <sup>R</sup> be a concave function that has a unique global maximizer a<sup>∗</sup>: Assume that a<sup>∗</sup> be a feasible action profile, i.e., a<sup>∗</sup> ∈ A: Consider the continuous time

,

where að Þ¼ 0 a<sup>0</sup> is the initial point of the algorithm and g is a strictly convex function on a: Let

Wat ð Þ¼ ð Þ <sup>t</sup>E<sup>m</sup> h a<sup>∗</sup> <sup>½</sup> ð Þ� ; <sup>ω</sup> hat ð Þ ð Þ; <sup>ω</sup> � þ dg <sup>a</sup><sup>∗</sup> ð Þ ; a tð Þ ,

<sup>E</sup>m∇ah að Þ ; <sup>ω</sup> ; <sup>a</sup><sup>∗</sup> h i ð Þ � <sup>a</sup> <sup>≥</sup> <sup>E</sup><sup>m</sup> h a<sup>∗</sup> ½ � ð Þ� ; <sup>ω</sup> h að Þ ; <sup>ω</sup> , <sup>∀</sup> <sup>a</sup>:

dt dg <sup>a</sup><sup>∗</sup> ð Þ¼� ; a tð Þ ag\_ <sup>a</sup>ð Þ� <sup>a</sup> gaaa\_; <sup>a</sup> � <sup>a</sup><sup>∗</sup> � � <sup>þ</sup> gaa\_ ¼ � gaaa\_; <sup>a</sup> � <sup>a</sup><sup>∗</sup> � � ¼ � <sup>E</sup>m∇ah að Þ ; <sup>ω</sup> ; <sup>a</sup> h i <sup>∗</sup> � <sup>a</sup> :

dt <sup>W</sup> <sup>≤</sup> <sup>E</sup>m∇ah að Þ ; <sup>ω</sup> ; <sup>a</sup><sup>∗</sup> h i ð Þ � <sup>a</sup>

aa <sup>E</sup>m∇ah að Þ ; <sup>ω</sup> � �

aa <sup>E</sup>m∇ah að Þ ; <sup>ω</sup> � � <sup>≤</sup> <sup>0</sup>,

�<sup>t</sup> <sup>E</sup>m∇ah að Þ ; <sup>ω</sup> ; <sup>g</sup>�<sup>1</sup>

� <sup>E</sup>m∇ah að Þ ; <sup>ω</sup> ; <sup>a</sup> h i <sup>∗</sup> � <sup>a</sup> ¼ �<sup>t</sup> <sup>E</sup>m∇ah að Þ ; <sup>ω</sup> ; <sup>g</sup>�<sup>1</sup>

:∇aEmhat ð Þ ð Þ; ω ,

<sup>E</sup><sup>m</sup> h a<sup>∗</sup> ½ � ð Þ� ; <sup>ω</sup> hat ð Þ ð Þ; <sup>ω</sup> dt <sup>≤</sup> dg <sup>a</sup><sup>∗</sup> ð Þ ; <sup>a</sup><sup>0</sup>

dt dg <sup>a</sup><sup>∗</sup> ð Þ ; a tð Þ : By concavity of <sup>E</sup>mh að Þ ; <sup>ω</sup> one has

log <sup>T</sup> t0 T � t<sup>0</sup> :

Distributionally Robust Optimization http://dx.doi.org/10.5772/intechopen.76686

dt <sup>W</sup> <sup>¼</sup> <sup>E</sup><sup>m</sup> h a<sup>∗</sup> <sup>½</sup> ð Þ� ; <sup>ω</sup> hat ð Þ ð Þ; <sup>ω</sup> ��

(10)

9

(11)

(12)

Definition 3. The Bregman divergence dg : A � A ! R is defined on a differentiable strictly convex function g : <sup>A</sup> ! <sup>R</sup>: For two points að Þ ; <sup>b</sup> <sup>∈</sup> <sup>A</sup><sup>2</sup> , it measures the gap between g að Þ and the first-order Taylor expansion of g around a evaluated at b

$$d\_{\mathcal{S}}(a,b) := \mathcal{g}(a) - \mathcal{g}(b) - \langle \nabla \mathcal{g}(b), a - b \rangle.$$

Example 3. From the Bregman divergence one gets other features by choosing specific functions g :


We are now ready to define algorithms for solving the righthand side of (7) and (9). One of the key approaches for error quantification of the algorithm with respect to the distributionally robust optimum is the so-called average regret. When the regret vanishes one gets close to a distributionally robust optimum.

Definition 4. The average regret of an algorithm which generates the trajectory a tðÞ¼ ð~a tð Þ; λð Þt ; μð ÞÞ t within t½ � <sup>0</sup>; T , t<sup>0</sup> > 0 is

$$regret\_T := \frac{1}{T - t\_0} \int\_{t\_0}^T \left[ \max\_{b \in \mathcal{A} \times \mathbb{R}\_+ \times \mathbb{R}} \mathbb{E}\_m h(b, \omega) \right] - \mathbb{E}\_m h(a(t), \omega) dt$$

#### 2.4.1. Armijo gradient flow

Algorithm 1. The Armijo's gradient pseudocode is as follows:

1: Procedure ARMIJO GRADIENT ð Þ að Þ0 ; e; T; g; m; h ⊳ The Armijo's gradient starting from að Þ0 within ½ � 0; T

2: a að Þ0

2.4. Learning algorithms

8 Optimization Algorithms - Examples

Learning algorithms are crucial for finding approximate solutions to optimization and control problems. They are widely used for seeking roots/kernel of a function and for finding feasible solutions to variational inequalities. Practically, a learning algorithm generates a certain trajectory (or a set of trajectories) toward a potential approximate solution. Selecting a learning algorithm that has specific properties such as better accuracy, more stability, less-oscillatory and quick convergence is a challenging task [2–5]. From the calculus of variations point of view, however, a learning algorithm generates curves. Therefore, selecting an algorithm among the others leads to an optimal control problem on the spaces of curves. Hence, it is natural to use optimal control theory to derive faster algorithms for a family of curves. Bergman-based algorithms and risk-aware version of it are introduced below to meet specific

Definition 3. The Bregman divergence dg : A � A ! R is defined on a differentiable strictly convex

dgð Þ a; b ≔ g að Þ� g bð Þ� h i ∇g bð Þ; a � b :

Example 3. From the Bregman divergence one gets other features by choosing specific functions g :

<sup>i</sup>¼<sup>1</sup> bi <sup>¼</sup> <sup>1</sup><sup>g</sup> then the Bregman divergence dgð Þ¼ <sup>a</sup>; <sup>b</sup> <sup>P</sup><sup>n</sup>

We are now ready to define algorithms for solving the righthand side of (7) and (9). One of the key approaches for error quantification of the algorithm with respect to the distributionally robust optimum is the so-called average regret. When the regret vanishes one gets close to a

Definition 4. The average regret of an algorithm which generates the trajectory a tðÞ¼ ð~a tð Þ; λð Þt ;

� �

1: Procedure ARMIJO GRADIENT ð Þ að Þ0 ; e; T; g; m; h ⊳ The Armijo's gradient starting from að Þ0 within

Emh bð Þ ; ω

max b∈ A�Rþ�R

<sup>i</sup>¼<sup>1</sup> ai log ai is defined on the relative interior of the simplex, i.e., <sup>a</sup> <sup>∈</sup>f<sup>b</sup> <sup>j</sup> <sup>b</sup><sup>∈</sup>

<sup>i</sup> then the Bregman divergence dgð Þ¼ <sup>a</sup>; <sup>b</sup> <sup>P</sup><sup>n</sup>

, it measures the gap between g að Þ and the first-order

<sup>i</sup>¼<sup>1</sup> ð Þ ai � bi

� Emhat ð Þ ð Þ; ω dt

<sup>2</sup> is the squared standard

bi � �

, is the

<sup>i</sup>¼<sup>1</sup> ai log ai

properties. We start by introducing the Bregman divergence.

function g : <sup>A</sup> ! <sup>R</sup>: For two points að Þ ; <sup>b</sup> <sup>∈</sup> <sup>A</sup><sup>2</sup>

Taylor expansion of g around a evaluated at b

<sup>i</sup>¼<sup>1</sup> <sup>a</sup><sup>2</sup>

Kullback–Leibler divergence.

distributionally robust optimum.

regretT <sup>≔</sup> <sup>1</sup>

T � t<sup>0</sup>

Algorithm 1. The Armijo's gradient pseudocode is as follows:

ðT t0

μð ÞÞ t within t½ � <sup>0</sup>; T , t<sup>0</sup> > 0 is

2.4.1. Armijo gradient flow

½ � 0; T

Euclidean distance.

; P<sup>n</sup>

• If g að Þ¼ <sup>P</sup><sup>n</sup>

• If g að Þ¼ <sup>P</sup><sup>n</sup>

ð Þ <sup>0</sup>; <sup>1</sup> <sup>n</sup>


#### 8: end procedure

Proposition 2. Let a <sup>↦</sup> <sup>E</sup>mh að Þ ; <sup>ω</sup> : <sup>R</sup><sup>n</sup>þ<sup>2</sup> ! <sup>R</sup> be a concave function that has a unique global maximizer a<sup>∗</sup>: Assume that a<sup>∗</sup> be a feasible action profile, i.e., a<sup>∗</sup> ∈ A: Consider the continuous time analogue of the Armijo gradient flow [6], which is given by

$$\begin{aligned} \frac{d}{dt}a(t) &= \left[\nabla^2 g\right]^{-1} . \nabla\_a \mathbb{E}\_m h(a(t), \omega), \\ a(0) &= a\_0 \in \mathbb{R}^{n+2}, \end{aligned} \tag{10}$$

where að Þ¼ 0 a<sup>0</sup> is the initial point of the algorithm and g is a strictly convex function on a: Let a tð Þ be the solution to (10).

Then the average regret within t½ � <sup>0</sup>; T , t<sup>0</sup> > 0 is bounded above by

$$regret\_T := \frac{1}{T - t\_0} \int\_{t\_0}^T \mathbb{E}\_m[h(a^\*, \omega) - h(a(t), \omega)] dt \le d\_{\mathcal{S}}(a^\*, a\_0) \frac{\log \frac{T}{t\_0}}{T - t\_0}.$$

Proof. Let

$$\mathcal{W}(a(t)) = t\mathbb{E}\_m[h(a^\*,\omega) - h(a(t),\omega)] + d\_{\mathcal{g}}(a^\*,a(t))\_{\ast}$$

where a is solution to (10). The function W is positive and <sup>d</sup> dt <sup>W</sup> <sup>¼</sup> <sup>E</sup><sup>m</sup> h a<sup>∗</sup> <sup>½</sup> ð Þ� ; <sup>ω</sup> hat ð Þ ð Þ; <sup>ω</sup> �� <sup>t</sup> <sup>E</sup>m∇ah að Þ ; <sup>ω</sup> ; <sup>g</sup>�<sup>1</sup> aa <sup>E</sup>m∇ahat ð Þ ð Þ; <sup>ω</sup> � � <sup>þ</sup> <sup>d</sup> dt dg <sup>a</sup><sup>∗</sup> ð Þ ; a tð Þ : By concavity of <sup>E</sup>mh að Þ ; <sup>ω</sup> one has

$$\langle \mathbb{E}\_m \nabla\_a h(a, \omega), (a^\*-a) \rangle \ge \mathbb{E}\_m[h(a^\*, \omega) - h(a, \omega)], \quad \forall \ a.$$

On the other hand,

$$\begin{split} \frac{d}{dt}d\_{\mathcal{S}}(a^\*,a(t)) &= -\dot{a}g\_a(a) - \left< g\_{aa}\dot{a}, a - a^\* \right> + g\_a \dot{a} \\ &= -\left< g\_{aa}\dot{a}, a - a^\* \right> = -\left< \mathbb{E}\_m \nabla\_a h(a,\omega), a^\* - a \right>. \end{split} \tag{11}$$

Hence,

$$\begin{split} & \frac{d}{dt} \mathcal{W} \leq \langle \mathbb{E}\_{\text{in}} \nabla\_{a} h(a, \omega), (a^\*-a) \rangle \\ & - t \langle \mathbb{E}\_{\text{in}} \nabla\_{a} h(a, \omega), \text{g}\_{\text{aa}}^{-1} \mathbb{E}\_{\text{in}} \nabla\_{a} h(a, \omega) \rangle \\ & - \langle \mathbb{E}\_{\text{in}} \nabla\_{a} h(a, \omega), a^\*-a \rangle \\ & = -t \langle \mathbb{E}\_{\text{in}} \nabla\_{a} h(a, \omega), \text{g}\_{\text{aa}}^{-1} \mathbb{E}\_{\text{in}} \nabla\_{a} h(a, \omega) \rangle \leq 0, \end{split} \tag{12}$$

where the last inequality is by convexity of g: It follows that <sup>d</sup> dt Wat ð Þ ð Þ ≤ 0 along the path of the gradient flow. This decreasing property implies 0 <sup>≤</sup> Wat ð Þ ð Þ <sup>≤</sup> W að Þ¼ ð Þ<sup>0</sup> dg <sup>a</sup><sup>∗</sup> ð Þ ; <sup>a</sup>ð Þ<sup>0</sup> : In particular, 0 <sup>≤</sup> <sup>t</sup>E<sup>m</sup> h a<sup>∗</sup> ½ � ð Þ� ; <sup>ω</sup> h að Þ ; <sup>ω</sup> <sup>≤</sup> W að Þ ð Þ<sup>0</sup> <sup>&</sup>lt; <sup>þ</sup>∞: Thus, the error to the value <sup>E</sup>mh a<sup>∗</sup> ð Þ ; <sup>ω</sup> is bounded by

Proof. Let W a; <sup>a</sup>\_; <sup>t</sup>; <sup>a</sup><sup>∗</sup> ð Þ¼ dg <sup>a</sup><sup>∗</sup>; a tðÞþ <sup>e</sup>�αð Þ<sup>t</sup> a t \_ð Þ � � <sup>þ</sup> <sup>e</sup><sup>β</sup>ð Þ<sup>t</sup> <sup>E</sup><sup>m</sup> h a<sup>∗</sup> ½ � ð Þ� ; <sup>ω</sup> hat ð Þ ð Þ; <sup>ω</sup> : It is clear that <sup>W</sup>

<sup>E</sup><sup>m</sup> h a<sup>∗</sup> ½ � ð Þ� ; <sup>ω</sup> hat ð Þ ð Þ; <sup>ω</sup> dt <sup>≤</sup>

�βð Þ<sup>s</sup> ds <sup>¼</sup> <sup>c</sup><sup>0</sup>

c0 t ðt 0 e

Figure 1. Global regret bound under Bregman vs. gradient. The initial gap is c<sup>0</sup> ¼ 25:

t ðt 0 e s e �es ds <sup>¼</sup> <sup>c</sup><sup>0</sup>

�βð Þ<sup>s</sup> ds <sup>¼</sup> <sup>c</sup><sup>0</sup> <sup>1</sup> � <sup>e</sup>�<sup>t</sup> ð Þ

Figure 1 illustrates the advantage of algorithm (13) compared with the gradient flow (10). It

The advantage of algorithms (10) and (13) is that it is not required to compute the Hessian of Emh að Þ ; ω as it is the case in the Newton scheme. As a corollary of Proposition 2 the regret vanishes as T grows. Thus, it is a no-regret algorithm. However, Algorithm (10) may not be sufficiently fast. Algorithm (13) provides a higher order convergence rate by carefully designing α; β � �: The average regret decays very quickly to zero [7]. However, it may generate an

<sup>t</sup><sup>0</sup> <sup>e</sup>�βð Þ<sup>s</sup> ds for <sup>β</sup> <sup>¼</sup> <sup>s</sup> and dg <sup>a</sup><sup>∗</sup> ð Þ ; <sup>a</sup><sup>0</sup>

dt Watð Þ; a t \_ð Þ; <sup>t</sup>; <sup>a</sup><sup>∗</sup> ð Þ <sup>≤</sup> 0 for <sup>β</sup>\_ <sup>≤</sup> <sup>e</sup><sup>α</sup>: Thus Watð Þ; a t \_ð Þ; <sup>t</sup>; <sup>a</sup><sup>∗</sup> ð Þ <sup>≤</sup> W a<sup>ð</sup> ð Þ<sup>0</sup> ;

c0 T � t<sup>0</sup>

, one obtains an error bound to the minimum value as

1 <sup>e</sup> � <sup>e</sup>�<sup>e</sup> � �<sup>t</sup> <sup>t</sup> ,

<sup>t</sup> :

log <sup>T</sup> t 0

<sup>T</sup>�t<sup>0</sup> with an initial gap of <sup>c</sup><sup>0</sup> <sup>¼</sup> <sup>25</sup>:

Distributionally Robust Optimization http://dx.doi.org/10.5772/intechopen.76686 11

ðT t0 e �βð Þ<sup>s</sup> ds:

is positive. Moreover, <sup>d</sup>

This completes the proof.

plots the regret bound <sup>c</sup><sup>0</sup>

In particular, for <sup>β</sup>ð Þ¼� <sup>s</sup> <sup>s</sup> <sup>þ</sup> es

<sup>a</sup>\_ð Þ<sup>0</sup> ; <sup>0</sup>; <sup>a</sup><sup>∗</sup>Þ ¼ <sup>c</sup>0: By integration between ½ � <sup>t</sup>0; <sup>T</sup> it follows

ðT t0

c0 t ðt 0 e

and for βð Þ¼ s s, the regret bound becomes

T�t<sup>0</sup> Ð T

1 T � t<sup>0</sup>

$$0 \le \mathbb{E}\_{\mathfrak{m}}[h(a^\*, \omega) - h(a, \omega)] \le \frac{W(a(0))}{t}.$$

The announced result on the regret follows by integration over ½ � t0; T and by averaging. This completes the proof.

Note that the above regret-bound is established without assuming strong convexity of a ↦ � Emh að Þ ; ω : Also no Lipschitz continuity bound of the gradient is assumed.

#### 2.4.2. Bregman learning algorithms

Algorithm 2. The Bregman learning pseudocode is as follows:


#### 8: end procedure

Proposition 3. Let a <sup>↦</sup> <sup>E</sup>mh að Þ ; <sup>ω</sup> : <sup>R</sup><sup>n</sup>þ<sup>2</sup> ! <sup>R</sup> be a concave function that has a unique global maximizer a<sup>∗</sup>: Assume that a<sup>∗</sup> be a feasible action profile, i.e., a<sup>∗</sup> ∈ A: Let α and β be two functions such that <sup>β</sup>\_ð Þ<sup>t</sup> <sup>≤</sup> <sup>e</sup><sup>α</sup>ð Þ<sup>t</sup> : Consider the following Bregman learning algorithm

$$\begin{aligned} \frac{d}{dt} \left[ \mathbf{g}\_a \left( a(t) + e^{-a(t)} \dot{a}(t) \right) \right] &= e^{a(t) + \beta(t)} \nabla\_a \mathbb{E}\_m h(a(t), \omega), \\ a(0) \in \mathbb{R}^{n+2}, \dot{a}(0) &\in \mathbb{R}^{n+2}, \end{aligned} \tag{13}$$

where að Þ0 is the initial point of the algorithm and g is a strictly convex function on a: Let a tð Þ be the solution to (13). Then the average regret within ½ � t0; T , t<sup>0</sup> > 0 is bounded above by

$$
\widetilde{\mathbf{r}} \, \mathbf{r} \, \mathbf{r} \, \mathbf{r}\_T \le \frac{c\_0}{T - t\_0} \int\_{t\_0}^T \mathbf{e}^{-\beta(s)} \, \mathbf{ds},\tag{14}$$

where c<sup>0</sup> <sup>≔</sup> dg <sup>a</sup><sup>∗</sup> ð Þþ ; <sup>a</sup>ð Þ<sup>0</sup> <sup>e</sup>�αð Þ<sup>0</sup> <sup>a</sup>\_ð ÞÞ þ <sup>0</sup> <sup>e</sup><sup>β</sup>ð Þ<sup>0</sup> <sup>E</sup><sup>m</sup> h a<sup>∗</sup> ½ � ð Þ� ; <sup>ω</sup> h að Þ ð Þ<sup>0</sup> ; <sup>ω</sup> <sup>&</sup>gt; <sup>0</sup>:

Proof. Let W a; <sup>a</sup>\_; <sup>t</sup>; <sup>a</sup><sup>∗</sup> ð Þ¼ dg <sup>a</sup><sup>∗</sup>; a tðÞþ <sup>e</sup>�αð Þ<sup>t</sup> a t \_ð Þ � � <sup>þ</sup> <sup>e</sup><sup>β</sup>ð Þ<sup>t</sup> <sup>E</sup><sup>m</sup> h a<sup>∗</sup> ½ � ð Þ� ; <sup>ω</sup> hat ð Þ ð Þ; <sup>ω</sup> : It is clear that <sup>W</sup> is positive. Moreover, <sup>d</sup> dt Watð Þ; a t \_ð Þ; <sup>t</sup>; <sup>a</sup><sup>∗</sup> ð Þ <sup>≤</sup> 0 for <sup>β</sup>\_ <sup>≤</sup> <sup>e</sup><sup>α</sup>: Thus Watð Þ; a t \_ð Þ; <sup>t</sup>; <sup>a</sup><sup>∗</sup> ð Þ <sup>≤</sup> W a<sup>ð</sup> ð Þ<sup>0</sup> ; <sup>a</sup>\_ð Þ<sup>0</sup> ; <sup>0</sup>; <sup>a</sup><sup>∗</sup>Þ ¼ <sup>c</sup>0: By integration between ½ � <sup>t</sup>0; <sup>T</sup> it follows

$$\frac{1}{T - t\_0} \int\_{t\_0}^T \mathbb{E}\_m[h(a^\*, \omega) - h(a(t), \omega)] dt \le \frac{c\_0}{T - t\_0} \int\_{t\_0}^T e^{-\beta(s)} ds.$$

This completes the proof.

where the last inequality is by convexity of g: It follows that <sup>d</sup>

Algorithm 2. The Bregman learning pseudocode is as follows:

3: while regret > e and t ≤ T do ⊳ We have the answer if regret is 0

that <sup>β</sup>\_ð Þ<sup>t</sup> <sup>≤</sup> <sup>e</sup><sup>α</sup>ð Þ<sup>t</sup> : Consider the following Bregman learning algorithm

�αð Þ<sup>t</sup> a t \_ð Þ h i � �

, <sup>a</sup>\_ð Þ<sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>þ<sup>2</sup>

regretT ≤

where c<sup>0</sup> <sup>≔</sup> dg <sup>a</sup><sup>∗</sup> ð Þþ ; <sup>a</sup>ð Þ<sup>0</sup> <sup>e</sup>�αð Þ<sup>0</sup> <sup>a</sup>\_ð ÞÞ þ <sup>0</sup> <sup>e</sup><sup>β</sup>ð Þ<sup>0</sup> <sup>E</sup><sup>m</sup> h a<sup>∗</sup> ½ � ð Þ� ; <sup>ω</sup> h að Þ ð Þ<sup>0</sup> ; <sup>ω</sup> <sup>&</sup>gt; <sup>0</sup>:

dt ga a tð Þþ <sup>e</sup>

<sup>a</sup>ð Þ<sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>þ<sup>2</sup>

bounded by

½ � 0; T

2: a að Þ0

6: end while

8: end procedure

completes the proof.

10 Optimization Algorithms - Examples

2.4.2. Bregman learning algorithms

4: Compute a tð Þ solution of (13)

7: return a tð Þ, regrett ⊳ get a tð Þ and the regret

d

5: Compute regrett

gradient flow. This decreasing property implies 0 <sup>≤</sup> Wat ð Þ ð Þ <sup>≤</sup> W að Þ¼ ð Þ<sup>0</sup> dg <sup>a</sup><sup>∗</sup> ð Þ ; <sup>a</sup>ð Þ<sup>0</sup> : In particular, 0 <sup>≤</sup> <sup>t</sup>E<sup>m</sup> h a<sup>∗</sup> ½ � ð Þ� ; <sup>ω</sup> h að Þ ; <sup>ω</sup> <sup>≤</sup> W að Þ ð Þ<sup>0</sup> <sup>&</sup>lt; <sup>þ</sup>∞: Thus, the error to the value <sup>E</sup>mh a<sup>∗</sup> ð Þ ; <sup>ω</sup> is

The announced result on the regret follows by integration over ½ � t0; T and by averaging. This

Note that the above regret-bound is established without assuming strong convexity of

1: procedure BREGMAN <sup>a</sup>ð Þ<sup>0</sup> ; <sup>e</sup>; <sup>T</sup>; <sup>g</sup>; <sup>α</sup>; <sup>β</sup>; <sup>m</sup>; <sup>h</sup> � �<sup>⊳</sup> The Bregman learning starting from að Þ<sup>0</sup> within

Proposition 3. Let a <sup>↦</sup> <sup>E</sup>mh að Þ ; <sup>ω</sup> : <sup>R</sup><sup>n</sup>þ<sup>2</sup> ! <sup>R</sup> be a concave function that has a unique global maximizer a<sup>∗</sup>: Assume that a<sup>∗</sup> be a feasible action profile, i.e., a<sup>∗</sup> ∈ A: Let α and β be two functions such

¼ e

where að Þ0 is the initial point of the algorithm and g is a strictly convex function on a: Let a tð Þ be the solution to (13). Then the average regret within ½ � t0; T , t<sup>0</sup> > 0 is bounded above by

> c0 T � t<sup>0</sup>

ðT t0 e

,

<sup>α</sup>ð Þþ<sup>t</sup> <sup>β</sup>ð Þ<sup>t</sup> <sup>∇</sup>aEmhat ð Þ ð Þ; <sup>ω</sup> ,

�βð Þ<sup>s</sup> ds, (14)

(13)

W að Þ ð Þ0 <sup>t</sup> :

<sup>0</sup> <sup>≤</sup> <sup>E</sup><sup>m</sup> h a<sup>∗</sup> ½ � ð Þ� ; <sup>ω</sup> h að Þ ; <sup>ω</sup> <sup>≤</sup>

a ↦ � Emh að Þ ; ω : Also no Lipschitz continuity bound of the gradient is assumed.

dt Wat ð Þ ð Þ ≤ 0 along the path of the

In particular, for <sup>β</sup>ð Þ¼� <sup>s</sup> <sup>s</sup> <sup>þ</sup> es , one obtains an error bound to the minimum value as

$$\frac{c\_0}{t} \int\_0^t e^{-\beta(s)} ds = \frac{c\_0}{t} \int\_0^t e^s e^{-e^s} ds = \frac{c\_0(\frac{1}{\varepsilon} - e^{-\varepsilon'})}{t} \mu$$

and for βð Þ¼ s s, the regret bound becomes

$$\frac{c\_0}{t} \int\_0^t e^{-\beta(s)} ds = \frac{c\_0(1 - e^{-t})}{t}.$$

Figure 1 illustrates the advantage of algorithm (13) compared with the gradient flow (10). It plots the regret bound <sup>c</sup><sup>0</sup> T�t<sup>0</sup> Ð T <sup>t</sup><sup>0</sup> <sup>e</sup>�βð Þ<sup>s</sup> ds for <sup>β</sup> <sup>¼</sup> <sup>s</sup> and dg <sup>a</sup><sup>∗</sup> ð Þ ; <sup>a</sup><sup>0</sup> log <sup>T</sup> t 0 <sup>T</sup>�t<sup>0</sup> with an initial gap of <sup>c</sup><sup>0</sup> <sup>¼</sup> <sup>25</sup>:

The advantage of algorithms (10) and (13) is that it is not required to compute the Hessian of Emh að Þ ; ω as it is the case in the Newton scheme. As a corollary of Proposition 2 the regret vanishes as T grows. Thus, it is a no-regret algorithm. However, Algorithm (10) may not be sufficiently fast. Algorithm (13) provides a higher order convergence rate by carefully designing α; β � �: The average regret decays very quickly to zero [7]. However, it may generate an

Figure 1. Global regret bound under Bregman vs. gradient. The initial gap is c<sup>0</sup> ¼ 25:

oscillatory trajectory with a big magnitude. The next subsection presents risk-aware algorithms that reduce the oscillatory phase of the trajectory.

#### 2.4.3. Risk-aware Bregman learning algorithm

In order to reduce the oscillatory phase, we introduce a risk-aware Bregman learning algorithm [7] which is a speed-up-and-average version of (13) called mean dynamics m of a given by

$$\begin{split} \dddot{\overline{m}} &= -\frac{3}{t} \dddot{\overline{m}} - (\mathcal{e}^a - \dot{\alpha}) \left( \ddot{\overline{m}} + \frac{2}{t} \dot{\overline{m}} \right) \\ &+ \frac{\mathcal{e}^{2a+\beta}}{t} g\_{\overline{m}}^{-1} \overline{m} \left( \overline{m} + [t + 2e^{-a}] \dot{\overline{m}} + te^{-a} \ddot{\overline{m}} \right) \mathbb{E}h\_{\overline{m}} \left( t \dot{\overline{m}} + \overline{m}, \omega \right), \end{split} \tag{15}$$

The risk-aware Bregman dynamics (15) generates a less oscillatory trajectory due to its averag-

<sup>R</sup>a sð Þ <sup>1</sup>

1 t ðt 0

> t ðt 0

<sup>E</sup>mh a<sup>∗</sup> ½ � ð Þ� ; <sup>ω</sup> <sup>E</sup>mhas ð Þ ð Þ; <sup>ω</sup> ds

c0 t ðt 0 e �βð Þ<sup>s</sup> ds:

<sup>¼</sup> <sup>E</sup>mhð Þ¼ m tð Þ; <sup>ω</sup> <sup>E</sup>mh <sup>E</sup><sup>μ</sup>ð Þ<sup>t</sup> <sup>a</sup>; <sup>ω</sup> � �

Emhas ð Þ ð Þ; ω ds:

a sð Þds; ω � �

> c0 t ðt 0 e �βð Þ<sup>s</sup> ds:

� � time units to the algorithm to be within a ball

<sup>t</sup> 1l½ � <sup>0</sup>;<sup>t</sup> ð Þ<sup>s</sup> � � ds: Thus, m tðÞ¼ <sup>E</sup><sup>μ</sup>ð Þ<sup>t</sup> <sup>a</sup> where <sup>μ</sup>ð Þ<sup>t</sup> is

Distributionally Robust Optimization http://dx.doi.org/10.5772/intechopen.76686 13

<sup>t</sup> 1l½ � <sup>0</sup>;<sup>t</sup> ð Þ ds : By convexity of �Emh að Þ ; ω we apply the

<sup>0</sup> <sup>≤</sup>E<sup>m</sup> h a<sup>∗</sup> ½ � ð Þ� ; <sup>ω</sup> <sup>h</sup>ð Þ m tð Þ; <sup>ω</sup> <sup>≤</sup>

ing nature. The next result provides an accuracy bound for (15).

Proposition 5. The risk-aware Bregman dynamics (15) satisfies

<sup>0</sup> a sð Þds: Then, m tðÞ¼ <sup>Ð</sup>

≤ 1 t ðt 0

a sð Þds; ω � �

≥E<sup>μ</sup>ð Þ<sup>t</sup> Emh að Þ¼ ; ω

<sup>0</sup> <sup>≤</sup>Emh a<sup>∗</sup> ð Þ� ; <sup>ω</sup> <sup>E</sup>mh <sup>1</sup>

≤ c<sup>0</sup> 1 t ðt 0 e �βð Þ<sup>s</sup> ds,

<sup>0</sup> <sup>≤</sup> <sup>E</sup>mh a<sup>∗</sup> ð Þ� ; <sup>ω</sup> <sup>E</sup>mhð Þ m tð Þ; <sup>ω</sup> <sup>≤</sup>

Definition 5. (Convergence time). Let δ > 0 and a tð Þ be the trajectory generated by Bregman algorithm starting from a<sup>0</sup> at time t0: The convergence time to be within a ball B <sup>E</sup>mh a<sup>∗</sup> ð Þ ð Þ ; <sup>ω</sup> ; <sup>δ</sup> of

<sup>T</sup><sup>δ</sup> <sup>¼</sup> inf <sup>t</sup> <sup>j</sup> <sup>E</sup><sup>m</sup> h a<sup>∗</sup> f g ½ � ð Þ� ; <sup>ω</sup> hat ð Þ ð Þ; <sup>ω</sup> <sup>≤</sup> <sup>δ</sup>; <sup>t</sup> <sup>&</sup>gt; <sup>t</sup><sup>0</sup> :

Proposition 6. Under the assumptions above, the error generated by the algorithm is at most (14)

δ

Proof. The proof is immediate. For δ > 0 the average regret bound of Proposition 5,

Proof. Let m tðÞ¼ <sup>1</sup>

Jensen's inequality:

In view of (14) one has

This completes the proof.

radius <sup>δ</sup> <sup>&</sup>gt; <sup>0</sup> from the center r a<sup>∗</sup> ð Þ is given by

which means that it takes at most T<sup>δ</sup> <sup>¼</sup> <sup>β</sup>�<sup>1</sup> log <sup>c</sup><sup>0</sup>

Bra<sup>∗</sup> ð Þ ð Þ; <sup>δ</sup> of radius <sup>δ</sup> <sup>&</sup>gt; <sup>0</sup> from the center <sup>E</sup>mh a<sup>∗</sup> ð Þ ; <sup>ω</sup> .

t Ðt

the measure with density <sup>d</sup>μð Þ<sup>t</sup> ½�¼ <sup>s</sup> <sup>1</sup>

<sup>E</sup>mh <sup>1</sup> t ðt 0

with starting vector <sup>m</sup>ð Þ¼ <sup>0</sup> <sup>a</sup>ð Þ<sup>0</sup> , <sup>m</sup>\_ ð Þ<sup>0</sup> , <sup>m</sup>€ð Þ<sup>0</sup> :

Algorithm 3. The risk-aware Bregman learning pseudocode is as follows:

1: procedure RISK-AWARE BREGMAN <sup>m</sup>ð Þ<sup>0</sup> ; <sup>e</sup>; <sup>T</sup>; <sup>g</sup>; <sup>α</sup>; <sup>β</sup>; <sup>m</sup>; <sup>h</sup> � � <sup>⊳</sup> The risk-aware Bregman learning starting from mð Þ0 within ½ � 0; T

$$\text{2:} \qquad \overline{m} \leftarrow \overline{m}(0) = a(0), \dot{\overline{m}}(0), \dddot{\overline{m}}(0)$$


#### 8: end procedure

Proposition 4. The time-average trajectory of the learning algorithm (13) generates the mean dynamics (15).

Proof. We use the average relation m tðÞ¼ <sup>1</sup> t Ðt <sup>0</sup> a sð Þ ds where a solves Eq. (13). From the definition of m, and by Hopital's rule, mð Þ¼ 0 að Þ0 : Moreover, m tð Þ and a tð Þ share the following equations:

$$a(t) = \overline{m}(t) + t\dot{\overline{m}}(t),$$

$$\dot{a}(t) = 2\dot{\overline{m}}(t) + t\ddot{\overline{m}}(t),$$

$$\ddot{a}(t) = 2\dddot{\overline{m}}(t) + t\dddot{\overline{m}}(t).$$

Substituting these values in Eq. (13) yields the mean dynamics (15). This completes the proof.

The risk-aware Bregman dynamics (15) generates a less oscillatory trajectory due to its averaging nature. The next result provides an accuracy bound for (15).

Proposition 5. The risk-aware Bregman dynamics (15) satisfies

$$0 \le \mathbb{E}\_m[h(a^\*, \omega) - h(\overline{m}(t), \omega)] \le \frac{c\_0}{t} \int\_0^t e^{-\beta(s)} ds.$$

Proof. Let m tðÞ¼ <sup>1</sup> t Ðt <sup>0</sup> a sð Þds: Then, m tðÞ¼ <sup>Ð</sup> <sup>R</sup>a sð Þ <sup>1</sup> <sup>t</sup> 1l½ � <sup>0</sup>;<sup>t</sup> ð Þ<sup>s</sup> � � ds: Thus, m tðÞ¼ <sup>E</sup><sup>μ</sup>ð Þ<sup>t</sup> <sup>a</sup> where <sup>μ</sup>ð Þ<sup>t</sup> is the measure with density <sup>d</sup>μð Þ<sup>t</sup> ½�¼ <sup>s</sup> <sup>1</sup> <sup>t</sup> 1l½ � <sup>0</sup>;<sup>t</sup> ð Þ ds : By convexity of �Emh að Þ ; ω we apply the Jensen's inequality:

$$\mathbb{E}\_m h\left(\frac{1}{t}\int\_0^t a(s)ds,\omega\right) = \mathbb{E}\_m h(\overline{m}(t),\omega) = \mathbb{E}\_m h\left(\mathbb{E}\_{\mu(t)}a,\omega\right),$$

$$\geq \mathbb{E}\_{\mu(t)}\mathbb{E}\_m h(a,\omega) = \frac{1}{t}\int\_0^t \mathbb{E}\_m h(a(s),\omega)ds.$$

In view of (14) one has

oscillatory trajectory with a big magnitude. The next subsection presents risk-aware algo-

In order to reduce the oscillatory phase, we introduce a risk-aware Bregman learning algorithm [7] which is a speed-up-and-average version of (13) called mean dynamics m of a given by

1: procedure RISK-AWARE BREGMAN <sup>m</sup>ð Þ<sup>0</sup> ; <sup>e</sup>; <sup>T</sup>; <sup>g</sup>; <sup>α</sup>; <sup>β</sup>; <sup>m</sup>; <sup>h</sup> � � <sup>⊳</sup> The risk-aware Bregman learning

Proposition 4. The time-average trajectory of the learning algorithm (13) generates the mean dynamics

of m, and by Hopital's rule, mð Þ¼ 0 að Þ0 : Moreover, m tð Þ and a tð Þ share the following equations:

€ð Þ,

€ð Þþ tm⃛ð Þ<sup>t</sup> :

a tðÞ¼ m tðÞþ tm t \_ ð Þ,

a t \_ðÞ¼ <sup>2</sup>m t \_ ð Þþ tm t

Substituting these values in Eq. (13) yields the mean dynamics (15). This completes the proof.

<sup>0</sup> a sð Þ ds where a solves Eq. (13). From the definition

t Ðt

a t €ðÞ¼ 3m t

�<sup>α</sup> ½ �m\_ <sup>þ</sup> te�<sup>α</sup>m€ � �Ehm tm\_ <sup>þ</sup> <sup>m</sup>; <sup>ω</sup> � �,

(15)

(16)

2 t m\_ � �

rithms that reduce the oscillatory phase of the trajectory.

<sup>m</sup>€ � <sup>e</sup>ð Þ <sup>α</sup> � <sup>α</sup>\_ <sup>m</sup>€ <sup>þ</sup>

Algorithm 3. The risk-aware Bregman learning pseudocode is as follows:

3: while regret > e and t ≤ T do ⊳ We have the answer if regret is 0

<sup>m</sup> <sup>m</sup> m þ t þ 2e

2.4.3. Risk-aware Bregman learning algorithm

12 Optimization Algorithms - Examples

<sup>m</sup>⃛¼ � <sup>3</sup> t

> þ e<sup>2</sup>αþ<sup>β</sup> <sup>t</sup> <sup>g</sup>�<sup>1</sup>

with starting vector <sup>m</sup>ð Þ¼ <sup>0</sup> <sup>a</sup>ð Þ<sup>0</sup> , <sup>m</sup>\_ ð Þ<sup>0</sup> , <sup>m</sup>€ð Þ<sup>0</sup> :

starting from mð Þ0 within ½ � 0; T 2: <sup>m</sup> <sup>m</sup>ð Þ¼ <sup>0</sup> <sup>a</sup>ð Þ<sup>0</sup> , <sup>m</sup>\_ ð Þ<sup>0</sup> , <sup>m</sup>€ð Þ<sup>0</sup>

4: Compute m tð Þ solution of (15)

7: return m tð Þ, regrett ⊳ get m tð Þ and the regret

Proof. We use the average relation m tðÞ¼ <sup>1</sup>

5: Compute regret

6: end while

8: end procedure

(15).

$$\begin{aligned} 0 \le \mathbb{E}\_m h(a^\*, \omega) &\quad -\mathbb{E}\_m h\left(\frac{1}{t} \int\_0^t a(s)ds, \omega\right) \\\\ \le \frac{1}{t} \int\_0^t [\mathbb{E}\_m h(a^\*, \omega) - \mathbb{E}\_m h(a(s), \omega)] ds \\\\ \le c\_0 &\frac{1}{t} \int\_0^t e^{-\beta(s)} ds, \\\\ 0 \le \mathbb{E}\_m h(a^\*, \omega) - \mathbb{E}\_m h(\overline{m}(t), \omega) \le \frac{c\_0}{t} \int\_0^t e^{-\beta(s)} ds. \end{aligned}$$

This completes the proof.

Definition 5. (Convergence time). Let δ > 0 and a tð Þ be the trajectory generated by Bregman algorithm starting from a<sup>0</sup> at time t0: The convergence time to be within a ball B <sup>E</sup>mh a<sup>∗</sup> ð Þ ð Þ ; <sup>ω</sup> ; <sup>δ</sup> of radius <sup>δ</sup> <sup>&</sup>gt; <sup>0</sup> from the center r a<sup>∗</sup> ð Þ is given by

$$T\_\delta = \inf \{ t \: \mid \: \mathbb{E}\_\mathbb{m} [h(a^\*, \omega) - h(a(t), \omega)] \le \delta, \ t > t\_0 \}.$$

Proposition 6. Under the assumptions above, the error generated by the algorithm is at most (14) which means that it takes at most T<sup>δ</sup> <sup>¼</sup> <sup>β</sup>�<sup>1</sup> log <sup>c</sup><sup>0</sup> δ � � time units to the algorithm to be within a ball Bra<sup>∗</sup> ð Þ ð Þ; <sup>δ</sup> of radius <sup>δ</sup> <sup>&</sup>gt; <sup>0</sup> from the center <sup>E</sup>mh a<sup>∗</sup> ð Þ ; <sup>ω</sup> .

Proof. The proof is immediate. For δ > 0 the average regret bound of Proposition 5,


Table 1. Convergence rate under different set of functions.

$$
overline{T}\_T \le \frac{c\_0}{T - t\_0} \int\_{t\_0}^T e^{-\beta(s)} ds \le \delta,\tag{17}$$

Example 4. Let f yð Þ¼ <sup>y</sup> log y defined on <sup>R</sup><sup>∗</sup>

ð Þ¼ <sup>y</sup> <sup>1</sup>

the amplitude decays very fast compared to the risk-neutral algorithm.

3. Constrained distributionally robust optimization

In the constrained case i.e., when A is a strict subset of R<sup>n</sup>þ<sup>2</sup>

rithm. We restrict our attention to the following constraints:

<sup>A</sup> <sup>¼</sup> <sup>a</sup><sup>∈</sup> <sup>R</sup><sup>n</sup> <sup>j</sup> al <sup>∈</sup> al

� � <sup>¼</sup> <sup>ξ</sup>ð Þ ½ � <sup>0</sup>; <sup>1</sup> where <sup>ξ</sup>ð Þ¼ xl alxl <sup>þ</sup> al

f 0

> al ; al

The algorithm (18)

ð Þ¼ <sup>y</sup> <sup>1</sup> <sup>þ</sup> log y, f <sup>00</sup>

ð Þ <sup>a</sup>1; <sup>a</sup><sup>2</sup> <sup>↦</sup> g að Þ¼ <sup>∥</sup>a∥<sup>2</sup>

<sup>þ</sup>: Then, fð Þ¼ <sup>1</sup> <sup>0</sup>, and derivatives of f are

� �: The coefficient <sup>ω</sup> distri-

, algorithms (10) and (13) present

<sup>k</sup>¼<sup>1</sup> <sup>ω</sup><sup>2</sup> ika2 k ð Þ¼ <sup>ξ</sup> <sup>y</sup><sup>∗</sup> <sup>¼</sup> <sup>e</sup><sup>ξ</sup>�<sup>1</sup>: Let

15

Distributionally Robust Optimization http://dx.doi.org/10.5772/intechopen.76686

<sup>y</sup> <sup>&</sup>gt; <sup>0</sup>: The Legendre-Fenchel transform of f is f <sup>∗</sup>

bution is unknown but a sampled empirical measure m is considered to be similar to uniform distribution in ð � <sup>0</sup>; <sup>1</sup> with 104 samples. We illustrate the quick convergence rate of the algorithm in a basic example and plot in Figure 2 the trajectories under standard gradient, Bregman dynamics and riskaware Bregman dynamics (15). In particular, we observe that risk-aware Bregman dynamics (15) provides very quickly a satisfactory value. In this particular setup, we observe that the accuracy of the risk-aware Bregman algorithm (15) at t ¼ 0:5 will need four times (t ¼ 2) less than the standard Bregman algorithm to reach a similar level of error. It takes 40 times more tð Þ ¼ 20 than the gradient ascent to reach that level. Also, we observe that the risk-aware Bregman algorithm is less oscillatory and

some drawbacks: The trajectory a tð Þ may not be feasible, i.e., a tð Þ∉A � R<sup>þ</sup> � R even when it starts in A: In order to design feasible trajectories, projected gradient has been widely studied in the literature. However, a projection into A at each time t involves additional optimization problems and the computation of the projected gradient adds extra complexity to the algo-

� �; <sup>l</sup><sup>∈</sup> f g <sup>1</sup>;…; <sup>n</sup> ; <sup>X</sup><sup>n</sup>

( )

We propose a method to compute a constrained solution that has a full support (whenever it exists). We do not use the projection operator. Indeed we transform the domain

> al � al al � al

> > l¼1 clal ≕^b:

ð Þ¼ al

� �xl <sup>≤</sup> <sup>b</sup> �X<sup>n</sup>

l¼1

ð Þ¼ 1 � xl al: ξ is a one-to-one mapping and

∈½ � 0; 1 :

clal ≤ b

:

<sup>l</sup>¼<sup>1</sup> clal <sup>&</sup>lt; <sup>b</sup>:

; al

We impose the following feasibility condition: al <sup>&</sup>lt; al, l∈f g <sup>1</sup>;…; <sup>n</sup> , cl <sup>&</sup>gt; <sup>0</sup>, <sup>P</sup><sup>n</sup>

Under this setting, the constraint set A is non-empty, convex and compact.

xl <sup>¼</sup> <sup>ξ</sup>�<sup>1</sup>

cl al � al

Xn l¼1

<sup>2</sup>, and að Þ <sup>1</sup>; <sup>a</sup>2; <sup>ω</sup> <sup>↦</sup> r að Þ¼� <sup>1</sup>; <sup>a</sup>2; <sup>ω</sup> <sup>1</sup> <sup>þ</sup> <sup>P</sup><sup>2</sup>

provides the announced convergence time bound. This completes the proof.

See Table 1 for detailed parametric functions on the bound Tδ:

Figure 2. Gradient ascent vs. risk-aware Bregman dynamics for <sup>r</sup> ¼ � <sup>1</sup> <sup>þ</sup> <sup>P</sup><sup>2</sup> <sup>k</sup>¼<sup>1</sup> <sup>ω</sup><sup>2</sup> k a2 k � �:

Example 4. Let f yð Þ¼ <sup>y</sup> log y defined on <sup>R</sup><sup>∗</sup> <sup>þ</sup>: Then, fð Þ¼ <sup>1</sup> <sup>0</sup>, and derivatives of f are f 0 ð Þ¼ <sup>y</sup> <sup>1</sup> <sup>þ</sup> log y, f <sup>00</sup> ð Þ¼ <sup>y</sup> <sup>1</sup> <sup>y</sup> <sup>&</sup>gt; <sup>0</sup>: The Legendre-Fenchel transform of f is f <sup>∗</sup> ð Þ¼ <sup>ξ</sup> <sup>y</sup><sup>∗</sup> <sup>¼</sup> <sup>e</sup><sup>ξ</sup>�<sup>1</sup>: Let ð Þ <sup>a</sup>1; <sup>a</sup><sup>2</sup> <sup>↦</sup> g að Þ¼ <sup>∥</sup>a∥<sup>2</sup> <sup>2</sup>, and að Þ <sup>1</sup>; <sup>a</sup>2; <sup>ω</sup> <sup>↦</sup> r að Þ¼� <sup>1</sup>; <sup>a</sup>2; <sup>ω</sup> <sup>1</sup> <sup>þ</sup> <sup>P</sup><sup>2</sup> <sup>k</sup>¼<sup>1</sup> <sup>ω</sup><sup>2</sup> ika2 k � �: The coefficient <sup>ω</sup> distribution is unknown but a sampled empirical measure m is considered to be similar to uniform distribution in ð � <sup>0</sup>; <sup>1</sup> with 104 samples. We illustrate the quick convergence rate of the algorithm in a basic example and plot in Figure 2 the trajectories under standard gradient, Bregman dynamics and riskaware Bregman dynamics (15). In particular, we observe that risk-aware Bregman dynamics (15) provides very quickly a satisfactory value. In this particular setup, we observe that the accuracy of the risk-aware Bregman algorithm (15) at t ¼ 0:5 will need four times (t ¼ 2) less than the standard Bregman algorithm to reach a similar level of error. It takes 40 times more tð Þ ¼ 20 than the gradient ascent to reach that level. Also, we observe that the risk-aware Bregman algorithm is less oscillatory and the amplitude decays very fast compared to the risk-neutral algorithm.
