**Polynomial Response Surfaces for Probabilistic Risk Assessment and Risk Control via Robust Design**

Sergey Oladyshkin and Wolfgang Nowak *SRC Simulation Technology, Institute of Hydraulic Engineering, University of Stuttgart Germany*

#### **1. Introduction**

20 Will-be-set-by-IN-TECH

316 Novel Approaches and Their Applications in Risk Assessment

Williams, M. & Ebel, E. (2012). Estimating changes in public health following implementation

Williams, M., Ebel, E. & Hoeting, J. (2011). Bayesian analysis for food-safety risk assessment:

Williams, M., Ebel, E. & Vose, D. (2011a). Framework for microbial food-safety risk assessments amenable to bayesian modeling, *Risk Analysis* 31(4): 548–565. Williams, M., Ebel, E. & Vose, D. (2011b). Methodology for determining the appropriateness

of a linear dose-response function, *Risk Analysis* 31(3): 345–350.

industry, *Foodborne Pathogens and Disease* 9: 59-67.

43: 1–14.

of hazard analysis and critical control point in the united states broiler slaughter

Evaluation of dose-response functions within winbugs, *Journal of Statistical Software*

Many engineering systems represent challenging classes of complex dynamic systems. Lacking information about their system properties leads to model uncertainties up to a level where quantification of uncertainties may become the dominant question in modeling, simulation and application tasks. Uncertainty quantification is the prerequisite for probabilistic risk assessment and related tasks. Current numerical simulation models are often too expensive for advanced application tasks that involve accurate uncertainty quantification, risk assessment and robust design. This Chapter will present recent approaches for these challenges based on polynomial response surface techniques, which reduce massively the initial complex model at surprising accuracy. The reduction is achieved via projections on orthonormal polynomial bases, which form a so-called response surface. This way, the model response to changes in uncertain parameters and design or control variables is represented by multi-variate polynomials for each output quantity of interest. This technique is known as polynomial chaos expansion (PCE) in the field of stochastic PDE solutions. The reduced model represented by the response surface is vastly faster than the original complex one, and thus provides a promising starting point for follow-up tasks: global sensitivity analysis, uncertainty quantification and probabilistic risk assessment and as well as robust design and control under uncertainty. Often, the fact that the response surface has known polynomial properties can further simplify these tasks. We will emphasize a more engineering-like language as compared to otherwise intense mathematical derivations found in the literature on PCE. Also we will make use of most recent developments in the theory of stochastic PDE solutions for engineering applications. The current Chapter provides tools based on PCE for global sensitivity analysis, uncertainty quantification and risk analysis as well as design under uncertainty (robust design).

#### **2. Response surface via polynomial chaos expansion**

In the present Chapter, we consider the response surface in a closed polynomial form. Obviously, a response surface can be constructed in different ways, e.g. it can be constructed directly on a dense Cartesian grid of input parameters at extremely high computational efforts. Likewise, conceptually straightforward numerical Monte Carlo (MC) simulation techniques are computationally demanding since the statistical accuracy of their predictions

Control Via Robust Design 3

Probabilistic Risk Assessment and Risk Control via Robust Design

<sup>319</sup> Polynomial Response Surfaces for

The aPC approach provides improved convergence in comparison to classical PCE techniques, when applied to input distributions that fall outside the range of classical PCE (Oladyshkin & Nowak, 2011). The necessity to adapt to arbitrary distribution is discussed in more details in

PCE techniques can mainly be sub-divided into intrusive (Ghanem & Spanos, 1993; Matthies & Keese., 2005; Xiu & Karniadakis, 2003) and non-intrusive (Isukapalli et al., 1998; Keese & Matthies, 2003; Li & Zhang, 2007; Oladyshkin, Class, Helmig & Nowak, 2011b) approaches, i.e., methods that require or do not require modifications in the system of governing equations and corresponding changes in simulation codes. Intrusive approaches require symbolic manipulations of the governing equations and can sometimes provide semi-analytical solutions for stochastic analyses of simple problems. The most well-known method from this group is the stochastic Galerkin technique. However, the necessary symbolic manipulations may become very complex and analytically cumbersome, and cannot easily be implemented in commercial codes. For this reason, non-intrusive approaches like sparse quadrature and the probabilistic collocation method (PCM: see Li & Zhang (2007); Oladyshkin, Class, Helmig & Nowak (2011b)) have lately been receiving a quickly increasing attention. In a simple sense, PCM can be interpreted as a smart (mathematically optimal) interpolation rule of model output between different parameter sets. The polynomial interpolation may be interpreted as a response surface of the model. It is based on a minimal and optimally chosen set of model evaluations, each with a defined set of model parameters (called collocation points). The challenge here is to find a compromise between computational effort and a reasonable

In this Section, we will introduce the aPC-based framework. aPC is for arbitrary Polynomial Chaos which is a most recent generalization of PCE methods (see Section 2.2). Let ω = {*ω*1,...,*ωN*} represent the vector of *N* input parameters for some model Ω = *f*(ω). The model Ω(*ω*) may be an explicit or implicit expression (e.g., a partial or ordinary differential equation or a coupled system). To perform sensitivity analysis, uncertainty quantification and risk assessment, we wish to investigate the influence of all parameters ω on the model output Ω.

In the following, we will approximate the model response by a truncated polynomial expansion for each point in space **x** and time *t*. According to polynomial chaos theory (Wiener,

> *M* ∑ *j*=0

where the number *M* of polynomials depends on the total number of analyzed input parameters (*N*) and the order *d* of the polynomial representation, according to the combinatoric formula *M* = (*N* + *d*)!/(*N*!*d*!) − 1. The coefficients *cj* in Eq. (1) quantify the dependence of the model output Ω on the input parameters ω for each desired point in space **x** and time *t*. The symbol Ψ*<sup>j</sup>* is a simplified notation of the multi-variate orthonormal polynomial basis for ω including all cross-terms between different parameters, as explained below. Let us mention that, in the current state of science for polynomial chaos expansion, the random variables have to be statistically independent or may be linearly correlated, which

*cj*(**x**, *t*)Ψ*j*(ω), (1)

1938), the model output Ω can be approximated by polynomials Ψ*j*(ω) as follows:

Ω(**x**, *t*; ω) ≈

The model output can be space and time dependent, i.e. **x** = (*x*1,*x*2,*x*3).

Section 4 of the Chapter.

approximation of the physical processes.

**2.1 Definitions and polynomial chaos expansion**

depend on the number of realizations used. In this Chapter we explore an alternative methodology which demands anly minimum number of model evaluations to construct a response surface (see details in Section 2.3).

Our alternative approach is through the polynomial chaos expansion (PCE) introduced by Wiener (1938). Generally, all PCE techniques can be viewed as an efficient approximation to full-blown stochastic modeling (e.g., exhaustive MC). The basic idea is to represent the response of a model to changes in variables through a response surface that is defined with the help of an orthonormal polynomial basis in the parameter space. In simple words, the dependence of model output on all relevant input parameters is approximated by a high-dimensional polynomial. This projection can be interpreted as an advanced approach to statistical regression. The PCE offers an efficient and accurate high-order way of including non-linear effects in stochastic analysis, see e.g. Fajraoui et al. (2011); Foo & Karniadakis (2010); Zhang & Lu (2004). One of the attractive features of PCE is the high-order approximation of error propagation (Ghanem & Spanos, 1991; 1990) as well as its computational speed when compared to MC (Oladyshkin, Class, Helmig & Nowak, 2011b).

The original PCE concept (Wiener, 1938) can be used only for Gaussian distributed input parameters. In recent years, the classical PCE technique was extended to the generalized polynomial chaos (gPC) (Wan & Karniadakis, 2006) which accommodates for the use of an increased, yet limited number of statistical distributions (Askey & Wilson, 1985). The PCE methods discussed above assume an exact knowledge of the probability density functions of all input parameters and they are optimal only when applied to a finite number of certain parametric probability distributions. Unfortunately, information about the distribution of data or input parameters is very limited in many realistic applications, especially in environmental engineering and sciences. Data that characterize model parameters often indicate a variety of statistical distribution shapes (e.g., bounded, skewed, multi-modal, discontinuous, etc). Also, empirical parameter distributions derived from raw data sets do in general not follow analytically known distribution shapes. For such reasons, application tasks demand further adaptation of the chaos expansion technique to a larger spectrum of distributions.

To accommodate for a wide range of data distributions, a recent generalization of PCE is the arbitrary polynomial chaos (aPC). There are only very few studies that have used aPC before, and they can only be found in mathematical stochastics (Ghanem & Doostan, 2006; Soize & Ghanem, 2004) and aerospace engineering (Witteveen & Bijl, 2006; Witteveen et al., 2007). These studies focused on proofs of existence, constructing the basis by Gram-Schmidt orthogonalization and on related, quite theoretical issues. A notable exception is the very recent study of Li et al. (2011) in the field of petroleum engineering. That study did not discuss, however, the aPC in the light of data availability which plays a critical role in applications such as those presented by Oladyshkin, Class, Helmig & Nowak (2011a); Oladyshkin & Nowak (2011).

Compared to earlier PCE techniques, the aPC adapts to arbitrary probability distribution shapes of input parameters and, in addition, can even work with unknown distribution shapes when only a few statistical moments can be inferred from limited data or from expert elicitation. The arbitrary distributions for the framework can be either discrete, continuous, or discretized continuous. They can be specified either analytically (as probability density/cumulative distribution functions), numerically as histogram or as a raw data sets. 2 Will-be-set-by-IN-TECH

depend on the number of realizations used. In this Chapter we explore an alternative methodology which demands anly minimum number of model evaluations to construct a

Our alternative approach is through the polynomial chaos expansion (PCE) introduced by Wiener (1938). Generally, all PCE techniques can be viewed as an efficient approximation to full-blown stochastic modeling (e.g., exhaustive MC). The basic idea is to represent the response of a model to changes in variables through a response surface that is defined with the help of an orthonormal polynomial basis in the parameter space. In simple words, the dependence of model output on all relevant input parameters is approximated by a high-dimensional polynomial. This projection can be interpreted as an advanced approach to statistical regression. The PCE offers an efficient and accurate high-order way of including non-linear effects in stochastic analysis, see e.g. Fajraoui et al. (2011); Foo & Karniadakis (2010); Zhang & Lu (2004). One of the attractive features of PCE is the high-order approximation of error propagation (Ghanem & Spanos, 1991; 1990) as well as its computational speed when compared to MC (Oladyshkin, Class, Helmig & Nowak, 2011b). The original PCE concept (Wiener, 1938) can be used only for Gaussian distributed input parameters. In recent years, the classical PCE technique was extended to the generalized polynomial chaos (gPC) (Wan & Karniadakis, 2006) which accommodates for the use of an increased, yet limited number of statistical distributions (Askey & Wilson, 1985). The PCE methods discussed above assume an exact knowledge of the probability density functions of all input parameters and they are optimal only when applied to a finite number of certain parametric probability distributions. Unfortunately, information about the distribution of data or input parameters is very limited in many realistic applications, especially in environmental engineering and sciences. Data that characterize model parameters often indicate a variety of statistical distribution shapes (e.g., bounded, skewed, multi-modal, discontinuous, etc). Also, empirical parameter distributions derived from raw data sets do in general not follow analytically known distribution shapes. For such reasons, application tasks demand further adaptation of the chaos expansion technique to a larger spectrum of

To accommodate for a wide range of data distributions, a recent generalization of PCE is the arbitrary polynomial chaos (aPC). There are only very few studies that have used aPC before, and they can only be found in mathematical stochastics (Ghanem & Doostan, 2006; Soize & Ghanem, 2004) and aerospace engineering (Witteveen & Bijl, 2006; Witteveen et al., 2007). These studies focused on proofs of existence, constructing the basis by Gram-Schmidt orthogonalization and on related, quite theoretical issues. A notable exception is the very recent study of Li et al. (2011) in the field of petroleum engineering. That study did not discuss, however, the aPC in the light of data availability which plays a critical role in applications such as those presented by Oladyshkin, Class, Helmig & Nowak (2011a); Oladyshkin & Nowak

Compared to earlier PCE techniques, the aPC adapts to arbitrary probability distribution shapes of input parameters and, in addition, can even work with unknown distribution shapes when only a few statistical moments can be inferred from limited data or from expert elicitation. The arbitrary distributions for the framework can be either discrete, continuous, or discretized continuous. They can be specified either analytically (as probability density/cumulative distribution functions), numerically as histogram or as a raw data sets.

response surface (see details in Section 2.3).

distributions.

(2011).

The aPC approach provides improved convergence in comparison to classical PCE techniques, when applied to input distributions that fall outside the range of classical PCE (Oladyshkin & Nowak, 2011). The necessity to adapt to arbitrary distribution is discussed in more details in Section 4 of the Chapter.

PCE techniques can mainly be sub-divided into intrusive (Ghanem & Spanos, 1993; Matthies & Keese., 2005; Xiu & Karniadakis, 2003) and non-intrusive (Isukapalli et al., 1998; Keese & Matthies, 2003; Li & Zhang, 2007; Oladyshkin, Class, Helmig & Nowak, 2011b) approaches, i.e., methods that require or do not require modifications in the system of governing equations and corresponding changes in simulation codes. Intrusive approaches require symbolic manipulations of the governing equations and can sometimes provide semi-analytical solutions for stochastic analyses of simple problems. The most well-known method from this group is the stochastic Galerkin technique. However, the necessary symbolic manipulations may become very complex and analytically cumbersome, and cannot easily be implemented in commercial codes. For this reason, non-intrusive approaches like sparse quadrature and the probabilistic collocation method (PCM: see Li & Zhang (2007); Oladyshkin, Class, Helmig & Nowak (2011b)) have lately been receiving a quickly increasing attention. In a simple sense, PCM can be interpreted as a smart (mathematically optimal) interpolation rule of model output between different parameter sets. The polynomial interpolation may be interpreted as a response surface of the model. It is based on a minimal and optimally chosen set of model evaluations, each with a defined set of model parameters (called collocation points). The challenge here is to find a compromise between computational effort and a reasonable approximation of the physical processes.

#### **2.1 Definitions and polynomial chaos expansion**

In this Section, we will introduce the aPC-based framework. aPC is for arbitrary Polynomial Chaos which is a most recent generalization of PCE methods (see Section 2.2). Let ω = {*ω*1,...,*ωN*} represent the vector of *N* input parameters for some model Ω = *f*(ω). The model Ω(*ω*) may be an explicit or implicit expression (e.g., a partial or ordinary differential equation or a coupled system). To perform sensitivity analysis, uncertainty quantification and risk assessment, we wish to investigate the influence of all parameters ω on the model output Ω. The model output can be space and time dependent, i.e. **x** = (*x*1,*x*2,*x*3).

In the following, we will approximate the model response by a truncated polynomial expansion for each point in space **x** and time *t*. According to polynomial chaos theory (Wiener, 1938), the model output Ω can be approximated by polynomials Ψ*j*(ω) as follows:

$$
\Omega(\mathbf{x}, t; \omega) \approx \sum\_{j=0}^{M} c\_j(\mathbf{x}, t) \Psi\_j(\omega), \tag{1}
$$

where the number *M* of polynomials depends on the total number of analyzed input parameters (*N*) and the order *d* of the polynomial representation, according to the combinatoric formula *M* = (*N* + *d*)!/(*N*!*d*!) − 1. The coefficients *cj* in Eq. (1) quantify the dependence of the model output Ω on the input parameters ω for each desired point in space **x** and time *t*. The symbol Ψ*<sup>j</sup>* is a simplified notation of the multi-variate orthonormal polynomial basis for ω including all cross-terms between different parameters, as explained below. Let us mention that, in the current state of science for polynomial chaos expansion, the random variables have to be statistically independent or may be linearly correlated, which

Control Via Robust Design 5

<sup>321</sup> Polynomial Response Surfaces for

Abramowitz & Stegun (1965); Stieltjes (1884); Witteveen & Bijl (2006)) from the following

⎤ ⎥ ⎥ ⎥ ⎥ ⎦

evident from Eq. (4) that statistical moments are the only required form of information on the input distributions. This property carries over to all other types of Taylor or polynomial chaos

The above orthogonal polynomial basis can be used directly for analysis. However, a normalized basis has further useful properties. For example, the mean and variance of Ω(ω) according to the expansion (1) is given by simple analytical relations (see Eq. (15) in Section 4 of this Chapter), by virtue of the orthonormality property. This follows from the general properties of Fourier expansions, which encompass all expansions in orthonormal bases. The squared coefficients of such expansions are called the spectrum (compare with the Fourier transformation). The sum of squared coefficients is the total energy or variance. Due to orthonormality, most terms cancel out in subsequent steps. For example, if we consider a stochastic process in the probability space (Λ, *A*, Γ) with space of events Λ, *σ*-algebra *A* and

⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

*p* (*k*) 0,*j p* (*k*) 1,*j* ... *p* (*k*) *k*−1,*j p* (*k*) *k*,*j*

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

*th* non-central (raw) statistical moments for random variable *ωj*. It becomes

*ωj*∈Λ

Because the square of a polynomial of order *k* yields a polynomial of order 2*k*, normalization

When the input data set is small, the sample moments are only uncertain estimates of real moments. Hence, a direct application of the method presented becomes less robust. In that case, it would be useful to apply some standard methods to achieve robustness in the estimation of moments, such as bootstrapping (e.g. Efron (1987)). Moreover, especially in such cases, expert opinion can be very useful. In the presented approach, an expert will have total freedom of data interpretation (not restricted to the selection among standard PDFs) and can provide much more sophisticated information (e.g. lower and higher moments,

� *P*(*k*) *<sup>j</sup>* (*ω*) �2

=

⎡ ⎢ ⎢ ⎢ ⎢ ⎣

⎤ ⎥ ⎥ ⎥ ⎥ ⎦

. (4)

*<sup>j</sup>* (*ωj*) is an orthonormal basis, then

*d*Γ(*ωj*), (6)

*<sup>j</sup>* for space of events Λ (where

*<sup>j</sup>* (*ωj*)*d*Γ(*ξ*) = *δkl*. (5)

matrix equation:

Here, *μi*,*<sup>j</sup>* are the *i*

by definition of orthogonality:

expansions.

where �*P<sup>k</sup>*

⎡ ⎢ ⎢ ⎢ ⎢ ⎣

Probabilistic Risk Assessment and Risk Control via Robust Design

probability measure <sup>Γ</sup> (see e.g. Grigoriu (2002)) and if *<sup>P</sup>*�(*k*)

�

The orthonormal polynomial basis can be obtained as:

*P*�(*k*) *<sup>j</sup>* (*ωj*) =

*ω<sup>j</sup>* ∈ Λ) with probability measure Γ.

*ωj*∈Λ

*P*(*k*) *j* � � � *P*(*k*) *j* � � � , � � � *P*(*k*) *j* � � � 2 = �

*<sup>j</sup>* � is the normalizing constant of the polynomial *<sup>P</sup><sup>k</sup>*

according to Eq. (6) requires the statistical moments of *ω* up to order 2*d*.

*P*�(*k*) *<sup>j</sup>* (*ωj*)*P*�(*l*)

*μ*0,*<sup>j</sup> μ*1,*<sup>j</sup>* ... *μk*,*<sup>j</sup> μ*1,*<sup>j</sup> μ*2,*<sup>j</sup>* ... *μk*+1,*<sup>j</sup>* ... ... ... ... *<sup>μ</sup>k*−1,*<sup>j</sup> <sup>μ</sup>k*,*<sup>j</sup>* ... *<sup>μ</sup>*2*k*−1,*<sup>j</sup>* 0 0 ... 1

can be removed by adequate linear transformation. Construction of a polynomial basis for statistically dependent random variables beyond linear dependence is a very important issue for future research.

#### **2.2 Data-driven orthonormal basis**

The support interval, weighting function or probability distribution for ω is determined from available information (modeler's experience, expert opinion, general prior information or field data) and reflects the uncertainty or expected range of variation of input parameters. This implies that the polynomial basis and related projections should adapt to the acquired information, such that they approximate the model best where the probability density of the parameters is highest. In order to construct such a data-driven polynomial basis that considers all the available information about the input parameters ω, let us define the set of polynomials {*P*(0) *<sup>j</sup>* , . . ., *<sup>P</sup>*(*d*) *<sup>j</sup>* } of degree *d* for the parameters *ω<sup>j</sup>* as an orthonormal basis in the parameter space. The polynomial *P*(*k*) *<sup>j</sup>* (*ωj*) of degree *k* in an individual parameter *ω<sup>j</sup>* can be written as a simple linear combination of the different powers *i* of *ωj*:

$$P\_j^{(k)}(\\\omega\_j) = \sum\_{i=0}^k p\_{i,j}^{(k)} \omega\_{j\prime}^i \quad k = 0 \dots d, \quad j = 0 \dots N. \tag{2}$$

Here *p* (*k*) *<sup>i</sup>*,*<sup>j</sup>* are the coefficients within the polynomial *<sup>P</sup>*(*k*) *<sup>j</sup>* (*ωj*). Assuming that the input parameters within ω are independent (Ghanem & Spanos, 1991), the multi-dimensional basis can be constructed as a simple product of the corresponding univariate polynomials:

$$\Psi\_k(\omega) = \prod\_{j=1}^N P\_j^{(a\_j^k)}(\omega\_j), \quad \sum\_{j=1}^N a\_j^k \le M, \quad k = 1 \dots N,\tag{3}$$

where *α<sup>k</sup> <sup>j</sup>* is a multivariate index that contains the combinatoric information how to enumerate all possible products of individual univariate basis functions. In other words, the index *α* can be seen as *M* × *N* matrix, which contains the corresponding degree (e.g. 0, 1, 2, etc.) for parameter number *j* in expansion term *k*. The multivariate basis allows to represent the reaction of a model Ω to several (*N*) parameters *ω<sup>j</sup>* (*j* = 0... *N*), as an N-dimensional polynomial response surface, defined by the expansion in Eq. (1).

We will show now, how to construct the data-driven orthogonal polynomial basis for each individual component *ω<sup>j</sup>* from the vector ω. The main idea of the data-driven approach, see Oladyshkin, Class, Helmig & Nowak (2011a); Oladyshkin & Nowak (2011), consists in constructing the coefficients *p* (*k*) *<sup>i</sup>*,*<sup>j</sup>* for Eq. 2 in such a way that the polynomials in Eq. (2) form a basis that is orthonormal in precisely the given input distributions of model parameters. It does so without posing any restrictions to the statistical distribution shapes or weighting functions that available data, expert opinion or modeler experience may assume.

According to Oladyshkin & Nowak (2011), an orthogonal polynomial basis up to order *d* can be constructively defined for any arbitrary probability measure, given that *ω<sup>j</sup>* has finite statistical moments (e.g., mean, variance, skewness, etc) up to order 2*d* − 1. The unknown polynomial coefficients *p* (*k*) *<sup>i</sup>*,*<sup>j</sup>* can be defined (among other available options for construction: 4 Will-be-set-by-IN-TECH

can be removed by adequate linear transformation. Construction of a polynomial basis for statistically dependent random variables beyond linear dependence is a very important issue

The support interval, weighting function or probability distribution for ω is determined from available information (modeler's experience, expert opinion, general prior information or field data) and reflects the uncertainty or expected range of variation of input parameters. This implies that the polynomial basis and related projections should adapt to the acquired information, such that they approximate the model best where the probability density of the parameters is highest. In order to construct such a data-driven polynomial basis that considers all the available information about the input parameters ω, let us define the set of polynomials

parameters within ω are independent (Ghanem & Spanos, 1991), the multi-dimensional basis

*N* ∑ *j*=1 *αk*

all possible products of individual univariate basis functions. In other words, the index *α* can be seen as *M* × *N* matrix, which contains the corresponding degree (e.g. 0, 1, 2, etc.) for parameter number *j* in expansion term *k*. The multivariate basis allows to represent the reaction of a model Ω to several (*N*) parameters *ω<sup>j</sup>* (*j* = 0... *N*), as an N-dimensional

We will show now, how to construct the data-driven orthogonal polynomial basis for each individual component *ω<sup>j</sup>* from the vector ω. The main idea of the data-driven approach, see Oladyshkin, Class, Helmig & Nowak (2011a); Oladyshkin & Nowak (2011), consists in

a basis that is orthonormal in precisely the given input distributions of model parameters. It does so without posing any restrictions to the statistical distribution shapes or weighting

According to Oladyshkin & Nowak (2011), an orthogonal polynomial basis up to order *d* can be constructively defined for any arbitrary probability measure, given that *ω<sup>j</sup>* has finite statistical moments (e.g., mean, variance, skewness, etc) up to order 2*d* − 1. The unknown

functions that available data, expert opinion or modeler experience may assume.

*<sup>j</sup>* is a multivariate index that contains the combinatoric information how to enumerate

can be constructed as a simple product of the corresponding univariate polynomials:

*<sup>j</sup>* } of degree *d* for the parameters *ω<sup>j</sup>* as an orthonormal basis in the parameter

*<sup>j</sup>* (*ωj*) of degree *k* in an individual parameter *ω<sup>j</sup>* can be written as a

*<sup>i</sup>*,*<sup>j</sup>* for Eq. 2 in such a way that the polynomials in Eq. (2) form

*<sup>i</sup>*,*<sup>j</sup>* can be defined (among other available options for construction:

, *k* = 0... *d*, *j* = 0... *N*. (2)

*<sup>j</sup>* (*ωj*). Assuming that the input

*<sup>j</sup>* ≤ *M*, *k* = 1... *N*, (3)

for future research.

{*P*(0)

Here *p*

where *α<sup>k</sup>*

*<sup>j</sup>* , . . ., *<sup>P</sup>*(*d*)

(*k*)

space. The polynomial *P*(*k*)

constructing the coefficients *p*

polynomial coefficients *p*

simple linear combination of the different powers *i* of *ωj*:

*k* ∑ *i*=0 *p* (*k*) *<sup>i</sup>*,*<sup>j</sup> <sup>ω</sup><sup>i</sup> j*

*<sup>i</sup>*,*<sup>j</sup>* are the coefficients within the polynomial *<sup>P</sup>*(*k*)

*N* ∏ *j*=1 *P* (*α<sup>k</sup> j*) *<sup>j</sup>* (*ωj*),

polynomial response surface, defined by the expansion in Eq. (1).

(*k*)

(*k*)

*P*(*k*) *<sup>j</sup>* (*ωj*) =

Ψ*k*(ω) =

**2.2 Data-driven orthonormal basis**

Abramowitz & Stegun (1965); Stieltjes (1884); Witteveen & Bijl (2006)) from the following matrix equation:

$$
\begin{bmatrix}
\mu\_{0,j} & \mu\_{1,j} & \dots & \mu\_{k,j} \\
\mu\_{1,j} & \mu\_{2,j} & \dots & \mu\_{k+1,j} \\
\dots & \dots & \dots & \dots & \dots \\
\mu\_{k-1,j} & \mu\_{k,j} & \dots & \mu\_{2k-1,j} \\
0 & 0 & \dots & 1 \\
\end{bmatrix}
\begin{bmatrix}
p\_{0,j}^{(k)} \\
p\_{1,j}^{(k)} \\
\dots \\
p\_{k-1,j}^{(k)} \\
p\_{k,j}^{(k)} \\
\end{bmatrix} = \begin{bmatrix}
0 \\
0 \\
\dots \\
0 \\
1
\end{bmatrix}.
\tag{4}
$$

Here, *μi*,*<sup>j</sup>* are the *i th* non-central (raw) statistical moments for random variable *ωj*. It becomes evident from Eq. (4) that statistical moments are the only required form of information on the input distributions. This property carries over to all other types of Taylor or polynomial chaos expansions.

The above orthogonal polynomial basis can be used directly for analysis. However, a normalized basis has further useful properties. For example, the mean and variance of Ω(ω) according to the expansion (1) is given by simple analytical relations (see Eq. (15) in Section 4 of this Chapter), by virtue of the orthonormality property. This follows from the general properties of Fourier expansions, which encompass all expansions in orthonormal bases. The squared coefficients of such expansions are called the spectrum (compare with the Fourier transformation). The sum of squared coefficients is the total energy or variance. Due to orthonormality, most terms cancel out in subsequent steps. For example, if we consider a stochastic process in the probability space (Λ, *A*, Γ) with space of events Λ, *σ*-algebra *A* and probability measure <sup>Γ</sup> (see e.g. Grigoriu (2002)) and if *<sup>P</sup>*�(*k*) *<sup>j</sup>* (*ωj*) is an orthonormal basis, then by definition of orthogonality:

$$\int\_{\omega\_{\hat{j}} \in \Lambda} \widehat{P}\_{\hat{j}}^{(k)}(\omega\_{\hat{j}}) \widehat{P}\_{\hat{j}}^{(l)}(\omega\_{\hat{j}}) d\Gamma(\xi) = \delta\_{kl}.\tag{5}$$

The orthonormal polynomial basis can be obtained as:

$$\widehat{P}\_{\dot{j}}^{(k)}(\omega\_{\dot{j}}) = \frac{P\_{\dot{j}}^{(k)}}{\left\| P\_{\dot{j}}^{(k)} \right\|}, \quad \left\| P\_{\dot{j}}^{(k)} \right\|^2 = \int\_{\omega\_{\dot{j}} \in \Lambda} \left[ P\_{\dot{j}}^{(k)}(\omega) \right]^2 d\Gamma(\omega\_{\dot{j}}) \,\tag{6}$$

where �*P<sup>k</sup> <sup>j</sup>* � is the normalizing constant of the polynomial *<sup>P</sup><sup>k</sup> <sup>j</sup>* for space of events Λ (where *ω<sup>j</sup>* ∈ Λ) with probability measure Γ.

Because the square of a polynomial of order *k* yields a polynomial of order 2*k*, normalization according to Eq. (6) requires the statistical moments of *ω* up to order 2*d*.

When the input data set is small, the sample moments are only uncertain estimates of real moments. Hence, a direct application of the method presented becomes less robust. In that case, it would be useful to apply some standard methods to achieve robustness in the estimation of moments, such as bootstrapping (e.g. Efron (1987)). Moreover, especially in such cases, expert opinion can be very useful. In the presented approach, an expert will have total freedom of data interpretation (not restricted to the selection among standard PDFs) and can provide much more sophisticated information (e.g. lower and higher moments,

Control Via Robust Design 7

<sup>323</sup> Polynomial Response Surfaces for

1st order 2nd order 3rd order 4th order

2 4 6 8 10

Parametrical dimensionality, N

points. The full tensor grid can be used for low-order (1*st*, 2*nd*) analysis of few parameters, but for higher-order analysis of many parameters the tensor grid suffers from the curse of dimensionality ((*d* + 1)*<sup>N</sup>* points). In that case, a smart choice of a sparse subset of the tensor grid becomes necessary. For this reason, the collocation approach became more popular in the last years. Probabilistic collocation (Li & Zhang, 2007; Oladyshkin, Class, Helmig & Nowak, 2011b) chooses the collocation points from the full tensor grid according to their probability weight, i.e. their importance as specified by the available probability distribution of ω. This simply means to select the collocation points from the most probable regions of the input parameters' distribution (see Oladyshkin, Class, Helmig & Nowak (2011b)) and the modeler

From the practical point of view, the computational costs of the data-driven chaos expansion combined with non-intrusive collocation approach are proportional to the number of terms in the chaos expansion multiplied by the time of a single model evaluation. The number of model evaluations (number of expansion terms *M*) for the construction of response surface via collocation approach depends on crossing order of expansion and presented in Figure 1.

The data-driven polynomial chaos expansion presented in Section 2.2 provides a simple and efficient tool for analysis of stochastic systems. Let us illustrate the efficiency of analysis within the data-driven versus the conventional expansion. For that, we will consider the simple exponential decay differential equation which was already used in Xiu & Karniadakis (2003)

Let *YPC* be the solution obtained using the polynomial chaos expansion (1) for the problem defined in equation (9). We use a Monte Carlo simulation as reference solution and define the relative error between the polynomial chaos expansion solution *YPC* and the Monte Carlo

*dt* <sup>=</sup> <sup>−</sup>*ξPhY*, *<sup>Y</sup>*(0) = <sup>1</sup> (9)

Fig. 1. Computational costs: number of model evaluations for different orders

can extract a lot of information in the main range of the parameter distribution.

0

**2.4 Efficient convergence of data-driven expansion**

*dY*(*t*)

for illustration of the Askey scheme:

solution *YMC* as *�* = |*YPC* − *YMC*| / |*YMC*|.

20

40

60

Number of model evaluations, M

80

100

Probabilistic Risk Assessment and Risk Control via Robust Design

etc.). Such an expert opinion (in a most general sense) will be incorporated directly without any additional transformation or additional subjectivity when translating it to the stochastic numerical framework.

#### **2.3 Non-intrusive determination of the coefficients**

The remaining task is to evaluate the coefficients *cj* in Eq. (1). Therefore, we will use the non-intrusive probabilistic collocation method (PCM) (Li & Zhang, 2007; Oladyshkin, Class, Helmig & Nowak, 2011b). The collocation formulation does not require any knowledge of the initial model structure, i.e., of Ω. It only requires knowledge on how to obtain the model output for a given set of input parameters, which allows to treat the model Ω like a "black-box". The distinctive feature of non-intrusive approaches is that any simulation model can be considered a "black-box", i.e. commercial software can be used without any modifications required. The idea of PCM is to evaluate the model exactly *M* times, which allows to directly fit the polynomial representation of Ω, see Eq. (1), with its *M* unknown coefficients *cj* to the obtained *M* model results. The *M* model evaluations are performed with *<sup>M</sup>* different parameter sets {*ω*(*i*) <sup>1</sup> , ..., *<sup>ω</sup>*(*i*) *<sup>N</sup>* }, *i* = 1, .., *M*, called collocation points.

This leads to the following system (Villadsen & Michelsen, 1978) of linear equations:

$$\mathbf{M}\_{\mathbf{V}}(\omega)\mathbf{V}\_{\mathcal{L}}(\mathbf{x},t) = \mathbf{V}\_{\Omega}(\mathbf{x},t;\omega) \tag{7}$$

where **V***<sup>c</sup>* is the *M* × 1 vector of unknown coefficients *cj* in expansion (1), the *M* × 1 vector **V**<sup>Ω</sup> contains the model output for each collocation point, and the *M* × *M* matrix **M**<sup>Ψ</sup> contains the polynomials evaluated at the collocation points:

$$\begin{aligned} \mathbf{M}\_{\mathbf{V}} &= \left\{ \mathbf{V}\_{i}(\omega\_{1}^{(i)}, \dots, \omega\_{N}^{(i)}) \right\}, \quad i = 1 \dots M, j = 1 \dots M; \\ \mathbf{V}\_{\Omega} &= \left\{ \Omega\_{i}(\mathbf{x}, y, z, t, \omega\_{1}^{(i)}, \dots, \omega\_{N}^{(i)}) \right\}, \quad i = 1 \dots M; \\ \mathbf{V}\_{\mathbb{C}} &= \left\{ c\_{i}(\mathbf{x}, y, z, t) \right\}, \quad i = 1 \dots M. \end{aligned} \tag{8}$$

The vectors **V***<sup>c</sup>* and **V**<sup>Ω</sup> are space- and time-dependent, whereas the matrix **M**<sup>Ψ</sup> does not depend on space and time and can be generated once for the given expansion degree and parameter number.

The solution **V***c* of the system (7) depends on the selection of collocation points. According to Villadsen & Michelsen (1978), the optimal choice of collocation points corresponds to the roots of the polynomial of one degree higher (*d* + 1) than the order used in the chaos expansion (*d*). Once the orthonormal polynomial basis is constructed using data (or assumptions on data), the collocation points become as well data-driven and optimally distributed in the space of input parameters. This strategy is based on the theory of Gaussian integration (e.g., Abramowitz & Stegun (1965)), and allows exact numerical integrations of order *d* given *d* + 1 values of the function to be integrated.

The data-driven polynomial basis (see Section 2.2) defines the positions of the collocation points specific to the distribution of input parameters at hand and, thus, indicates what are the optimal parameter sets for model evaluation using all available information about the input parameters.

For multi-parameter analysis, the number of available points from the original optimal integration rule is (*d* + 1)*N*, which is larger than the necessary number *M* of collocation 6 Will-be-set-by-IN-TECH

etc.). Such an expert opinion (in a most general sense) will be incorporated directly without any additional transformation or additional subjectivity when translating it to the stochastic

The remaining task is to evaluate the coefficients *cj* in Eq. (1). Therefore, we will use the non-intrusive probabilistic collocation method (PCM) (Li & Zhang, 2007; Oladyshkin, Class, Helmig & Nowak, 2011b). The collocation formulation does not require any knowledge of the initial model structure, i.e., of Ω. It only requires knowledge on how to obtain the model output for a given set of input parameters, which allows to treat the model Ω like a "black-box". The distinctive feature of non-intrusive approaches is that any simulation model can be considered a "black-box", i.e. commercial software can be used without any modifications required. The idea of PCM is to evaluate the model exactly *M* times, which allows to directly fit the polynomial representation of Ω, see Eq. (1), with its *M* unknown coefficients *cj* to the obtained *M* model results. The *M* model evaluations are performed with

*<sup>N</sup>* }, *i* = 1, .., *M*, called collocation points.

**M**Ψ(ω)**V***<sup>c</sup>* (**x**, *t*) = **V**Ω(**x**, *t*; ω) (7)

, *i* = 1... *M*, *j* = 1... *M*;

, *i* = 1... *M*; (8)

numerical framework.

*<sup>M</sup>* different parameter sets {*ω*(*i*)

parameter number.

input parameters.

polynomials evaluated at the collocation points:

**M**<sup>Ψ</sup> =

**V**<sup>Ω</sup> = 

values of the function to be integrated.

 <sup>Ψ</sup>*i*(*ω*(*i*)

**2.3 Non-intrusive determination of the coefficients**

<sup>1</sup> , ..., *<sup>ω</sup>*(*i*)

This leads to the following system (Villadsen & Michelsen, 1978) of linear equations:

<sup>1</sup> , ..., *<sup>ω</sup>*(*i*) *<sup>N</sup>* ) 

<sup>Ω</sup>*i*(*x*, *<sup>y</sup>*, *<sup>z</sup>*, *<sup>t</sup>*, *<sup>ω</sup>*(*i*)

**V***<sup>c</sup>* = {*ci*(*x*, *y*, *z*, *t*)} , *i* = 1... *M*.

where **V***<sup>c</sup>* is the *M* × 1 vector of unknown coefficients *cj* in expansion (1), the *M* × 1 vector **V**<sup>Ω</sup> contains the model output for each collocation point, and the *M* × *M* matrix **M**<sup>Ψ</sup> contains the

> <sup>1</sup> , ..., *<sup>ω</sup>*(*i*) *<sup>N</sup>* )

The vectors **V***<sup>c</sup>* and **V**<sup>Ω</sup> are space- and time-dependent, whereas the matrix **M**<sup>Ψ</sup> does not depend on space and time and can be generated once for the given expansion degree and

The solution **V***c* of the system (7) depends on the selection of collocation points. According to Villadsen & Michelsen (1978), the optimal choice of collocation points corresponds to the roots of the polynomial of one degree higher (*d* + 1) than the order used in the chaos expansion (*d*). Once the orthonormal polynomial basis is constructed using data (or assumptions on data), the collocation points become as well data-driven and optimally distributed in the space of input parameters. This strategy is based on the theory of Gaussian integration (e.g., Abramowitz & Stegun (1965)), and allows exact numerical integrations of order *d* given *d* + 1

The data-driven polynomial basis (see Section 2.2) defines the positions of the collocation points specific to the distribution of input parameters at hand and, thus, indicates what are the optimal parameter sets for model evaluation using all available information about the

For multi-parameter analysis, the number of available points from the original optimal integration rule is (*d* + 1)*N*, which is larger than the necessary number *M* of collocation

Fig. 1. Computational costs: number of model evaluations for different orders

points. The full tensor grid can be used for low-order (1*st*, 2*nd*) analysis of few parameters, but for higher-order analysis of many parameters the tensor grid suffers from the curse of dimensionality ((*d* + 1)*<sup>N</sup>* points). In that case, a smart choice of a sparse subset of the tensor grid becomes necessary. For this reason, the collocation approach became more popular in the last years. Probabilistic collocation (Li & Zhang, 2007; Oladyshkin, Class, Helmig & Nowak, 2011b) chooses the collocation points from the full tensor grid according to their probability weight, i.e. their importance as specified by the available probability distribution of ω. This simply means to select the collocation points from the most probable regions of the input parameters' distribution (see Oladyshkin, Class, Helmig & Nowak (2011b)) and the modeler can extract a lot of information in the main range of the parameter distribution.

From the practical point of view, the computational costs of the data-driven chaos expansion combined with non-intrusive collocation approach are proportional to the number of terms in the chaos expansion multiplied by the time of a single model evaluation. The number of model evaluations (number of expansion terms *M*) for the construction of response surface via collocation approach depends on crossing order of expansion and presented in Figure 1.

#### **2.4 Efficient convergence of data-driven expansion**

The data-driven polynomial chaos expansion presented in Section 2.2 provides a simple and efficient tool for analysis of stochastic systems. Let us illustrate the efficiency of analysis within the data-driven versus the conventional expansion. For that, we will consider the simple exponential decay differential equation which was already used in Xiu & Karniadakis (2003) for illustration of the Askey scheme:

$$\frac{dY(t)}{dt} = -\xi\_{Pl}Y\_{\prime} \quad Y(0) = 1\tag{9}$$

Let *YPC* be the solution obtained using the polynomial chaos expansion (1) for the problem defined in equation (9). We use a Monte Carlo simulation as reference solution and define the relative error between the polynomial chaos expansion solution *YPC* and the Monte Carlo solution *YMC* as *�* = |*YPC* − *YMC*| / |*YMC*|.

Control Via Robust Design 9

<sup>325</sup> Polynomial Response Surfaces for

applied to input distributions that fall outside the range of classical PCE (Oladyshkin, Class,

The method only demands the existence of a finite number of moments, and does not require the exact knowledge or even existence of probability density functions. An interesting aspect is that only moments up to twice the order of expansion matter. Therefore, any PDFs fitted to input data will lead to the same statistical moments for model output, if the PDFs coincide in the required moments up to twice the order of expansion. For the same reason, if one still desires to fit a single PDF to the input data set, we recommend maximum entropy or minimum relative entropy methods applied such that the moments relevant for PCE are matched. This differs drastically from fitting low-parametric distributions to only lower moments of the available data, because this would modify the remaining relevant moments up to twice the

Understanding the general role of parameters in models and the impact of varying model parameters on the response of prediction models is a relevant subject in various fields of science and engineering. Characterizing the impact of parameter variations is known as sensitivity analysis and can be subdivided into local and global analysis (Saltelli et al., 2008; Sobol, 1990; Sudret, 2008). In many cases of practical interest, we wish to perform a Global Sensitivity Analysis (GSA) in order to analyze a model as such or to investigate, quantify and rank the effects of parameter variation or parameter uncertainty on the overall model uncertainty. GSA can also be used to: (1) quantify the relative importance of each individual input parameter in the final prediction (Anderson & Burt, 1985; Sobol, 1990; Winter et al., 2006); (2) aid engineers to produce more robust designs; and finally (3) help decision makers

For example, the field of subsurface contaminant hydrology requires uncertainty estimates due to the ubiquitous lack of parameter knowledge caused by spatial heterogeneity of hydraulic properties in combination with incomplete characterization (Dagan, 1989; Rubin, 2003). For such reasons, we need to rely on probabilistic tools to predict contaminant levels and their overall health effects, and to quantify the corresponding uncertainties. Having efficient computational approaches to estimate uncertainty and to perform GSA in hydrogeological applications (and many other fields of science and engineering that feature uncertain dynamic or distributed systems) is desirable. It can inform modelers about the relevance of processes or parameters in the models they compile, and can inform engineers and decision makers about which parameters require most attention and where characterization efforts should be allocated such that prediction uncertainty can be minimized. Hence, there is an ever-increasing demand for having a GSA method that efficiently quantifies uncertainty and parameter relevance in complex and non-linear systems. An important recommendation to keep in mind is that GSA should be global not only in the sense of looking at the entire range of possible parameter variations. It should also be used to assess the importance of parameters on a global, final model output or post-processing result that is relevant to generate new insight, or relevant for final decisions. GSA should not merely be applied to model-internal quantities that are of secondary importance for the scientific or

order of expansion. More details on this issue will be provided in Section 4.

to allocate financial resources towards better uncertainty reduction.

application task at hand (Saltelli et al., 2008).

Helmig & Nowak, 2011a; Oladyshkin & Nowak, 2011).

Probabilistic Risk Assessment and Risk Control via Robust Design

**3. Global sensitivity analysis**

We will apply the chaos expansion to equation (9) with both intrusive, like the Galerkin method (Ghanem & Spanos, 1993; Matthies & Keese., 2005; Xiu & Karniadakis, 2003), and non-intrusive approaches , like the probabilistic collocation method (Isukapalli et al., 1998; Li & Zhang, 2007; Oladyshkin, Class, Helmig & Nowak, 2011b). Both approaches depend on the distribution of the random variable *ξ*. If the random variable *ξ* is not distributed in same space as the polynomial basis, an additional conversion is required, such as e.g. Gaussian anamorphosis (Wackernagel, 1998). For each method, we will construct two expansions: conventional PCE mapped to Gaussian and data-driven aPC. Figure 2 demonstrates the convergence of the mean value (at time *t* = 1) for a small selection of exemplary distributions of the model input *ξ* (e.g. Rayleigh, Weibull, Log-normal). A similar situation has been observed for the convergence of the variance. More details are presented in the study by Oladyshkin & Nowak (2011). Chaos expansion in the data-driven (optimal) polynomial basis without transformation shows at least an exponential convergence. Let us mentione, that the data-driven basis provides identical results for both intrusive and non-intrusive methods, because numerical integration is exact to the necessary degree when using the roots of the *d* + 1 order polynomial from the optimal basis, and because no transformation from physical to normal space is necessary. Convergence with a non-optimal basis (here: Hermite) after transformation strongly depends on the nonlinearity of the required transformation.

Fig. 2. Convergence estimationg the mean value based on optimal data-driven basis and non-optimal transformed basis using Galerkin projection (G.) and Collocation approach (C.)

#### **2.5 Conclusions to section 2**

We presented the data-driven approach (aPC) for construction of a response surface based on a global orthonormal polynomial basis for arbitrary distributions. The arbitrary distributions can be either discrete, continuous, or discretized continuous and can be specified either through a few statistical moments, analytically as PDF/CDF, numerically as a histogram, or theoretically through the even more general format of a probability measure. The aPC approach provides improved convergence in comparison to classical PCE techniques, when applied to input distributions that fall outside the range of classical PCE (Oladyshkin, Class, Helmig & Nowak, 2011a; Oladyshkin & Nowak, 2011).

The method only demands the existence of a finite number of moments, and does not require the exact knowledge or even existence of probability density functions. An interesting aspect is that only moments up to twice the order of expansion matter. Therefore, any PDFs fitted to input data will lead to the same statistical moments for model output, if the PDFs coincide in the required moments up to twice the order of expansion. For the same reason, if one still desires to fit a single PDF to the input data set, we recommend maximum entropy or minimum relative entropy methods applied such that the moments relevant for PCE are matched. This differs drastically from fitting low-parametric distributions to only lower moments of the available data, because this would modify the remaining relevant moments up to twice the order of expansion. More details on this issue will be provided in Section 4.

#### **3. Global sensitivity analysis**

8 Will-be-set-by-IN-TECH

We will apply the chaos expansion to equation (9) with both intrusive, like the Galerkin method (Ghanem & Spanos, 1993; Matthies & Keese., 2005; Xiu & Karniadakis, 2003), and non-intrusive approaches , like the probabilistic collocation method (Isukapalli et al., 1998; Li & Zhang, 2007; Oladyshkin, Class, Helmig & Nowak, 2011b). Both approaches depend on the distribution of the random variable *ξ*. If the random variable *ξ* is not distributed in same space as the polynomial basis, an additional conversion is required, such as e.g. Gaussian anamorphosis (Wackernagel, 1998). For each method, we will construct two expansions: conventional PCE mapped to Gaussian and data-driven aPC. Figure 2 demonstrates the convergence of the mean value (at time *t* = 1) for a small selection of exemplary distributions of the model input *ξ* (e.g. Rayleigh, Weibull, Log-normal). A similar situation has been observed for the convergence of the variance. More details are presented in the study by Oladyshkin & Nowak (2011). Chaos expansion in the data-driven (optimal) polynomial basis without transformation shows at least an exponential convergence. Let us mentione, that the data-driven basis provides identical results for both intrusive and non-intrusive methods, because numerical integration is exact to the necessary degree when using the roots of the *d* + 1 order polynomial from the optimal basis, and because no transformation from physical to normal space is necessary. Convergence with a non-optimal basis (here: Hermite) after

transformation strongly depends on the nonlinearity of the required transformation.

**Rayleigh distribution (**σ**=1)**

1 2 3 4 5 6

1 2 3 4 5 6

**Weibull distribution (**λ**=1, k=2)**

Degree of expansion, d

10−15

10−10

10−5

Error

100

Degree of expansion, d

Fig. 2. Convergence estimationg the mean value based on optimal data-driven basis and non-optimal transformed basis using Galerkin projection (G.) and Collocation approach (C.)

We presented the data-driven approach (aPC) for construction of a response surface based on a global orthonormal polynomial basis for arbitrary distributions. The arbitrary distributions can be either discrete, continuous, or discretized continuous and can be specified either through a few statistical moments, analytically as PDF/CDF, numerically as a histogram, or theoretically through the even more general format of a probability measure. The aPC approach provides improved convergence in comparison to classical PCE techniques, when

Transformed basis (C.) Transformed basis (G.) Data−driven basis (C. & G.)

10−15

1 2 3 4 5 6

**Log−normal distribution (**μ**=0,**σ**=1/2)**

Degree of expansion, d

**2.5 Conclusions to section 2**

10−10

10−8

10−6

Error

10−4

10−2

100

10−10

10−5

Error

100

Understanding the general role of parameters in models and the impact of varying model parameters on the response of prediction models is a relevant subject in various fields of science and engineering. Characterizing the impact of parameter variations is known as sensitivity analysis and can be subdivided into local and global analysis (Saltelli et al., 2008; Sobol, 1990; Sudret, 2008). In many cases of practical interest, we wish to perform a Global Sensitivity Analysis (GSA) in order to analyze a model as such or to investigate, quantify and rank the effects of parameter variation or parameter uncertainty on the overall model uncertainty. GSA can also be used to: (1) quantify the relative importance of each individual input parameter in the final prediction (Anderson & Burt, 1985; Sobol, 1990; Winter et al., 2006); (2) aid engineers to produce more robust designs; and finally (3) help decision makers to allocate financial resources towards better uncertainty reduction.

For example, the field of subsurface contaminant hydrology requires uncertainty estimates due to the ubiquitous lack of parameter knowledge caused by spatial heterogeneity of hydraulic properties in combination with incomplete characterization (Dagan, 1989; Rubin, 2003). For such reasons, we need to rely on probabilistic tools to predict contaminant levels and their overall health effects, and to quantify the corresponding uncertainties. Having efficient computational approaches to estimate uncertainty and to perform GSA in hydrogeological applications (and many other fields of science and engineering that feature uncertain dynamic or distributed systems) is desirable. It can inform modelers about the relevance of processes or parameters in the models they compile, and can inform engineers and decision makers about which parameters require most attention and where characterization efforts should be allocated such that prediction uncertainty can be minimized. Hence, there is an ever-increasing demand for having a GSA method that efficiently quantifies uncertainty and parameter relevance in complex and non-linear systems. An important recommendation to keep in mind is that GSA should be global not only in the sense of looking at the entire range of possible parameter variations. It should also be used to assess the importance of parameters on a global, final model output or post-processing result that is relevant to generate new insight, or relevant for final decisions. GSA should not merely be applied to model-internal quantities that are of secondary importance for the scientific or application task at hand (Saltelli et al., 2008).

Control Via Robust Design 11

<sup>327</sup> Polynomial Response Surfaces for

Classical PCE has already been used for sensitivity analysis in different fields of applications (Crestaux et al., 2009; Sudret, 2008). Within the data-driven PCE framework (de Barros, Oladyshkin & Nowak, 2011; Oladyhskin et al., 2011) used in this work, the so-called Sobol indices for sensitivity estimation (Sobol, 1990) can be computed analytically based on the PCE

(see Crestaux et al. (2009); Plischke (2010); Sudret (2008)) using Eq. (10):

, *χ<sup>j</sup>* =

be traced back to the joint contributions of the parameters *ωi*<sup>1</sup> , ..., *ωis*

*ST*

*<sup>j</sup>* = ∑

1, *i f α<sup>k</sup>*

0, *i f α<sup>k</sup>*

where *Si*1,...,*is* is the Sobol index that indicates what fraction of the of total variance of Ω can

operator *χ<sup>j</sup>* indicates where the chosen parameters *ω* numbered as *i*1, ..., *is* (i.e., *ωi*<sup>1</sup> , ..., *ωis*

have simultaneous contributions within the overall expansion. In plain words, it enumerates all polynomial terms that contain the specified combination *i*1, ..., *is* of model parameters. A complementing metric for sensitivity analysis is the *Total Index* introduced by Homma & Saltelli (1996). It expresses the total contribution to the variance of the model output Ω due to the uncertainty of an individual parameter *ω<sup>j</sup>* in all cross-combinations with other parameters:

(*i*1,...,*is*):*j*∈(*i*1,...,*is*)

The weighted sensitivity index introduced by Oladyshkin, de Barros & Nowak (2011) reflects the square slope *∂*Ω/*∂ωj*, but averaged over the statistical distributions or weighing functions

> *∂*Ω(ω) *∂ωj*

So, the weighted global index for GSA using an arbitrary probability measure Γ can be used for reflection of the expected magnitude of parameter variation. Compared to Sobol indices (see Section 3.1) which analyze the spectral energy (variance) contribution of individual parameters to the model Ω, the new weighted measure looks at the spectral energy within the derivatives *∂*Ω/*∂ωj*. The averaged sensitivity index *Sω<sup>j</sup>* can be explicitly expressed as:

> *b* (*α<sup>k</sup> <sup>j</sup>* −1) *i*

*<sup>j</sup>* simply sums up all Sobol indices in which the variable *ω<sup>j</sup>* appears, both as univariate

2

2 *P*(*i*)

*Si*1,...,*is*

*<sup>j</sup>* > 0, ∀*j* ∈ (*i*1, ..., *is*)

 ,

, (10)

*d*Γ(*ω*1)...*d*Γ(*ωN*) (11)

*<sup>j</sup>* (*ωj*). (12)

. The index selection

)

*<sup>j</sup>* = 0, ∃*j* ∈ (*i*1, ..., *is*)

**3.1 Sobol sensitivity indices**

where *S<sup>T</sup>*

of *ω*1, ..., *ωN*:

and joint influences.

**3.2 Weighted sensitivity indices**

*S*2 *<sup>ω</sup><sup>j</sup>* = 

*ω*1∈Λ

*S*2 *<sup>ω</sup><sup>j</sup>* =

...

*M* ∑ *k*=0 *c*2 *k αk <sup>j</sup>* −1 ∑ *i*=0

*ω<sup>N</sup>* ∈Λ

*Si*1,...,*is* =

*M* ∑ *j*=1 *χjc*<sup>2</sup> *j*

Probabilistic Risk Assessment and Risk Control via Robust Design

*M* ∑ *j*=1 *c*2 *j*

In this Section, we tackle GSA based on the aPC technique, following the line of work on aPC by Oladyshkin, Class, Helmig & Nowak (2011a); Oladyshkin & Nowak (2011) and synthesizing the GSA approach by Oladyshkin, de Barros & Nowak (2011). Because the presented framework accounts for arbitrary bounds or weighting functions for input parameters, it provides a weighted global sensitivity. In some sense, the novel sensitivity indexes introduced by Oladyshkin, de Barros & Nowak (2011) can be perceived as a generalization of the Morris method (Morris, 1991) to weighted analysis. Up to presence the Morris method considers a uniform importance of input parameters within pre-defined intervals. We also generalize the Sobol indices (Sobol, 1990) for GSA to the aPC context, and provide a novel GSA measure which resembles a weighted square norm of sensitivities. Compared to Sobol indices, the new weighted sensitivity measure is absolute rather than relative. The advantage of an absolute index over a relative one is that it is a quantitative expression for the (averaged) derivative (slope), and hence keeps the original meaning of a sensitivity as known from linear, local analysis.

Performing GSA requires to evaluate the model at many points in the space of the input parameters. The correct choice of such points within the parameter space is important for adequate and efficient assessment of sensitivity. The aPC approach explicitly offers a method for optimal choice of these points, based on the generalized mathematical theory of Gaussian integration (e.g., Abramowitz & Stegun (1965)), see Section 2.

Foglia et al. (2007) pointed out that, according to their experience, one can get 70% of the insight from 2% of the model runs when using local sensitivity analysis methods versus global methods. The big advantage of aPC-based GSA (or more generally: GSA based on any PCE technique), is that one can obtain global sensitivity information at computational costs that are hardly larger than those for local analysis. The reason is the following: Local methods use infinitesimally small spacing between parameter sets for model evaluation to get numerical derivatives evaluated at a single point. The aPC based-method places the parameter sets for model evaluation at an optimized spacing in parameter space. This can be interpreted as fitting secants (or polynomials for non-linear analysis) to the model response. These secants (polynomials) approximate the model over the entire parameter space in a weighted least-square sense (compare with the best unbiased ensemble linearization approach described by Nowak (2009)). This is more beneficial to computing a tangent or local second derivatives (compare FORM, SORM methods, e.g., Jang et al. (1994)) that approximate the model well just around one point in the parameter space.

This Section provides an alternative procedure to perform GSA that is computationally efficient and highly flexible. In particular, due to aPC, the GSA introduced here and in Oladyshkin, de Barros & Nowak (2011) can be interpreted as exploiting a smart (mathematically optimal) interpolation rule of model output between optimally chosen sets of input parameters, where the number of model evaluations is minimal. Compared to earlier works that related GSA to classical PCE the presented approach: (1) emphasizes a more engineering-like language as compared to otherwise intense mathematical derivations and is based on a clear 3-step procedure to perform sensitivity analysis, (2) provides easy-to-use semi-analytical expressions for frequent use in applications, (3) generalizes PCE-based GSA to arbitrary probability distributions of the investigated parameters, and moreover, (4) allows to align the complexity and order of analysis with the reliability and detail level of statistical information on the input parameters.

#### **3.1 Sobol sensitivity indices**

10 Will-be-set-by-IN-TECH

In this Section, we tackle GSA based on the aPC technique, following the line of work on aPC by Oladyshkin, Class, Helmig & Nowak (2011a); Oladyshkin & Nowak (2011) and synthesizing the GSA approach by Oladyshkin, de Barros & Nowak (2011). Because the presented framework accounts for arbitrary bounds or weighting functions for input parameters, it provides a weighted global sensitivity. In some sense, the novel sensitivity indexes introduced by Oladyshkin, de Barros & Nowak (2011) can be perceived as a generalization of the Morris method (Morris, 1991) to weighted analysis. Up to presence the Morris method considers a uniform importance of input parameters within pre-defined intervals. We also generalize the Sobol indices (Sobol, 1990) for GSA to the aPC context, and provide a novel GSA measure which resembles a weighted square norm of sensitivities. Compared to Sobol indices, the new weighted sensitivity measure is absolute rather than relative. The advantage of an absolute index over a relative one is that it is a quantitative expression for the (averaged) derivative (slope), and hence keeps the original meaning of a

Performing GSA requires to evaluate the model at many points in the space of the input parameters. The correct choice of such points within the parameter space is important for adequate and efficient assessment of sensitivity. The aPC approach explicitly offers a method for optimal choice of these points, based on the generalized mathematical theory of Gaussian

Foglia et al. (2007) pointed out that, according to their experience, one can get 70% of the insight from 2% of the model runs when using local sensitivity analysis methods versus global methods. The big advantage of aPC-based GSA (or more generally: GSA based on any PCE technique), is that one can obtain global sensitivity information at computational costs that are hardly larger than those for local analysis. The reason is the following: Local methods use infinitesimally small spacing between parameter sets for model evaluation to get numerical derivatives evaluated at a single point. The aPC based-method places the parameter sets for model evaluation at an optimized spacing in parameter space. This can be interpreted as fitting secants (or polynomials for non-linear analysis) to the model response. These secants (polynomials) approximate the model over the entire parameter space in a weighted least-square sense (compare with the best unbiased ensemble linearization approach described by Nowak (2009)). This is more beneficial to computing a tangent or local second derivatives (compare FORM, SORM methods, e.g., Jang et al. (1994)) that approximate the

This Section provides an alternative procedure to perform GSA that is computationally efficient and highly flexible. In particular, due to aPC, the GSA introduced here and in Oladyshkin, de Barros & Nowak (2011) can be interpreted as exploiting a smart (mathematically optimal) interpolation rule of model output between optimally chosen sets of input parameters, where the number of model evaluations is minimal. Compared to earlier works that related GSA to classical PCE the presented approach: (1) emphasizes a more engineering-like language as compared to otherwise intense mathematical derivations and is based on a clear 3-step procedure to perform sensitivity analysis, (2) provides easy-to-use semi-analytical expressions for frequent use in applications, (3) generalizes PCE-based GSA to arbitrary probability distributions of the investigated parameters, and moreover, (4) allows to align the complexity and order of analysis with the reliability and detail level of statistical

sensitivity as known from linear, local analysis.

integration (e.g., Abramowitz & Stegun (1965)), see Section 2.

model well just around one point in the parameter space.

information on the input parameters.

Classical PCE has already been used for sensitivity analysis in different fields of applications (Crestaux et al., 2009; Sudret, 2008). Within the data-driven PCE framework (de Barros, Oladyshkin & Nowak, 2011; Oladyhskin et al., 2011) used in this work, the so-called Sobol indices for sensitivity estimation (Sobol, 1990) can be computed analytically based on the PCE (see Crestaux et al. (2009); Plischke (2010); Sudret (2008)) using Eq. (10):

$$S\_{\bar{i}\_{1},\ldots,\bar{i}\_{s}} = \frac{\sum\_{j=1}^{M} \chi\_{j} c\_{j}^{2}}{\sum\_{j=1}^{M} c\_{j}^{2}}, \quad \chi\_{j} = \begin{cases} 1, & \text{if } & a\_{j}^{k} > 0, \quad \forall j \in (i\_{1},\ldots,i\_{s}) \\ 0, & \text{if } & a\_{j}^{k} = 0, \quad \exists j \in (i\_{1},\ldots,i\_{s}) \end{cases} \Big| \,\prime$$

where *Si*1,...,*is* is the Sobol index that indicates what fraction of the of total variance of Ω can be traced back to the joint contributions of the parameters *ωi*<sup>1</sup> , ..., *ωis* . The index selection operator *χ<sup>j</sup>* indicates where the chosen parameters *ω* numbered as *i*1, ..., *is* (i.e., *ωi*<sup>1</sup> , ..., *ωis* ) have simultaneous contributions within the overall expansion. In plain words, it enumerates all polynomial terms that contain the specified combination *i*1, ..., *is* of model parameters.

A complementing metric for sensitivity analysis is the *Total Index* introduced by Homma & Saltelli (1996). It expresses the total contribution to the variance of the model output Ω due to the uncertainty of an individual parameter *ω<sup>j</sup>* in all cross-combinations with other parameters:

$$\mathbf{S}\_{\mathbf{j}}^{T} = \sum\_{(\mathbf{i}\_{1}, \dots, \mathbf{i}\_{s}) : \mathbf{j} \in (\mathbf{i}\_{1}, \dots, \mathbf{i}\_{s'})} \mathbf{S}\_{\mathbf{i}\_{1}, \dots, \mathbf{i}\_{s'}} \tag{10}$$

where *S<sup>T</sup> <sup>j</sup>* simply sums up all Sobol indices in which the variable *ω<sup>j</sup>* appears, both as univariate and joint influences.

#### **3.2 Weighted sensitivity indices**

The weighted sensitivity index introduced by Oladyshkin, de Barros & Nowak (2011) reflects the square slope *∂*Ω/*∂ωj*, but averaged over the statistical distributions or weighing functions of *ω*1, ..., *ωN*:

$$S^2\_{\omega\_{\parallel}} = \int\_{\omega\_{\parallel} \in \Lambda} \dots \int\_{\omega\_{N} \in \Lambda} \left[ \frac{\partial \Omega(\omega)}{\partial \omega\_{j}} \right]^2 d\Gamma(\omega\_{1}) \dots d\Gamma(\omega\_{N}) \tag{11}$$

So, the weighted global index for GSA using an arbitrary probability measure Γ can be used for reflection of the expected magnitude of parameter variation. Compared to Sobol indices (see Section 3.1) which analyze the spectral energy (variance) contribution of individual parameters to the model Ω, the new weighted measure looks at the spectral energy within the derivatives *∂*Ω/*∂ωj*. The averaged sensitivity index *Sω<sup>j</sup>* can be explicitly expressed as:

$$S\_{\omega\_{\!\!/\!D}}^{2} = \sum\_{k=0}^{M} c\_{k}^{2} \sum\_{i=0}^{n\_{j}^{k}-1} \left[ b\_{i}^{(a\_{j}^{k}-1)} \right]^{2} P\_{j}^{(i)}(\omega\_{\!\!/\!)}.\tag{12}$$

Control Via Robust Design 13

<sup>329</sup> Polynomial Response Surfaces for

3. The presented algorithm can easily be implemented without any deep knowledge related

4. The model Ω does not have to be modified and no specific properties are required. Hence, it may by given in any arbitrary form for simple analytical solution up to a multi-scale

We will demonstrate the presented GSA methodology for a contaminant transport problem in a 3D heterogeneous aquifer and the resulting human health risk for an exposed population. The aquifer has a hydraulic conductivity tensor **K** (**x**) and constant effective porosity *ne*. For illustration purposes, we consider flow to be incompressible, single-phased, at steady-state, free of boundary effects. The Random Space Functions for the log-conductivity *Y* = ln *K* is

exponential) and its integral scale *IY*,*i*, where *i* = 1, 2 and 3 for **x** = (*x*1, *x*2, *x*3). We will consider *IY* ≡ *IY*,1 = *IY*,2 and *IY*,*<sup>v</sup>* ≡ *IY*,3 with anisotropy ratio *f* = *I*3/*IY*. Statistical stationarity for *Y* is assumed. Flow is uniform-in-the-average with mean velocity �**u** (**x**)� ≡ (*U*, 0, 0). In this test case, we will make use of the semi-closed expressions for one-particle displacement

In most cases, decision makers are interested in quantifying adverse effects in human health due to contaminated groundwater exposure. For this case, the environmental performance metric of interest is human health risk, which depends not only on the hydrogeological parameters but also on the physiological and behavioural parameters of the exposed individual. For illustration purposes, we will consider the increased lifetime cancer risk model from the EPA (USEPA, December 1989;D), although many other risk models exist as discussed in the literature (de Barros, Ezzedine & Rubin, 2011; Maxwell & Kastenberg, 1999; Siirila et al., 2011). The increased lifetime cancer risk formulation for the groundwater ingestion pathway

> 1 *ED*

where *C* is the maximum running average over the exposure duration *ED* (years) defined in Maxwell & Kastenberg (1999) and *a* is an uncertain health parameter (see Oladyshkin,

We will apply the presented framework introduced in Section 3.3 for the human health risk analysis. These sensitivity indices are useful to quantify the simultaneous influence of model parameters, especially when the number of parameters becomes large and visualization of a multivariate model response becomes unfeasible and confusing. We will analyze an uncertain contaminant spill location and uncertain parameters in the model for health risk. Let us

source with the source volume *V*<sup>0</sup> and the slope of the risk model *a* as analyzed parameters *ω<sup>j</sup>*

*t*+*ED* ∑ *t*

*C* (*t*)

<sup>∞</sup>

*t*=0

*<sup>Y</sup>*, longitudinal velocity *U* and position *xs*, *ys*, *zs* of the

, (14)

*<sup>r</sup>* <sup>=</sup> *aC*, *<sup>C</sup>* <sup>=</sup> max

*<sup>Y</sup>* , covariance model *CY* (assumed here

to the theory of chaos expansion and projection techniques.

Probabilistic Risk Assessment and Risk Control via Robust Design

multi-physical simulation software framework.

statistically characterized by its mean �*Y*�, variance *<sup>σ</sup>*<sup>2</sup>

covariance function from Dagan (1988).

de Barros & Nowak (2011) for details).

consider the integral scale *IY*, variance *σ*<sup>2</sup>

**3.4.2 Importance ranking of modeling parameters**

**3.4.1 Problem formulation**

is given by:

in our analysis.

**3.4 Illustration: sensitivity analysis for human health risk**

where the re-collection coefficients *b* (*α<sup>k</sup> <sup>j</sup>* −1) *<sup>i</sup>* are defined as solution of the corresponding linear system:

$$
\begin{bmatrix} b\_0^{(a\_j^k-1)} \\ b\_1^{(a\_j^k-1)} \\ \dots \\ b\_{a\_j^k-1}^{(a\_j^k-1)} \end{bmatrix} \begin{bmatrix} p\_{0,j}^{(0)} & 0 & 0 & 0 \\ p\_{0,j}^{(1)} & p\_{1,j}^{(1)} & \cdots & 0 \\ \dots & \dots & \dots & \dots \\ p\_{0,j}^{(a\_j^k-1)} & p\_{1,j}^{(a\_j^k-1)} & \cdots & p\_{a\_j^k-1,j}^{(a\_j^k-1)} \end{bmatrix}^T = \begin{bmatrix} p\_{1,j}^{(a\_j^k)} \\ p\_{2,j}^{(a\_j^k)} \\ \dots \\ \dots \\ a\_j^k p\_{a\_j^k,j}^{(a\_j^k)} \end{bmatrix}. \tag{13}
$$

This index reflects the influence of a parameter *ω<sup>j</sup>* onto the model output Ω in a similar fashion to the total index defined in Eq. (10). However, the new multivariate index in Eq. (11) does not rely on comparison among different parameters, i.e., it is an absolute measure. This is an advantage over the existing Sobol-based total index which is only a comparative and relative measure.

#### **3.3 Three-step algorithm for data-adaptive global sensitivity analysis**

This Section summarizes the computational algorithm for GSA based on aPC. The important feature in this computational algorithm is that it can be performed for arbitrary distributions of the input data *ω*. The entire algorithm for the desired degree of precision *d* and number of parameters *N* is based on the following 3 steps:

*Step 1. Characterize the model parameters to be investigated*: Compute the raw moments *μk*,*<sup>j</sup>* (*k* = 1...2*d*) of the input data for each input parameter *ω<sup>j</sup>* (*j* = 1... *N*). If a probability density function (PDF) is provided, then we can evaluate the actual theoretical moments *in lieu* of the raw data moments of a given data set. Alternatively, expert elicitation may serve to provide opinions on these moments, e.g., by guessing a distribution shape and using its moments.

*Step 2. Approximate the model response surface by polynomials*: For the specific moments *μk*,*j*, compute the coefficients of the optimal polynomial basis in Eq. (3) using the system of linear equations Eq. (4) and the normalization in Eq. (6). In addition, compute the coefficients of the expansion using the system of linear equations, Eq. (7), to represent the model response.

*Step 3. Compute the desired sensitivity information from the polynomials*: With the original model reduced to a multi-variable polynomial, the sensitivity indices can be obtained analytically without any heavy additional computational efforts. This is achieved by using the relations provided in Eqs.(10) or (12) provided in the two upcoming Sections.

In summary, the algorithm described above has the following advantages:


#### **3.4 Illustration: sensitivity analysis for human health risk**

#### **3.4.1 Problem formulation**

12 Will-be-set-by-IN-TECH

0,*<sup>j</sup>* 000

··· ··· ··· ···

This index reflects the influence of a parameter *ω<sup>j</sup>* onto the model output Ω in a similar fashion to the total index defined in Eq. (10). However, the new multivariate index in Eq. (11) does not rely on comparison among different parameters, i.e., it is an absolute measure. This is an advantage over the existing Sobol-based total index which is only a comparative and relative

This Section summarizes the computational algorithm for GSA based on aPC. The important feature in this computational algorithm is that it can be performed for arbitrary distributions of the input data *ω*. The entire algorithm for the desired degree of precision *d* and number of

*Step 1. Characterize the model parameters to be investigated*: Compute the raw moments *μk*,*<sup>j</sup>* (*k* = 1...2*d*) of the input data for each input parameter *ω<sup>j</sup>* (*j* = 1... *N*). If a probability density function (PDF) is provided, then we can evaluate the actual theoretical moments *in lieu* of the raw data moments of a given data set. Alternatively, expert elicitation may serve to provide opinions on these moments, e.g., by guessing a distribution shape and using its moments.

*Step 2. Approximate the model response surface by polynomials*: For the specific moments *μk*,*j*, compute the coefficients of the optimal polynomial basis in Eq. (3) using the system of linear equations Eq. (4) and the normalization in Eq. (6). In addition, compute the coefficients of the expansion using the system of linear equations, Eq. (7), to represent the model response.

*Step 3. Compute the desired sensitivity information from the polynomials*: With the original model reduced to a multi-variable polynomial, the sensitivity indices can be obtained analytically without any heavy additional computational efforts. This is achieved by using the relations

1. It constructs an optimal orthonormal polynomial basis for any desired distribution of data, whereas previous PCE techniques were limited to a small number of statistical distributions. Since only statistical moments are relevant, the input distributions may be either discrete, continuous, or discretized continuous and can be specified either through some statistical moments, analytically as PDF, numerically as histograms, or theoretically

2. The algorithm performs an optimal projection of the physical model onto a polynomial basis with minimum computational effort. This reduces the original model to a set of polynomials with many useful properties that allow immensely fast evaluation and offer

1,*<sup>j</sup>* ··· 0

(*α<sup>k</sup> <sup>j</sup>* −1) *αk <sup>j</sup>* −1,*j*

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ *T*

=

(1)

(*α<sup>k</sup> <sup>j</sup>* −1) 1,*<sup>j</sup>* ··· *p*

*<sup>i</sup>* are defined as solution of the corresponding linear

⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

*p* (*α<sup>k</sup> j*) 1,*j* 2*p* (*α<sup>k</sup> j*) 2,*j* ··· *αk j p* (*α<sup>k</sup> j*) *αk j* ,*j*

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

. (13)

(*α<sup>k</sup> <sup>j</sup>* −1)

where the re-collection coefficients *b*

⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ *b* (*α<sup>k</sup> <sup>j</sup>*−1) 0 *b* (*α<sup>k</sup> <sup>j</sup>*−1) 1 ··· *b* (*α<sup>k</sup> <sup>j</sup>*−1) *αk <sup>j</sup>* −1

parameters *N* is based on the following 3 steps:

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ *p* (0)

*p* (1) 0,*<sup>j</sup> p*

**3.3 Three-step algorithm for data-adaptive global sensitivity analysis**

provided in Eqs.(10) or (12) provided in the two upcoming Sections.

through the even more general format of probability measures.

a list of analytical relations.

In summary, the algorithm described above has the following advantages:

*p* (*α<sup>k</sup> <sup>j</sup>* −1) 0,*<sup>j</sup> p*

system:

measure.

We will demonstrate the presented GSA methodology for a contaminant transport problem in a 3D heterogeneous aquifer and the resulting human health risk for an exposed population. The aquifer has a hydraulic conductivity tensor **K** (**x**) and constant effective porosity *ne*. For illustration purposes, we consider flow to be incompressible, single-phased, at steady-state, free of boundary effects. The Random Space Functions for the log-conductivity *Y* = ln *K* is statistically characterized by its mean �*Y*�, variance *<sup>σ</sup>*<sup>2</sup> *<sup>Y</sup>* , covariance model *CY* (assumed here exponential) and its integral scale *IY*,*i*, where *i* = 1, 2 and 3 for **x** = (*x*1, *x*2, *x*3). We will consider *IY* ≡ *IY*,1 = *IY*,2 and *IY*,*<sup>v</sup>* ≡ *IY*,3 with anisotropy ratio *f* = *I*3/*IY*. Statistical stationarity for *Y* is assumed. Flow is uniform-in-the-average with mean velocity �**u** (**x**)� ≡ (*U*, 0, 0). In this test case, we will make use of the semi-closed expressions for one-particle displacement covariance function from Dagan (1988).

In most cases, decision makers are interested in quantifying adverse effects in human health due to contaminated groundwater exposure. For this case, the environmental performance metric of interest is human health risk, which depends not only on the hydrogeological parameters but also on the physiological and behavioural parameters of the exposed individual. For illustration purposes, we will consider the increased lifetime cancer risk model from the EPA (USEPA, December 1989;D), although many other risk models exist as discussed in the literature (de Barros, Ezzedine & Rubin, 2011; Maxwell & Kastenberg, 1999; Siirila et al., 2011). The increased lifetime cancer risk formulation for the groundwater ingestion pathway is given by:

$$r = a\overline{\mathbb{C}}, \quad \overline{\mathbb{C}} = \max\left[\frac{1}{ED} \sum\_{t}^{t+ED} \mathbb{C}\left(t\right)\right]\_{t=0}^{\infty} \tag{14}$$

where *C* is the maximum running average over the exposure duration *ED* (years) defined in Maxwell & Kastenberg (1999) and *a* is an uncertain health parameter (see Oladyshkin, de Barros & Nowak (2011) for details).

#### **3.4.2 Importance ranking of modeling parameters**

We will apply the presented framework introduced in Section 3.3 for the human health risk analysis. These sensitivity indices are useful to quantify the simultaneous influence of model parameters, especially when the number of parameters becomes large and visualization of a multivariate model response becomes unfeasible and confusing. We will analyze an uncertain contaminant spill location and uncertain parameters in the model for health risk. Let us consider the integral scale *IY*, variance *σ*<sup>2</sup> *<sup>Y</sup>*, longitudinal velocity *U* and position *xs*, *ys*, *zs* of the source with the source volume *V*<sup>0</sup> and the slope of the risk model *a* as analyzed parameters *ω<sup>j</sup>* in our analysis.

Control Via Robust Design 15

<sup>331</sup> Polynomial Response Surfaces for

−2

at computational costs that are almost low as those of linear analysis.

**4. Uncertainty quantification and probabilistic risk assessment**

availability and subjectivity raised in the Section 2.

Fig. 3. Convergence of total (left plot) and weighted (right plot) sensitivity indices for the

The proposed method incorporates the full range of possible simulation outcomes for the investigated model parameters as it approximates the model's full response surface by multivariate polynomials. The existing polynomial-based method to compute Sobol indices was extended to the much more general aPC framework. While Sobol's indices look at the contributions from individual parameters to the energy (variance) of the model, our weighted sensitivity indices look at the energy norm of derivatives of the model with respect to individual parameters. The resulting sensitivity measure shows better convergence properties than Sobol analysis. It can convey the information of a global sensitivity analysis

This Section presents a reasonably accurate method for uncertainty quantification and probabilistic risk-assessment at acceptable computational costs. The lack of information about distributed properties leads to model uncertainties up to a level where the quantification of uncertainties becomes the dominant question in application tasks and may override the influence of secondary physical processes. Often numerical simulation models are inadequate for stochastic simulation techniques based on brute-force Monte Carlo simulation and related approaches, because even single deterministic simulations may require parallel high-performance computing. In the current Section, we suggest and apply a massive stochastic model reduction technique based on non-intrusive polynomial expansion as defined in Section 2. As a second focus for this Section, we continue discussion of data

Unfortunately, precise information on distribution shapes for all uncertain input parameters is very rare in realistic applications, such as underground reservoir simulations, groundwater modeling, etc. Applied research on real-world systems, like modeling CO2 storage, groundwater flow, etc., often faces the problem of immensely limited information about the model parameters involved, e.g. reservoir permeability, porosity, etc. With only limited data available or even in total absence of data, not even probability density functions representing the lack of knowledge can be easily inferred in a justified manner. Moreover, even if some

0

2

4

log Sw

6

8

10

1 2 3 4 5

**Weighted sensitivity indices**

SI Y Sσ Y 2 SU Sx s S y s Sz s Sa

Degree of expansion, d

1 2 3 4 5

**Total indices**

Probabilistic Risk Assessment and Risk Control via Robust Design

Degree of expansion, d

−12

health risk prediction.

−10

−8

−6

log ST

−4

−2

0

According to Step 1 of the algorithm presented in Section 3.3, we first characterize all input parameters. For illustrative purpose we will consider the following distributions of modelling parameters:


These PDFs are used in Step 1 to calculate the corresponding raw moments of input data. According to Step 2 of the algorithm (Section 3.3), we compute the coefficients of the orthonormal polynomial basis using Eq. (4) and Eq. (6) for given raw moments and for the desired degree of expansion *d*. Also, we consider the response surface of the model according to Eq.(1) and Eq.(7). At the last phase of analysis, Step 3 of the algorithm (Section 3.3), the desired quantitative sensitivity information is extracted from the polynomial response surface.

All sensitivity indices can be constructed as it was shown above. For brevity, we will focus on the total and weighted indices only. Figure 3 illustrates the total (left plot) and weighted (right plot) sensitivity indices at different degrees of expansion. This figure shows the overall influence of all model parameters on the total human health effect (final prediction), which implicitly includes a time integral over the concentration history (see Eq. 14). For this particular case, we observe that the risk is more sensitive towards hydrogeological parameters than the health parameters. In the case study related to health risk (Figure 3), the new weighted measure converges faster in comparison to Total indices with increasing order of the expansion. Total indices show the main reaction of the model to the analyzed parameters, but as relative quantities do not provide faster stabilization in the ranking of analyses parameters in comparison to absolute weighted indices. The presented approach for GSA provides a relatively accurate approximation even for moderate orders (see also the convergence study for aPC in general in Oladyshkin & Nowak (2011)). Thus, global and weighted sensitivity analysis can already be performed at computational costs that are only slightly larger than those of local analysis.

#### **3.5 Conclusions to section 3**

We have presented an alternative method for global sensitivity analysis (GSA), which is based on the arbitrary polynomial chaos expansion (aPC). Compared to existing polynomial-based GSA methods, it can accommodate for all types of statistical distributions or weighting functions of the input parameters. This approach is denoted as *data-adaptive* because the aPC method can be applied even in situations where precise statistical information (e.g., known parametric distributions) for the input parameters is not available. If desired, the method can work directly with raw sampled data sets to represent the uncertainty and possible variational ranges of input data. The presented methods allow experts to choose freely of technical constraints the shapes of their statistical assumptions, and they allow to align the complexity and order of analysis with the reliability and detail level of statistical information on the input parameters.

14 Will-be-set-by-IN-TECH

According to Step 1 of the algorithm presented in Section 3.3, we first characterize all input parameters. For illustrative purpose we will consider the following distributions of modelling

These PDFs are used in Step 1 to calculate the corresponding raw moments of input data. According to Step 2 of the algorithm (Section 3.3), we compute the coefficients of the orthonormal polynomial basis using Eq. (4) and Eq. (6) for given raw moments and for the desired degree of expansion *d*. Also, we consider the response surface of the model according to Eq.(1) and Eq.(7). At the last phase of analysis, Step 3 of the algorithm (Section 3.3), the desired quantitative sensitivity information is extracted from the polynomial response surface. All sensitivity indices can be constructed as it was shown above. For brevity, we will focus on the total and weighted indices only. Figure 3 illustrates the total (left plot) and weighted (right plot) sensitivity indices at different degrees of expansion. This figure shows the overall influence of all model parameters on the total human health effect (final prediction), which implicitly includes a time integral over the concentration history (see Eq. 14). For this particular case, we observe that the risk is more sensitive towards hydrogeological parameters than the health parameters. In the case study related to health risk (Figure 3), the new weighted measure converges faster in comparison to Total indices with increasing order of the expansion. Total indices show the main reaction of the model to the analyzed parameters, but as relative quantities do not provide faster stabilization in the ranking of analyses parameters in comparison to absolute weighted indices. The presented approach for GSA provides a relatively accurate approximation even for moderate orders (see also the convergence study for aPC in general in Oladyshkin & Nowak (2011)). Thus, global and weighted sensitivity analysis can already be performed at computational costs that are only slightly larger than

We have presented an alternative method for global sensitivity analysis (GSA), which is based on the arbitrary polynomial chaos expansion (aPC). Compared to existing polynomial-based GSA methods, it can accommodate for all types of statistical distributions or weighting functions of the input parameters. This approach is denoted as *data-adaptive* because the aPC method can be applied even in situations where precise statistical information (e.g., known parametric distributions) for the input parameters is not available. If desired, the method can work directly with raw sampled data sets to represent the uncertainty and possible variational ranges of input data. The presented methods allow experts to choose freely of technical constraints the shapes of their statistical assumptions, and they allow to align the complexity and order of analysis with the reliability and detail level of statistical information on the input

5,6, where *ω*�

<sup>1</sup> following a beta distribution with *α* = 2 and *β* = 2;

5,6 is beta distributed with *α* = 2 and *β* = 2.

<sup>4</sup> is beta distributed with *α* = 2 and *β* = 2;

parameters:

• For *ω*1: *ω*<sup>1</sup> = 1 + 2*ω*�

• For *ω*4: *ω*<sup>4</sup> = 9 + 2*ω*�

those of local analysis.

parameters.

**3.5 Conclusions to section 3**

• For *ω*<sup>5</sup> and *ω*6: *ω*5,6 = −1 + 2*ω*�

<sup>1</sup>, with *ω*�

• For *ω*2: Uniformly distributed within the interval [0.1, 0.7]; • For *ω*3: Log-normally distributed with *μ* = 3.6 and *σ* = 0.3.

<sup>4</sup>, where *ω*�

• For *ω*7: *ω*<sup>7</sup> is lognormal distributed with *μ* = 1.7 and *σ* = 0.14.

Fig. 3. Convergence of total (left plot) and weighted (right plot) sensitivity indices for the health risk prediction.

The proposed method incorporates the full range of possible simulation outcomes for the investigated model parameters as it approximates the model's full response surface by multivariate polynomials. The existing polynomial-based method to compute Sobol indices was extended to the much more general aPC framework. While Sobol's indices look at the contributions from individual parameters to the energy (variance) of the model, our weighted sensitivity indices look at the energy norm of derivatives of the model with respect to individual parameters. The resulting sensitivity measure shows better convergence properties than Sobol analysis. It can convey the information of a global sensitivity analysis at computational costs that are almost low as those of linear analysis.

#### **4. Uncertainty quantification and probabilistic risk assessment**

This Section presents a reasonably accurate method for uncertainty quantification and probabilistic risk-assessment at acceptable computational costs. The lack of information about distributed properties leads to model uncertainties up to a level where the quantification of uncertainties becomes the dominant question in application tasks and may override the influence of secondary physical processes. Often numerical simulation models are inadequate for stochastic simulation techniques based on brute-force Monte Carlo simulation and related approaches, because even single deterministic simulations may require parallel high-performance computing. In the current Section, we suggest and apply a massive stochastic model reduction technique based on non-intrusive polynomial expansion as defined in Section 2. As a second focus for this Section, we continue discussion of data availability and subjectivity raised in the Section 2.

Unfortunately, precise information on distribution shapes for all uncertain input parameters is very rare in realistic applications, such as underground reservoir simulations, groundwater modeling, etc. Applied research on real-world systems, like modeling CO2 storage, groundwater flow, etc., often faces the problem of immensely limited information about the model parameters involved, e.g. reservoir permeability, porosity, etc. With only limited data available or even in total absence of data, not even probability density functions representing the lack of knowledge can be easily inferred in a justified manner. Moreover, even if some

Control Via Robust Design 17

<sup>333</sup> Polynomial Response Surfaces for

Unfortunately, information is not available for all sources of uncertainty. In such cases, additional assumptions become necessary. The classical way would be to introduce a theoretical (parametric) probability distribution, e.g. with an assumed mean and variance. For example, some expert could introduce a lognormal distribution for the leaky-well permeability (see details in Section 4.3.1) with parameters defined from the benchmark values

**Probability density function**

0 1 2 3

Leaky well permeability, [m2

Fig. 4. Assumed stochastic distribution: theoretical PDF (red line), alternative PDFs (green lines) with the same first four moments and Maximum Entropy PDF with the same first four

Establishing a full theoretical probability density function involves a strong assumption on all higher moments up to infinite order, and assumes implicit knowledge of the exact shape, e.g. also of the extreme-value tails. The data-driven approach strongly alleviates this situation, because it can handle a set of moments directly (e.g. the mean, variance, skewness, peakedness), without any further assumptions on higher-order moments and without having to introduce a PDF. In the current Section, we will use this freedom, and obtain only a small number of required moments directly from a large database or via experts, without asking for

For evaluating arbitrary complex output statistics without having a sufficiently large raw data set, we would recommend the Maximum Entropy PDF also shown in Figure 4 to draw a sufficiently large sample for Monte Carlo analysis of the polynomial. For example, a second-order expansion requires knowing the first four moments. Figure 4 illustrates possible distributions of stochastic variables, where the red line represents a theoretical PDF and the green lines represent a small collection of alternative PDFs with the same first four moments. The data-driven method does not require a choice between these alternatives, but only uses common information in the form of the required moments (here: up to order four). Existence of finite moments (up to a certain required order) is a necessary and sufficient condition for the proposed framework. Usually, this condition is easily fulfilled for a large spectrum of

practical applications, especially for moderate degrees of expansion.

0

0.5

1

1.5

PDF

2

2.5

3

Probabilistic Risk Assessment and Risk Control via Robust Design

x 10−12

]

Theoretical Alternative Max.Entropy

**4.2 Problem of data availability**

(Class et al., 2009), see Figure 4.

moments (blue line).

a full PDF.

amount of data is available, the statistical distribution of the corresponding model parameters can be nontrivial, e.g. bounded, skewed, multi-modal or discontinuous. In any case, the attempt to construct probability density functions from samples of limited size or from sparse information introduces additional room for subjectivity into the analysis. Thus, applied tasks demand the direct handling of arbitrary distribution shapes and sparse data sets without additional assumptions.

The purpose of the current Section is to work with a highly parsimonic and yet purely data-driven description of uncertainty, applying the arbitrary polynomial chaos (aPC). It was shown in Section 2 that statistical moments are the only source of information that is propagated in all types of polynomial expansion-based stochastic approaches. Thus, exact probability density functions for uncertain input parameters do not have to be known and do not even have to exist. This avoids the necessity of assuming subjectively or speculating on the exact shapes of probability distributions. The new freedom in distribution shapes gained with aPC opens the path to accessing with PCE even those applications where data samples of limited size merely allow the inference of a few moments, and one would not be able to construct a probability density function without introducing subjective assumptions and hence dangerous sources of bias.

#### **4.1 Evaluation of statistics**

In the presented approach, the statistics of the model output are based directly on the model and the specified moments of input data, see Section 2. If a model output Ω(ω) is expanded in the normalized polynomial basis Eq. (6), then characteristic statistical quantities of Ω(ω) can be evaluated directly. For example, the mean and variance of Ω(ω) are given by the following simple analytical relations:

$$\text{mean}(\Omega) = c\_{1\prime} \quad \text{var}(\Omega) = \sum\_{j=2}^{N} c\_{j\prime}^{2} \tag{15}$$

where the latter is a result of Parseval's Theorem (e.g. Siebert (1986)).

Likewise, all moments of Ω up to the order of expansion can be obtained analytically, based only on expansion coefficients and the used moments of the input parameters. Therefore, all probability distributions of input parameters that share the same 2*d* moments will, in any polynomial chaos expansion of order *d*, lead to the same moments of model output of order *d*. For the same reason, PCE expansions of order *d* are unaffected by subjective choices concerning input parameter distributions that only affect moments beyond the order 2*d*. This is the case, for example, for maximum entropy PDFs.

Cumulative distribution functions (CDF), probability density functions (PDF) and other arbitrary statistics of model output can be evaluated via Monte Carlo analysis of the polynomial response surface that results from the expansion Eq. (1). This is very fast, because the polynomial surface is much faster to evaluate than the original model equations. However, the latter approach will ask for full knowledge of the input probability density function involved, which we try to avoid. If Monte Carlo cannot be avoided, e.g. for computing exceedance probability in risk analysis, we suggest using maximum entropy PDFs, or alternatively, a sufficiently large data set (if available) that can be used directly as Monte-Carlo realizations of input parameters.

#### **4.2 Problem of data availability**

16 Will-be-set-by-IN-TECH

amount of data is available, the statistical distribution of the corresponding model parameters can be nontrivial, e.g. bounded, skewed, multi-modal or discontinuous. In any case, the attempt to construct probability density functions from samples of limited size or from sparse information introduces additional room for subjectivity into the analysis. Thus, applied tasks demand the direct handling of arbitrary distribution shapes and sparse data sets without

The purpose of the current Section is to work with a highly parsimonic and yet purely data-driven description of uncertainty, applying the arbitrary polynomial chaos (aPC). It was shown in Section 2 that statistical moments are the only source of information that is propagated in all types of polynomial expansion-based stochastic approaches. Thus, exact probability density functions for uncertain input parameters do not have to be known and do not even have to exist. This avoids the necessity of assuming subjectively or speculating on the exact shapes of probability distributions. The new freedom in distribution shapes gained with aPC opens the path to accessing with PCE even those applications where data samples of limited size merely allow the inference of a few moments, and one would not be able to construct a probability density function without introducing subjective assumptions and

In the presented approach, the statistics of the model output are based directly on the model and the specified moments of input data, see Section 2. If a model output Ω(ω) is expanded in the normalized polynomial basis Eq. (6), then characteristic statistical quantities of Ω(ω) can be evaluated directly. For example, the mean and variance of Ω(ω) are given by the following

mean(Ω) = *c*1, var(Ω) =

Likewise, all moments of Ω up to the order of expansion can be obtained analytically, based only on expansion coefficients and the used moments of the input parameters. Therefore, all probability distributions of input parameters that share the same 2*d* moments will, in any polynomial chaos expansion of order *d*, lead to the same moments of model output of order *d*. For the same reason, PCE expansions of order *d* are unaffected by subjective choices concerning input parameter distributions that only affect moments beyond the order 2*d*. This

Cumulative distribution functions (CDF), probability density functions (PDF) and other arbitrary statistics of model output can be evaluated via Monte Carlo analysis of the polynomial response surface that results from the expansion Eq. (1). This is very fast, because the polynomial surface is much faster to evaluate than the original model equations. However, the latter approach will ask for full knowledge of the input probability density function involved, which we try to avoid. If Monte Carlo cannot be avoided, e.g. for computing exceedance probability in risk analysis, we suggest using maximum entropy PDFs, or alternatively, a sufficiently large data set (if available) that can be used directly as

where the latter is a result of Parseval's Theorem (e.g. Siebert (1986)).

is the case, for example, for maximum entropy PDFs.

Monte-Carlo realizations of input parameters.

*N* ∑ *j*=2 *c*2

*<sup>j</sup>* , (15)

additional assumptions.

hence dangerous sources of bias.

**4.1 Evaluation of statistics**

simple analytical relations:

Unfortunately, information is not available for all sources of uncertainty. In such cases, additional assumptions become necessary. The classical way would be to introduce a theoretical (parametric) probability distribution, e.g. with an assumed mean and variance. For example, some expert could introduce a lognormal distribution for the leaky-well permeability (see details in Section 4.3.1) with parameters defined from the benchmark values (Class et al., 2009), see Figure 4.

Fig. 4. Assumed stochastic distribution: theoretical PDF (red line), alternative PDFs (green lines) with the same first four moments and Maximum Entropy PDF with the same first four moments (blue line).

Establishing a full theoretical probability density function involves a strong assumption on all higher moments up to infinite order, and assumes implicit knowledge of the exact shape, e.g. also of the extreme-value tails. The data-driven approach strongly alleviates this situation, because it can handle a set of moments directly (e.g. the mean, variance, skewness, peakedness), without any further assumptions on higher-order moments and without having to introduce a PDF. In the current Section, we will use this freedom, and obtain only a small number of required moments directly from a large database or via experts, without asking for a full PDF.

For evaluating arbitrary complex output statistics without having a sufficiently large raw data set, we would recommend the Maximum Entropy PDF also shown in Figure 4 to draw a sufficiently large sample for Monte Carlo analysis of the polynomial. For example, a second-order expansion requires knowing the first four moments. Figure 4 illustrates possible distributions of stochastic variables, where the red line represents a theoretical PDF and the green lines represent a small collection of alternative PDFs with the same first four moments. The data-driven method does not require a choice between these alternatives, but only uses common information in the form of the required moments (here: up to order four). Existence of finite moments (up to a certain required order) is a necessary and sufficient condition for the proposed framework. Usually, this condition is easily fulfilled for a large spectrum of practical applications, especially for moderate degrees of expansion.

Control Via Robust Design 19

<sup>335</sup> Polynomial Response Surfaces for

0

0.02

0.04

0.06

CO2 Leakage, [%]

Fig. 5. Estimation of mean value (left plot) and standard deviation (right plot) of the CO2 leakage rate: expert opinions (dashed lines) and data-driven approach (solid lines)

Similar to GSA in Section 3 the procedure for quantifying uncertainty in CO2 storage can be divided into three main steps. The first step is to construct the polynomial basis according to the data considered in Section 4.3.1. The second step is to set up the chaos expansion and obtain the required coefficients *ci* for expansion (1) using the non-intrusive probabilistic collocation method. In the third step, we evaluate mean and variance of output statistics

The three steps mentioned above are straightforward and exploit the input data directly, i.e. in a data-driven manner. Involving additional assumptions on input data is possible, but can be dangerous and will import more room for subjectivity inti the analysis. To illustrate the drastic impact of subjectivity that can be introduced into analysis, we performed the following simple experiment. The sample data (reservoir permeability and porosity) are distributed to five different and independent experts. The task of each expert consisted in constructing a theoretical probability density function for each parameter, which in their opinion would describe the statistics of the raw data best. As a common way of data description, all experts provided two-parametrical distributions for each input parameter (permeability and porosity). All experts proposed very different assumptions and techniques to match the permeability and porosity distributions. The responses of all experts were collected and used as input for modeling the benchmark problem defined in Sections 4.3.1. The results of this experiment are illustrated in Figure 5 using second-order polynomial chaos expansion. It shows the resulting mean value (left plot) and standard deviation (right plot) of the CO2 leakage rate over time. Here, dashed lines correspond to the results based on subjective expert opinions, and solid lines correspond to the results from our purely data-driven approach. All test cases presented in Figure 5 were performed under the same conditions in all other aspects. The differences in predicting leakage rates based on the different experts are a consequence of their different interpretations and opinions on how to treat the data. This example clearly demonstrates that the room for subjectivity (when assigning theoretical

0.08

0.1

0.12

0 20 40 60 80 100

Time, [days]

**Standard Deviation of CO2**

 **Leakage**

Data−driven Expert 1 Expert 2 Expert 3 Expert 4 Expert 5

0 20 40 60 80 100

Time, [days]

**4.3.2 Data-driven uncertainty quantification**

according to Section 4.1, see Figure 5.

**Mean Value of CO2**

 **Leakage**

Probabilistic Risk Assessment and Risk Control via Robust Design

Data−driven Expert 1 Expert 2 Expert 3 Expert 4 Expert 5

0

0.05

0.1

0.15

CO2 Leakage, [%]

0.2

0.25

0.3

#### **4.3 Illustration: risk assessment for carbon dioxide storage**

#### **4.3.1 Problem formulation**

We consider the benchmark leakage problem of injected CO2 into overlying formations through a leaky well defined by Class et al. (2009). In the current Section, we present an illustrative example for a homogenized system, i.e. we consider spatial heterogeneity only through zonation according to different geological media. However, the technique presented can be extended to many classes of heterogeneous systems, where spatially correlated heterogeneous parameter fields can be decomposed into their uncorrelated principal components using the KL-expansion (e.g. Li & Zhang (2007)), if heterogeneity does not span over too many scales. CO2 is injected into a deep aquifer, spreads within the aquifer and, upon reaching a leaky abandoned well, rises to a shallower aquifer. The goal of the simulation is to quantify the leakage rate which depends on the pressure build-up in the aquifer due to injection and on the plume evolution. The leaky well is at the center of the domain and the injection well is 100m away. Both aquifers are 30m thick and the separating aquitard has a thickness of 100m. The leaky well is modeled as a porous medium with a higher permeability than the formation. The CO2 leakage rate is defined in the benchmark study as the total CO2 mass flux integrated over a control plane midway between the top and bottom aquifer, divided by the injection rate, in percent.

The benchmark problem assumes that fluid properties such as density and viscosity are constant, all processes are isothermal, CO2 and brine are two separate and immiscible phases, mutual dissolution is neglected, the pressure conditions at the lateral boundaries are constant over time, the formation is isotropic rigid and chemically inert, and capillary pressure is negligible. Within the presented approach, the physical complexity of the problem can be increased to an arbitrary extent because of the non-intrusive black-box conception of the probabilistic collocation method (Isukapalli et al., 1998; Li & Zhang, 2007). All relevant parameters used for the simulation are given in the paper by Oladyshkin, Class, Helmig & Nowak (2011a). In the current study, the benchmark problem is simulated using DuMuX, a multi-scale multi-physics toolbox for the simulation of flow and transport processes in porous media (Flemisch et al., 2007).

To illustrate the proposed methodology, we will consider three parameters uncertain: reservoir absolute permeability, reservoir porosity and permeability of the leaky well. The distributions of absolute permeability and porosity were taken from the U.S. National Petroleum Council Public Database (which includes 1270 reservoirs), see also the work of Kopp et al. (2009). This choice reflects the situation of site screening, where site-specific data and data that allow heterogeneity within geological units to be resolved are not yet available. Instead, we use data sets that represent macroscopic properties of supposedly similar sites as prior knowledge.

Formally, the investigated uncertain parameters correspond to the components of the vector ω = {*ω*1, *ω*2, *ω*3} and represent the input parameters of Equation (1). The model output quantities Ω considered here are pressure and saturation values as a function of space and time, and the CO2 leakage rate through the leaky well as a function only of time.

18 Will-be-set-by-IN-TECH

We consider the benchmark leakage problem of injected CO2 into overlying formations through a leaky well defined by Class et al. (2009). In the current Section, we present an illustrative example for a homogenized system, i.e. we consider spatial heterogeneity only through zonation according to different geological media. However, the technique presented can be extended to many classes of heterogeneous systems, where spatially correlated heterogeneous parameter fields can be decomposed into their uncorrelated principal components using the KL-expansion (e.g. Li & Zhang (2007)), if heterogeneity does not span over too many scales. CO2 is injected into a deep aquifer, spreads within the aquifer and, upon reaching a leaky abandoned well, rises to a shallower aquifer. The goal of the simulation is to quantify the leakage rate which depends on the pressure build-up in the aquifer due to injection and on the plume evolution. The leaky well is at the center of the domain and the injection well is 100m away. Both aquifers are 30m thick and the separating aquitard has a thickness of 100m. The leaky well is modeled as a porous medium with a higher permeability than the formation. The CO2 leakage rate is defined in the benchmark study as the total CO2 mass flux integrated over a control plane midway between the top and

The benchmark problem assumes that fluid properties such as density and viscosity are constant, all processes are isothermal, CO2 and brine are two separate and immiscible phases, mutual dissolution is neglected, the pressure conditions at the lateral boundaries are constant over time, the formation is isotropic rigid and chemically inert, and capillary pressure is negligible. Within the presented approach, the physical complexity of the problem can be increased to an arbitrary extent because of the non-intrusive black-box conception of the probabilistic collocation method (Isukapalli et al., 1998; Li & Zhang, 2007). All relevant parameters used for the simulation are given in the paper by Oladyshkin, Class, Helmig & Nowak (2011a). In the current study, the benchmark problem is simulated using DuMuX, a multi-scale multi-physics toolbox for the simulation of flow and transport processes in porous

To illustrate the proposed methodology, we will consider three parameters uncertain: reservoir absolute permeability, reservoir porosity and permeability of the leaky well. The distributions of absolute permeability and porosity were taken from the U.S. National Petroleum Council Public Database (which includes 1270 reservoirs), see also the work of Kopp et al. (2009). This choice reflects the situation of site screening, where site-specific data and data that allow heterogeneity within geological units to be resolved are not yet available. Instead, we use data sets that represent macroscopic properties of supposedly similar sites as

Formally, the investigated uncertain parameters correspond to the components of the vector ω = {*ω*1, *ω*2, *ω*3} and represent the input parameters of Equation (1). The model output quantities Ω considered here are pressure and saturation values as a function of space and

time, and the CO2 leakage rate through the leaky well as a function only of time.

**4.3 Illustration: risk assessment for carbon dioxide storage**

bottom aquifer, divided by the injection rate, in percent.

**4.3.1 Problem formulation**

media (Flemisch et al., 2007).

prior knowledge.

Fig. 5. Estimation of mean value (left plot) and standard deviation (right plot) of the CO2 leakage rate: expert opinions (dashed lines) and data-driven approach (solid lines)

#### **4.3.2 Data-driven uncertainty quantification**

Similar to GSA in Section 3 the procedure for quantifying uncertainty in CO2 storage can be divided into three main steps. The first step is to construct the polynomial basis according to the data considered in Section 4.3.1. The second step is to set up the chaos expansion and obtain the required coefficients *ci* for expansion (1) using the non-intrusive probabilistic collocation method. In the third step, we evaluate mean and variance of output statistics according to Section 4.1, see Figure 5.

The three steps mentioned above are straightforward and exploit the input data directly, i.e. in a data-driven manner. Involving additional assumptions on input data is possible, but can be dangerous and will import more room for subjectivity inti the analysis. To illustrate the drastic impact of subjectivity that can be introduced into analysis, we performed the following simple experiment. The sample data (reservoir permeability and porosity) are distributed to five different and independent experts. The task of each expert consisted in constructing a theoretical probability density function for each parameter, which in their opinion would describe the statistics of the raw data best. As a common way of data description, all experts provided two-parametrical distributions for each input parameter (permeability and porosity). All experts proposed very different assumptions and techniques to match the permeability and porosity distributions. The responses of all experts were collected and used as input for modeling the benchmark problem defined in Sections 4.3.1. The results of this experiment are illustrated in Figure 5 using second-order polynomial chaos expansion. It shows the resulting mean value (left plot) and standard deviation (right plot) of the CO2 leakage rate over time. Here, dashed lines correspond to the results based on subjective expert opinions, and solid lines correspond to the results from our purely data-driven approach. All test cases presented in Figure 5 were performed under the same conditions in all other aspects. The differences in predicting leakage rates based on the different experts are a consequence of their different interpretations and opinions on how to treat the data. This example clearly demonstrates that the room for subjectivity (when assigning theoretical

Control Via Robust Design 21

Probabilistic Risk Assessment and Risk Control via Robust Design

<sup>337</sup> Polynomial Response Surfaces for

based on non-data-driven polynomial bases. First, the transformation from physical space to normal space introduces additional non-linear behavior into the overall problem, which will require a higher order of expansion to obtain comparable accuracy. Second, for non-intrusive methods that rely on numerical integration to obtain the coefficients *ci*, the accuracy of numerical integration strongly depends on the choice of integration or collocation points. For example, in Gauss-Hermite integration (Abramowitz & Stegun, 1965), the polynomial basis of degree *d* defines the positions of integration points by the roots of the polynomial of degree *d* + 1. These integration points are optimal only if the weighting function (here: the probability measure) and the polynomial basis are in direct correspondence. Any non-linear transformation from *ωPh* to *ω<sup>N</sup>* destroys this direct relation. Thus, using a non-data-driven polynomial basis leads to a non-optimal placement of integration points, which causes a reduced accuracy of the integration (see Oladyshkin & Nowak (2011)). As a consequence, in comparison with the classical PCE, the same order of precision can be achieved with a

In this Section, we presented a data-driven approach for uncertainty quantification and risk assessment, based on polynomial chaos expansion (PCE). The data-driven approach provides a response surface based on a global orthonormal polynomial basis for arbitrary distributions. Thus, the data-driven approach provides freedom for modeling even physical systems with unknown probability distribution functions, when only data sets of very limited size or only little prior knowledge is available. It offers a new approach to defining parameter uncertainty in stochastic analysis that avoids the subjectivity of assigning theoretical probability distributions. In a small fictitious example, we asked independent experts to select and fit parametric distributions to two raw data sets. The example demonstrated that subjectivity in data interpretation can easily lead to prediction bias. Modeling is a chain of many tasks. Like any chain, it is only as strong as its weakest link. Modeling results indicate clearly that the statistical treatment of input data is part of the chain, and that the subjectivity in assuming theoretical curves can weaken that link immensely.

smaller degree of expansion.

**4.4 Conclusions to section 4**

0 0.1 0.2 0.3 0.4 0.5 0.6

approach (left plot) and data-driven (right plot)

Leakage, [%]

CO2

 **Leakage: conventional approach**

Monte Carlo 1st degree 2nd degree 3rd degree 4th degree

0

0.2

0.4

0.6

Probability

Fig. 6. Cumulative distribution function of CO2 leakage rate after 30 days: conventional

0.8

1

**CDF of CO2**

0 0.1 0.2 0.3 0.4 0.5 0.6

Leakage, [%]

CO2

 **Leakage: data−driven approach**

Monte Carlo 1st degree 2nd degree 3rd degree 4th degree

0

0.2

0.4

0.6

Probability

0.8

1

**CDF of CO2**

stochastic distributions to real data and thus modifying their higher-order moments) can lead to significant differences in the predicted values.

The data-driven polynomial chaos expansion proposed here for stochastic analysis is based directly on the moments of sampled data without intermediate steps of data re-interpretation. This avoids the subjectivity usually introduced when choosing among a small limited number of theoretical distributions to represent a natural phenomenon, and so avoids the problem illustrated in Figure 5. Complex models often exhibit stronger non-linearity, which can amplify the impact of subjectivity especially for assumptions on extreme value tails and higher-order moments. Subjectivity within data interpretation can cause the same order of magnitude in uncertainty or error as numerical, conceptual and stochastic approximation error, see details in the paper by Oladyshkin, Class, Helmig & Nowak (2011a).

#### **4.3.3 Efficient probabilistic risk assessment**

The data-driven polynomial chaos expansion provides a simple but powerful tool for stochastic modeling and, in this case study, for the probabilistic risk assessment of CO2 storage. The most integrative characteristic of the overall benchmark problem is the total leakage of CO2. To compare a quantitative characteristic that is most important in probabilistic risk assessment, we computed the cumulative probability function of the CO2 leakage rate after 30 days (Figure 6). The cumulative density function represents the probability that the CO2 leakage is less than or equal to a particular value. In the current work, we also apply the conventional PCE approach for direct comparison with the data-driven approach. The principal difference lies in constructing the polynomial basis. The conventional approach is based on Hermite polynomials, which are optimal for Gaussian random variables. Because the random variables ω are not distributed Gaussian in compliance with the Hermite polynomial basis, an additional conversion is required. A large number of methods are based on transformations of the model variables ω*Ph* from physical space to corresponding normal variables ω*<sup>N</sup>* via Gaussian anamorphosis (Wackernagel, 1998). This would implicitly define an exact PDF, which once again is an unjustifiably strong assumption. In contrast, the data-driven approach is based directly on the considered moments of the distributions or raw data of the uncertain input variables ω in physical space. The left plot in Figure 6 corresponds to the conventional approach, i.e. involving Gaussian anamorphosis and Hermite polynomials. The right plot in Figure 6 corresponds to the data-driven approach without additional transformation. The convergence of both types of chaos expansions (dashed lines) was validated by traditional Monte Carlo simulations (solid lines) with 1270 realizations, where we sampled directly from the available data sets. We repeated the comparison study for different degrees of the chaos expansion, such as first order (4 evaluations), second order (10 evaluations), third order (20 evaluations) and fourth order (35 evaluations). Figure 6 demonstrates that the data-driven approach provides fast convergence. Even small degrees of the data-driven expansion (even the linear one in our specific case study) ensure adequate representation of the stochastic processes in the considered multiphase flow system. This fact can be very useful for fully resolved and complex real-world application challenges, where computational costs are very high even for a single model evaluation.

Convergence with the conventional approach (here: based on Hermite polynomials) strongly depends on the non-linearity of the required transformation from ω*Ph* to ω*<sup>N</sup>* (Oladyshkin & Nowak, 2011). There are two sources of slow convergence (or errors) for chaos expansions based on non-data-driven polynomial bases. First, the transformation from physical space to normal space introduces additional non-linear behavior into the overall problem, which will require a higher order of expansion to obtain comparable accuracy. Second, for non-intrusive methods that rely on numerical integration to obtain the coefficients *ci*, the accuracy of numerical integration strongly depends on the choice of integration or collocation points. For example, in Gauss-Hermite integration (Abramowitz & Stegun, 1965), the polynomial basis of degree *d* defines the positions of integration points by the roots of the polynomial of degree *d* + 1. These integration points are optimal only if the weighting function (here: the probability measure) and the polynomial basis are in direct correspondence. Any non-linear transformation from *ωPh* to *ω<sup>N</sup>* destroys this direct relation. Thus, using a non-data-driven polynomial basis leads to a non-optimal placement of integration points, which causes a reduced accuracy of the integration (see Oladyshkin & Nowak (2011)). As a consequence, in comparison with the classical PCE, the same order of precision can be achieved with a smaller degree of expansion.

#### **4.4 Conclusions to section 4**

20 Will-be-set-by-IN-TECH

stochastic distributions to real data and thus modifying their higher-order moments) can lead

The data-driven polynomial chaos expansion proposed here for stochastic analysis is based directly on the moments of sampled data without intermediate steps of data re-interpretation. This avoids the subjectivity usually introduced when choosing among a small limited number of theoretical distributions to represent a natural phenomenon, and so avoids the problem illustrated in Figure 5. Complex models often exhibit stronger non-linearity, which can amplify the impact of subjectivity especially for assumptions on extreme value tails and higher-order moments. Subjectivity within data interpretation can cause the same order of magnitude in uncertainty or error as numerical, conceptual and stochastic approximation

The data-driven polynomial chaos expansion provides a simple but powerful tool for stochastic modeling and, in this case study, for the probabilistic risk assessment of CO2 storage. The most integrative characteristic of the overall benchmark problem is the total leakage of CO2. To compare a quantitative characteristic that is most important in probabilistic risk assessment, we computed the cumulative probability function of the CO2 leakage rate after 30 days (Figure 6). The cumulative density function represents the probability that the CO2 leakage is less than or equal to a particular value. In the current work, we also apply the conventional PCE approach for direct comparison with the data-driven approach. The principal difference lies in constructing the polynomial basis. The conventional approach is based on Hermite polynomials, which are optimal for Gaussian random variables. Because the random variables ω are not distributed Gaussian in compliance with the Hermite polynomial basis, an additional conversion is required. A large number of methods are based on transformations of the model variables ω*Ph* from physical space to corresponding normal variables ω*<sup>N</sup>* via Gaussian anamorphosis (Wackernagel, 1998). This would implicitly define an exact PDF, which once again is an unjustifiably strong assumption. In contrast, the data-driven approach is based directly on the considered moments of the distributions or raw data of the uncertain input variables ω in physical space. The left plot in Figure 6 corresponds to the conventional approach, i.e. involving Gaussian anamorphosis and Hermite polynomials. The right plot in Figure 6 corresponds to the data-driven approach without additional transformation. The convergence of both types of chaos expansions (dashed lines) was validated by traditional Monte Carlo simulations (solid lines) with 1270 realizations, where we sampled directly from the available data sets. We repeated the comparison study for different degrees of the chaos expansion, such as first order (4 evaluations), second order (10 evaluations), third order (20 evaluations) and fourth order (35 evaluations). Figure 6 demonstrates that the data-driven approach provides fast convergence. Even small degrees of the data-driven expansion (even the linear one in our specific case study) ensure adequate representation of the stochastic processes in the considered multiphase flow system. This fact can be very useful for fully resolved and complex real-world application challenges, where computational costs are very high even for

Convergence with the conventional approach (here: based on Hermite polynomials) strongly depends on the non-linearity of the required transformation from ω*Ph* to ω*<sup>N</sup>* (Oladyshkin & Nowak, 2011). There are two sources of slow convergence (or errors) for chaos expansions

error, see details in the paper by Oladyshkin, Class, Helmig & Nowak (2011a).

to significant differences in the predicted values.

**4.3.3 Efficient probabilistic risk assessment**

a single model evaluation.

In this Section, we presented a data-driven approach for uncertainty quantification and risk assessment, based on polynomial chaos expansion (PCE). The data-driven approach provides a response surface based on a global orthonormal polynomial basis for arbitrary distributions. Thus, the data-driven approach provides freedom for modeling even physical systems with unknown probability distribution functions, when only data sets of very limited size or only little prior knowledge is available. It offers a new approach to defining parameter uncertainty in stochastic analysis that avoids the subjectivity of assigning theoretical probability distributions. In a small fictitious example, we asked independent experts to select and fit parametric distributions to two raw data sets. The example demonstrated that subjectivity in data interpretation can easily lead to prediction bias. Modeling is a chain of many tasks. Like any chain, it is only as strong as its weakest link. Modeling results indicate clearly that the statistical treatment of input data is part of the chain, and that the subjectivity in assuming theoretical curves can weaken that link immensely.

Fig. 6. Cumulative distribution function of CO2 leakage rate after 30 days: conventional approach (left plot) and data-driven (right plot)

Control Via Robust Design 23

<sup>339</sup> Polynomial Response Surfaces for

Stochastic response surface methods (Isukapalli et al., 1998) deal with the characterization of uncertainties in systems describing the dependence of model output on the uncertain input parameters. Usually in many applied tasks, the model parameters can be classified in two classes: design or control variables that can be chosen by the operator of a system, and uncertain parameters that describe our (incomplete) knowledge of the system properties. On the other hand, the system's performance and failure probability will also depend on design parameters. Evidently, the decision-making for design parameters will depend on the interplay between the response to design parameters, system uncertainty and, finally, the probability of failure. In terms of system response, sensitivity and expansion, this distinction is artificial. Therefore, we drop the distinction between uncertain and design parameters. Instead, we project the model response to all design and uncertain parameters onto a single integrative model response surface. It is a multidimensional surface and contains the integral information about the system behavior under all possible conditions at all points in space and time. Thus, the notion of stochastic response surfaces introduced by Isukapalli et al. (1998) is expanded to integrative response surfaces forming an effective basis for robust design under

We investigate the influence of uncertain and design parameters ω on the model output Ω (integrative response surface approach). This means that the considered model has a

where *NU* is the number of uncertain parameters and *ND* is the number of design parameters.

Of course design parameters do not have probabilistic distributions, but suitable weighting functions for such parameters can be described by user-defined feasibility functions that select the feasible range or preferences of the designing engineer concerning the values of design parameters. Feasibility functions provide a freedom for scenario analysis and can be used as

We again consider the problem of CO2 leakage (Class et al., 2009) presented in Section 4.3. In the current Section we will consider the design task of finding an optimal injection regime. As in the Section 4.3 we consider three uncertain parameters: reservoir absolute permeability, reservoir porosity and permeability of the leaky well. For the current case study we also included two design parameters for describing the injection strategy: the CO2 injection rate (fluctuating around 8.87 kg/s) and the size of the screening interval (up to 30m), see details in the paper by Oladyshkin, Class, Helmig & Nowak (2011b). The choice of the design parameters in this study is only exemplary and serves to demonstrate how engineering decision-making can be supported by the approach presented here. Both the injection rate and the screening interval directly affect the ratio of forces in the reservoir during the injection. The choice of feasibility functions is arbitrary, and modelers have full freedom to introduce feasibility functions and to weight them according to their personal experience or preferences. For example, we chose a beta distribution for the size of the screening interval with lower

*ω*1, ..., *ωNU* , *ωNU*+1, ...*ωNU*+*ND* (16)

uncertainty.

multivariable input ω for the expansion Eq. (1):

an entry point for monetary punishment terms.

**5.2.1 Problem formulation**

The total number of input parameters is *N* = *NU* + *ND*.

Probabilistic Risk Assessment and Risk Control via Robust Design

**5.2 Illustration: robust design for carbon dioxide storage**

The data-driven stochastic approach was validated on the basis of Monte Carlo simulation using a common 3D benchmark problem. The proposed approach yields a significant computational speed-up compared with Monte Carlo, and provides faster convergence than conventional polynomial chaos expansions. Even for small degrees of expansion, the data-driven expansion can be very accurate, which can save a lot of computational power for probabilistic risk analysis.

Data-driven polynomial chaos expansion is based directly on raw data or other arbitrary sources of information without auxiliary assumptions. This increases the efficiency of chaos expansion and minimizes subjectivity, providing valuable support for risk-informed decision making as well as for robust design and control, allowing a better assessment and reduction of risk.
