Preface

Econometric analysis forms the basis of empirical economic research. Our education taught us that econometrics is more than compiling data and running an ordinary least squares (OLS) regression to answer a research question(s). Specifically, econometrics helps us identify good ideas from bad ones and assesses research questions. Econometrics is applied economics that enables us to see the complicated world and assess the relationships in which policymakers and analysts could prescribe policy. *Econometrics – Recent Advances and Applications* uses applied econometric methods to answer interesting research questions. It includes six chapters written by experts in economics and finance. The book addresses the application of econometric methods to various fields in economics. As such, it is a useful resource for academicians, researchers, and practitioners in economics and finance.

The chapters in this book provide a theoretical framework for applying econometrics to a particular problem. In empirical research, the acquisition of the appropriate data is paramount. The chapters not only present applications of econometric methods but also provide unique datasets that can be used in empirical research.

Chapter 1, "Evaluating DSGE Models: From Calibration to Cointegration" by Bjørnar Karlsen Kivedal, examines the historical development of estimating new Keynesian dynamic stochastic equilibrium (DSGE) models. The author focuses on how cointegration can be used to test and estimate the relationships in these models. An empirical assessment of a model is crucial to validate the theory, and this should be an important step when analyzing DSGE models. The author illustrates different techniques for estimating DSGE models and compares these methods to using cointegration when estimating and evaluating DSGE models.

Chapter 2, "A Primer on Machine Learning Methods for Credit Rating Modeling" by Yixiao Jiang, studies the important features of predicting corporate bond ratings. There is a growing literature on predicting credit ratings via machine learning methods. However, there have been fewer empirical studies using ensemble methods, which refer to the technique of combining the prediction of multiple classifiers. This chapter compares six machine learning models: ordered logit model (OL), neural network (NN), support vector machine (SVM), bagged decision trees (BDTs), random forest (RF), and gradient-boosted machines (GBMs). This chapter may also serve as a primer for empirical researchers who want to learn machine learning methods by providing an intuitive description of each employed method. The author employed Moody's ratings using data collected from 2001 to 2017. Three broad categories of features, including financial ratios, equity risk, and bond issuer's cross-ownership relation with the credit rating agencies, were explored in the modeling phase, using the data before 2016. These models were tested in an evaluation phase, using the most recent data after 2016.

In Chapter 3, "Forecasting Weekly Shipments of Hass Avocados from Mexico to the United States Using Econometric and Vector Autoregression Models", Oral Capps discusses how domestic production cannot meet the US demand for avocados, satisfying only 10% of the national demand. Due to year-round production and longer shelf life, the Hass variety of avocados account for about 85% of avocados consumed in the United States and roughly 95% of total avocado imports, primarily from Mexico. Using weekly data from July 3, 2011, to October 24, 2021, the author estimated econometric and vector autoregression models regarding the seven main shipment sizes of Hass avocados from Mexico to the United States. Both models discern the impacts of inflation-adjusted and exchange-rate-adjusted prices per box, US disposable income, holidays, and events, and seasonality on the level of Hass avocado shipments by size. These impacts are generally robust across the respective models by shipment size. These models also mimic the variability in the level of shipments by size quite well based on goodness-of-fit metrics. Based on absolute percent error, these models provide reasonably accurate forecasts of Hass avocado shipments from Mexico by size associated with a time horizon of 13 weeks. However, neither type of model provides a better forecast performance universally across all avocado shipment sizes.

Chapter 4, "Spatiotemporal Difference-in-Differences: A Dynamic Mechanism of Socio-Economic Evaluation" by Lijia Mo delves into spatial econometrics, an expanding area in econometrics. Advances in econometric modeling and analysis of spatial cross-sectional and spatial panel data assist in revealing the spatiotemporal characteristics behind socioeconomic phenomena and improving prediction accuracy. Difference-in-differences (DID) is frequently used in causality inference and estimation of the treatment effect of the policy intervention in different time and space dimensions. Relying on flexible distributional hypotheses of treatment versus experiment groups on spillover, spatiotemporal DID provides space for innovation and alternatives, taking spatial heterogeneity, dependence, and proximity into consideration. This chapter gives a practical econometric evaluation of the dynamic mechanism in this spatiotemporal context and a toolkit for this fulfillment.

Chapter 5, "The Impact of Inflation Expectations and Public Debt on Taxation in South Africa" by Thobeka Ncanywa and Noko Setati, investigates the impact of inflation expectations and public debt on taxation in South Africa, employing the autoregressive distributive lag model and Granger Causality techniques. The results indicate a long-term positive relationship between inflation expectations and taxation and a significant negative relationship between public debt and taxation. The empirical results reveal that taxable income will also increase when consumers and businesses expect the inflation rate to rise. The public debt–taxation nexus can imply that the South African government finances its debts through borrowing rather than through taxation. Therefore, economic participants must have full knowledge of what can influence taxation.

Finally, Chapter 6, "Incorporating Model Uncertainty in Market Response Models with Multiple Endogenous Variables by Bayesian Model Averaging" by Jonathan Lee and Alex Lenkoski, develops a method to incorporate model uncertainty by model averaging in generalized linear models subject to multiple endogeneity and instrumentation. Their approach builds on a Gibbs sampler for the instrumental variable framework that incorporates model uncertainty in both outcome and instrumentation stages. Direct evaluation of model probabilities is intractable in this setting. However, the authors show that by nesting model moves inside the Gibbs sampler, a model comparison can be performed via conditional Bayes factors, leading to straightforward calculations. This new Gibbs sampler is slightly more involved than the original algorithm and exhibits no evidence of mixing difficulties. They further show how the same principle may be employed to evaluate the validity of instrumentation choices. The authors conclude with an empirical marketing study of estimating opening box office by three endogenous regressors (prerelease advertising, opening screens, and production budget).

While *Econometrics – Recent Advances and Applications* covers many econometric methods, it is not an exhaustive presentation. Specifically, each of the chapters clearly outlines the research development starting with its research question(s), providing specific data sources that can be used, and outlining the econometric method used. The authors ably explain their empirical results and their meaning, which can be used in policy-making and other decisional cases. More importantly, carefully explaining the empirical results is an important craft and will be a good education for all readers, whether they are junior investigators or more advanced researchers refreshing their knowledge of methods and research. However, readers of *Econometrics – Recent Advances and Applications* should be familiar with research design as well as econometric methods (e.g., Stock and Watson (2019) and Wooldridge (2019)) to apply these methods appropriately.

> **Brian W. Sloboda** University of Maryland, Global Campus, Aldelphi, Maryland, USA

#### **Chapter 1**

## Evaluating DSGE Models: From Calibration to Cointegration

*Bjørnar Karlsen Kivedal*

#### **Abstract**

This chapter examines the historical development of estimating new Keynesian dynamic stochastic equilibrium (DSGE) models. I focus, in particular, on how cointegration can be used in order to test and estimate the relationships in these models using a simple RBC model as an example. Empirical evaluation of a model is critical to validate the theory, and this should be an essential step when analyzing DSGE models. The chapter illustrates the use of various estimation techniques when estimating DSGE models and compares these methods to using cointegration when estimating and evaluating DSGE models.

**Keywords:** DSGE models, calibration, estimation, cointegration, RBC model

#### **1. Introduction**

Some of the first aggregate macroeconometric models describing national business cycles were developed by Jan Tinbergen in the 1930s. A model for the US was published in 1939 [1], estimated recursively by the ordinary least squares method, based on theoretical dynamic business cycle models such as the one developed by [2]. Tinbergen's work was further developed by [3], who discussed testing economic theory by statistical inference using empirical observations. Furthermore, [4] emphasized using a system of simultaneous equations in order to model the economy and suggested using other estimation methods than ordinary least squares on each equation. Several macroeconometric models were constructed for the US following this, most notably from the work by the Cowles Commission for Research in Economics such as the models by [5, 6]. These were followed by a number of other models of the same type. See, for example, [7] for an historical overview of macroeconometric models.

Macroeconometric models such as these were constructed based on historical data, which was used both for estimating the parameters and for the model structure. A structural change in the economy could, therefore, lead to the econometric model not being relevant any more. If these models were not invariant to such changes, they would not be usable for policy analysis, as pointed out by [8]. This became known as "the Lucas critique," suggesting that the behavior of the agents in the economy needed to be explained by a structural model instead of aggregate historical relationships. This was needed in order to have a model invariant to policy changes.

Particularly, the parameters of the model, which determines tastes and preferences should be invariant to policy changes, while the remaining parts of the model should be regarded as stochastic.

In response to the Lucas critique, real business cycle (RBC) models, as introduced by [9], used microeconomic foundations, where consumers and firms optimized their intertemporal utility or profits using rational expectations. Extensions of the model with various rigidities, monopolistic competition, and short-run non-neutrality of monetary policy led to new Keynesian models,<sup>1</sup> which later has become the standard both for forecasting and policy evaluation (See e.g. [10]). These models are examples of dynamic stochastic general equilibrium (DSGE) models, and they are typically solved by finding the first-order conditions for the optimization problems of the representative agents of the model. The first-order conditions are then expressed in log deviation from the steady state of the model such that a (log) linear model is obtained. This yields a model where the variables are expressed as log deviations from their representative steady-state values, that is. approximating percentage deviation from steady state. Furthermore, the part of the model based on preferences should be invariant to policy changes since policy changes should be modeled as stochastic. Hence, the structural DSGE model may be tested by imposing the hypotheses from the model as restrictions on a statistical model. This amounts to testing the Lucas critique since the structural part of the model is tested. If it is not rejected, the model may be useful for policy analysis. DSGE models are often used for analyzing monetary policy. Among the most popular models used are the medium scale models in [11, 12], focusing on the US and the Euro area, respectively.

The RBC model of [9] can be considered a cornerstone of DSGE models, and DSGE models are typically extended versions of RBC models. RBC models include optimizing agents with rational expectations, and only one shock is sufficient to generate business cycles. This shock is usually a shock to technology or productivity and modeled as an exogenous variable that enters the production function. In addition to this, DSGE models also include frictions, which take a lot of observed dynamics into account. Most importantly are price stickiness, usually modeled as Calvo pricing [13]. Other frictions, such as wage rigidities (see [14, 15]), are also often found in DSGE models. Other shocks and rigidities are also often included in models in order to allow for more detailed dynamics. However, many of these frictions can be found relatively unimportant empirically, see [11], and thus not necessary to explain the dynamics found in the data.

In general, a nonlinear DSGE model can be formulated as

$$E\_t\left[f\left(y\_{t+1}, y\_t, y\_{t-1}, u\_t\right)\right] = \mathbf{0} \tag{1}$$

and has a rational expectations solution

$$\mathbf{y}\_t = \mathbf{g}\left(\mathbf{y}\_{t-1}, \mathbf{u}\_t\right). \tag{2}$$

A linear approximation of such as model is usually used. This is given as

$$
\hat{\jmath}\_t = T(\theta)\hat{\jmath}\_{t-1} + R(\theta)u\_t,\tag{3}
$$

<sup>1</sup> Although they were developed simultaneously as RBC models.

where *T* and *R* are time-invariant (which means that they depend on the structural parameters of the model) and *ut* � *N*ð Þ 0, *Q :*^*yt* ¼ log *yt =y* � �*:* This is then solved for the representative agent with full information about the model and the structural shocks. For more details, see, for example [16], which a lot of the presentation in this chapter is inspired by. Other useful sources for more information are [17–19].

The next section presents a simple RBC model, which is a special case of a DSGE model. The following sections use this model as an example in order to illustrate calibration, generalized method of moments, full information maximum likelihood, and Bayesian methods. Section 7 presents the cointegrated vector autoregressive model and how to test implications of a DSGE model, while the final section concludes. There is also some code relevant for investigating the model shown in Appendix A.

#### **2. A simple RBC model**

If we consider the simple RBC model in [20], we have households that maximize

$$E\_t \sum\_{t=0}^{\infty} \beta^t (\ln c\_t + \gamma(1 - n\_t)) \tag{4}$$

subject to the budget constraint

$$
\infty\_t + \mathcal{c}\_t = \mathcal{w}\_t \mathfrak{n}\_t + r\_t \mathfrak{k}\_t \tag{5}
$$

and

$$k\_{t+1} = (\mathbf{1} - \delta)k\_t + \mathbf{x}\_t. \tag{6}$$

Here, *ct* is consumption and *nt* labor (hours worked) in time *t*. *γ* is the utility weight, *xt* investment, *wt* real wage, *rt* rental rate of capital, *kt* capital stock, and *δ* the depreciation rate.

This yields the first-order conditions

$$\mathbf{1}/c\_t = \beta \mathbf{E}\_t(\left(\mathbf{1}/c\_{t+1}\right)\left(\mathbf{1} + r\_{t+1} - \delta\right))\tag{7}$$

$$
\mathcal{y}\mathfrak{c}\_t = \mathfrak{w}\_t,\tag{8}
$$

which provides the optimal choice of *ct*, *nt* and *kt*þ1. Eq. (7) is an Euler equation, while eq. (8) is the marginal rate of substitution.

A single good is produced by perfectly competitive firms (who maximize their profits each period)

$$y\_t = z\_t (k\_t)^a (n\_t)^{1-a},\tag{9}$$

where 0 <*α* <1, *yt* is output and *zt* is the technology shock. The technology shock follows an exogenous stochastic process

$$
\ln z\_{t+1} = \rho \ln z\_t + \varepsilon\_{t+1},
\tag{10}
$$

where *ε<sup>t</sup>* is independently, identically, and normally distributed with zero mean and variance *σ*2.

The firm chooses the input levels (capital and labor) to maximize profits, and the marginal product of labor (capital) equals the marginal product of the real wage (rental rate).

Hence, the competitive equilibrium is the sequence of prices *wt* f g ,*rt* <sup>∞</sup> *<sup>t</sup>*¼<sup>0</sup> and allocations *ct*, *nt*, *xt*, *kt*þ1, *yt* � �<sup>∞</sup> *<sup>t</sup>*¼<sup>0</sup> such that firms maximize profits, agents maximize utility, and all markets clear. The structural parameters of the model are, thus, *β*, *γ*, *δ*, *α* and *ρ*. These parameters describe behavior, and we are, therefore, interested in assessing the value of these parameters.

Some steady-state relationships of the model are

$$\frac{k}{n} = \left( (\mathbf{1}/\beta + \delta - \mathbf{1})/a \right)^{1/(a-1)} \tag{11}$$

$$\frac{c}{\nu} = \mathbf{1} - \frac{a(\boldsymbol{\beta} + \boldsymbol{\delta} - \mathbf{1})}{\boldsymbol{\delta}}.\tag{12}$$

Hence, the long-run relationship between capital and hours worked, *k*/*n*, and the long-run relationship between consumption and output, *c*/*y*, may be described by a combination of structural parameters and should, thus, be constant in the long run.

Such a model is often log-linearized (i.e. written in terms of log deviation from the theoretical steady state; *x*^*<sup>t</sup>* � log *xt* � log *x*) in order to have a stationary representation. This yields,

$$\begin{aligned} \boldsymbol{E}\_{t}\boldsymbol{\hat{\varepsilon}}\_{t+1} &= \quad \boldsymbol{\hat{\varepsilon}}\_{t} + a\beta (\boldsymbol{k}/n)^{a-1} \Big[ (\boldsymbol{a}-\mathbf{1})\boldsymbol{E}\_{t}\boldsymbol{\hat{\varepsilon}}\_{t+1} + (\mathbf{1}-\boldsymbol{a})\boldsymbol{E}\_{t}\boldsymbol{\hat{n}}\_{t+1} + \mathbf{E}\_{t}\boldsymbol{\hat{\varepsilon}}\_{t+1} \Big] \\ \boldsymbol{\hat{n}}\_{t} &= \quad -(\mathbf{1}/a)\boldsymbol{\hat{\varepsilon}}\_{t} + \boldsymbol{\hat{k}}\_{t} + (\mathbf{1}/a)\boldsymbol{\hat{z}}\_{t} \\ \boldsymbol{\hat{\boldsymbol{\nu}}}\_{t} &= \quad a\boldsymbol{\hat{\delta}}\_{t} + (\mathbf{1}-\boldsymbol{a})\boldsymbol{\hat{n}}\_{t} + \boldsymbol{\hat{z}}\_{t} \\ \boldsymbol{\hat{\nu}}\_{t} &= \quad (\mathbf{1}-\boldsymbol{\delta})(\boldsymbol{k}/n)^{1-a}\boldsymbol{\hat{\varepsilon}}\_{t} + \left(\mathbf{1} - (\mathbf{1}-\boldsymbol{\delta})(\boldsymbol{k}/n)^{1-a}\right)\boldsymbol{\hat{\varepsilon}}\_{t} \\ \boldsymbol{\hat{k}}\_{t+1} &= \quad (\mathbf{1}-\boldsymbol{\delta})\boldsymbol{\hat{k}}\_{t} + \boldsymbol{\delta}\boldsymbol{\hat{\varepsilon}}\_{t} \\ \boldsymbol{\hat{z}}\_{t+1} &= \quad \rho\boldsymbol{\hat{z}}\_{t} + \boldsymbol{\varepsilon}\_{t} \end{aligned} \tag{13}$$

where the log deviations can be interpreted as percentage deviations from the steady state.

The log-linearized model has the solution

$$\begin{aligned} s\_t &= \Phi \xi\_t \\ \xi\_t &= D \xi\_{t-1} + v\_t. \end{aligned} \tag{14}$$

or

$$
\begin{bmatrix}
\dot{\boldsymbol{\nu}}\_{t} \\
\dot{\boldsymbol{n}}\_{t} \\
\dot{\boldsymbol{c}}\_{t}
\end{bmatrix} = \begin{bmatrix}
\boldsymbol{\phi}\_{\boldsymbol{\chi}k} & \boldsymbol{\phi}\_{\boldsymbol{\chi}\boldsymbol{x}} \\
\boldsymbol{\phi}\_{\boldsymbol{n}k} & \boldsymbol{\phi}\_{\boldsymbol{n}\boldsymbol{x}} \\
\boldsymbol{\phi}\_{\boldsymbol{c}k} & \boldsymbol{\phi}\_{\boldsymbol{c}\boldsymbol{x}}
\end{bmatrix} \begin{bmatrix}
\dot{\boldsymbol{k}}\_{t} \\
\dot{\boldsymbol{z}}\_{t}
\end{bmatrix} \tag{15}
$$

$$
\begin{bmatrix}
\dot{\boldsymbol{k}}\_{t} \\
\dot{\boldsymbol{z}}\_{t}
\end{bmatrix} = \begin{bmatrix}
\boldsymbol{d}\_{11} & \boldsymbol{d}\_{12} \\
\boldsymbol{d}\_{21} & \boldsymbol{d}\_{22}
\end{bmatrix} \begin{bmatrix}
\dot{\boldsymbol{k}}\_{t-1} \\
\dot{\boldsymbol{z}}\_{t-1}
\end{bmatrix} + \begin{bmatrix}
\boldsymbol{\epsilon}\_{t}^{k} \\
\boldsymbol{\epsilon}\_{t}^{x}
\end{bmatrix}.
$$

The eigenvalues and eigenvectors of the matrix in the system with expectational terms are used in order to calculate the Φ matrix.

There are different ways of obtaining values for the parameters f g *β*, *γ*, *δ*, *α*, *ρ* in the model. Using calibration, we choose the values subjectively or objectively in order to use the model for simulation, while we may estimate the value of the parameters based on observed economic data *ct*, *nt*, *yt <sup>T</sup> <sup>t</sup>*¼<sup>0</sup> by using statistical methods. In the next sections, we will compare calibration and estimation using generalized method of moments (GMM), full information maximum likelihood (FIML), Bayesian methods, and the cointegrated vector autoregressive (CVAR) model.

#### **3. Calibration**

At first, these models were mainly calibrated and simulated, as proposed by [9]. This was done by fixing the values of structural parameters according to empirical studies in microeconomics or moments of the data such as long-run "great ratios" representing historical relationships between the variables.

We can calibrate the RBC model in [20], by setting the parameter values of the model, that is. assigning values to f g *β*, *γ*, *δ*, *α*, *ρ* . This was popular before one was technically able to estimate large models and used in order to undertake computational experiments with the model. Hence, it is not possible to estimate parameters and test hypotheses regarding these when using calibration.

Calibration is, often, used in order to back out shocks (shock decomposition) and compares correlations between simulated variables and the data. Outcomes of the calibrated model may then be compared to descriptive statistics (e.g. the moments) of the data, and the model may be used in order to forecast or conduct policy analysis. The in-sample forecast performance may be assessed using measures such as root mean squared errors.

Hence, it is possible to do judgments of how the model fits the empirical reality even if we do not estimate the model. Calibration may also be useful as a first impression of the model before it is completely developed and estimated. However, we are not able to say anything about uncertainty. It may also be a useful approach when data are not available or only small samples can be obtained, which can be relevant for some regions and countries.

#### **4. Generalized method of moments**

Later, estimation using generalized method of moments (GMM) for single equations was conducted in order to estimate some of the parameters in the model. See, for example, [21] or [22] for estimation of the new Keynesian Phillips curve. GMM was introduced by [23] and first applied to DSGE models by [24, 25].

The method consists of minimizing the distance between some functions of the data and the model. Estimation can, therefore, be conducted using the (nonlinear) first-order conditions such that it is not required that we solve the model before estimating the parameters. However, we need a set of moment conditions in order to perform GMM, and it is a type of limited information estimation since we only utilize part of the theoretical model and not necessarily observations for all of the variables in the model. In particular, we have no likelihood function but only specific moments of interest that are adjusted to the data (called matching moments or orthogonality conditions).

Hence, we aim to minimize the distance between the observed moments from the sample and the population moments as implied by the model. In general, we have that the estimate of a parameter *θ* (is)

$$\hat{\theta}\_T = \underset{\theta}{\text{arg min }} Q\_T(\theta) \tag{16}$$

where

$$Q\_T = \frac{1}{T} \sum\_{t=1}^T f(\mathbf{y}\_t, \boldsymbol{\theta})^\prime \mathcal{W}\_T \frac{1}{T} \sum\_{t=1}^T f(\mathbf{y}\_t, \boldsymbol{\theta}). \tag{17}$$

Here, *WT* is the weighting matrix, which is used if there are more moment conditions than parameters. We, thus, seek to minimize *QT*, which is the square product of the sample moment, by the value of *θ:* The GMM estimator of *θ* is, thus, the value of *θ* that minimizes *QT:*

We may consider the Euler equation in the RBC model in [20], which was

$$\mathbf{1}/\mathbf{c}\_{t} = \beta \mathbf{E}\_{t}((\mathbf{1}/\mathbf{c}\_{t+1})(\mathbf{1} + r\_{t+1} - \delta)).\tag{18}$$

In order to estimate the parameters f g *β*, *δ* , we have two conditions since there are two parameters in one condition (equation). The first moment condition can be the Euler equation

$$E\_t\left[\beta \frac{c\_{t+1}}{c\_t} (\mathbf{1} + r\_{t+1} - \delta)\right] = \mathbf{0},\tag{19}$$

or more correctly that *Et β ct*þ<sup>1</sup> *ct* ð Þ 1 þ *rt*þ<sup>1</sup> � *δ* h i � <sup>1</sup> <sup>¼</sup> 0. The second may be

$$E\_t \left[ \beta \frac{c\_{t+1}}{c\_t} (\mathbf{1} + r\_{t+1} - \delta) \right] \frac{c\_t}{c\_{t-1}} = \mathbf{0},\tag{20}$$

since any zero factors multiplied by a factor of some observation will be zero.

Hence, the data are *ct*þ<sup>1</sup> *ct* ,*rt*þ<sup>1</sup> n o, and the instruments 1, *ct ct*�<sup>1</sup> n o*:rt* could also have been used as an instrument. This implies that the average (first moments) of the data series is used in order to estimate parameter values.

When using GMM, the choice of instrument may impact the estimation. We may also have issues with unobserved variables. If analytical moment conditions are impossible or hard to obtain, they can be computed numerically by simulation (often called simulated GMM). This is particularly useful if there are unobservable variables in Euler equations such as the one above or there are nonlinear function of steady-state parameters. Further, a large sample is needed in order for asymptotic theory to apply, and Monte Carlo studies have not been favorable to GMM [26].

#### **5. Full information maximum likelihood**

In order to identify the structural parameters of the system (i.e. the complete theoretical model), the full system should be estimated. Full information maximum likelihood (FIML) estimation, such as, for example, in [26], can be used in order to estimate the parameters of new Keynesian models. Hence, this uses full information from the model rather than the limited information approach used in GMM, where we looked at some moment conditions. When using maximum likelihood, the estimated parameters will be the ones that provide the maximum of the likelihood function or the log of the likelihood function.

In general, the data (*y*) depend on the unknown parameters *θ* through a probability density function

$$
\mathcal{Y} \sim f(\mathcal{Y}; \theta). \tag{21}
$$

The estimator is then ^*θ*, and it is a function of the data

$$
\hat{\boldsymbol{\theta}} = \mathbf{g}(\boldsymbol{y}).\tag{22}
$$

Given the observed *y*0, the estimator is then obtained by the likelihood function

$$\hat{\theta} = \arg\max\_{\theta} \left\{ f(y\_0; \theta) \right\}, \tag{23}$$

or the log of this function. We then get the value of *θ* that provides the maximum of *f y*0; *<sup>θ</sup>* . That is, the parametes that yield the maximum probability of observing *<sup>y</sup>*0.

Since the equations in DSGE models typically are nonlinear, we need to solve the model first and obtain a linear representation of the model. This provides a system where all the endogenous variables are expressed as a function of the exogenous variables and parameters of the model. However, a linear approximation of the model is often solved instead. The variables are then represented as deviation from the theoretical steady state (see Section 2). The structural parameters are estimated, and the model is assumed to be the true data generating process, see, for example, [26] or [27].

Almost all log-linearized DSGE models have a state-space representation

$$\begin{aligned} \mathbf{x}\_t &= A\mathbf{x}\_{t-1} + B\boldsymbol{\varepsilon}\_t \\ \mathbf{y}\_t &= \mathbf{C}\mathbf{x}\_t + D\boldsymbol{\eta}\_t, \end{aligned} \tag{24}$$

where *xt* is a vector containing the endogenous and exogenous state variables, and *yt* is a vector containing the observed variables. Hence, *yt* ¼ ⋯ is measurement equation, linking data to model. The objective is to estimate the parameter given the observed *yt* . The error term *η<sup>t</sup>* � *N*ð Þ 0, *R* is independent of *xt*, and *ε<sup>t</sup>* � *N*ð Þ 0, *Q* is independent of *x*0, *x*1, … , *xt* and *y*1, … , *yt* . Further, the matrices *A*, *B*, *C,* and *D* contain nonlinear functions of the structural parameters.

If both *xt* and *yt* contain observables, the state-space representation is a restricted VAR(1). If not, we may use a Kalman filter [28] in order to obtain the expected value of the unobservable variables and the likelihood function. This provides one-stepahead forecast errors (in-sample) and the recursive variance of forecast errors. Hence, the Kalman filter gives the expected value of all of the potentially unobserved variables given the history of the unobserved variables.

FIML also has some limitations, depending on what the DSGE model looks like. Firstly, we need as many shocks as observable variables in order to perform the estimation. This is known as stochastic singularity, and we, thus, often need to add shocks or errors to the model in order to utilize the full potential of a data set with a lot of variables. The model in [20] has only one shock *εt*, but three observables *yt* ,*ct*, *nt* . We can, thus, add structural shocks or measurement errors to the model if we want to utilize data on all observables.

When using FIML, we assume that the model is the correct representation of the data generating process (DGP). Hence, FIML is sensitive to misspecification since we estimate the model under this assumption. We often also have partial or weak identification of parameters when using FIML, see, for example, [29]. Both of these issues may give an issue with "the dilemma of absurd parameter estimates" [30], which implies that FIML estimates of structural parameters can often be at odds with additional information economists may have.

#### **6. Bayesian methods**

Since DSGE models contain a lot of parameters and often use a relatively small sample of quarterly data, the likelihood function typically contains a lot of local maxima and minima and nearly flat surfaces [19], making identification hard. In order to circumvent this issue, DSGE models are often estimated using Bayesian estimation. This combines a prior distribution with a likelihood-based estimation such as FIML presented in the previous section. This, thus, takes some of the problems with maximum likelihood estimation into account. However, the estimated parameters using Bayesian methods do not necessarily reflect all of the information in the data since prior distributions will influence the estimates to some extent. Using priors may also hide identification problems, which is an issue often neglected when estimating DSGE models [29].

The main difference between FIML and Bayesian methods is the way data are treated or interpreted. In frequentist methods such as FIML, the parameters are fixed and the data are random. This allows us to estimate the variance of the estimator and their confidence intervals, that is. the interval that ^*<sup>θ</sup>* lies in 1 � *<sup>α</sup>* percent of the time. Bayesian inference assumes that data are fixed and that the parameters are unknown. We may, therefore, focus on the variance of the parameter (rather than the variance of the estimator). Confidence intervals in Bayesian estimation will show the interval that has the highest probability of including *θ* conditional on the observed data, a prior distribution on *θ*, and a functional form (the DSGE model in our case).

If we have the model *f x*ð Þ j*θ* and the prior *f*ð Þ*θ* , we want to find the posterior probability density function *f*ð Þ *θ*j*x* . Using Bayes' rule, we have

$$f(\theta|\mathbf{y}) = \frac{f(\mathbf{y}|\theta)f(\theta)}{f(\mathbf{y})} \tag{25}$$

$$f(\theta|\mathbf{y}) \lhd f(\mathbf{y}|\theta) f(\theta). \tag{26}$$

Hence, the posterior kernel equals the model multiplied by the prior. We can, thus, find the distribution for the unknown parameter *θ*. Additionally, a point estimate of the posterior can be found, typically as the mean, median, or mode of the posterior distribution or by a loss function <sup>~</sup>*<sup>θ</sup>* <sup>¼</sup> arg min ^*<sup>θ</sup> <sup>E</sup>* <sup>ℒ</sup> ^*<sup>θ</sup>* � *<sup>θ</sup>* . The mean squared error, the absolute error, and the max aposteriori will, respectively, yield the mean, median, and maximum of the posterior distribution. Bayesian estimation is, thus, a combination of maximum likelihood estimation and a prior distribution. It is also important to remember that the data are fixed when using Bayesian estimation. We do, therefore, not necessarily seek to use the results for generalizing purposes, while we try to find the parameter(s) that gives the highest probability of observing the data at hand when using maximum likelihood.

In Bayesian estimation, the aim is to find the posterior density function *f θ*j*y*<sup>0</sup> . This shows how the parameters *θ* depend on the data *y*. *θ* is assumed random, while it was assumed deterministic in the case of maximum likelihood estimation. A prior distribution function *f*ð Þ*θ* , thus, needs to be specified before estimation as this is combined with a likelihood estimation as in FIML. Prior distributions may be subjective or objective. Subjective priors are a result of subjective opinions, while objective priors can be priors found by microeconomic empirical studies [31]. The weight we put on the prior distribution relative to the likelihood function also needs to be chosen a priori. We, thus, have two extreme cases of Bayesian estimation: 1) No weight on the prior (e.g. flat priors), which will be similar to FIML, and 2) Full weight on the prior and none on the likelihood, consistent with calibration. Hence, Bayesian estimation can be considered a combination of calibration and FIML.

The posterior is simulated by an algorithm such as a Monte Carlo method, and the accepted parameter values will form a histogram, which can be smoothed to provide the posterior distribution function.

An advantage of Bayesian estimation is that we may avoid identification issues that often are a problem when using FIML. However, we are also prone to hide these issues, which may be a problem. As argued in [32], Bayesian estimates should be compared to FIML estimates in order to see what role the priors have. Another advantage of Bayesian estimation is that we do not need to assume that the model is the correct DGP as for FIML and GMM.

Using prior information may also be an advantage since this is available information that then is taken into account in the estimation process even if it is not part of the model or the data set used in the estimation. However, if the same data are used for prior information as for the Bayesian estimation, for example,. great ratios, the priors do not add any information. It is also possible to compare different models *via* posterior odds ratios, see [33].

However, it may be difficult to replicate results from Bayesian estimation due to computationally intensive simulation methods (Metropolis-Hastings algorithm) [18]. For an overview of recent developments in Bayesian methods, see [34].

#### **7. The cointegrated VAR model**

DSGE models often contain variables that are nonstationary such as prices, wages, GDP, and productivity, and we use a log-linearized model with stationary variables in order to estimate the model with FIML or Bayesian methods. The data are then usually filtered by, for example, . the Hodrick-Prescott or the band pass filter in order to separate the trend and the cyclical component of the nonstationary data series, see, for example, [17]. Hence, the cyclical component of a variable in the data should correspond to the deviation from steady state for a variable in the theoretical model and is then used in order to estimate the (log-linearized) DSGE model. While the filtered

cycle measures deviations from an estimated trend, the log deviation in the theoretical model measures deviation from the theoretical steady state. Hence, there may be a mismatch between the trend component of the data and the theoretical trending relationships in the model, expressed by the steady-state relationships of the model. This should be taken into account when estimating DSGE models since the steadystate relationships are expected to correspond to the long-run relations of the observed variables.

We saw that the log-linear system may be solved to yield a purely backwardlooking solution such that it is represented by a vector autoregressive (VAR) model containing cross-equations restrictions from the DSGE model if all of the variables are observables.<sup>2</sup> An estimated VAR model should, therefore, be similar to the solution of a new Keynesian model if the model is the true data generating process.

Since the solution of the DSGE model takes the form of a restricted VAR model, another approach for estimating such a model is to first estimate an unrestricted VAR model and then impose various restrictions on it from the theoretical DSGE model. This implies going from a general to a specific model, and it allows testing the restrictions as they are imposed on the unrestricted model. If the restrictions are rejected, the theoretical model can be modified such that it is more in line with the empirical observations.

A VAR model with *k* lags may be written as

$$Z\_t = \Pi\_1 Z\_{t-1} + \dots \Pi\_k Z\_{t-k} + \varepsilon\_t,\tag{27}$$

where *Zt* is a vector of observed variables. A DSGE model has this representation (typically with *k* ¼ 1 lag) if all of its variables are observable as shown in (24). The VAR may be reformulated to a vector error correction model (VECM) such as

$$
\Delta Z\_t = \Gamma\_1 \Delta Z\_{t-1} + \dots + \Gamma\_{k-1} \Delta Z\_{t-k+1} + a \tilde{\boldsymbol{\beta}}^\prime \tilde{Z}\_{t-1} + \boldsymbol{\gamma}\_0 + \boldsymbol{\gamma}\_1 \mathbf{t} + \boldsymbol{\varepsilon}\_t,\tag{28}
$$

where *β*~<sup>0</sup> <sup>¼</sup> *<sup>β</sup>*, *<sup>β</sup>*0, *<sup>β</sup>*<sup>1</sup> ½ �, *<sup>Z</sup>*~*<sup>t</sup>*�<sup>1</sup> <sup>¼</sup> ½ � *Zt*�1, 1, *<sup>t</sup>* <sup>0</sup> , *ε<sup>t</sup>* � *IN*ð Þ 0, Ω for *t* ¼ 1, … , *T*, and *Z*�1, *Z*<sup>0</sup> is given. *γ*<sup>0</sup> is a constant. If there are one or more linear combinations of nonstationary (integrated of order one, *I*(1)) variables that are stationary (integrated of order zero, *I* (0)), they can be considered cointegration relationships. These are found by imposing reduced rank on the estimated VAR and will yield the cointegrated vector autoregressive (CVAR) model [36]. The cointegration rank is found through statistical tests and should match what is implied by theory (e.g. the number of steady-state relationships in the DSGE model). Common stochastic trends should cancel through steady-state relationships if they are driven by unit roots.

Additionally, the data do not need to be pre-filtered when using this approach since assumptions from the theoretical model on the stochastic trends may be tested and imposed. First, we find the number of cointegrating vectors in the data. These represent the long-run properties of the data and should correspond to the steady state of the theoretical model. The long-run properties of the model are then imposed as restrictions on the *β* vectors in the VECM eq. (28). There should, for example,. be a constant relationship between capital *k* and hours worked *n* and between consumption *c* and output *y* in the model in [20], as shown in eq. (11) and eq. (12).

<sup>2</sup> If only some variables are observed, it has a state space representation in form of a vector autoregressive regressive moving average (VARMA) model, see, for example, [35]).

#### *Evaluating DSGE Models: From Calibration to Cointegration DOI: http://dx.doi.org/10.5772/intechopen.111677*

For an example of this, see [37], which tests several restrictions from the theoretical DSGE model in [27] using the CVAR framework. Similar testing of the long-run properties of DSGE models can be found in [38, 39]. This is in line with using the VAR model as a statistical model and test theory through the probabilistic approach as suggested by [3], see, for example, [40].

Short-run restrictions may also be imposed and tested through cross-equation restrictions on the VAR representation of the data such as imposing the restrictions suggested by the parameters in (15). See [41] for an example of this. Using the CVAR model thereby allows using frequentist methods while dealing with potential misspecification. Hence, we do not need to use Bayesian methods if we would like to relax the assumption that the model is the true DGP as in GMM and FIML, but we can test it in the CVAR framework. If the restrictions from the DSGE model are rejected when tested in the CVAR model, this may suggest misspecification. The theoretical model can then be modified to be more in line with what we find empirically.

#### **8. Conclusion**

As shown in the chapter, calibration may be useful for assessing the relevance of a theoretical model by, for example, . simulations. This is often necessary if data are not available for many of the variables in the model. Calibration may also be used as a preliminary step in modeling and evaluation.

Generalized methods of moments do not require that we need to solve the model before estimation, and we do not need observations on all of the variables in the model. This avoids the problem of stochastic singularity, which is an issue when we use full information estimation methods. However, this also implies that we usually only focus on a subset of the model and relevant variables.

Full information maximum likelihood and Bayesian estimation both involve using the complete model (usually in a log-linearized form) and take full advantage of the data. While maximum likelihood may have identification issues for the structural parameters of the model, Bayesian methods can address this by using prior distributions for parameters. However, the choice of priors and the chosen prior weight may impact the estimates, and thus affect the estimates such that the data set at hand is not allowed to speak freely.

By using the cointegrated vector autoregressive model, we are able to test the theoretical implications of the model, in particular the long-run implications of a model, rather than assuming that the model is the true data generating process as with full information maximum likelihood or generalized method of moments. We also do not need to filter the data before estimating the model, removing the problem of a potential mismatch between the theoretical steady state and the long-run relationships in the data. Hence, if we would like to take full advantage of the data while also testing the implications from the model, the cointegrated vector autoregressive model is a relevant tool. We may use it as a preliminary step to assess the empirical relevance of a theoretical model or use it as a fully specified macroeconometric model.

#### **Notes**

This chapter has been written on the background of the trial lecture titled "Describe and compare different methods for analyzing DSGE models: Calibration, GMM, FIML, Bayesian methods, and CVAR" for defending my Ph.D. in Economics at the Norwegian University of Science and Technology, as well as the introductory chapter from the thesis "Testing economic theory using the cointegrated vector autoregressive model: New Keynesian models and house prices", see [42].

#### **Appendix**

The methods illustrated in this chapter to evaluate and estimate Hansen's RBC model [20] is possible to carry out and investigate using available code.

For calibration of the model and estimation using full information maximum likelihood and Bayesian methods, the most convenient approach is perhaps to use Dynare code, available at Johannes Pfeifer's home page on Github: https://github.com/ johannespfeifer/dsge\_mod [43] For more information about Dynare, which is a program that you can run using Matlab or Octave, see www.dynare.org. For GMM estimation of Hansen's RBC model, see [44].

In order to test the long-run implications of a DSGE model, I have estimated a cointegrated VAR with quarterly data on output, consumption, hours worked, and capital from 1960 to 2002 using R. The code is shown below. The data set is available at [45] and was used in order to test the implications of the model in [27] by [37]. In the code below, there is a test of one of the long-run restrictions of Hansen's model found in the steady states for the output-toconsumption ratio using commands in the urca package [46] in R. I also include the dummy variables accounting for extraordinary institutional events used in [37] to specify the model.

```
alldata < read.table(file = "irelanddata.csv",
                       sep = ";", header=TRUE).
logdata < subset(alldata, select=c(qtr,Ly,Lc,Lh,LCapP)).
colnames(logdata)[colnames(ldata) == "LCapP"] ="Lk".
dummyvar < read.table(file = "dummies.csv",
                       sep = ";", header=TRUE).
total < merge(logdata,dummyvar,by="qtr").
attach(total).
data < cbind(Ly,Lc,Lh,Lk).
dum < cbind(Ds7801,Dp7003,Dp7403,Dp7404,Dp7801,Dtr8001).
cointd < ca.jo(total, type='trace', K=2,
                 season=4, dumvar=dum).
summary(cointd).
H <- matrix(byrow=TRUE,
c(1,0,0,
1,0,0,
0,1,0,
0,0,1), c(4,3)).
betarestrictions < blrtest(z=cointd, H=H, r=1).
summary(betarestrictions).
```
First, the data are loaded into the object alldata. I then take the natural logarithm of the variables that are used and place them in logdata. The variable for capital is then renamed in order to match the theoretical model, which uses the letter *k* for capital. The dummy variables from a separate dummies.csv file matching the dummy variables from [37] are loaded into dummyvar, and the data are combined into the data *Evaluating DSGE Models: From Calibration to Cointegration DOI: http://dx.doi.org/10.5772/intechopen.111677*

#### **Figure 1.**

*Difference between log of income and log of consumption.*

frame total and attached. The data are then separated into the endogenous variables in data and the exogenous dummy variables in dum.

By using the command ca.jo, I estimate the VAR model and test for the number of reduced rank. This is then set to *r* ¼ 2 as in [37], and the restriction of a constant longrun relationship between *Ly* and *Lc* is imposed on the beta matrix.

The restriction of a stationary long-run relationship between consumption and income ð Þ *Ly* � *Lc* � *I*ð Þ 0 yields a p-value of 0, indicating that we reject it. This is perhaps not surprising, given the plot of the difference between log of income and log of consumption as shown in **Figure 1**, where we observe an upward trend and not stationarity.

#### **Author details**

Bjørnar Karlsen Kivedal Østfold University College, Halden, Norway

\*Address all correspondence to: bjornar.k.kivedal@hiof.no

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Tinbergen J. Business Cycles in the United States of America, 1919–1932. London: League of Nations, Economic Intelligence Service; 1939

[2] Frisch R. Propagation problems and impulse problems in dynamic economics. In: Economic Essays in Honour of Gustav Cassel. London: Allen and Unwin; 1933. pp. 171-205

[3] Haavelmo T. The probability approach in econometrics. Econometrica: Journal of the Econometric Society. 1944;**12**:1-114

[4] Haavelmo T. The statistical implications of a system of simultaneous equations. Econometrica, Journal of the Econometric Society. 1943;**11**(1):1-12

[5] Klein LR. Economic Fluctuations in the United States, 1921–1941. Vol. 176. New York: Wiley; 1950

[6] Klein LR, Goldberger AS. An Econometric Model of the United States, 1929–1952. Vol. 9. Amsterdam: North-Holland Publishing Company; 1955

[7] Welfe W. Macroeconometric Models. Advanced Studies in Theoretical and Applied Econometrics. Heidelberg: Springer Berlin; 2013

[8] Lucas RE. Econometric policy evaluation: A critique. Carnegie-Rochester Conference Series on Public Policy. 1976;**1**:19-46

[9] Kydland FE, Prescott EC. Time to build and aggregate fluctuations. Econometrica: Journal of the Econometric Society. 1982;**50**(6): 1345-1370

[10] Galí J. Monetary Policy, Inflation, and the Business Cycle: An Introduction to the New Keynesian Framework. Princeton and Oxford: Princeton Univ Pr; 2008

[11] Smets F, Wouters R. Shocks and frictions in us business cycles: A bayesian dsge approach. American Economic Review. 2007;**97**(3):586-606

[12] Smets F, Wouters R. An estimated dynamic stochastic general equilibrium model of the euro area. Journal of the European Economic Association. 2003; **1**(5):1123-1175

[13] Calvo GA. Staggered prices in a utility-maximizing framework. Journal of Monetary Economics. 1983;**12**(3): 383-398

[14] Blanchard O, Galí J. Real wage rigidities and the new Keynesian model. Journal of Money, Credit and Banking. 2007;**39**:35-65

[15] O. Blanchard and J. Galí, Labor markets and monetary policy: A new Keynesian model with unemployment, American Economic Journal: Macroeconomics, vol. 2, pp. 1–30, April 2010

[16] Canova F. Methods for Applied Macroeconomic Research. Princeton and Oxford: Princeton University Press; 2007

[17] DeJong DM, Dave C. Structural Macroeconometrics. Princeton and Oxford: Princeton University Press; 2007

[18] Tovar C. DSGE models and central banks, economics: The open-access. Open-Assessment E-Journal. 2009;**3**: 20090016. DOI: 10.5018/economicsejournal.ja.2009-16

[19] Fernández-Villaverde J. The econometrics of DSGE models. SERIEs: *Evaluating DSGE Models: From Calibration to Cointegration DOI: http://dx.doi.org/10.5772/intechopen.111677*

Journal of the Spanish Economic Association. 2010;**1**(1):3-49

[20] Hansen GD. Indivisible labor and the business cycle. Journal of Monetary Economics. 1985;**16**(3):309-327

[21] Galí J, Gertler M, López-Salido J. Robustness of the estimates of the hybrid New Keynesian Phillips curve. Journal of Monetary Economics. 2005;**52**(6): 1107-1118

[22] Galı J, Gertler M. Inflation dynamics: A structural econometric analysis. Journal of Monetary Economics. 1999; **44**(2):195-222

[23] Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society. 1982;**50**(4): 1029-1054

[24] Christiano LJ, Eichenbaum M. Current real-business-cycle theories and aggregate labor-market fluctuations. The American Economic Review. 1992;**82**(3): 430-450

[25] Burnside C, Eichenbaum M, Rebelo S. Labor hoarding and the business cycle. Journal of Political Economy. 1993;**101** (2):245-273

[26] Lindé J. Estimating new-Keynesian Phillips curves: A full information maximum likelihood approach. Journal of Monetary Economics. 2005;**52**(6): 1135-1149

[27] P. N. Ireland, A method for taking models to the data, Journal of Economic Dynamics and Control, vol. 28, pp. 1205– 1226, March 2004

[28] Kalman RE. A new approach to linear filtering and prediction problems. Journal of Basic Engineering. 1960;**82**: 13-45

[29] Canova F, Sala L. Back to square one: Identification issues in DSGE models.

Journal of Monetary Economics. 2009; **56**(4):431-449

[30] An S, Schorfheide F. Bayesian analysis of dsge models. Econometric Reviews. 2007;**26**(2–4):113-172

[31] Del Negro M, Schorfheide F. Priors from general equilibrium models for VARS. International Economic Review. 2004;**45**(2):643-673

[32] Fukac M, Pagan A. Issues in adopting DSGE models for use in the policy process, Australian National University, Centre for Applied Macroeconomic Analysis. CAMA Working Paper. 2006; **10**:2006

[33] Rabanal P, Rubio-Ramírez JF. Comparing new keynesian models of the business cycle: A bayesian approach. Journal of Monetary Economics. 2005; **52**(6):1151-1166

[34] Fernández-Villaverde J, Guerrón-Quintana PA. Estimating dsge models: Recent advances and future challenges. Annual Review of Economics. 2021;**13**: 229-252

[35] Fernández-Villaverde J, Rubio-Ramirez JF, Sargent T, Watson MW. ABCs (and Ds) of understanding VARs. American Economic Review. 2007; **97**(3):1021-1026

[36] Johansen S. Statistical analysis of cointegration vectors. Journal of Economic Dynamics and Control. 1988; **12**(2):231-254

[37] Juselius K, Franchi M. Taking a DSGE model to the data meaningfully, economics: The open-access. Open-Assessment E-Journal. 2007;**1**:4

[38] Kivedal BK. A DSGE model with housing in the cointegrated VAR

framework. Empirical Economics. 2014; **47**(3):853-880

[39] Kivedal BK. A new keynesian framework and wage and price dynamics in the Usa. Empirical Economics. 2018; **55**(3):1271-1289

[40] Juselius K. The Cointegrated VAR Model. Methodology and Applications. New York: Oxford University Press; 2006

[41] Bårdsen G, Fanelli L. Frequentist evaluation of small dsge models. Journal of Business & Economic Statistics. 2015; **33**(3):307-322

[42] Kivedal BK. Testing Economic Theory Using the Cointegrated Vector Autoregressive Model: New Keynesian Models and House Prices. PhD thesis. Trondheim: Norwegian University of Science and Technology. Faculty of Social Sciences and Technology Management. Department of Economics; 2013

[43] Johannes Pfeifer's home page on Github. https://github.com/johannespfe ifer/dsge\_mod [Accessed: March 24, 2023]

[44] Burnside AC. Real Business Cycle Models: Linear Approximation and GMM Estimation. Washington, D.C.: Mimeo, The World Bank; 1999

[45] Juselius K and Franchi M. Taking a DSGE Model to the Data Meaningfully [Dataset]. 2009

[46] Pfaff B, Zivot E, Stigler M, Pfaff MB. Package urca, unit root and cointegration tests for time series data. R Package Version. 2016:1-2. Available from: http://cran.pau.edu.tr/web/ packages/urca/urca.pdf

#### **Chapter 2**

## A Primer on Machine Learning Methods for Credit Rating Modeling

### *Yixiao Jiang*

#### **Abstract**

Using machine learning methods, this chapter studies features that are important to predict corporate bond ratings. There is a growing literature of predicting credit ratings via machine learning methods. However, there have been less empirical studies using ensemble methods, which refer to the technique of combining the prediction of multiple classifiers. This chapter compares six machine learning models: ordered logit model (OL), neural network (NN), support vector machine (SVM), bagged decision trees (BDT), random forest (RF), and gradient boosted machines (GBMs). By providing an intuitive description for each employed method, this chapter may also serve as a primer for empirical researchers who want to learn machine learning methods. Moody's ratings were employed, with data collected from 2001 to 2017. Three broad categories of features, including financial ratios, equity risk, and bond issuer's cross-ownership relation with the credit rating agencies, were explored in the modeling phase, performed with the data prior to 2016. These models were tested on an evaluation phase, using the most recent data after 2016.

**Keywords:** machine learning, credit ratings, forecasting, random forest, gradient boosted machine

#### **1. Introduction**

An issue of continuing interest to many financial market participants (portfolio risk managers, for example) is to predict corporate bond ratings for unrated issuers. Issuers themselves may seek a preliminary estimate of what their rating might be to decide the ratio of debt and equity financing. Starting with the seminal works of [1, 2], pioneering studies in the finance literature use accounting ratios and other publicly available information in reduced-form models to predict credit ratings. A variety of statistical techniques (OLS, discriminant analysis, and ordered logit/probit models) were employed to identify the most important characteristics for predicting ratings. See, [3–5].

Bond rating is, in a way, a classification problem. There is also a growing literature of predicting credit ratings via machine learning (ML) methods [6–11]. As can be seen from **Table 1**, neural network (NN) and support vector machine (SVM) have been


*Note: SVM = Support Vector Machine. NN = Neural Network. MDA = Multivariate Discriminant Analysis. RF = Random Forest. RST = Rough Set Theory.*

#### **Table 1.**

*Summary of credit rating predictive studies using machine learning.*

widely employed by prior studies. However, there have been less empirical studies using *ensemble methods*, which refer to the technique of combining the prediction of multiple classifiers. This study attempts to fill the void by employing three ensemble methods to predict credit ratings and contrasting their performance with popular single-classifier ML methods.

The two popular methods for creating accurate ensembles are bootstrap aggregating, or bagging, and boosting. Previous works in the statistics and computer science literature have shown that these methods are very effective for decision trees (DT)<sup>1</sup> , so this chapter considers DT as the basic classification method. [11] employs the random forest (RF) to predict enterprise ratings in Taiwan. To date, no comparative study has been carried out for the United States with any ensemble methods to our knowledge. Other than RF, this study also employs two additional ensemble methods: bagged decision trees (BDT) and gradient boosted machine (GBM).

This study is also the first to explore the predictive power of conflicts of interest in forecasting bond ratings. After the collapse of highly rated securities during the 07–09

<sup>1</sup> See, for example, [12–14].

#### *A Primer on Machine Learning Methods for Credit Rating Modeling DOI: http://dx.doi.org/10.5772/intechopen.107317*

financial crisis, the role of credit rating agencies (CRAs) as gatekeepers to financial markets has been scrutinized by academia and regulators at an unprecedented level. A number of conflicts of interest, including the issuer-pays business model, crossownership [15, 16], non-rating business relationship [17], transitioning analysts [18], have been identified in the literature as contributing factors to the rating inflation.

The type of conflict of interest under study arises from cross-ownership, meaning that the bond issuers and the CRA are controlled by common shareholders. Conflicts of interest between shareholders and managers, at a general level, have a variety of negative impact on the company [19]. In the context of the rating industry, as noted by [16], companies invested by Moody's two large shareholders, Berkshire Hathaway and Davis Selected Advisors, tend to receive more favorable ratings compared with others. Based on institutional ownership data, [15] constructed an index to capture bond issuers' cross-ownership with Moody's via all common shareholders and finds such biases to be more universal.

Motivated by the aforementioned studies, this chapter incorporates several conflicts of interest measure from the cross-ownership channel to predict Moody's ratings from 2001 to 2017. Since the predictive performance of ML methods is usually context-dependent, we compare the aforementioned tree-based ensemble methods (RF, BDT, and GBM) with three other ML models: ordered logit model (OL), neural network (NN), and support vector machine (SVM). RF presents the best results, correctly predicting 73.2% ratings out of sample. To improve the interpretability of "black box" ML models, we use sensitivity analysis to measure the importance and effect of particular input features in the model output response.

The rest of the chapter is organized as follow. Section 2 describes the empirical rating data and the features (attributes) under study. Section 3 discusses the three ensemble ML methods in the context of predicting credit ratings. Section 4 contains the predictive results and sensitivity analyses, and Section 5 concludes.

#### **2. Data and features**

The objective of this chapter is to predict corporate bond ratings assigned by Moody's, the leading credit rating agency (CRA) in the United States. The empirical sample consists of publicly listed companies covered in either Center for Research in Security Prices (CRSP) or Compustat. Moody's ratings on bonds issued by these companies are obtained from Mergent's Fixed Income Securities Database (FISD). Since the analysis involves Moody's shareholders, the sampling period starts from January 2001, when Moody's went to public, to December 2017.

#### **2.1 Credit rating outcome**

Under Moody's rating scale, the rating outcome falls into seven ordered categories with descending credit quality: *Aaa*,*Aa*,*A*,*Baa*,*Ba*,*B*, and *C*. The first four categories, from *Aaa* to *Baa*, are termed "investment-grade," whereas the remaining three are termed "high yield." The distribution of ratings over time is reported in **Table 2**. In 2004, about 50% of bonds in the data received investment grade ratings. The proportion of investment grade bonds has been trending up prior to the 07–09 financial crisis. The fact that nearly 90% of bonds received investment grade rating in 2008 suggests an obvious inflation of ratings. For the purpose of predicting credit ratings, it


**Table 2.** *Distribution of ratings.*

is therefore important to include conflicts of interest measures, which account for this trend.

A second observation from **Table 2** is that the rating outcome is highly skewed toward the middle. The majority of bonds are rated in *A* and *Baa*, and only 2% of bonds received *Aaa* or *C* ratings. This is yet another reason to consider ensemble methods, which are known to be superior than other ML methods with single classifiers when applying to highly imbalanced data [20, 21].

#### **2.2 Attributes under study**

For each quarter from 2001Q1 to 2017Q4, a total of 20 features/attributes are obtained from a variety of sources to predict ratings. These features can be broadly categorized into three groups: (1) financial ratios, (2) equity risk measures, and (3) the bond issuer's "connectedness" with Moody's shareholders.

#### *2.2.1 Financial ratios*

We follow [22] and employ the following financial ratios in the analysis: (X1) the value of the firm's total assets (*log(asset)*), (X2) long- and short-term debt divided by total asset (*Book\_lev*). (X3) Convertible debt divided by total assets (*ConvDe\_assets*), (X4) rental payments divided by total assets (*Rent\_Assets*), (X5) cash and marketable

#### *A Primer on Machine Learning Methods for Credit Rating Modeling DOI: http://dx.doi.org/10.5772/intechopen.107317*

securities divided by total assets (*Cash\_assets*), (X6) long- and short-term debt divided by EBITDA (*Debt\_EBITDA*), (X7) EBITDA to interest payments (*EBITA\_int*), (X8) profitability, measured as EBITDA divided by sales (*Profit*), (X9) tangibility, measured as net property, plant, and equipment divided by total assets (*PPE\_assets*), (X10) capital expenditures divided by total assets (*CAPX\_assets*), (X11) the volatility of profitability (*Vol\_profit*), defined as the standard deviation of profitability in the last 5 years divided by the mean in absolute values. The data on the aforementioned firm-level financial ratios are obtained from the CRSP-Compustat merged database in Wharton Research and Data Services (WRDS).

There is a distinction between the issuer rating and issue rating for corporate bonds. The former addresses the issuer's overall credit creditworthiness, whereas the latter refers to specific debt obligations and considers the ranking in the capital structure such as secured or subordinated.<sup>2</sup> Since this chapter predicts rating at the bond level, three bond characteristics are also included: (X12) the log of the issuing amount (*Amt*), (X13) a dummy variable indicating whether the bond is senior (*Seniority*), and (X14) a dummy variable indicating whether the bond is secured (*Security*). The issuing amount affects the maximum financial loss on the investment, whereas the seniority and security status affect the priority of repayment should a default occur. Data on these bond characteristics are obtained from FSID along with the credit ratings.

#### *2.2.2 Equity risk*

As noted by [23], equity risk has been accounting for a greater proportion of variations in credit rating outcomes among the three leading CRAs in the United States. To obtain measures for a company's equity risk, we estimate a Fama–French three-factor model for each issuer in the sample.<sup>3</sup> The following measures are then obtained: (X15) the firm's beta (*Beta*), which is the stock's market beta computed estimated annually using the CRSP value-weighted index, and (X16) the firm's idiosyncratic risk (*Idiosyncratic risk*), computed annually as the root mean squared error from the three-factor model.

#### *2.2.3 Cross-ownership with Moody's*

As noted above, conflicts of interest are measured by the "connectedness" (crossownership) between Moody's and a bond issuer. To characterize the degree of crossownership, I first obtain the list of Moody's shareholders from Thomson Reuters (13F) and calculate their ownership stake in Moody's (the percentage of Moody's stock that they hold) for each quarter in the sampling period. Next, I access each shareholder's investment portfolio to find out which bond issuers have the same shareholders as investors. The shareholder's manager type code (MGRNO) and the firm's Committee on Uniform Securities Identification Procedures (CUSIP) number are used to match the shareholding data with bond issuers.

To summarily characterize the shared-ownership relation between bond issuers and Moody's, I employ the following measure, termed *Moody-Firm-Ownership-Index*

<sup>2</sup> The issuer rating usually applies to senior unsecured debt

<sup>3</sup> The normal estimation window is set to be 252 days prior to the rating assignment date. For companies with sparse stock price data, we require at least 126 days.

*(MFOI)*, proposed by [15]. Suppose Moody's has *j* ¼ 1,2,⋯,*M* shareholders in a given quarter<sup>4</sup> , and any subset of those shareholders can invest in an issuing firm. Define

$$\mathbf{u}(\mathbf{X}\mathbf{1}\mathbf{7}):\qquad \mathbf{M}\mathbf{F}\mathbf{O}\mathbf{I}\_{i}=\sum\_{j=1}^{M}b\_{ij}\mathbf{s}\_{j}\tag{1}$$

where *sj* denotes shareholder *j*'s ownership take in Moody's, and *bij* denotes bond issuer *i*'s weight in shareholder *j*'s investment portfolio. Note that *bij* ¼ 0 means shareholder *j* does not invest in bond issuer *i*.

In addition to MFOI, three other measures are included as predictors. The first is the number of common shareholders, defined as

$$\mathbf{(X18)}: \qquad Num\\_SH\_i = \sum\_{j=1}^{M} \mathbf{1}\{b\_{ij} > \mathbf{0}\} \tag{2}$$

The second is the number of large common shareholders (which owns at least 5% of Moody's stock), defined as

$$\mathbf{1}(X\mathbf{1}9):\qquad Num\\_large\\_SH\_i = \sum\_{j=1}^{M} \mathbf{1}\{b\_{ij} > \mathbf{0}\} \times \mathbf{1}\{s\_j > \mathbf{0}.05\}\tag{3}$$

The last is a dummy variable capturing if the bond issuer is invested by Berkshire Hathaway, Moody's leading shareholder for our sampling period.

$$\mathbf{1(X20)}: \qquad BRK\_i = \mathbf{1\{b\_{ik} > 0\}}, \quad k = \text{Berkshire Hathaway} \tag{4}$$

Berkshire Hathaway is singled out here because it owns significantly more shares of Moody's compared with any other large shareholders.

#### **2.3 Descriptive statistics**

After combining data from multiple sources, the final dataset consists of 6817 bonds issued by 895 firms. The descriptive statistics for the 20 features/attributes are reported in **Table 3**. For asset (*X*1), EBITDA to interest (*X*7), profitability (*X*8), issuing amount (*X*12), and seniority (*X*13), there is a clear positive correlation between rating categories and the level of these attributes. For others like the Book-leverage ratio (*X*2), Debt-to-EBITDA ratio (*X*6), tangibility-to-asset ratio (*X*10), volatility of profit (*X*11), and idiosyncratic risk (*X*16), the correlation is negative. For the four conflicts of interest measures (*X*<sup>17</sup> - *X*20), they all decrease as the rating drops.

#### **3. Methods**

The dataset is split into two subsets based on the timing of the rating: a training set, which consists of 5814 (85.3% of the total) ratings before 2016, and a holdout set, which consists of 1000 (14.7%) ratings in 2016–2017. In this section, we discuss the

<sup>4</sup> Since all of the variable are time-specific, I drop the time t subscript for notational simplicity


*A Primer on Machine Learning Methods for Credit Rating Modeling DOI: http://dx.doi.org/10.5772/intechopen.107317*

#### **Table 3.**

*Descriptive statistics by rating categories.*

methodological aspect of three resemble methods—Random Forest (RF), Bagging, and Gradient Boosted Modeling (GBM)—and how they are implemented. The performances of these methods are compared with three other ML models: Ordered Logit Regression (OLR), Support Vector Machine (SVM), and Neural Network (NN), based on the predictive accuracy in the holdout set.

#### **3.1 Decision trees**

To understand the resemble method, we must first understand decision trees, the basic classification procedure upon which the ensemble (or resulting classification) is based<sup>5</sup> . For illustrative purpose, consider a sample decision tree that includes categorical outcome *Y* (credit rating) and three predictor variables: firm asset, leverage, and

<sup>5</sup> In this study, we restrict our attention to tree-based resemble methods because decision trees are extremely fast to train.

**Figure 1.** *Sample decision tree.*

seniority (binary). As displayed in **Figure 1**, the main components of a decision tree model are nodes and branches, while the complexity of the decision tree is governed by splitting, stopping, and pruning.

**Nodes** There are three types of nodes. (a) A root node, also called a decision node, represents the most important feature (in this case, the level of log (firm asset)) that will lead all subdivisions. (b) Leaf nodes, also called end nodes, represent the final predicted rating outcome based on the sequence of divisions. (c) Internal nodes, also called chance nodes, represent the intermediate sequence of features that guide the classification.

**Branches** A decision tree model is formed using a hierarchy of branches, with the more important features displayed closer to the root node. Each path from the root node through internal nodes to a leaf node represents a classification decision sequence. These decision tree pathways can also be represented as "if-then" rules, with the left branch denoting the binary condition is met. For example, "if the natural log of firm asset is less than 13.5 and the leverage ratio is less than 15%, then the bond is rated as Baa."

**Splitting** Measures that are related to the degree of "purity" of the subsequent nodes (i.e., the proportion with the target condition) are used to choose between different potential input variables; these measures include entropy, Gini index, classification error, information gain, and gain ratio. Normally not all potential input variables will be used to build the decision tree model and in some cases a specific input variable may be used multiple times at different levels of the decision tree.

**Stopping and Prunning** An overly complex tree can result in each leaf node 100% pure (i.e., all bonds have the same rating), but is likely to suffer from the problem of overfitting. To prevent this from happening, one may grow a large tree first and then prune it to optimal size by removing nodes that provide less additional information. One parameter that controls the complexity is the number of leaf nodes.

#### **3.2 Bagging**

The decision trees discussed above suffer from high variance, meaning if the training data are split into multiple parts at random with the same decision tree applied to each, the predictive results can be quite different. Bootstrap aggregation, or bagging, is a technique used to reduce the variance of predictions by combining the

#### *A Primer on Machine Learning Methods for Credit Rating Modeling DOI: http://dx.doi.org/10.5772/intechopen.107317*

result of multiple classifiers modeled on different subsamples of the same dataset. When applying bagging to decision trees, usually the trees are grown deep and are not pruned. Hence, each individual tree has high variance, but low bias. Averaging hundreds or even thousands of trees can reduce the variance and improve the predictive performance.

In practice, different subsamples are drawn from the training set with replacement (See, [24] for a detailed discussion of the bagging sampling approach). Each subsample has the same size with the training set, but only contains 2/3 of the data of the original data on average. The number of bootstrapped sample is therefore a hyperparameter to be tuned. For each bootstrapped sample, we fit a "bushy" deep decision tree with all 20 features considered at each splitting. Each tree acts as a base classifier to determine the rating of a bond. The final prediction is done via "majority voting" where each classifier casts one vote for its predicted rating, then the category with the most votes is used to classify the credit rating.

#### **3.3 Random forest**

Random forest is another ensemble classification method developed by [25]. One advantage of random forest (RF) over bagging is that it reduces the correlation among trees by randomizing the number of features. RF combines the bagging sampling approach of [24] and the random selection of features, introduced independently by [26, 27], to construct a collection of decision trees with controlled variation. Specifically, [25] recommends to randomly select *m* ¼ *log* <sup>2</sup>ð Þ *p* þ 1 features at any given splitting, with *p* being the total number of features, to grow each individual tree. Moreover, each tree is constructed using a subsample of the training set with replacement.

For the purpose of illustration, in **Figure 2**, we consider an RF populated by three trees that are similar to the one described in **Figure 1**. Note that the total number of features is 3. In this case, *m* ¼ *log* <sup>2</sup>ð Þ¼ 4 2, so each tree is generated using two features. For a bond with firm asset = 12, seniority = yes, and leverage = 12%, the majority rule returns a predicted rating of *Ba* category. In practice, the complexity of the random forest is governed by several hyperparameters, such as the number of trees and the maximum features at each splitting.

**Figure 2.** *Sample random forest.*

#### **3.4 Gradient boosting machines**

Gradient Boosting Machines (GBMs) are a ensemble method, which recognizes the weak learners and attempts to strengthen those learners in a recursive manner to improve prediction. The key difference between GBM and Bagging is that the training stage is parallel for Bagging (i.e., each tree is built independently), whereas GBM builds the new tree in a sequential way. Specifically, when the first tree is generated, the residual errors are calculated and used in the next tree as the target variable. The predictions made by this last last tree are combined with the previous model's predictions. New residuals are calculated using the predicted value and the actual value. This process is repeated until the errors no longer decreased significantly.

During the prediction stage, bagging and RF simply average the individual predictions (the "majority rule"). In contrast, a new set of weights will be assigned to each tree in GBM. The final predicted rating is an weighted average of individual predictions. A tree with a good classification result on the training data will be assigned a higher weight than a poor one. There is no consensus regarding to which method is better than the other; the answer very much depends on the data and the researcher's objective. Some scholars have argued that gradient boosted trees can outperform random forest [28, 29]. Others believe boosting tends to aggregate the overfitting problem because repeatedly fitting the residuals can capture noisy information.

#### **4. Results**

In this section, we begin by comparing the three aforementioned ensemble methods (BDT, RF, and GBM) in terms of the out-of-sample predictive accuracy. Three non-ensemble ML methods, the ordered-logit model, support vector machine, and neural network, are also evaluated with the same dataset. For each employed method, we discuss the relevant hyperparameters and how they are tuned empirically.

All ML methods were implemented using the software R. To be specific, BDT and RF were implemented using the *randomForest* package. The number of features is fixed at all 20 for BDT. For RF, each tree randomly selects *m* ¼ *log* <sup>2</sup>ð Þ¼ 20 þ 1 5 features. GBM is implemented using the package *gbm* package. For the three nonensemble ML methods, ordered-logit model is implemented using the *polr* function from the *MASS* package. Support Vector Machine is implemented via the *svm* function from the *e1071* package. The neural network is implemented using the *neuralnet* package.

#### **4.1 Predictive results**

Bagged Decision Tree (BDT) To evaluate the predictive results, we report the classification matrix in the holdout sample for each employed method. In the case of BDT, the main hyperparameter needs to be tuned is the number of trees. We run three BDTs, setting the number of trees to be 200, 500, and 800. It is found the model with 500 trees has the highest predictive accuracy (=69*:*1%). The full classification matrix is reported in **Table 4**. The horizontal dimension represents the true rating received in the holdout sample, whereas the vertical dimension represents the predicted rating category. Therefore, the entries on the diagonal line capture the number of ratings

*A Primer on Machine Learning Methods for Credit Rating Modeling DOI: http://dx.doi.org/10.5772/intechopen.107317*


**Table 4.**

*The classification confusion matrix of BDT with 500 trees in holdout sample.*

correctly predicted for a particular category. For example, the numbers in the first column shall be interpreted as 19 Aaa bonds are correctly classified as Aaa, whereas four (14) are misclassified into A (Baa).

Random Forest (RF) The next predictive model under evaluation is the Random Forest. In addition to the bagging technique, RF also randomizes the features set to further decrease the correlation among the decision trees. As noted above, RF has five hyperparameters that govern the complexity of the model. To decide these hyperparameter values, we implement a five-dimensional grid search where every combination of hyperparameters of interest is assessed. The hyperparameter grid is generated by

$$\mathcal{G} = \{\mathfrak{m} \times \mathbb{N} \times \mathfrak{n} \times \mathfrak{p} \times r\}, \qquad \text{where} \tag{5}$$


Consequently, a total of 216 (= 6 � 2 � 3 � 2 � 3) specifications of RF are compared in terms of the predictive accuracy in the holdout set. As shown in **Table 5**, the best predictive model consists of 500 trees, with each tree generated from the entire training set (*p* ¼ 1) with replacement. In each splitting, *m* ¼ 4 features are randomly selected. The overall classification accuracy of the holdout data turned out to be 73.2%. From the classification confusion matrix in **Table 6**, RF has a reliable predictive performance in almost all rating categories.

To develop some sense of how RF make prediction, **Figure 3** plots one decision tree from the RF model. There are a total of six attributes used in this particular tree. MFOI


#### **Table 5.** *The 10 best RF models from hyperparameters tunning.*


#### **Table 6.**

*The classification confusion matrix of the best RF in holdout sample.*

and idiosyncratic risk appear to be the two most important attributes. From the rightmost terminal node, it is almost certain that bonds with MFOI < 1*:*7 and idiosyncratic risk >0*:*1 can only receive high-yield ratings (25% Ba +57% B + 10% C = 91% of high yield), irrespective of other features. This provides a remarkably parsimonious yet robust decision rule to decide whether a bond is investment grade or not.

Gradient Boosting Machine (GBM) The classification confusion matrix of GBM is reported in **Table 7**. The overall predictive accuracy is 64.4%, which is 5 percentage point lower than BDT and nearly 10 percentage point lower than RF. As noted by [30], predictive results from Boosting methods are usually more volatile. [14] also made a conjecture that Boosting's sensitivity to noise may be partially responsible for its occasional increase in errors. As such, we recommend to always use RF or BDT for predicting credit ratings.

*A Primer on Machine Learning Methods for Credit Rating Modeling DOI: http://dx.doi.org/10.5772/intechopen.107317*

**Figure 3.** *Decision tree extracted from the RF model.*


#### **Table 7.**

*The classification confusion matrix of GBM in holdout sample.*

Ordered Logistic Regression (OLR) The OLR is a regression model where different features affect the rating outcome through the logistic transformation. Let *Zi* ¼ *β*<sup>0</sup> þ P<sup>20</sup> *<sup>j</sup>*¼<sup>1</sup>*xijβ<sup>j</sup>* be a linear index summarizing the information of the 20 considered features where the *β* coefficients are to be estimated from the data. The predicted probability in OLR for each rating category, *k* ¼ 1,⋯,7, can be described as *Pr Y*ð Þ¼ *ik* ¼ 1j*xi* 1 <sup>1</sup><sup>þ</sup> *exp Z*ð Þ *<sup>i</sup>*�*κ<sup>k</sup>* � <sup>1</sup> <sup>1</sup><sup>þ</sup> *exp Z*ð Þ *<sup>i</sup>*�*κk*�<sup>1</sup> where *<sup>κ</sup><sup>k</sup>* is a series of threshold point separating the different ratings with *k*<sup>0</sup> ¼ �∞ and *k*<sup>7</sup> ¼ ∞. While the model is easier to interpret, it is quite rigid and cannot accomodate complex nonlinear relationships.

The classification matrix of OLR is reported in **Table 8**. The overall classification accuracy is 53.9% for the holdout sample, which is much worse than RF. The model also fails to correctly predict all 37 *Aaa* bonds. This is unsurprising: when fitting a linear trend in the data (OLR belongs to the family of generalized linear model because the logistic transformation is applied on a linear score function of features), the fitness is usually worse in the tails of the distribution (**Table 9**).

Support Vector Machine (SVM) developed by [31] seeks to find the optimal separating hyperplane between binary classes by following the maximized margin criterion. When it comes to multiclass prediction where the outcome variables take *k* distinct categories, one may induce *k k*ð Þ �<sup>1</sup> <sup>2</sup> individual binary classifiers and then use the majority rule to determine the final predicted outcome. In order to find the separating hyperplane, SVM uses a kernel function to enlarge the feature space using basis


#### **Table 8.**

*The classification confusion matrix of OLR in holdout sample.*


#### **Table 9.**

*The classification confusion matrix of SVM in holdout sample.*

functions. Mathematically, SVM can be viewed as the following constrained maximization problem,

$$
\dot{m}\dot{m}\_a \qquad \frac{1}{2}a^T Q a - e^T a \tag{6}
$$

$$\text{s.t.} \qquad 0 \le a \le \text{Ce}, \qquad \mathbf{y}^T a = \mathbf{0} \tag{7}$$

where *e* is the vector of all ones, *Q* is a *N* � *N* semi-positive definite matrix, *Qij* ¼ *yi yj K xi*, *xj* � � with *K* being the kernel function.

This chapter follows [9] and employs the radial basis function (RBF): *k xi*, *xj* � � <sup>¼</sup> *exp* �*<sup>γ</sup> xi* � *xj* � � � � <sup>2</sup> n o, where *<sup>γ</sup>* and *<sup>C</sup>* are hyperparameters to be selected. A series of SVMs with *<sup>C</sup>* <sup>¼</sup> <sup>2</sup>*<sup>c</sup>* and *<sup>γ</sup>* <sup>¼</sup> <sup>2</sup>*<sup>g</sup>* are implemented. Based on a 10-fold crossvalidation, the best parameters are *C* ¼ 32 and *γ* ¼ 0*:*25. The overall classification accuracy turns out to be 67.2% for SVM, which lies between ORL and RF.

Neural Network (NN) The artificial neural network (NN) models are proposed by cognitive scientists to mimic the way that brain processes information. As noted by [32], NN can be viewed as a nonlinear regression model in the following form,

$$f(\mathbf{x}, \theta) = \tilde{\mathbf{x}}'\mathbf{a} + \sum\_{s}^{q} \mathbf{G}(\tilde{\mathbf{x}}'\boldsymbol{\gamma}\_{s})\boldsymbol{\beta}\_{s} \tag{8}$$

where *x*~ ¼ 1, *x*<sup>0</sup> ð Þ<sup>0</sup> , *q* is a integer representing the number of hidden neurons, and *G*ð Þ� is a given nonlinear activation function. NN processes information in a hierarchical manner: the signals from an *input node xj* (*i* ¼ 1,⋯,20) are first amplified or attenuated by *γjs* and arrive at *q hidden* (intermediate) *nodes*. The aggregated signals, in the form of *tildex*<sup>0</sup> *γs*, are then passed to the seven *output nodes* (e.g., the potential rating outcome) by the operation of the activation function *G x*~<sup>0</sup> *γ<sup>s</sup>* ð Þ. As in the previous step, information at the hidden node *s* is amplified or attenuated by *βs*. Other than through hidden nodes, signals are also allowed to affect the rating outcome directly through weights *α*.

For simplicity, this study focuses on a three-layer NN and varies the number of nodes in the hidden layer for training. In particular, 5, 10, 15, 20 hidden nodes are used. For each case, we run the same model with 50 replications to tease out the impact of bad starting values. In terms of the predictive accuracy, we find that the model with five hidden nodes slightly outperforms the rest (57.3, 56.4, 56.3, and 55.4%). In **Table 10**, we report the classification matrix for one of the NN models, with the network structure presented in **Figure 4**.

#### **4.2 Sensitivity analysis**

To explore which features are more important than others in predicting ratings, we performed two sensitivity analyses. While the analyses can be applied to any aforementioned ML methods, we decide to focus on RF due to its superior predictive performance.

The first analysis is the variable importance plots (VIP). Loosely speaking, variable importance is the increase in model error when the feature's information is "destroyed." On the left panel of **Figure 5**, we show the impurity-based measure


#### **Table 10.**

*The classification confusion matrix of NN in holdout sample.*

#### **Figure 4.**

*NN with five hidden nodes (A darker line means a stronger signal).*

where we base feature importance on the average total reduction of the loss function for a given feature across all trees. On the right panel, we show the permutation-based importance measure<sup>6</sup> . A feature is "important" if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction.

<sup>6</sup> In the permutation-based approach, the values for each variable are randomly permuted, one at a time, and the accuracy is again computed. The decrease in accuracy as a result of this randomly shuffling of feature values is averaged over all the trees for each predictor [33].

*A Primer on Machine Learning Methods for Credit Rating Modeling DOI: http://dx.doi.org/10.5772/intechopen.107317*

**Figure 5.** *Variable importance plot of each attribute for the RF model. Note: the figure on the left (right) ranks importance based on the Gini-impurity (permutation).*

Both measures consistently identify the two most important attributes to be MFOI and the idiosyncratic risk of the bond issuer's stock. Eliminating the information contained in MFOI, from the permutation-based metric, decreases the predictive accuracy by about 20%.

The second sensitivity analysis is to compute the Partial Dependence (PD) for important attributes. To describe the notion of partial dependence, let *X* ¼ f g *x*1, *x*2, ⋯, *x*<sup>20</sup> represent the set of the predictor variables in the RF model where the prediction function is denoted by ^*f X*ð Þ. The "partial dependence" of *<sup>x</sup>*1, for example, is defined as

$$PD(\mathbf{x}\_1) = \frac{\partial}{\partial \mathbf{x}\_1} \mathbf{E}\_{\mathbf{x}\_1} \left[ \hat{f}(\mathbf{x}\_1, \ \mathbf{x}\_c) \right] = \frac{\partial}{\partial \mathbf{x}\_1} \left[ \hat{f}(\mathbf{x}\_1, \mathbf{x}\_c) p\_c(\mathbf{x}\_c) d\mathbf{x}\_c \right] \tag{9}$$

where *Xc* ¼ f g *x*2, *x*3, ⋯, *x*<sup>20</sup> denote the other predictors and *pc*ð Þ *xc* is the marginal probability density of *xc* : *pc*ð Þ¼ *xc* Ð *p X*ð Þ*dxc*. This quantity, which resembles a marginal effect, can be estimated from a set of training data by

$$\hat{PD}(\mathbf{x}\_1) = \frac{1}{n} \sum\_{i} \frac{\partial}{\partial \mathbf{x}\_1} \hat{f}(\mathbf{x}\_1, \mathbf{x}\_{c,i}) \tag{10}$$

where *xc*,*<sup>i</sup>* are the values of *xc* that occur in the training sample; that is, we average out the effects of all the other predictors in the model. In **Figure 6**, we report the PDs

**Figure 6.**

*Partial dependence plot for MFOI and idiosyncratic risk from the RF model. Note: The black line depicts the PD at specific values of MFOI/idiosyncratic risk. The blue line is the fitted value.*

**Figure 7.**

*Joint partial dependence plot for MFOI and idiosyncratic risk.*

for MFOI and idiosyncratic risk separately. From the left panel, a lower value of MFOI has a negative impact on the rating outcome. As MFOI goes above 50, it starts to affect the rating in a positive way (a higher degree of connectedness between Moody and the issuer firm, as measured by MFOI, translates to a higher predicted rating). The positive impact of MFOI increases with the level of MFOI and plateaues as MFOI goes above 150, which is about the 99 percentile of its distribution. Conversely, we see that a larger idiosyncratic risk has a more deteriorating impact on ratings. Both patterns are economically sounding. **Figure 7** represents the joint PD for MFOI and

idiosyncratic risk. The negative impact of idiosyncratic risk is only pronounced when MFOI is low.

#### **4.3 Discussion**

The main message emerged from our empirical exercise is that conflicts of interest, as measured by bond issuer's connection with Moody's shareholders, have a strong predictive power in the credit rating outcome. This observation is consistent with several previous studies. [16] found that Moody's has been assigning more favorable ratings (relative to that of S&P's) to issuers related to its two largest shareholders— Berkshire Hathaway and Davis Selected Advisors. [23, 34] showed that such bias is more universal and apply to issuers associated with any large shareholders of Moody's.

Although cross-ownership has been recognized in the literature as a important driver of credit ratings, it has not been explicitly considered as a predictor variable in any prior studies that focus on prediction. This study complements the above by confirming that cross-ownership can be utilized to increase the predictability of credit ratings.

#### **5. Conclusions**

In this chapter, we employ six machine learning methods to predict bond ratings from a sample of US public firms. Other than the financial ratios employed by previous studies, this chapter expands the feature sets to include equity risk measures and the bond issuer's cross-ownership relation with the rating agency. Inclusion of the latter source of information is unprecedented.

Several observations/conclusions emerge from the analysis. (1) Ensemble methods, including the Random Forest, Bagged Decision Trees, and Gradient Boosting Machines, generally outperform the ML methods with a single classifier. (2) Among the three ensemble methods, random forest shows a significantly better performance than the other (correctly predicting 5% more bonds than bagging and 10% more bonds than boosting). (3) Sensitivity analyses reveals the firm's idiosyncratic risk and cross-ownership relation with the rating agency as the two most important attributes in predicting ratings.

*Econometrics – Recent Advances and Applications*

#### **Author details**

Yixiao Jiang† Economics, Christopher Newport University, Newport News, USA

\*Address all correspondence to: yixiao.jiang@cnu.edu

† This research is supported by the Christopher Newport University Faculty Development Fund.

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Altman EI. Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The Journal of Finance. 1968;**23**(4):589-609

[2] Horrigan JO. The determination of long-term credit standing with financial ratios. Journal of Accounting Research. 1966:44-62

[3] Kaplan RS, Urwitz G. Statistical models of bond ratings: A methodological inquiry. Journal of Business. 1979:231-261

[4] Pinches GE, Mingo KA. A multivariate analysis of industrial bond ratings. The Journal of Finance. 1973; **28**(1):1-18

[5] West RR. An alternative approach to predicting corporate bond ratings. Journal of Accounting Research. 1970: 118-125

[6] Bellotti T, Matousek R, Stewart C. A note comparing support vector machines and ordered choice models' predictions of international banks' ratings. Decision Support Systems. 2011;**51**(3):682-687

[7] Huang Z, Chen H, Hsu C-J, Chen W-H, Soushan W. Credit rating analysis with support vector machines and neural networks: a market comparative study. Decision Support Systems. 2004;**37**(4): 543-558

[8] Kumar K, Bhattacharya S. Artificial neural network vs linear discriminant analysis in credit ratings forecast. Review of Accounting and Finance. 2006;**5**(3):216-227

[9] Lee Y-C. Application of support vector machines to corporate credit rating prediction. Expert Systems with Applications. 2007;**33**(1):67-74

[10] Sermpinis G, Tsoukas S, Zhang P. Modelling market implied ratings using lasso variable selection techniques. Journal of Empirical Finance. 2018;**48**: 19-35

[11] Yeh C-C, Lin F, Hsu C-Y. A hybrid kmv model, random forests and rough set theory approach for credit rating. Knowledge-Based Systems. 2012;**33**: 166-172

[12] Bauer E, Kohavi R. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning. 1999;**36**(1): 105-139

[13] Drucker H, et al. Boosting and other machine learning algorithms. In: Machine Learning Proceedings 1994. MA, USA: Morgan Kaufmann; 1994. pp. 53-61

[14] Freund Y, Schapire RE, et al. Experiments with a new boosting algorithm. In: ICML. Vol. 96. Citeseer; 1996. pp. 148-156

[15] Jiang Y. Semiparametric estimation of a corporate bond rating model. Econometrics. 2021;**9**(2):23

[16] Kedia S, Rajgopal S, Zhou XA. Large shareholders and credit ratings. Journal of Financial Economics. 2017;**124**(3): 632-653

[17] Baghai R, Becker B. Non-rating revenue and conflicts of interest. Swedish House of Finance Research Paper, (15-06). 2016

[18] Cornaggia J, Cornaggia KJ, Xia H. Revolving doors on wall street. Journal of Financial Economics. 2016;**120**(2): 400-419

[19] Boubaker S, Sami H. Multiple large shareholders and earnings informativeness. Review of Accounting and Finance. 2011

[20] Khoshgoftaar TM, Golawala M, Van Hulse J. An empirical study of learning from imbalanced data using random forest. In 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007). IEEE. Vol. 2. 2007. pp. 310–317

[21] Muchlinski D, Siroky D, He J, Kocher M. Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data. Political Analysis. 2016:87-103

[22] Baghai RP, Servaes H, Tamayo A. Have rating agencies become more conservative? Implications for capital structure and debt pricing. The Journal of Finance. 2014;**69**(5):1961-2005

[23] Jiang Y. Credit ratings, financial ratios, and equity risk: A decomposition analysis based on moody's, standard & poor's and fitch's ratings. Finance Research Letters. 2021:-102512

[24] Breiman L. Bagging predictors. Machine Learning. 1996;**24**(2):123-140

[25] Breiman L. Random forests. Machine Learning. 2001;**45**(1):5-32

[26] Ho TK. Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition. IEEE. Vol. 1, 1995. pp. 278–282

[27] Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural Computation. 1997;**9**(7):1545-1588

[28] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learnin. Cited on 2009. p. 33

[29] Madeh Piryonesi S, El-Diraby TE. Using machine learning to examine impact of type of performance indicator on flexible pavement deterioration modeling. Journal of Infrastructure Systems. 2021;**27**(2):04021005

[30] Opitz D, Maclin R. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research. 1999;**11**: 169-198

[31] Vapnik VN. An overview of statistical learning theory. IEEE Transactions on Neural Networks. 1999; **10**(5):988-999

[32] Swanson NR, White H. A modelselection approach to assessing the information in the term structure using linear models and artificial neural networks. Journal of Business & Economic Statistics. 1995;**13**(3):265-275

[33] Boehmke B, Greenwell BM. Handson Machine Learning with R. New York, USA: CRC Press; 2019

[34] Gu Z, Jiang Y, Yang S. Estimating unobserved soft adjustment in bond rating models: Before and after the doddfrank act. Available at SSRN 3277328. 2018

### **Chapter 3**
