3.1.1. Acoustic models p Xð Þ jS

p Xð Þ jS can be further factorized using a probabilistic chain rule and Markov assumption as follows:

$$p(\mathbf{X}|\mathbf{S}) = \prod\_{t=1}^{T} p(\mathbf{x}\_t|\mathbf{x}\_1, \dots, \mathbf{x}\_{t-1}, \mathbf{S}) \tag{8}$$

$$\varepsilon \approx \prod\_{t=1}^{T} p(\mathbf{x}\_t|\mathbf{s}\_t) \propto \prod\_{t=1}^{T} \frac{p(\mathbf{s}\_t|\mathbf{x}\_t)}{p(\mathbf{s}\_t)} \tag{9}$$

In Eq. (9), framewise likelihood function p xt ð Þ jst is changed into the framewise posterior distribution p st ð Þ <sup>j</sup>xt p sð Þ<sup>t</sup> which is computed using DNN classifiers by pseudo-likelihood trick [38]. In Eq. (9), Markov assumption is too strong. Therefore, the contexts of input and hidden states are not considered. This issue can be resolved using either the recurrent neural networks (RNNs) or DNNs with long-context features. A framewise state alignment is required to train the framewise posterior which is offered by an HMM/GMM system.
