**4.2 Statistic criterion**

From the definition of domain adaptation, we see that the fundamental goal is to reduce the domain divergence between the source domain and target domain so that the function *F<sup>t</sup>* can achieve good performance on the target domain. Therefore, it's important and valuable to use a criterion to measure the divergence between different domains. In other words, we need to have a measurement of the difference of probability distributions from different datasets.

Maximum Mean Discrepancy (MMD) [15] is a well-known criterion that is widely adopted in deep domain adaptation such as [16, 17]. Specifically, MMD computes the mean squared difference between the two datasets, which can be defined as

$$\begin{split} \mathcal{D}\_{\text{MMD}}(\mathbf{X}^{\varepsilon}, \mathbf{X}^{\varepsilon}) &= \left\| \frac{1}{n} \sum\_{i=1}^{n} \phi\left(\mathbf{x}\_{i}^{\varepsilon}\right) - \frac{1}{m} \sum\_{j=1}^{m} \phi\left(\mathbf{x}\_{i}^{\varepsilon}\right) \right\|^{2} \\ &= \frac{1}{n^{2}} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \phi\left(\mathbf{x}\_{i}^{\varepsilon}\right)^{\mathsf{T}} \phi\left(\mathbf{x}\_{j}^{\varepsilon}\right) - \frac{2}{nm} \sum\_{i=1}^{n} \sum\_{j=1}^{m} \phi\left(\mathbf{x}\_{i}^{\varepsilon}\right)^{\mathsf{T}} \phi\left(\mathbf{x}\_{i}^{\varepsilon}\right) + \frac{1}{m^{2}} \sum\_{i=1}^{m} \sum\_{j=1}^{m} \phi\left(\mathbf{x}\_{i}^{\varepsilon}\right)^{\mathsf{T}} \phi\left(\mathbf{x}\_{j}^{\varepsilon}\right) \\ &= \frac{1}{n^{2}} \sum\_{i=1}^{n} \sum\_{j=1}^{n} k\left(\mathbf{x}\_{i}^{\varepsilon}, \mathbf{x}\_{j}^{\varepsilon}\right) - \frac{2}{nm} \sum\_{i=1}^{n} \sum\_{j=1}^{m} k\left(\mathbf{x}\_{i}^{\varepsilon}, \mathbf{x}\_{j}^{\varepsilon}\right) + \frac{1}{m^{2}} \sum\_{i=1}^{m} \sum\_{j=1}^{m} k\left(\mathbf{x}\_{i}^{\varepsilon}, \mathbf{x}\_{j}^{\varepsilon}\right) \end{split} \tag{4}$$

where *ϕ*ðÞ denotes the feature space map. In practice, we can use the kernel method *k*ðÞ to make MMD be computed easily (i.e., Gaussian kernel).

*H* -divergence [18] is a more general theory to measure the domain divergence, which is defined as

$$d\_{\mathcal{H}}(\mathcal{D}\_{\mathfrak{s}}, \mathcal{D}\_{\mathfrak{t}}) = 2 \sup\_{h \in \mathcal{H}} \left| \Pr\_{\mathbf{x}\_{\mathfrak{t}} \sim \mathcal{D}\_{\mathfrak{t}}} [h(\mathbf{x}\_{\mathfrak{t}}) = \mathbf{1}] - \Pr\_{\mathbf{x}\_{\mathfrak{t}} \sim \mathcal{D}\_{\mathfrak{t}}} [h(\mathbf{x}\_{\mathfrak{t}}) = \mathbf{1}] \right| \tag{5}$$

where *h*∈*H* is a binary classifier (i.e., hypothesis). For example, in [19], domain-adversarial networks are proposed based on this statistic criterion (note that this method can belong to the approach of domain-adversarial learning).

#### **4.3 Parameter regularization**

Note that for fine-tuning networks with the label criterion or the statistic criterion, the weights in the networks are usually shared between the source domain and target domain. In contrast to these methods, some researchers argue that the weights for each domain should be related but not shared. Based on this idea, the authors in [20] propose a two-stream architecture with a weight regularization method. Two types of regularizers are introduced: *L*<sup>2</sup> norm or in an exponential form.

$$\begin{aligned} r\_w \left( \theta\_j^\epsilon, \theta\_j^\epsilon \right) &= \left\| a\_j \theta\_j^\epsilon + b\_j - \theta\_j^\epsilon \right\| \\ \sigma r &= \exp \left( \left\| a\_j \theta\_j^\epsilon + b\_j - \theta\_j^\epsilon \right\| \right) - \mathbf{1} \end{aligned} \tag{6}$$

**4. Fine-tuning networks**

**Table 1**.

**Figure 3.**

**Table 1.**

�*Ylog <sup>Y</sup>*^

**50**

pre-trained model.

**4.1 Label criterion**

In the last section, we categorize the main methods to conduct domain adapta-

Statistic criterion ✓ Parameter regularization ✓ ✓ Sample regularization ✓ ✓

Target data generating ✓

GAN-based ✓

*Categorization of deep domain adaptation based on whether the labels in the target domain are available.*

The most basic approach to conduct domain adaptation is to fine-tune a pretrained network with labeled data from the target domain. Hence, we assume that the labels in the target dataset are available and we can utilize a supervised learning approach to adjust the weights/parameters in the network. Based on the definition

where *<sup>L</sup>* denotes a loss function, such as the cross-entropy loss *<sup>L</sup> <sup>Y</sup>*, *<sup>Y</sup>*^ <sup>¼</sup>

As discussed in Section 3.1, a question is that how many layers in the neural network we should freeze. In general, there are two main factors that can influence the fine-tuning procedure: the size of the target dataset and its similarity to the source domain. Based on the two factors, some common rules of thumb are introduced in [13]. One typical work is [14], in which a unified supervised method for

Θ is a set of parameters which is normally initialized with weights from the

� ð Þ <sup>1</sup> � *<sup>Y</sup> log* <sup>1</sup> � *<sup>Y</sup>*^ , which is commonly used in many works. Note that

<sup>¼</sup> *<sup>L</sup> Yt*, *<sup>F</sup><sup>t</sup> Xt* ð Þ ð Þ ; <sup>Θ</sup> (3)

**Supervised Unsupervised**

of the task, our target task *T <sup>t</sup>* based on label criterion approach is

*<sup>T</sup> <sup>t</sup>* <sup>¼</sup> *<sup>L</sup> Yt*, *<sup>Y</sup>*^*<sup>t</sup>*

Fine-tuning Label criterion ✓

*Advances and Applications in Deep Learning*

Adversarial-domain Domain classifier ✓

Sample-reconstruction Encoder-decoder-based ✓

tion with deep neural networks and give some high-level information. In this section, we firstly discuss the details of four approaches for fine-tuning networks in

*Categorization of domain adaptation based on feature space. (The image is from Wang [12]).*

where *a <sup>j</sup>* and *b <sup>j</sup>* are different parameters in each layer. Rather than using two networks for domain adaptation, in [21], they introduce a domain guided method to drop some weights in the networks directly.

#### **4.4 Sample regularization**

Alternatively, instead of adapting the parameters in the networks, we can reweight the data in each layer of feed-forward neural networks. The typical method to reduce internal covariate shit in deep neural networks is to conduct batch normalization during training [22].

$$
\hat{\mathbf{x}}\_i = \gamma \frac{\mathbf{x}\_i - \mu}{\sqrt{\sigma^2 + \varepsilon}} + \beta \tag{7}
$$

is expensive especially when the target datasets consist of large-size samples such as

Instead of directly synthesizing labeled data for domain adaptation, an alternative way is to add an extra domain classifier to enough domain confusion. The role of domain classifier is similar to that of the discriminator in GANs, it can distinguish the data between the source domain and target domain (the discriminator in GANs is responsible for recognizing the fake from the real data). With the help of an adversarial learning approach, the domain classifier can help the network learn domain-invariant representation from the source domain and the target domain. In

The core idea of the data-reconstruction approach is to utilize the reconstruction as an auxiliary task for encouraging domain confusion in an unsupervised manner. In this section, we discuss two types of approaches that are mainly addressed in recent years, including the encoder-decoder-based method and the GANs-based

To reconstruct the samples, the basic method is that we can adopt an autoencoder framework, in which there is an encoder network and decoder network. The encoder can map an input sample into a hidden representation and the decoder can reconstruct the input sample based on the hidden representation. In particular, the encoder-decoder networks for domain adaptation typically involve a shared encoder between the source domain and target domain so that the encoder can learn some domain-invariant representation. An earlier work can be found in [9], in which the stacked denoising auto-encoder is adopted for sentiment classification. Recently, a typical work called deep reconstruction-classification networks is introduced in [11], in which the encoder and decoder are both implemented with convolutional networks. Specifically, the convolutional encoder is used for supervised classification of the labeled data from the source domain. Meanwhile, it also maps the unlabeled data from the target domain into hidden representation, which is further decoded by the convolutional encoder for reconstructing the input. By jointly training these networks with the data from the source and target domains, the shared encoder can learn some common representations from both datasets, which results in domain adaptation. Other similar work based on auto-encoder can

Traditionally, the GANs [6] consists of a generator and discriminator, where the generator can be seen as a decoder network which can decode some random noise

other words, the trained model can be directly used for the target/new task. Therefore, the key is how to conduct adversarial learning with the domain classifier. In [8], a gradient reversal layer (GRL) before domain classifier is introduced to maximize the gradients for encouraging domain confusion (we normally minimize the gradients for reducing the scalar value of a loss function). In [27], a

domain confusion loss is proposed beside the domain classifier loss.

**6. Sample-reconstruction approaches**

**6.1 Encoder-decoder-based approaches**

also be found in [11, 28].

**53**

**6.2 GAN-based approaches**

high-resolution images.

*Transfer Learning and Deep Domain Adaptation DOI: http://dx.doi.org/10.5772/intechopen.94072*

**5.2 Domain classifier**

method.

Note that *xi* usually denotes the hidden activation of input sample *xi* in each layer of a neural network (e.g., the output feature map of each convolutional layer). *<sup>μ</sup>* <sup>¼</sup> <sup>1</sup> *B* P*<sup>B</sup> <sup>i</sup>*¼<sup>1</sup>*xi* and *<sup>σ</sup>* <sup>¼</sup> <sup>1</sup> *B* P*<sup>B</sup> <sup>i</sup>*¼<sup>1</sup>ð Þ *xi* � *<sup>μ</sup>* 2 . *B* is the batch size, *γ* and *β* are two hyperparameters to learn. Based on this method, [23] propose a revised method for practical domain adaptation. And in [24], researchers adopt instance normalization for stylization.
