**3. Machine learning methodology**

When the machine learning method has to be selected in radiation oncology, input and output variables are considered to predict expected analysis results by accuracy validation. Kang et al. [14] describe the principles of modeling as follows (**Figure 4**).

**Figure 4.** Core principles for modeling [14].

#### **3.1. Machine learning introduction**

Ethem Alpaydin [8] defines machine learning as the computer program for optimizing performance factor using data, and Mitchell also describes that a computer program can be said to be learned in experience (E), task (T), and performance (P) [9].

A machine learning algorithm can be divided into the unsupervised learning and supervised learning [8, 11]. For unsupervised and supervised learning process is little different as with training and test in **Figure 5**. A differentiation is the feedback loop for training and test difference between supervised and unsupervised learning in Figure 5(a) and (b).

#### **3.2. Supervised learning**

A supervised learning is a machine learning method to find a result from training data. For example, we know beforehand about the doughnut and bagel classification group. Doughnut is classified from the training. Then, we classify the group whether this doughnut belongs to doughnut group or bagel. This is the example of supervised learning.

**3. Machine learning methodology**

180 Radiotherapy

**3.1. Machine learning introduction**

**Figure 4.** Core principles for modeling [14].

**3.2. Supervised learning**

When the machine learning method has to be selected in radiation oncology, input and output variables are considered to predict expected analysis results by accuracy validation. Kang

Ethem Alpaydin [8] defines machine learning as the computer program for optimizing performance factor using data, and Mitchell also describes that a computer program can be

A machine learning algorithm can be divided into the unsupervised learning and supervised learning [8, 11]. For unsupervised and supervised learning process is little different as with training and test in **Figure 5**. A differentiation is the feedback loop for training and test differ-

A supervised learning is a machine learning method to find a result from training data. For example, we know beforehand about the doughnut and bagel classification group. Doughnut is classified from the training. Then, we classify the group whether this doughnut belongs to

said to be learned in experience (E), task (T), and performance (P) [9].

ence between supervised and unsupervised learning in Figure 5(a) and (b).

doughnut group or bagel. This is the example of supervised learning.

et al. [14] describe the principles of modeling as follows (**Figure 4**).

**Figure 5.** Unsupervised learning and supervised learning algorithm process and types. (a) Unsupervised learning process; (b) supervised learning process; and (c) Supervised and unsupervised learning algorithm types.

Generally, the training data include input characteristics with vector type; the vector presents wanted results. Thus, this continuous trial showing the result process is the regression. A classification is the division of input vector whether this value comes from several groups. When the supervised learner is executed, training data have to be measured by proper method to achieve final goal. The accuracy and validation for classification are needed to count numerically to measure its performance.

#### *3.2.1. Decision tree*

A decision tree consists of node and branch. If the nodes have more complicated hierarchy, leaf nodes and braches follow by certain decision. Thus, a diagram formed into the unknown condition at the nodes and the decision "yes" or "no" goes to a direction in a tree. This is beneficial to trace for a created hypothesis with the results. **Figure 6** shows that a decision tree and it is shown that its rules for their conditions whether patient characteristics about chemotherapy, cell, treatment, and sex for RT radiotherapy.

**Figure 6.** A decision tree and its rules for their conditions whether patient characteristics about chemotherapy, cell, treatment, and sex for RT radiotherapy (reproduced from Das et al. [7]).

A hyperplane h(x) defines Eq. (1) for the points x [12]:

$$\mathbf{h(x):w^Tx + b = 0} \tag{1}$$

where w is the weight vector and b is the offset. The generic form of a separate point for a numeric attribute X<sup>i</sup> is given in Eq. (2):

$$X\_{\parallel} \le \mathbf{v} \tag{2}$$

where v = −b is the certain value in the domain of X<sup>i</sup> . The decision point X<sup>i</sup> ≤ v thus divides R, the input data space into two regions RYY and RNN. Each split of R into RYY and RNN also induces a binary partition of the corresponding input data point D. That is, a split point of the form X<sup>i</sup> ≤ v induces the data partition in Eqs. (3) and (4):

$$\mathbf{D}\_{\gamma\gamma} = \left\{ \mathbf{x} \, \middle| \, \mathbf{x} \in \mathbf{D} \middle| \mathbf{x}\_{\cdot} \le \mathbf{v} \right\} \tag{3}$$

$$\mathbf{D}\_{\rm NN} = \left\{ \mathbf{x} \, \middle| \, \mathbf{x} \in \mathbf{D}\_{\rm r} \mathbf{x}\_{\rm r} \succeq \mathbf{v} \right\} \tag{4}$$

where DYY is the subset of data points that lie in region RYY and DNN is the subset of input points that line in RNN [12].

#### *3.2.2. Support vector machine*

*3.2.1. Decision tree*

182 Radiotherapy

A decision tree consists of node and branch. If the nodes have more complicated hierarchy, leaf nodes and braches follow by certain decision. Thus, a diagram formed into the unknown condition at the nodes and the decision "yes" or "no" goes to a direction in a tree. This is beneficial to trace for a created hypothesis with the results. **Figure 6** shows that a decision tree and it is shown that its rules for their conditions whether patient characteristics about

h(x):wT x + b=0 (1)

**Figure 6.** A decision tree and its rules for their conditions whether patient characteristics about chemotherapy, cell,

where w is the weight vector and b is the offset. The generic form of a separate point for a

X<sup>i</sup> ≤ v (2)

R, the input data space into two regions RYY and RNN. Each split of R into RYY and RNN also induces a binary partition of the corresponding input data point D. That is, a split point of the

DYY = {x| x ∈ D,xi ≤ v} (3)

DNN = {x| x ∈ D,xi >v} (4)

where DYY is the subset of data points that lie in region RYY and DNN is the subset of input

. The decision point X<sup>i</sup>

≤ v thus divides

chemotherapy, cell, treatment, and sex for RT radiotherapy.

A hyperplane h(x) defines Eq. (1) for the points x [12]:

treatment, and sex for RT radiotherapy (reproduced from Das et al. [7]).

is given in Eq. (2):

≤ v induces the data partition in Eqs. (3) and (4):

where v = −b is the certain value in the domain of X<sup>i</sup>

numeric attribute X<sup>i</sup>

points that line in RNN [12].

form X<sup>i</sup>

A support vector machine (SVM) is a machine learning method for pattern recognition and information analysis. Generally, it is used for classification and regression analysis. The SVM makes the decision about input data to determine whether a given set of data belongs to any category. For understanding the SVM, data group and hyperplane terms have to be defined.

A hyperplane in d dimensions is given as the set of all points x ∈ Rd that satisfies the equation h(x) = 0, where h(x) is the hyperplane function, defined as follows in Eq. (5) [12]:

$$\mathbf{h(x) = w^T x + b} \tag{5}$$

Here, w is the d dimensional weight vector and b is the scalar, called the bias. For points that lie on the hyperplane, it gives us Eq. (6):

$$\mathbf{h(x) = w^T x + b = 0} \tag{6}$$

The hyperplane is defined as the set of all points w<sup>T</sup>x = −b. If the input data group is linearly able to classify, then a dividing hyperplane h(x) = 0 could be found for all points classified as yi = −1, h(xi) < 0 and for all points classified as yi = +1, thus h(xi) > 0:

$$\mathbf{y} = \mathbf{r}, \text{ } \mathbf{h}(\mathbf{x}) < 0 \text{ and for all points classes as } \mathbf{y} = \mathbf{r} + \mathbf{r}, \text{ thus } \mathbf{h}(\mathbf{x}) > 0.$$

$$\mathbf{y} = \begin{cases} +1 \text{if } \mathbf{h}(\mathbf{x}) < 0 \\ -1 \text{if } \mathbf{h}(\mathbf{x}) < 0 \end{cases} \tag{7}$$

$$\mathbf{w}^T(\mathbf{a}\mathbf{1} - \mathbf{a}\mathbf{2}) = \mathbf{0} \tag{8}$$

The weight vector w can be designated at the direction that is normal to the hyperplane, however, b; the bias fixes the offset of the hyperplane in the d-dimensional space. Because w and −w are normal to the hyperplane, the vagueness that h(xi) > 0 where yi = 1 and h(xi) < 0 where yi = −1 can be removed.

Thus, let **x**p be the orthogonal projection, x the hyperplane, and let **r1** = **x** − **x**p:

$$\mathbf{x} = \mathbf{x}\_{p^\*} + \mathbf{r}\_1 \tag{9}$$

$$\mathbf{x} = \mathbf{x\_p} + \mathbf{r\_1} \frac{\mathbf{w}}{\|\mathbf{W}\|} \tag{10}$$

where r is the directed distance of x from xp, r1 is the x from xp, \_\_\_\_ *<sup>w</sup>* ‖*w*‖ is the unit weight vector.

r1 : + when r1 is in the same direction as w; r1 : – when r1 is in an opposite direction to w (**Figure 7**) [12].

In case of nonlinear SVM, the classes are not separable by linear SVM. The shape is in **Figure 8**, and some kernels include polynomial, Gaussian, etc.

There is the library for various programming languages using the support vector machine in **Table 3**.

#### *3.2.3. Neural network*

A neural network example in radiation oncology is shown in **Figure 9**. A three-layer neural network defines as follows, and this would have the following model for the approximated function as [11]

**Figure 7.** The support vectors and hyperplane (reproduced from Zaki and Wagner Meira [12]).

**Figure 8.** A nonlinear SVM (reproduced from Zaki and Wagner Meira[12]).


**Table 3.** Various programming languages to implement SVM algorithm (Good, ○; Better, ◑; Best, ).

**Figure 9.** Neural network for head and neck cancer of 3-class classification example [17].

$$\mathbf{f}(\mathbf{x})\mathbf{w}\mathbf{y}^T\mathbf{w}^{(2)} + \mathbf{b}^{(2)}\tag{11}$$

where the elements are the output of the neurons:

$$\mathbf{v}^{\mathsf{w}}\mathbf{s}\left(\mathbf{x}^{\mathsf{T}}\,\mathbf{w}\_{l}^{(1)} + \mathsf{b}^{(1)}\right)\tag{12}$$

(where x: the input vector; w(j), b(j): the interconnect weight vector, and j: the bias of layer)

#### **3.3. Unsupervised learning**

**Figure 7.** The support vectors and hyperplane (reproduced from Zaki and Wagner Meira [12]).

184 Radiotherapy

**Figure 8.** A nonlinear SVM (reproduced from Zaki and Wagner Meira[12]).

Unsupervised learning, otherwise supervised learning, does not know the specific group information. But the learning algorithm infers the results such as doughnut and bagel example. That is, there is no target value in unsupervised learning. It is related to density estimation on statistics. This unsupervised learning is beneficial to data characteristics analysis and its explanation. Typical example is clustering. Another one is an independent component analysis.

#### *3.3.1. Principal component analysis (PCA)*

Zaki and Wagner Meira defined the PCA as follows:

a. Finding r-dimensional basis that take the data variance.

b. It is called that the largest projected variance direction is the first principal component.

c. In case of orthogonal direction, then it is the second principal component and so forth.

And also, the mean squared error can be minimized by maximizing the data variance [12].

Principal component analysis (PCA) is applied to the normalized X to identify a set of principal components (PCs) [11]:

$$\mathbf{PC} = \mathbf{U}^{\mathrm{T}} \mathbf{X} = \sum \mathbf{V}^{\mathrm{T}} \tag{13}$$

where UΣV<sup>T</sup> is the singular value decomposition of X.

#### *3.3.2. Clustering*

Clustering is an unsupervised learning method, and that is finding the cluster without data label. The data and data label are required to classify. Thus, it needs different classification methods for unlabeled data. There are several ways to define cluster. One simple way is that we can define as "the data in same cluster inside" is close to each other, and the closest distance data could be selected. k-Means assume the data is close in same cluster. One center exists, and cost which is a distance between center and each data can be defined. Thus, k-means is an algorithm to reduce and minimize cost in cluster.

Given a clustering C = {C<sup>1</sup> , C2 , …, Ck }, the scoring function evaluates its quality. This sum of squared error scoring function is defined as [12]

$$\text{SSE}(\mathbf{C}) = \sum\_{i=0}^{k} \sum\_{X\_{\mid} \mathbf{c}\_{i} \mathbf{C}\_{i}} \left| \left| \begin{array}{c} X\_{j} - u\_{i} \\ \end{array} \right| \right|^{2} \tag{14}$$

The goal is to find the clustering that minimizes the SSE score, thus,

 C\*=argmin*<sup>c</sup>* {*SSE*(*C* )} (15)

k-Means employs a greedy iterative approach to find a clustering that minimizes the SSE objective [12].

Here is the advantage and disadvantage of various machine learning algorithms in radiation oncology in **Table 4**.


**Table 4.** The advantages and disadvantages by various machine learning algorithms in radiation oncology [15].
