**4. Air pollution area division method**

In this section, we perform location labeling through K-means clustering using created air pollution dataset. The K-means clustering algorithm can be partitioning all data by defining the number of clusters and obtaining the center point of clusters. We used the Scikit-learn python library [15].


**Figure 1.**

*The air pollution dataset in April 2020.*

Because it has to be performed clustering by calculating the distance between each data, we normalize to a value between 0 and 1 using MinMaxScaling [15]. Moreover, we perform the K-means clustering using normalized data. Eq. (1) shows the dataset that maximizes the degree of cohesion within each set. The distance between it in each cluster is measured using the Euclidean distance. The clustering algorithm ends when the average value of this distance no longer changes.

$$\arg\min \sum\_{i=1}^{k} \sum\_{\mathbf{x} \in \mathcal{S}\_{i}} \left\| \mathbf{x} - \boldsymbol{\mu}\_{i} \right\|^{2} \tag{1}$$

We calculate the inertia value to find the appropriate K value for K-means Clustering. The inertia value is the sum of the distances between clusters at each center point after clustering. **Figure 2** shows the inertia value according to the K. The optimal k value is where the inertia value decreases rapidly, and the change is not significant. However, it is difficult to determine the optimal k value in this graph. Therefore, we set the k value to 16, focusing on dividing the whole country into 16 provinces.


#### **Pseudocode 2.**

*The process of performing scaling and clustering.*

Pseudocode 2 is the source code that loads April data, performs the scaling and clustering. Also, **Figure 3** presents the coordinates of the center point of each cluster as a result of performing clustering based on the air pollution data for a month. We use the Folium python library to show this map [16]. The marker of the same color is the cluster's point divided into 16 in the cluster for the day. Also, **Table 1** is a comparison of 16 administrative district labels and clustering results. For example, the 0 label is the Gangwon-do area, and 11, 12, 15 labels contain the twelve air pollution stations in this district.

**Figure 2.** *The inertia value according to the K.*

**Figure 3.** *The coordinates of the center point of each cluster for a month.*

To determine the coordinates of the 16 cluster's center point, we perform the K-means clustering again. **Figure 4** is the visualization of the 16 center points on a map to divide regions. As a result of performing clustering, it is more closely


#### **Table 1.**

*Number of stations in each cluster by administrative district.*

distributed in the Seoul cluster center, Incheon, and Gyeonggi-do than other regions. It is because many air pollution monitoring stations are mainly distributed in the metropolitan area in Korea.

**Figure 5** shows the results of classifying air pollution monitoring stations by calculating the distance to each station from the obtained 16 center coordinates. Points on the map are the location of the air pollution monitoring station. In this case, we calculated the Euclidean distance using latitude and longitude.

Also, **Figure 6** visualizes the convex hull polygon by connecting the outermost point of the classified measurement stations as a line [17]. This method has the advantage of accurately classifying even if the distance between each point is close because classification is performed based on the location of the stations. However, in an area without an observatory, it is a shaded area, and the distribution of air pollution cannot be measured.

This chapter found cluster's center points using the location and concentration of air pollution monitoring stations to divide air pollution areas that can reflect data distribution. The stations are classified based on the center coordinates, and air pollution areas are divided using the Convexhull polygon. However, there was a problem that the classified air pollution areas did not include areas without air pollution monitoring stations.

Therefore, we use the Voronoi algorithm to include areas without measurement stations [18]. Also, it can classify areas based on the center point of the cluster. The Voronoi algorithm is to get a line segment that can divide the distance between neighboring points into two and obtain a polygon with the intersection of each line segment as a vertex. **Figure 7** shows the divided regions using the Voronoi algorithm. The dots represent the centers of classified clusters. The method used in the Voronoi

**Figure 5.** *The results of classifying air pollution monitoring stations by cluster.*

**Figure 6.** *The result using the convex hull polygon algorithm.*

**Figure 7.** *The result using the Voronoi algorithm.*
