**Air Pollution Analysis with a Possibilistic and Fuzzy Clustering Algorithm Applied in a Real Database of Salamanca (México)**

B. Ojeda-Magaña,<sup>1</sup> R. Ruelas1, L. Gómez-Barba1, M. A. Corona-Nakamura1, J. M. Barrón-Adame2, M. G. Cortina-Januchs2, J. Quintanilla-Domínguez2

> and A. Vega-Corona2 <sup>1</sup>*University of Guadalajara* <sup>2</sup>*University of Guanajuato México*

#### **1. Introduction**

Air pollution is one of the most important environmental problems in developed and undeveloped countries and it is associated with significant adverse health effects. Air pollution is characterized by the presence of a heterogeneous, complex mixture of gases, liquids and particulate matter in air. Pollution is caused by both natural and man-made sources, and it may greatly vary from one region to another according to the geography, demography, climate, and topography of these ones. For example, pollutant concentrations decrease significantly when the urban area meets certain characteristics as topography or large rain season (Celik & Kadi, 2007). Forest fires, volcanic eruptions, wind erosion, pollen dispersal, evaporation of organic compounds, and natural radioactivity are among natural causes of air pollution. Major man-made sources of air pollution include: industries, transportation, agriculture, power generation, and unplanned urban areas (Fenger, 2009).

Air pollutants exert a wide range of impacts on biological, physical, and ecosystems. Their effects on human health are of particular concern. The World Health Organization (WHO) consider air pollution as the mayor environmental risk to health and is estimated to cause approximately 2 million premature deaths worldwide per year (WHO, 2008).

This type of pollution is classified in criterio and non-criterio pollutants, the firsts are considered dangerous to human and animal health, its name was given after the result of various evaluations regarding air pollution published by the United States of America (EPA, 2008). Six criteria of pollutants are defined: Nitrogen Dioxide (*NO*2), Sulfur Dioxide (*SO*2), Carbon Monoxide (*CO*), Particulate Matter (*PM*), Lead (*Pb*), and Ozone (*O*3). The objective of this classification is to establish permissible levels to protect human and animal health and for the preservation of the environment. Human health is one of the most important concerns due to the short-term consequences of air pollution, especially in metropolitan areas, health effects are dependent on the type of pollutant, its concentration in air, length of exposure to the pollutant and individual susceptibility. Several groups of individuals react differently to air pollution, Children and elderly people are the most affected by this kind of pollution. Global warming and the greenhouse effect are among long term consequences of the global climate.

**CRUZ ROJA (CR)**

**DIF (DF) NATIVITAS (NA)**

Sulfur dioxide is produced fundamentally by the combustion of fossil fuels, and it has the energy generation sector as the main source of pollution. That is, the industrial sector generates 99.3 % of this pollutant, and only an approximate percentage of 0.06 % is generated by the transport sector. Particles produced by electric power generation represent 29 % of the total emissions, it follows the vehicular traffic in the roads without paving with 27 %, next the agriculture burns with 17 %, transport sector with 10 %,and the remaining 17 % is emitted by

Authorities of the city have made important efforts to measure and record on concentrations of pollutants Zamarripa & Sainez (2007). In 1999 the *Air Quality Monitoring Patronage* (AQMP) was formed. Since then the AQMP has been in charge of running the *Automatic Environmental Monitoring Network* (AEMN), and disseminate information. This information is validated by the *Institute of Ecology* (IE), which constantly analyzes the levels of pollutants INE (2004). The AEMN consists of three fixed and one mobile stations. The fixed stations are: *Cruz Roja* (CR),

The fixed stations cover approximately 80 % of the urban area while the mobile station covers the remaining 20 %. Fig. 1 illustrates the location of the three fixed stations. Each station has the necessary instrumentation to automatically track concentration of pollutants and meteorological variables every minute. Table 1 contains a sample of the concentration of

pollutants and meteorological variables in each of the three fixed stations.

Fig. 1. Location of monitoring stations in the city of Salamanca.

other sub-sectors.

*Nativitas* (NA), and *DIF*.

**REFINERY**

**POPULATION DENSITY Low (100 - 2,000) Medium (2,001 - 4,000) High (4,001 - 8,000)**

<sup>53</sup> Air Pollution Analysis with a Possibilistic

and Fuzzy Clustering Algorithm Applied in a Real Database of Salamanca (México)

**POWER GENERATION INDUSTRY**

Examine and study air pollutant information is very important for a better understanding of the human exposure and its potential impacts in health and welfare.

In recent years, the city of Salamanca has been catalogued as one of the most polluted cities in Mexico (Zuk et al., 2007). Sulphur Dioxide (*SO*2), and Particular Matter (*PM*10) are the criteria for searching air pollutants with the highest concentration in Salamanca, where three monitoring stations have been installed in order to know the level of air pollution; measure records of each monitoring station are handled separately. Actually an environmental contingency alarm is activated when the daily average pollutant concentration exceeds an established threshold (in a single monitoring station).

In this work, we propose to apply the PFCM (*Possibilistic Fuzzy c Means*) clustering algorithm to the measured data obtained from three monitoring stations so that a local environmental contingency alarm can be taken, according to the pollutant concentration reported by each monitoring station, general (or city) environmental contingency alarms will depend on the levels provided by the combined measure. So, the PFCM algorithm is used to find the prototypes of patterns that represent the relation between *SO*<sup>2</sup> and *PM*<sup>10</sup> air pollutants. For this relation analysis we use records from January 2007.

Once the prototypes have been estimated, a comparison is made between the average pollution of each monitoring station and the prototypes. In the analysis is used a data set from January to December 2007. The analysis include pollutant concentration as *SO*<sup>2</sup> , *PM*10, meteorological variables, wind speed, wind direction, temperature, and relative humidity.

It is also analyzed the impact of meteorological variables on the dispersion of pollutants, this is done through the calculus of correlation coefficients. This important correlation analysis is very simple and it is intended for improving decision making in environmental programs. Only the data gathered by the *Nativitas monitoring* station is used for the correlation analysis.

This paper is organized as follow: In Section 2 is presented the features, and explain the air pollution problem in Salamanca. In Section 3 is introduced the PFCM (*Possibilistic Fuzzy c Means*) clustering algorithm and the correlation coefficients. Section 4 presents the obtained results. And finally, in Section 5 we present our conclusions.

#### **2. Study case**

Salamanca is located in the state of Guanajuato, Mexico, and it has an approximate population of 234,000 inhabitants INEGI (2005). The city is 340 km northwest from Mexico City, with coordinates 20◦34'22" North latitude, and 101◦11'39" West longitude. It is located on a valley surrounded by the *Sierra Codornices*, where there are elevations with an average height of 2,000 meters Above Mean Sea Level (AMSL).

Salamanca has been one of the Mexican cities with more important industrial development in the last fifty years. Refinery and Power Generation Industries settled down in the fifty and seventy decades, respectively. These industries constitute the main and most important energy source for local, regional and national economy. However, the increase of population, quantity of vehicles, and the industry, refinery and thermoelectric activities, as well as orography and climatic characteristics have propitiated the increment in *SO*<sup>2</sup> and *PM*<sup>10</sup> concentrations INE (2004). The existent orography difficults the dispersion of pollutants by the wind, which produces the worst pollutant concentrations. *SO*<sup>2</sup> emissions are bigger than those in the Metropolitan area of Mexico City or Guadalajara city, the two biggest cities of Mexico, even when these ones have a bigger population than the city of Salamanca Cortina-Januchs et al. (2009). Orography hinders the dispersion of the worst pollutants by winds.

Examine and study air pollutant information is very important for a better understanding of

In recent years, the city of Salamanca has been catalogued as one of the most polluted cities in Mexico (Zuk et al., 2007). Sulphur Dioxide (*SO*2), and Particular Matter (*PM*10) are the criteria for searching air pollutants with the highest concentration in Salamanca, where three monitoring stations have been installed in order to know the level of air pollution; measure records of each monitoring station are handled separately. Actually an environmental contingency alarm is activated when the daily average pollutant concentration exceeds an

In this work, we propose to apply the PFCM (*Possibilistic Fuzzy c Means*) clustering algorithm to the measured data obtained from three monitoring stations so that a local environmental contingency alarm can be taken, according to the pollutant concentration reported by each monitoring station, general (or city) environmental contingency alarms will depend on the levels provided by the combined measure. So, the PFCM algorithm is used to find the prototypes of patterns that represent the relation between *SO*<sup>2</sup> and *PM*<sup>10</sup> air pollutants. For

Once the prototypes have been estimated, a comparison is made between the average pollution of each monitoring station and the prototypes. In the analysis is used a data set from January to December 2007. The analysis include pollutant concentration as *SO*<sup>2</sup> , *PM*10, meteorological variables, wind speed, wind direction, temperature, and relative humidity. It is also analyzed the impact of meteorological variables on the dispersion of pollutants, this is done through the calculus of correlation coefficients. This important correlation analysis is very simple and it is intended for improving decision making in environmental programs. Only the data gathered by the *Nativitas monitoring* station is used for the correlation analysis. This paper is organized as follow: In Section 2 is presented the features, and explain the air pollution problem in Salamanca. In Section 3 is introduced the PFCM (*Possibilistic Fuzzy c Means*) clustering algorithm and the correlation coefficients. Section 4 presents the obtained

Salamanca is located in the state of Guanajuato, Mexico, and it has an approximate population of 234,000 inhabitants INEGI (2005). The city is 340 km northwest from Mexico City, with coordinates 20◦34'22" North latitude, and 101◦11'39" West longitude. It is located on a valley surrounded by the *Sierra Codornices*, where there are elevations with an average height of 2,000

Salamanca has been one of the Mexican cities with more important industrial development in the last fifty years. Refinery and Power Generation Industries settled down in the fifty and seventy decades, respectively. These industries constitute the main and most important energy source for local, regional and national economy. However, the increase of population, quantity of vehicles, and the industry, refinery and thermoelectric activities, as well as orography and climatic characteristics have propitiated the increment in *SO*<sup>2</sup> and *PM*<sup>10</sup> concentrations INE (2004). The existent orography difficults the dispersion of pollutants by the wind, which produces the worst pollutant concentrations. *SO*<sup>2</sup> emissions are bigger than those in the Metropolitan area of Mexico City or Guadalajara city, the two biggest cities of Mexico, even when these ones have a bigger population than the city of Salamanca Cortina-Januchs et al. (2009). Orography hinders the dispersion of the worst pollutants by

the human exposure and its potential impacts in health and welfare.

established threshold (in a single monitoring station).

this relation analysis we use records from January 2007.

results. And finally, in Section 5 we present our conclusions.

meters Above Mean Sea Level (AMSL).

**2. Study case**

winds.

Fig. 1. Location of monitoring stations in the city of Salamanca.

Sulfur dioxide is produced fundamentally by the combustion of fossil fuels, and it has the energy generation sector as the main source of pollution. That is, the industrial sector generates 99.3 % of this pollutant, and only an approximate percentage of 0.06 % is generated by the transport sector. Particles produced by electric power generation represent 29 % of the total emissions, it follows the vehicular traffic in the roads without paving with 27 %, next the agriculture burns with 17 %, transport sector with 10 %,and the remaining 17 % is emitted by other sub-sectors.

Authorities of the city have made important efforts to measure and record on concentrations of pollutants Zamarripa & Sainez (2007). In 1999 the *Air Quality Monitoring Patronage* (AQMP) was formed. Since then the AQMP has been in charge of running the *Automatic Environmental Monitoring Network* (AEMN), and disseminate information. This information is validated by the *Institute of Ecology* (IE), which constantly analyzes the levels of pollutants INE (2004). The AEMN consists of three fixed and one mobile stations. The fixed stations are: *Cruz Roja* (CR), *Nativitas* (NA), and *DIF*.

The fixed stations cover approximately 80 % of the urban area while the mobile station covers the remaining 20 %. Fig. 1 illustrates the location of the three fixed stations. Each station has the necessary instrumentation to automatically track concentration of pollutants and meteorological variables every minute. Table 1 contains a sample of the concentration of pollutants and meteorological variables in each of the three fixed stations.

*Mpcm* =

**3.1 Fuzzy c-Means algorithm**

next steps:

as well as the ending tolerance *δ* > 0.

generally given in a random way.

*vi*)*<sup>T</sup> Ai*(*zk* <sup>−</sup> *vi*), 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>c</sup>*, 1 <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>N</sup>*.

of the objective function given by (4),

and Fuzzy Clustering Algorithm Applied in a Real Database of Salamanca (México)

*Jf cm*(**Z**; **U**, **V**) =

*μik* =

**<sup>U</sup>** ∈ �*c*×*N*|*μik* <sup>∈</sup> [0, 1], <sup>∀</sup>*i and k*;

<sup>55</sup> Air Pollution Analysis with a Possibilistic

*N* ∑ *k*=1

*μik* < *N*, ∀*i*

(*μik*)*m*�*zk* <sup>−</sup> *vi*�2, (4)

*ik* (6)

. (3)

(5)

*ikAi* = (*zk* −

∀*k*, ∃*i*, *μik* > 0; 0 <

The Fuzzy *c*-Means clustering algorithm (FCM) was initially developed by Dunn Dunn (1973), and generalized later by Bezdek Bezdek (1981). This algorithm is based on the optimization

> *N* ∑ *k*=1

where the membership matrix *U* = [*μik*] ∈ *Mfmc*, is a fuzzy c-partition of the space where *Z* is defined, *V* = [*v*1, *v*2, ..., *vc* ] is the vector of prototypes of the *c* groups, which are calculated according to *DikAi* <sup>=</sup> �*zk* <sup>−</sup> *vi*�2, a squared inner-product distance norm, and *<sup>m</sup>* <sup>∈</sup> [1, <sup>∞</sup>] is a weighting exponent which determines the fuzziness of the partition. The optimal c-partition for a Fuzzy c-Means algorithm, is reached through the couple (*U*∗, *V*∗) which minimizes

*Theorem* FCM Bezdek (1981): If *DikAi* = �*zk* − *vi*� > 0, for every *i*, *k*, *m* > 1, and Z contains at

 *DikAi DjkAi*

1 ≤ *i* ≤ *c*; 1 ≤ *k* ≤ *N*

1 ≤ *i* ≤ *c*. Following the previous equations of the FCM algorithm, the solution can be reached with the

*FCM-AO-V* Given the data set *Z* choose the number of clusters 1 < *c* < *N*, the weighting exponent *m* > 1,

**I** Provide an initial value to each one of the prototypes *vi*, *i* = 1, .., *c*. These values are

**II** Calculate the distance of *zk* to each one of the prototypes *vi*, using *D*<sup>2</sup>

2/(*m*−1)

−<sup>1</sup>

*c* ∑ *i*=1

locally the objective function *Jf cm*, according to the *alternating optimization* (AO).

least *<sup>c</sup>* distinct data points, then (*U*, *<sup>V</sup>*) <sup>∈</sup> *Mf cm* × �*c*×*<sup>N</sup>* may minimize *Jf cm* only if

 *c* ∑ *j*=1

*vi* =

*N* ∑ *k*=1 *μm ikzk <sup>N</sup>* ∑ *k*=1 *μm*


Pollutants


Meteorological variables

Table 1. Pollutants concentrations and meteorological variables recorder in the monitoring stations

#### **3. Clustering algorithms**

In this work we take advantage of the qualities of fuzzy and possibilistic clustering algorithms in order to find *c* groups in a set of unlabeled data set *Z* = {*z*1, *z*2,..., *zk*,..., *zN*} in an M-dimensional space, where the nearest *zk* to a prototype, or group center *vi*, belong to the group *i* among *c* possible groups. The membership of each *zk* to the different groups depends on the kind of partition of the M-dimensional space where data set is defined. This way, a *c-partition* can be either: hard (or crisp), fuzzy, and possibilistic Bezdek et al. (1999). The hard *c*-partition of the space for a data set *Z*(*k*) = {*zk*|*k* = 1, 2, ..., *N*}, of finite dimension and *c* groups, where 2 ≤ *c* < *N*, is defined by (1), (2) defines the fuzzy c-partition, whereas (3) defines the possibilistic c-partition.

$$M\_{\hbar c m} = \left\{ \mathbf{U} \in \mathfrak{K}^{c \times N} | \mu\_{ik} \in \{0, 1\}, \forall i \text{ and } k;$$

$$\sum\_{i=1}^{c} \mu\_{ik} = 1, \forall k; \quad 0 < \sum\_{k=1}^{N} \mu\_{ik} < N\_{\prime} \forall i \right\}; \tag{1}$$

$$M\_{f\circ m} = \left\{ \mathbf{U} \in \mathfrak{R}^{c \times N} | \mu\_{ik} \in [0, 1] \,\,\forall i \;\,\,and \;\,k;$$

$$\sum\_{i=1}^{c} \mu\_{ik} = 1 \,\,\forall k; \quad 0 < \sum\_{k=1}^{N} \mu\_{ik} < N \,\,\forall i \right\};\tag{2}$$

$$M\_{pcm} = \left\{ \mathbf{U} \in \mathfrak{R}^{c \times N} | \mu\_{ik} \in [0, 1], \forall i \; and \; k;$$

$$\forall k, \exists i, \mu\_{ik} > 0; \quad 0 < \sum\_{k=1}^{N} \mu\_{ik} < N, \forall i \right\}. \tag{3}$$

#### **3.1 Fuzzy c-Means algorithm**

4 Environmental Monitoring

Pollutants

micrometer in diameter (*PM*10) <sup>√</sup> <sup>√</sup>

Meteorological variables

Table 1. Pollutants concentrations and meteorological variables recorder in the monitoring

In this work we take advantage of the qualities of fuzzy and possibilistic clustering algorithms in order to find *c* groups in a set of unlabeled data set *Z* = {*z*1, *z*2,..., *zk*,..., *zN*} in an M-dimensional space, where the nearest *zk* to a prototype, or group center *vi*, belong to the group *i* among *c* possible groups. The membership of each *zk* to the different groups depends on the kind of partition of the M-dimensional space where data set is defined. This way, a *c-partition* can be either: hard (or crisp), fuzzy, and possibilistic Bezdek et al. (1999). The hard *c*-partition of the space for a data set *Z*(*k*) = {*zk*|*k* = 1, 2, ..., *N*}, of finite dimension and *c* groups, where 2 ≤ *c* < *N*, is defined by (1), (2) defines the fuzzy c-partition, whereas (3)

**<sup>U</sup>** ∈ �*c*×*N*|*μik* ∈ {0, 1}, <sup>∀</sup>*i and k*;

**<sup>U</sup>** ∈ �*c*×*N*|*μik* <sup>∈</sup> [0, 1], <sup>∀</sup>*i and k*;

*N* ∑ *k*=1

*N* ∑ *k*=1 *μik* < *N*, ∀*i*

*μik* < *N*, ∀*i*

; (1)

; (2)

*μik* = 1, ∀*k*; 0 <

*μik* = 1, ∀*k*; 0 <

Wind Direction (WD) √ √ √ Wind speed (WS) √ √ √ Temperature (T) √ √ Relative Humidity (RH) √ √ Barometric Pressure (BP) √ √ Solar Radiation (SR) √ √

Particulate Matter less than 10

√ Measured

stations

**3. Clustering algorithms**

defines the possibilistic c-partition.

*Mhcm* =

*Mf cm* =

*c* ∑ *i*=1

*c* ∑ *i*=1

Ozone (*O*3) <sup>√</sup> <sup>√</sup> <sup>√</sup> Sulfur Dioxide(*SO*2) <sup>√</sup> <sup>√</sup> <sup>√</sup> Carbon Monoxide (*CO*) √ √ √ Nitrogen Dioxide (*NOx*) <sup>√</sup> <sup>√</sup> <sup>√</sup>

Cruz Roja Nativitas DIF

Cruz Roja Nativitas DIF

The Fuzzy *c*-Means clustering algorithm (FCM) was initially developed by Dunn Dunn (1973), and generalized later by Bezdek Bezdek (1981). This algorithm is based on the optimization of the objective function given by (4),

$$J\_{fcm}(\mathbf{Z}; \mathbf{U}, \mathbf{V}) = \sum\_{i=1}^{c} \sum\_{k=1}^{N} (\mu\_{ik})^m ||z\_k - v\_i||^2,\tag{4}$$

where the membership matrix *U* = [*μik*] ∈ *Mfmc*, is a fuzzy c-partition of the space where *Z* is defined, *V* = [*v*1, *v*2, ..., *vc* ] is the vector of prototypes of the *c* groups, which are calculated according to *DikAi* <sup>=</sup> �*zk* <sup>−</sup> *vi*�2, a squared inner-product distance norm, and *<sup>m</sup>* <sup>∈</sup> [1, <sup>∞</sup>] is a weighting exponent which determines the fuzziness of the partition. The optimal c-partition for a Fuzzy c-Means algorithm, is reached through the couple (*U*∗, *V*∗) which minimizes locally the objective function *Jf cm*, according to the *alternating optimization* (AO).

*Theorem* FCM Bezdek (1981): If *DikAi* = �*zk* − *vi*� > 0, for every *i*, *k*, *m* > 1, and Z contains at least *<sup>c</sup>* distinct data points, then (*U*, *<sup>V</sup>*) <sup>∈</sup> *Mf cm* × �*c*×*<sup>N</sup>* may minimize *Jf cm* only if

$$\mu\_{ik} = \left(\sum\_{j=1}^{c} \left(\frac{D\_{ikA\_i}}{D\_{jkA\_i}}\right)^{2/(m-1)}\right)^{-1} \tag{5}$$
 
$$1 \le i \le c; \quad 1 \le k \le N$$

$$w\_i = \sum\_{k=1}^{N} \mu\_{ik}^m z\_k \Big/ \sum\_{k=1}^{N} \mu\_{ik}^m \tag{6}$$
 
$$1 \le i \le c.$$

Following the previous equations of the FCM algorithm, the solution can be reached with the next steps:

#### *FCM-AO-V*

Given the data set *Z* choose the number of clusters 1 < *c* < *N*, the weighting exponent *m* > 1, as well as the ending tolerance *δ* > 0.


1 ≤ *i* ≤ *c*; 1 ≤ *k* ≤ *N*

<sup>57</sup> Air Pollution Analysis with a Possibilistic

1 ≤ *i* ≤ *c*; 1 ≤ *k* ≤ *N*. Krishnapuram and Keller Krishnapuram & Keller (1993) Krishnapuram & Keller (1996) recommend to apply the FCM at a first time, such that the initial values of the PCM algorithm

*<sup>k</sup>*=<sup>1</sup> *<sup>μ</sup><sup>m</sup>*

∑*<sup>N</sup> <sup>k</sup>*=<sup>1</sup> *<sup>μ</sup><sup>m</sup> ik*

where *K* > 0, although the most common value is *K* = 1, and the membership values {*μik*} are those calculated with the FCM algorithm in order to reduce the influence of noise.

The PCM algorithm is very sensitive to the {*γi*} values, and the typicality values depend directly on it. For example, if the value of *γ<sup>i</sup>* is small, the typicality values *tik* of T are also small, whereas if the value of *γ<sup>i</sup>* is high, the *tik* are also high. For this work, the {*γi*} values

In order to avoid a problem with the initial PCM algorithm, as sometimes the prototypes of different groups coincided Hoppener et al. (2000), even if the natural structure of data has well delimited different groups, Tim *et al* Timm et al. (2004); Timm & Kruse. (2002) have modified the objective function to include a constraint based on the repulsion among groups, thus

The objective of the fuzzy clustering algorithms is to find an internal structure in a numerical data set into *n* different subgroups, where the members of each subgroup have a high similarity with its prototype (centroid, cluster center, signature, template, code vector) and a high dissimilarity with the prototypes of the other subgroups. This justifies the existence of

A simplified representation of a numerical data set into *n* subgroups, help us to get a better comprehension and knowledge of the data set Barron-Adame et al. (2007). Besides, the particional clustering algorithms (hard, fuzzy, probabilistic or possibilistic) provide, after a learning process, a set of prototypes as the most representative elements of each subgroups. Ruspini was the first one to use fuzzy sets for clustering Ruspini (1970). After that, Dunn Dunn (1973) developed in 1973 the first fuzzy clustering algorithm, named Fuzzy *c*-Means (FCM), with a parameter of fuzziness *m* equal to 2. Later on Bezdek Bezdek (1981) generalized this algorithm. The FCM is an algorithm where the membership degree of each point to each fuzzy set *Ai* is calculated according to its prototype. The sum of all the membership degrees of each

Krishnapuram and Keller Krishnapuram & Keller (1993) developed the Possibilistic *c*-Means (PCM) clustering algorithm, where the principal characteristic is the relaxation of the restriction that gives the relative typicality property of the FCM. The PCM provides a similarity degree between data points and each one of the prototypes, value known as absolute typicality or simply typicality Pal et al. (1997). So, the nearest points to a prototype are identified as typical, whereas the furthest points as atypical, and noise Ojeda-Magaña et al.

*ik*�*zk* <sup>−</sup> *vi*�<sup>2</sup>

*A*

*ik*, (10)

(11)

*vi* =

and Fuzzy Clustering Algorithm Applied in a Real Database of Salamanca (México)

*N* ∑ *k*=1 *t m ikzk <sup>N</sup>* ∑ *k*=1 *t m*

can be estimated. They also suggest the calculus of the penalty *γ<sup>i</sup>* with equation (11)

*<sup>γ</sup><sup>i</sup>* <sup>=</sup> *<sup>K</sup>* <sup>∑</sup>*<sup>N</sup>*

are obtained from equation (11).

avoiding identical groups when they must be different.

each one of the subgroups Andina & Pham (2007).

individual point to all the fuzzy sets must be equal to one.

(2009a)Ojeda-Magaña et al. (2009b).


$$||V\_{k+1} - V\_k||\_{err} \le \delta\_\nu$$

If this is truth, stop. Else, go to step **II**.

The FCM is an algorithm that calculates a membership value *μik* for each point *zk* in function of all prototypes *vi*. The sum of the membership values of *zk* to the *c* groups must be equal to one. However, a problem arises when there are several equidistant points from the prototypes of the groups, because the FCM is not able to detect noise points or nearest and furthest points from the prototypes. Pal *et al* Pal et al. (2004) show an example with two points located in the boundary of two groups, one point near to the prototypes and the other one far away from them. This must be handled with care, as both points are not *equally representative* of the groups, even if they have the same membership values. One way to overcome this inconvenience is to use a possibilistic algorithm.

#### **3.2 Possibilistic c-Means algorithm**

The Possibilistic c-Means clustering algorithm (PCM) Krishnapuram & Keller (1993) is based on *typicality values* and relaxes the constraint of the FCM concerning the sum of membership values of a point to all the *c* groups, which must be equal to one. Thus, the PCM identifies the similarity of data points with an alone prototype *vi* using a typicality values that takes values in [0,1]. The nearest data points to the prototypes are considered *typical*, further data points are *atypical* and data points with zero, or almost zero, typicality values are considered *noise* Ojeda-Magaña et al. (2009a). The objective function *Jpcm* proposed by Krishnapuram Krishnapuram & Keller (1993) for this algorithms is given by

$$J\_{pcm}(\mathbf{Z}; \mathbf{T}, \mathbf{V}, \gamma) = \left\{ \sum\_{k=1}^{N} \sum\_{i=1}^{c} (t\_{ik})^m ||z\_k - v\_i||\_A^2 + \right.$$

$$\sum\_{i=1}^{c} \gamma\_i \sum\_{k=1}^{N} (1 - t\_{ik})^m \right\},\tag{7}$$

where

$$T \in M\_{\text{pcm}\prime} \qquad \gamma\_i > 0, \quad 1 \le i \le c. \tag{8}$$

The first term of *Jpcm* is identical to that of the FCM objective function, which is based on the distance of the points to the prototypes. The second term, that includes a penalty *γi*, tries to bring *tik* toward 1.

*Theorem* PCM Krishnapuram & Keller (1993): if *γ<sup>i</sup>* > 0, 1 ≤ *i* ≤ *c*, *m* > 1 and Z has at least *<sup>c</sup>* distinct data points, then (*T*, *<sup>V</sup>*) <sup>∈</sup> *Mpcm* × �*c*×*<sup>N</sup>* may minimize *Jpcm* only if

$$t\_{ik} = \frac{1}{1 + \left(\frac{\|z\_k - v\_i\|^2}{\gamma\_i}\right)^{1/(m-1)}}\tag{9}$$

**III** Calculate the membership values of the matrix *U* = [*μik*], if *Dik***<sup>A</sup>** > 0, using equation (5).

�*Vk*<sup>+</sup><sup>1</sup> − *Vk*�*err* ≤ *δ*,

The FCM is an algorithm that calculates a membership value *μik* for each point *zk* in function of all prototypes *vi*. The sum of the membership values of *zk* to the *c* groups must be equal to one. However, a problem arises when there are several equidistant points from the prototypes of the groups, because the FCM is not able to detect noise points or nearest and furthest points from the prototypes. Pal *et al* Pal et al. (2004) show an example with two points located in the boundary of two groups, one point near to the prototypes and the other one far away from them. This must be handled with care, as both points are not *equally representative* of the groups, even if they have the same membership values. One way to overcome this

The Possibilistic c-Means clustering algorithm (PCM) Krishnapuram & Keller (1993) is based on *typicality values* and relaxes the constraint of the FCM concerning the sum of membership values of a point to all the *c* groups, which must be equal to one. Thus, the PCM identifies the similarity of data points with an alone prototype *vi* using a typicality values that takes values in [0,1]. The nearest data points to the prototypes are considered *typical*, further data points are *atypical* and data points with zero, or almost zero, typicality values are considered *noise* Ojeda-Magaña et al. (2009a). The objective function *Jpcm* proposed by Krishnapuram

> *N* ∑ *k*=1

*c* ∑ *i*=1

> *c* ∑ *i*=1 *γi N* ∑ *k*=1

The first term of *Jpcm* is identical to that of the FCM objective function, which is based on the distance of the points to the prototypes. The second term, that includes a penalty *γi*, tries to

*Theorem* PCM Krishnapuram & Keller (1993): if *γ<sup>i</sup>* > 0, 1 ≤ *i* ≤ *c*, *m* > 1 and Z has at

 �*zk*−*vi*�<sup>2</sup> *γi*

1/(*m*−1)

least *<sup>c</sup>* distinct data points, then (*T*, *<sup>V</sup>*) <sup>∈</sup> *Mpcm* × �*c*×*<sup>N</sup>* may minimize *Jpcm* only if

*tik* <sup>=</sup> <sup>1</sup> 1 +

(*tik*)*m*�*zk* <sup>−</sup> *vi*�<sup>2</sup>

(<sup>1</sup> <sup>−</sup> *tik*)*<sup>m</sup>*

*T* ∈ *Mpcm*, *γ<sup>i</sup>* > 0, 1 ≤ *i* ≤ *c*. (8)

*<sup>A</sup>* +

, (7)

, (9)

**IV** Update the new values of the prototypes *vi* using equation (6).

**V** Verify if the error is equal or lower than *δ*,

If this is truth, stop. Else, go to step **II**.

inconvenience is to use a possibilistic algorithm.

Krishnapuram & Keller (1993) for this algorithms is given by

*Jpcm*(**Z**;**T**, **V**, *γ*) =

**3.2 Possibilistic c-Means algorithm**

where

bring *tik* toward 1.

$$1 \le i \le c; \qquad 1 \le k \le N$$

$$v\_i = \sum\_{k=1}^N t\_{ik}^m z\_k \Big/ \sum\_{k=1}^N t\_{ik}^m,\tag{10}$$

$$1 \le i \le c; \qquad 1 \le k \le N.$$

Krishnapuram and Keller Krishnapuram & Keller (1993) Krishnapuram & Keller (1996) recommend to apply the FCM at a first time, such that the initial values of the PCM algorithm can be estimated. They also suggest the calculus of the penalty *γ<sup>i</sup>* with equation (11)

$$\gamma\_i = K \frac{\sum\_{k=1}^{N} \mu\_{ik}^m ||z\_k - v\_i||\_A^2}{\sum\_{k=1}^{N} \mu\_{ik}^m} \tag{11}$$

where *K* > 0, although the most common value is *K* = 1, and the membership values {*μik*} are those calculated with the FCM algorithm in order to reduce the influence of noise.

The PCM algorithm is very sensitive to the {*γi*} values, and the typicality values depend directly on it. For example, if the value of *γ<sup>i</sup>* is small, the typicality values *tik* of T are also small, whereas if the value of *γ<sup>i</sup>* is high, the *tik* are also high. For this work, the {*γi*} values are obtained from equation (11).

In order to avoid a problem with the initial PCM algorithm, as sometimes the prototypes of different groups coincided Hoppener et al. (2000), even if the natural structure of data has well delimited different groups, Tim *et al* Timm et al. (2004); Timm & Kruse. (2002) have modified the objective function to include a constraint based on the repulsion among groups, thus avoiding identical groups when they must be different.

The objective of the fuzzy clustering algorithms is to find an internal structure in a numerical data set into *n* different subgroups, where the members of each subgroup have a high similarity with its prototype (centroid, cluster center, signature, template, code vector) and a high dissimilarity with the prototypes of the other subgroups. This justifies the existence of each one of the subgroups Andina & Pham (2007).

A simplified representation of a numerical data set into *n* subgroups, help us to get a better comprehension and knowledge of the data set Barron-Adame et al. (2007). Besides, the particional clustering algorithms (hard, fuzzy, probabilistic or possibilistic) provide, after a learning process, a set of prototypes as the most representative elements of each subgroups.

Ruspini was the first one to use fuzzy sets for clustering Ruspini (1970). After that, Dunn Dunn (1973) developed in 1973 the first fuzzy clustering algorithm, named Fuzzy *c*-Means (FCM), with a parameter of fuzziness *m* equal to 2. Later on Bezdek Bezdek (1981) generalized this algorithm. The FCM is an algorithm where the membership degree of each point to each fuzzy set *Ai* is calculated according to its prototype. The sum of all the membership degrees of each individual point to all the fuzzy sets must be equal to one.

Krishnapuram and Keller Krishnapuram & Keller (1993) developed the Possibilistic *c*-Means (PCM) clustering algorithm, where the principal characteristic is the relaxation of the restriction that gives the relative typicality property of the FCM. The PCM provides a similarity degree between data points and each one of the prototypes, value known as absolute typicality or simply typicality Pal et al. (1997). So, the nearest points to a prototype are identified as typical, whereas the furthest points as atypical, and noise Ojeda-Magaña et al. (2009a)Ojeda-Magaña et al. (2009b).

**III** With these results, calculate the penalty parameter *γ<sup>i</sup>* for each cluster *i*. Take *K* = 1.

<sup>59</sup> Air Pollution Analysis with a Possibilistic

**V** Calculate the membership values of the matrix *U* = [*μik*] if *Dik***<sup>A</sup>** > 0, use equation (13). **VI** Calculate the typicality values of the matrix *T* = [*tik*], if *Dik***<sup>A</sup>** > 0, use equation (14).

�*Vk*<sup>+</sup><sup>1</sup> − *Vk*�*err* ≤ *δ*,

As it is known, in the partition clustering algorithms is necessary a minimum of two groups. However, in our problem we only have one group, this group is formed by patterns [*SO*2;*PM*10] pollutant concentrations. Therefore, is proposed a synthetic cloud of patterns

In this case, the number of patterns (4320) is the same in the synthetic cloud and the pollutant

**0 50 100 150 200 250**

Fig. 2 shows clearly the synthetic cloud (located in the lower part) and the pollutant concentration patterns (located in the superior part). Once the groups are identified, we apply

 **Concentration (ppb)**

**SO2**

*ikAi* = (*zk* −

**IV** Calculate the distance of *zk* to each one of the prototypes *vi* using *D*<sup>2</sup>

and Fuzzy Clustering Algorithm Applied in a Real Database of Salamanca (México)

*vi*)*<sup>T</sup> Ai*(*zk* <sup>−</sup> *vi*), 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>c</sup>*, 1 <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>N</sup>*.

**VIII** Verify if the error is equal or lower than *δ*,

if this is truth, stop. Else, go to step **IV**.

**3.4 PFCM clustering algorithm in the AEMN**

**−800**

the PFCM clustering algorithm.

Fig. 2. Air pollution and synthetic cloud patterns.

**−600**

**−400**

**−200**

**PM10 Concentration (**μ **gr/m3**

**)**

**0**

**200**

**400**

∑<sup>1</sup> = 400 0 0 400 , *v*1 =

concentration.

**VII** Update the value of the prototypes *vi* using equation (15).

with the following covariance matrix and vector of centers:

<sup>100</sup> <sup>−</sup><sup>600</sup>

.

#### **3.3 PFCM clustering algorithm**

Pal *et al*. Pal et al. (1997) have proposed to use the membership degrees as well as the typicality values, looking for a better clustering algorithm. They called it *Fuzzy Possibilistic c-Means* (FPCM). However, the sum equal to one of the typicality values for each point was the origin of a problem, particularly when the algorithm uses a lot of data. In order to avoid this problem, Pal *et al* Pal et al. (2005) proposed to relax this constraint and they developed the PFCM clustering algorithm, where the function to be optimized is given by (12)

$$J\_{pfcm}(\mathbf{Z}; \mathbf{U}, \mathbf{T}, \mathbf{V}) = \sum\_{i=1}^{c} \sum\_{k=1}^{N} (a\mu\_{ik}^{m} + b t\_{ik}^{\eta}) \times \|z\_{k} - v\_{i}\|^2 + \dotsb \tag{12}$$

$$\sum\_{i=1}^{c} \gamma\_{i} \sum\_{k=1}^{N} (1 - t\_{ik})^{\eta} \tag{12}$$

and subject to the constraints ∑*<sup>c</sup> <sup>i</sup>*=<sup>1</sup> *μik* = 1∀*k*; 0 ≤ *μik*, *tik* ≤ 1 and the constants *a* > 0, *b* > 0, *m* > 1 and *η* > 1. The parameters *a* and *b* define a relative importance between the membership degrees and the typicality values. The parameter *μik* in (12) has the same meaning as in the FCM. The same happens for the *tik* values with respect to the PCM algorithm.

emphTheorem PFCM Pal et al. (2005): If *DikA* = �*zk* − *vi*� > 0, for every *i*, *k*, *m*, *η* > 1, and Z contains at least *<sup>c</sup>* different patterns, then (*U*, *<sup>T</sup>*, *<sup>V</sup>*) <sup>∈</sup> *Mf cm* <sup>×</sup> *Mpcm* × �*<sup>p</sup>* and *Jpf cm* can be minimized if and only if

$$
\mu\_{ik} = \left(\sum\_{j=1}^{c} \left(\frac{D\_{ikA\_l}}{D\_{jkA\_l}}\right)^{2/(m-1)}\right)^{-1} \tag{13}
$$

$$
1 \le i \le c; \quad 1 \le k \le n
$$

$$
t\_{ik} = \frac{1}{1 + \left(\frac{b}{\gamma\_l} D\_{ik\_{A\_l}}^2\right)^{1/(\eta-1)}} \tag{14}
$$

$$1 \le i \le c; \quad 1 \le k \le n$$

$$v\_i = \sum\_{k=1}^{N} (a\mu\_{ik}^m + bt\_{ik}^m)z\_k \Big/ \sum\_{k=1}^{N} (a\mu\_{ik}^m + bt\_{ik}^m) \Big/ \tag{15}$$
 
$$1 \le i \le c.$$

The membership degrees are calculated with equation (13), the typicality values with (14) and for the prototypes the equation (15) is used.

The iterative process of this algorithm follows the next steps:

#### *PFCM-AO-V*

Given the data set *Z* choose the number of clusters 1 < *c* < *N*, the weighting exponents *m* > 1, *η* > 1, and the values of the constants *a* > 0, and *b* > 0.


**III** With these results, calculate the penalty parameter *γ<sup>i</sup>* for each cluster *i*. Take *K* = 1.

**IV** Calculate the distance of *zk* to each one of the prototypes *vi* using *D*<sup>2</sup> *ikAi* = (*zk* − *vi*)*<sup>T</sup> Ai*(*zk* <sup>−</sup> *vi*), 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>c</sup>*, 1 <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>N</sup>*.

**V** Calculate the membership values of the matrix *U* = [*μik*] if *Dik***<sup>A</sup>** > 0, use equation (13).

**VI** Calculate the typicality values of the matrix *T* = [*tik*], if *Dik***<sup>A</sup>** > 0, use equation (14).

**VII** Update the value of the prototypes *vi* using equation (15).

**VIII** Verify if the error is equal or lower than *δ*,

8 Environmental Monitoring

Pal *et al*. Pal et al. (1997) have proposed to use the membership degrees as well as the typicality values, looking for a better clustering algorithm. They called it *Fuzzy Possibilistic c-Means* (FPCM). However, the sum equal to one of the typicality values for each point was the origin of a problem, particularly when the algorithm uses a lot of data. In order to avoid this problem, Pal *et al* Pal et al. (2005) proposed to relax this constraint and they developed the

PFCM clustering algorithm, where the function to be optimized is given by (12)

*c* ∑ *i*=1

*c* ∑ *i*=1 *γi N* ∑ *k*=1

*N* ∑ *k*=1

(*aμ<sup>m</sup> ik* <sup>+</sup> *bt<sup>η</sup>*

*b* > 0, *m* > 1 and *η* > 1. The parameters *a* and *b* define a relative importance between the membership degrees and the typicality values. The parameter *μik* in (12) has the same meaning as in the FCM. The same happens for the *tik* values with respect to the PCM

emphTheorem PFCM Pal et al. (2005): If *DikA* = �*zk* − *vi*� > 0, for every *i*, *k*, *m*, *η* > 1, and Z contains at least *<sup>c</sup>* different patterns, then (*U*, *<sup>T</sup>*, *<sup>V</sup>*) <sup>∈</sup> *Mf cm* <sup>×</sup> *Mpcm* × �*<sup>p</sup>* and *Jpf cm* can be

> *DikAi DjkAi*

1 ≤ *i* ≤ *c*; 1 ≤ *k* ≤ *n*

1 ≤ *i* ≤ *c*; 1 ≤ *k* ≤ *n*

1 ≤ *i* ≤ *c*. The membership degrees are calculated with equation (13), the typicality values with (14) and

*PFCM-AO-V* Given the data set *Z* choose the number of clusters 1 < *c* < *N*, the weighting exponents

**I** Provide an initial value to each one of the prototypes *vi*, *i* = 1, .., *c*. These values are

 *<sup>N</sup>* ∑ *k*=1

(*aμ<sup>m</sup>*

*ik* <sup>+</sup> *bt<sup>m</sup>*

2/(*m*−1)

<sup>−</sup><sup>1</sup>

*ik*) × �*zk* <sup>−</sup> *vi*�<sup>2</sup> <sup>+</sup>

*<sup>i</sup>*=<sup>1</sup> *μik* = 1∀*k*; 0 ≤ *μik*, *tik* ≤ 1 and the constants *a* > 0,

(<sup>1</sup> <sup>−</sup> *tik*)*η*, (12)

1/(*η*−1) (14)

*ik*), (15)

(13)

*Jpf cm*(**Z**; **U**,**T**, **V**) =

*μik* =

*vi* =

The iterative process of this algorithm follows the next steps:

*m* > 1, *η* > 1, and the values of the constants *a* > 0, and *b* > 0.

for the prototypes the equation (15) is used.

generally given in a random way. **II** Run the FCM-AO-V algorithm.

*N* ∑ *k*=1

(*aμ<sup>m</sup>*

 *c* ∑ *j*=1

*tik* <sup>=</sup> <sup>1</sup> 1 + *<sup>b</sup> γi D*<sup>2</sup> *ikAi*

> *ik* <sup>+</sup> *bt<sup>m</sup> ik*)*zk*

**3.3 PFCM clustering algorithm**

and subject to the constraints ∑*<sup>c</sup>*

algorithm.

minimized if and only if

$$\|V\_{k+1} - V\_k\|\_{\mathcal{err}} \le \delta\_\prime$$

if this is truth, stop. Else, go to step **IV**.

#### **3.4 PFCM clustering algorithm in the AEMN**

As it is known, in the partition clustering algorithms is necessary a minimum of two groups. However, in our problem we only have one group, this group is formed by patterns [*SO*2;*PM*10] pollutant concentrations. Therefore, is proposed a synthetic cloud of patterns with the following covariance matrix and vector of centers:

∑<sup>1</sup> = 400 0 0 400 , *v*1 = <sup>100</sup> <sup>−</sup><sup>600</sup> .

In this case, the number of patterns (4320) is the same in the synthetic cloud and the pollutant concentration.

Fig. 2. Air pollution and synthetic cloud patterns.

Fig. 2 shows clearly the synthetic cloud (located in the lower part) and the pollutant concentration patterns (located in the superior part). Once the groups are identified, we apply the PFCM clustering algorithm.

station we observe that either *SO*<sup>2</sup> or PM10 pollutant concentrations are highest. At the DIF monitoring station we observe the highest PM10 concentrations in the AEMN network. The main proposal in this work is to apply the PFCM clustering algorithm to the AEMN in Salamanca as well to integrate the pollutant measures from the three monitoring stations. The PFCM initial parameters (*a*, *b*, *m* and *η*) are very important in order to reduce the outlier effects in the pattern prototypes. Pal *et al*, in Pal et al. (2005) recommend of *b* parameter value larger than the *a* parameter value in order to reduce the mentioned effects. On the other hand, a small value for *η* and a value greater than 1 for *m* are recommended. nevertheless, choosing a too high of a value of *m* reduces the effect of membership of data to the clusters, and the

<sup>61</sup> Air Pollution Analysis with a Possibilistic

and Fuzzy Clustering Algorithm Applied in a Real Database of Salamanca (México)

Taking into account the previous recommendations, the initial parameters for the PFCM clustering algorithm were set as follows: *a* = 1, *b* = 5, *m* = 2 and *η* = 2. The found prototypes

In Fig. 4(a) the daily averages of *SO*<sup>2</sup> concentrations are presented for each monitoring station together with the corresponding prototypes. It is observed also that Cruz Roja monitoring station receives the highest emissions of *SO*<sup>2</sup> concentrations: this is due to its location near to the refinery. The prototypes in this case were very low in comparison with the observed *SO*<sup>2</sup> concentrations, because only one station observed high *SO*<sup>2</sup> concentrations (Cruz Roja). According with the analyzed patterns the emitted pollutant is only measured by the Cruz Roja

Fig. 4(b) shows the daily averages of *PM*<sup>10</sup> concentrations and result prototypes. In this case, the observed averages are very similar at the three monitoring stations. The *PM*<sup>10</sup> pollutant

Table 2 shows the correlation results among *SO*<sup>2</sup> and *PM*<sup>10</sup> pollutants and the meteorological variables. The database used in the correlation analysis correspond to year 2004 of Nativitas. This period was taking because contains more meteorological registrations. The obtained results of the *SO*<sup>2</sup> correlation coefficient show a high positive correlation between *SO*<sup>2</sup> pollutant and Wind Speed, also a high and negative correlation between *SO*<sup>2</sup> pollutant and Wind Direction is observed. The other meteorological variables have not impact. For the *PM*<sup>10</sup> pollutant, the meteorological variable with more impact is the Relative Humidity. We observe, when the Relative Humidity increases the pollutant concentration decreases. The

*SO*<sup>2</sup> *PM*<sup>10</sup>

*SO*<sup>2</sup> 1 0.0731 *PM*<sup>10</sup> 0.0731 1 *WS* 0.4756 -0.1385 *WD* -0.6151 0.1478 *T* -0.0329 -0.0007 *RH* -0.0322 -0.4416 *BP* 0.1462 0.1806 *SR* -0.021 -0.1207

Table 2. Correlation Coefficient between pollutant concentration and meteorological

dispersion is more uniform then the *SO*<sup>2</sup> pollutant dispersion in the city.

PM10 particles are caught and fall to the ground during rain.

algorithm behaves as a simple PCM.

(*a* and *b*) are shown in Fig. 4.

monitoring station (see Fig. 4).

variables.

#### **3.5 Correlation coefficient**

The correlation coefficient *r* (also called Pearson's product moment correlation after Karl Pearson Pérez et al. (2000)) is used to determine the strength and direction of the relationship between two variables. This form of correlation requires that both variables are normally distributed, interval or ratio variables. The correlation coefficient is calculated by eq.(16):

$$\sigma = \frac{n\sum x\_i y\_i - (\sum x\_i)(\sum y\_i)}{\sqrt{n(\sum x\_i^2) - (\sum x\_i)^2}\sqrt{n(\sum y\_i^2) - (\sum y\_i)^2}}\tag{16}$$

where *n* is the number of data points. The numerical values of correlation coefficient range from +1 to -1. If two variables move exactly together, the value of the correlation coefficient is 1. This indicates perfect positive correlation. If two variables move exactly opposite to each other, the value of the correlation coefficient is -1. Low numerical values indicate little relationship between two variables, such as -0.10 or +0.15 indicate little relationship between on two variable.

#### **4. Results**

Fig. 3 shows the distribution of pollutant patterns [*SO*2;*PM*10] at the three monitoring stations (CR, DF and NA). The mesh in Fig. 3 corresponds to the thresholds established by the program to improve the air quality in Salamanca (*ProAire*) INE (2004). Thresholds are Pre-contingency, Phase-I contingency and Phase-II contingency. For example, for *SO*<sup>2</sup> concentrations equal to or bigger than 145 *ppb* and smaller than 225 *ppb* (average per day), a level of environmental pre-contingency is declared. Therefore the spaces between lines in the mesh represent the levels of environmental contingency for *SO*<sup>2</sup> and *PM*<sup>10</sup> concentrations.

In Fig. 3 each symbol (\*, • and �) represent the pollutant patterns at each monitoring station. At Nativitas monitoring station we observe that the highest *PM*<sup>10</sup> and *SO*<sup>2</sup> pollutant concentrations are not present at the same time. On other hand, at the Cruz Roja monitoring

Fig. 3. Monitoring Network per minute.

The correlation coefficient *r* (also called Pearson's product moment correlation after Karl Pearson Pérez et al. (2000)) is used to determine the strength and direction of the relationship between two variables. This form of correlation requires that both variables are normally distributed, interval or ratio variables. The correlation coefficient is calculated by eq.(16):

where *n* is the number of data points. The numerical values of correlation coefficient range from +1 to -1. If two variables move exactly together, the value of the correlation coefficient is 1. This indicates perfect positive correlation. If two variables move exactly opposite to each other, the value of the correlation coefficient is -1. Low numerical values indicate little relationship between two variables, such as -0.10 or +0.15 indicate little relationship between

Fig. 3 shows the distribution of pollutant patterns [*SO*2;*PM*10] at the three monitoring stations (CR, DF and NA). The mesh in Fig. 3 corresponds to the thresholds established by the program to improve the air quality in Salamanca (*ProAire*) INE (2004). Thresholds are Pre-contingency, Phase-I contingency and Phase-II contingency. For example, for *SO*<sup>2</sup> concentrations equal to or bigger than 145 *ppb* and smaller than 225 *ppb* (average per day), a level of environmental pre-contingency is declared. Therefore the spaces between lines in the mesh represent the

In Fig. 3 each symbol (\*, • and �) represent the pollutant patterns at each monitoring station. At Nativitas monitoring station we observe that the highest *PM*<sup>10</sup> and *SO*<sup>2</sup> pollutant concentrations are not present at the same time. On other hand, at the Cruz Roja monitoring

**Monitoring station data**

**<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> <sup>400</sup> <sup>0</sup>**

 **Concentration (ppb)**

**SO2**

*<sup>n</sup>*(<sup>∑</sup> *yi*

<sup>2</sup>) <sup>−</sup> (<sup>∑</sup> *yi*)<sup>2</sup> (16)

**CR DF NA**

*<sup>r</sup>* <sup>=</sup> *<sup>n</sup>* <sup>∑</sup> *xiyi* <sup>−</sup> (<sup>∑</sup> *xi*)(<sup>∑</sup> *yi*)

<sup>2</sup>) − (<sup>∑</sup> *xi*)<sup>2</sup>

*<sup>n</sup>*(<sup>∑</sup> *xi*

levels of environmental contingency for *SO*<sup>2</sup> and *PM*<sup>10</sup> concentrations.

Fig. 3. Monitoring Network per minute.

**PM10 Concentration(** μ **gr/m3**

**)**

**3.5 Correlation coefficient**

on two variable.

**4. Results**

station we observe that either *SO*<sup>2</sup> or PM10 pollutant concentrations are highest. At the DIF monitoring station we observe the highest PM10 concentrations in the AEMN network.

The main proposal in this work is to apply the PFCM clustering algorithm to the AEMN in Salamanca as well to integrate the pollutant measures from the three monitoring stations.

The PFCM initial parameters (*a*, *b*, *m* and *η*) are very important in order to reduce the outlier effects in the pattern prototypes. Pal *et al*, in Pal et al. (2005) recommend of *b* parameter value larger than the *a* parameter value in order to reduce the mentioned effects. On the other hand, a small value for *η* and a value greater than 1 for *m* are recommended. nevertheless, choosing a too high of a value of *m* reduces the effect of membership of data to the clusters, and the algorithm behaves as a simple PCM.

Taking into account the previous recommendations, the initial parameters for the PFCM clustering algorithm were set as follows: *a* = 1, *b* = 5, *m* = 2 and *η* = 2. The found prototypes (*a* and *b*) are shown in Fig. 4.

In Fig. 4(a) the daily averages of *SO*<sup>2</sup> concentrations are presented for each monitoring station together with the corresponding prototypes. It is observed also that Cruz Roja monitoring station receives the highest emissions of *SO*<sup>2</sup> concentrations: this is due to its location near to the refinery. The prototypes in this case were very low in comparison with the observed *SO*<sup>2</sup> concentrations, because only one station observed high *SO*<sup>2</sup> concentrations (Cruz Roja). According with the analyzed patterns the emitted pollutant is only measured by the Cruz Roja monitoring station (see Fig. 4).

Fig. 4(b) shows the daily averages of *PM*<sup>10</sup> concentrations and result prototypes. In this case, the observed averages are very similar at the three monitoring stations. The *PM*<sup>10</sup> pollutant dispersion is more uniform then the *SO*<sup>2</sup> pollutant dispersion in the city.

Table 2 shows the correlation results among *SO*<sup>2</sup> and *PM*<sup>10</sup> pollutants and the meteorological variables. The database used in the correlation analysis correspond to year 2004 of Nativitas. This period was taking because contains more meteorological registrations. The obtained results of the *SO*<sup>2</sup> correlation coefficient show a high positive correlation between *SO*<sup>2</sup> pollutant and Wind Speed, also a high and negative correlation between *SO*<sup>2</sup> pollutant and Wind Direction is observed. The other meteorological variables have not impact. For the *PM*<sup>10</sup> pollutant, the meteorological variable with more impact is the Relative Humidity. We observe, when the Relative Humidity increases the pollutant concentration decreases. The PM10 particles are caught and fall to the ground during rain.


Table 2. Correlation Coefficient between pollutant concentration and meteorological variables.

**5. Conclusions**

values.

**6. References**

Academic.

and power generation industries.

Nowadays, there is a program to improve the air quality in the city of Salamanca, Mexico. Besides, this program has established thresholds for several levels of contingencies depending on the *SO*<sup>2</sup> and *PM*<sup>10</sup> pollutant concentrations. However, a particular level of contingency for the city is declared taking into account the highest pollutant concentration provided by one of the three monitoring stations. For example, if a pollutant concentration exceeds a given threshold in a single monitoring station, the alarm of contingency applies to the whole city. This value is normally provided by the Cruz Roja station, due to its proximity to the refinery

<sup>63</sup> Air Pollution Analysis with a Possibilistic

and Fuzzy Clustering Algorithm Applied in a Real Database of Salamanca (México)

Looking for local and general contingency levels in the city, we have proposed to estimate a set of prototypes such that they can represent a calculated measure of pollutant concentrations according to the values measured in the three fixed stations. In such a way, a local alarm of contingency can be activated in the area of impact of the pollution depending on each station, and a general alarm of contingency according to the values provided by the prototypes. Nevertheless, the last case requires adjusting the thresholds, as the actual values would be only used for local contingency because they depend on the measured values of pollutant concentrations, and the general contingency requires thresholds as a function of calculated

Barron-Adame, J. M., Herrera-Delgado, J. A., Cortina-Januchs, M. G., Andina, D. &

*Problem-Solving Methods in Knowledge Engineering. IWINAC-07*, pp. 599–607. Bezdek, J. C. (1981). *Pattern Recognition With Fuzzy Objective Function Algorithms*, Kluwer

Bezdek, J. C., Keller, J., Krishnapuram, R. & Pal, N. R. (1999). *Fuzzy Models and Algorithms for*

Celik, M. B. & Kadi, I. (2007). The relation between meteorological factors and pollutants

Cortina-Januchs, M. G., Barron-Adame, J. M., Vega-Corona, A. & Andina, D. (2009). Prevision

EPA (2008). Air quality and health, chapter Environmental Protection Agency, National

Fenger, J. (2009). Air pollution in the last 50 years - from local to global, *Journal of Atmospheric*

Hoppener, F., Klawonn, F., Kruse, R. & Runkler, T. (2000). *Fuzzy Cluster Analysis, Methods for classification, data analysis and image recognition*, Chistester, United Kingdom. INE (2004). *Programa para mejorar la calidad del aire en Salamanca*, 2 edn, Instituto de Ecología del

Estado de Guanajuato, Calle Aldana N.12, Col. Pueblito de Rocha, 36040 Guanajuato,

*International Conference on Industrial Informatics (INDIN 09)*, pp. 510–515. Dunn, J. (1973). A fuzzy relative of the isodata process and its use in detecting compact

of industrial so2 pollutant concentration applying anns, *Proceedings of The 7th IEEE*

*Pattern Recognition and Image Processing*, first edn, Boston, London.

concentration in karabuk city, *G.U. Journal of science* 20(4): 87–95.

well-separated clusters, *Journal of Cybernetics* 3(3): 32–57.

Ambient Air Quality Standards (NAAQS).

*Environment* 43(1): 13–22.

Gto.

Vega-Corona, A. (2007). Air pollutant level estimation applying a self-organizing neural network, *Proceedings of the 2nd international work-conference on Nature Inspired*

Andina, D. & Pham, D. T. (2007). *Computational Intelligence*, Springer.


Fig. 4. Comparison between air pollutant averages and estimated prototypes.

#### **5. Conclusions**

12 Environmental Monitoring

**Comparison among monitoring points and prototype**

**CR DF NA Prototype**

**CR DF NA Prototype**

**0 5 10 15 20 25 30**

(a) *SO*<sup>2</sup>

**0 5 10 15 20 25 30**

(b) PM10

Fig. 4. Comparison between air pollutant averages and estimated prototypes.

**Number of Days**

**Number of Days**

**Comparison among monitoring points and prototype**

**0**

**0**

**20**

**40**

**60**

**PM10 Concentration(** μ **gr/m3**

**)**

**80**

**100**

**120**

**20**

**40**

**60**

**SO2 Concentration (ppb)**

**80**

**100**

**120**

Nowadays, there is a program to improve the air quality in the city of Salamanca, Mexico. Besides, this program has established thresholds for several levels of contingencies depending on the *SO*<sup>2</sup> and *PM*<sup>10</sup> pollutant concentrations. However, a particular level of contingency for the city is declared taking into account the highest pollutant concentration provided by one of the three monitoring stations. For example, if a pollutant concentration exceeds a given threshold in a single monitoring station, the alarm of contingency applies to the whole city. This value is normally provided by the Cruz Roja station, due to its proximity to the refinery and power generation industries.

Looking for local and general contingency levels in the city, we have proposed to estimate a set of prototypes such that they can represent a calculated measure of pollutant concentrations according to the values measured in the three fixed stations. In such a way, a local alarm of contingency can be activated in the area of impact of the pollution depending on each station, and a general alarm of contingency according to the values provided by the prototypes. Nevertheless, the last case requires adjusting the thresholds, as the actual values would be only used for local contingency because they depend on the measured values of pollutant concentrations, and the general contingency requires thresholds as a function of calculated values.

#### **6. References**

Andina, D. & Pham, D. T. (2007). *Computational Intelligence*, Springer.


**5** 

F.Z. Dong et al.\*

*P. R. China* 

**Real-Time In Situ Measurements** 

**Concentrations and Their Emission Gross** 

Over the past few decades environmental protection has been of greatly worldwide concerns due to the fact of global warming and air quality deterioration particularly in the fast developing countries like China and India (Platt, 1980; Edner, 1991; Sigrist, 1995; Culshaw, 1998; Fried, 1998; Linnerud, 1998; Weibring, 1998; Nelson, 2002; Liu, 2002; Christian, 2003 & 2004; Taslakov, 2006; de Gouw, 2007; Karl, 2007 & 2009; http://www.cnemc.cn). These have resulted in large demands and tremendous efforts for new technology developments to monitor and control industrial gas pollution (Lindinger, 1998; Dong, 2005; Kan, 2006 & 2007; Wang Y.J., 2009; Wang F., 2010; Xia, 2010; Zhang, 2011). CO2, CO, NH3, H2S, HF, HCI, and volatile organic compounds (VOCs) are very important gases generated in many industrial processes; therefore to implement on-line monitoring of these industrial emitted gases is a key factor for industrial process control. Furthermore if one can simultaneously measure the gas flow path-averaged velocity and gas concentrations in a smokestack, all the industrial emissions from the targeted smokestack would be realtime obtained. This could be much beneficial to the administrative implementation of global environmental protection policy on reduction of gas pollution and environmental

Tunable diode laser absorption spectroscopy (TDLAS) is a kind of technology with advantages of high sensitivity, high selectivity and fast responsibility. It has been widely used in the applications of green-house measurements (Feher, 1995; Nadezhdinskii, 1999; Kan, 2006), hazardous gas leakage detection (May, 1989; Uehara, 1992; Iseki, 2000 & 2004), industry process control (Linnerud, 1998; Deguchi,2002) and combustion gas measurements (Zhou, 2005; Rieker, 2009). Proton transfer reaction—mass spectrometry (PTR-MS) is a relatively new technology firstly developed at the University of Innsbruck, Austria, in the 1990s (Hansel, 1995). PTR-MS has been found being an extremely powerful and promising technology for online detection of VOCs at trace level (Smith, 2005; Jordan, 2009). Optical flow sensor (OFS-2000) based on the concept of optical scintillation to measure airflow velocity (Wang T.I., 1981;

\* W.Q. Liu, Y.N. Chu, J.Q. Li, Z.R. Zhang, Y. Wang, T. Pang, B. Wu, G.J. Tu, H. Xia, Y. Yang,

*Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Science Island, Hefei, P. R. China.* 

**1. Introduction** 

management.

C.Y. Shen, Y.J. Wang, Z.B. Ni and J.G. Liu

**of Industrial Hazardous Gas** 

*Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Science Island, Hefei,* 

