1. Introduction

A system is a set of interrelated components assembled to accomplish certain objectives or goal. Basic characteristics of a system are highlighted as boundaries, interfaces, input-outputs, and methods of making outputs from inputs. The environment of a system includes people, organizations, and other systems that supply data to or receive data from the system.

Solving problems comes from a system that usually uses the method of systems approach taking into account the goals, environment, and internal workings of the system. This method involves the following steps:


© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited. © 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

iii. Select the best solution and determine whether the solution is working.

An information system (IS) consists of components such as hardware, software, databases, personnel, and procedures that managers can use to make better decisions in control business operations. ISs are also used to document and monitor the operations of some other systems, called target systems that are prerequisite for the existence of ISs. On side of infrastructure, information system is an integration of diverse computers, displays and visualizations, database, storage systems, instruments, sensors, etc. via software and networks to share data and to provide aggregate capabilities.

In business operation, the activities of an organization equipped with IS are usually of three kinds: operational, tactical, and strategic planning. In this context, a strategy is meant as determination of the basic long-term goals and objectives of an enterprise and the adoption of courses of action and the allocation of resources necessary for achieving these goals. Operational tasks are the daily activities of the firm in consuming and acquiring resources. These daily transactions produce basis data for the operational systems.

ISs that provide information for allocation of efficient resources to achieve business objectives are known as tactical systems. Tactical systems provide middle-level managers with the information they need to monitor and control operational tasks and to allocate their resources effectively. The time frame for tactical activities may be monthly, quarterly, or yearly. Alternatively, ISs that support the strategic plans of the business are known as strategic planning systems. These systems are designed to provide top managers with information that assists them in making long-term planning decisions.

Both of the strategic planning information systems and tactical information systems may use the same data source, so the distinction between them is not always clear. For example, middle-level and top managers use budgeting information to allocate reasonable resources or to plan the long-term or short-term activities, budgeting becomes a tactical decision activity or a strategic planning activity, respectively. Hence, the differences between systems are attributed to whom and what the budgeting data are used.

The top management of the organization carries out strategic planning based on results of operational tasks, tactical systems, and related external information to decide whether to build new plants, new products, facilities, or invest in technology. For making these decisions, strategic planners have to address problems that involve long-range analysis and prediction. The time frame for strategic activities may be months or years.

Some basic business systems that serve the operational level of the organization are called transaction processing systems or TPS for short. A TPS that records the daily routine transactions necessary to the conduct of the business monitor and control system physical processes is called process control system or PCS. For example, a wastewater treatment plan uses electronic sensors linked to computers to monitor wastewater processes continually and control the water quality process [1]. Similarly, a petroleum refinery uses sensors and computers to monitor chemical processes and make real-time controls to the refinery process. A process control system comprises the whole range of equipment, computer programs, and operating procedures [2].

Knowledge-based IS that supports the creation, organization, and dissemination of business knowledge to employees and managers throughout a company is named as knowledge management system. In such a case, knowledge management is the deployment of a comprehensive system that enhances the growth of knowledge. Expert systems are the category of artificial intelligence which has been used most successfully in building commercial applications. An expert system is also considered as a knowledge-based system that provides expert advice and act as expert consultants to users.

iii. Select the best solution and determine whether the solution is working.

transactions produce basis data for the operational systems.

them in making long-term planning decisions.

uted to whom and what the budgeting data are used.

procedures [2].

The time frame for strategic activities may be months or years.

to provide aggregate capabilities.

56 Management of Information Systems

An information system (IS) consists of components such as hardware, software, databases, personnel, and procedures that managers can use to make better decisions in control business operations. ISs are also used to document and monitor the operations of some other systems, called target systems that are prerequisite for the existence of ISs. On side of infrastructure, information system is an integration of diverse computers, displays and visualizations, database, storage systems, instruments, sensors, etc. via software and networks to share data and

In business operation, the activities of an organization equipped with IS are usually of three kinds: operational, tactical, and strategic planning. In this context, a strategy is meant as determination of the basic long-term goals and objectives of an enterprise and the adoption of courses of action and the allocation of resources necessary for achieving these goals. Operational tasks are the daily activities of the firm in consuming and acquiring resources. These daily

ISs that provide information for allocation of efficient resources to achieve business objectives are known as tactical systems. Tactical systems provide middle-level managers with the information they need to monitor and control operational tasks and to allocate their resources effectively. The time frame for tactical activities may be monthly, quarterly, or yearly. Alternatively, ISs that support the strategic plans of the business are known as strategic planning systems. These systems are designed to provide top managers with information that assists

Both of the strategic planning information systems and tactical information systems may use the same data source, so the distinction between them is not always clear. For example, middle-level and top managers use budgeting information to allocate reasonable resources or to plan the long-term or short-term activities, budgeting becomes a tactical decision activity or a strategic planning activity, respectively. Hence, the differences between systems are attrib-

The top management of the organization carries out strategic planning based on results of operational tasks, tactical systems, and related external information to decide whether to build new plants, new products, facilities, or invest in technology. For making these decisions, strategic planners have to address problems that involve long-range analysis and prediction.

Some basic business systems that serve the operational level of the organization are called transaction processing systems or TPS for short. A TPS that records the daily routine transactions necessary to the conduct of the business monitor and control system physical processes is called process control system or PCS. For example, a wastewater treatment plan uses electronic sensors linked to computers to monitor wastewater processes continually and control the water quality process [1]. Similarly, a petroleum refinery uses sensors and computers to monitor chemical processes and make real-time controls to the refinery process. A process control system comprises the whole range of equipment, computer programs, and operating A decision support system (DSS) is a computer-based system intended for use by a particular manager or a team of managers at any organizational level in making a decision in the process of solving a semi-structured decision. Database-based management system and a user interface are major components of a DSS. The database consists of information related to production information, market and marketing information, research data, financial transactions, and so forth.

The decision-maker must have suitable knowledge and skills on mining these systems of DSS to address the problem arising and make effective decisions. In traditional approaches to decisionmaking, usually scientific expertise together with statistical descriptions is needed to support decision-making. Recently, many innovative facilities have been proposed for decision-making process in enterprises with huge databases, together with several heuristic models.

Management information systems (MIS) are a kind of computer ISs that could collect and process information from different sources to make decisions in level of management [3]. This level contains computer systems that are intended to assist operational management in monitoring and controlling the transaction processing activities that occur at clerical level. MIS provides information in the form of prespecified formats to support business decision-making. The next level in the organizational hierarchy is occupied by low-level managers and supervisors. Therefore, MIS takes internal data from the system and summarized it to meaningful and useful forms as management reports to use it to support management activities and decision-making.

MISs encompass a complex and broad topic, that is why, MIS boundaries need to be defined to reduce difficulties in system managing. Firstly, MIS contains a vast number of related activities, so it is hard to review all of them. It may discuss on a selected sample of activities, depending on objectives and viewpoint of researcher. Alternatively, it only focuses on farm levels or on some lesser extent systems enough for researchers addressing problems. Secondly, MISs can be defined and described in several frameworks. Only a few of these frameworks are used to discuss important subject matters. Lastly, MISs are developed as a sense of how these systems have evolved, adapted, and been refined as new technologies have emerged, changing economic conditions, etc.

To evaluate performance of MIS, its output data must be characterized in a set of basic features appropriate to functions, objectives, and goals of the system. These output data need to be observed repetitively to evaluate the extent to which MIS is implemented to make successful decisions in organization. Using these observations, methods of data mining in rough set point of view, statistical analysis, etc. can be applied to evaluate the extent to which MISs are used to make effective decisions in planning purposes [4–7].

### 2. Evaluation of features and making decision rules

In mathematical modeling, an IS can be modeled by a sample Ω = {ω1, ω2, …, ωn} of n objects ω<sup>i</sup> where i = 1,2,…, n. The ith object ω<sup>i</sup> is observed by instances of m conditional features f1, f2,…, fm, valued as fj (ωi) j = 1,2,…, m. Additionally, a feature d characterizes a specific effect of ω<sup>i</sup> denoted by d(ωi), the so-called decision feature. In case of having s effects for a decision, d is represented by values d(ωi)=dk with k∈{1,2,…, s}.

Let F = {f1, f2,…, fm}, then (Ω, F∪{d}) is a decision information table or DIT with n = |Ω| objects, m = |A| conditional features, and a decision d. Objects ω and ω' are indiscernible if and only if the following binary relation RF on Ω with respect to (w.r.t.) F is satisfied:

$$\mathsf{R}\_{\mathsf{A}} \colon \mathfrak{f}\_{\mathsf{j}}(\omega) = \mathfrak{f}\_{\mathsf{j}}(\omega') \mathsf{j} = 1,2,...,\mathsf{m} \tag{1}$$

This is an equivalence relation. Equivalent class of ω∈Ω with respect to (w.r.t.) F is:

$$[\boldsymbol{\omega}]\_{\mathbb{F}} = \left\{ \boldsymbol{\omega} \, \middle| \, \boldsymbol{\omega} \in \Omega \, \middle| \, \mathbb{f} \big( \boldsymbol{\omega} \big) = \mathfrak{f} \big( \boldsymbol{\omega} \big) \, \middle| \, \big| = 1, 2, \dots, \mathfrak{m} \right\} \tag{2}$$

Assume that there are r such equivalence classes and named by C1, C2,…, Cr. They are disjoint subsets and form a partition of Ω by RF. Similarly, for the decision feature d, another partition of Ω is D1, D2,…, Ds defined by the following equivalence relation:

$$\mathbf{R}\_{\mathbf{d}} : \mathbf{d}(\omega) = \mathbf{d}\_{\mathbf{k}} \text{ } \mathbf{k} = 1, 2, \dots, \text{s} \tag{3}$$

Here, Dk = {ω'∈Ω | d(ω')=dk} is an equivalence classes called the kth decision class of the DIT. If f(Dk) = |Dk|/n be frequency of Dk w.r.t Ω, information entropy H(d) of decision feature d is

$$\mathbf{H(d)} = -\sum\_{\mathbf{k}=1}^{s} \mathbf{f(D\_i)} \log\_2 \mathbf{f(D\_k)}\tag{4}$$

On the other hand, let f(Ci) = |Ci|/n be frequency of Ci and f(Dk| Ci) = |Dk∩Ci|/|Ci| conditional frequency of Dk conditioned Ci. The conditional entropy H(d|F) of the decision feature d w.r.t condition F is determined by

$$\mathbf{H}(\mathbf{d}|\mathbf{F}) = -\sum\_{i=1}^{r} \mathbf{f}(\mathbf{C}\_{i}) \sum\_{k=1}^{s} \mathbf{f}(\mathbf{D}\_{k}|\mathbf{C}\_{i}) \log\_{2} \mathbf{f}(\mathbf{D}\_{k}|\mathbf{C}\_{i}) \tag{5}$$

From Eqs. (4) and (5), the mutual information I(F, d) between F and d is given by

$$\mathbf{H(F,d)} = \mathbf{H(d)} - \mathbf{H(d|F)}\tag{6}$$

The mutual information is nonnegative and symmetric, i.e. I(F, d) = I(d, F). In this case, the significance of feature f∈F w.r.t d is defined as

$$\text{Sgnf}(\mathbf{f}, \mathbf{d}) = \text{I}(\mathbf{F}, \mathbf{d}) - \text{I}(\mathbf{F} - \{\mathbf{f}\}, \mathbf{d}) \tag{7}$$

The significance of feature a represents the dependency of decision attribute d relative to condition attribute f. This measure reflects the discrimination ability of condition attributes.

The larger Sgnf(f, d), the more stronger of dependency relationships between a and decision attribute d. if Sgnf(f, d) > 0, then f is a core feature of DIT or f satisfies

2. Evaluation of features and making decision rules

the following binary relation RF on Ω with respect to (w.r.t.) F is satisfied:

This is an equivalence relation. Equivalent class of ω∈Ω with respect to (w.r.t.) F is:

represented by values d(ωi)=dk with k∈{1,2,…, s}.

½ � <sup>ω</sup> <sup>F</sup> <sup>¼</sup> <sup>ω</sup>'

H dð Þ¼� <sup>j</sup><sup>F</sup> <sup>X</sup><sup>r</sup>

significance of feature f∈F w.r.t d is defined as

w.r.t condition F is determined by

of Ω is D1, D2,…, Ds defined by the following equivalence relation:

H dð Þ¼�X<sup>s</sup>

<sup>i</sup>¼<sup>1</sup> f Cð Þ<sup>i</sup>

From Eqs. (4) and (5), the mutual information I(F, d) between F and d is given by

fm, valued as fj

58 Management of Information Systems

In mathematical modeling, an IS can be modeled by a sample Ω = {ω1, ω2, …, ωn} of n objects ω<sup>i</sup> where i = 1,2,…, n. The ith object ω<sup>i</sup> is observed by instances of m conditional features f1, f2,…,

denoted by d(ωi), the so-called decision feature. In case of having s effects for a decision, d is

Let F = {f1, f2,…, fm}, then (Ω, F∪{d}) is a decision information table or DIT with n = |Ω| objects, m = |A| conditional features, and a decision d. Objects ω and ω' are indiscernible if and only if

Assume that there are r such equivalence classes and named by C1, C2,…, Cr. They are disjoint subsets and form a partition of Ω by RF. Similarly, for the decision feature d, another partition

Here, Dk = {ω'∈Ω | d(ω')=dk} is an equivalence classes called the kth decision class of the DIT. If f(Dk) = |Dk|/n be frequency of Dk w.r.t Ω, information entropy H(d) of decision feature d is

On the other hand, let f(Ci) = |Ci|/n be frequency of Ci and f(Dk| Ci) = |Dk∩Ci|/|Ci| conditional frequency of Dk conditioned Ci. The conditional entropy H(d|F) of the decision feature d

X<sup>s</sup>

The mutual information is nonnegative and symmetric, i.e. I(F, d) = I(d, F). In this case, the

The significance of feature a represents the dependency of decision attribute d relative to condition attribute f. This measure reflects the discrimination ability of condition attributes.

(ωi) j = 1,2,…, m. Additionally, a feature d characterizes a specific effect of ω<sup>i</sup>

RA: fjð Þ¼ <sup>ω</sup> fj <sup>ω</sup>' � � <sup>j</sup> <sup>¼</sup> <sup>1</sup>, <sup>2</sup>, …, <sup>m</sup> (1)

<sup>∈</sup> <sup>Ω</sup> <sup>j</sup> fj <sup>ω</sup>' � � <sup>¼</sup> fjð Þ <sup>ω</sup> <sup>j</sup> <sup>¼</sup> <sup>1</sup>; <sup>2</sup>;…; <sup>m</sup> � � (2)

Rd : dð Þ¼ ω dk, k ¼ 1, 2, …, s (3)

<sup>k</sup>¼<sup>1</sup> f Dð Þ<sup>i</sup> log <sup>2</sup> f Dð Þ<sup>k</sup> (4)

<sup>k</sup>¼<sup>1</sup> f Dð Þ <sup>k</sup>jCi log 2f Dð Þ <sup>k</sup>jCi (5)

I Fð Þ¼ ; d H dð Þ� H dð Þ j F (6)

Sgnf fð Þ¼ ; d I Fð Þ� ; d I Fð Þ � f gf ; d (7)

$$\mathbf{I}(\mathbf{F} - \{\mathbf{f}\}, \mathbf{d}) < \mathbf{I}(\mathbf{F}, \mathbf{d}) \tag{8}$$

Any core feature is significant and may not be eliminated in mining DIT. Let CFs be a set of all core features, CFs ⊆ F. To find CFs, each feature in F must be verified using Eq. (8) to whether or not include it to CFs.

Example 1: To analyze some features of a service, Table 1 illustrated a DIT consists of evaluations of nine clients on four features of the service. In which, d is the decision feature, f1: capacity for innovation; f2: service capability; f3: product technologies; and f4: solution, are conditional features. Values in Table 1 mean, 0: unpleased, 1: acceptable, and 2: very pleased.

Here, F = {f1, f2, f3, f4}. Using Eq. (1), four equivalence classes w.r.t F are C1 = {ω1, ω8}, C2 = {ω2, ω7}, C3 = {ω3, ω5, ω9}, C4 = {ω4, ω6} and from Eq. (3) two decision classes D0 = {ω2, ω5, ω7, ω9}, D1 = {ω1, ω3, ω4, ω6, ω8}. From Eq. (4), the information entropy of decision feature d is H(d) = 0.9911 and H(A) = 0.4976. From Eq. (5), the conditional entropy of d is H(d|F) = 0.3061, so the mutual information between F and d is I(F, d) = 0.6850.


Table 1. A decision information system for evaluation service quality.

If the first feature a1 is eliminated, it is obtained the same H(d), but H(F {f1}) = 0.5144 and H(d|F {f1}) = 0.7505. These imply I(F {f1}, d) = 0.2405 < I(F, d) and the a1, capacity for innovation is a core feature. But, Sgnf(f4, d) = Sgnf(F, d)Sgnf(F {f4}, d) = 0, so f4 may be eliminated since it is not significant.

The features F, d can be considered as random quantities with values are represented in rows of a DIT. In theory of information, the mutual information is a measure of average information this random quantity receives from that one in all one's conditions and vice versa. Therefore, I(F, d) measures quantity of average information that the decision feature d receives from conditional features w.r.t. decisional value of d. That is why, it is concerned to the problem of removing redundant conditional features so that the reduced set provides the same effect, e.g., the same quality of classification or decision as the original.

A coeffect reduced set R of conditional features set is a subset of A so that I(R, d) = I(F, d), i.e., R contains some conditional features having the same effect as F. Any coeffect reduced set or reduced set of F for short can be used as the whole F. An algorithm to find a reduced set R based on mutual information is as follows:

ALGORITHM MIBR // Mutual Information Based Reduced set.

// Input: DIT = (Ω, F ∪ {d}).

// Output: R // a reduced set of F.

S ≔∅; R ≔ CFs; // set of core features.

Repeat.

S ≔ R; for any f∈FR, if I(R∪{f}, d) > I(S, d) then S ≔ R∪{f};

R ≔ S; // reassign before doing the next iteration.

Until I(R, d) = I(F, d);

Example 2: Using data in Table 1, the above algorithm is done as follows.

Firstly, R = CFs = {f1}, S = R then

i. f2∈FR, then I(R∪{f2},d) = 0.6850 > I(S, d) = 0.3198, so S = R∪{f2} = {f1, f2};

ii. f3∈FR, I(R∪{f3},d) = 0.6850 = I(S, d), S does not change;

iii. f4∈FR, I(R∪{f4},d) = 0.6850 = I(S, d), S does not change;

R = S = {f1, f2}. By checking, I(R, d) = 0.6850 = I(F, d), the iteration is terminated. It is obtained R = {f1, f2} is a reduced set of F.

It is noticed that, if the two steps i and ii of the previous treatment are permuted, then the set R = {f1, f3} is another reduced set of F.

Remark: As shown above, reduced set R of DIT is not unique. Finding minimum reduced set of DIT is an optimization problem. Several algorithms have been proposed to solve this problem, e.g., algorithm of rough set-based feature selection based on ant colony optimization (RSFSACO) in [8], cf. [9], for more detail.

Given X, a subset of Ω in a DIT, low-approximation or upper-approximation of X w.r.t. F respectively named as LFX or UFX, is defined by:

$$\mathcal{L}\_{\mathsf{F}}\mathbb{X} = \left\{ \omega \in \Omega \mid [\omega]\_{\mathsf{F}} \underline{\mathsf{C}} \mathcal{X} \right\} , \mathsf{U}\_{\mathsf{F}}\mathcal{X} = \left\{ \omega \in \Omega \mid [\omega]\_{\mathsf{F}} \cap \mathsf{X} \neq \mathcal{Q} \right\} \tag{9}$$

It can be shown that LFX ⊆ X ⊆ UFX. Some other relations between these approximations have been illustrated, e.g., in [5]. The difference set BFX=UFX�LFX is called a boundary of X and Ω�UFX is the outside region of X. X is a rough set if BFX6¼∅, otherwise a crisp set.

Example 3: In Example 1, let X = {ω1, ω3, ω5, ω7, ω9}. Then, the approximations of X are LFX={ω3, ω5, ω9}=C3 and UFX={ω1, ω2, ω3, ω5, ω7, ω8, ω9}=C1∪C2∪C3. The boundary BFX={ω2, ω8, ω9} differs from empty set, so X is a rough set and C4 is the outside region of X. Figure 1 shows all these sets w.r.t in Ω.

Any decision class Ω<sup>k</sup> in Ω/Rd is subset of Ω, so it has a low approximation LFΩk. Hence, positive region in Ω w.r.t d, f is the following subset:

$$\mathbf{P\_i(F)} = \cup\_{\mathbf{k}=1}^s \mathbf{L\_F} \mathbf{Q\_k} \tag{10}$$

In data analysis, the dependence between attributes is important. The dependency of the decision feature d on the conditional features F is defined by the following ratio:

$$\mathbf{Dep}(\mathbf{d}, \mathbf{F}) = |\mathbf{P}\_{\mathbf{d}}(\mathbf{F})| / |\Omega| \tag{11}$$

By definition, 0 ≤ Dep(d, F) ≤ 1 and if Dep(d, F) = 1, d depends totally on F. If Dep(d, F) = 0, i.e., Pd(F) = ∅, then d does not depend on F. In case of 0 < Dep(d, F) < 1, d depends partially on F.

Figure 1. Approximations of X.

If the first feature a1 is eliminated, it is obtained the same H(d), but H(F {f1}) = 0.5144 and H(d|F {f1}) = 0.7505. These imply I(F {f1}, d) = 0.2405 < I(F, d) and the a1, capacity for innovation is a core feature. But, Sgnf(f4, d) = Sgnf(F, d)Sgnf(F {f4}, d) = 0, so f4 may be

The features F, d can be considered as random quantities with values are represented in rows of a DIT. In theory of information, the mutual information is a measure of average information this random quantity receives from that one in all one's conditions and vice versa. Therefore, I(F, d) measures quantity of average information that the decision feature d receives from conditional features w.r.t. decisional value of d. That is why, it is concerned to the problem of removing redundant conditional features so that the reduced set provides the same effect, e.g.,

A coeffect reduced set R of conditional features set is a subset of A so that I(R, d) = I(F, d), i.e., R contains some conditional features having the same effect as F. Any coeffect reduced set or reduced set of F for short can be used as the whole F. An algorithm to find a reduced set R based

eliminated since it is not significant.

60 Management of Information Systems

on mutual information is as follows:

// Input: DIT = (Ω, F ∪ {d}).

Until I(R, d) = I(F, d);

Repeat.

// Output: R // a reduced set of F.

Firstly, R = CFs = {f1}, S = R then

R = {f1, f2} is a reduced set of F.

R = {f1, f3} is another reduced set of F.

S ≔∅; R ≔ CFs; // set of core features.

the same quality of classification or decision as the original.

ALGORITHM MIBR // Mutual Information Based Reduced set.

S ≔ R; for any f∈FR, if I(R∪{f}, d) > I(S, d) then S ≔ R∪{f};

ii. f3∈FR, I(R∪{f3},d) = 0.6850 = I(S, d), S does not change; iii. f4∈FR, I(R∪{f4},d) = 0.6850 = I(S, d), S does not change;

Example 2: Using data in Table 1, the above algorithm is done as follows.

i. f2∈FR, then I(R∪{f2},d) = 0.6850 > I(S, d) = 0.3198, so S = R∪{f2} = {f1, f2};

R = S = {f1, f2}. By checking, I(R, d) = 0.6850 = I(F, d), the iteration is terminated. It is obtained

It is noticed that, if the two steps i and ii of the previous treatment are permuted, then the set

R ≔ S; // reassign before doing the next iteration.

Using the degree of dependency, a coeffect reduced set R of conditional features in a DIT can also be found by meaning of Dep(d, R) = Dep(d, F).

Example 4: Example 1 gives two decision classes D0 = {ω2, ω5, ω7, ω9}, D1 = {ω1, ω3, ω4, ω6, ω7, ω8}; low approximations of these classes are LFD0 = {ω2, ω7}, LFD1 = {ω1, ω4, ω6, ω8} thus Pd(F) = {ω1, ω2, ω4, ω6, ω7, ω8} and the degree of dependency or quality of approximation is Dep(d, F) = 1/3. Using the coeffect reduced set R = {f1, f2}, it can be shown that all equivalence classes w.r.t R are the same ones in Example 1. Therefore, the above low approximations and positive region are also the same, i.e., LRD0 = LFD0, LRD1 = LFD1 and Pd(R) = Pd(F).

So far, problems of inducing rules from DITs have been studied and developed. The rough set method can be applied to the problems with several advantages [5]. For instance, the lower and upper approximations are applied to describe the inconsistency of a DIT and to induce corresponding rules dynamically from decision systems [6]. These methods of approximation can be used to address incomplete input data for inducing decision rules [7]. Such rules can be applied to partition a set of objects into classifications [10].

Given a DIT, let Vf be the range of f∈F, for a v∈Vf, ω ∈ Ω a proposition like f(ω) = v or f = v for short, takes a logic value true or false depending on ω. Assignment, ϕ ≔ f = v is to define a logic variable ϕ w.r.t the proposition f = v. Then, ϕ is true if there exists ω ∈ Ω so that f(ω) = v or false in vice versa. Set of logic variables on F and logical operations, like ~: not; ∧: and; ∨: or; set up a set of logic expressions called decision language from F, denoted by L(F). The meaning of ϕ in L(F), denoted by 〈ϕ〉, is a set of ω in Ω so that the proposition ϕ is true. Additionally, if ϕ ≔ f = v then 〈ϕ〉 = {ω∈Ω/f(ω) = v}, so ϕ takes the set 〈ϕ〉 as its description.

A decision rule allows individual, team workers, and organization choose effectively specific course of action in response to opportunities and threads and help. Formally, a decision rule is a logic expression defined by proposition ϕ ! ψ , read "if ϕ then ψ", where ϕ ∈ L(F) and ψ ∈ L(d) referred to as condition and decision of the rule, respectively. A decision rule ϕ ! ψ is true if 〈ϕ〉 ⊆ 〈ψ〉 . Both ϕ andψ are equivalent written as ϕ \$ ψ, if and only if (ϕ!ψ) ∧ (ψ!ϕ).

Assume that 〈ϕ〉 and 〈ψ〉 are nonempty. The support of the rule ϕ ! ψ is defined as

$$\text{Supp}(\phi \to \psi \,) = |\langle \phi \rangle \cap \langle \psi \rangle| \tag{12}$$

The larger Supp(ϕ ! ψ), the more power of the rule in DIT. When |〈ϕ〉 |6¼∅, the certainty or accuracy of ϕ ! ψ denoted by Cert(ϕ,ψ) is

$$\mathbf{Cert}(\phi \to \psi \,) = |\langle \phi \rangle \cap \langle \psi \rangle| / |\langle \phi \rangle| \tag{13}$$

This is a percentage objects of 〈ψ〉 presented in 〈ϕ〉 or percent of objects having property ψ in the set of objects having property ϕ or Cert(ϕ ! ψ) shows the confidence of the rule. In consequences, Cert(ϕ ! ψ) = 1 is equivalent to ϕ ! ψ is true, the rule is certain or accurate. Alternatively, if |〈ψ〉 | 6¼ ∅ the coverage of ϕ ! ψ is also defined:

$$\mathsf{Cov}(\phi \to \psi) = |\langle \phi \rangle \cap \langle \psi \rangle| / |\langle \psi \rangle| \tag{14}$$

The smaller of Covg(ϕ ! ψ), the less power of the rule. Finally, the popularity of ϕ ! ψ is measured by the strength of the rule:

$$\text{Strg}(\phi \to \psi) = |\langle \phi \rangle \cap \langle \psi \rangle| / |\mathfrak{D}| \tag{15}$$

In a given DIT, a coeffect reduced set R of conditional features and corresponding positive region Pd(R) are set up. Then, the DIT is restricted to a new table with features R, d and Pd(R). Such a table is called decision support table or DST. Based on the above measures, decision rules extracted from DST are verified before using them in prediction decisions.

It is noted that, there may be pairs of inconsistent or conflicting decision rules which have the same conditions but different decisions. Such conflicting rules must be excluded. In general, set ℜ of τ decision rules ϕα!ψα selected need to meet the properties:


Using the degree of dependency, a coeffect reduced set R of conditional features in a DIT can

Example 4: Example 1 gives two decision classes D0 = {ω2, ω5, ω7, ω9}, D1 = {ω1, ω3, ω4, ω6, ω7, ω8}; low approximations of these classes are LFD0 = {ω2, ω7}, LFD1 = {ω1, ω4, ω6, ω8} thus Pd(F) = {ω1, ω2, ω4, ω6, ω7, ω8} and the degree of dependency or quality of approximation is Dep(d, F) = 1/3. Using the coeffect reduced set R = {f1, f2}, it can be shown that all equivalence classes w.r.t R are the same ones in Example 1. Therefore, the above low approximations and positive region are also the same, i.e., LRD0 = LFD0, LRD1 = LFD1 and

So far, problems of inducing rules from DITs have been studied and developed. The rough set method can be applied to the problems with several advantages [5]. For instance, the lower and upper approximations are applied to describe the inconsistency of a DIT and to induce corresponding rules dynamically from decision systems [6]. These methods of approximation can be used to address incomplete input data for inducing decision rules [7]. Such rules can be

Given a DIT, let Vf be the range of f∈F, for a v∈Vf, ω ∈ Ω a proposition like f(ω) = v or f = v for short, takes a logic value true or false depending on ω. Assignment, ϕ ≔ f = v is to define a logic variable ϕ w.r.t the proposition f = v. Then, ϕ is true if there exists ω ∈ Ω so that f(ω) = v or false in vice versa. Set of logic variables on F and logical operations, like ~: not; ∧: and; ∨: or; set up a set of logic expressions called decision language from F, denoted by L(F). The meaning of ϕ in L(F), denoted by 〈ϕ〉, is a set of ω in Ω so that the proposition ϕ is true. Additionally, if ϕ ≔

A decision rule allows individual, team workers, and organization choose effectively specific course of action in response to opportunities and threads and help. Formally, a decision rule is a logic expression defined by proposition ϕ ! ψ , read "if ϕ then ψ", where ϕ ∈ L(F) and ψ ∈ L(d) referred to as condition and decision of the rule, respectively. A decision rule ϕ ! ψ is true if 〈ϕ〉 ⊆ 〈ψ〉 . Both ϕ andψ are equivalent written as ϕ \$ ψ, if and only if

Assume that 〈ϕ〉 and 〈ψ〉 are nonempty. The support of the rule ϕ ! ψ is defined as

Cert <sup>ϕ</sup> ! <sup>ψ</sup> <sup>¼</sup> <sup>ϕ</sup>

Alternatively, if |〈ψ〉 | 6¼ ∅ the coverage of ϕ ! ψ is also defined:

Supp <sup>ϕ</sup> ! <sup>ψ</sup> <sup>¼</sup> <sup>ϕ</sup>

The larger Supp(ϕ ! ψ), the more power of the rule in DIT. When |〈ϕ〉 |6¼∅, the certainty or

This is a percentage objects of 〈ψ〉 presented in 〈ϕ〉 or percent of objects having property ψ in the set of objects having property ϕ or Cert(ϕ ! ψ) shows the confidence of the rule. In consequences, Cert(ϕ ! ψ) = 1 is equivalent to ϕ ! ψ is true, the rule is certain or accurate.

 <sup>∩</sup> h i <sup>ψ</sup> 

<sup>∩</sup> h i <sup>ψ</sup> j j <sup>=</sup> <sup>ϕ</sup>

(12)

(13)

also be found by meaning of Dep(d, R) = Dep(d, F).

applied to partition a set of objects into classifications [10].

f = v then 〈ϕ〉 = {ω∈Ω/f(ω) = v}, so ϕ takes the set 〈ϕ〉 as its description.

Pd(R) = Pd(F).

62 Management of Information Systems

(ϕ!ψ) ∧ (ψ!ϕ).

accuracy of ϕ ! ψ denoted by Cert(ϕ,ψ) is


Example 5: A coeffect reduced set, e.g., R = {f1, f2}, and positive region determined by Pd(R) = {ω1, ω2, ω4, ω6, ω7, ω8} as in Example 4. Some decision rules are extracted from Table 1 and measures of obtained rules are presented in Table 2. The supports of the 2nd and 3rd rules are 2, their certainties and strengths are equal to 1 and 22.2%. So, they can be combined together:

$$(\mathbf{f}\_1 = \mathbf{1}) \land [(\mathbf{f}\_2 = \mathbf{1}) \lor (\mathbf{f}\_2 = \mathbf{2})] \to \mathbf{d} = \mathbf{1} \tag{16}$$

The support of this rule is raised to 4, coverage of 100% and strength 44.4%. This rule is supported by the classes C1, C4, and can be deduced as follows: "if capacity for innovation is acceptable and service capability is unpleased then the system activity is still acceptable".

The class C3 = {ω3, ω5, ω9} is not in Pd(R), and a rule like (f1 = 1) and (f2 = 0) ! (d = 0 or 1) may not be considered. Because, when it was used, this rule would be useless, since it receives nothing in decision.


Table 2. List of extracted decision rules.

The method of decision-making is also applied to build up decisions for risk warning based on processing historical data. Risk management model includes three sequential basic steps, that are risk identification, risk measurement, and risk warning. Risk identification should be objective itself, all risk levels are assessed by experts based on their work experience, this method ignores the role of historical data. That model does not have enough consideration on the uncertain and imprecision of risk. Alternatively, that method will unavoidably lead to some faulty judgments.

Data to identify risk factors often come from the operation, policy, environment, and management of a system. Collected data including a feature to assess risks are described by the feature d in a DIT. This decision feature d is often of six levels, 0: no risk, 1: little, 2: lowgrade, 3: middle-grade, 4: distinct, and 5: dangerous. The historical data are collected factually, so there will be some data fields or features which have less impact on the final risk level. If these redundant features are removed, then there will be produced a simplified feature set which will have a positive impact on risk judgment. Where is the place of finding reduced feature set to ignore unnecessary information while the nature of collected data is still unchanged.

Based on fact-finding of conditional features and observed risk levels on DIT, decision rules to predict risk levels are extracted. This process is only a step of the training stage in machine learning. To improve quality of risk prediction, more observations on DIT and verifications of rules must be done repeatedly.

Example 6: To evaluate security risks of a system, three conditional feature types of the system come from environmental impact, management structure, and control equipment are taken into account. These conditional features are notated as E, M, and C, respectively, and the decision feature d is simplified at two levels, either 1: risk-warning or 0: no-warning. Data are shown in Table 3.

From Table 3, there are five equivalence classes C1 = {ω1}, C2 = {ω2, ω5}, C3 = {ω3}, C4 = {ω4}, C5 = {ω6} and two decision classes D1 = {ω4, ω5}, D2 = {ω1, ω2, ω3, ω6}.

Using Eqs. (4)–(6), the information entropy of F = {E, M, C} is H(F) = 2.2516, H(d) = 0.9183 and mutual information between F and d I(F, d) = 0.5850. From Eq. (6), I(F {C}, d) = 0.1258 less than I(F, d), then a3 is a core feature with a significance of Sgnf(C, d) = 0.4591.


Table 3. Risk warning data.

Consider F-{M} = {E, C}, from Eq. (5), H(d|F � {M}) = 0.3333 implies to I(F � {M}, d) = 0.5850 = I(F, d). Therefore, {E, C} is a coeffect reduced set of F. Hence, there are formally two decision rules:

$$[(\mathbf{E} = \mathbf{0}) \land (\mathbf{C} = \mathbf{0})] \lor [(\mathbf{E} = \mathbf{1}) \land (\mathbf{C} = \mathbf{1})] \to (\mathbf{d} = \mathbf{0})\tag{17}$$

$$[(\mathbf{E} = 1) \land (\mathbf{C} \neq 0)] \lor [(\mathbf{E} = 0) \land (\mathbf{C} \neq 0)] \to (\mathbf{d} = 1) \tag{18}$$

It is noticed that the first expression of the second disjunction is an implication of the second one in the first rule. Therefore, maybe [(E = 1) ∧ (C = 1)] ! [(d = 0) or (d = 1)] happens. Alternatively, the second rule can be written as (C # 0) ! (d = 1). However, if E = 1 and C = 1, the first rule gives d = 0 contrary to the just deduced rule. For these reason, the above rules are chosen reasonably as [(E = 1) ∧ (C = 2)] ∨ [(E = 0) ∧ (C6¼ 0)] ! (d = 1).

Similarly, F-{E} = {M, C} gives I(F � {E}, d) = I(F, d), thus {M, C} is also a reduced set of F. Then,

$$[(\mathbf{M} = 1) \land (\mathbf{C} = 0)] \lor [(\mathbf{M} = 0) \land (\mathbf{C} = 1)] \to (\mathbf{d} = 0) \tag{19}$$

$$[(\mathbf{M} = 1) \land (\mathbf{C} \neq 0)] \lor [(\mathbf{M} = 0) \land (\mathbf{C} = 1)] \to (\mathbf{d} = 1) \tag{20}$$

It is also noticed that the second expressions of the above disjunctions are identical and it is necessary to ignore them. Because, if (M = 0) ∧ (C = 1) is true, these rules simultaneously imply d = 0, 1 hard to decide.

Consequently, the second and fourth rules in Table 4 may be used for risk warning w.r.t the collected data in Table 3.

The difficulties in choosing decision rules will be increasing with large-scale datasets. To reduce in part this shortcoming and make decision rules more efficiently, techniques of machine learning should be used. For instance, in [11], a back propagation neural network was used for training data in DIT, verifying decision rules in a number of steps to minimize errors in prediction based on decision rules.


Table 4. List of extracted decision rules for risk warning.

The method of decision-making is also applied to build up decisions for risk warning based on processing historical data. Risk management model includes three sequential basic steps, that are risk identification, risk measurement, and risk warning. Risk identification should be objective itself, all risk levels are assessed by experts based on their work experience, this method ignores the role of historical data. That model does not have enough consideration on the uncertain and imprecision of risk. Alternatively, that method will unavoidably lead to

Data to identify risk factors often come from the operation, policy, environment, and management of a system. Collected data including a feature to assess risks are described by the feature d in a DIT. This decision feature d is often of six levels, 0: no risk, 1: little, 2: lowgrade, 3: middle-grade, 4: distinct, and 5: dangerous. The historical data are collected factually, so there will be some data fields or features which have less impact on the final risk level. If these redundant features are removed, then there will be produced a simplified feature set which will have a positive impact on risk judgment. Where is the place of finding reduced feature set to ignore unnecessary information while the nature of collected data is

Based on fact-finding of conditional features and observed risk levels on DIT, decision rules to predict risk levels are extracted. This process is only a step of the training stage in machine learning. To improve quality of risk prediction, more observations on DIT and verifications of

Example 6: To evaluate security risks of a system, three conditional feature types of the system come from environmental impact, management structure, and control equipment are taken into account. These conditional features are notated as E, M, and C, respectively, and the decision feature d is simplified at two levels, either 1: risk-warning or 0: no-warning. Data are

From Table 3, there are five equivalence classes C1 = {ω1}, C2 = {ω2, ω5}, C3 = {ω3}, C4 = {ω4},

Using Eqs. (4)–(6), the information entropy of F = {E, M, C} is H(F) = 2.2516, H(d) = 0.9183 and mutual information between F and d I(F, d) = 0.5850. From Eq. (6), I(F {C}, d) = 0.1258 less

ω<sup>1</sup> 01 1 1 ω<sup>2</sup> 10 1 1 ω<sup>3</sup> 11 2 1 ω<sup>4</sup> 01 0 0 ω<sup>5</sup> 10 1 0 ω<sup>6</sup> 01 2 1

EMCd

C5 = {ω6} and two decision classes D1 = {ω4, ω5}, D2 = {ω1, ω2, ω3, ω6}.

than I(F, d), then a3 is a core feature with a significance of Sgnf(C, d) = 0.4591.

some faulty judgments.

64 Management of Information Systems

still unchanged.

shown in Table 3.

Table 3. Risk warning data.

rules must be done repeatedly.

#### 3. Evaluation of the extent of MIS using ANOVA

For the outcome extent of an MIS, it is assumed that a reduced set of m features, namely f1, f2, …, fm, is considered and evaluated with real numbers. The probability distribution of fi is assumed that normal N(ξi, σ <sup>i</sup> 2 ) with expected mean ξ<sup>i</sup> and variance σ<sup>i</sup> 2 .

ANOVA or analysis of variance was derived based on the approach in which the statistical method uses the variance to determine the expected means whether they are different or equal. It assesses the significance of factors, the so-called features here, by comparing the response means of observation samples at different features. In this chapter, ANOVA with single stage and multiple stages are introduced to evaluate features from the extent of an MIS.

In doing ANOVA, it is also assumed that all m features fi are of the same variances. In a course of consideration, m observation samples at different features are randomly drawn. The ith sample is denoted by {ωij}, j = 1, 2,…, ni, a manifestation of a random variable fi from the population of fi values. Basic characteristics of the ith sample are:

$$\begin{aligned} \overline{\boldsymbol{\alpha}\_{\text{i}}} &= \left(\sum\_{j=1}^{n\_{\text{i}}} \boldsymbol{\alpha}\_{\text{ij}}\right) / \mathbf{n}\_{\text{i}} - \text{sample average, is an estimate for } \boldsymbol{\mu}\_{\text{i}}\\ \mathbf{s}\_{\text{i}}^{2} &= \left(\sum\_{j=1}^{n\_{\text{i}}} \left[\boldsymbol{\alpha}\_{\text{ij}} - \overline{\boldsymbol{\alpha}}\_{\text{i}}\right]\right)^{2} / d\_{\text{i}} - \text{sample variance, estimate for } \boldsymbol{\sigma}^{2} \text{ with degree of freedom } \mathbf{d}\_{\text{i}} = \mathbf{n}\_{\text{i}} - 1. \end{aligned}$$

These calculations are done by using the following three basic sums:

Sum:

$$\mathbf{S}\_{\mathbf{i}} = \sum\_{j=1}^{n\_{\mathbf{i}}} \omega\_{\mathbf{i}} \tag{21}$$

Sum of squares:

$$\text{SS}\_{\text{i}} = \sum\_{\text{j=1}}^{n} \omega\_{\text{i}}^{2} \tag{22}$$

Sum of squares of derivations:

$$SS\_{\mathbf{i}} = \sum\_{\mathbf{j}=1}^{n} \left[ \alpha\_{\mathbf{i}\mathbf{j}} - \overline{\alpha\_{\mathbf{i}}} \right]^2 \tag{23}$$

Then, it is implied that ϖ <sup>i</sup> = Si/ni and SSDi = SSi�Si 2 /ni, so s\*i <sup>2</sup> = SSDi/dfi.

To verify condition that all variance σ<sup>i</sup> <sup>2</sup> are equal to the same value σ<sup>2</sup> , the Bartlett test based on the χ<sup>2</sup> probability distribution is used at a level of significance α valued from 1 to 5%. If the hypothesis on the equality of all variances is correct, m > 1 and ni > 1 for all i, Bartlett has shown that the statistic χ<sup>2</sup> cal has approximately a χ<sup>2</sup> -distribution with df = m�1:

$$\chi^2\_{\text{cal}} = 2.3026 \left( \text{df} \times \log \text{s}^2 - \sum\_{i=1}^{m} \text{df}\_i \log \text{s}^2\_{\text{si}} \right) / \text{c} \tag{24}$$

Here, df = <sup>∑</sup>i:1..m dfi, c = 1+(∑i:1..m 1/dfi�1/df)/[3(m � 1)], s<sup>2</sup> = (∑i:1..m dfi � s\*i 2 )/df = (∑i:1..m SSDi)/ df is the pool variance, an estimate for σ<sup>2</sup> . If a calculated χ<sup>2</sup> cal is less than χ<sup>2</sup> <sup>1</sup> � <sup>α</sup>-percentile, it is unreasonable to deny that all variances are the same. It is noticed that the approximation χ<sup>2</sup> distribution is a poor one for dfi ≤ 2.

In case of n1 = n2 = … = n, then df = n�1 and Eq. (21) can be quite simple. Indeed, because of logs<sup>2</sup> = log<sup>∑</sup> i:1..m SSDi�log(df) and logsi <sup>2</sup> = logSSDi�log(dfi), a shortened form of Eq. (24) is

Some Methods for Evaluating Performance of Management Information System http://dx.doi.org/10.5772/intechopen.74093 67

$$\chi^2\_{\text{cal}} = 2.3026 \left( \text{m} \times \text{log} \text{s}^2 - \sum\_{\text{i=1}}^{\text{m}} \log \text{s}^2\_{\text{\*i}} \right) \text{df}/\text{c} \tag{25}$$

where, c = 1 + (m + 1)/(3 m[n�1]). The value <sup>χ</sup><sup>2</sup> cal in Eq. (25) is calculated by using only all SSDs.

ANOVA or analysis of variance was derived based on the approach in which the statistical method uses the variance to determine the expected means whether they are different or equal. It assesses the significance of factors, the so-called features here, by comparing the response means of observation samples at different features. In this chapter, ANOVA with single stage

In doing ANOVA, it is also assumed that all m features fi are of the same variances. In a course of consideration, m observation samples at different features are randomly drawn. The ith sample is denoted by {ωij}, j = 1, 2,…, ni, a manifestation of a random variable fi from the

Si <sup>¼</sup> <sup>X</sup>ni

SS<sup>i</sup> <sup>¼</sup> <sup>X</sup>ni

SS<sup>i</sup> <sup>¼</sup> <sup>X</sup>ni

cal has approximately a χ<sup>2</sup>

Here, df = <sup>∑</sup>i:1..m dfi, c = 1+(∑i:1..m 1/dfi�1/df)/[3(m � 1)], s<sup>2</sup> = (∑i:1..m dfi � s\*i

cal <sup>¼</sup> <sup>2</sup>:3026 df � logs2 �X<sup>m</sup>

<sup>j</sup>¼<sup>1</sup> <sup>ω</sup><sup>2</sup>

<sup>j</sup>¼<sup>1</sup> <sup>ω</sup>ij � <sup>ω</sup><sup>i</sup>

2

the χ<sup>2</sup> probability distribution is used at a level of significance α valued from 1 to 5%. If the hypothesis on the equality of all variances is correct, m > 1 and ni > 1 for all i, Bartlett has

<sup>2</sup> are equal to the same value σ<sup>2</sup>

� �

. If a calculated χ<sup>2</sup>

unreasonable to deny that all variances are the same. It is noticed that the approximation χ<sup>2</sup>

In case of n1 = n2 = … = n, then df = n�1 and Eq. (21) can be quite simple. Indeed, because of

/ni, so s\*i

—sample variance, estimate for <sup>σ</sup><sup>2</sup> with degree of freedom dfi = ni � 1.

<sup>j</sup>¼<sup>1</sup> <sup>ω</sup><sup>i</sup> (21)

� �<sup>2</sup> (23)

<sup>2</sup> = SSDi/dfi.


∗i

cal is less than χ<sup>2</sup>

<sup>2</sup> = logSSDi�log(dfi), a shortened form of Eq. (24) is

<sup>i</sup>¼<sup>1</sup> dfi log s2

<sup>i</sup> (22)

, the Bartlett test based on

=c (24)

)/df = (∑i:1..m SSDi)/

<sup>1</sup> � <sup>α</sup>-percentile, it is


2

and multiple stages are introduced to evaluate features from the extent of an MIS.

population of fi values. Basic characteristics of the ith sample are:

These calculations are done by using the following three basic sums:

=df <sup>i</sup>

Then, it is implied that ϖ <sup>i</sup> = Si/ni and SSDi = SSi�Si

χ2

df is the pool variance, an estimate for σ<sup>2</sup>

logs<sup>2</sup> = log<sup>∑</sup> i:1..m SSDi�log(df) and logsi

distribution is a poor one for dfi ≤ 2.

=ni—sample average, is an estimate for μi,

<sup>ω</sup><sup>i</sup> <sup>¼</sup> <sup>P</sup>ni

s2 <sup>i</sup> <sup>¼</sup> <sup>P</sup>ni

Sum:

Sum of squares:

<sup>j</sup>¼<sup>1</sup> ωij � �

66 Management of Information Systems

<sup>j</sup>¼<sup>1</sup> ωij � ω<sup>i</sup> � � � �<sup>2</sup>

Sum of squares of derivations:

shown that the statistic χ<sup>2</sup>

To verify condition that all variance σ<sup>i</sup>

Setting n = ∑ i:1..m ni, ω<sup>o</sup> = (∑ i:1..ni niϖi)/n, ξ<sup>o</sup> = (∑ i:1..ni niξi)/n and η<sup>i</sup> = ξi�ξo. It is shown the following partitions

$$\begin{split} \sum\_{i=1}^{\mathsf{m}} \sum\_{j=1}^{\mathsf{m}\_{\mathsf{i}}} \left[ \boldsymbol{\omega}\_{\vec{\boldsymbol{\eta}}} - \boldsymbol{\xi}\_{\vec{\mathbf{i}}} \right]^{2} &= \sum\_{i=1}^{\mathsf{m}} \sum\_{j=1}^{\mathsf{m}\_{\mathsf{i}}} \left[ \boldsymbol{\omega}\_{\vec{\boldsymbol{\eta}}} - \overline{\boldsymbol{\omega}}\_{\vec{\mathbf{i}}} \right]^{2} + \sum\_{i=1}^{\mathsf{m}} \mathbf{n}\_{\mathsf{i}} \left[ \overline{\boldsymbol{\alpha}\_{\vec{\mathbf{i}}}} - \boldsymbol{\xi}\_{\vec{\mathbf{i}}} \right]^{2} \\ &= \sum\_{i=1}^{\mathsf{m}} \sum\_{j=1}^{\mathsf{m}\_{\mathsf{i}}} \left[ \boldsymbol{\omega}\_{\vec{\boldsymbol{\eta}}} - \overline{\boldsymbol{\omega}}\_{\vec{\mathbf{i}}} \right]^{2} + \sum\_{i=1}^{\mathsf{m}} \mathbf{n}\_{\mathsf{i}} \left[ \overline{\boldsymbol{\alpha}\_{\vec{\mathbf{i}}}} - \boldsymbol{\omega}\_{\mathsf{o}} - \boldsymbol{\eta}\_{\vec{\mathbf{i}}} \right]^{2} + \mathbf{n} \left[ \boldsymbol{\omega}\_{\mathsf{o}} - \boldsymbol{\xi}\_{\mathsf{o}} \right]^{2} \end{split} \tag{26}$$

According to the χ<sup>2</sup> -partition theorem, the sums in the rightmost side of Eq. (26) are of χ<sup>2</sup> distribution with degrees of freedom n�m, m�1, 1, respectively.

If the expected means of m populations are the same, ξ<sup>i</sup> = ξ<sup>o</sup> and η<sup>i</sup> = 0 for all i. The two first terms of Eq. (26) are variations within or between samples and determined in turn as follows:

$$\mathbf{s}\_1^2 = \left(\sum\_{i=1}^{\mathbf{m}} \sum\_{\vec{\gamma}=1}^{\mathbf{n}\_i} \left[\alpha\mathbf{y}\_{\vec{\gamma}} - \overline{\alpha}\mathbf{i}\right]^2\right) / (\mathbf{n} - \mathbf{m}) = \left(\sum\_{i=1}^{\mathbf{m}} \text{SSD}\_i\right) / (\mathbf{n} - \mathbf{m})\tag{27}$$

$$\mathbf{s}\_2^2 = \left(\sum\_{i=1}^{\mathfrak{m}} \mathbf{n}\_i [\overline{\alpha}\_i - \omega\_0]^2\right) / (\mathbf{m} - 1) = \left(\sum\_{i=1}^{\mathfrak{m}} \mathbf{S}\_i^2 / \mathbf{n}\_i - \left[\sum\_{i=1}^{\mathfrak{m}} \mathbf{S}\_i\right]^2 / \mathbf{n}\right) / (\mathbf{m} - 1) \tag{28}$$

The statistics s1 2 , s2 <sup>2</sup> and s3 <sup>2</sup> = n[ωo�ξo] <sup>2</sup> are unbiased estimates of σ<sup>2</sup> . In this case, the total variance between observations and population is determined as follows:

$$\mathbf{s}^2 = \left(\sum\_{i=1}^{\mathbf{m}} \mathbf{n}\_i \left[\boldsymbol{\omega}\_{\overline{\mathbf{n}}} - \boldsymbol{\omega}\_0\right]^2\right) / (\mathbf{n} - 1) = \left(\sum\_{i=1}^{\mathbf{m}} \mathbf{S} \mathbf{S}\_i - \left[\sum\_{i=1}^{\mathbf{m}} \mathbf{S}\_i\right]^2 / \mathbf{n}\right) / (\mathbf{n} - 1) \tag{29}$$

In such a case, the variance ratio v<sup>2</sup> cal = s1 2 /s2 <sup>2</sup> is of the Fisher probability distribution with dfs1 = n � m, dfs2 = m � 1. Therefore, the hypothesis about equality of m expected means is tested using the Fisher distribution with a given level of significance α valued from 1 to 5%. If v2 cal > F1 � <sup>α</sup>(dfs1, dfs2), the hypothesis of equal means would be rejected, in which F1 � <sup>α</sup>(dfs1, dfs2) is the 100(1 � α)% percentile of the Fisher distribution.

It is noticed that the condition m > 1 and, for all i, ni > 1 are essential not only for Bartlett test, but also for doing ANOVA [12]. Conversely, the analysis is trivial when ni = 1 for some i. Also, if m = 1, the analysis is pure inference from single population [13].

Example 7: Assume that there are four features need to be tested at the 5% level of significance with data in Table 5. Calculations are given in Table 5.

Using Eq. (24), χ<sup>2</sup> cal = 1.328 is far less than χ<sup>2</sup> 0.95(3) = 7.815, the 95% percentile in the table of χ<sup>2</sup> probabilities with df = 3. Therefore, the hypothesis on equality of variances is accepted. The variation between dataset is estimated by the pool variance, Eq. (29), s<sup>2</sup> = 36.3/9 = 4.037. Using the underlined numbers in Table 5, the ANOVA table is presented in Table 6.


Table 5. Calculations for single-stage ANOVA.


Table 6. Single-stage ANOVA table of Example 7.

The calculated basic sums in the first part of Table 5 are used to set up an ANOVA in Table 6. It is shown that v<sup>2</sup> cal = 0.453/4.037 = 0.112 < 3.86, the 95% percentile in the table of Fisher probabilities w.r.t α = 5%. The hypothesis on equality of the expected means would be accepted at the 5% significance level.

If the hypothesis ξ<sup>i</sup> = ξ<sup>2</sup> = … = ξ<sup>m</sup> is rejected, all possible differences of these means in form of linear combinations are estimated by using confidence intervals. In such a case, there is a probability of 1 � α that all comparisons simultaneously among the expected means satisfy:

$$-\lambda < \sum\_{i=1}^{m} \delta\_i \overline{\omega\_i} - \sum\_{i=1}^{m} \delta\_i \underline{\xi}\_i < \lambda \tag{30}$$

Here, <sup>∑</sup>i=1…<sup>m</sup> <sup>δ</sup><sup>i</sup> = 0 and <sup>λ</sup><sup>2</sup> = s<sup>2</sup> � F1 � <sup>α</sup>(m � 1, n � k) � (m � 1) � <sup>∑</sup>i=1…<sup>m</sup> (δ<sup>i</sup> 2 /ni), F1�α(m�1, n � k) is the 100(1 � α)% percentile of the Fisher probability distribution.

For instance, if m = 3, n = 4, <sup>ϖ</sup><sup>1</sup> = 2.25, <sup>ϖ</sup><sup>2</sup> = 4.0, <sup>ϖ</sup><sup>3</sup> = 4.5 and s<sup>2</sup> = 4.41, then F0.95(2,3 <sup>4</sup> – 3) = 4.26. Using Eq. (30), some 95% confidence intervals are calculated as follows:

δ<sup>1</sup> =1= δ2, δ<sup>3</sup> = 0, λ = 4.33; the confidence interval of ξ<sup>1</sup> ξ<sup>2</sup> is 1.75 4.297 or (2.55, 6.47).

δ<sup>1</sup> = 0, δ<sup>2</sup> =1= δ3; similarly, the confidence intervals of ξ<sup>2</sup> ξ<sup>3</sup> is 0.5 4.297 or (3.797, 4.797).

δ<sup>1</sup> =½= δ2, δ<sup>3</sup> = 1, λ = 3.721. The 95% confidence interval of ½ξ<sup>1</sup>½ξ<sup>2</sup>ξ<sup>3</sup> is (2.436, 5.096).

When having several stages need to be tested on equality with expected means of features, multiple-stage ANOVA is applied. This is the case of evaluating the same given m features in k different stages, denoted by Γνν = 1, 2,…, k. To simplify in presentation, without loss generality, it is assumed that all observed samples in stages have the same size, i.e., ni = n for all i, and Eq. (25) is used for Bartlett test.

The notations are similar, but an index ν added to the observations in each νth stage. The sums in Eqs. (21)–(23) are renotated as Sνi, SSνi, SSDνi. Since, ϖν<sup>i</sup> = Sνi/n, sν<sup>i</sup> <sup>2</sup> = SSDνi/(n1) are the average and variance of sample of the νth stage. All computations with multistage are similar to the single-stage ANOVA. Then, the results from stage computations are combined as shown at the end part of Table 7, to form multistage ANOVA table.

Example 8: Given a two-stage dataset of three features in five first rows of Table 7, calculations are illustrated in the parts, notated as {1} and {2}, of the table which aim at presenting schemes for finding basic sums and terms of Bartlett test and ANOVA.


Table 7. Calculations for Two-stage ANOVA.

The calculated basic sums in the first part of Table 5 are used to set up an ANOVA in Table 6.

probabilities w.r.t α = 5%. The hypothesis on equality of the expected means would be accepted

If the hypothesis ξ<sup>i</sup> = ξ<sup>2</sup> = … = ξ<sup>m</sup> is rejected, all possible differences of these means in form of linear combinations are estimated by using confidence intervals. In such a case, there is a probability of 1 � α that all comparisons simultaneously among the expected means satisfy:

<sup>i</sup>¼<sup>1</sup> <sup>δ</sup>iω<sup>i</sup> �X<sup>m</sup>

�<sup>λ</sup> <sup>&</sup>lt; <sup>X</sup><sup>m</sup>

Features fi f1 f2 f3 f4 ωi1 7 58 7 ωi2 3 43 4 ωi3 4 65 2 ωi4 5

si<sup>2</sup> 4.333 1 6.333 4.333

2: 4.037 c.: 1.157 χ<sup>2</sup>

Variation sources SSD df s

Within features 36.333 9 4.037

Total 37.692 12 F0.95(3,9) = 3.86

) 5.455 Σ(Si2

Table 5. Calculations for single-stage ANOVA.

Table 6. Single-stage ANOVA table of Example 7.

) 0.637 0 0.802 0.637

{1}. ni. 3 3 3 4 13 Si 14 15 16 18 63 SSi 74 77 98 94 343

/fi 65.33 75 85.33 81 306.67 SSDi 8.667 2 12.67 13 36.333 {2}. dfi. 2 2 2 3 9 1/dfi 0.5 0.5 0.5 0.333 1.833

) 1.274 0 1.603 1.91 4.787

/ni)�(ΣSi)2

Between features 1.359 3 0.453 0.11

n � k) is the 100(1 � α)% percentile of the Fisher probability distribution.

Here, <sup>∑</sup>i=1…<sup>m</sup> <sup>δ</sup><sup>i</sup> = 0 and <sup>λ</sup><sup>2</sup> = s<sup>2</sup> � F1 � <sup>α</sup>(m � 1, n � k) � (m � 1) � <sup>∑</sup>i=1…<sup>m</sup> (δ<sup>i</sup>

cal = 0.453/4.037 = 0.112 < 3.86, the 95% percentile in the table of Fisher

<sup>i</sup>¼<sup>1</sup> <sup>δ</sup>iξ<sup>i</sup> <sup>&</sup>lt; <sup>λ</sup> (30)

2

/ni), F1�α(m�1,

cal: 1.328

<sup>2</sup> v<sup>2</sup>

/n: 1.359

It is shown that v<sup>2</sup>

Si2

68 Management of Information Systems

log(si<sup>2</sup>

s

dfi.log(si2

df.log(s<sup>2</sup>

at the 5% significance level.

Calculations for the Bartlett test in {2} of Table 7 show that χ<sup>2</sup> cal = 1.194 < χ<sup>2</sup> 0.95(5) = 11.07, the hypothesis that population variance is the same for all features is accepted at α = 5%. An estimate of the population variance is s1 <sup>2</sup> = 21.33/(2 <sup>3</sup> [3–1]) = 1.778, cf. Table 8. The part {3} of Table 7 is the calculation scheme for the terms in Table 8, where Subtotal equals Total minus Within stages or the sum of Between features within stages, Between stages, and Interaction.

The ratio of the variation between stages to within features is v<sup>2</sup> = s3 2 /s1 <sup>2</sup> = 14.222/1.778 = 8.0 which by far exceeds the 95% percentile of Fisher distribution F0.95(1,12) = 4.75. That means the difference of the expected means between stages is different significantly. In other words, the effects between stages are significantly discriminated.

Similarly, in comparison of the variation within features and between features within stages, Table 8 shows that v<sup>2</sup> = s2 2 /s1 <sup>2</sup> = 3.105/1.778 = 1.747 < F0.95(2,12) = 3.89. This shows that the difference between the expected means of features within stages is not significant or the effects between features within stages are almost the same.

Beside the above effects, the interaction between stages and features is also a factor need to be considered. The ratio v<sup>2</sup> = 0.006/0.012 = 0.50 gives that such an interaction is not present in given dataset. Thus, both the lines labeled "Interaction" and "Within stages" give the same unbiased estimates of σ<sup>2</sup> , since a combination of these lines can improve the estimate of σ<sup>2</sup> . The residual mean square is a sum of variations between the Interaction and Within stages. This leads to an updated population variance is 1.525 less than s1 <sup>2</sup> = 1.778 in Table 8, but obviously increases v<sup>2</sup> ratios. Table 9 analyzes the interaction without stage of Example 8.


Table 8. Two-stage ANOVA table of Example 8.


Table 9. ANOVA table—two-stage without interaction.

The ratio v<sup>2</sup> = s2 2 /s1 <sup>2</sup> = 3.105/1.525 = 2.036 < F0.95(2,14) = 3.74 or the effects between features within stages are the same. While, v<sup>2</sup> = s3 2 /s1 <sup>2</sup> = 14.222/1.525 = 9.328 which also by far exceeds F0.95(1,14) = 4.60, the effects between stages are also significantly discriminated, cf. Table 8.
