**4. Reverse engineering of object-oriented code**

In this section we analyze traditional reverse engineering techniques based on static and dynamic analysis. We show how to reverse engineering object-oriented code to models, in 68 Reverse Engineering – Recent Advances and Applications

**ASSOCIATES** <<Transformation-Tag>> <<Transformation-Transformation>>

**AXIOMS** ass1: <<Transformation-Transformation>>; ass2: <<Transformation-Rule>> ;

<<Transformation-TypeModel>> <<TypeModel-Package>> <<Domain-TypeModel>>

**IS** Unidirectional-2 [Transformation: class1; Transformation: class2; extendedBy: role1;

**IS** Composition-2 [Transformation: class1; Rule: class2; transformation: role1; rule: role2; 1:

NEREUS can be integrated with object-oriented languages such as Eiffel. The article (Favre, 2005) describes a forward engineering process from UML static models to object-oriented code. More information related to the NEREUS approach may be found at (Favre, 2010) and (Favre, 2009). However, we would like remark that here NEREUS is used as an intermediate formal notation to communicate the essential of an MDA reverse engineering approach.

In this section we analyze traditional reverse engineering techniques based on static and dynamic analysis. We show how to reverse engineering object-oriented code to models, in

<<Transformation-Rule>> <<Transformation-TypeModel>>

includes (get\_rule (ass2, get\_extends (ass1, t)), get\_rule (ass1, t))

**ASSOCIATES** <<Rule-Domain>> <<Domain-TypeModel>>

<<Rule-Domain>> <<Rule-Rule>> <<Transformation-Rule>>

extends: role2; \*: mult1; 0..1: mult2; +: visibility1; + : visibility2]

**4. Reverse engineering of object-oriented code** 

t: Transformation ;…

**CLASS** TypedModel **IMPORTS** EMOF::Package

<<TypeModel-TypeModel>>

isCheckable: Domain -> Boolean isEnforceable: Domain -> Boolean

**END-CLASS** 

**ASSOCIATES**

**END-CLASS CLASS** Domain

**DEFERRED ATTRIBUTES**

**END-CLASS CLASS** Rule

**ASSOCIATES**

**END-CLASS** 

**END-ASSOCIATION** 

**END-ASSOCIATION... END-PACKAGE** 

size (get\_extends (ass1, t)) = 1 implies

**IS-SUBTYPE-OF** EMOF::NamedElement

**IS-SUBTYPE-OF** EMOF::NamedElement

**IS-SUBTYPE-OF** EMOF::NamedElement

**ASSOCIATION** Transformation-Rule

mult1; \*: mult2; +: visibility1; +: visibility2]

**ASSOCIATION** Transformation-Transformation

particular. Static analysis extracts static information that describes the structure of the software reflected in the software documentation (e.g., the text of the source code) while dynamic analysis information describes the structure of the run-behavior. Static information can be extracted by using techniques and tools based on compiler techniques such as parsing and data flow algorithms. On the other hand, dynamic information can be extracted by using debuggers, event recorders and general tracer tools.

Figure 3 shows the different phases. The source code is parsed to obtain an abstract syntax tree (AST) associated with the source programming language grammar. Next, a metamodel extractor extracts a simplified, abstract version of the language that ignores all instructions that do not affect the data flows, for instance all control flows such as conditional and loops.

The information represented according to this metamodel allows building the OFG for a given source code, as well as conducting all other analysis that do not depend on the graph. The idea is to derive statically information by performing a propagation of data. Different kinds of analysis propagate different kinds of information in the data-flow graph, extracting the different kinds of diagrams that are included in a model.

The static analysis is based on classical compiler techniques (Aho, Sethi & Ullman, 1985) and abstract interpretation (Jones & Nielson, 1995). The generic flow propagation algorithms are specializations of classical flow analysis techniques. Because there are many possible executions, it is usually not reasonable to consider all states of the program. Thus, static analysis is based on abstract models of the program state that are easier to manipulate, although lose some information. Abstract interpretation of program state allows obtaining automatically as much information as possible about program executions without having to run the program on all input data and then ensuring computability or tractability.

The static analysis builds a partial model (PIM or PSM) that must be refined by dynamic analysis. Dynamic analysis is based on testing and profiling. Execution tracer tools generate execution model snapshots that allow us to deduce complementary information. Execution models, programs and UML models coexist in this process. An object-oriented execution model has the following components: a set of objects, a set of attributes for each object, a location for each object, each object refers to a value of an object type and, a set of messages that include a name selector and may include one or more arguments. Additionally, types are available for describing types of attributes and parameters of methods or constructors. On the other hand, an object-oriented program model has a set of classes, a set of attributes for each class, a set of operations for each class, and a generalization hierarchy over classes.

The combination of static and dynamic analysis can enrich the reverse engineering process. There are different ways of combination, for instance performing first static analysis and then dynamic analysis or perhaps iterating static and dynamic analysis.

#### **4.1 Static analysis**

The concepts and algorithms of data flow analysis described in (Aho, Sethi & Ullman, 1985) are adapted for reverse engineering object-oriented code. Data flow analysis infers information about the behavior of a program by only analyzing the text of the source code. The basic representation of this static analysis is the Object Flow Graph (OFG) that allows tracing information of object interactions from the object creation, through object assignment

MDA-Based Reverse Engineering 71

All instructions that refer to data flows are represented in the abstract language, while all control flow instructions such as conditional and different iteration constructs are ignored. To avoid name conflicts all identifiers are given fully scoped names including a list of enclosing packages, classes and methods. The abstract syntax of a simplified language

Some notational conventions are considered: non-terminals are denoted by upper case letters; a is class attribute name; *m* is method name; p1, p2,…pj are formal parameters; a1,a2,…aj are actual parameters and *cons* is class constructor and *c* is class name. *x* and *y* are program locations that are globally data objects, i.e. object with an address into memory

A program P consists of zero or more declarations (D\*) concatenated with zero or more statements (S\*). The order of declarations and statements is irrelevant. The nesting structure of packages, classes and statements is flattened, i.e. statements belonging to different

There are three types of declarations: attribute declarations (2), method declarations (3) and constructor declaration (4). An attribute declaration is defined by the scope determined by the list of packages, classes, followed by the attribute identifier. A method declaration consists in its name followed by a list of formal parameter (p1,p2,…pj). Constructors have a

There are three types of statement declarations: allocation statements (5), assignments (6) and method invocation (7). The left hand side and the right hand side of all statements is a

The process of transformation of an object-oriented program into a simplified language can

The Object Flow Graph (OFG) is a pair (N, E) where N is a set of nodes and E is a set of edges. A node is added for each program location (i.e. formal parameter or attribute). Edges represent the data flows appearing in the program. They are added to the OFG according to the rules specified in (Tonella & Potrich, 2005, pp. 26). Next, we describe the rules for

(Tonella & Potrich, 2005) is as follows:

(1) P ::= D\*S\*

(6) x = y

such as variables, class attributes and method parameters

similar declaration.

be easily automated.

constructing OFG from Java statements:

(3) m (p1,p2,…,pj)

(4) cons (p1,p2,…,pj)

(5) S ::= x = new c (a1,a2,…aj)

(7) [x = ] y.m (a1,a2,…,aj)

methods are identified by using their fully scope names for their identifiers.

program location. The target of a method invocation is also a program location.

(2) D ::= a

to variables, attributes or their use in messages (method invocations). OFG is defined as an oriented graph that represents all data flows linking objects.

The static analysis is data flow sensitive, but control flow insensitive. This means that programs with different control flows and the same data flows are associated with the same analysis results. The choice of this program representation is motivated by the computational complexity of the involved algorithms. On the one hand, control flow sensitive analysis is computationally intractable and on the other hand, data flow sensitive analysis is aligned to the "nature" of the object-oriented programs whose execution models impose more constraints on the data flows than on the control flows. For example, the sequence of method invocations may change when moving from an application which uses a class to another one, while the possible ways to copy and propagate object references remains more stable.

A consequence of the control flow insensitivity is that the construction of the OFG can be described with reference to a simplified, abstract version of the object-oriented languages in which instructions related to flow control are ignored. A generic algorithm of flow propagation working on the OFG processes object information. In the following, we describe the three essential components of the common analysis framework: the simplified abstract object-oriented language, the data flow graph and the flow propagation algorithm.

Fig. 3. Static and dynamic analysis

70 Reverse Engineering – Recent Advances and Applications

to variables, attributes or their use in messages (method invocations). OFG is defined as an

The static analysis is data flow sensitive, but control flow insensitive. This means that programs with different control flows and the same data flows are associated with the same analysis results. The choice of this program representation is motivated by the computational complexity of the involved algorithms. On the one hand, control flow sensitive analysis is computationally intractable and on the other hand, data flow sensitive analysis is aligned to the "nature" of the object-oriented programs whose execution models impose more constraints on the data flows than on the control flows. For example, the sequence of method invocations may change when moving from an application which uses a class to another one, while the possible ways to copy and propagate object references

A consequence of the control flow insensitivity is that the construction of the OFG can be described with reference to a simplified, abstract version of the object-oriented languages in which instructions related to flow control are ignored. A generic algorithm of flow propagation working on the OFG processes object information. In the following, we describe the three essential components of the common analysis framework: the simplified abstract object-oriented language, the data flow graph and the flow propagation algorithm.

oriented graph that represents all data flows linking objects.

remains more stable.

Fig. 3. Static and dynamic analysis

All instructions that refer to data flows are represented in the abstract language, while all control flow instructions such as conditional and different iteration constructs are ignored. To avoid name conflicts all identifiers are given fully scoped names including a list of enclosing packages, classes and methods. The abstract syntax of a simplified language (Tonella & Potrich, 2005) is as follows:


Some notational conventions are considered: non-terminals are denoted by upper case letters; a is class attribute name; *m* is method name; p1, p2,…pj are formal parameters; a1,a2,…aj are actual parameters and *cons* is class constructor and *c* is class name. *x* and *y* are program locations that are globally data objects, i.e. object with an address into memory such as variables, class attributes and method parameters

A program P consists of zero or more declarations (D\*) concatenated with zero or more statements (S\*). The order of declarations and statements is irrelevant. The nesting structure of packages, classes and statements is flattened, i.e. statements belonging to different methods are identified by using their fully scope names for their identifiers.

There are three types of declarations: attribute declarations (2), method declarations (3) and constructor declaration (4). An attribute declaration is defined by the scope determined by the list of packages, classes, followed by the attribute identifier. A method declaration consists in its name followed by a list of formal parameter (p1,p2,…pj). Constructors have a similar declaration.

There are three types of statement declarations: allocation statements (5), assignments (6) and method invocation (7). The left hand side and the right hand side of all statements is a program location. The target of a method invocation is also a program location.

The process of transformation of an object-oriented program into a simplified language can be easily automated.

The Object Flow Graph (OFG) is a pair (N, E) where N is a set of nodes and E is a set of edges. A node is added for each program location (i.e. formal parameter or attribute). Edges represent the data flows appearing in the program. They are added to the OFG according to the rules specified in (Tonella & Potrich, 2005, pp. 26). Next, we describe the rules for constructing OFG from Java statements:

MDA-Based Reverse Engineering 73

Each node *n* stores the incoming and outgoing flow information inside the sets *in[n]* and *out[n],* which are initially empty. Each node *n* generates the set of flow information entities included in *gen[s]* set, and prevents the elements of *kill[n]* set from being further propagated after node n. In forward propagation *in[n]* is obtained from the predecessors of node n as

The OFG based on the previous rules is "object insensitive"; this means that it is not possible to distinguish two locations (e.g. two class attributes) when they belongs to different class instances. An object sensitive OFG might improve the analysis results. It can be built by giving all non-static program locations an object scope instead of a class scope and objects can be identified statically by their allocation points. Thus, in an object sensitive OFG, nonstatic class attributes and methods with their parameters and local variables, are replicated

Dynamic analysis operates by generating execution snapshots to collect life cycle traces of object instances and observing the executions to extract information. Ernst (2003) argues that whereas the chief challenge of static analysis is choosing a good abstract interpretation, the chief challenge of performing good dynamic analysis is selecting a representative set of test cases. A test case can help to detect properties of the program, but it can be difficult to detect whether results of a test are true program properties or properties of a particular execution context. The main limitation of dynamic analysis is related to the quality of the test cases

Integrating dynamic and static analysis seems to be beneficial. The static and dynamic information could be shown as separated views or merged in a single view. In general, the outcome of the dynamic analysis could be visualized as a set of diagrams, each one associated with one execution trace of a test case. Although, the construction of these diagrams can be automated, their analysis requires human intervention in most cases.

Maoz and Harel (2010) present a powerful technique for the visualization and exploration of execution traces of models that is different from previous approaches that consider execution traces at the code level. This technique belongs to the domain of model-based dynamic analysis adapting classical visualization paradigms and techniques to specific needs of dynamic analysis. It allows relating the system execution traces and its models in different tasks such as testing whether a system run satisfies model properties. We consider that these results allow us to address reverse engineering challenges in the context of model-

In this section we describe how to extract class diagrams from Java code. A class diagram is a representation of the static view that shows a collection of static model elements, such as

in[n] = Uppred(n) out[p]

the union of the respective out sets.

for every statically identified object.

**4.2 Dynamic analysis** 

used to produce diagrams.

driven development.

Dynamic analysis depends on the quality of the test cases.

**4.3 An example: Recovering class diagram** 

out[n] = gen[n] U (in[n] - kill[n])


When a constructor or method is invoked, edges are built which connect each actual parameter ai to the respective formal parameter pi. In case of constructor invocation, the newly created object, referenced by *cons.this* is paired with the left hand side *x* of the related assignment. In case of method invocation, the target object *y* becomes *m.this* inside the called method, generating the edge (*y, m.this*), and the value returned by method *m* (if any) flows to the left hand side *x* (pair (*m.return, x*)).

Some edges in the OFG may be related to object flows that are external to the analyzed code. Examples of external flows are related with the usage of class libraries, dynamic loading (through reflection) or the access to modules written in other programming language. Due to these external flows can be treated in a similar way next, we show how to affect the OFG the usage of class libraries.

Each time a library class introduces a data flow from a variable *x* to a variable *y* an edge *(x,y)* must be included in the OFG. Containers are an example of library classes that introduce external data flows, for instance, any Java class implementing the interface *Collection* or the interface *Map*. Object containers provide two basic operations affecting the OFG: insert and extract for adding an object to a container and accessing an object in a container respectively. In the abstract program representation, insertion and extraction methods are associated with container objects.

Next, we show a pseudo-code of a generic forward propagation algorithm that is a specific instance of the algorithms applied to control flow graph described in (Aho, Sethi & Ullman, 1985):

```
for each node n N 
in[n] = {}; 
out[n]= gen[n] U (in[n] - kill[n]) 
endfor
while any in[n] or out[n] changes 
for each node n N 
in[n] = Uppred(n) out[p]; 
out[n] = gen[n] U(in[n] - kill[n]) 
endfor
endwhile
```
Let *gen[n]* and *kill[n]* be two sets of each basic node n N. *gen[n]* is the set of flow information entities generated by *n. kill[n]* is the set of definition outside of *n* that define entities that also have definitions within n. There are two sets of equations, called data-flow equations that relate incoming and outgoing flow information inside the sets:

72 Reverse Engineering – Recent Advances and Applications

(5) S ::= x = new c (a1,a2,…aj) {(a1,p1) E,..(aj,pj) E, (cons.this,x) E}

When a constructor or method is invoked, edges are built which connect each actual parameter ai to the respective formal parameter pi. In case of constructor invocation, the newly created object, referenced by *cons.this* is paired with the left hand side *x* of the related assignment. In case of method invocation, the target object *y* becomes *m.this* inside the called method, generating the edge (*y, m.this*), and the value returned by method *m* (if any) flows

Some edges in the OFG may be related to object flows that are external to the analyzed code. Examples of external flows are related with the usage of class libraries, dynamic loading (through reflection) or the access to modules written in other programming language. Due to these external flows can be treated in a similar way next, we show how to affect the OFG

Each time a library class introduces a data flow from a variable *x* to a variable *y* an edge *(x,y)* must be included in the OFG. Containers are an example of library classes that introduce external data flows, for instance, any Java class implementing the interface *Collection* or the interface *Map*. Object containers provide two basic operations affecting the OFG: insert and extract for adding an object to a container and accessing an object in a container respectively. In the abstract program representation, insertion and extraction methods are associated with

Next, we show a pseudo-code of a generic forward propagation algorithm that is a specific instance of the algorithms applied to control flow graph described in (Aho, Sethi & Ullman,

**for** each node n N

**for** each node n N in[n] = Uppred(n) out[p];

out[n]= gen[n] U (in[n] - kill[n])

**while** any in[n] or out[n] changes

out[n] = gen[n] U(in[n] - kill[n])

equations that relate incoming and outgoing flow information inside the sets:

Let *gen[n]* and *kill[n]* be two sets of each basic node n N. *gen[n]* is the set of flow information entities generated by *n. kill[n]* is the set of definition outside of *n* that define entities that also have definitions within n. There are two sets of equations, called data-flow

in[n] = {};

**endfor**

**endfor endwhile**  (m.return,x) E}

(7) [x = ] y.m (a1,a2,…,aj) {(y, m.this) E, (a1,p1) E,..(aj,pj)E,

(1) P ::= D\*S\* { } (2) D ::= a { } (3) m (p1,p2,…,pj) { } (4) cons (p1,p2,…,pj) { }

to the left hand side *x* (pair (*m.return, x*)).

the usage of class libraries.

container objects.

1985):

(6) x = y {(y,x) E}

in[n] = Uppred(n) out[p] out[n] = gen[n] U (in[n] - kill[n])

Each node *n* stores the incoming and outgoing flow information inside the sets *in[n]* and *out[n],* which are initially empty. Each node *n* generates the set of flow information entities included in *gen[s]* set, and prevents the elements of *kill[n]* set from being further propagated after node n. In forward propagation *in[n]* is obtained from the predecessors of node n as the union of the respective out sets.

The OFG based on the previous rules is "object insensitive"; this means that it is not possible to distinguish two locations (e.g. two class attributes) when they belongs to different class instances. An object sensitive OFG might improve the analysis results. It can be built by giving all non-static program locations an object scope instead of a class scope and objects can be identified statically by their allocation points. Thus, in an object sensitive OFG, nonstatic class attributes and methods with their parameters and local variables, are replicated for every statically identified object.
