**4.2. Workflow, tools and data**

## **Workflow**

Now we describe the workflow used as our case study. The objective of the workflow was to identify differentially expressed genes in human kidney and liver cancerous cells [47, 60], with fragments of genes sequenced with Illumina technology [35]. The workflow consists of four phases (Figure 3): (i) mapping the input sequences onto the 24 human chromosome sequences; (ii) converting format from SAM (Bowtie) to BED (a specific format of the CoverageBED program); (iii) generating fixed intervals for all chromossomes based on their length, since this is the input for the CoverageBED program; and (iv) executing the CoverageBED program, which generates histograms showing the number of mappings for each interval.

The *mapping phase* has the objective of identifying the region of a reference genome where each input sequence was located. A set of sequences mapping in the same region allows the inferences that these sequences have the same structural organization of the reference genome.

The *CoverageBED* program [54] allowed the study of the expression level of the cancerous genes using histograms of the mapped input sequences onto the human reference genome, so that differentially expressed genes between kidney and liver cancer genes could be identified.

**Figure 3.** The workflow investigated the expression level of liver and kidney genes (generated by Illumina) mapping them to 24 human chromosome sequences.

#### **Tools**

Four tools were implemented in each cloud provider: Bowtie [44], SAM2BED (a Perl script), genome2interval (another Perl script) and coverageBed [54].

#### **Data**

14 Will-be-set-by-IN-TECH:Bioinformatics, ISBN: 980-953-307-202-4

Two Perl scripts implementing the workflow (SAM2BED and genome2interval) and the coverageBed program (integrating the BEDTolls suite [54]) were installed in each cloud

Now we describe the workflow used as our case study. The objective of the workflow was to identify differentially expressed genes in human kidney and liver cancerous cells [47, 60], with fragments of genes sequenced with Illumina technology [35]. The workflow consists of four phases (Figure 3): (i) mapping the input sequences onto the 24 human chromosome sequences; (ii) converting format from SAM (Bowtie) to BED (a specific format of the CoverageBED program); (iii) generating fixed intervals for all chromossomes based on their length, since this is the input for the CoverageBED program; and (iv) executing the CoverageBED program,

The *mapping phase* has the objective of identifying the region of a reference genome where each input sequence was located. A set of sequences mapping in the same region allows the inferences that these sequences have the same structural organization of the reference genome. The *CoverageBED* program [54] allowed the study of the expression level of the cancerous genes using histograms of the mapped input sequences onto the human reference genome, so that differentially expressed genes between kidney and liver cancer genes could be identified.

**Figure 3.** The workflow investigated the expression level of liver and kidney genes (generated by

Four tools were implemented in each cloud provider: Bowtie [44], SAM2BED (a Perl script),

Illumina) mapping them to 24 human chromosome sequences.

genome2interval (another Perl script) and coverageBed [54].

which generates histograms showing the number of mappings for each interval.

provider.

**Workflow**

**Tools**

**4.2. Workflow, tools and data**

The 24 human chromosome sequences were downloaded from HG19:

ftp://ftp.ncbi.nih.gov/genomes/H\_sapiens/Assembled\_chromosomes/seq/

The reference site is:

http://www.ncbi.nlm.nih.gov/genome/assembly/293148/

Finally, the names of the files followed the format:

hs\_ref\_GRCh37.p5\_chr\*.fa.gz.

#### **4.3. Implementation details**

A message module allowed the communication among the services, having been created using the Nettycommunication library [36], which is responsible for the TCP connection event manager. Messages were serialized using both JSON format [23] and Jackson library [21], and file transfer was accomplished through the HTTP protocol GET and PUT methods. Message and file communications were realized using an unique TCP port, which avoided the necessity to create complex firewall rules. Besides, the message module is capable of multiplexing both message and file traffic. A simplified version of the *Chord* [67] protocol was implemented for the P2P network and plug-ins. We developed plug-in prototypes for Apache Hadoop and SunGridEngine. Java was the language used to implement the BioNimbus prototype.

Next, we briefly describe somefeatures of the services implemented on our BioNimbus prototype (Figure 4):


In the monitoring service, there is a thread responsible for waiting the user requests and responses received from other services of the infrastructure. When a request initiates a

• Storage service: two threads were used for its implementation. The first one waits for the requests sent by other services. To treat the request of saving files (StoreReq), the storage service executes the storage policy adopted in BioNimbus. For this case study, we used a method based on a round-robin of the plug-ins that informed having enough space to store the file. When a cloud is chosen, a response (StoreReply) is sent to the service making the request, which will send the file to the cloud indicated by the storage service. When this transfer finishes, the plug-in receiving the file storage sends a special message (StoreAck), which contains information that will keep correct the federation file table. In the case study, a simple backend was implemented to maintain the federation file table. Every time a new confirmation is received by the storage service, it adds an input in the map file with the file identifier containing information such as name, size and storage cloud. This mapping is stored in JSON format [23] in a file in the federation file system of the cloud where the service will be executed. When initiating its execution, the storage service verifies if the map file left and load in memory the federation file table last status. The other two types treated by the first thread are file list (ListReq) and localization (GetReq). For the first case, the thread builds a response (ListReply) with the mapping loaded in memory. For localization, it builds a response (GetReply) searching for the

Towards a Hybrid Federated Cloud Platform to Effi ciently Execute Bioinformatics Work ows 123

Finally, another thread is executed at intervals of 30 seconds requesting to the discovery service the current configuration of the federation (CloudReq message). The received

We executed the workflow at the University of Brasilia and the Amazon EC2, and on both

Cloud Providers hour:minute:second University of Brasilia (UnB) 1:11:47 Amazon EC2 1:18:44 Both clouds (UnB and EC2) 1:09:07

We measured how the file transfer time affected the job execution total time. Table 2 and Figure 5 shows the total and file transfer times of the 18 longest jobs of the workflow, as well as the percentage of the file transfer time related to the total time. These percentages show that file transfer represents at least 50% of the total time of this job execution. This means that in federated clouds executing data-driven bioinformatics applications, storage services have

We also investigated how the time execution of a job was affected when sent to execution in a cloud provider, taking a long time, being cancelled and returning to the list of pending jobs to be executed again. There were seven jobs cancelled, the first seven jobs in Table 2 with the longest times to be executed. When making an experiment without cancelling jobs, we obtained greater times, when compared to the experiment with cancelling since they were

We mention now some points that can affect BioNimbus performance: (i) the scheduler does not consider jobs being transferred and identifies CPUs involved in these transfers as idle; (ii)

**Table 1.** Workflow execution time on each cloud and on the BioNimbus federated cloud.

to be especially designed to minimize as much as possible huge file transfers.

cloud information in the map using the request identifier.

information is used by the storage policy.

**4.4. Results**

cloud providers (Table 1).

sent to clouds almost idle.

**Figure 4.** The services implemented for our case study, noting that we used two cloud providers in the BioNimbus prototype.

job (JobStartReq) is received, this thread generates an unique identifier for this job and saves this informaticon in the PendingJobs map. Next, it calls the scheduling policy, which returns a mapping among the jobs and the plug-ins that can execute them. Thus, when the jobs are all scheduled, this thread sends requests in order to create tasks (TaskStartReq) that have to be executed in the cloud providers, and waits for their corresponding outputs (TaskStartReply). When an output is received, the service removes the job from the PendingJobs map and creates an input in the RunningJobs map, with information about the job and its corresponding tasks, removing a job when all its tasks finish. As previouly mentioned, a new *DynamicAHP* algorithm was implemented in BioNimbus [12], which is based on on a decision making strategy proposed by [61].

Another thread in the monitoring service, executed at intervals of 15 seconds, is responsible for following the jobs. First, it sends status requests (TaskStatusReq) to each job registered in RunningJobs. The response (TaskStatusReply), treated by the previous described thread, can again initiate the scheduling service according to some parameters. Cancelling messages (TaskCancelReq and TaskCancelReply) will be sent, and the job will be reinserted in the PendingJobs map and removed from the RunningJobs. This thread also verifies whether there are pending jobs in PendingJobs, initiating another scheduling process in this case, and sending query messages to the discovery service (CloudReq) and to the storage service (ListReq), whose responses (CloudReply and ListReply) will be received by the first thread and used by the scheduling policy when needed.

• Storage service: two threads were used for its implementation. The first one waits for the requests sent by other services. To treat the request of saving files (StoreReq), the storage service executes the storage policy adopted in BioNimbus. For this case study, we used a method based on a round-robin of the plug-ins that informed having enough space to store the file. When a cloud is chosen, a response (StoreReply) is sent to the service making the request, which will send the file to the cloud indicated by the storage service. When this transfer finishes, the plug-in receiving the file storage sends a special message (StoreAck), which contains information that will keep correct the federation file table. In the case study, a simple backend was implemented to maintain the federation file table.

Every time a new confirmation is received by the storage service, it adds an input in the map file with the file identifier containing information such as name, size and storage cloud. This mapping is stored in JSON format [23] in a file in the federation file system of the cloud where the service will be executed. When initiating its execution, the storage service verifies if the map file left and load in memory the federation file table last status.

The other two types treated by the first thread are file list (ListReq) and localization (GetReq). For the first case, the thread builds a response (ListReply) with the mapping loaded in memory. For localization, it builds a response (GetReply) searching for the cloud information in the map using the request identifier.

Finally, another thread is executed at intervals of 30 seconds requesting to the discovery service the current configuration of the federation (CloudReq message). The received information is used by the storage policy.

#### **4.4. Results**

16 Will-be-set-by-IN-TECH:Bioinformatics, ISBN: 980-953-307-202-4

**Figure 4.** The services implemented for our case study, noting that we used two cloud providers in the

job (JobStartReq) is received, this thread generates an unique identifier for this job and saves this informaticon in the PendingJobs map. Next, it calls the scheduling policy, which returns a mapping among the jobs and the plug-ins that can execute them. Thus, when the jobs are all scheduled, this thread sends requests in order to create tasks (TaskStartReq) that have to be executed in the cloud providers, and waits for their corresponding outputs (TaskStartReply). When an output is received, the service removes the job from the PendingJobs map and creates an input in the RunningJobs map, with information about the job and its corresponding tasks, removing a job when all its tasks finish. As previouly mentioned, a new *DynamicAHP* algorithm was implemented in BioNimbus [12], which is based on on a decision making strategy proposed by [61]. Another thread in the monitoring service, executed at intervals of 15 seconds, is responsible for following the jobs. First, it sends status requests (TaskStatusReq) to each job registered in RunningJobs. The response (TaskStatusReply), treated by the previous described thread, can again initiate the scheduling service according to some parameters. Cancelling messages (TaskCancelReq and TaskCancelReply) will be sent, and the job will be reinserted in the PendingJobs map and removed from the RunningJobs. This thread also verifies whether there are pending jobs in PendingJobs, initiating another scheduling process in this case, and sending query messages to the discovery service (CloudReq) and to the storage service (ListReq), whose responses (CloudReply and ListReply) will be received by the first thread and used by the

BioNimbus prototype.

scheduling policy when needed.

We executed the workflow at the University of Brasilia and the Amazon EC2, and on both cloud providers (Table 1).


**Table 1.** Workflow execution time on each cloud and on the BioNimbus federated cloud.

We measured how the file transfer time affected the job execution total time. Table 2 and Figure 5 shows the total and file transfer times of the 18 longest jobs of the workflow, as well as the percentage of the file transfer time related to the total time. These percentages show that file transfer represents at least 50% of the total time of this job execution. This means that in federated clouds executing data-driven bioinformatics applications, storage services have to be especially designed to minimize as much as possible huge file transfers.

We also investigated how the time execution of a job was affected when sent to execution in a cloud provider, taking a long time, being cancelled and returning to the list of pending jobs to be executed again. There were seven jobs cancelled, the first seven jobs in Table 2 with the longest times to be executed. When making an experiment without cancelling jobs, we obtained greater times, when compared to the experiment with cancelling since they were sent to clouds almost idle.

We mention now some points that can affect BioNimbus performance: (i) the scheduler does not consider jobs being transferred and identifies CPUs involved in these transfers as idle; (ii)


Table 3 and Figure 6 show the number of jobs executed in a single cloud provider and on both. Note that, including the transfer time, jobs with smaller inputs execute faster on two cloud providers, since the possibility to cancel delayed jobsthat are running and scheduling them again lowered the total execution time. Besides, when files are small, the time to transfer files is rapid, while when they are large the transfer time strongly affects the total execution time (as shown in Table 2). Thus, for large files, the storage policy has to be very carefully designed using replication and fragmentation in order to significantly decrease file transfer

Cloud Providers until 200 seconds between 200 seconds above 1000 seconds

**Figure 6.** Comparing the number of executed jobs in BioNimbus, where time (in seconds) includes the

In this section, we discuss cloud projects designed to accelerate execution and increase the amount of storage available to bioinformatics applications. When compared to BioNimbus, these projects are dedicated to particular applications or are executed in a single cloud environment. BioNimbus intends to integrate public and private centers offering bioinformatics applications in one single platform using the hybrid federation cloud

University of Brasilia 34 30 32 EC2 Amazon 37 27 32 UnB and EC2 64 8 24

**Table 3.** Number of executed jobs, where time includes the file transfer time.

and 1000 seconds

Towards a Hybrid Federated Cloud Platform to Effi ciently Execute Bioinformatics Work ows 125

time.

file transfer time.

paradigm.

**5. Related work**

**Table 2.** Total and file transfer times of the longest jobs executed in BioNimbus.

**Figure 5.** Comparing the total and file transfer times of the longest jobs executed in BioNimbus. The file transfer time is colored red, while its percentage related to the total time is shown in blue.

the input files are all simultaneously downloaded, i.e. there are no priorities for downloads; (iii) jobs are now canceled based only on the wait time in the pending jobs list, i.e. the file transfering time is not considered; and (iv) jobs with small input files that were sent to a cloud provider after jobs with large input files got executed earlier, while the later were still downloading their input data.

Table 3 and Figure 6 show the number of jobs executed in a single cloud provider and on both. Note that, including the transfer time, jobs with smaller inputs execute faster on two cloud providers, since the possibility to cancel delayed jobsthat are running and scheduling them again lowered the total execution time. Besides, when files are small, the time to transfer files is rapid, while when they are large the transfer time strongly affects the total execution time (as shown in Table 2). Thus, for large files, the storage policy has to be very carefully designed using replication and fragmentation in order to significantly decrease file transfer time.


**Table 3.** Number of executed jobs, where time includes the file transfer time.

**Figure 6.** Comparing the number of executed jobs in BioNimbus, where time (in seconds) includes the file transfer time.
