**4. Related work**

20 Will-be-set-by-IN-TECH

6running with 8 processes on 8 cores. From previous work, (Osthoff et al., 2011a), we know that the OLAM MPI implementation scalability on 2 Quad-Core Xeon E5520 system is limited by memory contention, and that the hotspot routine presents the higher number of instruction cache miss. We observe that, for this system, the hotspot routine does not have the higher number of cache miss. Hotspot single system routine is a cpu-intensive calculation routine. Therefore, we conclude that the number of instruction cache miss has a lower performance impact in this processor architecture. Also, we observe that hotspot routine has no MPI communication overhead, therefore is a good candidate for future GPU code implementation. From Vtune analysis, we also observe that part of hotspot routines overhead are due to MPI communication routines execution time. We conclude that the speed-up is limited by, in decreasing order to: multi-core cpu processing power, multi-core memory contention and

Hybrid OLAM MPI/OpenMP implementation starts one MPI process on the system, which then starts OpenMP threads on the cores of the platform. The OpenMP threads are generated on nine *do* loops from *higher number of cache miss* routine. From Vtune performance analyzer, we observed that the execution time and the number of cache misses of this routine decrease up to 50% in comparison to the MPI-only implementation. These results show that the use of OpenMP improved OLAM's memory usage for this architecture. On the other hand, we observe that CPI (Clock Per Instructions) increases from 1.2 on the MPI implementation to 2.7 on Hybrid MPI/OpenMP one, due to FORTRAN Compiler OpenMP routines overhead. We conclude OLAM MPI/OpenMP implementation still needs optimizations in order to obtain

The Hybrid MPI/CUDA implementation starts MPI process, and each process starts threads in the GPU device. As mention before, due to development time reasons, we implemented two CUDA kernels out of nine *higher number of cache miss* routine's *do* loops. Therefore, each MPI process starts two kernel threads on the GPU. In order to explain this implementation's results, we instrumented the two CUDA kernels to run with Computer Visual Profiler7, which

Fig. 19. Speed-up for the three OLAM's implementations.

**3.4.2 The hybrid MPI/OpenMP implementation's analysis**

**3.4.3 The hybrid MPI/CUDA implementation's analysis**

MPI communication routine overhead.

the desired speedup.

<sup>6</sup> http://www.intel.com <sup>7</sup> http://www.nvidia.com Parallel applications scalability is the focus of several papers in the literature. In (Michalakes et al., 2008) the authors execute the high resolution *WRF* weather forecast code over 15k cores and conclude that one of the greatest bottlenecks is data storage. This same code was used in (Seelam et al., 2009) to evaluate file system caching and pre-fetching optimizations to many-core concurrent I/O, achieving improvements of 100%. I/O contention on multi-core systems is also a known issue in the literature and few strategies to mitigate the performance loss can be found. The work of ( Wolfe, 2009) presents the lessons learned from porting a part of the Weather Research and Forecasting Model (WRF) to the PGI Accelerator. Like OLAM, the application used in our work, the WRF model is a large application written in FORTRAN. They ported the WSM5 ice microphysics model and measured the performance. This measurement compared the use of PGI Accelerator with a multi-core implementation and with a hand-written CUDA implementation. Finally, the work of ( Govett et al., 2010) runs the weather model from the Earth System Research Laboratory (ESLR) on GPUs and relies on the CPUs for model initialization, I/O and inter-processor communication. They have shown

<sup>8</sup> case studies examples: http://www.culatools.com/features/performance

that, when using the shared file system to store the results, most of the time spent in the output routines was spent in the close operation. Then, we proposed a modification to delay

Improving Atmospheric Model Performance on a Multi-Core Cluster System 23

For future works we plan to further parallelize OLAM timestep routines in order to improve both the MPI/OpenMP and the MPI/CUDA implementations. Also, we plan to study OLAM's performance for higher resolutions and for unbalanced workload on single nodes

In order to include further I/O optimizations, we also intend to evaluate the parallel I/O library. This will be done aiming to reduce the overhead on creation of files, improving the

Adcroft,C.H.A. and Marshall,J. Representation of topography by shaved cells in a height coordinate ocean model, *Monthly Weather Review, 125:2293U2315, 1997. ˝* Boito, F.Z.; Kassick, R. V.; Pilla, L.L.; Barbieri,N.; Schepke,C.; Navaux,P.; Maillard,N.;

Cotton, W.r.; Pielke, R.; Walko, R.; Liston, G.; Tremback, C.; Harrington, J. & Jiang, H. RAMS

Govett, M. et al. (2010). Running the NIM Next-Generation Weather Model on GPUs,

Michalakes, J., Hacker, J., Loft, R., McCracken, M. O., Snavely, A., Wright, N. J., Spelce,

*125(1):012022*, URL http://stacks.iop.org/ 1742-6596/125/i=1/a=012022. Ohta, K., Matsuba, H. & Ishikawa, Y. (2009). Improving ParallelWrite by Node-Level Request

Osthoff,C.; Schepke, C.; Panetta, J.; Grunmann, P.;Maillard, N.; Navaux, P.; Silva

Osthoff,C.; Grunmann, P.; Boito, F.; Kassick, R.; Pilla, L.; Navaux, P.; Schepke, C.; Panetta,

Pielke, R.A. et al. (1992). A Comprehensive Meteorological Modeling System - RAMS, in:

Schepke, C.; Maillard, N.; Osthoff, C.; Dias, P.& Pannetta, J. Performance Evaluation of an

Denneulin, Y.; Osthoff,C.; Grunmann, P.; Dias, P. & Panetta,J. (2011). I/O Performance of a Large Atmospheric using PVSF, *Proceedings of Renpar20 / SympA14/*

2000: Current status and future directions, *Meteorol. and Atmos. Phys., 82, 5-29, 2003.*

*Proceedings of 10th IEEE/ACM International Conference on Cluster, Cloud and Grid*

T., Gorda, B., & Walkup, R. WRF Nature Run., *Journal of Physics: Conference Series*

Scheduling, *Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid*, IEEE Computer Society, Washington, DC, USA, pp. 196£203

Dias, P.L.& Lopes, P.P. I/O Performance Evaluation on Multicore Clusters with Atmospheric Model Environment, *Proceedings of 22nd International Symposium on Computer Architecture and High Performance Computing*, IEEE Computer Society,

J.; Maillard, N.& Silva Dias, P.L. Improving Performance on Atmospheric Models through a Hybrid OpenMP/MPI Implementation, *Proceedings of The 9th IEEE International Symposium on Parallel and Distributed Processing with Applications*, IEEE

Atmospheric Simulation Model on Multi-Core Environments, *Proceedings of the Latin American Conference on High Performance Computing* , CLCAR, Gramado, Brazil, pp.

this operation, obtaining an increase in performance of up to 37%.

and on multi-core/many-core clusters.

*CFSE8* INRIA, Saint-Malo, France.

Petropolis,Brazil, pp. 49-55

330-332.

*Computing*, Melborne, Australia, pp. 729-796.

Computer Society, Busan, Korea, pp. 69-75

*Meteorology and Atmospheric Physics.* 49(1), pp. 69-91

output phases' performance.

**6. References**

that the part of the code that computed dynamics of the model runs 34 times faster on a single GPU than on the CPU.

A scalability study with the NFS file system had shown that the OLAM model's performance is limited by the I/O operations (Osthoff et al., 2010). The work of (Schepke et al., 2010) presents an evaluation of OLAM MPI with VtuneAnalyzer and identifies a large amount of cache misses when using 8 cores. Then, experiments comparing the use of no distributed file system and of PVFS cleared that the scalability of these operations is not a problem when using the local disks (Boito et al., 2011). The concurrency in the access to the shared file system, the size of the requests and the large number of small files created were pointed as responsible for the bad performance with PVFS and NFS. Recent work from ( Kassick et al, 2011) presented a trace-based visualization of OLAM's I/O performance and has shown that, when using the NFS File System to store the results on a multi-core cluster, most of the time spent in the output routines was spent in the close operation. Because of that, they propose to delay the close operation until the next output phase in order to increase the I/O performance. A new implementation of OLAM using MPI and OpenMP was proposed in order to reduce the intra-node concurrency and the number of generated files. Results have shown that this implementation has better performance than the only-MPI one. The I/O performance was also increased, but the scalability of these operations remains a problem. We also observe that the use of OpenMP instead of MPI inside the nodes of the cluster improves the application memory usage (Osthoff et al., 2011a).

#### **5. Conclusion**

This work evaluated the Ocean-Land-Atmosphere Model (OLAM) in multi-core environments - single multi-core node and cluster of multi-core nodes. We discussed three implementations of the model, using different levels of parallelism - MPI, OpenMP and CUDA. The implementations are: (1) a MPI implementation; (2) a Hybrid MPI/OpenMP implementation and (4) a Hybrid MPI/CUDA implementation.

We have shown that a Hybrid MPI/OpenMP implementation can improve the performance of OLAM multi-core cluster using either the local disks, a single-server shared file system (NFS) and a parallel file system (PVFS). We observe that, as we increase the number of nodes, the Hybrid MPI/OpenMP implementation performs better than the MPI one. The main reason is that the Hybrid MPI/OpenMP implementation decreases the number of output files, resulting in a better performance for I/O operations. We also confirm that OpenMP global memory advantages improve the application's memory usage in a multi-core system.

In the experiments in a single multi-core node, we observed that, as we increase the number of cores, the MPI/OpenMP implementation performs better than others implementations. The MPI/OpenMP implementation's bottleneck was observed to be due to low routine parallelism. In order to improve this implementation's speed-up, we plan to further parallelize the OLAM's most cpu-consuming routine. Finally, we observed that the MPI/CUDA implementation's performance decreases for numbers of processes greater than 2 because of the workload reduction with the parallelism. Therefore, we plan to further evaluate the performance with resolutions higher than 40km.

We also applied a trace-based performance visualization in order to better understand OLAM's I/O performance. The libRastro library was used to instrument and obtain the traces of the application. The traces were analyzed and visualized with the Pajé. We have shown that, when using the shared file system to store the results, most of the time spent in the output routines was spent in the close operation. Then, we proposed a modification to delay this operation, obtaining an increase in performance of up to 37%.

For future works we plan to further parallelize OLAM timestep routines in order to improve both the MPI/OpenMP and the MPI/CUDA implementations. Also, we plan to study OLAM's performance for higher resolutions and for unbalanced workload on single nodes and on multi-core/many-core clusters.

In order to include further I/O optimizations, we also intend to evaluate the parallel I/O library. This will be done aiming to reduce the overhead on creation of files, improving the output phases' performance.

#### **6. References**

22 Will-be-set-by-IN-TECH

that the part of the code that computed dynamics of the model runs 34 times faster on a single

A scalability study with the NFS file system had shown that the OLAM model's performance is limited by the I/O operations (Osthoff et al., 2010). The work of (Schepke et al., 2010) presents an evaluation of OLAM MPI with VtuneAnalyzer and identifies a large amount of cache misses when using 8 cores. Then, experiments comparing the use of no distributed file system and of PVFS cleared that the scalability of these operations is not a problem when using the local disks (Boito et al., 2011). The concurrency in the access to the shared file system, the size of the requests and the large number of small files created were pointed as responsible for the bad performance with PVFS and NFS. Recent work from ( Kassick et al, 2011) presented a trace-based visualization of OLAM's I/O performance and has shown that, when using the NFS File System to store the results on a multi-core cluster, most of the time spent in the output routines was spent in the close operation. Because of that, they propose to delay the close operation until the next output phase in order to increase the I/O performance. A new implementation of OLAM using MPI and OpenMP was proposed in order to reduce the intra-node concurrency and the number of generated files. Results have shown that this implementation has better performance than the only-MPI one. The I/O performance was also increased, but the scalability of these operations remains a problem. We also observe that the use of OpenMP instead of MPI inside the nodes of the cluster improves the application

This work evaluated the Ocean-Land-Atmosphere Model (OLAM) in multi-core environments - single multi-core node and cluster of multi-core nodes. We discussed three implementations of the model, using different levels of parallelism - MPI, OpenMP and CUDA. The implementations are: (1) a MPI implementation; (2) a Hybrid MPI/OpenMP

We have shown that a Hybrid MPI/OpenMP implementation can improve the performance of OLAM multi-core cluster using either the local disks, a single-server shared file system (NFS) and a parallel file system (PVFS). We observe that, as we increase the number of nodes, the Hybrid MPI/OpenMP implementation performs better than the MPI one. The main reason is that the Hybrid MPI/OpenMP implementation decreases the number of output files, resulting in a better performance for I/O operations. We also confirm that OpenMP global memory

In the experiments in a single multi-core node, we observed that, as we increase the number of cores, the MPI/OpenMP implementation performs better than others implementations. The MPI/OpenMP implementation's bottleneck was observed to be due to low routine parallelism. In order to improve this implementation's speed-up, we plan to further parallelize the OLAM's most cpu-consuming routine. Finally, we observed that the MPI/CUDA implementation's performance decreases for numbers of processes greater than 2 because of the workload reduction with the parallelism. Therefore, we plan to further evaluate

We also applied a trace-based performance visualization in order to better understand OLAM's I/O performance. The libRastro library was used to instrument and obtain the traces of the application. The traces were analyzed and visualized with the Pajé. We have shown

implementation and (4) a Hybrid MPI/CUDA implementation.

the performance with resolutions higher than 40km.

advantages improve the application's memory usage in a multi-core system.

GPU than on the CPU.

memory usage (Osthoff et al., 2011a).

**5. Conclusion**


**1. Introduction**

2006; 2008).

Knowledge of meteorology forms the basis of scientific weather forecasting, which evolves around predicting the state of the atmosphere for a given location. Weather forecasting as practiced by humans is an example of having to make judgments in the presence of uncertainty. Weather forecasts are often made by collecting quantitative data about the current state of the atmosphere and using scientific understanding of atmospheric processes to project how the atmosphere will evolve in future. Over the last few years the necessity of increasing our knowledge about the cognitive process in weather forecasting has been recognized. For its human practitioners, forecasting the weather becomes a task for which the details can be uniquely personal, although most human forecasters use approaches based on the science of meteorology in common to deal with the challenges of the task. The chaotic nature of the atmosphere, the massive computational power required to solve the equations that describe the atmosphere, error involved in measuring the initial conditions, and an incomplete understanding of atmospheric processes mean that forecasts become less accurate as the difference in current time and the time for which the forecast is being made (the range of the forecast) increases (Doswell, 2004; Ramachandran et al., 2006; Subrahamanyam et al.,

**Applications of Mesoscale Atmospheric Models** 

**in Short-Range Weather Predictions During** 

*1Space Physics Laboratory, Vikram Sarabhai Space Centre, Indian Space Research Organization, Department of Space, Government of India, Thiruvananthapuram* 

**Satellite Launch Campaigns in India** 

D. Bala Subrahamanyam1 and Radhika Ramachandran2

*Department of Space, Government of India, Thiruvananthapuram* 

*2Indian Institute of Space Science and Technology (IIST),* 

*India* 

**2**

In the Indian scenario where most of the agricultural industries depend on summer monsoon rainfall, several atmospheric models are run at different organizations to deliver regular forecasts to common man and media with a special emphasis on prediction of onset of monsoon and expected amount of rainfall. However, operational forecast models run by meteorological departments are not meant for prediction of local weather with the high spatial and temporal resolutions within a specific time window. Depending on the users requirements and event-based management, atmospheric models are tuned for delivering the forecast products. During satellite launch operations, accurate weather predictions and reliable information on the winds, wind-shears and thunderstorm activities over the launch site happens to be of paramount importance in the efficient management of launch time operations (Manobianco et al., 1996; Rakesh et al., 2007; Ramachandran et al., 2006;

