**Acknowledgements**

The authors would like to thank the support provided by CNPq (grants 308317/2009-2 and 300192/2012-6) and FAPERJ (grant E-26/102.025/2009).

<sup>14</sup> The careful reader will note that this index calculation leads to an efficient coalesced memory access pattern [1, 32].
