What precaution has to be taken when nested loops are used?
Answers
Answer:
To speed up data‐intensive programs, two complementary techniques, namely nested loops parallelization and data locality optimization, should be considered. Effective parallelization techniques distribute the computation and necessary data across different processors, whereas data locality places data on the same processor. Therefore, locality and parallelization may demand different loop transformations. As such, an integrated approach that combines these two can generate much better results than each individual approach. This paper proposes a unified approach that integrates these two techniques to obtain an appropriate loop transformation. Applying this transformation results in coarse grain parallelism through exploiting the largest possible groups of outer permutable loops in addition to data locality through dependence satisfaction at inner loops. These groups can be further tiled to improve data locality through exploiting data reuse in multiple dimensions.
I. Introduction
The aim is to speed up data intensive programs for execution on multiprocessor systems with hierarchical memory. To speed up data‐intensive programs, there have been two complementary techniques, namely, data locality optimization and nested loops parallelization. Data locality optimization techniques attempt to maximize the number of data accesses satisfied from the higher levels of the memory hierarchy and decrease the costly off‐chip accesses. On the other hand, loop parallelization is aimed at speeding up loop executions by distributing independent loop iterations on multiple processors. Both data locality and loop parallelization can be achieved by applying appropriate loop transformations [1]. However, there is an inherent difficulty in treatment with data in these two techniques. Effective parallelization distributes the computation and its necessary data across different processors, whereas locality places data on the same processor. An integrated approach that combines these two can generate much better results in terms of program speedup than using either technique individually [2].
Multidimensional affine scheduling can successfully model complex sequences of loop transformations as one single optimization step [3]. This paper presents a unified approach based on the polyhedral model to achieve multidimensional affine scheduling to increase parallelism and improve data locality for nested loops. The polyhedral model has enabled more complex loop nests to be handled and facilitate composition and application of multiple transformations as a multidimensional schedule [3]. To obtain an appropriate schedule, in our proposed method, the dependency satisfaction is moved to the inner levels in the transformed space to the furthest extent possible. On the other hand, to obtain tileable iteration space, we are interested in obtaining fully permutable groups of loops. As a result, we obtain the largest possible groups of outer fully permutable loops while satisfying dependencies at inner loops as much as possible.
The main contributions of this paper are as follows:
We propose a unified approach for achieving coarse grain parallelism and data locality improvement simultaneously.
To overcome the combinatorial nature of the optimization search space, we propose an objective function such that the multidimensional schedule could be determined dimension by dimension.
We separate the concept of the ratio of tile sides from the concept of the actual size to identify the shape of a tile. The former is determined in such a way as to minimize tile inter‐communication, and the latter is determined according to the amount of data that fits in the local cache.
The remaining parts of this paper are organized as follows. In section II, related work is discussed. Section III presents the underlying polyhedral model on which our approach is based. Then, section IV shows the overall steps of a loop transformation framework in the polyhedral model. Section V, explains our proposed algorithm for nested loop parallelization and locality improvement. Experiment results are shown in section VI. Finally, the conclusion and future works are presented in section VII.