Architecture-aware Taylor shift by 1

Anatole D. Ruslanov

doi:10.17918/etd-1226

We introduce register tiling for optimizing series of multiprecision additions. Our new tile method for designing an architecture-aware classical Taylor shift by 1 algorithm-a low-level operation important to the monomial bases variant of the Descartes method for polynomial real root isolation-obtains up to 7 times faster performance over standard implementations that call the efficient integer addition routines from the GNU Multiple Precision Arithmetic Library [44]. Our tile method for Taylor shift by 1 algorithm requires more word additions but it reduces the number of cycles per word addition by decreasing memory traffic and the number of carry computations. To enable standard compilers to tile the algorithm, we introduce signed digits, suspended normalization, radix reduction, and delayed carry propagation. The performance of our tile method depends on several parameters that can be modeled for and tuned to the underlying architecture. We show how such modeling can guide automatic code generation and automatic experimentation to adapt an algorithm to the underlying architecture for better ILP and pipeline utilization. We automatically generate our tile method for Taylor shift by 1 in a high-level language and tune it to four different processor architectures. The architecture-aware tile method outperforms four asymptotically fast methods up to degree 6000 on the four hardware platforms. We analyze feasibility of constructing high-performance architecture-aware fast methods. Using our register tiling technique, we automatically generate and tune de Castel-jau's algorithm, an operation with similar pattern of additions. The algorithm- "probably the most fundamental computation in the field of curve and surface design"(G. Farin [33])-is the main subalgorithm of the Bernstein bases variant of the Descartes method. We obtain similar performance gains. Applying our architecture-aware algorithms, we compare performance of several implementations of the monomial bases and Bernstein bases variants of the Descartes method on four processor architectures and for three classes of input polynomials. All variants have the same asymptotic computing time bound. The comparison shows that the best absolute computing times are obtained on an Opteron processor platform using the Bernstein-bases variant of the Descartes method with register tiling.

Architecture-aware Taylor shift by 1

Files and links (1)

Abstract

Metrics

Details

Architecture-aware Taylor shift by 1

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media