We introduce register tiling for optimizing series of multiprecision additions. Our new tile method for designing an architecture-aware classical Taylor shift by 1 algorithm-a low-level operation important to the monomial bases variant of the Descartes method for polynomial real root isolation-obtains up to 7 times faster performance over standard implementations that call the efficient integer addition routines from the GNU Multiple Precision Arithmetic Library [44]. Our tile method for Taylor shift by 1 algorithm requires more word additions but it reduces the number of cycles per word addition by decreasing memory traffic and the number of carry computations. To enable standard compilers to tile the algorithm, we introduce signed digits, suspended normalization, radix reduction, and delayed carry propagation. The performance of our tile method depends on several parameters that can be modeled for and tuned to the underlying architecture. We show how such modeling can guide automatic code generation and automatic experimentation to adapt an algorithm to the underlying architecture for better ILP and pipeline utilization. We automatically generate our tile method for Taylor shift by 1 in a high-level language and tune it to four different processor architectures. The architecture-aware tile method outperforms four asymptotically fast methods up to degree 6000 on the four hardware platforms. We analyze feasibility of constructing high-performance architecture-aware fast methods. Using our register tiling technique, we automatically generate and tune de Castel-jau's algorithm, an operation with similar pattern of additions. The algorithm- "probably the most fundamental computation in the field of curve and surface design"(G. Farin [33])-is the main subalgorithm of the Bernstein bases variant of the Descartes method. We obtain similar performance gains. Applying our architecture-aware algorithms, we compare performance of several implementations of the monomial bases and Bernstein bases variants of the Descartes method on four processor architectures and for three classes of input polynomials. All variants have the same asymptotic computing time bound. The comparison shows that the best absolute computing times are obtained on an Opteron processor platform using the Bernstein-bases variant of the Descartes method with register tiling.
Metrics
21 File views/ downloads
21 Record Views
Details
Title
Architecture-aware Taylor shift by 1
Creators
Anatole D. Ruslanov - DU
Contributors
Jeremy Russell Johnson (Advisor) - Drexel University (1970-)
Werner Krandick (Advisor) - Drexel University (1970-)
Awarding Institution
Drexel University
Degree Awarded
Doctor of Philosophy (Ph.D.)
Publisher
Drexel University; Philadelphia, Pennsylvania
Resource Type
Dissertation
Language
English
Academic Unit
College of Arts and Sciences; Drexel University; Mathematics
Other Identifier
1226; 991014632834904721
Research Home Page
Browse by research and academic units
Learn about the ETD submission process at Drexel
Learn about the Libraries’ research data management services