Logo image
Automated performance tuning
Conference proceeding

Automated performance tuning

Jeremy Johnson
Proceedings of the 4th International Workshop on parallel and symbolic computation, pp 20-21
21 Jul 2010

Abstract

autotuning code generation and optimization high-performance computing parallelism vectorization
This tutorial presents automated techniques for implementing and optimizing numeric and symbolic libraries on modern computing platforms including SSE, multicore, and GPU. Obtaining high performance requires effective use of the memory hierarchy, short vector instructions, and multiple cores. Highly tuned implementations are difficult to obtain and are platform dependent. For example, Intel Core i7 980 XE has a peak floating point performance of over 100 GFLOPS and the NVIDIA Tesla C870 has a peak floating point performance of over 500 GFLOPS, however, achieving close to peak performance on such platforms is extremely difficult. Consequently, automated techniques are now being used to tune and adapt high performance libraries such as ATLAS (math-atlas.sourceforge.net), PLASMA (icl.cs.utk.edu/plasma) and MAGMA (icl.cs.utk.edu/magma) for dense linear algebra, OSKI (bebop.cs.berkeley.edu/oski) for sparse linear algebra, FFTW (www.fftw.org) for the fast Fourier transform (FFT), and SPIRAL (www.spiral.net) for wide class of digital signal processing (DSP) algorithms. Intel currently uses SPIRAL to generate parts of their MKL and IPP libraries.

Metrics

2 Record Views
1 citations in Scopus

Details

Logo image