Cooley-Tukey FFT algorithm

The Cooley-Tukey algorithm, named after J.W. Cooley and John Tukey, is the most common fast Fourier transform (FFT) algorithm. It re-expresses the discrete Fourier transform (DFT) of an arbitrary composite size N = N1N2 in terms of smaller DFTs of sizes N1 and N2, recursively, in order to reduce the computation time to O(N log N) for highly-composite N (smooth numbers). Because of the algorithm's importance, specific variants and implementation styles have become known by their own names, as described below.

Because the Cooley-Tukey algorithm breaks the DFT into smaller DFTs, it can be combined arbitrarily with any other algorithm for the DFT. For example, Rader's or Bluestein's algorithm can be used to handle large prime factors that cannot be decomposed by Cooley-Tukey, or the prime-factor algorithm can be exploited for greater efficiency in separating out relatively prime factors.

See also the fast Fourier transform for information on other FFT algorithms, specializations for real and/or symmetric data, and accuracy in the face of finite floating-point precision.


This algorithm, including its recursive application, was invented around 1805 by Carl Friedrich Gauss, who used it to interpolate the trajectories of the asteroids Pallas and Juno, but his work was not widely recognized (being published only posthumously and in neo-Latin) . Gauss did not analyze the asymptotic computational time, however. Various limited forms were also rediscovered several times throughout the 19th and early 20th centuries. FFTs became popular after J. W. Cooley of IBM and John W. Tukey of Princeton published a paper in 1965 reinventing the algorithm and describing how to perform it conveniently on a computer .

Tukey reportedly came up with the idea during a meeting of a US presidential advisory committee discussing ways to detect nuclear-weapon tests in the Soviet Union. Another participant at that meeting, Richard Garwin of IBM, recognized the potential of the method and put Tukey in touch with Cooley, who implemented it for a different (and less-classified) problem: analyzing 3d crystallographic data (see also: multidimensional FFTs). Cooley and Tukey subsequently published their joint paper, and wide adoption quickly followed.

The fact that Gauss had described the same algorithm (albeit without analyzing its asymptotic cost) was not realized until several years after Cooley and Tukey's 1965 paper. Their paper cited as inspiration only work by I. J. Good on what is now called the prime-factor FFT algorithm (PFA), but it was not realized until later that PFA is a quite different algorithm (only working for sizes that have relatively prime factors, unlike any composite size for Cooley-Tukey).

The radix-2 DIT case

A radix-2 decimation-in-time (DIT) FFT is the simplest and most common form of the Cooley-Tukey algorithm, although highly optimized Cooley-Tukey implementations typically use other forms of the algorithm as described below. Radix-2 DIT divides a DFT of size N into two interleaved DFTs (hence the name "radix-2") of size N/2 with each recursive stage.

The DFT is defined by the formula:

X_k = sum_{n=0}^{N-1} x_n e^{-frac{2pi i}{N} nk}
where k is an integer ranging from 0 to N-1.

Radix-2 DIT first computes the Fourier transforms of the even-indexed numbers x_{2m} (x_0, x_2, ldots, x_{N-2}) and of the odd-indexed numbers x_{2m+1} (x_1, x_3, ldots, x_{N-1}), and then combines those two results to produce the Fourier transform of the whole sequence. This idea can then be performed recursively to reduce the overall runtime to O(N log N). This simplified form assumes that N is a power of two; since the number of sample points N can usually be chosen freely by the application, this is often not an important restriction.

More explicitly, let us write M=N/2 and denote the DFT of the even-indexed numbers x_{2m} by E_j and the DFT of the odd-indexed numbers x_{2m+1} by O_j (m=0,...,M-1, j=0,...,M-1). Then it follows:


X_k & = & sum_{m=0}^{frac{N}{2}-1} x_{2m} e^{-frac{2pi i}{N} (2m)k} + sum_{m=0}^{frac{N}{2}-1} x_{2m+1} e^{-frac{2pi i}{N} (2m+1)k}

& = & sum_{m=0}^{M-1} x_{2m} e^{-frac{2pi i}{M} mk} + e^{-frac{2pi i}{N}k} sum_{m=0}^{M-1} x_{2m+1} e^{-frac{2pi i}{M} mk}

& = & left{ begin{matrix} E_k + e^{-frac{2pi i}{N}k} O_k & mbox{if } k E_{k-M} - e^{-frac{2pi i}{N}(k-M)} O_{k-M} & mbox{if } k geq M. end{matrix} right.

end{matrix} Here we have used the critical fact that E_{k+M}=E_k and O_{k+M}=O_k, so that these DFTs, in addition to having only M sample points, needs only be evaluated for M values of k. The original DFT has thus been divided into two DFTs of size N/2.

This process is an example of the general technique of divide and conquer algorithms; in many traditional implementations, however, the explicit recursion is avoided, and instead one traverses the computational tree in breadth-first fashion.

The above re-expression of a size-N DFT as two size-N/2 DFTs is sometimes called the Danielson-Lanczos lemma, since the identity was noted by those two authors in 1942 (influenced by Runge's 1903 work). They applied their lemma in a "backwards" recursive fashion, repeatedly doubling the DFT size until the transform spectrum converged (although they apparently didn't realize the linearithmic asymptotic complexity they had achieved). The Danielson-Lanczos work predated widespread availability of computers and required hand calculation (possibly with mechanical aides such as adding machines); they reported a computation time of 140 minutes for a size-64 DFT operating on real inputs to 3-5 significant digits. Cooley and Tukey's 1965 paper reported a running time of 0.02 minutes for a size-2048 complex DFT on an IBM 7094 (probably in 36-bit single precision, ~8 digits). Rescaling the time by the number of operations, this corresponds roughly to a speedup factor of around 800,000. (140 minutes for size 64 may sound like a long time, but it corresponds to an average of at most 16 seconds per floating-point operation, around 20% of which are multiplications...this is a fairly impressive rate for a human being to sustain for over two hours, especially considering the bookkeeping overhead.)

General factorizations

More generally, Cooley-Tukey algorithms recursively re-express a DFT of a composite size N = N1N2 as:

  1. Perform N1 DFTs of size N2.
  2. Multiply by complex roots of unity called twiddle factors.
  3. Perform N2 DFTs of size N1.

Typically, either N1 or N2 is a small factor (not necessarily prime), called the radix (which can differ between stages of the recursion). If N1 is the radix, it is called a decimation in time (DIT) algorithm, whereas if N2 is the radix, it is decimation in frequency (DIF, also called the Sande-Tukey algorithm). The version presented above was a radix-2 DIT algorithm; in the final expression, the phase multiplying the odd transform is the twiddle factor, and the +/- combination (butterfly) of the even and odd transforms is a size-2 DFT. (The radix's small DFT is sometimes known as a butterfly, so-called because of the shape of the dataflow diagram for the radix-2 case.)

There are many other variations on the Cooley-Tukey algorithm. Mixed-radix implementations handle composite sizes with a variety of (typically small) factors in addition to two, usually (but not always) employing the O(N2) algorithm for the prime base cases of the recursion. Split radix merges radices 2 and 4, exploiting the fact that the first transform of radix 2 requires no twiddle factor, in order to achieve the lowest known arithmetic operation count for power-of-two sizes. (On present-day computers, performance is determined more by cache and CPU pipeline considerations than by strict operation counts; well-optimized FFT implementations often employ larger radices and/or hard-coded base-case transforms of significant size.) Another way of looking at the Cooley-Tukey algorithm is that it re-expresses a size N one-dimensional DFT as an N1 by N2 two-dimensional DFT (plus twiddles), where the output matrix is transposed. The net result of all of these transpositions, for a radix-2 algorithm, corresponds to a bit reversal of the input (DIF) or output (DIT) indices. If, instead of using a small radix, one employs a radix of roughly √N and explicit input/output matrix transpositions, it is called a four-step algorithm (or six-step, depending on the number of transpositions), initially proposed to improve memory locality, e.g. for cache optimization or out-of-core operation, and was later shown to be an optimal cache-oblivious algorithm.

The general Cooley-Tukey factorization rewrites the indices k and n as k = N_2 k_1 + k_2 and n = N_1 n_2 + n_1, respectively, where the indices ka and na run from 0..Na-1 (for a of 1 or 2). That is, it re-indexes the input (n) and output (k) as N1 by N2 two-dimensional arrays in column-major and row-major order, respectively; the difference between these indexings is a transposition, as mentioned above. When this re-indexing is substituted into the DFT formula for nk, the N_1 n_2 N_2 k_1 cross term vanishes (its exponential is unity), and the remaining terms give

X_{N_2 k_1 + k_2} =
sum_{n_1=0}^{N_1-1} sum_{n_2=0}^{N_2-1} x_{N_1 n_2 + n_1} e^{-frac{2pi i}{N_1 N_2} cdot (N_1 n_2 + n_1) cdot (N_2 k_1 + k_2) }
sum_{n_1=0}^{N_1-1} left[e^{-frac{2pi i}{N} n_1 k_2 } right] left(sum_{n_2=0}^{N_2-1} x_{N_1 n_2 + n_1} e^{-frac{2pi i}{N_2} n_2 k_2 } right) e^{-frac{2pi i}{N_1} n_1 k_1 }

where the inner sum is a DFT of size N2, the outer sum is a DFT of size N1, and the [...] bracketed term is the twiddle factor.

An arbitrary radix r (as well as mixed radices) can be employed, as was shown by both Cooley and Tukey as well as Gauss (who gave examples of radix-3 and radix-6 steps). Cooley and Tukey originally assumed that the radix butterfly required O(r2) work and hence reckoned the complexity for a radix r to be O(r2 N/r logrN) = O(N log2(Nr/log2r); from calculation of values of r/log2r for integer values of r from 2 to 12 the optimal radix is found to be 3 (the closest integer to e, which minimizes r/log2r). This analysis was erroneous, however: the radix-butterfly is also a DFT and can be performed via an FFT algorithm in O(r log r) operations, hence the radix r actually cancels in the complexity O(r log(rN/r logrN), and the optimal r is determined by more complicated considerations. In practice, quite large r (32 or 64) are important in practice in order to effectively exploit e.g. the large number of processor registers on modern processors, and even a unbounded radix r=√N also achieves O(N log N) complexity and has theoretical and practical advantages for large N as mentioned above.

Data reordering, bit reversal, and in-place algorithms

Although the abstract Cooley-Tukey factorization of the DFT, above, applies in some form to all implementations of the algorithm, much greater diversity exists in the techniques for ordering and accessing the data at each stage of the FFT. Of special interest is the problem of devising an in-place algorithm that overwrites its input with its output data using only O(1) auxiliary storage.

The most well-known reordering technique involves explicit bit reversal for in-place radix-2 algorithms. Bit reversal is the permutation where the data at an index n, written in binary with digits b4b3b2b1b0 (e.g. 5 digits for N=32 inputs), is transferred to the index with reversed digits b0b1b2b3b4 . Consider the last stage of a radix-2 DIT algorithm like the one presented above, where the output is written in-place over the input: when E_k and O_k are combined with a size-2 DFT, those two values are overwritten by the outputs. However, the two output values should go in the first and second halves of the output array, corresponding to the most significant bit b4 (for N=32); whereas the two inputs E_k and O_k are interleaved in the even and odd elements, corresponding to the least significant bit b0. Thus, in order to get the output in the correct place, these two bits must be swapped in the input. If you include all of the recursive stages of a radix-2 DIT algorithm, all the bits must be swapped and thus one must pre-process the input with a bit reversal to get in-order output. Correspondingly, the reversed (dual) algorithm is radix-2 DIF, and this takes in-order input and produces bit-reversed output, requiring a bit-reversal post-processing step. Alternatively, some applications (such as convolution) work equally well on bit-reversed data, so one can do radix-2 DIF without bit reversal, followed by processing, followed by the radix-2 DIT inverse DFT without bit reversal to produce final results in the natural order.

Many FFT users, however, prefer natural-order outputs, and a separate, explicit bit-reversal stage can have a non-negligible impact on the computation time, even though bit reversal can be done in O(N) time and has been the subject of much research). Also, while the permutation is a bit reversal in the radix-2 case, it is more generally an arbitrary (mixed-base) digit reversal for the mixed-radix case, and the permutation algorithms become more complicated to implement. Moreover, it is desirable on many hardware architectures to re-order intermediate stages of the FFT algorithm so that they operate on consecutive (or at least more localized) data elements. To these ends, a number of alternative implementation schemes have been devised for the Cooley-Tukey algorithm that do not require separate bit reversal and/or involve additional permutations at intermediate stages.

The problem is greatly simplified if it is out-of-place: the output array is distinct from the input array or, equivalently, an equal-size auxiliary array is available. The Stockham auto-sort algorithm performs every stage of the FFT out-of-place, typically writing back and forth between two arrays, transposing one "digit" of the indices with each stage, and has been especially popular on SIMD architectures. Even greater potential SIMD advantages (more consecutive accesses) have been proposed for the Pease algorithm, which also reorders out-of-place with each stage, but this method requires separate bit/digit reversal and O(N log N) storage. One can also directly apply the Cooley-Tukey factorization definition with explicit (depth-first) recursion and small radices, which produces natural-order out-of-place output with no separate permutation step and can be argued to have cache-oblivious locality benefits on systems with hierarchical memory

A typical strategy for in-place algorithms without auxiliary storage and without separate digit-reversal passes involves small matrix transpositions (which swap individual pairs of digits) at intermediate stages, which can be combined with the radix butterflies to reduce the number of passes over the data


Search another word or see Cooley-Tukey_FFT_algorithmon Dictionary | Thesaurus |Spanish
Copyright © 2015, LLC. All rights reserved.
  • Please Login or Sign Up to use the Recent Searches feature