The 2 June 2011 issue of The Economist contains an article on parallel programming. The article points out what is obvious to some, mostly because they have learned it the hard way.

You might expect a six-core machine to be six times faster than a machine with a single-core microprocessor. Yet for most tasks it is not.

Indeed. Every once in a while, if you pick the right problem and the right approach, you can get a significant speedup on a multi-core box. I’ll give you a simple example.

The problem is that old chestnut, computing the nth Fibonacci number, for large n. By “large n” I mean n=1,000,000 or so. By then fib(n) has enough digits to fill a book. For example, fib(1,000,000) has 208,988 digits and fib(2,000,000) has 417,975 digits.

To have a basis for comparison, here is a standard approach to computing the nth Fibonacci number.

(defn std-fib "A slow but simple way to compute nth Fibonacci number." [n] (loop [a 0 b 1 k n] (if (zero? k) a (recur b (+ a b) (dec k)))))

The algorithm is familiar, but a couple of points are worth noting. The first is that computing the nth Fibonacci number this way will require n additions. The second is that though this approach is tail recursive, it does **not** use constant space. The space usage will be linear in n, for the following reason. The number fib(n) grows exponentially with n. The standard ways to represent a number n grow logarithmically with n. The log and the exponential functions more or less cancel, and the space required will grow proportionally to n.

We’ll turn the iterative *std-fib* into a parallel computation in two steps. The first step is to rethink our approach, and switch to a matrix computation. Then we’ll tweak the matrix approach to make it parallel.

Here’s the quick review of matrix multiplication:

Because it is tedious to work with LaTeX in WordPress, as above, I am going to change the notation a bit and write the 2 x 2 matrix A as [a b c d], etc. This simpler notation works well with both WordPress and Clojure.

Here’s the basis of computing Fibonacci numbers with matrices. Consider the matrix

F = [0 1 1 1] = [fib(0) fib(1) fib(1) fib(2)]

The matrix powers of F have a nice property. For integers n > 0:

F^{n+1}= [fib(n) fib(n+1) fib(n+1) fib(n+2)]

So we’ll compute the nth Fibonacci number by computing F^{n+1} and extracting its first entry. At first glance that might seem crazy. Multiplying two matrices will involve several multiplications and several additions, whereas *std-fib* did only a single addition. How can the proposed matrix approach possibly be faster?

Suppose we want to compute F^{1000}. We will not perform 1000 matrix multiplications. Instead we’ll note that F^{1000} is the square of F^{500}. Suddenly half of our multiplications are gone. And of course F^{500} is the square of F^{250}, and so on, recursively.

We’ll compute matrix powers by squaring when we can, and multiplying when we have to.

(defn mat-power "2x2 matrix M raised to nth power" [M n] (cond (= 0 n) [1 0 0 1] (even? n) (recur (mat-sq M ) (/ n 2)) :else (mat-mult M (mat-power M (dec n)))))

The *mat-power* function is defined in terms of two other functions, *mat-mult* and *mat-sq*. Let’s take a look at them.

(defn mat-mult "Matrix multiplication for 2x2 matrices, both arguments are powers of the same symmetric matrix. Note that this version of multiplication is _not_ correct in the general case. In particular, the product of symmetric matrices need not be symmetric. This function should be used only to compute a positive integer power of a symmetric matrix. Such a product is guaranteed to be symmetric." [[a b _ d] [e f _ h]] (let [[ae bf dh af bh] (map * [a b d a b] [e f h f h]) [w x y] (map + [ae af bf] [bf bh dh])] [w x x y])) (defn mat-sq "Square of a symmetric 2x2 matrix" [[a b b c]] (let [[aa bb cc x] (map * [a b c b ] [a b c (+ a c) ]) [w y] (map + [aa bb] [bb cc])] [w x x y])) (defn fib "Compute the nth Fibonacci number." [n] (let [F [0 1 1 1] M (mat-power F (inc n))] (M 0)))

Above I exploit a feature of the problem at hand to reduce the amount of work done. We are interested in computing powers of the symmetric matrix F = [0 1 1 1]. A positive integer power of a symmetric matrix is guaranteed to be symmetric. The two off-diagonal elements must be equal, so instead of computing four entries, I can get away with computing only three, cutting the work to be done by 25% or so.

Let’s see how the matrix approach compares to *std-fib* when timed on largish values of n. On my machine, I see

```
fast-fib.fib> (time (do (std-fib 1000000) nil))
"Elapsed time: 44284.767864 msecs"
fast-fib.fib> (time (do (fib 1000000) nil))
"Elapsed time: 3208.876229 msecs"
```

Down from 44 seconds to 3 seconds; *std-fib* took 13.8 times as long as *fib*. Nice. Let’s go a little higher.

```
fast-fib.fib> (time (do (std-fib 2000000) nil))
"Elapsed time: 168510.795379 msecs"
fast-fib.fib> (time (do (fib 2000000) nil))
"Elapsed time: 12722.213307 msecs"
```

Again, *std-fib* took over 13 times as long as *fib*.

Now let’s see how to use those extra cores. We’ll write parallel versions of our earlier functions, and distinguish them by prefixing the function name with a “p”.

(defn pmat-mult "Matrix multiplication for 2x2 matrices, both powers of a single symmetric matrix" [[a b _ d] [e f _ h]] (let [[ae bf dh af bh] (pmap * [a b d a b] [e f h f h]) [w x y] (pmap + [ae af bf] [bf bh dh])] [w x x y])) (defn pmat-sq "Square of symmetric 2x2 matrix" [[a b b c]] (let [[aa bb cc x] (pmap * [a b c b ] [a b c (+ a c) ]) [w y] (pmap + [aa bb] [bb cc])] [w x x y])) (defn pmat-power "Symmetric 2x2 matrix M raised to nth power" [M n] (cond (= 0 n) [1 0 0 1] (even? n) (recur (pmat-sq M ) (/ n 2)) :else (pmat-mult M (pmat-power M (dec n))))) (defn pfib "The k-th Fibonacci number, parallel version." [k] (let [F [0 1 1 1] M (pmat-power F (inc k))] (M 0)))

Basically, I replaced *map* with *pmap*. That was the plan all along, and the reason I wrote the matrix multiplication functions using map. So does it work? Well, I have a four core box. Check it out.

```
fast-fib.fib> (time (do (fib 1000000) nil))
"Elapsed time: 3184.672283 msecs"
fast-fib.fib> (time (do (pfib 1000000) nil))
"Elapsed time: 1029.386271 msecs"
```

That’s a ratio of about 3.1. Let’s double n.

```
fast-fib.fib> (time (do (fib 2000000) nil))
"Elapsed time: 12660.566862 msecs"
fast-fib.fib> (time (do (pfib 2000000) nil))
"Elapsed time: 3617.482009 msecs"
```

That’s a ratio of 3.5. We see *fib* taking about three times as long as *pfib*. Recall that we are computing three numbers. A speedup of three is about all we could hope for. If you want to grab the code and play around with it yourself, you can get it here.

I wrote this code because I wanted an unequivocal example of the four cores on my machine speeding up a computation. An earlier attempt at much the same computation was misconceived. I tried a similar approach, but the devil is in the details. I used the matrix Fibonacci computation, but explicitly computed what binary powers of F would need to be multiplied together, computed them using only squaring, and then used a home-brew parallel reduce operator to multiply the resultant matrices in parallel. This was no faster than *fib*, and more complicated. I was taken in by a too-clever scheme which focused on making the wrong thing parallel. Multiplying whole matrices in parallel did not work nearly so well as the simpler approach I showed you here.

Final thought: I raced through the explanation of the matrix approach to calculating Fibonacci numbers. The method is well known, and there are detailed explanations available online. I like one I ran across not too long ago, on Code Overflow.