The sum of all the multiples
So, this week I had a dreadful night, I just couldn’t sleep at all. In the search for something to do with a sleepless night I decided to dissect a simple (perhaps the simplest) problem at Project Euler: Problem 1.
The problem is stated as this:
If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 1000.
Anyways, the idea of this post is to explore this problem, starting from a very naive implementation building up to alternatives, gathering knowledge of the problem and it’s solution, until we reach a good implementation.
So, here I started. Doing it in scheme, I just figured I should do it in a functional way. I ended up with the code below:
(require-extension srfi-1) (define (div-by-3-or-5 x) (or (zero? (modulo x 3)) (zero? (modulo x 5)))) (let ((result (fold + 0 (filter div-by-3-or-5 (iota 1000000 1))))) (format #T &quot;Result: ~a~%&quot; result))
It’s a simple program, obviously. A simple run-down of what it does is given below:
- First the iota function generates a lists that has all the numbers from 1 to 1000000;
- This list is then passed on to filter which returns a list that is composed of all members from it’s input that are multiples of 3 or 5;
- Now, we iterate (through the function fold) on this new list applying + (the sum operator) to it’s contents;
- The result of that is the result of our computation.
Running this program with the help of the UNIX utility time, tells us that our program isn’t exactly the fastest kid in the class:
Result: 233334166668 real 0m2.018s
It takes 2 whole seconds to do it’s job. But when we look at it again, there’s a reason for that. Our code is, right now, very stupid: It creates a whole list of 1000000 numbers in memory; then it creates a smaller list, with thousands of numbers; then it runs through this list adding the values.
We could make it so that the lists aren’t created, using streams (or iterators, or generators… pick your name). So that’s what I did:
(require-extension srfi-41) (define (div-by-3-or-5 x) (or (zero? (modulo x 3)) (zero? (modulo x 5)))) (let ((result (stream-fold + 0 (stream-filter div-by-3-or-5 (stream-range 1 1000000))))) (display result))
I substituted the list-creating primitives for their stream-creating counterparts. Unfortunately as you can see from the results below, either the implementation of Streams from Chicken Scheme is oddly slow or I’m doing something very wrong, because it actually got much slower. I’ll ask the guys from Chicken what am I doing wrong and will fix it later. For now, here are the results:
Result: 233333166668 real 0m5.019s
Of course, doing it the functional way isn’t the only way of doing it in Scheme, and it’s easier to make an iterative version that doesn’t use streams nor builds the whole list. And so I did it, with the following code:
(let loop ((i 3) (acc 0)) (if (&gt; i 1000000) (display (format &quot;Result: ~a~%&quot; acc)) (loop (+ i 1) (if (or (= (modulo i 5) 0) (= (modulo i 3) 0)) (+ acc i) acc))))
Which is simply a loop from 1 to 1000000 adding to an accumulator the sum of all the numbers that are multiples of either 3 or 5. And now we start seeing an evolution in speed! Check the results:
Result: 233334166668 real 0m1.262s
That is a mild improvement, but it’s an improvement nonetheless — the kind of improvement I expected to get from the stream based version. There are some problems with this approach, but let’s try to keep it naive, and deal with a small attempt at optimization.
The target here is that iterating through a million numbers, and permforming two million modulo operations, is bound to be slow. But figuring out what are the numbers we want is simple: we know that we can list all the multiples of a number if we start iterating from this number and add itself to the iteration every step.
That is, if we want to find all the multiples of 3, we have to do 3, 3 + 3, 3 + 3 + 3, and so on. So, so to sum out all the numbers that are multiples of 3 or 5, all we have to do is to sum all the multiples of 3 with all the multiples of 5.
Almost that simple: we are summing some of the numbers twice. You see, 30 is a multiple of 3 and of 5, and thus it’s being added by both sides of the sum. The solution for that is simple, really: We have to figure out what numbers are counted twice, but since 3 and 5 are primes we can guarantee that any number divisible by both, is also divisible by their least common multiple, which is 15.
So our algorithm is:
- Sum all the multiples of 3 from 1 to 1000000;
- Sum all the multiples of 5 from 1 to 1000000;
- Add those two sums;
- Subtract from it all the multiples of 15 from 1 to 1000000.
The code for it is as simple as the algorithm, see below:
(define (compute-sum-for n) (let loop ((i n) (acc 0)) (if (&gt; i 1000000) acc (loop (+ i n) (+ acc i))))) (let ((result-3 (compute-sum-for 3)) (result-5 (compute-sum-for 5)) (result-15 (compute-sum-for 15))) (format #T &quot;Result: ~a~%&quot; (- (+ result-3 result-5) result-15)))
The auxiliary function compute-sum-for makes it much more readable, and the result is much faster than before:
Result: 233334166668 real 0m0.366s
Now that’s the kind of performance we’re after! Right? … Well, no! It’s about three to four times faster than the best previous attempt, I give you that, but we’re talking about computers. They should be faster! From now on I can see two paths to further speed out this code: The naive and the ingenious (which should be apparent to anyone with some knowledge of math by now).
The naive option is simple: We’re in the era of multi-core computers. We have three independent sums going on, why don’t we dispatch them in three different threads and cut the processing time by a third of it?
Well, so I did it with the following code:
(require-extension srfi-18) (define (make-worker n) (lambda () (let loop ((i n) (acc 0)) (if (&gt; i 1000000) acc (loop (+ i n) (+ acc i)))))) (let ((thread-3 (thread-start! (make-thread (make-worker 3)))) (thread-5 (thread-start! (make-thread (make-worker 5)))) (thread-15 (thread-start! (make-thread (make-worker 15))))) (let ((result-15 (thread-join! thread-15)) (result-5 (thread-join! thread-5)) (result-3 (thread-join! thread-3))) (format #T &quot;Result: ~a~%&quot; (- (+ result-3 result-5) result-15))))
We build three threads, using a worker function not much different from our compute-sum-for function from before, start them, and then get their results before doing the final sum. And the speed is…
Result: 233334166668 real 0m0.445s
… slightly inferior! You see, you can’t simply assume adding threads to a computation is going to make it faster. Our computation so far was so simple that adding threads to it gave more in terms of overhead than in terms of performance gain. Keep that in mind.
That leaves us with the only option, finally adding some thought to this and stop being naive. Let’s look at what we’re doing, shall we? We detailed the algorithm before, and it’s basic block is a sum of all numbers in a sequence with fixed increment steps.
It just so happens that we have a name for those sequences. It is called an Arithmetic Progression and as it turns out we know how to sum these!
So we could modify our compute-sum-for function to use this knowledge, like in the code below:
;;;; Returns the sum of the arithmetic progression from start, up to the first ;;;; item which is bigger than the limit (exclusive), with the given step. (define (sum-ap start limit step) (let ((n (+ 1 (floor (/ (- limit start) step))))) (* (/ n 2) ( + (* 2 start) (* step (- n 1)))))) (define (compute-sum-for n) (sum-ap 0 1000000 n)) (let ((result-3 (compute-sum-for 3)) (result-5 (compute-sum-for 5)) (result-15 (compute-sum-for 15))) (format #T &quot;Result: ~a~%&quot; (- (+ result-3 result-5) result-15)))
Just added the sum-ap function which figures out from the parameters start, limit and step how many elements from our progression we have in a given range, and then uses the formula (described in that wikipedia link above) for the sum of an arithmetic progression to figure out the values. Here are the resulting time spent in this computation:
Result: 233334166668.0 real 0m0.011s
And now that’s the speed we’re looking for!
And with that we reach where I wanted to go: Naive implementations of the solution to a problem are, most of the times, great ways to get more insight into the nature of the problem and, therefore, come up with better implementations.
Please give some feedback on whatever you like: on the content, the style, the length and formatting or whatever. I kinda feel this was too long for such a simple post, but it really felt smaller in my head, when I was writing the programs.