I was recently looking at the presentation ZOMG WHY IS THIS CODE SO SLOW given by Aaron Patterson (aka tenderlove) at RubyConf 2010. His talk was about benchmarking, performance and optimization in ruby, examples being extracted from his recent work on ActiveRecord and ARel. I must confess being a bit embarrassed about what I'm going to say here (Aaron is a really famous ruby hacker, member of both ruby-core and rails-core teams, developer of excellent ruby tools like ARel, mechanize, and so on.) but... I'm a bit... how to say?... hum, surprised!

The reason is that some examples in the slides can be very misleading IMHO. Amazingly, Aaron himself writes "Don't believe me" and "Think critically". Let's do that: I dare this writing.

## Benchmarking and asserting execution time

The example below is given at slide 104 (click here for the slides, when Aaron introduces the minitest/benchmark library (I must add that, at the time of writing, a very similar example is given in minitest's own README.

```
require 'rubygems'
require 'minitest/autorun'
require 'minitest/benchmark'
class BenchFib < MiniTest::Unit::TestCase
def fib n
a, b = 0, 1
n.times do
a, b = b, a + b
end
a
end
def bench_fib
assert_performance_linear 0.99 do |n|
n.times do
fib(1000)
end
end
end
end
```

The example is quite easy to understand:

- The method
`fib`

computes the n-th fibonacci number, - As its name suggests, the method
`bench_fib`

aims at benchmarking and asserting the performance of (the implementation of)`fib`

. - It relies on
`assert_performance_linear`

(provided by minitest/benchmark) which will invoke the block with increasing values for`n`

(1, 10, 100, 1000, and 10000). - By measuring the execution time of each of these invocations, minitest is able to check whether the execution time is a linear function of
`n`

or not.

As you can see below, the test passes. The execution time exhibits a beautiful linear progression! Can we say therefore that `fib`

is a linear function? Not at all!

```
blambeau@yemana % ruby fib1_test.rb
# Running benchmarks:
BenchFib 1 10 100 1000 10000
bench_fib 0.000612 0.006107 0.076857 0.750271 7.524193
Finished benchmarks in 8.365573s, 0.1195 tests/s, 0.1195 assertions/s.
1 tests, 1 assertions, 0 failures, 0 errors, 0 skips
```

## Trivially true or simply wrong?

Not at all, because the assertion is trivially true. Therefore the test is useless, if not simply wrong.

```
assert_performance_linear 0.99 do |n|
n.times do
fib(1000)
end
end
```

Here is why:

- The call to
`fib(1000)`

itself has a*constant*execution time*T*(but see later), - One call to
`fib(1000)`

takes*T*ms., ten calls take*10xT*ms., ..., ten thousand calls take*10000xT*ms. - Therefore, the linear progression is implied by
`assert_performance_linear`

only, and is absolutely independent of`fib`

In other words, from the points of view of the assertion at least (vs. the actual execution time, which differs) the code given in the example is somewhat equivalent to (and the test passes, of course):

```
assert_performance_linear 0.99 do |n|
n.times do
# everything with constant execution time
# nothing, for example
end
end
```

The assertion is trivially true as well as the test, and the benchmarking approach itself is misleading, if not wrong.

## Really?

Unfortunately if you want to go deeper in your understanding, things get more complicated. The truth is that answering whether the benchmarking is wrong or not depends on what you want to assert precisely. In this respect, the example at hand is amazing because:

It proves (or at least asserts) a lot of things, but

certainly notthat`fib`

has a linear execution time (even if it does)

If you do not trust me, modify `fib`

as below. The modified implementation (does not compute a fibonacci number anymore and) presents an exponential execution time with respect to `n`

.

```
def fib n
a, b = 0, 1
(2**n).times do
a, b = b, a + b
end
a
end
```

Execute the test now (I've only replaced 1000 by 10 here to have something that ends in a reasonable amount of time. Sorry for the confusion it could introduce):

```
assert_performance_linear 0.99 do |n|
n.times do
fib(10)
end
end
```

```
blambeau@yemana % ruby fib2_test.rb
# Running benchmarks:
BenchFib 1 10 100 1000 10000
bench_fib 0.000646 0.006415 0.079595 0.775503 7.777460
Finished benchmarks in 8.647361s, 0.1156 tests/s, 0.1156 assertions/s.
1 tests, 1 assertions, 0 failures, 0 errors, 0 skips
```

The test passes and the performance is linear (the execution time is really similar to what we had before, but it is only because `10**2 == 1024 ~= 1000`

). The problem is that such a test would pass and would assert a linear performance whatever the `fib`

you choose, and whatever it's internal complexity: constant, linear, polynomial, or even exponential!

Contrarily to what is argued by Aaron (see the video around 18:00, this kind of test will never detect the fact that your mate have replaced your linear function by an exponential one.

So what is asserted exactly? Is the assertion trivially true, really? No exactly. In fact, the test asserts that:

- The method
`Integer#times(n)`

has a linear execution time with respect to`n`

and - The call to
`fib(1000)`

has a constant execution time, which implies that... - ... it does not (seem to) depend on the number of times the function has been called before and
- ... it does not (seem to) depend on another hidden parameter unless the latter is itself constant and
- ... so on.

## What could /should it be?

At this step, you could argue that it is a typo on the slide and that the bench method was intended to be as shown below:

```
assert_performance_linear 0.99 do |n|
n.times do
fib(n)
end
end
```

But no way! The test even fails, and *it has to fail*: `Integer#times`

is linear in `n`

and calls `fib`

at each iteration, which is linear in `n`

as well. The correct computation is therefore `n*n`

, which is quadratic, not linear. Moreover, a similar mistake would be repeated on many other slides (see slides 100-102, 104-106, 109-114, 165-166, ...).

What about the example below?

```
assert_performance_linear 0.99 do |n|
n.times do |i|
fib(i)
end
end
```

Well, this time, the exact complexity formula is `n*(n+1)/2`

, which is bounded by `n^2`

and is therefore quadratic, not linear either. Contrast this with

```
assert_performance_linear 0.99 do |n|
100.times do
fib(n)
end
end
```

Or simply:

```
assert_performance_linear 0.99 do |n|
fib(n)
end
```

The last two examples are right, and correctly assert that `fib`

itself has a linear complexity. The first one repeats the computation 100 times, to have better statistical results. The latter point is extremely important, to avoid having tests that wrongly success or fail (see later).

## So what?

So what? What happened? Is it important? For me, the main problem with Aaron's talk is that it mixes two different concerns: *benchmarking* and *asserting complexity*. And keeping these concerns separate is certainly important! Let's have a quick look at the differences.

### Benchmarking

What Aaron is actually interested in his talk is *benchmarking*. Benchmarking allows *measuring and comparing execution times*. In that case, executing the same constant code (e.g. `fib(1000)`

) over and over again perfectly makes sense. It is what Aaron is actually doing on all slides showing linear curves (slides 153, 163, 166, 174, 180, 183, 187, and so on.): it compares different implementations of similar specifications. The fact that the curves are linear is misleading: it does not say anything about the internal complexity of the different implementations. But it makes the job: the delta between the different curves (in mathematical words: the difference of the slope of the curves) is representative of the efficiency difference between these implementations!

Even if it makes sense, remember that conducting a benchmarking experiment on one example only (e.g. `fib(1000)`

) is extremely insignificant, statistically speaking. Varying parameters and computing the average execution time is certainly better that executing the code with the same parameter, even a billion times. I must confess that correctly designing such a benchmarking experiment is also a bit harder. Aaron showed you the easiest and the very pragmatic way, very arguable but working (as demonstrated)!

### Asserting complexity

Asserting the internal complexity of an algorithm is another beast, and is probably much harder than benchmarking. Asserting complexity is asserting about the *performance of an algorithm with respect to the size of the problem it solves*. Well, it is (somewhat) obvious that asserting complexity is not Aaron's motivation. Instead he mainly uses `assert_linear_performance`

as a pragmatic way of measuring execution time and reporting it on a plot (see what he says at 18:15). All examples with `assert_linear_performance`

are actually flawed, as I've discussed earlier.

To reason about the complexity of your algorithm you *have to* design an experiment where the size of the problem varies. In the `fib`

example, you have to measure the execution time for different (well chosen) values of `n`

. Executing the algorithm many times (say 10 times) for each parameter value and computing the average allows obtaining better results with respect to statistical significance. If you report the measured time against the size of the problem in a graph, you obtain the complexity profile (see Aaron's slides 229-233 for typical profiles).

Asserting the complexity of the algorithm from such measures is even harder than simply visualizing it (isn't even undecidable??? please drop me an email if you know the answer). The idea (and what minitest/benchmark does) is to fit the measures with a given mathematical profile (linear, polynomial, exponential), and to check whether the fitting error is under a given threshold. In my opinion, designing and conducting such an experiment rigorously is difficult and requires time (both for the design and the execution of the experiment) and a lot of experience. For this reason, I personally doubt about the pertinence of having such assertions in an automated test suite. Anyway, if you plan to use them, remember to also output graphs and to have a look at it! Human inspection is sometimes much more pragmatic and rigorous than flawed automations ;-)

## Conclusion

I'll conclude this post with Aaron's own response to the mail I've sent to him about this:

I think your post is accurate. The benchmark shows a constant amount of work in a linear progression. Did I say something else? If so, I apologize.

And with my girlfriend's:

The benchmark shows that (unlike me) your computer is not tired of computing the 1000-th fibonacci number, even after a million times.

She's right!