Ruby vs. Python comes down to the for loop

CommunityNews · 13 November 2021 03:58

Ruby vs Python comes down to the for loop.
Contrasting how each language handles iteration helps understand how to work effectively in either.

Read in full here:

This thread was posted by one of our members via one of our news source trackers.

bot · 13 November 2021 03:58

Corresponding tweet for this thread:

Share link for this tweet.

OvermindDL1 · 15 November 2021 17:49

These design differences are also why python is faster than ruby in any code I’ve seen ( wonder how python’s pypy JIT compares to ruby’s new JIT, hmm…)

AstonJ · 16 November 2021 04:27

The speed difference was always tangible when I first compared the two - not sure what it’s like now, but Ruby is a lot faster than it used to be

OvermindDL1 · 16 November 2021 16:19

Python is also a lot faster than it used to, though it has a few JIT’s available for it now, I’ve never benchmarked those to each other, hmmm…

@AstonJ Since you know Ruby well and although I could write code in it I’m unsure just how ‘natural’ of code it would be, think you could come up with a simple enough test for doing some, like, math or so? Otherwise I might find some online. I’m curious how they differ nowadays, lol.

OvermindDL1 · 16 November 2021 17:28

Wow but Ruby’s non-official variants are hard to install/build…
Managed to get the main build 3.1.0-preview1 installed, but couldn’t get jruby, mruby, graalvm ruby and a few others built at all, their build scripts are baaaaaad… o.O

Though I think the ruby backend ecosystem is a bit more uniform and universal around the main builds unlike Python and it having a lot of well used ‘other’ implementations? Like the main ruby even has a JIT but the main python never will as its designed to be simple and efficient if not fast (like that python is easy to embed as a scripting language, like lua).

Is this good for a normal standard ruby-style loop over numbers as a benchmark? I’m wanting to benchmark the actual language interpreter/jit, not any C code in either language (since those can be equally performant and python probably has an edge there because of all it’s scientific libraries) and want it to be how someone would actually write them:

require "benchmark"

def bench(count)
	ret = 0
	(1..count).each do |n|
		ret += ret * 2
	end
end

result = Benchmark.realtime do
  bench(10_000_000)
end

puts("Time Taken for loop of 10_000_000: #{result}s")

And here was pythons:

from timeit import timeit

def bench(count):
	ret = 0
	for i in range(1, count):
		ret += ret * 2
	return ret

result = timeit(lambda: bench(10_000_000), number=1);

print(f"Time Taken for loop of 10_000_000: {result}s")

And running them:

❯ make bench_loop
/home/overminddl1/.asdf/installs/python/3.11-dev/bin/python loop.py
Time Taken for loop of 10_000_000: 0.44509852200280875s
/home/overminddl1/.asdf/installs/python/graalpython-21.3.0/bin/python loop.py
Time Taken for loop of 10_000_000: 0.06898683903273195s
/home/overminddl1/.asdf/installs/python/pypy3.8-7.3.7/bin/python loop.py
Time Taken for loop of 10_000_000: 0.014560726936906576s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby loop.rb
Time Taken for loop of 10_000_000: 0.5772202990483493s

So it looks like ruby is by default still slower than python, though not by near the margin it used to be. Python’s two most popular JIT’s graal and pypy are both significantly faster, with pypy being the fastest (by a large margin even over graal, I’m impressed by that…).

However, ruby’s times are so slow that I’m not sure it’s using the JIT, let’s see, does that need some special argument to enable… Hrmm, google shows… the new ruby JIT isn’t terribly performant and actually slows down some cases (pypy sure doesn’t seem to do that for python…), but still, pure math should be something a JIT can do very very very well and easily. Ah, --yjit, let’s add that…

❯ make bench_loop
/home/overminddl1/.asdf/installs/python/3.11-dev/bin/python loop.py
Time Taken for loop of 10_000_000: 0.425267347949557s
/home/overminddl1/.asdf/installs/python/graalpython-21.3.0/bin/python loop.py
Time Taken for loop of 10_000_000: 0.06571533100213856s
/home/overminddl1/.asdf/installs/python/pypy3.8-7.3.7/bin/python loop.py
Time Taken for loop of 10_000_000: 0.01419509295374155s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby loop.rb
Time Taken for loop of 10_000_000: 0.5761694150278345s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --yjit loop.rb
Time Taken for loop of 10_000_000: 0.49059942609164864s

Well that barely helped… Let’s see… Wait it has two JIT’s?! Adding in the other one:

❯ make bench_loop
/home/overminddl1/.asdf/installs/python/3.11-dev/bin/python loop.py
Time Taken for loop of 10_000_000: 0.4192896030144766s
/home/overminddl1/.asdf/installs/python/graalpython-21.3.0/bin/python loop.py
Time Taken for loop of 10_000_000: 0.05795672698877752s
/home/overminddl1/.asdf/installs/python/pypy3.8-7.3.7/bin/python loop.py
Time Taken for loop of 10_000_000: 0.021143854944966733s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby loop.rb
Time Taken for loop of 10_000_000: 0.5799980269512162s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --jit loop.rb
Time Taken for loop of 10_000_000: 0.593747540959157s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --yjit loop.rb
Time Taken for loop of 10_000_000: 0.4838587270351127s

Huh, so the normal one actually runs slower… Let’s do some warmups in the benches, changing the code to, oh wait, I don’t know if ruby has unlimited length integers like python, changing the actual benchmark to not care about that then as well:
Ruby:

require "benchmark"

def bench(count)
	ret = 0
	(1..count).each do |i|
		ret += i * 2
	end
end

# Warmups
(1..10_000).each do |i|
	bench(i)
end

result = Benchmark.realtime do
  bench(10_000_000)
end

puts("Time Taken for loop of 10_000_000: #{result}s")

Python:

from timeit import timeit

def bench(count):
	ret = 0
	for i in range(1, count):
		ret += i * 2
	return ret

# Warmups
for i in range(10_000):
	bench(i)

result = timeit(lambda: bench(10_000_000), number=1);

print(f"Time Taken for loop of 10_000_000: {result}s")

And results:

❯ make bench_loop
/home/overminddl1/.asdf/installs/python/3.11-dev/bin/python loop.py
Time Taken for loop of 10_000_000: 0.6883266480872408s
/home/overminddl1/.asdf/installs/python/graalpython-21.3.0/bin/python loop.py
Time Taken for loop of 10_000_000: 0.19750940497033298s
/home/overminddl1/.asdf/installs/python/pypy3.8-7.3.7/bin/python loop.py
Time Taken for loop of 10_000_000: 0.01244226808194071s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby loop.rb
Time Taken for loop of 10_000_000: 0.5437580200377852s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --jit loop.rb
Time Taken for loop of 10_000_000: 0.44931442802771926s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --yjit loop.rb
Time Taken for loop of 10_000_000: 0.4761481450404972

Well hey the normal ruby jit actually (barely) sped up over the ruby interpreter finally, and is better than the ‘new’ yjit too, and all of which are (barely) faster than non-JIT python, but JIT’d python is still utterly blowing away all ruby’s, JIT’d or not, like to no where within the same ballpark… Let’s see if I can work on installing graal-ruby some more, I’d be most curious seeing how it compares to graal-python… Yeah it doesn’t like where it’s installing, like at all, and I can’t figure out why…

ERROR: cannot install TruffleRuby to /home/overminddl1/.asdf/installs/ruby/truffleruby+graalvm-21.3.0, which does not look like a valid TruffleRuby prefix
TruffleRuby only supports being installed to a not existing directory or replacing an existing TruffleRuby installation

This is an amazingly most unhelpful of messages. That directory does indeed not exist so what’s the issue… I love when error messages are just wrong… >.>

Well, let’s take a look at mruby then since that’s one I’ve heard a lot about:

BUILD FAILED (Ubuntu 21.10 using ruby-build 20211109)

Inspect or clean up the working tree at /tmp/ruby-build.20211116102321.569674.TUm8SR
Results logged to /tmp/ruby-build.20211116102321.569674.TUm8SR

❯ cat /tmp/ruby-build.20211116102321.569674.TUm8SR                                         
cat: /tmp/ruby-build.20211116102321.569674.TUm8SR: Is a directory
...

Wait, I need ruby… to install mruby… what on earth kind of recursion is this?!?
Uuuuugh, fine then… Aaaand mruby installed! Let’s see if that fixed graal ruby too… and nope, same useless message. Well let’s bench with mruby:

❯ make bench_loop                             
/home/overminddl1/.asdf/installs/python/3.11-dev/bin/python loop.py
Time Taken for loop of 10_000_000: 0.6626679109176621s
/home/overminddl1/.asdf/installs/python/graalpython-21.3.0/bin/python loop.py
Time Taken for loop of 10_000_000: 0.2004202629905194s
/home/overminddl1/.asdf/installs/python/pypy3.8-7.3.7/bin/python loop.py
Time Taken for loop of 10_000_000: 0.013142568059265614s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby loop.rb
Time Taken for loop of 10_000_000: 0.5442068020347506s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --jit loop.rb
Time Taken for loop of 10_000_000: 0.4532001350307837s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --yjit loop.rb
Time Taken for loop of 10_000_000: 0.5024946819758043s
/home/overminddl1/.asdf/installs/ruby/mruby-3.0.0/bin/ruby loop.rb
trace (most recent call last):
loop.rb:1: undefined method 'require' (NoMethodError)
make: *** [Makefile:18: bench_loop] Error 1

Uhhhhh, well, that’s… how do I debug an stdlib missing module for mruby?!? O.o

OvermindDL1 · 16 November 2021 18:45

And just to add a base comparison to a compiled language:

use criterion::{criterion_group, criterion_main, Criterion};

fn bench_loop(count: usize) -> usize {
	let mut ret = 0;
	for i in 1..count {
		ret += i * 2;
	}
	ret
}

fn bench_iter(count: usize) -> usize {
	(1..count).map(|i| i * 2).sum()
}

fn bench(c: &mut Criterion) {
	const COUNT: usize = 10_000_000;
	c.bench_function("for", |b| b.iter(|| bench_loop(COUNT)));
	c.bench_function("iter", |b| b.iter(|| bench_iter(COUNT)));
}

criterion_group!(benches, bench);
criterion_main!(benches);

Which runs as:

❯ cargo criterion
   Compiling pyby v0.1.0 (/home/overminddl1/Projects/pyby)
    Finished bench [optimized] target(s) in 2.23s
for                     time:   [315.03 ps 315.42 ps 315.83 ps]                
                        change: [-0.7200% -0.2908% +0.1634%] (p = 0.21 > 0.05)
                        No change in performance detected.

iter                    time:   [315.22 ps 316.68 ps 319.05 ps]                 
                        change: [-0.9833% -0.4288% +0.2130%] (p = 0.15 > 0.05)
                        No change in performance detected.

Uhhh, yeah let’s not optimize out the entire loop… >.>
Changed the code to this:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn bench_loop(count: usize) -> usize {
	let mut ret = 0;
	for i in 1..count {
		ret += i * 2;
	}
	ret
}

fn bench_iter(count: usize) -> usize {
	(1..count).map(|i| i * 2).sum()
}

fn bench(c: &mut Criterion) {
	const COUNT: usize = 10_000_000;
	c.bench_function("for", |b| b.iter(|| bench_loop(black_box(COUNT))));
	c.bench_function("iter", |b| b.iter(|| bench_iter(black_box(COUNT))));
}

criterion_group!(benches, bench);
criterion_main!(benches);

And now:

❯ cargo criterion
   Compiling pyby v0.1.0 (/home/overminddl1/Projects/pyby)
    Finished bench [optimized] target(s) in 2.17s
for                     time:   [1.1803 ns 1.1827 ns 1.1850 ns]                 
                        change: [+273.19% +274.47% +275.65%] (p = 0.00 < 0.05)
                        Performance has regressed.

iter                    time:   [1.1963 ns 1.2008 ns 1.2058 ns]                  
                        change: [+277.17% +279.55% +281.87%] (p = 0.00 < 0.05)
                        Performance has regressed.

There we go, much better, much slower, however it’s still optimizing out to some match instead of a loop, lol. Let’s black_box something in the loop to keep that from being optimized to math too… Code is now:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn bench_loop(count: usize) -> usize {
	let mut ret = 0;
	for i in 1..count {
		ret += black_box(i * 2);
	}
	ret
}

fn bench_iter(count: usize) -> usize {
	(1..count).map(|i| black_box(i * 2)).sum()
}

fn bench(c: &mut Criterion) {
	const COUNT: usize = 10_000_000;
	c.bench_function("for", |b| b.iter(|| bench_loop(black_box(COUNT))));
	c.bench_function("iter", |b| b.iter(|| bench_iter(black_box(COUNT))));
}

criterion_group!(benches, bench);
criterion_main!(benches);

And now runs as:

❯ cargo criterion
   Compiling pyby v0.1.0 (/home/overminddl1/Projects/pyby)
    Finished bench [optimized] target(s) in 2.19s
for                     time:   [3.7198 ms 3.7250 ms 3.7302 ms]                 
                        change: [+314567311% +315475093% +316458081%] (p = 0.00 < 0.05)
                        Performance has regressed.

iter                    time:   [3.6822 ms 3.6891 ms 3.6961 ms]                  
                        change: [+307082143% +308377678% +309613221%] (p = 0.00 < 0.05)
                        Performance has regressed.

Soooo, never seen such a huge change amount before, lol, this looks correct for a loop now. So of the python/ruby languages the best time was 13.14ms, here the time is ~3.7ms, so I’ve learned both that the JIT’s are not capable of optimizing to just straight math to remove the loops entirely like a good language’s compiler can, and that they can’t seem to optimize the loops very well even at that,

Hmm, let’s split the actual loop functions into a standalone library, the code of which is just:

use criterion::black_box;

pub fn bench_loop(count: usize) -> usize {
	let mut ret = 0;
	for i in 1..count {
		ret += black_box(i * 2);
	}
	ret
}

pub fn bench_iter(count: usize) -> usize {
	(1..count).map(|i| black_box(i * 2)).sum()
}

I kept the black_box so the loop doesn’t optimize out to just math, and the benchmark function in a different binary that links in the library is:

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use pyby::*;

fn bench(c: &mut Criterion) {
	const COUNT: usize = 10_000_000;
	c.bench_function("for", |b| b.iter(|| bench_loop(black_box(COUNT))));
	c.bench_function("iter", |b| b.iter(|| bench_iter(black_box(COUNT))));
}

criterion_group!(benches, bench);
criterion_main!(benches);

And running that:

❯ cargo criterion
for                     time:   [3.5874 ms 3.6058 ms 3.6273 ms] 
iter                    time:   [3.5454 ms 3.5532 ms 3.5672 ms]

So about the same.
Let’s dump the assembly of the code though:

❯ cargo +nightly asm pyby::bench_loop --rust
 pub fn bench_loop(count: usize) -> usize {
 push    rax
     cmp     rdi, 2
     jb      .LBB0_1
     lea     r8, [rdi, -, 1]
     add     rdi, -2
     mov     ecx, r8d
     and     ecx, 3
     cmp     rdi, 3
     jae     .LBB0_8
     mov     edx, 1
     xor     eax, eax
     jmp     .LBB0_4
.LBB0_1:
     xor     eax, eax
 }
 pop     rcx
 ret
.LBB0_8:
     and     r8, -4
     neg     r8
     mov     edx, 1
     mov     edi, 8
     xor     eax, eax
.LBB0_9:
     lea     rsi, [rdi, -, 6]
     mov     qword, ptr, [rsp], rsi
 ret += black_box(i * 2);
 add     rax, qword, ptr, [rsp]
     lea     rsi, [rdi, -, 4]
     mov     qword, ptr, [rsp], rsi
 add     rax, qword, ptr, [rsp]
     lea     rsi, [rdi, -, 2]
     mov     qword, ptr, [rsp], rsi
 add     rax, qword, ptr, [rsp]
 mov     qword, ptr, [rsp], rdi
 add     rax, qword, ptr, [rsp]
     add     rdi, 8
     lea     rsi, [r8, +, rdx]
     add     rsi, 4
     add     rdx, 4
     cmp     rsi, 1
     jne     .LBB0_9
.LBB0_4:
     test    rcx, rcx
     je      .LBB0_7
     add     rdx, rdx
.LBB0_6:
     mov     qword, ptr, [rsp], rdx
 add     rax, qword, ptr, [rsp]
     add     rdx, 2
     add     rcx, -1
     jne     .LBB0_6
.LBB0_7:
 }
 pop     rcx
 ret

And for the iterator version:

❯ cargo +nightly asm pyby::bench_iter --rust
 pub fn bench_iter(count: usize) -> usize {
 push    rax
     cmp     rdi, 2
     jb      .LBB1_1
     lea     r8, [rdi, -, 1]
     add     rdi, -2
     mov     ecx, r8d
     and     ecx, 3
     cmp     rdi, 3
     jae     .LBB1_4
     mov     edx, 1
     xor     eax, eax
     jmp     .LBB1_6
.LBB1_1:
     xor     eax, eax
 }
 pop     rcx
 ret
.LBB1_4:
     and     r8, -4
     neg     r8
     mov     edx, 1
     mov     edi, 8
     xor     eax, eax
.LBB1_5:
     lea     rsi, [rdi, -, 6]
     mov     qword, ptr, [rsp], rsi
     add     rax, qword, ptr, [rsp]
     lea     rsi, [rdi, -, 4]
     mov     qword, ptr, [rsp], rsi
     add     rax, qword, ptr, [rsp]
     lea     rsi, [rdi, -, 2]
     mov     qword, ptr, [rsp], rsi
     add     rax, qword, ptr, [rsp]
     mov     qword, ptr, [rsp], rdi
     add     rax, qword, ptr, [rsp]
     add     rdi, 8
     lea     rsi, [r8, +, rdx]
     add     rsi, 4
     add     rdx, 4
     cmp     rsi, 1
     jne     .LBB1_5
.LBB1_6:
     test    rcx, rcx
     je      .LBB1_9
     add     rdx, rdx
.LBB1_8:
     mov     qword, ptr, [rsp], rdx
     add     rax, qword, ptr, [rsp]
     add     rdx, 2
     add     rcx, -1
     jne     .LBB1_8
.LBB1_9:
 pop     rcx
 ret

And they seem to be precisely identical, as expected, but yet no SSE, I wonder what changes if we compile for small generated code instead of fast generated code:

 pub fn bench_iter(count: usize) -> usize {
 push    rax
     cmp     rdi, 2
     jb      .LBB1_1
     dec     rdi
     mov     ecx, 2
     xor     eax, eax
.LBB1_3:
     mov     qword, ptr, [rsp], rcx
     add     rcx, 2
     add     rax, qword, ptr, [rsp]
     dec     rdi
     jne     .LBB1_3
 }
 pop     rcx
 ret
.LBB1_1:
 xor     eax, eax
 pop     rcx
 ret

Cool, so it did just end up becoming a single simple loop, that is indeed much smaller code, the inner loop being that which is in the LBB1_3 label (notice it’s actually looping from count down to zero since it determined the order doesn’t matter and it’s an instruction smaller that way). Let’s see how much slower the benchmark itself is:

for                     time:   [5.4088 ms 5.4163 ms 5.4242 ms]                 
                        change: [+49.304% +50.211% +51.020%] (p = 0.00 < 0.05)
                        Performance has regressed.

iter                    time:   [5.4249 ms 5.4436 ms 5.4740 ms]                  
                        change: [+51.652% +52.344% +53.307%] (p = 0.00 < 0.05)
                        Performance has regressed.

About 50% slower, which is huge, but still much faster than either of the scripting languages in the prior post, lol.

Hmm, I wonder how fast it is when ran in full debug mode, I’m betting this will be impressively slow, lol, benching:

Benchmarking for: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 37.7s, or reduce sample count to 10.
for                     time:   [376.34 ms 376.67 ms 377.02 ms]                
                        change: [+10561% +10580% +10599%] (p = 0.00 < 0.05)
                        Performance has regressed.

Benchmarking iter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 56.6s, or reduce sample count to 10.
iter                    time:   [566.35 ms 567.12 ms 567.91 ms]                 
                        change: [+15795% +15837% +15883%] (p = 0.00 < 0.05)
                        Performance has regressed.

Wooo that is indeed a LOT slower! Like on par with interpreted python/ruby code kind of slow!
Let’s see what the code looks like, the iterator version first since they definitely are not compiling to the same code now!):

❯ cargo +nightly asm --build-type debug pyby::bench_iter --rust
 pub fn bench_iter(count: usize) -> usize {
 sub     rsp, 56
 mov     qword, ptr, [rsp, +, 48], rdi
 (1..count).map(|i| black_box(i * 2)).sum()
 mov     qword, ptr, [rsp, +, 32], 1
 mov     qword, ptr, [rsp, +, 40], rdi
 mov     rdi, qword, ptr, [rsp, +, 32]
 mov     rsi, qword, ptr, [rsp, +, 40]
 call    qword, ptr, [rip, +, _ZN4core4iter6traits8iterator8Iterator3map17h12723980faf40533E@GOTPCREL]
 mov     qword, ptr, [rsp, +, 16], rax
 mov     qword, ptr, [rsp, +, 24], rdx
 mov     rsi, qword, ptr, [rsp, +, 24]
 mov     rdi, qword, ptr, [rsp, +, 16]
 call    qword, ptr, [rip, +, _ZN4core4iter6traits8iterator8Iterator3sum17hf22cfa1d0bd6d00dE@GOTPCREL]
 mov     qword, ptr, [rsp, +, 8], rax
 mov     rax, qword, ptr, [rsp, +, 8]
 }
 add     rsp, 56
 ret

Indeed, lots of method calls in the loop that didn’t need to exist before, and now the basic for loop version:

❯ cargo +nightly asm --build-type debug pyby::bench_loop --rust
 pub fn bench_loop(count: usize) -> usize {
 sub     rsp, 136
 mov     qword, ptr, [rsp, +, 104], rdi
 let mut ret = 0;
 mov     qword, ptr, [rsp, +, 48], 0
 for i in 1..count {
 mov     qword, ptr, [rsp, +, 56], 1
 mov     qword, ptr, [rsp, +, 64], rdi
 mov     rdi, qword, ptr, [rsp, +, 56]
 mov     rsi, qword, ptr, [rsp, +, 64]
 call    qword, ptr, [rip, +, _ZN63_$LT$I$u20$as$u20$core..iter..traits..collect..IntoIterator$GT$9into_iter17ha1c7c7c0ba434465E@GOTPCREL]
 mov     qword, ptr, [rsp, +, 32], rax
 mov     qword, ptr, [rsp, +, 40], rdx
 mov     rax, qword, ptr, [rsp, +, 40]
 mov     rcx, qword, ptr, [rsp, +, 32]
 mov     qword, ptr, [rsp, +, 72], rcx
 mov     qword, ptr, [rsp, +, 80], rax
.LBB12_2:
 mov     rax, qword, ptr, [rip, +, _ZN4core4iter5range101_$LT$impl$u20$core..iter..traits..iterator..Iterator$u20$for$u20$core..ops..range..Range$LT$A$GT$$GT$4next17h413bbe735b9fe1d5E@GOTPCREL]
 lea     rdi, [rsp, +, 72]
 call    rax
 mov     qword, ptr, [rsp, +, 96], rdx
 mov     qword, ptr, [rsp, +, 88], rax
 mov     rax, qword, ptr, [rsp, +, 88]
 test    rax, rax
 je      .LBB12_5
 jmp     .LBB12_12
.LBB12_12:
 jmp     .LBB12_6
 ud2
.LBB12_5:
 }
 mov     rax, qword, ptr, [rsp, +, 48]
 add     rsp, 136
 ret
.LBB12_6:
 for i in 1..count {
 mov     rax, qword, ptr, [rsp, +, 96]
 mov     qword, ptr, [rsp, +, 112], rax
 mov     qword, ptr, [rsp, +, 120], rax
 for i in 1..count {
 mov     qword, ptr, [rsp, +, 128], rax
 ret += black_box(i * 2);
 mov     ecx, 2
 mul     rcx
 mov     qword, ptr, [rsp, +, 24], rax
 seto    al
 test    al, 1
 jne     .LBB12_8
 mov     rdi, qword, ptr, [rsp, +, 24]
 ret += black_box(i * 2);
 call    qword, ptr, [rip, +, _ZN9criterion9black_box17h51d9c2b7d11cc19eE@GOTPCREL]
 mov     qword, ptr, [rsp, +, 16], rax
 jmp     .LBB12_9
.LBB12_8:
 ret += black_box(i * 2);
 lea     rdi, [rip, +, str.1]
 lea     rdx, [rip, +, .L__unnamed_2]
 mov     rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
 mov     esi, 33
 call    rax
 ud2
.LBB12_9:
 mov     rax, qword, ptr, [rsp, +, 16]
 ret += black_box(i * 2);
 add     rax, qword, ptr, [rsp, +, 48]
 mov     qword, ptr, [rsp, +, 8], rax
 setb    al
 test    al, 1
 jne     .LBB12_11
 mov     rax, qword, ptr, [rsp, +, 8]
 mov     qword, ptr, [rsp, +, 48], rax
 for i in 1..count {
 jmp     .LBB12_2
.LBB12_11:
 ret += black_box(i * 2);
 lea     rdi, [rip, +, str.0]
 lea     rdx, [rip, +, .L__unnamed_3]
 mov     rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
 mov     esi, 28
 call    rax
 ud2

A lot more inlined, but still a lot of setup for the 1..count iterator setup. Let’s add in a pure number loop in it as well, added this function to the library:

pub fn bench_manual(count: usize) -> usize {
    let mut ret = 0;
    let mut i = 0;
    while i < count {
        ret += black_box(i * 2);
        i += 1;
    }
    ret
}

And added this to the benchmark:

c.bench_function("manual", |b| b.iter(|| bench_manual(black_box(COUNT))));

And benchmarked as normal:

for                     time:   [3.5597 ms 3.5656 ms 3.5712 ms]                 
iter                    time:   [3.5255 ms 3.5312 ms 3.5369 ms]                  
manual                  time:   [3.5953 ms 3.6012 ms 3.6080 ms]

And again in debug mode:

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 38.0s, or reduce sample count to 10.
for                     time:   [378.34 ms 378.85 ms 379.38 ms]                
                        change: [+10502% +10525% +10548%] (p = 0.00 < 0.05)
                        Performance has regressed.

Benchmarking iter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 56.8s, or reduce sample count to 10.
iter                    time:   [566.02 ms 566.90 ms 567.91 ms]                 
                        change: [+15919% +15954% +15988%] (p = 0.00 < 0.05)
                        Performance has regressed.

Benchmarking manual: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 10.9s, or reduce sample count to 40.
manual                  time:   [107.72 ms 108.03 ms 108.36 ms]                   
                        change: [+2889.5% +2899.8% +2912.2%] (p = 0.00 < 0.05)
                        Performance has regressed.

And decompiled that new function as normal:

 pub fn bench_manual(count: usize) -> usize {
 push    rax
 while i < count {
 test    rdi, rdi
 je      .LBB0_1
 lea     rax, [rdi, -, 1]
 mov     r8d, edi
 and     r8d, 3
 cmp     rax, 3
 jae     .LBB0_8
 xor     eax, eax
 xor     edx, edx
 jmp     .LBB0_4
.LBB0_1:
 xor     eax, eax
 }
 pop     rcx
 ret
.LBB0_8:
 while i < count {
 and     rdi, -4
 mov     esi, 6
 xor     eax, eax
 xor     edx, edx
.LBB0_9:
 lea     rcx, [rsi, -, 6]
 mov     qword, ptr, [rsp], rcx
 ret += black_box(i * 2);
 add     rax, qword, ptr, [rsp]
     lea     rcx, [rsi, -, 4]
     mov     qword, ptr, [rsp], rcx
 add     rax, qword, ptr, [rsp]
     lea     rcx, [rsi, -, 2]
     mov     qword, ptr, [rsp], rcx
 add     rax, qword, ptr, [rsp]
 mov     qword, ptr, [rsp], rsi
 add     rax, qword, ptr, [rsp]
 i += 1;
 add     rdx, 4
 while i < count {
 add     rsi, 8
 cmp     rdi, rdx
 jne     .LBB0_9
.LBB0_4:
 test    r8, r8
 je      .LBB0_7
 add     rdx, rdx
.LBB0_6:
 mov     qword, ptr, [rsp], rdx
 ret += black_box(i * 2);
 add     rax, qword, ptr, [rsp]
 while i < count {
 add     rdx, 2
 add     r8, -1
 jne     .LBB0_6
.LBB0_7:
 }
 pop     rcx
 ret

And decompiled that new function as debug:

 pub fn bench_manual(count: usize) -> usize {
 sub     rsp, 72
 mov     qword, ptr, [rsp, +, 40], rdi
 mov     qword, ptr, [rsp, +, 64], rdi
 let mut ret = 0;
 mov     qword, ptr, [rsp, +, 48], 0
 let mut i = 0;
 mov     qword, ptr, [rsp, +, 56], 0
.LBB12_1:
 mov     rax, qword, ptr, [rsp, +, 40]
 while i < count {
 cmp     qword, ptr, [rsp, +, 56], rax
 jb      .LBB12_3
 }
 mov     rax, qword, ptr, [rsp, +, 48]
 add     rsp, 72
 ret
.LBB12_3:
 ret += black_box(i * 2);
 mov     rax, qword, ptr, [rsp, +, 56]
 mov     ecx, 2
 mul     rcx
 mov     qword, ptr, [rsp, +, 32], rax
 seto    al
 test    al, 1
 jne     .LBB12_5
 mov     rdi, qword, ptr, [rsp, +, 32]
 ret += black_box(i * 2);
 call    qword, ptr, [rip, +, _ZN9criterion9black_box17h51d9c2b7d11cc19eE@GOTPCREL]
 mov     qword, ptr, [rsp, +, 24], rax
 jmp     .LBB12_6
.LBB12_5:
 ret += black_box(i * 2);
 lea     rdi, [rip, +, str.1]
 lea     rdx, [rip, +, .L__unnamed_2]
 mov     rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
 mov     esi, 33
 call    rax
 ud2
.LBB12_6:
 mov     rax, qword, ptr, [rsp, +, 24]
 ret += black_box(i * 2);
 add     rax, qword, ptr, [rsp, +, 48]
 mov     qword, ptr, [rsp, +, 16], rax
 setb    al
 test    al, 1
 jne     .LBB12_8
 mov     rax, qword, ptr, [rsp, +, 16]
 mov     qword, ptr, [rsp, +, 48], rax
 i += 1;
 mov     rax, qword, ptr, [rsp, +, 56]
 add     rax, 1
 mov     qword, ptr, [rsp, +, 8], rax
 setb    al
 test    al, 1
 jne     .LBB12_10
 jmp     .LBB12_9
.LBB12_8:
 ret += black_box(i * 2);
 lea     rdi, [rip, +, str.0]
 lea     rdx, [rip, +, .L__unnamed_3]
 mov     rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
 mov     esi, 28
 call    rax
 ud2
.LBB12_9:
 mov     rax, qword, ptr, [rsp, +, 8]
 i += 1;
 mov     qword, ptr, [rsp, +, 56], rax
 while i < count {
 jmp     .LBB12_1
.LBB12_10:
 i += 1;
 lea     rdi, [rip, +, str.0]
 lea     rdx, [rip, +, .L__unnamed_4]
 mov     rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
 mov     esi, 28
 call    rax
 ud2

So as a raw number loop it runs the fastest in debug mode of all the others in debug mode since it doesn’t have to make function calls that normally get optimized out, but they are basically all the same running time in release compilations, although the generated code has quite a few differences, probably because it’s not allowed to change ordering so it has to go in the same order instead of being allowed to count down to 0 now, let’s make that change manually, changed the new function code to be this (also starting at 1 instead of 0 to match the python/ruby above, won’t make much difference though if any):

pub fn bench_manual(mut count: usize) -> usize {
	let mut ret = 0;
	while count > 0 {
		ret += black_box(count * 2);
		count -= 1;
	}
	ret
}

Never an optimization you’d usually want to do manually as the compiler can do it for you with normal code used, but still, benchmarking:

for                     time:   [3.5657 ms 3.5742 ms 3.5837 ms]
iter                    time:   [3.5695 ms 3.5802 ms 3.5918 ms]
manual                  time:   [3.2158 ms 3.2211 ms 3.2267 ms]

Although now it’s even better (since the call to the black_box doesn’t have to be pessimized out, that call specifically prevents optimizations being done to it). We can just as easily change the other two to be a reverse iteration as well, their code now being:

use criterion::black_box;

pub fn bench_manual(mut count: usize) -> usize {
	let mut ret = 0;
	while count > 0 {
		ret += black_box(count * 2);
		count -= 1;
	}
	ret += black_box(0 * 2); // Needed to match prior count
	ret
}

pub fn bench_loop(count: usize) -> usize {
	let mut ret = 0;
	for i in (1..count).rev() {
		ret += black_box(i * 2);
	}
	ret
}

pub fn bench_iter(count: usize) -> usize {
	(1..count).rev().map(|i| black_box(i * 2)).sum()
}

And the full benchmark being:

for                     time:   [3.3039 ms 3.3084 ms 3.3128 ms]
iter                    time:   [3.3149 ms 3.3235 ms 3.3332 ms]
manual                  time:   [3.2541 ms 3.2624 ms 3.2721 ms]

And of course now they are better than the manual version since they can still have other optimizations that manual can’t have now that they’ve been equalled out from that black_box preventing optimizations.