Ruby vs. Python comes down to the for loop

OvermindDL1 · 16 November 2021 18:45

And just to add a base comparison to a compiled language:

use criterion::{criterion_group, criterion_main, Criterion};

fn bench_loop(count: usize) -> usize {
	let mut ret = 0;
	for i in 1..count {
		ret += i * 2;
	}
	ret
}

fn bench_iter(count: usize) -> usize {
	(1..count).map(|i| i * 2).sum()
}

fn bench(c: &mut Criterion) {
	const COUNT: usize = 10_000_000;
	c.bench_function("for", |b| b.iter(|| bench_loop(COUNT)));
	c.bench_function("iter", |b| b.iter(|| bench_iter(COUNT)));
}

criterion_group!(benches, bench);
criterion_main!(benches);

Which runs as:

❯ cargo criterion
   Compiling pyby v0.1.0 (/home/overminddl1/Projects/pyby)
    Finished bench [optimized] target(s) in 2.23s
for                     time:   [315.03 ps 315.42 ps 315.83 ps]                
                        change: [-0.7200% -0.2908% +0.1634%] (p = 0.21 > 0.05)
                        No change in performance detected.

iter                    time:   [315.22 ps 316.68 ps 319.05 ps]                 
                        change: [-0.9833% -0.4288% +0.2130%] (p = 0.15 > 0.05)
                        No change in performance detected.

Uhhh, yeah let’s not optimize out the entire loop… >.>
Changed the code to this:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn bench_loop(count: usize) -> usize {
	let mut ret = 0;
	for i in 1..count {
		ret += i * 2;
	}
	ret
}

fn bench_iter(count: usize) -> usize {
	(1..count).map(|i| i * 2).sum()
}

fn bench(c: &mut Criterion) {
	const COUNT: usize = 10_000_000;
	c.bench_function("for", |b| b.iter(|| bench_loop(black_box(COUNT))));
	c.bench_function("iter", |b| b.iter(|| bench_iter(black_box(COUNT))));
}

criterion_group!(benches, bench);
criterion_main!(benches);

And now:

❯ cargo criterion
   Compiling pyby v0.1.0 (/home/overminddl1/Projects/pyby)
    Finished bench [optimized] target(s) in 2.17s
for                     time:   [1.1803 ns 1.1827 ns 1.1850 ns]                 
                        change: [+273.19% +274.47% +275.65%] (p = 0.00 < 0.05)
                        Performance has regressed.

iter                    time:   [1.1963 ns 1.2008 ns 1.2058 ns]                  
                        change: [+277.17% +279.55% +281.87%] (p = 0.00 < 0.05)
                        Performance has regressed.

There we go, much better, much slower, however it’s still optimizing out to some match instead of a loop, lol. Let’s black_box something in the loop to keep that from being optimized to math too… Code is now:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn bench_loop(count: usize) -> usize {
	let mut ret = 0;
	for i in 1..count {
		ret += black_box(i * 2);
	}
	ret
}

fn bench_iter(count: usize) -> usize {
	(1..count).map(|i| black_box(i * 2)).sum()
}

fn bench(c: &mut Criterion) {
	const COUNT: usize = 10_000_000;
	c.bench_function("for", |b| b.iter(|| bench_loop(black_box(COUNT))));
	c.bench_function("iter", |b| b.iter(|| bench_iter(black_box(COUNT))));
}

criterion_group!(benches, bench);
criterion_main!(benches);

And now runs as:

❯ cargo criterion
   Compiling pyby v0.1.0 (/home/overminddl1/Projects/pyby)
    Finished bench [optimized] target(s) in 2.19s
for                     time:   [3.7198 ms 3.7250 ms 3.7302 ms]                 
                        change: [+314567311% +315475093% +316458081%] (p = 0.00 < 0.05)
                        Performance has regressed.

iter                    time:   [3.6822 ms 3.6891 ms 3.6961 ms]                  
                        change: [+307082143% +308377678% +309613221%] (p = 0.00 < 0.05)
                        Performance has regressed.

Soooo, never seen such a huge change amount before, lol, this looks correct for a loop now. So of the python/ruby languages the best time was 13.14ms, here the time is ~3.7ms, so I’ve learned both that the JIT’s are not capable of optimizing to just straight math to remove the loops entirely like a good language’s compiler can, and that they can’t seem to optimize the loops very well even at that,

Hmm, let’s split the actual loop functions into a standalone library, the code of which is just:

use criterion::black_box;

pub fn bench_loop(count: usize) -> usize {
	let mut ret = 0;
	for i in 1..count {
		ret += black_box(i * 2);
	}
	ret
}

pub fn bench_iter(count: usize) -> usize {
	(1..count).map(|i| black_box(i * 2)).sum()
}

I kept the black_box so the loop doesn’t optimize out to just math, and the benchmark function in a different binary that links in the library is:

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use pyby::*;

fn bench(c: &mut Criterion) {
	const COUNT: usize = 10_000_000;
	c.bench_function("for", |b| b.iter(|| bench_loop(black_box(COUNT))));
	c.bench_function("iter", |b| b.iter(|| bench_iter(black_box(COUNT))));
}

criterion_group!(benches, bench);
criterion_main!(benches);

And running that:

❯ cargo criterion
for                     time:   [3.5874 ms 3.6058 ms 3.6273 ms] 
iter                    time:   [3.5454 ms 3.5532 ms 3.5672 ms]

So about the same.
Let’s dump the assembly of the code though:

❯ cargo +nightly asm pyby::bench_loop --rust
 pub fn bench_loop(count: usize) -> usize {
 push    rax
     cmp     rdi, 2
     jb      .LBB0_1
     lea     r8, [rdi, -, 1]
     add     rdi, -2
     mov     ecx, r8d
     and     ecx, 3
     cmp     rdi, 3
     jae     .LBB0_8
     mov     edx, 1
     xor     eax, eax
     jmp     .LBB0_4
.LBB0_1:
     xor     eax, eax
 }
 pop     rcx
 ret
.LBB0_8:
     and     r8, -4
     neg     r8
     mov     edx, 1
     mov     edi, 8
     xor     eax, eax
.LBB0_9:
     lea     rsi, [rdi, -, 6]
     mov     qword, ptr, [rsp], rsi
 ret += black_box(i * 2);
 add     rax, qword, ptr, [rsp]
     lea     rsi, [rdi, -, 4]
     mov     qword, ptr, [rsp], rsi
 add     rax, qword, ptr, [rsp]
     lea     rsi, [rdi, -, 2]
     mov     qword, ptr, [rsp], rsi
 add     rax, qword, ptr, [rsp]
 mov     qword, ptr, [rsp], rdi
 add     rax, qword, ptr, [rsp]
     add     rdi, 8
     lea     rsi, [r8, +, rdx]
     add     rsi, 4
     add     rdx, 4
     cmp     rsi, 1
     jne     .LBB0_9
.LBB0_4:
     test    rcx, rcx
     je      .LBB0_7
     add     rdx, rdx
.LBB0_6:
     mov     qword, ptr, [rsp], rdx
 add     rax, qword, ptr, [rsp]
     add     rdx, 2
     add     rcx, -1
     jne     .LBB0_6
.LBB0_7:
 }
 pop     rcx
 ret

And for the iterator version:

❯ cargo +nightly asm pyby::bench_iter --rust
 pub fn bench_iter(count: usize) -> usize {
 push    rax
     cmp     rdi, 2
     jb      .LBB1_1
     lea     r8, [rdi, -, 1]
     add     rdi, -2
     mov     ecx, r8d
     and     ecx, 3
     cmp     rdi, 3
     jae     .LBB1_4
     mov     edx, 1
     xor     eax, eax
     jmp     .LBB1_6
.LBB1_1:
     xor     eax, eax
 }
 pop     rcx
 ret
.LBB1_4:
     and     r8, -4
     neg     r8
     mov     edx, 1
     mov     edi, 8
     xor     eax, eax
.LBB1_5:
     lea     rsi, [rdi, -, 6]
     mov     qword, ptr, [rsp], rsi
     add     rax, qword, ptr, [rsp]
     lea     rsi, [rdi, -, 4]
     mov     qword, ptr, [rsp], rsi
     add     rax, qword, ptr, [rsp]
     lea     rsi, [rdi, -, 2]
     mov     qword, ptr, [rsp], rsi
     add     rax, qword, ptr, [rsp]
     mov     qword, ptr, [rsp], rdi
     add     rax, qword, ptr, [rsp]
     add     rdi, 8
     lea     rsi, [r8, +, rdx]
     add     rsi, 4
     add     rdx, 4
     cmp     rsi, 1
     jne     .LBB1_5
.LBB1_6:
     test    rcx, rcx
     je      .LBB1_9
     add     rdx, rdx
.LBB1_8:
     mov     qword, ptr, [rsp], rdx
     add     rax, qword, ptr, [rsp]
     add     rdx, 2
     add     rcx, -1
     jne     .LBB1_8
.LBB1_9:
 pop     rcx
 ret

And they seem to be precisely identical, as expected, but yet no SSE, I wonder what changes if we compile for small generated code instead of fast generated code:

 pub fn bench_iter(count: usize) -> usize {
 push    rax
     cmp     rdi, 2
     jb      .LBB1_1
     dec     rdi
     mov     ecx, 2
     xor     eax, eax
.LBB1_3:
     mov     qword, ptr, [rsp], rcx
     add     rcx, 2
     add     rax, qword, ptr, [rsp]
     dec     rdi
     jne     .LBB1_3
 }
 pop     rcx
 ret
.LBB1_1:
 xor     eax, eax
 pop     rcx
 ret

Cool, so it did just end up becoming a single simple loop, that is indeed much smaller code, the inner loop being that which is in the LBB1_3 label (notice it’s actually looping from count down to zero since it determined the order doesn’t matter and it’s an instruction smaller that way). Let’s see how much slower the benchmark itself is:

for                     time:   [5.4088 ms 5.4163 ms 5.4242 ms]                 
                        change: [+49.304% +50.211% +51.020%] (p = 0.00 < 0.05)
                        Performance has regressed.

iter                    time:   [5.4249 ms 5.4436 ms 5.4740 ms]                  
                        change: [+51.652% +52.344% +53.307%] (p = 0.00 < 0.05)
                        Performance has regressed.

About 50% slower, which is huge, but still much faster than either of the scripting languages in the prior post, lol.

Hmm, I wonder how fast it is when ran in full debug mode, I’m betting this will be impressively slow, lol, benching:

Benchmarking for: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 37.7s, or reduce sample count to 10.
for                     time:   [376.34 ms 376.67 ms 377.02 ms]                
                        change: [+10561% +10580% +10599%] (p = 0.00 < 0.05)
                        Performance has regressed.

Benchmarking iter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 56.6s, or reduce sample count to 10.
iter                    time:   [566.35 ms 567.12 ms 567.91 ms]                 
                        change: [+15795% +15837% +15883%] (p = 0.00 < 0.05)
                        Performance has regressed.

Wooo that is indeed a LOT slower! Like on par with interpreted python/ruby code kind of slow!
Let’s see what the code looks like, the iterator version first since they definitely are not compiling to the same code now!):

❯ cargo +nightly asm --build-type debug pyby::bench_iter --rust
 pub fn bench_iter(count: usize) -> usize {
 sub     rsp, 56
 mov     qword, ptr, [rsp, +, 48], rdi
 (1..count).map(|i| black_box(i * 2)).sum()
 mov     qword, ptr, [rsp, +, 32], 1
 mov     qword, ptr, [rsp, +, 40], rdi
 mov     rdi, qword, ptr, [rsp, +, 32]
 mov     rsi, qword, ptr, [rsp, +, 40]
 call    qword, ptr, [rip, +, _ZN4core4iter6traits8iterator8Iterator3map17h12723980faf40533E@GOTPCREL]
 mov     qword, ptr, [rsp, +, 16], rax
 mov     qword, ptr, [rsp, +, 24], rdx
 mov     rsi, qword, ptr, [rsp, +, 24]
 mov     rdi, qword, ptr, [rsp, +, 16]
 call    qword, ptr, [rip, +, _ZN4core4iter6traits8iterator8Iterator3sum17hf22cfa1d0bd6d00dE@GOTPCREL]
 mov     qword, ptr, [rsp, +, 8], rax
 mov     rax, qword, ptr, [rsp, +, 8]
 }
 add     rsp, 56
 ret

Indeed, lots of method calls in the loop that didn’t need to exist before, and now the basic for loop version:

❯ cargo +nightly asm --build-type debug pyby::bench_loop --rust
 pub fn bench_loop(count: usize) -> usize {
 sub     rsp, 136
 mov     qword, ptr, [rsp, +, 104], rdi
 let mut ret = 0;
 mov     qword, ptr, [rsp, +, 48], 0
 for i in 1..count {
 mov     qword, ptr, [rsp, +, 56], 1
 mov     qword, ptr, [rsp, +, 64], rdi
 mov     rdi, qword, ptr, [rsp, +, 56]
 mov     rsi, qword, ptr, [rsp, +, 64]
 call    qword, ptr, [rip, +, _ZN63_$LT$I$u20$as$u20$core..iter..traits..collect..IntoIterator$GT$9into_iter17ha1c7c7c0ba434465E@GOTPCREL]
 mov     qword, ptr, [rsp, +, 32], rax
 mov     qword, ptr, [rsp, +, 40], rdx
 mov     rax, qword, ptr, [rsp, +, 40]
 mov     rcx, qword, ptr, [rsp, +, 32]
 mov     qword, ptr, [rsp, +, 72], rcx
 mov     qword, ptr, [rsp, +, 80], rax
.LBB12_2:
 mov     rax, qword, ptr, [rip, +, _ZN4core4iter5range101_$LT$impl$u20$core..iter..traits..iterator..Iterator$u20$for$u20$core..ops..range..Range$LT$A$GT$$GT$4next17h413bbe735b9fe1d5E@GOTPCREL]
 lea     rdi, [rsp, +, 72]
 call    rax
 mov     qword, ptr, [rsp, +, 96], rdx
 mov     qword, ptr, [rsp, +, 88], rax
 mov     rax, qword, ptr, [rsp, +, 88]
 test    rax, rax
 je      .LBB12_5
 jmp     .LBB12_12
.LBB12_12:
 jmp     .LBB12_6
 ud2
.LBB12_5:
 }
 mov     rax, qword, ptr, [rsp, +, 48]
 add     rsp, 136
 ret
.LBB12_6:
 for i in 1..count {
 mov     rax, qword, ptr, [rsp, +, 96]
 mov     qword, ptr, [rsp, +, 112], rax
 mov     qword, ptr, [rsp, +, 120], rax
 for i in 1..count {
 mov     qword, ptr, [rsp, +, 128], rax
 ret += black_box(i * 2);
 mov     ecx, 2
 mul     rcx
 mov     qword, ptr, [rsp, +, 24], rax
 seto    al
 test    al, 1
 jne     .LBB12_8
 mov     rdi, qword, ptr, [rsp, +, 24]
 ret += black_box(i * 2);
 call    qword, ptr, [rip, +, _ZN9criterion9black_box17h51d9c2b7d11cc19eE@GOTPCREL]
 mov     qword, ptr, [rsp, +, 16], rax
 jmp     .LBB12_9
.LBB12_8:
 ret += black_box(i * 2);
 lea     rdi, [rip, +, str.1]
 lea     rdx, [rip, +, .L__unnamed_2]
 mov     rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
 mov     esi, 33
 call    rax
 ud2
.LBB12_9:
 mov     rax, qword, ptr, [rsp, +, 16]
 ret += black_box(i * 2);
 add     rax, qword, ptr, [rsp, +, 48]
 mov     qword, ptr, [rsp, +, 8], rax
 setb    al
 test    al, 1
 jne     .LBB12_11
 mov     rax, qword, ptr, [rsp, +, 8]
 mov     qword, ptr, [rsp, +, 48], rax
 for i in 1..count {
 jmp     .LBB12_2
.LBB12_11:
 ret += black_box(i * 2);
 lea     rdi, [rip, +, str.0]
 lea     rdx, [rip, +, .L__unnamed_3]
 mov     rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
 mov     esi, 28
 call    rax
 ud2

A lot more inlined, but still a lot of setup for the 1..count iterator setup. Let’s add in a pure number loop in it as well, added this function to the library:

pub fn bench_manual(count: usize) -> usize {
    let mut ret = 0;
    let mut i = 0;
    while i < count {
        ret += black_box(i * 2);
        i += 1;
    }
    ret
}

And added this to the benchmark:

c.bench_function("manual", |b| b.iter(|| bench_manual(black_box(COUNT))));

And benchmarked as normal:

for                     time:   [3.5597 ms 3.5656 ms 3.5712 ms]                 
iter                    time:   [3.5255 ms 3.5312 ms 3.5369 ms]                  
manual                  time:   [3.5953 ms 3.6012 ms 3.6080 ms]

And again in debug mode:

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 38.0s, or reduce sample count to 10.
for                     time:   [378.34 ms 378.85 ms 379.38 ms]                
                        change: [+10502% +10525% +10548%] (p = 0.00 < 0.05)
                        Performance has regressed.

Benchmarking iter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 56.8s, or reduce sample count to 10.
iter                    time:   [566.02 ms 566.90 ms 567.91 ms]                 
                        change: [+15919% +15954% +15988%] (p = 0.00 < 0.05)
                        Performance has regressed.

Benchmarking manual: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 10.9s, or reduce sample count to 40.
manual                  time:   [107.72 ms 108.03 ms 108.36 ms]                   
                        change: [+2889.5% +2899.8% +2912.2%] (p = 0.00 < 0.05)
                        Performance has regressed.

And decompiled that new function as normal:

 pub fn bench_manual(count: usize) -> usize {
 push    rax
 while i < count {
 test    rdi, rdi
 je      .LBB0_1
 lea     rax, [rdi, -, 1]
 mov     r8d, edi
 and     r8d, 3
 cmp     rax, 3
 jae     .LBB0_8
 xor     eax, eax
 xor     edx, edx
 jmp     .LBB0_4
.LBB0_1:
 xor     eax, eax
 }
 pop     rcx
 ret
.LBB0_8:
 while i < count {
 and     rdi, -4
 mov     esi, 6
 xor     eax, eax
 xor     edx, edx
.LBB0_9:
 lea     rcx, [rsi, -, 6]
 mov     qword, ptr, [rsp], rcx
 ret += black_box(i * 2);
 add     rax, qword, ptr, [rsp]
     lea     rcx, [rsi, -, 4]
     mov     qword, ptr, [rsp], rcx
 add     rax, qword, ptr, [rsp]
     lea     rcx, [rsi, -, 2]
     mov     qword, ptr, [rsp], rcx
 add     rax, qword, ptr, [rsp]
 mov     qword, ptr, [rsp], rsi
 add     rax, qword, ptr, [rsp]
 i += 1;
 add     rdx, 4
 while i < count {
 add     rsi, 8
 cmp     rdi, rdx
 jne     .LBB0_9
.LBB0_4:
 test    r8, r8
 je      .LBB0_7
 add     rdx, rdx
.LBB0_6:
 mov     qword, ptr, [rsp], rdx
 ret += black_box(i * 2);
 add     rax, qword, ptr, [rsp]
 while i < count {
 add     rdx, 2
 add     r8, -1
 jne     .LBB0_6
.LBB0_7:
 }
 pop     rcx
 ret

And decompiled that new function as debug:

 pub fn bench_manual(count: usize) -> usize {
 sub     rsp, 72
 mov     qword, ptr, [rsp, +, 40], rdi
 mov     qword, ptr, [rsp, +, 64], rdi
 let mut ret = 0;
 mov     qword, ptr, [rsp, +, 48], 0
 let mut i = 0;
 mov     qword, ptr, [rsp, +, 56], 0
.LBB12_1:
 mov     rax, qword, ptr, [rsp, +, 40]
 while i < count {
 cmp     qword, ptr, [rsp, +, 56], rax
 jb      .LBB12_3
 }
 mov     rax, qword, ptr, [rsp, +, 48]
 add     rsp, 72
 ret
.LBB12_3:
 ret += black_box(i * 2);
 mov     rax, qword, ptr, [rsp, +, 56]
 mov     ecx, 2
 mul     rcx
 mov     qword, ptr, [rsp, +, 32], rax
 seto    al
 test    al, 1
 jne     .LBB12_5
 mov     rdi, qword, ptr, [rsp, +, 32]
 ret += black_box(i * 2);
 call    qword, ptr, [rip, +, _ZN9criterion9black_box17h51d9c2b7d11cc19eE@GOTPCREL]
 mov     qword, ptr, [rsp, +, 24], rax
 jmp     .LBB12_6
.LBB12_5:
 ret += black_box(i * 2);
 lea     rdi, [rip, +, str.1]
 lea     rdx, [rip, +, .L__unnamed_2]
 mov     rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
 mov     esi, 33
 call    rax
 ud2
.LBB12_6:
 mov     rax, qword, ptr, [rsp, +, 24]
 ret += black_box(i * 2);
 add     rax, qword, ptr, [rsp, +, 48]
 mov     qword, ptr, [rsp, +, 16], rax
 setb    al
 test    al, 1
 jne     .LBB12_8
 mov     rax, qword, ptr, [rsp, +, 16]
 mov     qword, ptr, [rsp, +, 48], rax
 i += 1;
 mov     rax, qword, ptr, [rsp, +, 56]
 add     rax, 1
 mov     qword, ptr, [rsp, +, 8], rax
 setb    al
 test    al, 1
 jne     .LBB12_10
 jmp     .LBB12_9
.LBB12_8:
 ret += black_box(i * 2);
 lea     rdi, [rip, +, str.0]
 lea     rdx, [rip, +, .L__unnamed_3]
 mov     rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
 mov     esi, 28
 call    rax
 ud2
.LBB12_9:
 mov     rax, qword, ptr, [rsp, +, 8]
 i += 1;
 mov     qword, ptr, [rsp, +, 56], rax
 while i < count {
 jmp     .LBB12_1
.LBB12_10:
 i += 1;
 lea     rdi, [rip, +, str.0]
 lea     rdx, [rip, +, .L__unnamed_4]
 mov     rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
 mov     esi, 28
 call    rax
 ud2

So as a raw number loop it runs the fastest in debug mode of all the others in debug mode since it doesn’t have to make function calls that normally get optimized out, but they are basically all the same running time in release compilations, although the generated code has quite a few differences, probably because it’s not allowed to change ordering so it has to go in the same order instead of being allowed to count down to 0 now, let’s make that change manually, changed the new function code to be this (also starting at 1 instead of 0 to match the python/ruby above, won’t make much difference though if any):

pub fn bench_manual(mut count: usize) -> usize {
	let mut ret = 0;
	while count > 0 {
		ret += black_box(count * 2);
		count -= 1;
	}
	ret
}

Never an optimization you’d usually want to do manually as the compiler can do it for you with normal code used, but still, benchmarking:

for                     time:   [3.5657 ms 3.5742 ms 3.5837 ms]
iter                    time:   [3.5695 ms 3.5802 ms 3.5918 ms]
manual                  time:   [3.2158 ms 3.2211 ms 3.2267 ms]

Although now it’s even better (since the call to the black_box doesn’t have to be pessimized out, that call specifically prevents optimizations being done to it). We can just as easily change the other two to be a reverse iteration as well, their code now being:

use criterion::black_box;

pub fn bench_manual(mut count: usize) -> usize {
	let mut ret = 0;
	while count > 0 {
		ret += black_box(count * 2);
		count -= 1;
	}
	ret += black_box(0 * 2); // Needed to match prior count
	ret
}

pub fn bench_loop(count: usize) -> usize {
	let mut ret = 0;
	for i in (1..count).rev() {
		ret += black_box(i * 2);
	}
	ret
}

pub fn bench_iter(count: usize) -> usize {
	(1..count).rev().map(|i| black_box(i * 2)).sum()
}

And the full benchmark being:

for                     time:   [3.3039 ms 3.3084 ms 3.3128 ms]
iter                    time:   [3.3149 ms 3.3235 ms 3.3332 ms]
manual                  time:   [3.2541 ms 3.2624 ms 3.2721 ms]

And of course now they are better than the manual version since they can still have other optimizations that manual can’t have now that they’ve been equalled out from that black_box preventing optimizations.