And just to add a base comparison to a compiled language:
use criterion::{criterion_group, criterion_main, Criterion};
fn bench_loop(count: usize) -> usize {
let mut ret = 0;
for i in 1..count {
ret += i * 2;
}
ret
}
fn bench_iter(count: usize) -> usize {
(1..count).map(|i| i * 2).sum()
}
fn bench(c: &mut Criterion) {
const COUNT: usize = 10_000_000;
c.bench_function("for", |b| b.iter(|| bench_loop(COUNT)));
c.bench_function("iter", |b| b.iter(|| bench_iter(COUNT)));
}
criterion_group!(benches, bench);
criterion_main!(benches);
Which runs as:
❯ cargo criterion
Compiling pyby v0.1.0 (/home/overminddl1/Projects/pyby)
Finished bench [optimized] target(s) in 2.23s
for time: [315.03 ps 315.42 ps 315.83 ps]
change: [-0.7200% -0.2908% +0.1634%] (p = 0.21 > 0.05)
No change in performance detected.
iter time: [315.22 ps 316.68 ps 319.05 ps]
change: [-0.9833% -0.4288% +0.2130%] (p = 0.15 > 0.05)
No change in performance detected.
Uhhh, yeah let’s not optimize out the entire loop… >.>
Changed the code to this:
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn bench_loop(count: usize) -> usize {
let mut ret = 0;
for i in 1..count {
ret += i * 2;
}
ret
}
fn bench_iter(count: usize) -> usize {
(1..count).map(|i| i * 2).sum()
}
fn bench(c: &mut Criterion) {
const COUNT: usize = 10_000_000;
c.bench_function("for", |b| b.iter(|| bench_loop(black_box(COUNT))));
c.bench_function("iter", |b| b.iter(|| bench_iter(black_box(COUNT))));
}
criterion_group!(benches, bench);
criterion_main!(benches);
And now:
❯ cargo criterion
Compiling pyby v0.1.0 (/home/overminddl1/Projects/pyby)
Finished bench [optimized] target(s) in 2.17s
for time: [1.1803 ns 1.1827 ns 1.1850 ns]
change: [+273.19% +274.47% +275.65%] (p = 0.00 < 0.05)
Performance has regressed.
iter time: [1.1963 ns 1.2008 ns 1.2058 ns]
change: [+277.17% +279.55% +281.87%] (p = 0.00 < 0.05)
Performance has regressed.
There we go, much better, much slower, however it’s still optimizing out to some match instead of a loop, lol. Let’s black_box something in the loop to keep that from being optimized to math too… Code is now:
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn bench_loop(count: usize) -> usize {
let mut ret = 0;
for i in 1..count {
ret += black_box(i * 2);
}
ret
}
fn bench_iter(count: usize) -> usize {
(1..count).map(|i| black_box(i * 2)).sum()
}
fn bench(c: &mut Criterion) {
const COUNT: usize = 10_000_000;
c.bench_function("for", |b| b.iter(|| bench_loop(black_box(COUNT))));
c.bench_function("iter", |b| b.iter(|| bench_iter(black_box(COUNT))));
}
criterion_group!(benches, bench);
criterion_main!(benches);
And now runs as:
❯ cargo criterion
Compiling pyby v0.1.0 (/home/overminddl1/Projects/pyby)
Finished bench [optimized] target(s) in 2.19s
for time: [3.7198 ms 3.7250 ms 3.7302 ms]
change: [+314567311% +315475093% +316458081%] (p = 0.00 < 0.05)
Performance has regressed.
iter time: [3.6822 ms 3.6891 ms 3.6961 ms]
change: [+307082143% +308377678% +309613221%] (p = 0.00 < 0.05)
Performance has regressed.
Soooo, never seen such a huge change
amount before, lol, this looks correct for a loop now. So of the python/ruby languages the best time was 13.14ms, here the time is ~3.7ms, so I’ve learned both that the JIT’s are not capable of optimizing to just straight math to remove the loops entirely like a good language’s compiler can, and that they can’t seem to optimize the loops very well even at that,
Hmm, let’s split the actual loop functions into a standalone library, the code of which is just:
use criterion::black_box;
pub fn bench_loop(count: usize) -> usize {
let mut ret = 0;
for i in 1..count {
ret += black_box(i * 2);
}
ret
}
pub fn bench_iter(count: usize) -> usize {
(1..count).map(|i| black_box(i * 2)).sum()
}
I kept the black_box so the loop doesn’t optimize out to just math, and the benchmark function in a different binary that links in the library is:
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use pyby::*;
fn bench(c: &mut Criterion) {
const COUNT: usize = 10_000_000;
c.bench_function("for", |b| b.iter(|| bench_loop(black_box(COUNT))));
c.bench_function("iter", |b| b.iter(|| bench_iter(black_box(COUNT))));
}
criterion_group!(benches, bench);
criterion_main!(benches);
And running that:
❯ cargo criterion
for time: [3.5874 ms 3.6058 ms 3.6273 ms]
iter time: [3.5454 ms 3.5532 ms 3.5672 ms]
So about the same.
Let’s dump the assembly of the code though:
❯ cargo +nightly asm pyby::bench_loop --rust
pub fn bench_loop(count: usize) -> usize {
push rax
cmp rdi, 2
jb .LBB0_1
lea r8, [rdi, -, 1]
add rdi, -2
mov ecx, r8d
and ecx, 3
cmp rdi, 3
jae .LBB0_8
mov edx, 1
xor eax, eax
jmp .LBB0_4
.LBB0_1:
xor eax, eax
}
pop rcx
ret
.LBB0_8:
and r8, -4
neg r8
mov edx, 1
mov edi, 8
xor eax, eax
.LBB0_9:
lea rsi, [rdi, -, 6]
mov qword, ptr, [rsp], rsi
ret += black_box(i * 2);
add rax, qword, ptr, [rsp]
lea rsi, [rdi, -, 4]
mov qword, ptr, [rsp], rsi
add rax, qword, ptr, [rsp]
lea rsi, [rdi, -, 2]
mov qword, ptr, [rsp], rsi
add rax, qword, ptr, [rsp]
mov qword, ptr, [rsp], rdi
add rax, qword, ptr, [rsp]
add rdi, 8
lea rsi, [r8, +, rdx]
add rsi, 4
add rdx, 4
cmp rsi, 1
jne .LBB0_9
.LBB0_4:
test rcx, rcx
je .LBB0_7
add rdx, rdx
.LBB0_6:
mov qword, ptr, [rsp], rdx
add rax, qword, ptr, [rsp]
add rdx, 2
add rcx, -1
jne .LBB0_6
.LBB0_7:
}
pop rcx
ret
And for the iterator version:
❯ cargo +nightly asm pyby::bench_iter --rust
pub fn bench_iter(count: usize) -> usize {
push rax
cmp rdi, 2
jb .LBB1_1
lea r8, [rdi, -, 1]
add rdi, -2
mov ecx, r8d
and ecx, 3
cmp rdi, 3
jae .LBB1_4
mov edx, 1
xor eax, eax
jmp .LBB1_6
.LBB1_1:
xor eax, eax
}
pop rcx
ret
.LBB1_4:
and r8, -4
neg r8
mov edx, 1
mov edi, 8
xor eax, eax
.LBB1_5:
lea rsi, [rdi, -, 6]
mov qword, ptr, [rsp], rsi
add rax, qword, ptr, [rsp]
lea rsi, [rdi, -, 4]
mov qword, ptr, [rsp], rsi
add rax, qword, ptr, [rsp]
lea rsi, [rdi, -, 2]
mov qword, ptr, [rsp], rsi
add rax, qword, ptr, [rsp]
mov qword, ptr, [rsp], rdi
add rax, qword, ptr, [rsp]
add rdi, 8
lea rsi, [r8, +, rdx]
add rsi, 4
add rdx, 4
cmp rsi, 1
jne .LBB1_5
.LBB1_6:
test rcx, rcx
je .LBB1_9
add rdx, rdx
.LBB1_8:
mov qword, ptr, [rsp], rdx
add rax, qword, ptr, [rsp]
add rdx, 2
add rcx, -1
jne .LBB1_8
.LBB1_9:
pop rcx
ret
And they seem to be precisely identical, as expected, but yet no SSE, I wonder what changes if we compile for small generated code instead of fast generated code:
pub fn bench_iter(count: usize) -> usize {
push rax
cmp rdi, 2
jb .LBB1_1
dec rdi
mov ecx, 2
xor eax, eax
.LBB1_3:
mov qword, ptr, [rsp], rcx
add rcx, 2
add rax, qword, ptr, [rsp]
dec rdi
jne .LBB1_3
}
pop rcx
ret
.LBB1_1:
xor eax, eax
pop rcx
ret
Cool, so it did just end up becoming a single simple loop, that is indeed much smaller code, the inner loop being that which is in the LBB1_3
label (notice it’s actually looping from count
down to zero since it determined the order doesn’t matter and it’s an instruction smaller that way). Let’s see how much slower the benchmark itself is:
for time: [5.4088 ms 5.4163 ms 5.4242 ms]
change: [+49.304% +50.211% +51.020%] (p = 0.00 < 0.05)
Performance has regressed.
iter time: [5.4249 ms 5.4436 ms 5.4740 ms]
change: [+51.652% +52.344% +53.307%] (p = 0.00 < 0.05)
Performance has regressed.
About 50% slower, which is huge, but still much faster than either of the scripting languages in the prior post, lol.
Hmm, I wonder how fast it is when ran in full debug mode, I’m betting this will be impressively slow, lol, benching:
Benchmarking for: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 37.7s, or reduce sample count to 10.
for time: [376.34 ms 376.67 ms 377.02 ms]
change: [+10561% +10580% +10599%] (p = 0.00 < 0.05)
Performance has regressed.
Benchmarking iter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 56.6s, or reduce sample count to 10.
iter time: [566.35 ms 567.12 ms 567.91 ms]
change: [+15795% +15837% +15883%] (p = 0.00 < 0.05)
Performance has regressed.
Wooo that is indeed a LOT slower! Like on par with interpreted python/ruby code kind of slow!
Let’s see what the code looks like, the iterator version first since they definitely are not compiling to the same code now!):
❯ cargo +nightly asm --build-type debug pyby::bench_iter --rust
pub fn bench_iter(count: usize) -> usize {
sub rsp, 56
mov qword, ptr, [rsp, +, 48], rdi
(1..count).map(|i| black_box(i * 2)).sum()
mov qword, ptr, [rsp, +, 32], 1
mov qword, ptr, [rsp, +, 40], rdi
mov rdi, qword, ptr, [rsp, +, 32]
mov rsi, qword, ptr, [rsp, +, 40]
call qword, ptr, [rip, +, _ZN4core4iter6traits8iterator8Iterator3map17h12723980faf40533E@GOTPCREL]
mov qword, ptr, [rsp, +, 16], rax
mov qword, ptr, [rsp, +, 24], rdx
mov rsi, qword, ptr, [rsp, +, 24]
mov rdi, qword, ptr, [rsp, +, 16]
call qword, ptr, [rip, +, _ZN4core4iter6traits8iterator8Iterator3sum17hf22cfa1d0bd6d00dE@GOTPCREL]
mov qword, ptr, [rsp, +, 8], rax
mov rax, qword, ptr, [rsp, +, 8]
}
add rsp, 56
ret
Indeed, lots of method calls in the loop that didn’t need to exist before, and now the basic for loop version:
❯ cargo +nightly asm --build-type debug pyby::bench_loop --rust
pub fn bench_loop(count: usize) -> usize {
sub rsp, 136
mov qword, ptr, [rsp, +, 104], rdi
let mut ret = 0;
mov qword, ptr, [rsp, +, 48], 0
for i in 1..count {
mov qword, ptr, [rsp, +, 56], 1
mov qword, ptr, [rsp, +, 64], rdi
mov rdi, qword, ptr, [rsp, +, 56]
mov rsi, qword, ptr, [rsp, +, 64]
call qword, ptr, [rip, +, _ZN63_$LT$I$u20$as$u20$core..iter..traits..collect..IntoIterator$GT$9into_iter17ha1c7c7c0ba434465E@GOTPCREL]
mov qword, ptr, [rsp, +, 32], rax
mov qword, ptr, [rsp, +, 40], rdx
mov rax, qword, ptr, [rsp, +, 40]
mov rcx, qword, ptr, [rsp, +, 32]
mov qword, ptr, [rsp, +, 72], rcx
mov qword, ptr, [rsp, +, 80], rax
.LBB12_2:
mov rax, qword, ptr, [rip, +, _ZN4core4iter5range101_$LT$impl$u20$core..iter..traits..iterator..Iterator$u20$for$u20$core..ops..range..Range$LT$A$GT$$GT$4next17h413bbe735b9fe1d5E@GOTPCREL]
lea rdi, [rsp, +, 72]
call rax
mov qword, ptr, [rsp, +, 96], rdx
mov qword, ptr, [rsp, +, 88], rax
mov rax, qword, ptr, [rsp, +, 88]
test rax, rax
je .LBB12_5
jmp .LBB12_12
.LBB12_12:
jmp .LBB12_6
ud2
.LBB12_5:
}
mov rax, qword, ptr, [rsp, +, 48]
add rsp, 136
ret
.LBB12_6:
for i in 1..count {
mov rax, qword, ptr, [rsp, +, 96]
mov qword, ptr, [rsp, +, 112], rax
mov qword, ptr, [rsp, +, 120], rax
for i in 1..count {
mov qword, ptr, [rsp, +, 128], rax
ret += black_box(i * 2);
mov ecx, 2
mul rcx
mov qword, ptr, [rsp, +, 24], rax
seto al
test al, 1
jne .LBB12_8
mov rdi, qword, ptr, [rsp, +, 24]
ret += black_box(i * 2);
call qword, ptr, [rip, +, _ZN9criterion9black_box17h51d9c2b7d11cc19eE@GOTPCREL]
mov qword, ptr, [rsp, +, 16], rax
jmp .LBB12_9
.LBB12_8:
ret += black_box(i * 2);
lea rdi, [rip, +, str.1]
lea rdx, [rip, +, .L__unnamed_2]
mov rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
mov esi, 33
call rax
ud2
.LBB12_9:
mov rax, qword, ptr, [rsp, +, 16]
ret += black_box(i * 2);
add rax, qword, ptr, [rsp, +, 48]
mov qword, ptr, [rsp, +, 8], rax
setb al
test al, 1
jne .LBB12_11
mov rax, qword, ptr, [rsp, +, 8]
mov qword, ptr, [rsp, +, 48], rax
for i in 1..count {
jmp .LBB12_2
.LBB12_11:
ret += black_box(i * 2);
lea rdi, [rip, +, str.0]
lea rdx, [rip, +, .L__unnamed_3]
mov rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
mov esi, 28
call rax
ud2
A lot more inlined, but still a lot of setup for the 1..count
iterator setup. Let’s add in a pure number loop in it as well, added this function to the library:
pub fn bench_manual(count: usize) -> usize {
let mut ret = 0;
let mut i = 0;
while i < count {
ret += black_box(i * 2);
i += 1;
}
ret
}
And added this to the benchmark:
c.bench_function("manual", |b| b.iter(|| bench_manual(black_box(COUNT))));
And benchmarked as normal:
for time: [3.5597 ms 3.5656 ms 3.5712 ms]
iter time: [3.5255 ms 3.5312 ms 3.5369 ms]
manual time: [3.5953 ms 3.6012 ms 3.6080 ms]
And again in debug mode:
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 38.0s, or reduce sample count to 10.
for time: [378.34 ms 378.85 ms 379.38 ms]
change: [+10502% +10525% +10548%] (p = 0.00 < 0.05)
Performance has regressed.
Benchmarking iter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 56.8s, or reduce sample count to 10.
iter time: [566.02 ms 566.90 ms 567.91 ms]
change: [+15919% +15954% +15988%] (p = 0.00 < 0.05)
Performance has regressed.
Benchmarking manual: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 10.9s, or reduce sample count to 40.
manual time: [107.72 ms 108.03 ms 108.36 ms]
change: [+2889.5% +2899.8% +2912.2%] (p = 0.00 < 0.05)
Performance has regressed.
And decompiled that new function as normal:
pub fn bench_manual(count: usize) -> usize {
push rax
while i < count {
test rdi, rdi
je .LBB0_1
lea rax, [rdi, -, 1]
mov r8d, edi
and r8d, 3
cmp rax, 3
jae .LBB0_8
xor eax, eax
xor edx, edx
jmp .LBB0_4
.LBB0_1:
xor eax, eax
}
pop rcx
ret
.LBB0_8:
while i < count {
and rdi, -4
mov esi, 6
xor eax, eax
xor edx, edx
.LBB0_9:
lea rcx, [rsi, -, 6]
mov qword, ptr, [rsp], rcx
ret += black_box(i * 2);
add rax, qword, ptr, [rsp]
lea rcx, [rsi, -, 4]
mov qword, ptr, [rsp], rcx
add rax, qword, ptr, [rsp]
lea rcx, [rsi, -, 2]
mov qword, ptr, [rsp], rcx
add rax, qword, ptr, [rsp]
mov qword, ptr, [rsp], rsi
add rax, qword, ptr, [rsp]
i += 1;
add rdx, 4
while i < count {
add rsi, 8
cmp rdi, rdx
jne .LBB0_9
.LBB0_4:
test r8, r8
je .LBB0_7
add rdx, rdx
.LBB0_6:
mov qword, ptr, [rsp], rdx
ret += black_box(i * 2);
add rax, qword, ptr, [rsp]
while i < count {
add rdx, 2
add r8, -1
jne .LBB0_6
.LBB0_7:
}
pop rcx
ret
And decompiled that new function as debug:
pub fn bench_manual(count: usize) -> usize {
sub rsp, 72
mov qword, ptr, [rsp, +, 40], rdi
mov qword, ptr, [rsp, +, 64], rdi
let mut ret = 0;
mov qword, ptr, [rsp, +, 48], 0
let mut i = 0;
mov qword, ptr, [rsp, +, 56], 0
.LBB12_1:
mov rax, qword, ptr, [rsp, +, 40]
while i < count {
cmp qword, ptr, [rsp, +, 56], rax
jb .LBB12_3
}
mov rax, qword, ptr, [rsp, +, 48]
add rsp, 72
ret
.LBB12_3:
ret += black_box(i * 2);
mov rax, qword, ptr, [rsp, +, 56]
mov ecx, 2
mul rcx
mov qword, ptr, [rsp, +, 32], rax
seto al
test al, 1
jne .LBB12_5
mov rdi, qword, ptr, [rsp, +, 32]
ret += black_box(i * 2);
call qword, ptr, [rip, +, _ZN9criterion9black_box17h51d9c2b7d11cc19eE@GOTPCREL]
mov qword, ptr, [rsp, +, 24], rax
jmp .LBB12_6
.LBB12_5:
ret += black_box(i * 2);
lea rdi, [rip, +, str.1]
lea rdx, [rip, +, .L__unnamed_2]
mov rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
mov esi, 33
call rax
ud2
.LBB12_6:
mov rax, qword, ptr, [rsp, +, 24]
ret += black_box(i * 2);
add rax, qword, ptr, [rsp, +, 48]
mov qword, ptr, [rsp, +, 16], rax
setb al
test al, 1
jne .LBB12_8
mov rax, qword, ptr, [rsp, +, 16]
mov qword, ptr, [rsp, +, 48], rax
i += 1;
mov rax, qword, ptr, [rsp, +, 56]
add rax, 1
mov qword, ptr, [rsp, +, 8], rax
setb al
test al, 1
jne .LBB12_10
jmp .LBB12_9
.LBB12_8:
ret += black_box(i * 2);
lea rdi, [rip, +, str.0]
lea rdx, [rip, +, .L__unnamed_3]
mov rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
mov esi, 28
call rax
ud2
.LBB12_9:
mov rax, qword, ptr, [rsp, +, 8]
i += 1;
mov qword, ptr, [rsp, +, 56], rax
while i < count {
jmp .LBB12_1
.LBB12_10:
i += 1;
lea rdi, [rip, +, str.0]
lea rdx, [rip, +, .L__unnamed_4]
mov rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
mov esi, 28
call rax
ud2
So as a raw number loop it runs the fastest in debug mode of all the others in debug mode since it doesn’t have to make function calls that normally get optimized out, but they are basically all the same running time in release compilations, although the generated code has quite a few differences, probably because it’s not allowed to change ordering so it has to go in the same order instead of being allowed to count down to 0 now, let’s make that change manually, changed the new function code to be this (also starting at 1 instead of 0 to match the python/ruby above, won’t make much difference though if any):
pub fn bench_manual(mut count: usize) -> usize {
let mut ret = 0;
while count > 0 {
ret += black_box(count * 2);
count -= 1;
}
ret
}
Never an optimization you’d usually want to do manually as the compiler can do it for you with normal code used, but still, benchmarking:
for time: [3.5657 ms 3.5742 ms 3.5837 ms]
iter time: [3.5695 ms 3.5802 ms 3.5918 ms]
manual time: [3.2158 ms 3.2211 ms 3.2267 ms]
Although now it’s even better (since the call to the black_box doesn’t have to be pessimized out, that call specifically prevents optimizations being done to it). We can just as easily change the other two to be a reverse iteration as well, their code now being:
use criterion::black_box;
pub fn bench_manual(mut count: usize) -> usize {
let mut ret = 0;
while count > 0 {
ret += black_box(count * 2);
count -= 1;
}
ret += black_box(0 * 2); // Needed to match prior count
ret
}
pub fn bench_loop(count: usize) -> usize {
let mut ret = 0;
for i in (1..count).rev() {
ret += black_box(i * 2);
}
ret
}
pub fn bench_iter(count: usize) -> usize {
(1..count).rev().map(|i| black_box(i * 2)).sum()
}
And the full benchmark being:
for time: [3.3039 ms 3.3084 ms 3.3128 ms]
iter time: [3.3149 ms 3.3235 ms 3.3332 ms]
manual time: [3.2541 ms 3.2624 ms 3.2721 ms]
And of course now they are better than the manual version since they can still have other optimizations that manual can’t have now that they’ve been equalled out from that black_box
preventing optimizations.