Ruby vs Python comes down to the for loop.
Contrasting how each language handles iteration helps understand how to work effectively in either.
Read in full here:
This thread was posted by one of our members via one of our news source trackers.
Ruby vs Python comes down to the for loop.
Contrasting how each language handles iteration helps understand how to work effectively in either.
Read in full here:
This thread was posted by one of our members via one of our news source trackers.
These design differences are also why python is faster than ruby in any code Iâve seen ( wonder how pythonâs pypy JIT compares to rubyâs new JIT, hmmâŠ)
The speed difference was always tangible when I first compared the two - not sure what itâs like now, but Ruby is a lot faster than it used to be
Python is also a lot faster than it used to, though it has a few JITâs available for it now, Iâve never benchmarked those to each other, hmmmâŠ
@AstonJ Since you know Ruby well and although I could write code in it Iâm unsure just how ânaturalâ of code it would be, think you could come up with a simple enough test for doing some, like, math or so? Otherwise I might find some online. Iâm curious how they differ nowadays, lol.
Wow but Rubyâs non-official variants are hard to install/buildâŠ
Managed to get the main build 3.1.0-preview1 installed, but couldnât get jruby, mruby, graalvm ruby and a few others built at all, their build scripts are baaaaaad⊠o.O
Though I think the ruby backend ecosystem is a bit more uniform and universal around the main builds unlike Python and it having a lot of well used âotherâ implementations? Like the main ruby even has a JIT but the main python never will as its designed to be simple and efficient if not fast (like that python is easy to embed as a scripting language, like lua).
Is this good for a normal standard ruby-style loop over numbers as a benchmark? Iâm wanting to benchmark the actual language interpreter/jit, not any C code in either language (since those can be equally performant and python probably has an edge there because of all itâs scientific libraries) and want it to be how someone would actually write them:
require "benchmark"
def bench(count)
ret = 0
(1..count).each do |n|
ret += ret * 2
end
end
result = Benchmark.realtime do
bench(10_000_000)
end
puts("Time Taken for loop of 10_000_000: #{result}s")
And here was pythons:
from timeit import timeit
def bench(count):
ret = 0
for i in range(1, count):
ret += ret * 2
return ret
result = timeit(lambda: bench(10_000_000), number=1);
print(f"Time Taken for loop of 10_000_000: {result}s")
And running them:
⯠make bench_loop
/home/overminddl1/.asdf/installs/python/3.11-dev/bin/python loop.py
Time Taken for loop of 10_000_000: 0.44509852200280875s
/home/overminddl1/.asdf/installs/python/graalpython-21.3.0/bin/python loop.py
Time Taken for loop of 10_000_000: 0.06898683903273195s
/home/overminddl1/.asdf/installs/python/pypy3.8-7.3.7/bin/python loop.py
Time Taken for loop of 10_000_000: 0.014560726936906576s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby loop.rb
Time Taken for loop of 10_000_000: 0.5772202990483493s
So it looks like ruby is by default still slower than python, though not by near the margin it used to be. Pythonâs two most popular JITâs graal and pypy are both significantly faster, with pypy being the fastest (by a large margin even over graal, Iâm impressed by thatâŠ).
However, rubyâs times are so slow that Iâm not sure itâs using the JIT, letâs see, does that need some special argument to enable⊠Hrmm, google shows⊠the new ruby JIT isnât terribly performant and actually slows down some cases (pypy sure doesnât seem to do that for pythonâŠ), but still, pure math should be something a JIT can do very very very well and easily. Ah, --yjit
, letâs add thatâŠ
⯠make bench_loop
/home/overminddl1/.asdf/installs/python/3.11-dev/bin/python loop.py
Time Taken for loop of 10_000_000: 0.425267347949557s
/home/overminddl1/.asdf/installs/python/graalpython-21.3.0/bin/python loop.py
Time Taken for loop of 10_000_000: 0.06571533100213856s
/home/overminddl1/.asdf/installs/python/pypy3.8-7.3.7/bin/python loop.py
Time Taken for loop of 10_000_000: 0.01419509295374155s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby loop.rb
Time Taken for loop of 10_000_000: 0.5761694150278345s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --yjit loop.rb
Time Taken for loop of 10_000_000: 0.49059942609164864s
Well that barely helped⊠Letâs see⊠Wait it has two JITâs?! Adding in the other one:
⯠make bench_loop
/home/overminddl1/.asdf/installs/python/3.11-dev/bin/python loop.py
Time Taken for loop of 10_000_000: 0.4192896030144766s
/home/overminddl1/.asdf/installs/python/graalpython-21.3.0/bin/python loop.py
Time Taken for loop of 10_000_000: 0.05795672698877752s
/home/overminddl1/.asdf/installs/python/pypy3.8-7.3.7/bin/python loop.py
Time Taken for loop of 10_000_000: 0.021143854944966733s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby loop.rb
Time Taken for loop of 10_000_000: 0.5799980269512162s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --jit loop.rb
Time Taken for loop of 10_000_000: 0.593747540959157s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --yjit loop.rb
Time Taken for loop of 10_000_000: 0.4838587270351127s
Huh, so the normal one actually runs slower⊠Letâs do some warmups in the benches, changing the code to, oh wait, I donât know if ruby has unlimited length integers like python, changing the actual benchmark to not care about that then as well:
Ruby:
require "benchmark"
def bench(count)
ret = 0
(1..count).each do |i|
ret += i * 2
end
end
# Warmups
(1..10_000).each do |i|
bench(i)
end
result = Benchmark.realtime do
bench(10_000_000)
end
puts("Time Taken for loop of 10_000_000: #{result}s")
Python:
from timeit import timeit
def bench(count):
ret = 0
for i in range(1, count):
ret += i * 2
return ret
# Warmups
for i in range(10_000):
bench(i)
result = timeit(lambda: bench(10_000_000), number=1);
print(f"Time Taken for loop of 10_000_000: {result}s")
And results:
⯠make bench_loop
/home/overminddl1/.asdf/installs/python/3.11-dev/bin/python loop.py
Time Taken for loop of 10_000_000: 0.6883266480872408s
/home/overminddl1/.asdf/installs/python/graalpython-21.3.0/bin/python loop.py
Time Taken for loop of 10_000_000: 0.19750940497033298s
/home/overminddl1/.asdf/installs/python/pypy3.8-7.3.7/bin/python loop.py
Time Taken for loop of 10_000_000: 0.01244226808194071s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby loop.rb
Time Taken for loop of 10_000_000: 0.5437580200377852s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --jit loop.rb
Time Taken for loop of 10_000_000: 0.44931442802771926s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --yjit loop.rb
Time Taken for loop of 10_000_000: 0.4761481450404972
Well hey the normal ruby jit actually (barely) sped up over the ruby interpreter finally, and is better than the ânewâ yjit too, and all of which are (barely) faster than non-JIT python, but JITâd python is still utterly blowing away all rubyâs, JITâd or not, like to no where within the same ballpark⊠Letâs see if I can work on installing graal-ruby some more, Iâd be most curious seeing how it compares to graal-python⊠Yeah it doesnât like where itâs installing, like at all, and I canât figure out whyâŠ
ERROR: cannot install TruffleRuby to /home/overminddl1/.asdf/installs/ruby/truffleruby+graalvm-21.3.0, which does not look like a valid TruffleRuby prefix
TruffleRuby only supports being installed to a not existing directory or replacing an existing TruffleRuby installation
This is an amazingly most unhelpful of messages. That directory does indeed not exist so whatâs the issue⊠I love when error messages are just wrong⊠>.>
Well, letâs take a look at mruby then since thatâs one Iâve heard a lot about:
BUILD FAILED (Ubuntu 21.10 using ruby-build 20211109)
Inspect or clean up the working tree at /tmp/ruby-build.20211116102321.569674.TUm8SR
Results logged to /tmp/ruby-build.20211116102321.569674.TUm8SR
⯠cat /tmp/ruby-build.20211116102321.569674.TUm8SR
cat: /tmp/ruby-build.20211116102321.569674.TUm8SR: Is a directory
...
Wait, I need ruby⊠to install mruby⊠what on earth kind of recursion is this?!?
Uuuuugh, fine then⊠Aaaand mruby installed! Letâs see if that fixed graal ruby too⊠and nope, same useless message. Well letâs bench with mruby:
⯠make bench_loop
/home/overminddl1/.asdf/installs/python/3.11-dev/bin/python loop.py
Time Taken for loop of 10_000_000: 0.6626679109176621s
/home/overminddl1/.asdf/installs/python/graalpython-21.3.0/bin/python loop.py
Time Taken for loop of 10_000_000: 0.2004202629905194s
/home/overminddl1/.asdf/installs/python/pypy3.8-7.3.7/bin/python loop.py
Time Taken for loop of 10_000_000: 0.013142568059265614s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby loop.rb
Time Taken for loop of 10_000_000: 0.5442068020347506s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --jit loop.rb
Time Taken for loop of 10_000_000: 0.4532001350307837s
/home/overminddl1/.asdf/installs/ruby/3.1.0-preview1/bin/ruby --yjit loop.rb
Time Taken for loop of 10_000_000: 0.5024946819758043s
/home/overminddl1/.asdf/installs/ruby/mruby-3.0.0/bin/ruby loop.rb
trace (most recent call last):
loop.rb:1: undefined method 'require' (NoMethodError)
make: *** [Makefile:18: bench_loop] Error 1
Uhhhhh, well, thatâs⊠how do I debug an stdlib missing module for mruby?!? O.o
And just to add a base comparison to a compiled language:
use criterion::{criterion_group, criterion_main, Criterion};
fn bench_loop(count: usize) -> usize {
let mut ret = 0;
for i in 1..count {
ret += i * 2;
}
ret
}
fn bench_iter(count: usize) -> usize {
(1..count).map(|i| i * 2).sum()
}
fn bench(c: &mut Criterion) {
const COUNT: usize = 10_000_000;
c.bench_function("for", |b| b.iter(|| bench_loop(COUNT)));
c.bench_function("iter", |b| b.iter(|| bench_iter(COUNT)));
}
criterion_group!(benches, bench);
criterion_main!(benches);
Which runs as:
⯠cargo criterion
Compiling pyby v0.1.0 (/home/overminddl1/Projects/pyby)
Finished bench [optimized] target(s) in 2.23s
for time: [315.03 ps 315.42 ps 315.83 ps]
change: [-0.7200% -0.2908% +0.1634%] (p = 0.21 > 0.05)
No change in performance detected.
iter time: [315.22 ps 316.68 ps 319.05 ps]
change: [-0.9833% -0.4288% +0.2130%] (p = 0.15 > 0.05)
No change in performance detected.
Uhhh, yeah letâs not optimize out the entire loop⊠>.>
Changed the code to this:
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn bench_loop(count: usize) -> usize {
let mut ret = 0;
for i in 1..count {
ret += i * 2;
}
ret
}
fn bench_iter(count: usize) -> usize {
(1..count).map(|i| i * 2).sum()
}
fn bench(c: &mut Criterion) {
const COUNT: usize = 10_000_000;
c.bench_function("for", |b| b.iter(|| bench_loop(black_box(COUNT))));
c.bench_function("iter", |b| b.iter(|| bench_iter(black_box(COUNT))));
}
criterion_group!(benches, bench);
criterion_main!(benches);
And now:
⯠cargo criterion
Compiling pyby v0.1.0 (/home/overminddl1/Projects/pyby)
Finished bench [optimized] target(s) in 2.17s
for time: [1.1803 ns 1.1827 ns 1.1850 ns]
change: [+273.19% +274.47% +275.65%] (p = 0.00 < 0.05)
Performance has regressed.
iter time: [1.1963 ns 1.2008 ns 1.2058 ns]
change: [+277.17% +279.55% +281.87%] (p = 0.00 < 0.05)
Performance has regressed.
There we go, much better, much slower, however itâs still optimizing out to some match instead of a loop, lol. Letâs black_box something in the loop to keep that from being optimized to math too⊠Code is now:
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn bench_loop(count: usize) -> usize {
let mut ret = 0;
for i in 1..count {
ret += black_box(i * 2);
}
ret
}
fn bench_iter(count: usize) -> usize {
(1..count).map(|i| black_box(i * 2)).sum()
}
fn bench(c: &mut Criterion) {
const COUNT: usize = 10_000_000;
c.bench_function("for", |b| b.iter(|| bench_loop(black_box(COUNT))));
c.bench_function("iter", |b| b.iter(|| bench_iter(black_box(COUNT))));
}
criterion_group!(benches, bench);
criterion_main!(benches);
And now runs as:
⯠cargo criterion
Compiling pyby v0.1.0 (/home/overminddl1/Projects/pyby)
Finished bench [optimized] target(s) in 2.19s
for time: [3.7198 ms 3.7250 ms 3.7302 ms]
change: [+314567311% +315475093% +316458081%] (p = 0.00 < 0.05)
Performance has regressed.
iter time: [3.6822 ms 3.6891 ms 3.6961 ms]
change: [+307082143% +308377678% +309613221%] (p = 0.00 < 0.05)
Performance has regressed.
Soooo, never seen such a huge change
amount before, lol, this looks correct for a loop now. So of the python/ruby languages the best time was 13.14ms, here the time is ~3.7ms, so Iâve learned both that the JITâs are not capable of optimizing to just straight math to remove the loops entirely like a good languageâs compiler can, and that they canât seem to optimize the loops very well even at that,
Hmm, letâs split the actual loop functions into a standalone library, the code of which is just:
use criterion::black_box;
pub fn bench_loop(count: usize) -> usize {
let mut ret = 0;
for i in 1..count {
ret += black_box(i * 2);
}
ret
}
pub fn bench_iter(count: usize) -> usize {
(1..count).map(|i| black_box(i * 2)).sum()
}
I kept the black_box so the loop doesnât optimize out to just math, and the benchmark function in a different binary that links in the library is:
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use pyby::*;
fn bench(c: &mut Criterion) {
const COUNT: usize = 10_000_000;
c.bench_function("for", |b| b.iter(|| bench_loop(black_box(COUNT))));
c.bench_function("iter", |b| b.iter(|| bench_iter(black_box(COUNT))));
}
criterion_group!(benches, bench);
criterion_main!(benches);
And running that:
⯠cargo criterion
for time: [3.5874 ms 3.6058 ms 3.6273 ms]
iter time: [3.5454 ms 3.5532 ms 3.5672 ms]
So about the same.
Letâs dump the assembly of the code though:
⯠cargo +nightly asm pyby::bench_loop --rust
pub fn bench_loop(count: usize) -> usize {
push rax
cmp rdi, 2
jb .LBB0_1
lea r8, [rdi, -, 1]
add rdi, -2
mov ecx, r8d
and ecx, 3
cmp rdi, 3
jae .LBB0_8
mov edx, 1
xor eax, eax
jmp .LBB0_4
.LBB0_1:
xor eax, eax
}
pop rcx
ret
.LBB0_8:
and r8, -4
neg r8
mov edx, 1
mov edi, 8
xor eax, eax
.LBB0_9:
lea rsi, [rdi, -, 6]
mov qword, ptr, [rsp], rsi
ret += black_box(i * 2);
add rax, qword, ptr, [rsp]
lea rsi, [rdi, -, 4]
mov qword, ptr, [rsp], rsi
add rax, qword, ptr, [rsp]
lea rsi, [rdi, -, 2]
mov qword, ptr, [rsp], rsi
add rax, qword, ptr, [rsp]
mov qword, ptr, [rsp], rdi
add rax, qword, ptr, [rsp]
add rdi, 8
lea rsi, [r8, +, rdx]
add rsi, 4
add rdx, 4
cmp rsi, 1
jne .LBB0_9
.LBB0_4:
test rcx, rcx
je .LBB0_7
add rdx, rdx
.LBB0_6:
mov qword, ptr, [rsp], rdx
add rax, qword, ptr, [rsp]
add rdx, 2
add rcx, -1
jne .LBB0_6
.LBB0_7:
}
pop rcx
ret
And for the iterator version:
⯠cargo +nightly asm pyby::bench_iter --rust
pub fn bench_iter(count: usize) -> usize {
push rax
cmp rdi, 2
jb .LBB1_1
lea r8, [rdi, -, 1]
add rdi, -2
mov ecx, r8d
and ecx, 3
cmp rdi, 3
jae .LBB1_4
mov edx, 1
xor eax, eax
jmp .LBB1_6
.LBB1_1:
xor eax, eax
}
pop rcx
ret
.LBB1_4:
and r8, -4
neg r8
mov edx, 1
mov edi, 8
xor eax, eax
.LBB1_5:
lea rsi, [rdi, -, 6]
mov qword, ptr, [rsp], rsi
add rax, qword, ptr, [rsp]
lea rsi, [rdi, -, 4]
mov qword, ptr, [rsp], rsi
add rax, qword, ptr, [rsp]
lea rsi, [rdi, -, 2]
mov qword, ptr, [rsp], rsi
add rax, qword, ptr, [rsp]
mov qword, ptr, [rsp], rdi
add rax, qword, ptr, [rsp]
add rdi, 8
lea rsi, [r8, +, rdx]
add rsi, 4
add rdx, 4
cmp rsi, 1
jne .LBB1_5
.LBB1_6:
test rcx, rcx
je .LBB1_9
add rdx, rdx
.LBB1_8:
mov qword, ptr, [rsp], rdx
add rax, qword, ptr, [rsp]
add rdx, 2
add rcx, -1
jne .LBB1_8
.LBB1_9:
pop rcx
ret
And they seem to be precisely identical, as expected, but yet no SSE, I wonder what changes if we compile for small generated code instead of fast generated code:
pub fn bench_iter(count: usize) -> usize {
push rax
cmp rdi, 2
jb .LBB1_1
dec rdi
mov ecx, 2
xor eax, eax
.LBB1_3:
mov qword, ptr, [rsp], rcx
add rcx, 2
add rax, qword, ptr, [rsp]
dec rdi
jne .LBB1_3
}
pop rcx
ret
.LBB1_1:
xor eax, eax
pop rcx
ret
Cool, so it did just end up becoming a single simple loop, that is indeed much smaller code, the inner loop being that which is in the LBB1_3
label (notice itâs actually looping from count
down to zero since it determined the order doesnât matter and itâs an instruction smaller that way). Letâs see how much slower the benchmark itself is:
for time: [5.4088 ms 5.4163 ms 5.4242 ms]
change: [+49.304% +50.211% +51.020%] (p = 0.00 < 0.05)
Performance has regressed.
iter time: [5.4249 ms 5.4436 ms 5.4740 ms]
change: [+51.652% +52.344% +53.307%] (p = 0.00 < 0.05)
Performance has regressed.
About 50% slower, which is huge, but still much faster than either of the scripting languages in the prior post, lol.
Hmm, I wonder how fast it is when ran in full debug mode, Iâm betting this will be impressively slow, lol, benching:
Benchmarking for: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 37.7s, or reduce sample count to 10.
for time: [376.34 ms 376.67 ms 377.02 ms]
change: [+10561% +10580% +10599%] (p = 0.00 < 0.05)
Performance has regressed.
Benchmarking iter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 56.6s, or reduce sample count to 10.
iter time: [566.35 ms 567.12 ms 567.91 ms]
change: [+15795% +15837% +15883%] (p = 0.00 < 0.05)
Performance has regressed.
Wooo that is indeed a LOT slower! Like on par with interpreted python/ruby code kind of slow!
Letâs see what the code looks like, the iterator version first since they definitely are not compiling to the same code now!):
⯠cargo +nightly asm --build-type debug pyby::bench_iter --rust
pub fn bench_iter(count: usize) -> usize {
sub rsp, 56
mov qword, ptr, [rsp, +, 48], rdi
(1..count).map(|i| black_box(i * 2)).sum()
mov qword, ptr, [rsp, +, 32], 1
mov qword, ptr, [rsp, +, 40], rdi
mov rdi, qword, ptr, [rsp, +, 32]
mov rsi, qword, ptr, [rsp, +, 40]
call qword, ptr, [rip, +, _ZN4core4iter6traits8iterator8Iterator3map17h12723980faf40533E@GOTPCREL]
mov qword, ptr, [rsp, +, 16], rax
mov qword, ptr, [rsp, +, 24], rdx
mov rsi, qword, ptr, [rsp, +, 24]
mov rdi, qword, ptr, [rsp, +, 16]
call qword, ptr, [rip, +, _ZN4core4iter6traits8iterator8Iterator3sum17hf22cfa1d0bd6d00dE@GOTPCREL]
mov qword, ptr, [rsp, +, 8], rax
mov rax, qword, ptr, [rsp, +, 8]
}
add rsp, 56
ret
Indeed, lots of method calls in the loop that didnât need to exist before, and now the basic for loop version:
⯠cargo +nightly asm --build-type debug pyby::bench_loop --rust
pub fn bench_loop(count: usize) -> usize {
sub rsp, 136
mov qword, ptr, [rsp, +, 104], rdi
let mut ret = 0;
mov qword, ptr, [rsp, +, 48], 0
for i in 1..count {
mov qword, ptr, [rsp, +, 56], 1
mov qword, ptr, [rsp, +, 64], rdi
mov rdi, qword, ptr, [rsp, +, 56]
mov rsi, qword, ptr, [rsp, +, 64]
call qword, ptr, [rip, +, _ZN63_$LT$I$u20$as$u20$core..iter..traits..collect..IntoIterator$GT$9into_iter17ha1c7c7c0ba434465E@GOTPCREL]
mov qword, ptr, [rsp, +, 32], rax
mov qword, ptr, [rsp, +, 40], rdx
mov rax, qword, ptr, [rsp, +, 40]
mov rcx, qword, ptr, [rsp, +, 32]
mov qword, ptr, [rsp, +, 72], rcx
mov qword, ptr, [rsp, +, 80], rax
.LBB12_2:
mov rax, qword, ptr, [rip, +, _ZN4core4iter5range101_$LT$impl$u20$core..iter..traits..iterator..Iterator$u20$for$u20$core..ops..range..Range$LT$A$GT$$GT$4next17h413bbe735b9fe1d5E@GOTPCREL]
lea rdi, [rsp, +, 72]
call rax
mov qword, ptr, [rsp, +, 96], rdx
mov qword, ptr, [rsp, +, 88], rax
mov rax, qword, ptr, [rsp, +, 88]
test rax, rax
je .LBB12_5
jmp .LBB12_12
.LBB12_12:
jmp .LBB12_6
ud2
.LBB12_5:
}
mov rax, qword, ptr, [rsp, +, 48]
add rsp, 136
ret
.LBB12_6:
for i in 1..count {
mov rax, qword, ptr, [rsp, +, 96]
mov qword, ptr, [rsp, +, 112], rax
mov qword, ptr, [rsp, +, 120], rax
for i in 1..count {
mov qword, ptr, [rsp, +, 128], rax
ret += black_box(i * 2);
mov ecx, 2
mul rcx
mov qword, ptr, [rsp, +, 24], rax
seto al
test al, 1
jne .LBB12_8
mov rdi, qword, ptr, [rsp, +, 24]
ret += black_box(i * 2);
call qword, ptr, [rip, +, _ZN9criterion9black_box17h51d9c2b7d11cc19eE@GOTPCREL]
mov qword, ptr, [rsp, +, 16], rax
jmp .LBB12_9
.LBB12_8:
ret += black_box(i * 2);
lea rdi, [rip, +, str.1]
lea rdx, [rip, +, .L__unnamed_2]
mov rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
mov esi, 33
call rax
ud2
.LBB12_9:
mov rax, qword, ptr, [rsp, +, 16]
ret += black_box(i * 2);
add rax, qword, ptr, [rsp, +, 48]
mov qword, ptr, [rsp, +, 8], rax
setb al
test al, 1
jne .LBB12_11
mov rax, qword, ptr, [rsp, +, 8]
mov qword, ptr, [rsp, +, 48], rax
for i in 1..count {
jmp .LBB12_2
.LBB12_11:
ret += black_box(i * 2);
lea rdi, [rip, +, str.0]
lea rdx, [rip, +, .L__unnamed_3]
mov rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
mov esi, 28
call rax
ud2
A lot more inlined, but still a lot of setup for the 1..count
iterator setup. Letâs add in a pure number loop in it as well, added this function to the library:
pub fn bench_manual(count: usize) -> usize {
let mut ret = 0;
let mut i = 0;
while i < count {
ret += black_box(i * 2);
i += 1;
}
ret
}
And added this to the benchmark:
c.bench_function("manual", |b| b.iter(|| bench_manual(black_box(COUNT))));
And benchmarked as normal:
for time: [3.5597 ms 3.5656 ms 3.5712 ms]
iter time: [3.5255 ms 3.5312 ms 3.5369 ms]
manual time: [3.5953 ms 3.6012 ms 3.6080 ms]
And again in debug mode:
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 38.0s, or reduce sample count to 10.
for time: [378.34 ms 378.85 ms 379.38 ms]
change: [+10502% +10525% +10548%] (p = 0.00 < 0.05)
Performance has regressed.
Benchmarking iter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 56.8s, or reduce sample count to 10.
iter time: [566.02 ms 566.90 ms 567.91 ms]
change: [+15919% +15954% +15988%] (p = 0.00 < 0.05)
Performance has regressed.
Benchmarking manual: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 10.9s, or reduce sample count to 40.
manual time: [107.72 ms 108.03 ms 108.36 ms]
change: [+2889.5% +2899.8% +2912.2%] (p = 0.00 < 0.05)
Performance has regressed.
And decompiled that new function as normal:
pub fn bench_manual(count: usize) -> usize {
push rax
while i < count {
test rdi, rdi
je .LBB0_1
lea rax, [rdi, -, 1]
mov r8d, edi
and r8d, 3
cmp rax, 3
jae .LBB0_8
xor eax, eax
xor edx, edx
jmp .LBB0_4
.LBB0_1:
xor eax, eax
}
pop rcx
ret
.LBB0_8:
while i < count {
and rdi, -4
mov esi, 6
xor eax, eax
xor edx, edx
.LBB0_9:
lea rcx, [rsi, -, 6]
mov qword, ptr, [rsp], rcx
ret += black_box(i * 2);
add rax, qword, ptr, [rsp]
lea rcx, [rsi, -, 4]
mov qword, ptr, [rsp], rcx
add rax, qword, ptr, [rsp]
lea rcx, [rsi, -, 2]
mov qword, ptr, [rsp], rcx
add rax, qword, ptr, [rsp]
mov qword, ptr, [rsp], rsi
add rax, qword, ptr, [rsp]
i += 1;
add rdx, 4
while i < count {
add rsi, 8
cmp rdi, rdx
jne .LBB0_9
.LBB0_4:
test r8, r8
je .LBB0_7
add rdx, rdx
.LBB0_6:
mov qword, ptr, [rsp], rdx
ret += black_box(i * 2);
add rax, qword, ptr, [rsp]
while i < count {
add rdx, 2
add r8, -1
jne .LBB0_6
.LBB0_7:
}
pop rcx
ret
And decompiled that new function as debug:
pub fn bench_manual(count: usize) -> usize {
sub rsp, 72
mov qword, ptr, [rsp, +, 40], rdi
mov qword, ptr, [rsp, +, 64], rdi
let mut ret = 0;
mov qword, ptr, [rsp, +, 48], 0
let mut i = 0;
mov qword, ptr, [rsp, +, 56], 0
.LBB12_1:
mov rax, qword, ptr, [rsp, +, 40]
while i < count {
cmp qword, ptr, [rsp, +, 56], rax
jb .LBB12_3
}
mov rax, qword, ptr, [rsp, +, 48]
add rsp, 72
ret
.LBB12_3:
ret += black_box(i * 2);
mov rax, qword, ptr, [rsp, +, 56]
mov ecx, 2
mul rcx
mov qword, ptr, [rsp, +, 32], rax
seto al
test al, 1
jne .LBB12_5
mov rdi, qword, ptr, [rsp, +, 32]
ret += black_box(i * 2);
call qword, ptr, [rip, +, _ZN9criterion9black_box17h51d9c2b7d11cc19eE@GOTPCREL]
mov qword, ptr, [rsp, +, 24], rax
jmp .LBB12_6
.LBB12_5:
ret += black_box(i * 2);
lea rdi, [rip, +, str.1]
lea rdx, [rip, +, .L__unnamed_2]
mov rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
mov esi, 33
call rax
ud2
.LBB12_6:
mov rax, qword, ptr, [rsp, +, 24]
ret += black_box(i * 2);
add rax, qword, ptr, [rsp, +, 48]
mov qword, ptr, [rsp, +, 16], rax
setb al
test al, 1
jne .LBB12_8
mov rax, qword, ptr, [rsp, +, 16]
mov qword, ptr, [rsp, +, 48], rax
i += 1;
mov rax, qword, ptr, [rsp, +, 56]
add rax, 1
mov qword, ptr, [rsp, +, 8], rax
setb al
test al, 1
jne .LBB12_10
jmp .LBB12_9
.LBB12_8:
ret += black_box(i * 2);
lea rdi, [rip, +, str.0]
lea rdx, [rip, +, .L__unnamed_3]
mov rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
mov esi, 28
call rax
ud2
.LBB12_9:
mov rax, qword, ptr, [rsp, +, 8]
i += 1;
mov qword, ptr, [rsp, +, 56], rax
while i < count {
jmp .LBB12_1
.LBB12_10:
i += 1;
lea rdi, [rip, +, str.0]
lea rdx, [rip, +, .L__unnamed_4]
mov rax, qword, ptr, [rip, +, _ZN4core9panicking5panic17h50b51d19800453c0E@GOTPCREL]
mov esi, 28
call rax
ud2
So as a raw number loop it runs the fastest in debug mode of all the others in debug mode since it doesnât have to make function calls that normally get optimized out, but they are basically all the same running time in release compilations, although the generated code has quite a few differences, probably because itâs not allowed to change ordering so it has to go in the same order instead of being allowed to count down to 0 now, letâs make that change manually, changed the new function code to be this (also starting at 1 instead of 0 to match the python/ruby above, wonât make much difference though if any):
pub fn bench_manual(mut count: usize) -> usize {
let mut ret = 0;
while count > 0 {
ret += black_box(count * 2);
count -= 1;
}
ret
}
Never an optimization youâd usually want to do manually as the compiler can do it for you with normal code used, but still, benchmarking:
for time: [3.5657 ms 3.5742 ms 3.5837 ms]
iter time: [3.5695 ms 3.5802 ms 3.5918 ms]
manual time: [3.2158 ms 3.2211 ms 3.2267 ms]
Although now itâs even better (since the call to the black_box doesnât have to be pessimized out, that call specifically prevents optimizations being done to it). We can just as easily change the other two to be a reverse iteration as well, their code now being:
use criterion::black_box;
pub fn bench_manual(mut count: usize) -> usize {
let mut ret = 0;
while count > 0 {
ret += black_box(count * 2);
count -= 1;
}
ret += black_box(0 * 2); // Needed to match prior count
ret
}
pub fn bench_loop(count: usize) -> usize {
let mut ret = 0;
for i in (1..count).rev() {
ret += black_box(i * 2);
}
ret
}
pub fn bench_iter(count: usize) -> usize {
(1..count).rev().map(|i| black_box(i * 2)).sum()
}
And the full benchmark being:
for time: [3.3039 ms 3.3084 ms 3.3128 ms]
iter time: [3.3149 ms 3.3235 ms 3.3332 ms]
manual time: [3.2541 ms 3.2624 ms 3.2721 ms]
And of course now they are better than the manual version since they can still have other optimizations that manual canât have now that theyâve been equalled out from that black_box
preventing optimizations.