Golang’s most important feature is invisible

CommunityNews · 23 January 2022 15:03

Golang’s most important feature is invisible.
I have been surprised about is how little fan fair has been given to what I consider Go’s most important feature.

Read in full here:

https://blog.devgenius.io/golangs-most-important-feature-is-invisible-6be9c1e7249b?gi=21e47786496b

This thread was posted by one of our members via one of our news source trackers.

OvermindDL1 · 24 January 2022 21:22

Did you see it? There isn’t a lot of code there.

Uh, yeah that looks like the horror of global variables hidden behind a singleton interface (with a global net/http/Server instance I know, which you can use individually, but global variable access as default wow that’s bad…). You are registering a global route handler, then starting a single global http server… What happened if a, say, event library does the same in their own…

To better highlight the feature let’s look at an example of similar code written in Java.

Yeah, that’s because Java is horrifyingly verbose. Let’s try it in some random modern Rust http framework (the tokio one):

use axum::{routing::get, Router};
use std::net::SocketAddr;

#[tokio::main]
async fn main() {
    let app = Router::new().route("/", get(|| async { "Goodbye, World!" }));
    axum::Server::bind(&SocketAddr::from(([0, 0, 0, 0], 8080)))
        .serve(app.into_make_service())
        .await
        .unwrap();
}

Or a random C++ framework that I happen to have here (which has a few globalisms I’m not a fan of, but at least it’s not in the stdlib so it at least can be fixed, unlike things in an stdlib which can’t change their API’s):

#include <drogon/drogon.h>

using namespace drogon;
int main()
{
    app()
        .addListener("0.0.0.0", 8080)
        .registerHandler("/", [](const HttpRequestPtr& req, std::function<void (const HttpResponsePtr &)> &&callback) {
            auto resp = HttpResponse::newHttpResponse();
            resp->setBody("Goodby, World!");
            callback(resp);
        })
        .run();
}

They are both very low level languages, and yes they are just as succinct or more so than the go code.

The feature I’m talking about is the Go runtimes handling of blocking goroutines.

Built in CSP transformations yeah, rust does the same via async, and C++ has something similar built-in in C++20 too (and there were libraries to do similar since the 90’s in C and C++ both), plus can’t forget languages like Erlang where it’s the very nature of everything in it).

To get this performance out of Java you would need to add threadpools, futures or some other async library.

(Geez this site is horrifyingly bad to copy/paste from, keeps wiping my selection before I can copy it with javascript! Horribly site… Killing JS, of course, fixes it… and wow does it load faster, though code blocks vanish then, hallmark of a great page design there >.>)

Though java has those built in, even a global default threadpool in addition to actual async calls, and yes you could absolutely write a library that looks like go’s http library with very similar code in Java as well, just by default the ecosystem goes for incredibly unreadable verbosity in Java for reasons I’ve yet to determine, but that’s not really a language issue (other than java encourages it by its design), more just their ecosystem, and this is not the case for, dare I say, most to the vast amount of languages out.

But don’t take my word for it on performance, let’s run a quick performance test.

Yeah this is a very flawed benchmark, AB on such a simple http resolver is going to be testing the TCP stack far far more than the code in the server itself and Java uses an older style TCP handling than modern HTTP frameworks (but there very well may be some Java HTTP web frameworks that use the modern faster methods of talking TCP with the kernel, just not with the code they used in the article with the incredibly old server that is specifically not fast, and not designed to be fast).

They should compare to the above C++ or Rust code, which does use the modern methods and will likely be at or faster than the given golang code as both have less memory pressure in addition to other efficiencies over the go code (and though the techempower benchmarks are incredibly flawed and shouldn’t ever be referred to, both the above axum and drogon frameworks bench faster on, especially when you start getting more complex functionality where go is going to be doing even more allocating in comparison).

To make a long story short, the Java version hits ~21K requests per second while the Go version hits ~36K requests per second.

There’s… no way they should be that close, nor that slow, which makes me think they are hitting hardware limitations in the benchmark, go’s should be much faster than that just from the cpu core scaling it does that java doesn’t… Let’s try this on my old old desktop, their same code with their same benchmark line, so here’s golang’s first:

❯ ab -c 8 -n 100000 -s 3 -k http://localhost:8080/
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests


Server Software:        
Server Hostname:        localhost
Server Port:            8080

Document Path:          /
Document Length:        16 bytes

Concurrency Level:      8
Time taken for tests:   0.829 seconds
Complete requests:      100000
Failed requests:        0
Keep-Alive requests:    100000
Total transferred:      15700000 bytes
HTML transferred:       1600000 bytes
Requests per second:    120592.79 [#/sec] (mean)
Time per request:       0.066 [ms] (mean)
Time per request:       0.008 [ms] (mean, across all concurrent requests)
Transfer rate:          18489.32 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     0    0   0.0      0       2
Waiting:        0    0   0.0      0       2
Total:          0    0   0.0      0       2

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      0
  95%      0
  98%      0
  99%      0
 100%      2 (longest request)

And here’s the java one:

❯ ab -c 8 -n 100000 -s 3 -k http://localhost:8080/
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests


Server Software:        
Server Hostname:        localhost
Server Port:            8080

Document Path:          /
Document Length:        15 bytes

Concurrency Level:      8
Time taken for tests:   6.426 seconds
Complete requests:      100000
Failed requests:        0
Keep-Alive requests:    100000
Total transferred:      14800000 bytes
HTML transferred:       1500000 bytes
Requests per second:    15562.74 [#/sec] (mean)
Time per request:       0.514 [ms] (mean)
Time per request:       0.064 [ms] (mean, across all concurrent requests)
Transfer rate:          2249.30 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     0    1   4.3      0      98
Waiting:        0    0   0.3      0      96
Total:          0    1   4.3      0      98

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      0
  95%      0
  98%      0
  99%      2
 100%     98 (longest request)

So in short:

Go: Requests per second: 120592.79 [#/sec] (mean)
Java: Requests per second: 15562.74 [#/sec] (mean)

Yeah, this is a LOT more of what I expect. Hmm, for comparison, in Rust:

❯ ab -c 8 -n 100000 -s 3 -k http://localhost:8080/
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests


Server Software:        
Server Hostname:        localhost
Server Port:            8080

Document Path:          /
Document Length:        15 bytes

Concurrency Level:      8
Time taken for tests:   0.783 seconds
Complete requests:      100000
Failed requests:        0
Keep-Alive requests:    100000
Total transferred:      15600000 bytes
HTML transferred:       1500000 bytes
Requests per second:    127642.85 [#/sec] (mean)
Time per request:       0.063 [ms] (mean)
Time per request:       0.008 [ms] (mean, across all concurrent requests)
Transfer rate:          19445.59 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     0    0   0.0      0       1
Waiting:        0    0   0.0      0       0
Total:          0    0   0.0      0       1

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      0
  95%      0
  98%      0
  99%      0
 100%      1 (longest request)

And for the C++ one (whoops, accidentally ran it in debug mode and got almost 111k requests/sec and was wondering why it was so slow, lol, here’s the release run):

❯ ab -c 8 -n 100000 -s 3 -k http://localhost:8080/                          
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests


Server Software:        drogon/1.7.4
Server Hostname:        localhost
Server Port:            8080

Document Path:          /
Document Length:        14 bytes

Concurrency Level:      8
Time taken for tests:   0.742 seconds
Complete requests:      100000
Failed requests:        0
Keep-Alive requests:    100000
Total transferred:      17600000 bytes
HTML transferred:       1400000 bytes
Requests per second:    134825.40 [#/sec] (mean)
Time per request:       0.059 [ms] (mean)
Time per request:       0.007 [ms] (mean, across all concurrent requests)
Transfer rate:          23173.12 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     0    0   0.0      0       1
Waiting:        0    0   0.0      0       1
Total:          0    0   0.0      0       1

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      0
  95%      0
  98%      0
  99%      0
 100%      1 (longest request)

And adding it to the prior chart:

Go: Requests per second: 120592.79 [#/sec] (mean)
Java: Requests per second: 15562.74 [#/sec] (mean)
Rust: Requests per second: 127642.85 [#/sec] (mean)
C++: Requests per second: 134825.40 [#/sec] (mean)

The C++ one is using a library that’s a surprisingly thin layer on very fast TCP handling. The Rust and Go ones are a higher level mapping over TCP handling with more abstractions for ease of use (though the C++ one, if a couple lines more wordy than you might expect, is still quite easy to use and readable in my opinion, and very easily tied into generative outputs like JSON or HTML generators or so) so are a touch slower (if used lower level ones in those languages then would of course be a bit faster there too, but that abstraction overhead won’t matter much once your business logic is added). And the java one is a very very old, no modern patterns, not even inherently multi-threaded (not that I’ve ever seen it handle multi-threading well for throughput, it uses it for other purposes of concurrency, not increasing throughput annoyingly enough, unsure why such an old library was picked as the ancient sun one), and it doesn’t support keepalive (last I saw, and yep just confirmed, still doesn’t handle them). First of all, the code as shown in the article was needlessly long when the go version (and my rust and C++) were inline, so let’s make Java’s inline and remove useless calls, it’s now this long/short:

import java.io.OutputStream;
import java.net.InetSocketAddress;
import com.sun.net.httpserver.HttpServer;

public class HelloWorld {
    public static void main(String[] args) throws Exception {
        HttpServer server = HttpServer.create(new InetSocketAddress(8080), 0);
        server.createContext("/", (t) -> {
            String response = "Goodbye, World!";
            t.sendResponseHeaders(200, response.length());
            OutputStream os = t.getResponseBody();
            os.write(response.getBytes());
            os.close();
        });
        server.start();
    }
}

And running ab again with keepalives (removing the -k) disabled you now see it run at:

❯ ab -c 8 -n 100000 -s 3 http://localhost:8080/ 
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests


Server Software:        
Server Hostname:        localhost
Server Port:            8080

Document Path:          /
Document Length:        15 bytes

Concurrency Level:      8
Time taken for tests:   4.378 seconds
Complete requests:      100000
Failed requests:        0
Total transferred:      11000000 bytes
HTML transferred:       1500000 bytes
Requests per second:    22842.06 [#/sec] (mean)
Time per request:       0.350 [ms] (mean)
Time per request:       0.044 [ms] (mean, across all concurrent requests)
Transfer rate:          2453.74 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       1
Processing:     0    0   0.1      0       3
Waiting:        0    0   0.1      0       3
Total:          0    0   0.1      0       3

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      0
  95%      0
  98%      1
  99%      1
 100%      3 (longest request)

So now the chart is:

Go: Requests per second: 120592.79 [#/sec] (mean)
Java with keepalive: Requests per second: 15562.74 [#/sec] (mean)
Java without keepalive: Requests per second: 22842.06 [#/sec] (mean)
Rust: Requests per second: 127642.85 [#/sec] (mean)
C++: Requests per second: 134825.40 [#/sec] (mean)

So of course the ancient ancient horribly ancient Java sun HttpServer library is the slowest still, of course, but that’s because it’s single-threaded where the others are natively multi-threaded, so what if the others were single-threaded as well? Let’s try, single-threaded results (enough of the shell output, too much, but can run again if it’s really wanted to be seen, or run it yourself), not only did I specifically set single threaded in the API’s (which can enable some enhancements, like in the java one, as some assumptions can be made but I also forced them only to a single cpu core via taskset so they couldn’t cheat, again it’s the best of multiple runs as the author does:

ST: Go with keepalive: Requests per second: 48756.25 [#/sec] (mean)
ST: Go without keepalive: Requests per second: 21141.35 [#/sec] (mean)
ST: Java with keepalive: Requests per second: 19864.77 [#/sec] (mean)
ST: Java without keepalive: Requests per second: 19400.72 [#/sec] (mean)
ST: Rust with keepalive: Requests per second: 80947.80 [#/sec] (mean)
ST: Rust without keepalive: Requests per second: 27702.41 [#/sec] (mean)
ST: C++ with keepalive: Requests per second: 103743.37 [#/sec] (mean)
ST: C++ without keepalive: Requests per second: 23611.98 [#/sec] (mean)

So yeah, go is about the same speed as java without keepalive, rust and C++ are both faster, and with keepalive (remember the java library doesn’t support keepalive’s properly, it’s too old, it’s design might even predate keepalives even working in browsers… don’t know why the author picked such an old non-standard java http library when there are much better ones) then C++ is best as that library is very low level and using some interesting kernel primitives, rust is pretty close with that overhead probably being its async virtual calls, and the go version keepalive is even half the speed of the rust one, because it’s CSP is quite surprisingly slower than Rusts (in a lot of ways, Go channels, not used here because they are so slow, are very slow in comparison).

So no, this is a bad example by the author, seemingly very cherry picked even though it’s not something anyone would ever really touch nowadays. Plus it doesn’t show off any CSP happening at all in the go code, which is really weird… Maybe if they showed off the CSP code instead of just returning a static string then they’d see some more interesting performance losses especially compared to C++ or Rust, where both also have CSP transformations now too (C++ coroutines and Rust async keyword both do CSP transformations, and since they are keywords that means you only have to incur that cost when it’s needed, not for all code, plus it is good documentation for making obvious which code might suspend and what won’t).

In languages like C you might use a library like libevent to get this type of behavior.

C also has some surprisingly simple API libraries as well, like they mentioned libevent, and doing that on libevent might look like (using libevhtp, a thin layer on libevent adding http handling, which is also so very old, over 10 years, that no one would use it anymore either and there are still yet better HTTP libraries for C, but they mentioned libevent, so here’s libevent via libevhtp even though it doesn’t have modern support of many things):

#include <evhtp.h>
#include <string.h>

static void goodbye_world(evhtp_request_t *req, void *user) {
    const char *body = "Goodbye, World!";
    evbuffer_add(req->buffer_out, body, strlen(body));
    evhtp_send_reply(req, EVHTP_RES_OK);
}

int main(int argc, char **argv) {
    struct event_base *ev = event_base_new(); // initialize libevent
    struct evhtp *htp = evhtp_new(ev, NULL); // initialize the http interface on top of libevent
    evhtp_set_cb(htp, "/", goodbye_world, NULL);
    evhtp_enable_flag(htp, EVHTP_FLAG_ENABLE_ALL); // Enable reuseport and all such interesting flags
    evhtp_use_threads_wexit(htp, NULL, NULL, 16, NULL); // Set thread count, no callbacks set
    evhtp_bind_socket(htp, "0.0.0.0", 8080, 2048); // Bind to port 8080
    event_base_loop(ev, 0); // Start the libevent loop and thus run everything attached to it
    evhtp_unbind_socket(htp); // Unbind socket since closing libevent
    evhtp_safe_free(htp, evhtp_free); // Free evhtp data
    evhtp_safe_free(ev, event_base_free); // Free libevent data
    return 0; // And exit
}

And some results:

C with keepalive: Requests per second: 148561.48 [#/sec] (mean)
C without keepalive: Requests per second: 24792.85 [#/sec] (mean)
ST:C with keepalive: Requests per second: 97511.41 [#/sec] (mean)
ST:C without keepalive: Requests per second: 22799.91 [#/sec] (mean)

So basically raw libevent with a thin http layer (that is lacking a lot of security checks, don’t actually use it in production or public or really anywhere, this is purely for a ‘fast’ example) is about the fastest one, mostly just because it’s polling through kernel events as fast as possible while running very little code, but of course it’s faster than go, and it does not support CSP transformations, though you can emulate with libraries that can get surprisingly close to looking like CSP transformation (though macros macros everywhere!).

Something closer to go might be libdill or libmill that directly try to copy Go’s channels to C though looking traditionally C with libdill and a macro macro macro layer to make it look more like go but still in C, lol, but it gets you those fake CSP transformations that you might very well enjoy, though unsure if faster than go due to stack shuffling I think it does, could possibly be faster too since manually managing memory, but maybe someone can test, lol.

In short, CSP is nice, but having it be optional via a keyword is not just better for performance but is also better as documentation of ‘what’ might suspend. Also their benchmarks were extremely disingenuous, comparing a modern http stack against an extremely old http server that is even lacking basic functionality (and supports basically nothing modern), including the performance enhancing aspects of modernish http, is… very weird… Especially when they are supposed to be showing off the CSP transformations of the language but don’t demonstrate those whatsoever (like entirely unused in their example…).