Failing Big with Elixir and LiveView - A Post-Mortem

wmnnd · 4 June 2021 11:51

Here’s the story how one of the world’s first production deployments of LiveView came to be - and how trying to improve it almost caused a political party in Germany to cancel their convention.

I wrote this post just a few days after the event took place. As annoying as it was, it was a good teachable moment. And soon I’ll write an update with a tutorial on how to scale to 5,000 concurrent LiveView users on a single VPS

AstonJ · 4 June 2021 11:56

Nice one Philipp - I enjoyed reading your story and I look forward to the follow-up!

OvermindDL1 · 4 June 2021 15:58

Oooo, this looks like an interesting read!

Participants poll the GenServer for updates every second.

/me twitches

That seems… inefficient compared to just pushing updates as they happen instead of polling, perhaps with a debouncer? Phoenix makes it easy to push updates to a channel from any process, bypassing the majority of the message passing costs. This is foreboding, lol.

Everything was great - except for one problem: The party kept growing, and thus the number of participants in these events kept growing, too.

And yep, this seems to confirm…

The frequent polling intervals of the first iteration ended up maxing out all eight CPU cores of a t3a.2xlarge AWS EC2 instance.

And yep, that seems even heavier than expected for just polling on the BEAM, I wonder what other costs were involved…

So I decided to switch from constant polling to a Pub/Sub model. This is also quite easy to do with Elixir and Phoenix: Phoenix comes with its own easy-to-use PubSub module.

Yay! Hopefully straight to the socket processes and not re-rendering with LiveView (which does it so incredibly inefficiently compared to some other thing libraries).

A three-day convention packed with votes and almost 3,000 eligible members in Germany.

Didn’t stress test it first?!? Still though, 3k doesn’t sound like much, I’ve stressed drab at work to over 40k on a single core without issues.

It was like watching a trainwreck: As soon as the server was up again, RAM usage immediately started climbing, and climbing … until the inevitable out-of-memory crash.

Oooo I can see so many possible causes…

The LiveView controller process would then receive these messages, set the @participants assign and render an updated view:

…oh wow, right, LiveView stores the changes inside each liveview process instead of shared data or just pushing it to the client to handle like you can in Drab (I still say Drab is overall better designed than LiveView, trivial not to cause this kind of issue in it, where LiveView encourages these issues…)…

With dozens of these updates happening per second as participants were joining the convention, messages were piling up in the inbox of the LiveView admin controller processes faster than they could be handled.

Eh, I wouldn’t think so, when a process on the beam sends a message to another process on the beam on the same system it has backpressure, so if the mailbox grows then the sender process gets scheduled less and less often until it practically is paused… Though if PubSub were used to talk to intermediary processes I could see issues…

My laptop crashed, the theory had been confirmed!

Why on earth would the laptop crash from a single process consuming excess memory?!? What on earth was the OS being used?!
No, I still think it was something else than the mailbox… Like using liveview re-rendering huge swaths of things instead of a better Drab-like model of pushing updates to the client to handle. Still should have debounced the changed data, which Drab would have automatically done by just broadcasting straight to the clients from the change process instead of an intermediary process with its own memory and mailbox and stack and all.

I then wanted the LiveView process to occasionally check if this other assign had been modified and, if so, also update @participants .

More polling? Why not a timeout message when a change comes in? Or better yet broadcast straight to the clients instead of going through intermediary processes per client (that sounds so heavy for shared data…).

With thousands of updates coming in at the same time, neither Firefox nor Chromium stood a chance.

Debouncing and batching!

I implemented a mechanism to do so at most once every second.

Close enough to debouncing, though more costly when no updates are happening, lol.

Avoid large payloads in Phoenix.PubSub if possible

Yep, best to send only changes, and let the pubsub go straight to the client socket process to be handled on the client instead of intermediary re-rendering processes.

Throttle PubSub events at the sender level to avoid clogged process inboxes

Yeah, pubsub doesn’t backpressure as much as one would hope, this is why sending directly to the socket processes would be far better (which use pubsub internally anyway, still debounce your data!).

Using assign/3 in LiveView always causes an update via Websocket, even if no changes were made

And LiveView has no ability to push updates to the client without sending updated DOM either unless you want to manually craft javascript and all, it really needs to take a few of Drab’s features (especially since Drab predated LiveView by about 2 years! I still don’t know why LiveView was made instead of just working on Drab…).

AstonJ · 4 June 2021 16:24

Probably a topic suited for another thread, but I’d be interested in knowing what the differences are now that LV has been around a while (from my understanding it was leaner/more performance-centric?).

OvermindDL1 · 4 June 2021 16:40

I hope its gotten more performance centric since I last touched it, it was horribly horribly inefficient compared to Drab. With Drab you could quite literally send the minimal tiny bit of changes, like if you want to extend a list of things or update a field it’s a single call to send a tiny bit of data where with LiveView it had to re-render the html and send it over and use mergedom or something like that to merge its changes in, which was incredibly heavy!

AstonJ · 4 June 2021 17:27

Really? I’m surprised! (Maybe there’s something you could send some PRs for?)

Could be a good experiment for you @wmnnd - build a version of your app with Drab to see if there’s much difference on the performance front

wmnnd · 4 June 2021 18:39

Thanks for your super detailed feedback, that was very interesting to read!

A timeout message at which level? At the LiveView level? And how could changes be broadcast directly to the clients without going through the LV process?

How would you implement debouncing then? In my current solution, I update the state to keep track of it needing to be updated, so this call that happens once every second is not really costly at all

True, using some kind of diffing would obviously be ideal here. But again, I’m using LiveView, so it kinda has to go through that. Can you recommend a way to do diffing in Elixir?

I might have overstated what happened by using the term “crash” It froze for a few seconds until the OOM killer came in.

OvermindDL1 · 4 June 2021 18:56

It’s not really a PR style thing, it’s more of a how-it-was-designed thing. Drab was designed to be much lower level than LiveView, it’s LiveView-like functionality is but one of a host of modules built on it, and thus it’s easy to just use the lower functionality for more efficient updates and so forth, LiveView doesn’t really have a good way to ‘escape’ out of how it works.

Drab hasn’t been updated much since LiveView came out, the author lost interest after it felt like LiveView tried to steal the limelight out of what it already did and somehow gained Elixir/Phoenix level of being pushed out. I have commit privs so I can do work on it, but been similar for me as well, mostly I just use it at work and it works very well for my things, though I know others are using it as well.

As in when you first get an update then have a message by sent back to you, say, a second later, then anything else that arrives until then just gets batched together and when your timeout happens ‘then’ you send the updated information to the browser, and when you get more information you start that timeout again. It means you’re delayed by up to a second but it allows easy batching/debouncing and keeps the actual retrieval of the new data fast.

It’s pretty trivial in Drab, it has functions designed for many-connection broadcasting (like for this case you’d have a new connection send over all data, then as new data comes in to the main process that holds it all, then instead of sending to the connection processes you’d “broadcast” it to all connected things on the given channel to inject the new element at the given location to extend the data, perhaps with pruning old as well, it all depends on how you are displaying it), I haven’t used LiveView yet short of minor tests but I’d be surprised and disheartened if it didn’t have a similar capability?

When new data comes in just send a message to yourself 1 second in the future, when that message arrives then broadcast the data out and reset the timeout, then when more data comes in and that timeout isn’t set then set it (simple boolean) and have a timeout message be sent back to you a second later, there’s a built in thing to delay a message like that in the OTP.

The cost I was referencing was not the main process that holds all its data polling once a second, which would be cheap enough as it’s only one, but all the connections doing it, which suddenly makes for a noisy internal mesh on 1 second timeouts all the time.

From the sounds of it, it seems like you were just sending essentially presence updates (so Presence would be useful) for the connected people and I don’t know if the votes were done in secret or in public, but public would be the worst case, but in that case in Drab you’d just broadcast a ‘vote’ for whichever way to the given user element (which drab would internally handle by just sending just that change). It really depends on what the precise data was that was being held and how it displayed though, but there are always method to minimize data transmitted.

Lol, ah, big difference yeah. ^.^;
On most OS’s it’s pretty easy to restrict the allowed accessible memory of a process nowadays (like cgroups on linux), great for debugging such failures much more quickly.

AstonJ · 4 June 2021 19:44

I’d agree it would have been nice if Drab could have been used, expanded or adopted officially but as you mentioned they are quite different - so it’s possible that the Phoenix team’s vision for such a tool was just a bit too different to that of Drabs.

It’s difficult to comment about things like this because we don’t know what the specific reasoning was, or even details like timelines - I think I heard that Chris started on LV quite early on but didn’t get around to working more on it (so the idea pre-dates Drab). There was similar functionality in Volt, which was an isomorphic Ruby framework and it shared a number of other features with Phoenix too (such as performant websockets thanks to JRuby and VertX and with real time apps in mind) - in many ways I saw, and I think a lot of others did, Phoenix being an even more performant version of that. In fact I noticed a lot of people who were interested in Volt, become interested in Phoenix after it was announced (including Drab’s creator - I am sure he was somewhat active/interested in Volt too)

I wrote a bit about Volt here: Ruby is about to get red hot. Again. – (via @AstonJ)

And this is a good explanation:

(I’d love to see more of Volt in Phoenix - like easier ways to build apps as a series of components )

Sorry @wmnnd, we can split these posts into a separate thread if you prefer

wmnnd · 4 June 2021 19:49

Ah, this is actually where one of my learnings came from. I tried doing that, setting a flag in the LV process when a new update arrived and doing the debouncing at that level. However, that’s when I discovered that LV always sends updates via websocket even if the flag I was using isn’t present in any templates. That’s why I switched to doing the debouncing at the sender level.

So you mean rendering the update once and then broadcasting the actual DOM changes only? I don’t think LV can do that.

Ah no, the connections don’t do any polling unless they have been messaged with an update.

I don’t think Presence would work because most of the connected users don’t need to receive these updates and I think Presence broadcasts updates to everyone in the channel. Also, the participant data being broadcast is not merely ephemeral; some of it is also persisted.

OvermindDL1 · 4 June 2021 21:03

They stated its because they didn’t know about Drab when they made LiveView, although it was quite a long post on the forums that they frequented, but if so then that’s why…

Really?! Why would it re-render and send updates even when there are no changes?!? o.O

Exactly, it’s some of the basic core functionality in Drab, it seems really odd if LiveView can’t support that… Without that capability it would be quite difficult to scale…

Presence has both a registration and a subscription. When you register to it you are added to the presence list, but you don’t get updates or information about it, and you are automatically unregistered when your process dies. When you subscribe to it then you get all updates on it, whether you are registered to it or not (so you can watch the presence list of something you aren’t in just fine). You can also list to get the current list in it. So when you need a presence list of what people/connections exist, then its great at it!

Presence can also hold some data per connection as well, just careful with it as changes to that get broadcasted out as well so you generally want it small (though I guess it doesn’t matter as much if just a few ‘admins’ see the list).

AstonJ · 5 June 2021 02:03

Could be they started on it before Drab - at least that’s what I thought was said, LV was started on a long time ago but was put on the back burner.

Maybe a natural evolution for Drab could be to go the other way, into something like Volt (isomorphic framework)

kokolegorille · 5 June 2021 12:07

Its something else…

OvermindDL1 · 7 June 2021 16:01

Oh, he stepped away back in 2019’s, didn’t realized that happened, that really sucks!

olivermt · 11 June 2021 08:31

@wmnnd and @OvermindDL1

What you would normally do here is make a shield between what is sent to the LV differ and what your ‘process state’ is.
How you typically manage this is to have a different assigns like use_this (which is what you use in your template as @use_this ) as an example.
Whereas your ‘real’ state that gets the pubsub updates is held in real_data assigns.

Then you can have some plumbing, maybe a “did this change within the last second” style timestamp assigns to assist your logic of when you update the use_this assign which then kicks off re-rendering since the assign used in ~L changed.

I will fully admit that it’s not super clear to the average new user that this is a good design pattern. However I assume that problem will solve itself with more adoption.