Heavy memory usage difference between 0.24 and 0.25 #480

Totodore · 2025-01-26T00:55:32Z

One of the user of socketioxide, a socket.io implementation made with tokio-tungstenite reported a really high memory usage (around 3GB for 50K sockets) when benchmarking a basic ping/pong socket.io server.

I checked a bit and it appears that there is a big difference between 0.24 and 0.25 versions of tungstenite/tokio-tungstenite. I took the official echo example to do this and benchmarked with 10K sockets echoing messages.

Here is the repository with all the details and results: https://github.com/Totodore/tokio-tungstenite-mem-usage

Recap :

tokio-tungstenite 0.24.0

Calls to allocation functions: 341,152 (3,694/s)
Temporary allocations: 60,021 (17.59%, 649/s)
Peak heap memory usage: 106.1MB
Peak RSS: 119.6MB
Total runtime: 1min32s

tokio-tungstenite 0.25.0

Calls to allocation functions: 324,172 (3,064/s)
Temporary allocations: 68,341 (21.08%, 646/s)
Peak heap memory usage: 908MB
Peak RSS: 936MB
Total runtime: 1min45s

Tokio-tungstenite 0.25 eats 908MB of RAM vs 106.1MB for 0.24!

Possible causes

0.25 introduces the Bytes struct in the Message API. As it shares data directly received from the internal socket, I suspect that the cause is around this feature. According to the heaptrack trace, it comes from the FrameCodec in_buffer which seems to grows if it doesn't have enough space. But no reference to the message content is kept so the FrameCodec should not grow that much.

I also tried without echoing anything and the results are the same:

    let (_write, mut read) = ws_stream.split();
    while let Some(_msg) = read.try_next().await.unwrap() {}

The text was updated successfully, but these errors were encountered:

Zarathustra2 · 2025-01-26T00:59:56Z

Hi, I noticed this as well when upgrading axum. I think the reason for this is #468

Can you retry and set read_buffer_size to something lower and see if the issue persists?

agalakhov · 2025-01-26T01:17:11Z

Can you git bisect?

Totodore · 2025-01-26T01:18:10Z

With read_buffer_size set to 64KB, I peak at 660MB (6 times the base version). With 618MB coming from FrameCodec::read_frame whereas previously it was shared with FrameCodec::new
With read_buffer_size set to 16KB, I peak at 169MB. With 163MB coming from FrameCodec::read_frame.

I don't quite understand why the memory tradeoff for 3% of speed is worth in #468...
And there is still a significant difference even with a read_buffer size set to 64KB.

Totodore · 2025-01-26T14:26:31Z

Can you git bisect?

I can't reproduce the issue with the tungstenite example, it seems that the heavy memory use happens in async contexts with tokio-tungstenite.

agalakhov · 2025-01-26T16:27:19Z

Did you try to use 50k sockets for tungstenite as well? If there is 64kB per connection for 50k connections, a high memory usage is quite expected. 660MB is not so much since 64k*50k = 3.2G.

The actual "speed vs memory" trade-off is not as simple as it seems. Speed is not just speed, it is the number of concurrent connections being needed. Being faster effectively means having less clients in the wait queue and, for many applications, this is equivalent to less connections in parallel. 1 GB RAM out of 64 GB RAM (1.5% RAM) for 3% more performance is a fair trade. On the other hand, many applications do not benefit at all from the increased buffer size since due to the nature of their communication. They just waste RAM and get nothing in exchange.

TL;DR: there is no "one-size-fits-all" buffer size. The configuration depends on the actual requirements.

daniel-abramov · 2025-01-26T22:36:40Z

@Zarathustra2 is right in assuming that it was the increase in the size of the read buffer that caused higher memory consumption, however, it's not the linked PR that caused such a stark difference.

The aforementioned PR only changed it from 64 KiB to 128 KiB. The real difference was introduced earlier, when our internal custom ReadBuffer was replaced with BytesMut:

ReadBuffer (0.24.0) had the initial capacity of 4 KiB (and it also read from the stream in 4 KiB chunks).
Newly introduced BytesMut (0.25.0+) buffer was created with an initial capacity of 64 KiB (16 times larger).
#468 additionally increased it to 128 KiB (32 times larger than in 0.24.0).

@Totodore, so once you set read_buffer_size to 4096 (4 KiB), the memory consumption in the latest version will match that of 0.24.0 (I've just locally benchmarked it). I'd also suggest using the latest version (as of now, 0.26.1) for cleaner results.

According to the heaptrack trace, it comes from the FrameCodec in_buffer which seems to grows if it doesn't have enough space. But no reference to the message content is kept so the FrameCodec should not grow that much.

That particular function should not appear in the stack trace at all if we're talking about the k6 configuration that is available in your repository. I suppose the screenshot refers to a different benchmark (I've noticed that the filename in your README does not match the config name in the repository).

PS: I agree that our current default read buffer configuration might be too large for a general use case. I believe the buffer size was set to such a high value due to our benchmarks that showed better results with larger buffer sizes. Perhaps we should tune it down to 16 KiB or something like this.

Totodore · 2025-01-26T23:30:15Z

Thanks for the details, I understand this much more now.
I will try to set the input buffer to 4KB on my side and check everything.

The call traces are generated with the k6 client provided in the repo (there is just a typo on the readme). I don't understand why they should not be visible, the k6 client is doing TCP calls in the same way that a classic ws impl no?

Even if the in_buffer_size is not changed, I think it would be nice to have more details on the doc front page about this memory/performance tradeoff.

daniel-abramov · 2025-01-27T01:57:14Z

The call traces are generated with the k6 client provided in the repo (there is just a typo on the readme). I don't understand why they should not be visible, the k6 client is doing TCP calls in the same way that a classic ws impl no?

Right. However, the config in the repository lets each client connection send a single short message. Such a message would perfectly fit in the read buffer without needing an additional re-allocation inside the read_frame() function (though I must admit, I'm referring to 0.26.1; we might have had some suboptimality in 0.25.0).

alexheretic · 2025-01-27T13:47:21Z

I think we can improve the read_buffer_size docs to clarify the tradeoff and perhaps tweak the default. The current default is good for high load read performance (in particular benchmarked is a high load of small messages).

Here is a comparison of different read buffer sizes to the 100k small message read bench:

group                                       256k                    128k                    64k                     32k                     16k                     8k                      4k                      zero
-----                                       ----                    ----                    ---                     ---                     ---                     --                      --                      ----
read 100k small messages (client)           1.00      5.5±0.08ms    1.01      5.5±0.01ms    1.01      5.5±0.02ms    1.03      5.6±0.20ms    1.11      6.1±0.02ms    1.19      6.5±0.05ms    1.39      7.6±0.08ms    70.27   384.9±6.24ms
read+unmask 100k small messages (server)    1.00      5.7±0.00ms    1.02      5.8±0.02ms    1.04      5.9±0.01ms    1.09      6.2±0.02ms    1.18      6.7±0.03ms    1.36      7.8±0.04ms    1.71      9.7±0.06ms    88.22   503.1±1.87ms

128k is maybe a good default for users with lower numbers of ws connections (common for client usage?) as the performance improvement is desirable and the absolute extra memory required isn't much without multiplication. But perhaps not as good for users expecting many connections, as may be commonly expected for servers. I suppose it is a poor default for write heavy servers that don't expect to read much at all.

Some potential actions:

Pick a new single default that balances memory & perf differently.
Use different defaults for client & server usage (lower for servers).
Document read_buffer_size with speed/memory tradeoff.

Note that the write buffer setting has less impact as the capacity is more lazily used / may not be fully allocated if eagerly flushing, whereas the read buffer is fully allocated on init so it is available for reading into.

Totodore mentioned this issue Jan 26, 2025

High memory consumption Totodore/socketioxide#448

Closed

Zarathustra2 mentioned this issue Jan 27, 2025

Add ability to set read_buffer_size for websockets tokio-rs/axum#3165

Closed

alexheretic mentioned this issue Feb 5, 2025

Add WebSocketConfig::read_buffer_size docs explaining performance/memory tradeoff #482

Merged

daniel-abramov mentioned this issue Mar 4, 2025

Add fast-and-unsound feature #485

Open

StarSapien mentioned this issue Mar 12, 2025

pub fn read_buffer_size of impl<F> WebSocketUpgrade<F> in extract/ws.rs not available to call, breaks tungstenite API tokio-rs/axum#3262

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heavy memory usage difference between 0.24 and 0.25 #480

Heavy memory usage difference between 0.24 and 0.25 #480

Totodore commented Jan 26, 2025

Zarathustra2 commented Jan 26, 2025

agalakhov commented Jan 26, 2025

Totodore commented Jan 26, 2025 •

edited

Loading

Totodore commented Jan 26, 2025

agalakhov commented Jan 26, 2025

daniel-abramov commented Jan 26, 2025 •

edited

Loading

Totodore commented Jan 26, 2025 •

edited

Loading

daniel-abramov commented Jan 27, 2025

alexheretic commented Jan 27, 2025 •

edited

Loading

Heavy memory usage difference between 0.24 and 0.25 #480

Heavy memory usage difference between 0.24 and 0.25 #480

Comments

Totodore commented Jan 26, 2025

Recap :

tokio-tungstenite 0.24.0

tokio-tungstenite 0.25.0

Possible causes

Zarathustra2 commented Jan 26, 2025

agalakhov commented Jan 26, 2025

Totodore commented Jan 26, 2025 • edited Loading

Totodore commented Jan 26, 2025

agalakhov commented Jan 26, 2025

daniel-abramov commented Jan 26, 2025 • edited Loading

Totodore commented Jan 26, 2025 • edited Loading

daniel-abramov commented Jan 27, 2025

alexheretic commented Jan 27, 2025 • edited Loading

Totodore commented Jan 26, 2025 •

edited

Loading

daniel-abramov commented Jan 26, 2025 •

edited

Loading

Totodore commented Jan 26, 2025 •

edited

Loading

alexheretic commented Jan 27, 2025 •

edited

Loading