Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heavy memory usage difference between 0.24 and 0.25 #480

Open
Totodore opened this issue Jan 26, 2025 · 9 comments
Open

Heavy memory usage difference between 0.24 and 0.25 #480

Totodore opened this issue Jan 26, 2025 · 9 comments

Comments

@Totodore
Copy link
Contributor

One of the user of socketioxide, a socket.io implementation made with tokio-tungstenite reported a really high memory usage (around 3GB for 50K sockets) when benchmarking a basic ping/pong socket.io server.

I checked a bit and it appears that there is a big difference between 0.24 and 0.25 versions of tungstenite/tokio-tungstenite. I took the official echo example to do this and benchmarked with 10K sockets echoing messages.

Here is the repository with all the details and results: https://github.com/Totodore/tokio-tungstenite-mem-usage

Recap :

tokio-tungstenite 0.24.0

  • Calls to allocation functions: 341,152 (3,694/s)
  • Temporary allocations: 60,021 (17.59%, 649/s)
  • Peak heap memory usage: 106.1MB
  • Peak RSS: 119.6MB
  • Total runtime: 1min32s

tokio-tungstenite 0.25.0

  • Calls to allocation functions: 324,172 (3,064/s)
  • Temporary allocations: 68,341 (21.08%, 646/s)
  • Peak heap memory usage: 908MB
  • Peak RSS: 936MB
  • Total runtime: 1min45s

Tokio-tungstenite 0.25 eats 908MB of RAM vs 106.1MB for 0.24!

Possible causes

0.25 introduces the Bytes struct in the Message API. As it shares data directly received from the internal socket, I suspect that the cause is around this feature. According to the heaptrack trace, it comes from the FrameCodec in_buffer which seems to grows if it doesn't have enough space. But no reference to the message content is kept so the FrameCodec should not grow that much.

I also tried without echoing anything and the results are the same:

    let (_write, mut read) = ws_stream.split();
    while let Some(_msg) = read.try_next().await.unwrap() {}

Image

@Zarathustra2
Copy link

Hi, I noticed this as well when upgrading axum. I think the reason for this is #468

Can you retry and set read_buffer_size to something lower and see if the issue persists?

@agalakhov
Copy link
Member

Can you git bisect?

@Totodore
Copy link
Contributor Author

Totodore commented Jan 26, 2025

  • With read_buffer_size set to 64KB, I peak at 660MB (6 times the base version). With 618MB coming from FrameCodec::read_frame whereas previously it was shared with FrameCodec::new
  • With read_buffer_size set to 16KB, I peak at 169MB. With 163MB coming from FrameCodec::read_frame.

I don't quite understand why the memory tradeoff for 3% of speed is worth in #468...
And there is still a significant difference even with a read_buffer size set to 64KB.

@Totodore
Copy link
Contributor Author

Can you git bisect?

I can't reproduce the issue with the tungstenite example, it seems that the heavy memory use happens in async contexts with tokio-tungstenite.

@agalakhov
Copy link
Member

Did you try to use 50k sockets for tungstenite as well? If there is 64kB per connection for 50k connections, a high memory usage is quite expected. 660MB is not so much since 64k*50k = 3.2G.

The actual "speed vs memory" trade-off is not as simple as it seems. Speed is not just speed, it is the number of concurrent connections being needed. Being faster effectively means having less clients in the wait queue and, for many applications, this is equivalent to less connections in parallel. 1 GB RAM out of 64 GB RAM (1.5% RAM) for 3% more performance is a fair trade. On the other hand, many applications do not benefit at all from the increased buffer size since due to the nature of their communication. They just waste RAM and get nothing in exchange.

TL;DR: there is no "one-size-fits-all" buffer size. The configuration depends on the actual requirements.

@daniel-abramov
Copy link
Member

daniel-abramov commented Jan 26, 2025

@Zarathustra2 is right in assuming that it was the increase in the size of the read buffer that caused higher memory consumption, however, it's not the linked PR that caused such a stark difference.

The aforementioned PR only changed it from 64 KiB to 128 KiB. The real difference was introduced earlier, when our internal custom ReadBuffer was replaced with BytesMut:

  • ReadBuffer (0.24.0) had the initial capacity of 4 KiB (and it also read from the stream in 4 KiB chunks).
  • Newly introduced BytesMut (0.25.0+) buffer was created with an initial capacity of 64 KiB (16 times larger).
  • #468 additionally increased it to 128 KiB (32 times larger than in 0.24.0).

@Totodore, so once you set read_buffer_size to 4096 (4 KiB), the memory consumption in the latest version will match that of 0.24.0 (I've just locally benchmarked it). I'd also suggest using the latest version (as of now, 0.26.1) for cleaner results.


According to the heaptrack trace, it comes from the FrameCodec in_buffer which seems to grows if it doesn't have enough space. But no reference to the message content is kept so the FrameCodec should not grow that much.

That particular function should not appear in the stack trace at all if we're talking about the k6 configuration that is available in your repository. I suppose the screenshot refers to a different benchmark (I've noticed that the filename in your README does not match the config name in the repository).


PS: I agree that our current default read buffer configuration might be too large for a general use case. I believe the buffer size was set to such a high value due to our benchmarks that showed better results with larger buffer sizes. Perhaps we should tune it down to 16 KiB or something like this.

@Totodore
Copy link
Contributor Author

Totodore commented Jan 26, 2025

Thanks for the details, I understand this much more now.
I will try to set the input buffer to 4KB on my side and check everything.

The call traces are generated with the k6 client provided in the repo (there is just a typo on the readme). I don't understand why they should not be visible, the k6 client is doing TCP calls in the same way that a classic ws impl no?

Even if the in_buffer_size is not changed, I think it would be nice to have more details on the doc front page about this memory/performance tradeoff.

@daniel-abramov
Copy link
Member

The call traces are generated with the k6 client provided in the repo (there is just a typo on the readme). I don't understand why they should not be visible, the k6 client is doing TCP calls in the same way that a classic ws impl no?

Right. However, the config in the repository lets each client connection send a single short message. Such a message would perfectly fit in the read buffer without needing an additional re-allocation inside the read_frame() function (though I must admit, I'm referring to 0.26.1; we might have had some suboptimality in 0.25.0).

@alexheretic
Copy link
Contributor

alexheretic commented Jan 27, 2025

I think we can improve the read_buffer_size docs to clarify the tradeoff and perhaps tweak the default. The current default is good for high load read performance (in particular benchmarked is a high load of small messages).

Here is a comparison of different read buffer sizes to the 100k small message read bench:

group                                       256k                    128k                    64k                     32k                     16k                     8k                      4k                      zero
-----                                       ----                    ----                    ---                     ---                     ---                     --                      --                      ----
read 100k small messages (client)           1.00      5.5±0.08ms    1.01      5.5±0.01ms    1.01      5.5±0.02ms    1.03      5.6±0.20ms    1.11      6.1±0.02ms    1.19      6.5±0.05ms    1.39      7.6±0.08ms    70.27   384.9±6.24ms
read+unmask 100k small messages (server)    1.00      5.7±0.00ms    1.02      5.8±0.02ms    1.04      5.9±0.01ms    1.09      6.2±0.02ms    1.18      6.7±0.03ms    1.36      7.8±0.04ms    1.71      9.7±0.06ms    88.22   503.1±1.87ms

128k is maybe a good default for users with lower numbers of ws connections (common for client usage?) as the performance improvement is desirable and the absolute extra memory required isn't much without multiplication. But perhaps not as good for users expecting many connections, as may be commonly expected for servers. I suppose it is a poor default for write heavy servers that don't expect to read much at all.

Some potential actions:

  • Pick a new single default that balances memory & perf differently.
  • Use different defaults for client & server usage (lower for servers).
  • Document read_buffer_size with speed/memory tradeoff.

Note that the write buffer setting has less impact as the capacity is more lazily used / may not be fully allocated if eagerly flushing, whereas the read buffer is fully allocated on init so it is available for reading into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants