Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge GC pressure when serving files with Cohttp_lwt_unix.Server.respond_file #207

Open
mfp opened this issue Nov 26, 2014 · 16 comments
Open

Comments

@mfp
Copy link

mfp commented Nov 26, 2014

respond_file reads the file in 16 KB chunks using Lwt_io.read. This allocates lots of 16 KB strings that go directly to the major heap and become garbage right away, and is bound to trigger major collection fests. See ocsigen/ocsigenserver#49 for an example of what can happen in a similar case...

@rgrinberg
Copy link
Member

We should probably just be using send_file in this case. At least where it's available. Since you are knowledgeable on this topic. What would you recommend?

@mfp
Copy link
Author

mfp commented Nov 26, 2014

I'm not especially knowledgeable on this, but for sure serving files is what sendfile(2) was made for. It would only help for plain (non-SSL) connections, which is an important limitation.

There are a few complications though:

  1. portability: both sendfile(2) and the accompanying TCP_CORK option are Linux-only
  2. it doesn't seem to fit Cohttp's current (stream-based) model

Regarding (2), after a quick look (this is the first time I look at cohttp's code, so take it with a grain
of salt) I believe it might be possible to change both Cohttp_lwt_body.t (so that it can be a reference to a file reifiable as a regular stream) and the output channels used elsewhere in order to perform
a sort of stream fusion and replace the read-file-into-stream & write-stream-to-socket operations into a single sendfile loop.

This sounds quite involved, so for the time being simply allocating a string buffer and reusing it
(with Lwt_io.read_into) would help a lot. I assume that nobody expects the strings returned by the stream to be immutable (this could be documented in the corresponding interface).

@pqwy
Copy link

pqwy commented Nov 26, 2014

Another alternative would be to use Cstruct.t. sendfile(2) avoids copying the file into OCaml memory altogether, sure, but just using cstructs mostly bypasses the GC. They are backed by a foreign-allocated Bigarray and are represented on the heap as a 3-slot record, so they are unlikely to get undue promotion out of the young generation.

@rgrinberg
Copy link
Member

We don't even need cstruct if we adopt core_kernel for every backend since Bigsubstring is analagous AFAIK.

@pqwy
Copy link

pqwy commented Nov 26, 2014

Yeah. Anything that wraps Bigarray with offset/pointer so the bulk of it is allocated off the GC heap and it's not expensive to cut.

We use Cstruct.t in TLS pervasively for these reasons.

@dbuenzli
Copy link

I'd really prefer cstruct than a pervasive core_kernel dependency, we want portability and fast build times on constrained platforms.

@hcarty
Copy link
Contributor

hcarty commented Nov 26, 2014

I second the preference for cstruct over core_kernel.

@pqwy
Copy link

pqwy commented Nov 26, 2014

@dbuenzli, @hcarty:
That's also true. I can't see myself switching TLS and all the related libs to core just for the buffer type anytime soon.

The ecosystem could really use a common bigarray wrapper with zero-copy slicing and a few handy fixed-size extractor funs, that everybody agrees on. Like cstruct, sans syntax extensions, but one that everyone is using...

@avsm
Copy link
Member

avsm commented Nov 26, 2014

Cstruct now only optionally depends on camlp4 since 1.5.0, and has an adapter to Async.

@mfp
Copy link
Author

mfp commented Nov 26, 2014

On Wed, Nov 26, 2014 at 09:54:52AM -0800, David Kaloper wrote:

Another alternative would be to use Cstruct.t. sendfile(2) avoids
copying the file into OCaml memory altogether, sure, but just using cstructs
mostly bypasses the GC. They are backed by a foreign-allocated Bigarray
and are represented on the heap as a 3-slot record, so they are unlikely to
get undue promotion out of the young generation.

I agree that cstructs (or even plain old 1-dimensional bigarrays, maybe with
minimal wrapping for offset/length) should be more prevalent in this kind of
things (IO buffers and such). The main advantage is that they're not going to
be moved around by the GC, so no copying to a temp buffer is needed when
performing IO after releasing the runtime.

They are not a panacea however. For instance, non-mmapped bigarrays simply use
malloc to reserve the memory and free() it on GC of the bigarray, which
cannot be controlled precisely at a reasonable cost -- you have to do
Gc.full_major (). This does not generate extra GC load, but can easily cause
memory fragmentation (and high RSS peaks), for there's no compaction unlike
OCaml's major heap. So you're going to need some care (and maybe a buffer
pool) and have to avoid allocating too much anyway --- not unlike with plain
old strings, although for different reasons.

Going back to the present issue, if Cohttp_lwt_body.t is changed from
Stream of string Lwt_stream.t to Stream of cstruct_like Lwt_stream.t,
you definitely want to reuse the cstruct_like buffer. Also, for compatibility
with Lwt_io, which cohttp_lwt* builds upon, I think simple 1-dimensional
bigarrays (= Lwt_bytes.t) would be best, as in time Lwt_io channels could
become able to dump them directly to the underlying fd (if applicable).
However, neither Lwt_io.read_into_bytes" norLwt_io.write_from_bytesexist at the moment, so... in fact a single plain old string (+ length) reused throughout the life of the stream is going to be close to optimal inLwt_io`'s current state.

Mauricio Fernández

@pqwy
Copy link

pqwy commented Nov 26, 2014

Cstructs are a minimal wrapping of bigarrays with offset/length!

As for Lwt_io.read_into_bytes and its counterpart, they are called Lwt_bytes.read / write, and are what Lwt_cstruct.read / write build upon. It should be relatively straightforward to switch Cohttp_lwt_body.t and friends to cstruct internally, with the damage being api breakage or duplication.

Re: imprecise deallocation and fragmentation.. I have no idea how it would play out. I think those questions are best answered by measurements.

@mfp
Copy link
Author

mfp commented Nov 26, 2014

On Wed, Nov 26, 2014 at 11:55:11AM -0800, David Kaloper wrote:

Cstructs are a minimal wrapping of bigarrays with offset/length!

I'm not very lucid today, what I meant was really non-opaque wrapping, for
maximal compatibility/interoperability.

As for Lwt_io.read_into_bytes and its counterpart, they are called Lwt_bytes.read / write, and are what Lwt_cstruct.read / write build upon. It should be relatively straightforward to switch Cohttp_lwt_body.t and friends to cstruct internally, with the damage being api breakage or duplication.

Lwt_bytes.read/write operate on Lwt_unix.file_descr, not Lwt_io channels
(as needed to read from/write to TLS channels via Lwt_ssl).

We want to avoid allocation in respond_file's stream, but at some point
you're going to want to write it to a Lwt_io channel if you're living in the
Lwt ecosystem.

This is why, when serving a file with respond_file, you're going to run into
this issue as well: mirage/ocaml-conduit#27

Refer also to ocsigen/lwt#98 (as you can see I went
on a little bug reporting spree).

Re: imprecise deallocation and fragmentation.. I have no idea how it would play out.
I think those questions are best answered by measurements.

Allocating lots of bigarrays and hoping the GC will release them is just
asking for trouble. There's a constant in the bigarray lib used to compute the
major slice size which will cause up to 1 GB worth of bigarray data to be
allocated before a full major GC cycle is completed. The heap is going to
become gruyère cheese if you don't try to reuse the bigarrays (and triggering
GC runs manually would generate as much work as plain old strings being
allocated in the major heap, but that at least adjusts the GC speed
automatically and prevents fragmentation...).

Mauricio Fernández

@Chris00
Copy link
Contributor

Chris00 commented Jan 26, 2015

Lwt_bytes.read/write operate on Lwt_unix.file_descr, not Lwt_io channels (as needed to read from/write to TLS channels via Lwt_ssl).

It looks like a first move would be to request Lwt to have “allocation free” Lwt_io.read and Lwt_io.write (working on channels) — it would be useful for other purposes too. Async already has that. Then one can ask Ssl to directly support Bytes.t.

@Chris00
Copy link
Contributor

Chris00 commented Jan 26, 2015

BTW, while we are at it, IO.read should report differently EOF and an error condition — IMHO it is very important to distinguish the two, one may have to rely on EOF to make sure the complete input was read...

@avsm
Copy link
Member

avsm commented Jan 26, 2015

I've opened up #207 to track the IO.read comment -- it's helpful to keep these bug reports separate where possible.

@mfp
Copy link
Author

mfp commented Aug 17, 2015

I'm thinking of implementing a trivial buffer pool to prevent allocation of large buffers (both Bytes.t and Lwt_bytes.t/Cstruct.t), starting from conduit and upwards to cohttp and ocsigenserver. I did it in ocsigenserver to great effect (up to ~30% speed up).

This is the API I have in mind (as you can see in the .ml, the implementation is trivial).
Reusing buffers entails the risk of accidentally using one after it's been explicitly released, but the places where they are used are clear enough and the speed gains so great it makes it worthwhile IMO (for instance, the buffer passed to Lwt_io.make can be released in the ~close function).

Initially, I was going to just add the buffer pool to conduit, and then reuse it in cohttp and ocsigenserver (at some point, the API could be enriched and generalized, and split off to a separate package). One doubt I have is if it'd be acceptable to expose conduit's buffer pool (I mean the pool "value" itself, not the implementation) so that cohttp and ocsigenserver can reuse it, or if it'd be deemed better to have each one use its own internally (and without exposing it), while reusing the implementation, of course.

In preliminary benchmarks of the ocsigenserver cohttp(_rebased) branch, I've seen it's between ~50% (for very small responses) and about as fast (slowdown going down as size increases, about ~ 5% for 16KB files) as stock ocsigenserver 2.6 (without any of the allocation prevention work mentioned earlier). I think the higher per-connection overhead can easily be explained by cohttp using 16KB buffers (vs. 4 KB in stock ocsigenserver).

Also, regarding the precise problem that motivated this issue, it's similar to this one relative to ocsigenserver, so I'd expect about the same speed increase (>2X faster serving of large files) once addressed. This would build on the buffer pool work for extra effectiveness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants