Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy & embed any string/constant pool entry in the serialized output #1537

Open
eregon opened this issue Sep 16, 2023 · 9 comments
Open

Copy & embed any string/constant pool entry in the serialized output #1537

eregon opened this issue Sep 16, 2023 · 9 comments
Labels
enhancement New feature or request
Milestone

Comments

@eregon
Copy link
Member

eregon commented Sep 16, 2023

I think we should copy & embed any string/constant pool entry in the serialized output and therefore not need the source at all when deserializing.
That would mean larger serialized size but no need to read the source at all which could be a nice speedup in the case the serialized form is saved on disk and newer/equal the source file.
I think the serialized size wouldn't be worse than before + eager loading constant pool and strings, it would be about the same, but gaining to not have to read the source.
That might actually be fairly easy, and would be less temporal memory than serialized + source at the same time in memory (but that's not a big deal).

This would also remove the need for accessing owned when serializing the constant pool: #1449 (comment)

Originally posted by @eregon in #1479 (comment)

@kddnewton
Copy link
Collaborator

I'm not sure about this. It makes it so that all serialization consumers now get a much bigger payload each time, even if they have the source available. This would make the JS one, for example, much bigger while it always has the source available.

@enebo what do you think?

@eregon
Copy link
Member Author

eregon commented Dec 6, 2023

There are two gains:

  1. It would make loading from the serialized AST on disk quite a bit faster, by not needing to read the source (and only check its mtime to know if up to date).

  2. It will be important when JRuby & TruffleRuby do lazy deserialization of method literals, so there is no need to keep the Source around for the whole program duration when it would otherwise not be needed.
    I think that's a really valuable optimization to have and we already have some pieces ready in Prism (the offset on DefNode), but it's something for later, in 2024.

Also currently we need to hold onto the Source even after deserializing the entire Prism AST at once, for things like IntegerNode ( 😓 we should probably do something about that). That's rather suboptimal in memory usage, especially with lazy method compilation/translation (!= lazy deserialization, TruffleRuby already uses this with Prism), as then both the full AST and the source are kept for the entire program duration while maybe only a single/few methods of the file are used.

@enebo
Copy link
Collaborator

enebo commented Dec 7, 2023

There are two levels here to consider: loaded_tree and serialized_blob.

For loaded tree the only source we depend on at this point is Integer, Float, and Rational. They need to be moved into the tree and not loaded from source. We just haven't made a decision on how to do this. Keeping source a round AND the tree is a huge memory cost over the legacy parser. I assume we 100% will be doing this soon. We are so close to not having source here.

For serialization I think there is a pretty big carrot to moving constants into serialization: pre-compilation. It would be nice to eventually have a stable serialized format we can load and not bother to load from source at all (assuming it is trusted). We have plans to replace our AOT serialization format with this since Loader requires no C.

As @eregon said we both do lazy method loading and whatever allows us to use as little memory as possible (also helps for startup time). Ultimately, what we would have for methods would be a slice of the constant pool and a slice of the offsets representing the method body (whether we actually slice this or keep whole serialized blob is up to experimentation). The time saved will be not loading huge portions of the tree until it is used. As some Rails commands I have see up to 80% of all method definitions loaded are not called.

The last bit also mentioned by @eregon but he only mentions memory...it is faster to make this pool in C than have us create it in Java when Java is just starting up and still somewhat cold. If we have to depend on C we may as well have C do as much heavy lifting for us as we can get.

@kddnewton I have said this many times now but there is a pretty large dual concern in prism and we keep trying to compromise to serve both. I think it is still working but there will always be some tradeoff so long as we use a single format. A larger serialized format is less memory for compilation but more memory for syntax consumers.

A question would be how much larger would moving constant pool into serialization actually be? We will swap start+length for pool_offset + bytes in pool but for things like identifiers we will have quite a bit of sharing. So in those sorts of cases we may actually reduce size. Strings will be larger period. So, I do think it will be larger but I am unsure how much larger.

@kddnewton
Copy link
Collaborator

Thanks @enebo for your thoughts here.

I imagine it's okay to slow down the serialization API as it's only being used for Java and JS at the moment. The only real thing I'm concerned about here in terms of size would be strings. That could be quite a large serialized blob if the files have large strings in them. We would also need the DATA constant to be stored in the file as well, which could end up being large.

If you all think it is worth it, I'm definitely game to try after we ship 3.3.

@enebo
Copy link
Collaborator

enebo commented Dec 7, 2023

@kddnewton yeah I am not sure what percentage of strings are within files but some files will end up much larger being pushed into serialization (although smaller than having both serialized+source). For me, this is a nice to have thing and something I definitely want but not at the cost of having prism being 100% usable. So deferring this is ok in my opinion.

I would like the three numeric types to have values in the AST though. Carrying around source during build is some code I would love to delete and it will make a significant change in memory size (since we still do lazy method but just with AST+source). Removing source would get us near in memory to what we do with the legacy parser.

@kddnewton
Copy link
Collaborator

@enebo I added the integer value storage to the next milestone so we'll do that. From reading I think I understand how to do that. However I am very unclear on how to do floats/rationals. I'm not tackling it today so it's not pressing, but if you have some resources I'm going to need them.

@kddnewton kddnewton added this to the Ruby 3.4.0 milestone Dec 7, 2023
@enebo
Copy link
Collaborator

enebo commented Dec 7, 2023

@kddnewton an interim solution may just be for us to copy the number strings to the constant pool and then entertain the native storage later.

@eregon
Copy link
Member Author

eregon commented Dec 8, 2023

We would also need the DATA constant to be stored in the file as well

True, although this would only be for the main file/script. I think Prism doesn't know about this currently, but it's a boolean we could pass in, and we'd only need Prism to serialize DATA in that case.

an interim solution may just be for us to copy the number strings to the constant pool and then entertain the native storage later.

Yeah, this would be an easy way for now. OTOH that might be suboptimal for non-serializing use cases?

I was thinking we might be able to just special-case them in Loader.java to store an extra byte[] for them (and declare that extra field in IntegerNode and FloatNode). Not the cleanest, but if the extra type: constant field for IntegerNode and FloatNode is undesirable then it might be a good trade-off.

@enebo
Copy link
Collaborator

enebo commented Dec 8, 2023

@eregon yeah I thought about us just adding stuff not in config.yml but it feels messy. If you do it though I won't complain. We just need to revisit this later when the real solutions occurs.

@kddnewton kddnewton added enhancement New feature or request and removed suggestion labels Dec 14, 2023
@kddnewton kddnewton modified the milestones: CRuby unblocked, Long-term Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants