[rust] fix accessing `ConstantId` slices #1449

froydnj · 2023-09-12T14:11:53Z

When I wrote ConstantId::as_slice, I misunderstood how yp_constant_id_t works in concert with yp_constant_pool_t: it's not a straight index into the pool of constants.

This implementation is, of course, suboptimal in the sense that we're taking what should be a cheap operation and making it linear in the number of constant ids the program contains, but it is at least correct. I don't have a test for this, as the example that I was seeing this on doesn't fail as a standalone test in the testsuite, for reasons I don't fully understand.

It would not be entirely surprising to me if everybody using the C API gets bitten by this. Happy to rewrite the constant pool implementation so it works more like one would expect.

kddnewton · 2023-09-12T15:48:08Z

Happy to merge this fix.

The people that are using the constant pool will typically build a new table that is ordered correctly for their use case. The reason it is the way it is today is because it's actually a hash table under the hood in order to make lookups fast.

You can see how this is done in other places. For example:

froydnj · 2023-09-12T16:16:21Z

The people that are using the constant pool will typically build a new table that is ordered correctly for their use case. The reason it is the way it is today is because it's actually a hash table under the hood in order to make lookups fast.

Yes, I see that, but the constant pool layout makes things like ConstantId::as_slice require something like this PR, because you don't necessarily know the table up front. I guess you could define a ConstantIdTable trait and force as_slice to take that, but that feels...excessive?

Why not set things up so that you have fast access from a yp_constant_id_t -> yp_constant_t and hash indexing for names as well? It would not be that much more code. Would you consider a patch for that?

kddnewton · 2023-09-12T16:18:17Z

How would you do that without linearly increasing the memory footprint?

froydnj · 2023-09-12T17:12:02Z

The straightforward way to do things is to have:

typedef struct {
    const uint8_t *start;
    size_t length;
    size_t hash;
} yp_constant_t;

struct constant_pool_slot {
    uint32_t id : 31;
    uint32_t owned: 1;
};

typedef struct {
    // Allocated with `capacity` entries; all entries are potentially valid, but
    // `id == 0` entries have no data associated with them.
    struct constant_pool_slot *slots;
    // Allocated with `capacity` entries, but only the first `size` entries are
    // valid.  Indexed by `yp_constant_id_t`.  Does not have to be resized
    // alongside `slots` if the hash table gets too full, unless desired for
    // consistency.
    yp_constant_t *constants;
    size_t size;
    size_t capacity;
} yp_constant_pool_t;

constants holds your yp_constant_t values, slots acts as your hash table. This layout does mean that a hit takes an extra load, since you need to access constants[slot->id]->length to verify that your entries match, but this is probably compensated for by making the hash table itself more dense, and making the yp_constant_t values smaller (no more wasted space!). Resizing is also cheaper (which is admittedly a minor thing.) Freeing a constant pool is now slightly more expensive too.

There are several refinements one could do:

Allocate slots and constants in one big chunk;
Use 32-bit hashes and store them in the slots, which makes the slots bigger, but yp_constant_t much smaller. (Admittedly, this could be done with the current scheme.)

kddnewton · 2023-09-12T18:35:51Z

If you'd like to investigate this approach, could you open a PR with benchmarks on memory and speed? If it's an improvement on both, I'll merge immediately. If it's an improvement on only one, then we'll have to weigh the pros/cons.

froydnj · 2023-09-12T19:29:18Z

Sure! Do you have standardized benchmarks for measuring these sorts of things?

kddnewton · 2023-09-15T23:53:25Z

Ugh I'm working on it, but no I don't have anything in particular. Honestly if you want to run it against stripe's codebase as you have been, I think that would be fine.

eregon · 2023-09-16T12:57:15Z

yp_constant_id_t -> yp_constant_t

This is also needed for #1533.

eregon · 2023-09-16T13:09:44Z

I think compact hashes like described in #1449 (comment) is a well-known solution perf-wise and memory-wise, and crucially it makes yp_constant_id_t -> yp_constant_t possible which is necessary.
It's very similar to how CRuby changed the representation of Hash and also to https://blog.toit.io/hash-maps-that-dont-hate-you-1a96150b492a (ignore the part about removing entries, we don't care about that here, rather the pictures are good to explain the design of compact hashes).
The difference for the constant pool is insertion order is not strictly necessary, but seems better e.g. for the efficiency when serializing the constant pool (i.e. reading memory linearly vs pretty much random reads).

An alternative would be to yp_constant_t* from yp_constant_pool_insert_*() and have a yp_constant_id_t field in yp_constant_t. That would increase the footprint of type: constant fields from 4 to 8 bytes in C node structs.

froydnj · 2023-09-16T13:57:18Z

The difference for the constant pool is insertion order is not strictly necessary, but seems better e.g. for the efficiency when serializing the constant pool (i.e. reading memory linearly vs pretty much random reads).

This turns out to require a little more work, because in a scheme where you have separate arrays for each of:

struct yp_hash_bucket {
    uint32_t id : 31;
    uint32_t owned : 1;
    uint32_t hash;
};

struct yp_constant {
    const uint8_t *start;
    size_t len;
};

and yp_constant_id_t indexes into an array of struct yp_constant, you still need the owned bit when serializing, so you can't do truly linear accesses. You could solve this by moving the owned bit--which could then become a bool--to struct yp_constant and changing the len to be uint32_t, which seems very reasonable (who has 4GB+ identifiers or even strings, if strings ever wind up in the constant pool?).

An alternative would be to yp_constant_t* from yp_constant_pool_insert_*() and have a yp_constant_id_t field in yp_constant_t. That would increase the footprint of type: constant fields from 4 to 8 bytes in C node structs.

I don't think you want to use pointers because the pointers are unstable in the face of constant pool resizes, whereas indices are not; the size increase is not desirable either.

eregon · 2023-09-16T14:24:08Z

you still need the owned bit when serializing

Actually I think we would not need it when serializing with the alternative in #1479 (comment) -> #1537
i.e. we would embed all constant pool entries in the serialized output, so then it's irrelevant whether it's owned or not, we'd serialize as if they were all owned by the constant pool.

Indeed, this bit could also just be moved somewhere else if that's necessary/easier/faster.

I don't think you want to use pointers because the pointers are unstable in the face of constant pool resizes, whereas indices are not

Ah true, great point, I missed that (that's something that can't happen in Java or Ruby, to "invalidate" a pointer).

[rust] fix accessing ConstantId slices

b44214e

froydnj force-pushed the froydnj-fix-constant-access branch from 9d1b5fc to b44214e Compare September 12, 2023 14:18

kddnewton approved these changes Sep 12, 2023

View reviewed changes

kddnewton merged commit 41430f4 into ruby:main Sep 12, 2023
46 checks passed

eregon mentioned this pull request Sep 16, 2023

Make Call*Node#*name use the constant pool #1533

Merged

eregon mentioned this pull request Sep 16, 2023

Copy & embed any string/constant pool entry in the serialized output #1537

Open

froydnj mentioned this pull request Sep 19, 2023

make yp_constant_id_ts useful for constant pool indexing #1554

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rust] fix accessing `ConstantId` slices #1449

[rust] fix accessing `ConstantId` slices #1449

froydnj commented Sep 12, 2023

kddnewton commented Sep 12, 2023

froydnj commented Sep 12, 2023

kddnewton commented Sep 12, 2023

froydnj commented Sep 12, 2023

kddnewton commented Sep 12, 2023

froydnj commented Sep 12, 2023

kddnewton commented Sep 15, 2023

eregon commented Sep 16, 2023

eregon commented Sep 16, 2023

froydnj commented Sep 16, 2023

eregon commented Sep 16, 2023 •

edited

Loading

[rust] fix accessing ConstantId slices #1449

[rust] fix accessing ConstantId slices #1449

Conversation

froydnj commented Sep 12, 2023

kddnewton commented Sep 12, 2023

froydnj commented Sep 12, 2023

kddnewton commented Sep 12, 2023

froydnj commented Sep 12, 2023

kddnewton commented Sep 12, 2023

froydnj commented Sep 12, 2023

kddnewton commented Sep 15, 2023

eregon commented Sep 16, 2023

eregon commented Sep 16, 2023

froydnj commented Sep 16, 2023

eregon commented Sep 16, 2023 • edited Loading

[rust] fix accessing `ConstantId` slices #1449

[rust] fix accessing `ConstantId` slices #1449

eregon commented Sep 16, 2023 •

edited

Loading