Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable field norms #476

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

disable field norms #476

wants to merge 2 commits into from

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Feb 23, 2021

this DRAFT PR isn't meant to be merged, I'm just curious as to what a planet build would look like with norms: false on all the fields.

it's been a while since we last looked at this in #323

I suspect that since setting norms: false will disable 'field length', it will:

  • fix the issue we have with aliases counting towards the field length and therefore scoring lower when more aliases exist
  • at the same time will have a negative impact on exact matching queries where the shorter field length allowed them to score higher

the thing I'm curious about is how much effect the second point has in practice, there is actually an integration test which regresses as part of this commit but I suspect that population / popularity scoring may, to some degree, resolve some of the exact matching issues.

my hope it that it shows that this could potentially be workable, although I'm not willing to bet on it 😆

see: http://makble.com/what-is-lucene-norms

@missinglink
Copy link
Member Author

I was expecting the build size to be reduced since it's not storing the 1 byte per document with the norms.
It's not significant compared to the rest of the index:

Screenshot 2021-02-24 at 09 33 13

@missinglink
Copy link
Member Author

missinglink commented Feb 23, 2021

Some examples of improvements, in both cases the more popular, yet wordier names are now being scored higher than the exact matching or succinct names.

Screenshot 2021-02-24 at 10 14 21

Screenshot 2021-02-24 at 10 13 31

note: 'Angkor Wat Putt' is a mini-golf ;) it's actually got a popularity score of 6600, compared to the ticket office 2200, this is not great but it's not the fault of the similarity algo, we can fix that either in the data or the population calculation algo

More testing to come...

@missinglink missinglink changed the title disable field norms across the board disable field norms Feb 24, 2021
@missinglink
Copy link
Member Author

missinglink commented Feb 24, 2021

So surprisingly the testing was fairly favourable, as expected it had the positive effect of fixing the field length scoring discrepancy introduced by adding aliases, and produced better sorting in many autocomplete cases with few regressions there.

For /v1/search and /v1/search/structured specifically I don't think it's necessarily all roses, the query /v1/search/structured?neighbourhood=Chelsea used to return Chelsea, London, England, United Kingdom first and now is returning Chelsea Heights, Atlantic City, NJ, USA first.

While this is kinda what I thought we wanted (because the USA result has a higher population). Upon reflection I don't think this is the behaviour we want from the /v1/search*** endpoints. I think for those we want to favour exact matches higher because the user asked for Chelsea not Chelsea%.

My current thesis:

"field length is an important tool for scoring exact matches better" but also "autocomplete by nature doesn't always favour exact matches and so maybe field length is less/not important there"

I've pushed a second commit which only sets norms: false on ngram fields, let's see what that looks like.

@missinglink
Copy link
Member Author

missinglink commented Feb 25, 2021

This is one more screenshot of the dev build with norms=false on all fields, the query is /v1/autocomplete?text=statue of liberty:

Screenshot 2021-02-25 at 22 55 35

@missinglink
Copy link
Member Author

I put the newer build on dev (this is the build which only disabled norms on the ngram fields, not the other ones) and there's no noticeable difference from master.

This is pretty much what I was suspecting because the ngram indices are usually only used for the last token entered, so the 'damage has been done' already by that point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant