Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add normalizer for keyword fields #415

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

add normalizer for keyword fields #415

wants to merge 1 commit into from

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Dec 13, 2019

cherry-picked from #412 and based on #414.

This PR adds a normalizer which is the nearest thing to an analyzer for keyword fields.
more info here: elastic/elasticsearch#18064

This allows us to perform some basic normalization to fields such as layer, source and category, forcing them to be lowercased and doing some ICU normalization.

One notable change here is that those fields were previously case-sensitive and will now be case-insensitive, which I think is preferable despite there being a test which was covering this behaviour.

Note that not all keyword fields should have a normalizer specified, for instance, verbatim fields such as bounding_box and addendum are probably best left with the default null normalizer.

Normalizers are applied both at index-time and at query-time.

I would like to add some additional filters such as trim and unique but they are not available until version 6.4 of elasticsearch and so will come in a subsequent PR which can be merged independently of this.

@Joxit
Copy link
Member

Joxit commented Dec 13, 2019

Is this normalizer necessary? The API is already doing the lowercase transformation for layer and source (category too ?) and there are also some check for ids 🤔

The code looks good 😄

@missinglink
Copy link
Member Author

Good point, I was more thinking about trying to prevent bugs by ensuring the tokens were normalized.
This normalization needs to be done both at index-time and query-time and I could see bugs easily being introduced, especially in the pelias/api code.

It's not a big deal and I'd be fine with not merging this if it comes with a performance hit.

@missinglink
Copy link
Member Author

missinglink commented Dec 16, 2019

I found this old test case super confusing because it's asserting that the keyword field is case-sensitive even though it should never be the case 🤷‍♂

[admission of guilt] it was written by me 😝

@Joxit
Copy link
Member

Joxit commented Dec 16, 2019

Ha ha ha, it's ok, 4 years ago, the statute of limitations has passed :p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants