Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make to_ascii_lowercase optional #63

Open
technic opened this issue Nov 18, 2019 · 4 comments
Open

Make to_ascii_lowercase optional #63

technic opened this issue Nov 18, 2019 · 4 comments

Comments

@technic
Copy link

technic commented Nov 18, 2019

Hi thanks for cool crate!

Could you remove or make to_ascii_lowercase optional? I think such pre-processing should be done on the library client side, since it is simple (.map(|doc| doc.to_ascii_lowercase())), and is not required for main heavy tokenization fitting and transform logic, I would prefer to call it my self when needed.

@rth
Copy link
Owner

rth commented Nov 19, 2019

I agree it should definitely be optional. CountVectorizerParams already has a parameter for it, but currently it's not used.

I think such pre-processing should be done on the library client side

Well, to_ascii_lowercase used for now is indeed fast, but proper unicode lowercasing with str::to_lowercase is significantly slower rust-lang/rust#26244 (comment) and could also benefit from being used in that parallel pipeline.

@technic
Copy link
Author

technic commented Nov 21, 2019

Maybe you can pass optional lambda as an argument. Because tokenizer cannot do this, due to it's &str -> &str signature. Actually it may be useful to hold internal state for the tokenizer, like some buffer which can be shared between two iterator instances. But for that you have to make tokenize to have &mut self as a first argument, and create new instance of tokenizer in each thread.

@rth
Copy link
Owner

rth commented Nov 21, 2019

Maybe you can pass optional lambda as an argument. Because tokenizer cannot do this, due to it's &str -> &str signature.

PR would be welcome. Initially it was &str to avoid memory copies, but maybe realistically there is no way around it with a somewhat generic API.

Actually it may be useful to hold internal state for the tokenizer, like some buffer which can be shared between two iterator instances. But for that you have to make tokenize to have &mut self as a first argument, and create new instance of tokenizer in each thread.

What's the use case for an internal state in tokenizers? RegexpTokenizer does have an internal state inside Regex (and creating that is slow), but I would still think that Tokenizer.tokenize shouldn't change the internal state in general..

@technic
Copy link
Author

technic commented Nov 22, 2019

RegexpTokenizer does have an internal state inside Regex

I think Regex uses RefCell inside to maintain some cache, and hide internal mutability. Well, this is the other option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants