Make to_ascii_lowercase optional #63

technic · 2019-11-18T15:12:30Z

Hi thanks for cool crate!

Could you remove or make to_ascii_lowercase optional? I think such pre-processing should be done on the library client side, since it is simple (.map(|doc| doc.to_ascii_lowercase())), and is not required for main heavy tokenization fitting and transform logic, I would prefer to call it my self when needed.

The text was updated successfully, but these errors were encountered:

rth · 2019-11-19T23:35:49Z

I agree it should definitely be optional. CountVectorizerParams already has a parameter for it, but currently it's not used.

I think such pre-processing should be done on the library client side

Well, to_ascii_lowercase used for now is indeed fast, but proper unicode lowercasing with str::to_lowercase is significantly slower rust-lang/rust#26244 (comment) and could also benefit from being used in that parallel pipeline.

technic · 2019-11-21T12:49:47Z

Maybe you can pass optional lambda as an argument. Because tokenizer cannot do this, due to it's &str -> &str signature. Actually it may be useful to hold internal state for the tokenizer, like some buffer which can be shared between two iterator instances. But for that you have to make tokenize to have &mut self as a first argument, and create new instance of tokenizer in each thread.

rth · 2019-11-21T22:36:40Z

Maybe you can pass optional lambda as an argument. Because tokenizer cannot do this, due to it's &str -> &str signature.

PR would be welcome. Initially it was &str to avoid memory copies, but maybe realistically there is no way around it with a somewhat generic API.

Actually it may be useful to hold internal state for the tokenizer, like some buffer which can be shared between two iterator instances. But for that you have to make tokenize to have &mut self as a first argument, and create new instance of tokenizer in each thread.

What's the use case for an internal state in tokenizers? RegexpTokenizer does have an internal state inside Regex (and creating that is slow), but I would still think that Tokenizer.tokenize shouldn't change the internal state in general..

technic · 2019-11-22T11:55:08Z

RegexpTokenizer does have an internal state inside Regex

I think Regex uses RefCell inside to maintain some cache, and hide internal mutability. Well, this is the other option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make to_ascii_lowercase optional #63

Make to_ascii_lowercase optional #63

technic commented Nov 18, 2019

rth commented Nov 19, 2019

technic commented Nov 21, 2019

rth commented Nov 21, 2019

technic commented Nov 22, 2019

Make to_ascii_lowercase optional #63

Make to_ascii_lowercase optional #63

Comments

technic commented Nov 18, 2019

rth commented Nov 19, 2019

technic commented Nov 21, 2019

rth commented Nov 21, 2019

technic commented Nov 22, 2019