fix: 1.x - nltk upgrade, use `nltk.download('punkt_tab')` #8256

vblagoje · 2024-08-20T09:43:41Z

We needed to update a few more deps to get a green CI
We needed to skip nltk preprocessing tests that load pickle models (seems to be forbidden in nltk 3.9)
fixes Upgrade Haystack 1.x to NLTK 3.9 #8238

vblagoje · 2024-08-20T11:53:50Z

I've managed to get the CI to pass. Note the changes in dependencies. It couldn't be done without these and we need to pin a few more dependencies which is ok.

The nltk tests that were failing are related to inability to load old models in pickle files, which I think is forbidden now in nltk 3.9.x

I'll upgrade this PR draft into a PR

julian-risch

Disabling custom tokenizers is a bigger limitation but for now it's our best option in my opinion. We don't want to re-write how the PreProcessor loads custom models now. Users can still choose to not upgrade to the next Haystack 1.26.x release.

anakin87 · 2024-08-20T14:15:31Z

I would make this limitation a bit more evident.

if we don't want to suppress the parameter tokenizer_model_folder, we can log a clear warning.
let's also add an upgrade entry in the release note.

vblagoje · 2024-08-29T08:16:28Z

I would make this limitation a bit more evident.

if we don't want to suppress the parameter tokenizer_model_folder, we can log a clear warning.

let's also add an upgrade entry in the release note.

I opted for always None-ing tokenizer_model_folder and logging the warning with resolution path. This way we don't have to touch the codebase much and cause some unintended consequences. LMK if you have a better proposal @julian-risch @anakin87

anakin87

I opted for always None-ing tokenizer_model_folder and logging the warning with resolution path.

I agree with your approach.

I left some comments to better understand...

pyproject.toml

Co-authored-by: Stefano Fiorucci <[email protected]>

anakin87

OK for me.

I would prefer that @julian-risch also take a look.

vblagoje · 2024-08-29T09:28:49Z

OK for me.

I would prefer that @julian-risch also take a look.

Makes sense 🙏

julian-risch

We should change
https://github.com/deepset-ai/haystack/blob/nltk_update_exp1/haystack/nodes/preprocessor/preprocessor.py#L932 and https://github.com/deepset-ai/haystack/blob/nltk_update_exp1/haystack/nodes/preprocessor/preprocessor.py#L939
to use the following instead.

from nltk.tokenize.punkt import PunktTokenizer
tokenizer = PunktTokenizer(language_name)

Just like it is done here nltk/nltk@496515e

This is also how I understand the first part of the comment by @sagarneeldubey #8238 (comment)
You could reach out to them directly to understand what changes they made in their custom preprocessor component. And whether this PR can replace their custom preprocessor.

vblagoje added 3 commits August 20, 2024 10:47

Use nltk.download('punkt_tab'), pin nltk>=3.9

b4071d2

Add reno note

119080d

Pin urllib3<2.0.0

dfb9bfa

github-actions bot added topic:preprocessing topic:dependencies topic:build/distribution labels Aug 20, 2024

vblagoje added 5 commits August 20, 2024 12:00

Exp with deps

f99f341

Updates

aa5b896

Updates

934310e

Pin python-pptx<=1.0

954eefc

Skip tests with old nltk pickle model files

d8c5dc9

github-actions bot added the topic:tests label Aug 20, 2024

vblagoje mentioned this pull request Aug 20, 2024

chore: 1.x - nltk upgrade, use nltk.download('punkt_tab') #8254

Closed

vblagoje changed the title ~~draft: Nltk update exp1~~ fix: 1.x - nltk upgrade, use nltk.download('punkt_tab') Aug 20, 2024

vblagoje marked this pull request as ready for review August 20, 2024 11:56

vblagoje requested review from a team as code owners August 20, 2024 11:56

vblagoje requested review from dfokina, Amnah199, anakin87, julian-risch and silvanocerza and removed request for a team and Amnah199 August 20, 2024 11:56

Update reno note

7c160cb

julian-risch reviewed Aug 20, 2024

View reviewed changes

silvanocerza removed their request for review August 22, 2024 07:58

vblagoje added 3 commits August 29, 2024 09:40

Ignore tokenizer_model_folder nltk parameter

3ced1ac

Add upgrade section for the release note

181e52d

mypy fixes

401344e

vblagoje requested a review from julian-risch August 29, 2024 08:16

anakin87 reviewed Aug 29, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

pyproject.toml Show resolved Hide resolved

pyproject.toml Show resolved Hide resolved

vblagoje and others added 2 commits August 29, 2024 10:57

Update pyproject.toml

ca15147

Co-authored-by: Stefano Fiorucci <[email protected]>

Update release notes

fb0abb6

anakin87 self-requested a review August 29, 2024 09:25

anakin87 approved these changes Aug 29, 2024

View reviewed changes

julian-risch requested changes Aug 29, 2024

View reviewed changes

Use PunktTokenizer instead of nltk.data.load

06399e8

julian-risch approved these changes Aug 29, 2024

View reviewed changes

vblagoje merged commit 8c95fab into v1.26.x Aug 29, 2024
57 checks passed

vblagoje deleted the nltk_update_exp1 branch August 29, 2024 13:31

vblagoje mentioned this pull request Aug 29, 2024

Upgrade Haystack 1.x to NLTK 3.9 #8238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 1.x - nltk upgrade, use `nltk.download('punkt_tab')` #8256

fix: 1.x - nltk upgrade, use `nltk.download('punkt_tab')` #8256

vblagoje commented Aug 20, 2024 •

edited

Loading

vblagoje commented Aug 20, 2024

julian-risch left a comment •

edited

Loading

anakin87 commented Aug 20, 2024

vblagoje commented Aug 29, 2024

anakin87 left a comment

anakin87 left a comment

vblagoje commented Aug 29, 2024

julian-risch left a comment

fix: 1.x - nltk upgrade, use nltk.download('punkt_tab') #8256

fix: 1.x - nltk upgrade, use nltk.download('punkt_tab') #8256

Conversation

vblagoje commented Aug 20, 2024 • edited Loading

vblagoje commented Aug 20, 2024

julian-risch left a comment • edited Loading

Choose a reason for hiding this comment

anakin87 commented Aug 20, 2024

vblagoje commented Aug 29, 2024

anakin87 left a comment

Choose a reason for hiding this comment

anakin87 left a comment

Choose a reason for hiding this comment

vblagoje commented Aug 29, 2024

julian-risch left a comment

Choose a reason for hiding this comment

fix: 1.x - nltk upgrade, use `nltk.download('punkt_tab')` #8256

fix: 1.x - nltk upgrade, use `nltk.download('punkt_tab')` #8256

vblagoje commented Aug 20, 2024 •

edited

Loading

julian-risch left a comment •

edited

Loading