Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: PyPDFToDocument correctly serializes custom converters, deprecate DefaultConverter #8430

Merged
merged 2 commits into from
Oct 1, 2024

Conversation

shadeMe
Copy link
Collaborator

@shadeMe shadeMe commented Oct 1, 2024

Proposed Changes:

The PyPDFToDocument component was incorrectly serializing its default converter. This PR fixes it and deprecates the latter.

Utility methods were added to aid the serde of custom classes that implement from_dict and to_dict methods.

How did you test it?

Unit tests

Notes for the reviewer

  • This is the follow-up PR to this one.
  • If you have better names for the utility classes, I'm all ears.

Checklist

@shadeMe shadeMe requested review from a team as code owners October 1, 2024 11:01
@shadeMe shadeMe requested review from dfokina and Amnah199 and removed request for a team October 1, 2024 11:01
@github-actions github-actions bot added type:documentation Improvements on the docs topic:tests and removed type:documentation Improvements on the docs labels Oct 1, 2024
@shadeMe shadeMe removed the request for review from Amnah199 October 1, 2024 11:02
@coveralls
Copy link
Collaborator

coveralls commented Oct 1, 2024

Pull Request Test Coverage Report for Build 11127151741

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 11 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.03%) to 90.248%

Files with Coverage Reduction New Missed Lines %
components/converters/pypdf.py 11 83.58%
Totals Coverage Status
Change from base Build 11122967174: -0.03%
Covered Lines: 7413
Relevant Lines: 8214

💛 - Coveralls

@wochinge
Copy link
Contributor

wochinge commented Oct 1, 2024

Thanks, @shadeMe The more detailed errors will also be a great help to the users!

@github-actions github-actions bot added the type:documentation Improvements on the docs label Oct 1, 2024
@shadeMe shadeMe requested a review from anakin87 October 1, 2024 12:43
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this so quickly and thoroughly! The changes look very good to me. The naming of the util methods too. Only suggestion I have is to add tests for the new util methods auto_serialize_class_instance and auto_deserialize_class_instance too. What do you think @silvanocerza ?

I would suggest to also update the class docstring "If no converter is provided, uses a default text extraction converter implementation." or something like that here. Makes it more clear that there is no default converter anymore.

@anakin87
Copy link
Member

anakin87 commented Oct 1, 2024

@shadeMe I'm probably missing some context.

I understand that there are some serialization issues.

But why did we decide to deprecate DefaultConverter?

@shadeMe
Copy link
Collaborator Author

shadeMe commented Oct 1, 2024

@shadeMe I'm probably missing some context.

I understand that there are some serialization issues.

But why did we decide to deprecate DefaultConverter?

Because it honestly didn't have a good reason to exist outside the component, which was primarily the reason why the serialization bug crept in.

@silvanocerza
Copy link
Contributor

silvanocerza commented Oct 1, 2024

Thanks for working on this so quickly and thoroughly! The changes look very good to me. The naming of the util methods too. Only suggestion I have is to add tests for the new util methods auto_serialize_class_instance and auto_deserialize_class_instance too. What do you think @silvanocerza ?

@julian-risch My only concern about the methods is the auto_ prefix. @shadeMe and I talked a bit about it and the main concern is that it would be too generic. Given that we have already other methods to handle serde it might get confusing. I would still argue to remove it though.

But why did we decide to deprecate DefaultConverter?

@anakin87 Mainly the assumption that converters don't need state most of the times so they can be simple functions. Though I'm unsure about that, I kinda remember that converters can be configured and if we treat them as callables we'd lose that possibility.

I remember that you briefly worked on this to change the converter backend so I thought you'd know more if that's the case or not.

@anakin87
Copy link
Member

anakin87 commented Oct 1, 2024

Now I better understand the motivation and am OK with deprecating/removing DefaultConverter.
(The only reason it might be useful is as an example of implementing a PyPDFConverter, but this can also be inferred from the Protocol.)

In general, this component has always seemed a bit tricky to me from the UX point of view, and I would be happy if we improve it in the future.

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@shadeMe
Copy link
Collaborator Author

shadeMe commented Oct 1, 2024

Test failure is unrelated; merging.

@shadeMe shadeMe merged commit ee89f6a into deepset-ai:main Oct 1, 2024
17 of 18 checks passed
@shadeMe shadeMe deleted the fix/pypdf-converter-serde branch October 1, 2024 14:35
julian-risch pushed a commit that referenced this pull request Oct 1, 2024
…ate `DefaultConverter` (#8430)

* fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter`

* Remove `auto` prefix from serde util function names, add unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants