Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert _local_exists to mirror fsspec behavior #4186

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

deepyaman
Copy link
Member

@deepyaman deepyaman commented Sep 23, 2024

Description

The current implementation of _local_exists is different from how the exists method exposed by fsspec implementations works, in that it checks parent directories. This prevents the code path from reaching NotADirectoryError (e.g. if you try saving a versioned dataset on top of an unversioned one), resulting in a different/inconsistent error message from one of the fsspec-based dataset implementations.

Development notes

This essentially reverts a very old change in b0581f1, which predates fsspec use.

Existing dataset implementations (usually?) define their own exists_function using fsspec, so this doesn't normally get hit.

I ran into this while trying to implement basic versioning of ibis.FileDataset, because the test_versioning_existing_dataset was failing with a different error message than one would expect (due to the above note on NotADirectoryError).

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@deepyaman deepyaman marked this pull request as ready for review September 23, 2024 21:39
@noklam
Copy link
Contributor

noklam commented Sep 24, 2024

Thanks for catching this, do we know why it was implemented in that way? That surely looks a bit weird to me and the change looks correct.

@deepyaman
Copy link
Member Author

Thanks for catching this, do we know why it was implemented in that way? That surely looks a bit weird to me and the change looks correct.

I can't really figure out, beyond the commit I linked above; maybe you can figure out from the QB-internal Jira, if KED-715 still exists?

@deepyaman deepyaman enabled auto-merge (squash) September 24, 2024 16:53
@noklam
Copy link
Contributor

noklam commented Sep 30, 2024

Versioning failed if the same dataset name already exists in the file path directory

Acceptance criteria

Documentation about what happens when you switch on versioning after you've run the pipeline, let users know that it will throw an error because the file path already exists

[If possible] Better error message in AbstractDataSet to show you how to resolve the problem i.e. delete the file

That's as far as I can dig into.

@noklam
Copy link
Contributor

noklam commented Sep 30, 2024

Found a bit more at here

Motivation and Context
Original issue: quantumblack/asset-kernel-ai#683

What has been done:

_local_exists function in kedro/io/core.py now also checks for the existence of any files within filepath parents. This allows us to throw a DataSetError if versioned dataset attempts to save after unversioned one has already been saved.
I did not change the error message since it applies not only to local datasets, but to S3 too, for example. S3 doesn't have this issue - it allows having a file and a 'folder' with the same name. WDYT?

  • This allows us to throw a DataSetError if versioned dataset attempts to save after unversioned one has already been saved.

I think this is the main reason, but as far as I remember this doesn't seem to work nicely when a versioned dataset blocked by an existing unversion dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants