Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Ignore non-conforming commit files in Delta log directory #3552

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

vkorukanti
Copy link
Collaborator

Description

Support for coordinated commit introduced an issue where a Delta file in _delta_log that doesn't conform to the expected file name format (%020d.json) is considered part of the Delta history.

We use the following method to test whether a file is a commit file.

Before coordinated commit support:

  public static boolean isCommitFile(String filePath) {
    String fileName = ..
    return Pattern.compile("\\d+\\.json").matcher(fileName).matches();
  }

After the coordinated commit support:

  public static boolean isCommitFile(String filePath) {
    String fileName = ..
    return Pattern.compile("\\d+\\.json").matcher(fileName).matches() || // backfilled commit file
        Pattern.compile("(\\d+)\\.([^\\.]+)\\.json").matcher(fileName).matches() // un-backfilled commit file
  }

For example, if the _delta_log has a file with name 0000x.uuid.json, it will be considered to be part of the delta file, but it is not. The commit files of format %20d.uuid.json are considered part of the table history only if the parent directory name is _commit.

Update the above method as follows (this is in line with the Delta-Spark)

  public static boolean isCommitFile(String filePath) {
    String fileName = ...
    String parentDirName = ...
    return Pattern.compile("\\d+\\.json").matcher(fileName).matches() || // backfilled commit file
        (parentDirName == "_commit" &&
            Pattern.compile("(\\d+)\\.([^\\.]+)\\.json").matcher(fileName).matches()) // un-backfilled commit file
  }

How was this patch tested?

Unit test that reproes the issue without the fix

assert(logSegment.isPresent)
checkLogSegment(
logSegment.get(),
expectedVersion = 14,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without the fix, this returns the latest snapshot version as 19.

public static boolean isCommitFile(String filePathStr) {
Path filePath = new Path(filePathStr);
String fileName = filePath.getName();
String fileParentName = filePath.getParent().getName();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Can we fetch the fileParentName only when it is needed? These Path/URI operations are slow and may cause slowness when applied on large number of objects.

if (DELTA_FILE_PATTERN.matcher(fileName).matches()) {
  return true
} else {
  String fileParentName = filePath.getParent().getName();
  return COMMIT_SUBDIR.equals(fileParentName)
    && UUID_DELTA_FILE_REGEX.matcher(fileName).matches());
}

Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of tests are failing as not a commit file?

long version = -1;
if (FileNames.isCommitFile(fileName)) {
if (FileNames.isCommitFile(filePath)) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create #3586 to unify these utility APIs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants