Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"dangling symbolic link" flakes after upgrading to Bazel 7 #20886

Closed
dbolduc opened this issue Jan 13, 2024 · 58 comments
Closed

"dangling symbolic link" flakes after upgrading to Bazel 7 #20886

dbolduc opened this issue Jan 13, 2024 · 58 comments

Comments

@dbolduc
Copy link

dbolduc commented Jan 13, 2024

Description of the bug:

googleapis/google-cloud-cpp#13444

After upgrading to Bazel 7, we have started seeing transient failures in our CI. These have all been from io_opentelemetry_cpp.

ERROR: /h/.cache/bazel/_bazel_root/eab0d61a99b6696edb3d2aff87b585e8/external/io_opentelemetry_cpp/sdk/src/common/BUILD:6:11: output 'external/io_opentelemetry_cpp/sdk/src/common/_virtual_includes/random/src/common/random.h' is a dangling symbolic link
ERROR: /h/.cache/bazel/_bazel_root/eab0d61a99b6696edb3d2aff87b585e8/external/io_opentelemetry_cpp/sdk/src/trace/BUILD:6:11: output 'external/io_opentelemetry_cpp/sdk/src/trace/_virtual_includes/trace/src/trace/span.h' is a dangling symbolic link
...etc...
ERROR: /h/.cache/bazel/_bazel_root/eab0d61a99b6696edb3d2aff87b585e8/external/io_opentelemetry_cpp/api/BUILD:13:11: Symlinking virtual headers for api failed: not all outputs were created or valid
ERROR: /h/.cache/bazel/_bazel_root/eab0d61a99b6696edb3d2aff87b585e8/external/io_opentelemetry_cpp/sdk/BUILD:6:11: Symlinking virtual headers for headers failed: not all outputs were created or valid

My naive guess is that it has something to do with how that repo uses include_prefix: https://github.com/open-telemetry/opentelemetry-cpp/blob/c4f39f2be8109fd1a3e047677c09cf47954b92db/sdk/src/trace/BUILD#L10

Which category does this issue belong to?

External Dependency

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Not sure, but I can supply more logs and test solutions (within reason).

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 7.0.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

It probably has to do with "build without the bytes"

Have you found anything relevant by searching the web?

#19143 seems like a similar issue.

Any other information, logs, or outputs that you want to share?

No response

@tjgq
Copy link
Contributor

tjgq commented Jan 15, 2024

Can you share the Bazel flags you're using?

We've seen a few symlink-related issues due to the --incompatible_sandbox_hermetic_tmp flag flip. Does setting --noincompatible_sandbox_hermetic_tmp change anything?

@tjgq tjgq added more data needed awaiting-user-response Awaiting a response from the author labels Jan 15, 2024
@sgowroji sgowroji removed the awaiting-user-response Awaiting a response from the author label Jan 16, 2024
@sgowroji
Copy link
Member

Hi @dbolduc, Can you please take a look on the above comment.

@alevenberg
Copy link

ERROR: /h/.cache/bazel/_bazel_root/eab0d61a99b6696edb3d2aff87b585e8/external/io_opentelemetry_cpp/api/BUILD:13:11: output 'external/io_opentelemetry_cpp/api/_virtual_includes/api/opentelemetry/version.h' is a dangling symbolic link
ERROR: /h/.cache/bazel/_bazel_root/eab0d61a99b6696edb3d2aff87b585e8/external/io_opentelemetry_cpp/sdk/src/trace/BUILD:6:11: output 'external/io_opentelemetry_cpp/sdk/src/trace/_virtual_includes/trace/src/trace/span.h' is a dangling symbolic link
ERROR: /h/.cache/bazel/_bazel_root/eab0d61a99b6696edb3d2aff87b585e8/external/io_opentelemetry_cpp/sdk/src/trace/BUILD:6:11: Symlinking virtual headers for trace failed: not all outputs were created or valid
ERROR: /h/.cache/bazel/_bazel_root/eab0d61a99b6696edb3d2aff87b585e8/external/io_opentelemetry_cpp/api/BUILD:13:11: Symlinking virtual headers for api failed: not all outputs were created or valid
ERROR: /h/.cache/bazel/_bazel_root/eab0d61a99b6696edb3d2aff87b585e8/external/io_opentelemetry_cpp/sdk/src/common/BUILD:6:11: output 'external/io_opentelemetry_cpp/sdk/src/common/_virtual_includes/random/src/common/random.h' is a dangling symbolic link
ERROR: /h/.cache/bazel/_bazel_root/eab0d61a99b6696edb3d2aff87b585e8/external/io_opentelemetry_cpp/sdk/src/common/BUILD:6:11: Symlinking virtual headers for random failed: not all outputs were created or valid
Analyzing: 859 targets (0 packages loaded, 12520 targets configured)`

coverage-ci https://console.cloud.google.com/cloud-build/builds;region=us-east1/62bf2763-5a18-4983-87fb-47254671c10c?project=936212892354

@dbolduc
Copy link
Author

dbolduc commented Jan 18, 2024

Not sure, but I can supply more logs and test solutions (within reason).

I lied.

Does setting --noincompatible_sandbox_hermetic_tmp change anything?

I reverted our builds to use Bazel 6.4.0 before testing this. 🤷

phst added a commit to phst/rules_elisp that referenced this issue Jan 21, 2024
This is primary to work around bazelbuild/bazel#20886,
but also helps on Windows if symbolic links aren’t supported.
@moroten
Copy link
Contributor

moroten commented Jan 24, 2024

#20408 (comment) reports that this is still a problem in Bazel 7.0.0. Even if the PR #19739 (fixing #19143 in 6.4) mentions that it is working at HEAD at the time, before 7.0 was branched, 7.0.0 is still broken (transiently).

@Gormo
Copy link

Gormo commented Jan 27, 2024

This occurs quite frequently in our environment and even more frequently when enabling bzl-module support.
We are using remote execution and build-without-the-bytes but applying the fix in #19739 does not seem to affect anything.

I have tried to recreate this in a minimal sandbox without success but my guess is that it does not seem related to remote-execution since it appears on some targets like this:

cc_library(
name = "foo",
hdrs = glob(["**/*.h"]),
srcs = glob(["**/*.c"]),
strip_include_prefix = "include",
)

Where the strip_include_prefix triggers the symlink actions to occur but it occasionally fails locally even though the target files must be available locally since they were found by the glob in the repository.

This is a major issue for us and a stopper for us in order to proceed with the migration to Bazel modules.

Bazel version used: 7.0.2

@uhlajs
Copy link

uhlajs commented Jan 30, 2024

Same when using rules_foreign_cc. But using --noincompatible_sandbox_hermetic_tmp as @tjgq suggests here fixed the issue.

@tjgq
Copy link
Contributor

tjgq commented Jan 30, 2024

@uhlajs Is this with 7.0.0, or 7.0.2? The latter contains some fixes for --incompatible_sandbox_hermetic_tmp.

@uhlajs
Copy link

uhlajs commented Jan 30, 2024

With 7.0.2.

@fmeum
Copy link
Collaborator

fmeum commented Jan 30, 2024

@uhlajs rules_foreign_cc, when used with cmake, has some actual issues with hermetic tmp that I think come from cmake install not following symlinks. I would recommend filing an issue in the rules_foreign_cc repo if that's what you are running into - it's not a Bazel bug. Feel free to mention me on it.

@tjgq
Copy link
Contributor

tjgq commented Jan 30, 2024

Thanks for rerouting this, @fmeum. Feel free to ping or reopen if it turns out to be a Bazel issue and not just a rules_foreign_cc one.

@tjgq tjgq closed this as not planned Won't fix, can't repro, duplicate, stale Jan 30, 2024
@Gormo
Copy link

Gormo commented Jan 30, 2024

@tjgq the dangling symbolic link issues we were seeing were not related to rules_foreign_cc and are definitely a Bazel issue. I don't think this issue should be closed since this needs to be fixed.

@moroten
Copy link
Contributor

moroten commented Jan 30, 2024

@Gormo did not use cmake or rules_foreign_cc and he was using 7.0.2 (see his comment above). It is therefore also an issue in Bazel itself when using a native cc_library.

@tjgq tjgq reopened this Jan 30, 2024
@tjgq
Copy link
Contributor

tjgq commented Jan 30, 2024

Sorry, I misread the thread.

@moroten
Copy link
Contributor

moroten commented Jan 30, 2024

@tjgq no problems. Thank you for reopening it.

@fmeum
Copy link
Collaborator

fmeum commented Jan 30, 2024

@Gormo Since it appears to be difficult to produce a reproducer for this, could you perhaps try to bisect this down to the breaking Bazel commit using Bazelisk's --bisect feature? Does this still reproduce with hermetic tmp disabled?

@Gormo
Copy link

Gormo commented Feb 12, 2024

An update from our side:
We have tried to recreate this issue in a controlled manor without success.
Observations:

  • Occurs more frequently with bazel-modules enabled
  • Occurs more frequently on our CI machines which uses vmware with virtual disks.
  • Setting --noincompatible_sandbox_hermetic_tmp did not decrease the frequency of the flakes.
  • Disabling skymeld decreases the failure rate significantly but does not the remove the flakes completely.

Current theory is that it could be related to io-access and some kind of internal race-condition between different threads and sometimes io-access is delayed on disks with heavy load which triggers this issue.

We tried to recreate this locally by using a disk-loader for simulating high io-load but unfortunately without success.

@fmeum
Copy link
Collaborator

fmeum commented Feb 12, 2024

@Gormo Could you check whether the issue still occurs with 48ea3d2 (a commit in the 7.1.0 branch that should be available with Bazelisk)? That commit changes how the C++ header symlinks are created in Bazel.

@Gormo
Copy link

Gormo commented Feb 12, 2024

@fmeum I have now tried, cherry-picking 48ea3d2 on top on 7.0.2 but that didn't really affect anything since we get the errors also on ctx.actions.symlink() generated actions.
Then I tried adding a patch to always use "use_exec_root_for_source".

But that didn't remove any flakiness either.

diff --git a/src/main/java/com/google/devtools/build/lib/analysis/starlark/StarlarkActionFactory.java b/src/main/java/com/google/devtools/build/lib/analysis/starlark/StarlarkActionFactory.java
index 44ff502fdf..3d037752bb 100644
--- a/src/main/java/com/google/devtools/build/lib/analysis/starlark/StarlarkActionFactory.java
+++ b/src/main/java/com/google/devtools/build/lib/analysis/starlark/StarlarkActionFactory.java
@@ -291,9 +291,7 @@ public class StarlarkActionFactory implements StarlarkActionFactoryApi {
     if (useExecRootForSourceObject != Starlark.UNBOUND) {
       BuiltinRestriction.failIfCalledOutsideAllowlist(thread, PRIVATE_STARLARKIFICATION_ALLOWLIST);
     }
-    boolean useExecRootForSource =
-        !Starlark.UNBOUND.equals(useExecRootForSourceObject)
-            && (Boolean) useExecRootForSourceObject;
+    boolean useExecRootForSource = true;
 
     RuleContext ruleContext = getRuleContext();

@lberki
Copy link
Contributor

lberki commented Feb 12, 2024

@Wyverald FYI if you haven't already seen this thread

@lberki
Copy link
Contributor

lberki commented Feb 12, 2024

@Gormo can you check what the paths look like about which Bazel complains that they are dangling symlinks? I'm curious if they stay dangling symlinks at the end of the build and if so, how exactly they are dangling, i.e. in what step of the symlink resolution does the "file not found" error occur? (e.g. simply the target doesn't exist even though the directory that contains it does? Does the symlink point to a file under a directory that should exist, but it doesn't? Something more complicated?)

It's not immediately obvious how this could happen: AFAIU FileStateValues in external repositories depend on the pertinent RepositoryDirectoryValue and that could only be created after fetching the repository in question.

Does --spawn_strategy=standalone fix the issue? If so, that limits the possibly buggy places to the sandbox implementations, if not, it's something else.

@Gormo
Copy link

Gormo commented Feb 12, 2024

Here is an (IP-mangled) output from the symlinks:

ERROR: /Top-bazel/478b2cbff2254079381e27d1a245fab2/external/FOO/BUILD.bazel:7:12: output 'external/FOO/lib/libfoo.a' is a dangling symbolic link
ERROR: /Top-bazel/478b2cbff2254079381e27d1a245fab2/external/FOO/BUILD.bazel:7:12: Creating symlink bazel-out/pclinux64-fastbuild/bin/external/FOO/lib/libfoo.a failed: not all outputs were created or valid
Target @@FOO//:FOO-pkg failed to build
INFO: Elapsed time: 231.573s, Critical Path: 0.90s
INFO: 5581 processes: 5386 remote cache hit, 192 internal, 3 remote.
ERROR: Build did NOT complete successfully


bar@build:/Top$ ls -alh bazel-out/pclinux64-fastbuild/bin/external/FOO/lib/libfoo.a
lrwxrwxrwx 1 bar 1002 206 Feb 12 16:25 bazel-out/pclinux64-fastbuild/bin/external/FOO/lib/libfoo.a -> /Top-bazel/478b2cbff2254079381e27d1a245fab2/execroot/_main/bazel-out/aarch64_qnx-opt-ST-d3c7f1d3749b/bin/external/Foo/Configuration/Bazel/Bar/Platform/Foo_proxy_library_stripped.a

bar@build:/Top$ ls /Top-bazel/478b2cbff2254079381e27d1a245fab2/execroot/_main/bazel-out/aarch64_qnx-opt-ST-d3c7f1d3749b/bin/external/Foo/Configuration/Bazel/Bar/Platform/Foo_proxy_library_stripped.a
ls: cannot access '/Top-bazel/478b2cbff2254079381e27d1a245fab2/execroot/_main/bazel-out/aarch64_qnx-opt-ST-d3c7f1d3749b/bin/external/Foo/Configuration/Bazel/Bar/Platform/Foo_proxy_library_stripped.a': No such file or directory

I also tried with "--spawn_strategy=standalone" but that didn't affect anything.

@lberki
Copy link
Contributor

lberki commented Feb 12, 2024

If it still fails with --spawn_strategy=standalone, that exonerates --incompatible_sandbox_hermetic_tmp.

Can you check which directory exists and which one does not in the ancestors of /Top-bazel/478b2cbff2254079381e27d1a245fab2/execroot/_main/bazel-out/aarch64_qnx-opt-ST-d3c7f1d3749b/bin/external/Foo/Configuration/Bazel/Bar/Platform/Foo_proxy_library_stripped.a and if all of them do,where that symlink points to and what directory among the ancestors of the target exists?

@Gormo
Copy link

Gormo commented Feb 22, 2024

This error also occurs on Windows:

09:55:02 ERROR: E:/bazel/ypqoi5y7/external/extrepo/BUILD.bazel:5:11: output 'external/extrepo/_virtual_includes/Foo/foo.h' is a dangling symbolic link
09:55:03 ERROR: E:/bazel/ypqoi5y7/external/extrepo/BUILD.bazel:5:11: Symlinking virtual headers for Foo failed: not all outputs were created or valid

Looking at the disk, I can see.

myuser@computer MINGW64 /e/folder/project
$ stat bazel-out/k8-fastbuild/bin/external/extrepo/_virtual_includes/Foo/foo.h
  File: bazel-out/k8-fastbuild/bin/external/extrepo/_virtual_includes/Foo/foo.h -> /e/bazel/ypqoi5y7/execroot/AT/external/extrepo/include/foo.h
  Size: 87              Blocks: 0          IO Block: 65536  symbolic link
Device: 769dcb8ch/1990052748d   Inode: 47850746041798830  Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (297610/myuser)   Gid: (297121/ UNKNOWN)
Access: 2024-02-22 09:55:02.963725500 +0100
Modify: 2024-02-22 09:55:02.963725500 +0100
Change: 2024-02-22 09:55:02.963725500 +0100
Birth: 2024-02-22 09:55:02.963725500 +0100

myuser@computer MINGW64 /e/folder/project
$ stat /e/bazel/ypqoi5y7/execroot/AT/external/extrepo/include/foo.h
  File: /e/bazel/ypqoi5y7/execroot/AT/external/extrepo/include/foo.h
  Size: 231955          Blocks: 228        IO Block: 65536  regular file
Device: 769dcb8ch/1990052748d   Inode: 26740122787614710  Links: 1
Access: (0644/-rw-r--r--)  Uid: (297610/myuser)   Gid: (297121/ UNKNOWN)
Access: 2024-02-19 11:48:02.498995600 +0100
Modify: 2024-02-19 11:48:02.498995600 +0100
Change: 2024-02-19 11:48:02.498995600 +0100
Birth: 2024-02-19 11:48:02.495996800 +0100

The build event streaming posted the event after the symlink was created.

{
  "eventTime": {
    "seconds": "1708592102",  // 2024-02-22 09:55:02
    "nanos": 970000000
  },
  "buildEvent": {
    "id": {
      "actionCompleted": {
        "primaryOutput": "bazel-out/k8-fastbuild/bin/external/extrepo/_virtual_includes/Foo/foo.h",
        "label": "@@extrepo//:Foo",
        "configuration": {
          "id": "6c01f9acdccc727d9bd9c32e7940d9d1ef9fff9fc411e9c028468eb14fcbb30d"
        }
      }
    },
    "action": {
      "exitCode": 1,
      "label": "@@extrepo//:Foo",
      "configuration": {
        "id": "6c01f9acdccc727d9bd9c32e7940d9d1ef9fff9fc411e9c028468eb14fcbb30d"
      },
      "type": "Symlink",
      "failureDetail": {
        "message": "not all outputs were created or valid",
        "execution": {
          "code": "ACTION_OUTPUTS_NOT_CREATED"
        }
      }
    }
  },
}

@Gormo
Copy link

Gormo commented Feb 26, 2024

Here is an equivalent output from Linux:

09:03:48 ERROR: .../358a92b74c6f0e9ec4d7be91768fb6e5/external/extrepo/BUILD.bazel:22:21: output 'external/extrepo/_virtual_includes/Foo/foo.h' is a dangling symbolic link
09:03:48 ERROR: .../358a92b74c6f0e9ec4d7be91768fb6e5/external/extrepo/BUILD.bazel:22:21: Symlinking virtual headers for Foo failed: not all outputs were created or valid

Looking at the disk, I can see.

$ pwd
/.../358a92b74c6f0e9ec4d7be91768fb6e5/execroot/WS
$ stat bazel-out/k8-fastbuild/bin/external/extrepo/_virtual_includes/Foo/foo.h
  File: bazel-out/k8-fastbuild/bin/external/extrepo/_virtual_includes/Foo/foo.h -> /.../358a92b74c6f0e9ec4d7be91768fb6e5/execroot/WS/external/extrepo/include/foo.h
  Size: 150             Blocks: 8          IO Block: 4096   symbolic link
Device: fd01h/64769d    Inode: 5246275     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: ( 1234/ myuser)   Gid: ( 1002/ mygrp)
Access: 2024-02-26 08:29:45.042242594 +0000   <--- Sorry, I probably read the link before running stat.
Modify: 2024-02-26 08:03:48.657096677 +0000
Change: 2024-02-26 08:03:48.657096677 +0000
Birth: -
$ stat external/extrepo/include/foo.h
  File: external/extrepo/include/foo.h
  Size: 3620            Blocks: 8          IO Block: 4096   regular file
Device: fd01h/64769d    Inode: 2369368     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/ myuser)   Gid: ( 1002/ mygrp)
Access: 2024-02-26 06:44:07.327746161 +0000
Modify: 2024-02-26 06:43:38.415307171 +0000
Change: 2024-02-26 06:43:38.415307171 +0000
Birth: -

The build event streaming posted the event after the symlink was created.

{
  "eventTime": {
    "seconds": "1708934628",  // 2024-02-26 08:03:48
    "nanos": 744000000
  },
  "buildEvent": {
    "id": {
      "actionCompleted": {
        "primaryOutput": "bazel-out/k8-fastbuild/bin/external/extrepo/_virtual_includes/Foo/foo.h",
        "label": "@@extrepo//:Foo",
        "configuration": {
          "id": "288d5a05823d74872ae8a6af12afac9f87f011ad2951eb028990ca9116f858ad"
        }
      }
    },
    "action": {
      "exitCode": 1,
      "label": "@@extrepo//:Foo",
      "configuration": {
        "id": "288d5a05823d74872ae8a6af12afac9f87f011ad2951eb028990ca9116f858ad"
      },
      "type": "Symlink",
      "failureDetail": {
        "message": "not all outputs were created or valid",
        "execution": {
          "code": "ACTION_OUTPUTS_NOT_CREATED"
        }
      }
    }
  }
}

So it seems like the symlink targets always exists but they are sometimes not populated into the sandbox.

@freeformstu
Copy link

@Gormo Could you check whether the issue still occurs with 48ea3d2 (a commit in the 7.1.0 branch that should be available with Bazelisk)? That commit changes how the C++ header symlinks are created in Bazel.

We were hitting this issue very consistently and 7.1.0rc1 appears to have fixed the issue for us.

phst added a commit to phst/rules_elisp that referenced this issue Mar 5, 2024
This reverts commit 578ce77.

Copying doesn't appear to help with
bazelbuild/bazel#20886,
and it's going to be fixed in Bazel 7.1 anyway.
phst added a commit to phst/rules_elisp that referenced this issue Mar 5, 2024
This reverts commit 578ce77.

Copying doesn't appear to help with
bazelbuild/bazel#20886,
and it's going to be fixed in Bazel 7.1 anyway.
@Gormo
Copy link

Gormo commented Mar 15, 2024

@freeformstu, The issue still occurs frequently on 7.1.0 when running with bazel modules enabled.

@fmeum
Copy link
Collaborator

fmeum commented Mar 15, 2024

Has anyone been able to reproduce this issue locally with Bazel 7.1.0 and --noincompatible_sandbox_hermetic_tmp? This issue is still pretty mysterious to me and a local reproducer, even if not deterministically triggering the bug, would be very helpful.

@chrisabbott
Copy link

chrisabbott commented Mar 16, 2024

Has anyone been able to reproduce this issue locally with Bazel 7.1.0 and --noincompatible_sandbox_hermetic_tmp? This issue is still pretty mysterious to me and a local reproducer, even if not deterministically triggering the bug, would be very helpful.

@fmeum I was able to reproduce the same issue under Bazel 7.1.0 with rules_foreign_cc version 0.10.1.

I had to add both --noincompatible_sandbox_hermetic_tmp and --noexperimental_merged_skyframe_analysis_execution to resolve this.

/edit In trying to reproduce this, it looks like --noincompatible_sandbox_hermetic_tmp is all I needed after all

@fmeum
Copy link
Collaborator

fmeum commented Mar 16, 2024

@chrisabbott Could you share an example with which you observe this behavior?

@chrisabbott
Copy link

chrisabbott commented Mar 16, 2024

@chrisabbott Could you share an example with which you observe this behavior?

@fmeum Yep! Here you go. Bear in mind that I didn't actually need both flags as I mentioned above, so it may or may not be useful to you.

.bazelversion

7.1.0

MODULE.bazel

module(
    name = "example",
    version = "0.0",
)

http_archive = use_repo_rule("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

bazel_dep(
    name = "rules_foreign_cc",
    version = "0.10.1"
)

_ALL_CONTENT = """\
filegroup(
    name = "all_srcs",
    srcs = glob(["**"]),
    visibility = ["//visibility:public"],
)
"""

# Add spdlog 1.13.0 as a third-party dependency.
SPDLOG_VERSION = "1.13.0"
SPDLOG_INTEGRITY = "sha256-n2dju3b/99s3H1czYmyDNS7dfFeJlQGrACSPr62cxQQ="

http_archive(
    name = "spdlog",
    build_file_content = _ALL_CONTENT,
    strip_prefix = "spdlog-{}".format(SPDLOG_VERSION),
    urls = [
        "https://github.com/gabime/spdlog/archive/v{}.zip".format(SPDLOG_VERSION),
    ],
    integrity = SPDLOG_INTEGRITY,
)

third_party/BUILD.bazel

load("@rules_foreign_cc//foreign_cc:defs.bzl", "cmake")

cmake(
    name = "spdlog",
    lib_source = "@spdlog//:all_srcs",
    out_static_libs = ["libspdlog.a"],
)

@fmeum
Copy link
Collaborator

fmeum commented Mar 16, 2024

@chrisabbott Sorry, I didn't see your edit before: If that flag fixes the issue, I'm pretty sure it's the deterministic failure described in #21215.

@Gormo
Copy link

Gormo commented Mar 19, 2024

@fmeum, we have --noincompatible_sandbox_hermetic_tmp enabled on 7.1.0, but we still see the error frequently in CI when bazel modules is enabled. (It's actually as frequent so it's stopping us from migrating to bazel modules). It's undeterministic and difficult to understand what triggers it but if you have proposals for relevant log points I can enable those, or even patch bazel for a test if needed.

@fmeum
Copy link
Collaborator

fmeum commented Mar 20, 2024

@Gormo Could you try to bisect this down to a smaller commit range by setting USE_BAZEL_VERSION to commits between 6.4.0 and 7.0.0 with Bazelisk? That could help us understand which kind of change caused this regression.

@linzhp
Copy link
Contributor

linzhp commented Apr 22, 2024

One thing that we noticed in the past few months after upgrading to Bazel 7 was files in Bazel's output_base mysteriously missing from CI machines, leading to errors like:

ERROR: no such package '@@com_github_aws_aws_sdk_go//service/cloudfront/sign': BUILD file not found in directory 'service/cloudfront/sign' of external repository @@com_github_aws_aws_sdk_go. Add a BUILD file to a directory to mark it as a package.

or

failed to fetch com_github_bazelbuild_remote_apis: fetch_repo: /<truncated>/external/go_sdk/bin/go /<truncated>/external/go_sdk/bin/go mod download -json -modcacherw github.com/bazelbuild/[email protected]: fork/exec /<truncated>/external/go_sdk/bin/go: no such file or directory

The error messages above are just examples. These kinds of errors can happen to any external repo, but only to external repos (files missing under output_base/external). We do have a process on CI machine to clean cache directories, including Bazel output_base when the disk is close to full. However, the cleaning process is to rotate the whole cache directory, not deleting files individually. So it shouldn't cause partial deletion of output_base.

When this error happens, all subsequent builds from that CI machine would fail with similar errors until we run bazel clean --expunge there.

I can image the missing files under output_base/external can cause dangling symbolic link discussed on this ticket.

@fmeum
Copy link
Collaborator

fmeum commented Apr 23, 2024

The issues described in this thread that aren't fixed by --noincompatible_sandbox_hermetic_tmp are consistent with #22073: Any kind of symlink pointing into an external repo's directory under the execroot could be dangling due to the race in that issue. I would thus recommend everyone in this thread to try out a fix when it lands.

I don't know what could be causing the issue @linzhp described though.

@rdesgroppes
Copy link

rdesgroppes commented Apr 24, 2024

I faced the issue with jansson:

  1. deps.bzl:
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

def deps():
    ref = "2.14"
    http_archive(
        name = "jansson",
        build_file = "//my:jansson.BUILD",
        sha256 = "5798d010e41cf8d76b66236cfb2f2543c8d082181d16bc3085ab49538d4b9929",
        strip_prefix = "jansson-{}".format(ref),
        url = "https://github.com/akheron/jansson/releases/download/v{}/jansson-{}.tar.gz".format(ref, ref),
    )
  1. jansson.BUILD:
load("@rules_foreign_cc//foreign_cc:defs.bzl", "cmake")
filegroup(name = "lib_source", srcs = glob(["**"]))
cmake(name = "libjansson", lib_source = ":lib_source", visibility = ["//visibility:public"])
  1. error (as of Bazel 7.1.1):
ERROR: /my/external/jansson/BUILD.bazel:9:6: Error while validating output TreeArtifact File:[[<execution_root>]bazel-out/k8-fastbuild/bin]external/jansson/libjansson/include : Child jansson.h of tree artifact /my/execroot/__main__/bazel-out/k8-fastbuild/bin/external/jansson/libjansson/include is a dangling symbolic link
ERROR: /my/external/jansson/BUILD.bazel:9:6: Foreign Cc - CMake: Building libjansson failed: not all outputs were created or valid
Target @@jansson//:libjansson failed to build
  1. workaround:
         build_file = "//my:jansson.BUILD",
+        patches = ["//my:jansson.patch"],  # https://github.com/bazelbuild/bazel/issues/20886
         sha256 = "5798d010e41cf8d76b66236cfb2f2543c8d082181d16bc3085ab49538d4b9929",
  1. jansson.patch:
--- CMakeLists.txt
+++ CMakeLists.txt
@@ -271,2 +271,3 @@
-file (COPY ${CMAKE_CURRENT_SOURCE_DIR}/src/jansson.h
-           DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/include/)
+configure_file (${CMAKE_CURRENT_SOURCE_DIR}/src/jansson.h
+                ${CMAKE_CURRENT_BINARY_DIR}/include/jansson.h
+                COPYONLY)
@@ -298 +299 @@
-   ${CMAKE_CURRENT_SOURCE_DIR}/src/jansson.h)
+   ${CMAKE_CURRENT_BINARY_DIR}/include/jansson.h)

💡 --noincompatible_sandbox_hermetic_tmp also works, but then facing occurrences of:

@fmeum
Copy link
Collaborator

fmeum commented Apr 26, 2024

Since it's pretty likely that this is fixed by 52adf0b, I will close this issue.

If you can still reproduce your issue with a version of Bazel including this commit (currently last_green, but the fix is in the process of being merged into the 7.2.0 branch, which also has all its commits available with USE_BAZEL_VERSION after a short wait):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests