Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Failed to find previous kopia snapshot manifests" for velero backup using csi snapshot data mover #8222

Open
dharanui opened this issue Sep 17, 2024 · 12 comments

Comments

@dharanui
Copy link

dharanui commented Sep 17, 2024

What steps did you take and what happened:
velero version 1.12
also tried upgrading to 1.14.1
aws plugin 1.10.1

scheduled backup runs everyday with csi datamover.

backups are in partiallyFailed status intermittently with few datauploads in "Failed" state.
On describing the failed datauploads we find the following error.

data path backup failed: Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/elastic-system/elasticsearch-data-logging-elasticsearch-es-master-nodeset-2: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/elastic-system/elasticsearch-data-logging-elasticsearch-es-master-nodeset-2 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content from blob "q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c": failed to get blob with ID q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c: BLOB not found

What did you expect to happen:
All datauploads are in completed state and Backups are successful everyday.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help
bundle-2024-09-17-11-32-20.tar.gz

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:

Environment:
Production environment

  • Velero version (use velero version):
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

Looks like the repo content with ID q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c was missing, so the following backups for the same snapshot represented by this content always failed.

Take example for volume datadir-active-tracking-mongodb-secondary-0:

  • A backup completed on 2024-09-16T02:14:24Z with snapshot ff8935066e0e428d4f2905013353fb14
  • On 2024-09-17T02:09:58Z the problem happened complaining the parent snapshot manifest blob was not found
  • However, on 2024-09-17T04:46:14Z another backup succeeded for the same volume with the same parent snapshot ff8935066e0e428d4f2905013353fb14

Therefore, looks like the content was not found at one time but found at the other time.

The BLOB not found was reported by the object store. Therefore, please share the info of your object store. And you can also check if object qc9e6d67bcf92c91ed614a2d63a27ce00-s670d20f3fffaf54212c exists in your object store.

@dharanui
Copy link
Author

dharanui commented Sep 18, 2024

"Therefore, looks like the content was not found at one time but found at the other time." -> True, I have triggered a backup almost immediately after the backup is failed at "2024-09-17T02:09:58Z".

We use AWS S3 as object store.
I am not sure how to check the object in object store. But if it was not present how will the subsequent backup succeed?

backup repository example:

apiVersion: velero.io/v1
kind: BackupRepository
metadata:
  creationTimestamp: "2024-08-29T11:06:27Z"
  generateName: active-tracking-default-kopia-
  generation: 480
  labels:
    velero.io/repository-type: kopia
    velero.io/storage-location: default
    velero.io/volume-namespace: active-tracking
  name: active-tracking-default-kopia-svqtc
  namespace: velero
  resourceVersion: "218166016"
  uid: 5175b990-5122-4dc6-8813-11e3db56ebb6
spec:
  backupStorageLocation: default
  maintenanceFrequency: 1h0m0s
  repositoryType: kopia
  resticIdentifier: s3:s3-eu-west-1.amazonaws.com/ot3-qa-patch-velero-backup/restic/active-tracking
  volumeNamespace: active-tracking
status:
  lastMaintenanceTime: "2024-09-18T08:16:56Z"
  phase: Ready

@Lyndon-Li
Copy link
Contributor

I also think the object should exist all the time. But the object store just returned 404 (BLOB not found for Kopia) at the time when the problem happened. I think that is the problem.

@dharanui
Copy link
Author

corresponding log example from velero pod: "level=error msg="data path backup failed: Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/feature-flag-proxy/redis-data-feature-flag-proxy-redis-replicas-0: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/feature-flag-proxy/redis-data-feature-flag-proxy-redis-replicas-0 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content from blob "qf139466a8e99d0a9065f927a84f0aee4-sd5191d7c73ebc3bb12c": failed to get blob with ID qf139466a8e99d0a9065f927a84f0aee4-sd5191d7c73ebc3bb12c: BLOB not found, plugin: velero.io/csi-pvc-backupper" backup=velero/daily-bkp-20240918033004 logSource="pkg/controller/backup_controller.go:663"

Is there any way to tell the plugin to retry in case of 404?
Or what are other places to check this?

@dharanui
Copy link
Author

this seems to happen quite frequently in multiple clusters. Tried with velero version 1.12 and 1.14 also aws-plugin versions.

@blackpiglet
Copy link
Contributor

Another notice-worthy thing is please check whether the snapshots can be shared across clusters.

@dharanui
Copy link
Author

the volumeSnapshots will be deleted after the dataupload is successful right?
But however these (volumes and snapshots) can be shared across clusters. We do this testing on everyday basis.

@blackpiglet
Copy link
Contributor

The VolumeSnapshots are reset to retain the snapshot before deletion, so that is not the problem.
I mean please check whether the cloud provider can support sharing the snapshots between clusters, but you already confirm that is not an issue in your environment in the previous reply.

I think we can use kopia CLI to check whether the snapshot exist in the object store. q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c.

@dharanui
Copy link
Author

WE had same error again today:
data path backup failed: Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/glowroot/data-glowroot-cassandra-0: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/glowroot/data-glowroot-cassandra-0 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content: failed to get blob with ID qfc53d34093075fb10774b2cba835cae2-s63aafb7e87d5212a12c: BLOB not found

connected to repository and tried to find blob with id qfc53d34093075fb10774b2cba835cae2-s63aafb7e87d5212a12c as per error using command kopia blob list. But there is no blob with that id.

So the question is why is this issue not for every backup??

@dharanui
Copy link
Author

dharanui commented Sep 24, 2024

@blackpiglet / @Lyndon-Li do we have any workaround for this at the moment?

Found the following warnings on node-agent is it related?

node-agent time="2024-09-24T02:02:58Z" level=warning msg="active indexes [xr0_7_5fc616defe00f04ddd4cbb1444f7e8c7-sc4455fdf2d353fa5-c1 xn9_07592f6e33cbfb35b171e26f3450747c-sc5a932c3e062da0f12c-c1 xn9_10f827950e25c36b2405e63cf75a226b-s65ef87e5c616798712c-c1 xn9_14ddd7437f5720c24b237bc0deb22bf8-se901f10811af4f5f12d-c1 xn9_16e195a514cbba289a0ff3ef06db5d6f-s9614bd11b68aa6b212c-c1 xn9_1b0cb5d825399323a430a121f34476eb-s588095dfbac520c412c-c1 xn9_1e9549c8004c8c7268bad88419beab36-s9e4c96b059eddf3712d-c1 xn9_363917e4ac30525628976b596ef064a6-s5db7801afbadff7212c-c1 xn9_3a694490dfde37d62dae07b9b44ed0c2-sd1bc0aaf5b64e17112d-c1 xn9_3c0d542715d47c6c98902b309a90b251-sa8791d496c562fc112c-c1 xn9_3e6be1beb10d8d4e1c912caab27c3e5d-s755589140036a22012c-c1 xn9_53b9bf91278eb5c12346df5d63674faf-sc9f612ebdbd2a0ce12c-c1 xn9_639b5b6a8082b62c25c4a3935d50b6d7-sde1a45d58364560812d-c1 xn9_65fab6b9e1f26eaec6aa67521b1f78af-s093e4037cd9fc77b12c-c1 xn9_7ce07bd6999c9b6c1faa68af67987f87-s035cfd62c5d3644e12d-c1 xn9_81ac4312bb80f0d1186b231087b23f05-s7b90ce81329cb91812c-c1 xn9_8209b3b790a161761c84389a5f194002-s62f35ccd1cb1a0cb12c-c1 xn9_cd23b9ad0846ddf02f3d760bf5ace8dc-s7cdcedd28e8b24a612c-c1 xn9_cda8b06ba4a871a6075bae991eda8111-s0d0bce2d983eadcc12c-c1 xn9_d17d9345eea8d30e44182a388c48efcd-s9cdf0b0cdbd99eaa12d-c1 xn9_d3257f5446f4da22f224f753f68ee432-sdc539f6c6c87654012c-c1 xn9_dd532a5b43313a79639213de2464e8b8-s0db90ab1c3d9fb2412d-c1 xn9_e41bfe5b368eadec73cca2640ca5a9d2-s4d49b708bddaff8312c-c1 xn9_e838905b2a045727172d1518c04d9037-s43808c5f5d632ecf12c-c1 xn9_e9fe6494f98a8bc416517fe754a40383-sa2068559f11620a612d-c1 xn9_fb57dc5c187a4de3eec07fb7911b5b3d-secb786f3edaf429e12c-c1 xn9_fe21260253354ede021f822921b376c1-s0c06d9592eb3d69b12c-c1 xn10_00609770d0c78ff21286ce008d6b8ed7-sfe0a567df5eff63b12d-c1 xn10_08418a307bad786065e08baa65bb8d74-s494cde92e1b5f13012d-c1 xn10_092752e1ffa9226f61fce1308201541e-sfcd5cedcfd2cb4d212d-c1 xn10_1dce7427ad64f225d65692ccdedd1e3a-s2a85c940c8f73af612d-c1 xn10_3819707c2f4c6e6c34114cd409345f34-sb168d866e512d21312d-c1 xn10_4670b1f09e6f5394fff9c8ff45071509-sa2cad73ac0e9fdb512d-c1 xn10_4d74081a46fdbc4903964b5a23e7f921-s2fa4528c987552e712d-c1 xn10_4ff2621b3827a3ceb351ad7ba37d0a9e-s4db7aa7af554dc7a12d-c1 xn10_5bfb3687c48dd1abbc6add70c42d1b3f-sd41468b08298693c12d-c1 xn10_89b031dbcc0eec199c40d95eeaba16ee-s3650b95ad5c3489e12d-c1 xn10_8a7222cc2486e07f8cdf1e3a783da8e8-se76da0199f2b97cc12d-c1 xn10_95349a5f4bed81e32ffef280da1a14e7-scfaa896c1d9d776a12d-c1 xn10_9c586fe9a6e4a504bd9077b1620b2c21-s74c14ec8fe45d41212d-c1 xn10_a82e80af983db869d21fa7288b868f9f-s4f92e15e30b2f9b512d-c1 xn10_ad008aaf537636d340ebb263e914cbae-sbe46290e8f673d2c12d-c1 xn10_b0e2829139c853137c673e3a86635fcf-sd31964a47c2bd32d12d-c1 xn10_bb905a8b8e266cd6994271d6065700f4-sbae8b502911ff4c312d-c1 xn10_dd76752723fb1d4deda22600d2868270-se11374fe4c1a0d8412d-c1 xn10_df433df44424bd5c167d7eee7948921a-scb663fe2e899520a12d-c1 xn10_ee77f8f18fa5207191d5e1d8b718bad5-sd823c3c30eae664a12d-c1] deletion watermark 2024-09-22 20:59:27 +0000 UTC" controller=dataupload dataupload=velero/daily-bkp-20240924020030-9qz56 logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error

@Lyndon-Li
Copy link
Contributor

The warning message is not relevant.
The weird thing is that the problem happens to your env only (we don't see other users report this problem) and even in your env, the problem is not always happen.

So we need more info to get the root cause. Please try to answer below questions:

  1. Have you enabled any advanced feature of AWS s3?
  2. Can you restore from every backup when you see this problem?
  3. Have you deleted any backup before the problem happens?
  4. Even if a blob is deleted, e.g, due to backup deletion, the blob won't be deleted immediately from the object store, instead, it is deleted along with a maintenance job after a safe enough time (e.g., 24 hours). So can you check when the missing snapshot is generated?

@dharanui
Copy link
Author

dharanui commented Sep 24, 2024

Hi @Lyndon-Li

  1. No advanced featured enabled for the s3
  2. yes
  3. Yes we have deleted multiple backups as we created test backups to debug few issues and deleted them after they are done. Also our backups are stored for 14 days so the ones older than 14 days are deleted by setting ttl.
  4. I guess maintainence job by default is 1hr and not 24 hrs for kopia. Could that be potential problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants