"Failed to find previous kopia snapshot manifests" for velero backup using csi snapshot data mover #8222

dharanui · 2024-09-17T06:05:51Z

What steps did you take and what happened:
velero version 1.12
also tried upgrading to 1.14.1
aws plugin 1.10.1

scheduled backup runs everyday with csi datamover.

backups are in partiallyFailed status intermittently with few datauploads in "Failed" state.
On describing the failed datauploads we find the following error.

data path backup failed: Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/elastic-system/elasticsearch-data-logging-elasticsearch-es-master-nodeset-2: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/elastic-system/elasticsearch-data-logging-elasticsearch-es-master-nodeset-2 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content from blob "q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c": failed to get blob with ID q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c: BLOB not found

What did you expect to happen:
All datauploads are in completed state and Backups are successful everyday.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help
bundle-2024-09-17-11-32-20.tar.gz

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:

Environment:
Production environment

Velero version (use velero version):
Velero features (use velero client config get features):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2024-09-18T02:35:24Z

Looks like the repo content with ID q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c was missing, so the following backups for the same snapshot represented by this content always failed.

Take example for volume datadir-active-tracking-mongodb-secondary-0:

A backup completed on 2024-09-16T02:14:24Z with snapshot ff8935066e0e428d4f2905013353fb14
On 2024-09-17T02:09:58Z the problem happened complaining the parent snapshot manifest blob was not found
However, on 2024-09-17T04:46:14Z another backup succeeded for the same volume with the same parent snapshot ff8935066e0e428d4f2905013353fb14

Therefore, looks like the content was not found at one time but found at the other time.

The BLOB not found was reported by the object store. Therefore, please share the info of your object store. And you can also check if object qc9e6d67bcf92c91ed614a2d63a27ce00-s670d20f3fffaf54212c exists in your object store.

dharanui · 2024-09-18T08:32:15Z

"Therefore, looks like the content was not found at one time but found at the other time." -> True, I have triggered a backup almost immediately after the backup is failed at "2024-09-17T02:09:58Z".

We use AWS S3 as object store.
I am not sure how to check the object in object store. But if it was not present how will the subsequent backup succeed?

backup repository example:

apiVersion: velero.io/v1
kind: BackupRepository
metadata:
  creationTimestamp: "2024-08-29T11:06:27Z"
  generateName: active-tracking-default-kopia-
  generation: 480
  labels:
    velero.io/repository-type: kopia
    velero.io/storage-location: default
    velero.io/volume-namespace: active-tracking
  name: active-tracking-default-kopia-svqtc
  namespace: velero
  resourceVersion: "218166016"
  uid: 5175b990-5122-4dc6-8813-11e3db56ebb6
spec:
  backupStorageLocation: default
  maintenanceFrequency: 1h0m0s
  repositoryType: kopia
  resticIdentifier: s3:s3-eu-west-1.amazonaws.com/ot3-qa-patch-velero-backup/restic/active-tracking
  volumeNamespace: active-tracking
status:
  lastMaintenanceTime: "2024-09-18T08:16:56Z"
  phase: Ready

Lyndon-Li · 2024-09-18T08:49:03Z

I also think the object should exist all the time. But the object store just returned 404 (BLOB not found for Kopia) at the time when the problem happened. I think that is the problem.

dharanui · 2024-09-18T10:24:26Z

corresponding log example from velero pod: "level=error msg="data path backup failed: Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/feature-flag-proxy/redis-data-feature-flag-proxy-redis-replicas-0: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/feature-flag-proxy/redis-data-feature-flag-proxy-redis-replicas-0 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content from blob "qf139466a8e99d0a9065f927a84f0aee4-sd5191d7c73ebc3bb12c": failed to get blob with ID qf139466a8e99d0a9065f927a84f0aee4-sd5191d7c73ebc3bb12c: BLOB not found, plugin: velero.io/csi-pvc-backupper" backup=velero/daily-bkp-20240918033004 logSource="pkg/controller/backup_controller.go:663"

Is there any way to tell the plugin to retry in case of 404?
Or what are other places to check this?

dharanui · 2024-09-19T10:15:01Z

this seems to happen quite frequently in multiple clusters. Tried with velero version 1.12 and 1.14 also aws-plugin versions.

blackpiglet · 2024-09-19T13:45:31Z

Another notice-worthy thing is please check whether the snapshots can be shared across clusters.

dharanui · 2024-09-20T09:35:39Z

the volumeSnapshots will be deleted after the dataupload is successful right?
But however these (volumes and snapshots) can be shared across clusters. We do this testing on everyday basis.

blackpiglet · 2024-09-21T11:43:02Z

The VolumeSnapshots are reset to retain the snapshot before deletion, so that is not the problem.
I mean please check whether the cloud provider can support sharing the snapshots between clusters, but you already confirm that is not an issue in your environment in the previous reply.

I think we can use kopia CLI to check whether the snapshot exist in the object store. q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c.

dharanui · 2024-09-22T07:37:00Z

WE had same error again today:
data path backup failed: Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/glowroot/data-glowroot-cassandra-0: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/glowroot/data-glowroot-cassandra-0 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content: failed to get blob with ID qfc53d34093075fb10774b2cba835cae2-s63aafb7e87d5212a12c: BLOB not found

connected to repository and tried to find blob with id qfc53d34093075fb10774b2cba835cae2-s63aafb7e87d5212a12c as per error using command kopia blob list. But there is no blob with that id.

So the question is why is this issue not for every backup??

dharanui · 2024-09-24T04:17:15Z

@blackpiglet / @Lyndon-Li do we have any workaround for this at the moment?

Found the following warnings on node-agent is it related?

node-agent time="2024-09-24T02:02:58Z" level=warning msg="active indexes [xr0_7_5fc616defe00f04ddd4cbb1444f7e8c7-sc4455fdf2d353fa5-c1 xn9_07592f6e33cbfb35b171e26f3450747c-sc5a932c3e062da0f12c-c1 xn9_10f827950e25c36b2405e63cf75a226b-s65ef87e5c616798712c-c1 xn9_14ddd7437f5720c24b237bc0deb22bf8-se901f10811af4f5f12d-c1 xn9_16e195a514cbba289a0ff3ef06db5d6f-s9614bd11b68aa6b212c-c1 xn9_1b0cb5d825399323a430a121f34476eb-s588095dfbac520c412c-c1 xn9_1e9549c8004c8c7268bad88419beab36-s9e4c96b059eddf3712d-c1 xn9_363917e4ac30525628976b596ef064a6-s5db7801afbadff7212c-c1 xn9_3a694490dfde37d62dae07b9b44ed0c2-sd1bc0aaf5b64e17112d-c1 xn9_3c0d542715d47c6c98902b309a90b251-sa8791d496c562fc112c-c1 xn9_3e6be1beb10d8d4e1c912caab27c3e5d-s755589140036a22012c-c1 xn9_53b9bf91278eb5c12346df5d63674faf-sc9f612ebdbd2a0ce12c-c1 xn9_639b5b6a8082b62c25c4a3935d50b6d7-sde1a45d58364560812d-c1 xn9_65fab6b9e1f26eaec6aa67521b1f78af-s093e4037cd9fc77b12c-c1 xn9_7ce07bd6999c9b6c1faa68af67987f87-s035cfd62c5d3644e12d-c1 xn9_81ac4312bb80f0d1186b231087b23f05-s7b90ce81329cb91812c-c1 xn9_8209b3b790a161761c84389a5f194002-s62f35ccd1cb1a0cb12c-c1 xn9_cd23b9ad0846ddf02f3d760bf5ace8dc-s7cdcedd28e8b24a612c-c1 xn9_cda8b06ba4a871a6075bae991eda8111-s0d0bce2d983eadcc12c-c1 xn9_d17d9345eea8d30e44182a388c48efcd-s9cdf0b0cdbd99eaa12d-c1 xn9_d3257f5446f4da22f224f753f68ee432-sdc539f6c6c87654012c-c1 xn9_dd532a5b43313a79639213de2464e8b8-s0db90ab1c3d9fb2412d-c1 xn9_e41bfe5b368eadec73cca2640ca5a9d2-s4d49b708bddaff8312c-c1 xn9_e838905b2a045727172d1518c04d9037-s43808c5f5d632ecf12c-c1 xn9_e9fe6494f98a8bc416517fe754a40383-sa2068559f11620a612d-c1 xn9_fb57dc5c187a4de3eec07fb7911b5b3d-secb786f3edaf429e12c-c1 xn9_fe21260253354ede021f822921b376c1-s0c06d9592eb3d69b12c-c1 xn10_00609770d0c78ff21286ce008d6b8ed7-sfe0a567df5eff63b12d-c1 xn10_08418a307bad786065e08baa65bb8d74-s494cde92e1b5f13012d-c1 xn10_092752e1ffa9226f61fce1308201541e-sfcd5cedcfd2cb4d212d-c1 xn10_1dce7427ad64f225d65692ccdedd1e3a-s2a85c940c8f73af612d-c1 xn10_3819707c2f4c6e6c34114cd409345f34-sb168d866e512d21312d-c1 xn10_4670b1f09e6f5394fff9c8ff45071509-sa2cad73ac0e9fdb512d-c1 xn10_4d74081a46fdbc4903964b5a23e7f921-s2fa4528c987552e712d-c1 xn10_4ff2621b3827a3ceb351ad7ba37d0a9e-s4db7aa7af554dc7a12d-c1 xn10_5bfb3687c48dd1abbc6add70c42d1b3f-sd41468b08298693c12d-c1 xn10_89b031dbcc0eec199c40d95eeaba16ee-s3650b95ad5c3489e12d-c1 xn10_8a7222cc2486e07f8cdf1e3a783da8e8-se76da0199f2b97cc12d-c1 xn10_95349a5f4bed81e32ffef280da1a14e7-scfaa896c1d9d776a12d-c1 xn10_9c586fe9a6e4a504bd9077b1620b2c21-s74c14ec8fe45d41212d-c1 xn10_a82e80af983db869d21fa7288b868f9f-s4f92e15e30b2f9b512d-c1 xn10_ad008aaf537636d340ebb263e914cbae-sbe46290e8f673d2c12d-c1 xn10_b0e2829139c853137c673e3a86635fcf-sd31964a47c2bd32d12d-c1 xn10_bb905a8b8e266cd6994271d6065700f4-sbae8b502911ff4c312d-c1 xn10_dd76752723fb1d4deda22600d2868270-se11374fe4c1a0d8412d-c1 xn10_df433df44424bd5c167d7eee7948921a-scb663fe2e899520a12d-c1 xn10_ee77f8f18fa5207191d5e1d8b718bad5-sd823c3c30eae664a12d-c1] deletion watermark 2024-09-22 20:59:27 +0000 UTC" controller=dataupload dataupload=velero/daily-bkp-20240924020030-9qz56 logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error

Lyndon-Li · 2024-09-24T05:16:15Z

The warning message is not relevant.
The weird thing is that the problem happens to your env only (we don't see other users report this problem) and even in your env, the problem is not always happen.

So we need more info to get the root cause. Please try to answer below questions:

Have you enabled any advanced feature of AWS s3?
Can you restore from every backup when you see this problem?
Have you deleted any backup before the problem happens?
Even if a blob is deleted, e.g, due to backup deletion, the blob won't be deleted immediately from the object store, instead, it is deleted along with a maintenance job after a safe enough time (e.g., 24 hours). So can you check when the missing snapshot is generated?

dharanui · 2024-09-24T11:12:20Z

Hi @Lyndon-Li

No advanced featured enabled for the s3
yes
Yes we have deleted multiple backups as we created test backups to debug few issues and deleted them after they are done. Also our backups are stored for 14 days so the ones older than 14 days are deleted by setting ttl.
I guess maintainence job by default is 1hr and not 24 hrs for kopia. Could that be potential problem?

blackpiglet added the area/datamover label Sep 17, 2024

Lyndon-Li added the repository label Sep 18, 2024

blackpiglet assigned Lyndon-Li Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Failed to find previous kopia snapshot manifests" for velero backup using csi snapshot data mover #8222

"Failed to find previous kopia snapshot manifests" for velero backup using csi snapshot data mover #8222

dharanui commented Sep 17, 2024 •

edited

Loading

Lyndon-Li commented Sep 18, 2024

dharanui commented Sep 18, 2024 •

edited

Loading

Lyndon-Li commented Sep 18, 2024

dharanui commented Sep 18, 2024

dharanui commented Sep 19, 2024

blackpiglet commented Sep 19, 2024

dharanui commented Sep 20, 2024

blackpiglet commented Sep 21, 2024

dharanui commented Sep 22, 2024

dharanui commented Sep 24, 2024 •

edited

Loading

Lyndon-Li commented Sep 24, 2024

dharanui commented Sep 24, 2024 •

edited

Loading

"Failed to find previous kopia snapshot manifests" for velero backup using csi snapshot data mover #8222

"Failed to find previous kopia snapshot manifests" for velero backup using csi snapshot data mover #8222

Comments

dharanui commented Sep 17, 2024 • edited Loading

Lyndon-Li commented Sep 18, 2024

dharanui commented Sep 18, 2024 • edited Loading

Lyndon-Li commented Sep 18, 2024

dharanui commented Sep 18, 2024

dharanui commented Sep 19, 2024

blackpiglet commented Sep 19, 2024

dharanui commented Sep 20, 2024

blackpiglet commented Sep 21, 2024

dharanui commented Sep 22, 2024

dharanui commented Sep 24, 2024 • edited Loading

Lyndon-Li commented Sep 24, 2024

dharanui commented Sep 24, 2024 • edited Loading

dharanui commented Sep 17, 2024 •

edited

Loading

dharanui commented Sep 18, 2024 •

edited

Loading

dharanui commented Sep 24, 2024 •

edited

Loading

dharanui commented Sep 24, 2024 •

edited

Loading