Troubleshoot Snapshots
When a snapshot fails, a support bundle will be collected and stored automatically. Because this is a point-in-time collection of all logs and system state at the time of the failed snapshot, this is a good place to view the logs.
Velero is Crashing
If Velero is crashing and not starting, some common causes are:
Invalid Cloud Credentials
Symptom
You see the following error message from Velero when trying to configure a snapshot.
time="2020-04-10T14:22:24Z" level=info msg="Checking existence of namespace" logSource="pkg/cmd/server/server.go:337" namespace=velero
time="2020-04-10T14:22:24Z" level=info msg="Namespace exists" logSource="pkg/cmd/server/server.go:343" namespace=velero
time="2020-04-10T14:22:27Z" level=info msg="Checking existence of Velero custom resource definitions" logSource="pkg/cmd/server/server.go:372"
time="2020-04-10T14:22:31Z" level=info msg="All Velero custom resource definitions exist" logSource="pkg/cmd/server/server.go:406"
time="2020-04-10T14:22:31Z" level=info msg="Checking that all backup storage locations are valid" logSource="pkg/cmd/server/server.go:413"
An error occurred: some backup storage locations are invalid: backup store for location "default" is invalid: rpc error: code = Unknown desc = NoSuchBucket: The specified bucket does not exist
status code: 404, request id: BEFAE2B9B05A2DCF, host id: YdlejsorQrn667ziO6Xr6gzwKJJ3jpZzZBMwwMIMpWj18Phfii6Za+dQ4AgfzRcxavQXYcgxRJI=
Cause
If the cloud access credentials are invalid or do not have access to the location in the configuration, Velero will crashloop. The Velero logs will be included in a support bundle, and the message will look like this.
Solution
Replicated recommends that you validate the access key / secret or service account json.
Invalid Top-level Directories
Symptom
You see the following error message when Velero is starting:
time="2020-04-10T14:12:42Z" level=info msg="Checking existence of namespace" logSource="pkg/cmd/server/server.go:337" namespace=velero
time="2020-04-10T14:12:42Z" level=info msg="Namespace exists" logSource="pkg/cmd/server/server.go:343" namespace=velero
time="2020-04-10T14:12:44Z" level=info msg="Checking existence of Velero custom resource definitions" logSource="pkg/cmd/server/server.go:372"
time="2020-04-10T14:12:44Z" level=info msg="All Velero custom resource definitions exist" logSource="pkg/cmd/server/server.go:406"
time="2020-04-10T14:12:44Z" level=info msg="Checking that all backup storage locations are valid" logSource="pkg/cmd/server/server.go:413"
An error occurred: some backup storage locations are invalid: backup store for location "default" is invalid: Backup store contains invalid top-level directories: [other-directory]
Cause
This error message is caused when Velero is attempting to start, and it is configured to use a reconfigured or re-used bucket.
When configuring Velero to use a bucket, the bucket cannot contain other data, or Velero will crash.
Solution
Configure Velero to use a bucket that does not contain other data.
Node Agent is Crashing
If the node-agent Pod is crashing and not starting, some common causes are:
Metrics Server is Failing to Start
Symptom
You see the following error in the node-agent logs.
time="2023-11-16T21:29:44Z" level=info msg="Starting metric server for node agent at address []" logSource="pkg/cmd/cli/nodeagent/server.go:229"
time="2023-11-16T21:29:44Z" level=fatal msg="Failed to start metric server for node agent at []: listen tcp :80: bind: permission denied" logSource="pkg/cmd/cli/nodeagent/server.go:236"
Cause
This is a result of a known issue in Velero 1.12.0 and 1.12.1 where the port is not set correctly when starting the metrics server. This causes the metrics server to fail to start with a permission denied
error in environments that do not run MinIO and have Host Path, NFS, or internal storage destinations configured. When the metrics server fails to start, the node-agent Pod crashes. For more information about this issue, see the GitHub issue details.
Solution
Replicated recommends that you either upgrade to Velero 1.12.2 or later, or downgrade to a version earlier than 1.12.0.
Snapshot Creation is Failing
Timeout Error when Creating a Snapshot
Symptom
You see a backup error that includes a timeout message when attempting to create a snapshot. For example:
Error backing up item
timed out after 12h0m0s
Cause
This error message appears when the node-agent (restic) Pod operation timeout limit is reached. In Velero v1.4.2 and later, the default timeout is 240 minutes.
Restic is an open-source backup tool. Velero integrates with Restic to provide a solution for backing up and restoring Kubernetes volumes. For more information about the Velero Restic integration, see File System Backup in the Velero documentation.
Solution
Use the kubectl Kubernetes command-line tool to patch the Velero deployment to increase the timeout:
Velero version 1.10 and later:
kubectl patch deployment velero -n velero --type json -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--fs-backup-timeout=TIMEOUT_LIMIT"}]'
Velero versions less than 1.10:
kubectl patch deployment velero -n velero --type json -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--restic-timeout=TIMEOUT_LIMIT"}]'
Replace TIMEOUT_LIMIT
with a length of time for the node-agent (restic) Pod operation timeout in hours, minutes, and seconds. Use the format 0h0m0s
. For example, 48h30m0s
.
The timeout value reverts back to the default value if you rerun the velero install
command.