Storage System Alerts and Recommended Actions
This document details the various alert rules configured within the system, providing information on their impact, potential causes, and recommended actions.
ALERT_1401: [SECOR]: High Disk Usage Detected
Severity: critical
Affected System: Secor Storage
Impact Summary: "High disk usage in Secor can delay saving processed data into the system
Causes:
The system is running out of disk space.
Volume autoscaling might be disabled.
Volume autoscaling may have failed due to threshold limits, subject to cloud provider limitations on scaling frequency.
A high volume of data is being written to the persistent storage.
A lot of old or unused data is accumulated in the PV.
Actions:
Enable the auto scaling of volume.
Increase the volume size of the secor system.
Adjust the volume autoscaler to a higher threshold percentage. This proactive measure will increase volume at a greater utilization level, preventing frequent scaling.
Increase the PV size if needed.
For more assistance, contact the administrative support.
Component: Storage: Secor[]
ALERT_1402: [PERSISTENT VOLUMES]: Storage volume resize request not processed by autoscaler
Severity: critical
Affected System: Persistent Volumes Storage
Impact Summary: The volume autoscaler failed to process the request to increase storage capacity, which may lead to system instability.
Causes:
Persistent Volume (PV) Resizing is ignored by the Volume Autoscaler.
The available storage in the cluster is insufficient to resize PV.
The volume autoscaler might not be enabled.
Actions:
Review the ignored PVs and verify if they are intentionally excluded from auto-scaling by the auto scaler.
Check if autoscaler is enabled and configured properly.
Ensure there is enough free storage in the cluster.
For more assistance, contact the administrative support.
Component: Storage: Peristent volumes[]
ALERT_1403: [PERSISTENT VOLUMES]: Failed to automatically expand storage volume
Severity: critical
Affected System: Persistent Volumes Storage
Impact Summary: The volume autoscaler failed to expand the storage space (PV) due to certain limitations of the cloud provider. As a result, the system may become unhealthy.
Causes:
Volume Autoscaler failed to resize the Persistent Volume.
Available storage in the cluster is insufficient to resize PV.
Volume autoscaler might be misconfigured.
Actions:
Check if there is enough free storage in the cluster and allocate more if needed, or enable autoscaling.
Verify that the volume autoscaler is enabled and correctly configured.
For more assistance, contact the administrative support.
Component: Storage: Peristent volumes[]
STORAGE_PG_BACKUP_004: [CRITICAL][POSTGRESQL BACKUP]: PostgreSQL database backup not found
Severity: critical
Affected System: PostgreSQL backup Storage
Impact Summary: Missing PostgreSQL database backup can risk data loss during failures or rollbacks, impacting data recovery.
Causes:
The backup job may have failed.
The PostgreSQL service is down.
The connection credentials could be incorrect.
The job scheduler configuration may be incorrect.
Actions:
Review the logs of the backup job to identify any errors or failures.
Ensure that the PostgreSQL service is running and accessible. Restart it if necessary.
Confirm that the database credentials used for the backup are correct.
Check if the job scheduler is configured correctly and running as expected.
Ensure there is enough storage available for the backup process.
For more assistance, contact the administrative support.
Component: Storage: PostgreSQL backup[]
STORAGE_SECOR_UNIQUE_005: [CRITICAL][SECOR UNIQUE]:Kafka data backup not found
Severity: critical
Affected System: Secor - Unique Storage
Impact Summary: Missing Secor unique data backup can lead to permanent loss of Kafka messages, affecting data reliability and historical analysis in downstream systems.
Causes:
The Secor service might not be running or has crashed.
Insufficient storage space for backups.
The connection to the storage destination (e.g., S3) is unavailable.
Actions:
Review logs to identify specific errors causing the failure.
Ensure that Secor is running and restart if necessary.
Ensure that the target storage location has enough space and is accessible.
Review Secor configuration files for misconfigurations.
For more assistance, contact the administrative support.
Component: Storage: Secor - Unique[]
STORAGE_SECOR_006: Postgres database backup failed
Severity: critical
Affected System: Secor - Unique Storage
Impact Summary: The PostgreSQL database backup has not occurred within the expected timeframe, requiring immediate investigation to prevent potential data loss.
Causes:
(The PostgreSQL database backup has not occurred within the expected timeframe. Please Investigate immediately or contact administrative support for assistance.)
Actions:
The PostgreSQL database backup has not occurred within the expected timeframe. Please Investigate immediately or contact administrative support for assistance.
Component: Storage: Secor - Unique[]
STORAGE_SECOR_INGESTION_005: [CRITICAL][SECOR INGESTION]:Kafka data backup not found
Severity: critical
Affected System: Secor - Ingestion Storage
Impact Summary: Missing Secor ingestion data backup can lead to permanent loss of Kafka messages, affecting data reliability and historical analysis in downstream systems.
Causes:
The Secor service might not be running or has crashed.
Insufficient storage space for backups.
The connection to the storage destination (e.g., S3) is unavailable.
Actions:
Review logs to identify specific errors causing the failure.
Ensure that Secor is running and restart if necessary.
Ensure that the target storage location has enough space and is accessible.
Review Secor configuration files for misconfigurations.
For more assistance, contact the administrative support.
Component: Storage: Secor - Ingestion[]
STORAGE_SECOR_DENORM_005: [CRITICAL][SECOR DENORM]:Kafka data backup not found
Severity: critical
Affected System: Secor - Denorm Storage
Impact Summary: Missing Secor denorm data backup can lead to permanent loss of Kafka messages, affecting data reliability and historical analysis in downstream systems.
Causes:
The Secor service might not be running or has crashed.
Insufficient storage space for backups.
The connection to the storage destination (e.g., S3) is unavailable.
Actions:
Review logs to identify specific errors causing the failure.
Ensure that Secor is running and restart if necessary.
Ensure that the target storage location has enough space and is accessible.
Review Secor configuration files for misconfigurations.
For more assistance, contact the administrative support.
Component: Storage: Secor - Denorm[]
ALERT_1408: [VELERO]:Kubernetes cluster backup not found
Severity: critical
Affected System: Velero backup Storage
Impact Summary: The backup process for the Kubernetes cluster failed, resulting in the system and its associated data may not be recoverable in the event of a failure or outage, impacting availability, disaster recovery, and rollback capabilities.
Causes:
The backup process has stopped due to a service failure.
The Velero backup job has exceeded the allowed time or encountered an error.
The backup destination is full.
Storage credentials provided are incorrect.
Velero is not able to connect to the S3 storage.
Actions:
Ensure Velero is up and running.
Run Velero logs to identify specific errors.
Ensure the backup destination has enough space and is accessible.
Confirm correct credentials and backup settings in Velero.
For more assistance, contact the administrative support.
Component: Storage: Velero backup[]
Last updated
