Storage System Alerts and Recommended Actions

This document details the various alert rules configured within the system, providing information on their impact, potential causes, and recommended actions.

ALERT_1401: [SECOR]: High Disk Usage Detected

Severity: critical

Affected System: Secor Storage

Impact Summary: "High disk usage in Secor can delay saving processed data into the system

Causes:

The system is running out of disk space.
Volume autoscaling might be disabled.
Volume autoscaling may have failed due to threshold limits, subject to cloud provider limitations on scaling frequency.
A high volume of data is being written to the persistent storage.
A lot of old or unused data is accumulated in the PV.

Actions:

Enable the auto scaling of volume.
Increase the volume size of the secor system.
Adjust the volume autoscaler to a higher threshold percentage. This proactive measure will increase volume at a greater utilization level, preventing frequent scaling.
Increase the PV size if needed.
For more assistance, contact the administrative support.

Component: Storage: Secor[]

ALERT_1402: [PERSISTENT VOLUMES]: Storage volume resize request not processed by autoscaler

Severity: critical

Affected System: Persistent Volumes Storage

Impact Summary: The volume autoscaler failed to process the request to increase storage capacity, which may lead to system instability.

Causes:

Persistent Volume (PV) Resizing is ignored by the Volume Autoscaler.
The available storage in the cluster is insufficient to resize PV.
The volume autoscaler might not be enabled.

Actions:

Review the ignored PVs and verify if they are intentionally excluded from auto-scaling by the auto scaler.
Check if autoscaler is enabled and configured properly.
Ensure there is enough free storage in the cluster.
For more assistance, contact the administrative support.

Component: Storage: Peristent volumes[]

ALERT_1403: [PERSISTENT VOLUMES]: Failed to automatically expand storage volume

Severity: critical

Affected System: Persistent Volumes Storage

Impact Summary: The volume autoscaler failed to expand the storage space (PV) due to certain limitations of the cloud provider. As a result, the system may become unhealthy.

Causes:

Volume Autoscaler failed to resize the Persistent Volume.
Available storage in the cluster is insufficient to resize PV.
Volume autoscaler might be misconfigured.

Actions:

Check if there is enough free storage in the cluster and allocate more if needed, or enable autoscaling.
Verify that the volume autoscaler is enabled and correctly configured.
For more assistance, contact the administrative support.

Component: Storage: Peristent volumes[]

STORAGE_PG_BACKUP_004: [CRITICAL][POSTGRESQL BACKUP]: PostgreSQL database backup not found

Severity: critical

Affected System: PostgreSQL backup Storage

Impact Summary: Missing PostgreSQL database backup can risk data loss during failures or rollbacks, impacting data recovery.

Causes:

The backup job may have failed.
The PostgreSQL service is down.
The connection credentials could be incorrect.
The job scheduler configuration may be incorrect.

Actions:

Review the logs of the backup job to identify any errors or failures.
Ensure that the PostgreSQL service is running and accessible. Restart it if necessary.
Confirm that the database credentials used for the backup are correct.
Check if the job scheduler is configured correctly and running as expected.
Ensure there is enough storage available for the backup process.
For more assistance, contact the administrative support.

Component: Storage: PostgreSQL backup[]

STORAGE_SECOR_UNIQUE_005: [CRITICAL][SECOR UNIQUE]:Kafka data backup not found

Severity: critical

Affected System: Secor - Unique Storage

Impact Summary: Missing Secor unique data backup can lead to permanent loss of Kafka messages, affecting data reliability and historical analysis in downstream systems.

Causes:

The Secor service might not be running or has crashed.
Insufficient storage space for backups.
The connection to the storage destination (e.g., S3) is unavailable.

Actions:

Review logs to identify specific errors causing the failure.
Ensure that Secor is running and restart if necessary.
Ensure that the target storage location has enough space and is accessible.
Review Secor configuration files for misconfigurations.
For more assistance, contact the administrative support.

Component: Storage: Secor - Unique[]

STORAGE_SECOR_006: Postgres database backup failed

Severity: critical

Affected System: Secor - Unique Storage

Impact Summary: The PostgreSQL database backup has not occurred within the expected timeframe, requiring immediate investigation to prevent potential data loss.

Causes:

(The PostgreSQL database backup has not occurred within the expected timeframe. Please Investigate immediately or contact administrative support for assistance.)

Actions:

The PostgreSQL database backup has not occurred within the expected timeframe. Please Investigate immediately or contact administrative support for assistance.

Component: Storage: Secor - Unique[]

STORAGE_SECOR_INGESTION_005: [CRITICAL][SECOR INGESTION]:Kafka data backup not found

Severity: critical

Affected System: Secor - Ingestion Storage

Impact Summary: Missing Secor ingestion data backup can lead to permanent loss of Kafka messages, affecting data reliability and historical analysis in downstream systems.

Causes:

The Secor service might not be running or has crashed.
Insufficient storage space for backups.
The connection to the storage destination (e.g., S3) is unavailable.

Actions:

Review logs to identify specific errors causing the failure.
Ensure that Secor is running and restart if necessary.
Ensure that the target storage location has enough space and is accessible.
Review Secor configuration files for misconfigurations.
For more assistance, contact the administrative support.

Component: Storage: Secor - Ingestion[]

STORAGE_SECOR_DENORM_005: [CRITICAL][SECOR DENORM]:Kafka data backup not found

Severity: critical

Affected System: Secor - Denorm Storage

Impact Summary: Missing Secor denorm data backup can lead to permanent loss of Kafka messages, affecting data reliability and historical analysis in downstream systems.

Causes:

The Secor service might not be running or has crashed.
Insufficient storage space for backups.
The connection to the storage destination (e.g., S3) is unavailable.

Actions:

Review logs to identify specific errors causing the failure.
Ensure that Secor is running and restart if necessary.
Ensure that the target storage location has enough space and is accessible.
Review Secor configuration files for misconfigurations.
For more assistance, contact the administrative support.

Component: Storage: Secor - Denorm[]

ALERT_1408: [VELERO]:Kubernetes cluster backup not found

Severity: critical

Affected System: Velero backup Storage

Impact Summary: The backup process for the Kubernetes cluster failed, resulting in the system and its associated data may not be recoverable in the event of a failure or outage, impacting availability, disaster recovery, and rollback capabilities.

Causes:

The backup process has stopped due to a service failure.
The Velero backup job has exceeded the allowed time or encountered an error.
The backup destination is full.
Storage credentials provided are incorrect.
Velero is not able to connect to the S3 storage.

Actions:

Ensure Velero is up and running.
Run Velero logs to identify specific errors.
Ensure the backup destination has enough space and is accessible.
Confirm correct credentials and backup settings in Velero.
For more assistance, contact the administrative support.

Component: Storage: Velero backup[]

PreviousQuerying System Alerts and Recommended Actions NextNotifications

Last updated 8 months ago

hashtagALERT_1401: [SECOR]: High Disk Usage Detected

hashtagALERT_1402: [PERSISTENT VOLUMES]: Storage volume resize request not processed by autoscaler

hashtagALERT_1403: [PERSISTENT VOLUMES]: Failed to automatically expand storage volume

hashtagSTORAGE_PG_BACKUP_004: [CRITICAL][POSTGRESQL BACKUP]: PostgreSQL database backup not found

hashtagSTORAGE_SECOR_UNIQUE_005: [CRITICAL][SECOR UNIQUE]:Kafka data backup not found

hashtagSTORAGE_SECOR_006: Postgres database backup failed

hashtagSTORAGE_SECOR_INGESTION_005: [CRITICAL][SECOR INGESTION]:Kafka data backup not found

hashtagSTORAGE_SECOR_DENORM_005: [CRITICAL][SECOR DENORM]:Kafka data backup not found

hashtagALERT_1408: [VELERO]:Kubernetes cluster backup not found

ALERT_1401: [SECOR]: High Disk Usage Detected

ALERT_1402: [PERSISTENT VOLUMES]: Storage volume resize request not processed by autoscaler

ALERT_1403: [PERSISTENT VOLUMES]: Failed to automatically expand storage volume

STORAGE_PG_BACKUP_004: [CRITICAL][POSTGRESQL BACKUP]: PostgreSQL database backup not found

STORAGE_SECOR_UNIQUE_005: [CRITICAL][SECOR UNIQUE]:Kafka data backup not found

STORAGE_SECOR_006: Postgres database backup failed

STORAGE_SECOR_INGESTION_005: [CRITICAL][SECOR INGESTION]:Kafka data backup not found

STORAGE_SECOR_DENORM_005: [CRITICAL][SECOR DENORM]:Kafka data backup not found

ALERT_1408: [VELERO]:Kubernetes cluster backup not found