Infra Alerts and Recommended Actions

This document details the various alert rules configured within the system, providing information on their impact, potential causes, and recommended actions.

ALERT_1001: [KAFKA]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: Kafka Data Processing

Impact Summary: High CPU usage in Kafka may disrupt the data ingestion process, causing delays in data being processed and affecting the real-time flow of data into the dataset.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Kafka[]


ALERT_1002: [VALKEY DENORM]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: Valkey Caching Service

Impact Summary: High CPU usage in Valkey may stop the caching of new data, causing delays in processing real-time data and preventing new data from being queried.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Valkey[]


ALERT_1003: [VALKEY DEDUPE]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: Valkey Caching Service

Impact Summary: High CPU usage in Valkey may stop the caching of new data, causing delays in processing real-time data and preventing new data from being queried.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Valkey[]


ALERT_1004: [DRUID HISTORICAL]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: Druid Historical Nodes

Impact Summary: High CPU usage in Druid Historicals can affect the ability to access old segments, leading to delayed or incomplete query results.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Druid[Historicals]


ALERT_1005: [DRUID INDEXER]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: Druid Indexer Nodes

Impact Summary: High CPU usage in Druid Indexer can interrupt data ingestion, making real-time data unavailable for querying.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Druid[Indexer]


ALERT_1006: [DRUID OVERLORD]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: Druid Overlord Node

Impact Summary: High CPU usage in Druid Overlord can interrupt the handling of ingestion tasks, leading to delays or loss of new data, and impacting real-time data querying.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Druid[Overlord]


ALERT_1007: [DRUID BROKER]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: Druid Broker Nodes

Impact Summary: High CPU usage in Druid Broker can cause query routing issues, leading to delays while querying the data

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Druid[Broker]


ALERT_1008: [PROMETHEUS]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: Prometheus Monitoring Service

Impact Summary: High CPU usage in Prometheus can cause delays in collecting and processing metrics, potentially leading to slower response times for queries and incomplete data for monitoring and alerting.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Prometheus[]


ALERT_1009: [GRAFANA]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: Grafana Visualization Service

Impact Summary: High CPU usage in Grafana can lead to, delayed alerting, affecting the monitoring and analysis of system metrics.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Grafana[]


[LOKI]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: Loki Logging Service

Impact Summary: High CPU usage in Loki can delay log ingestion and querying, impacting real-time log visibility and troubleshooting.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Loki[]


[POSTGRES]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: PostgreSQL Database Service

Impact Summary: High CPU usage in PostgreSQL can lead to slow query performance and potential application timeouts, affecting data access and overall system responsiveness.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Postgres[]


[API SERVICE]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: API Service

Impact Summary: High CPU usage in API service can cause slower response times, failed requests, potentially affecting service availability.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: API Service[]


INFRA_SECOR_012: [SECOR BACKUP]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: Secor Backup Service

Impact Summary: High CPU usage in Secor can delay Kafka topic backups to cloud storage, increasing the risk of data loss during outages or failures.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Secor[]


INFRA_KAFKA_MSG_EXPORTER_013: [KAFKA MESSAGE EXPORTER]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: Kafka Message Exporter Service

Impact Summary: High CPU usage in Kafka Message Exporter can result in delayed or missing metric exports to Prometheus, impacting monitoring accuracy.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Kafka message exporter[]


ALERT_1010: [UNIFIED PIPELINE]: High CPU Usage Detected. System Under Heavy Load.

Severity: warning

Affected System: Unified Data Processing Pipeline

Impact Summary: High CPU usage in the Unified pipeline can interrupt the processing of real-time data, making it unavailable for querying.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Manually allocate more CPU if necessary.

  • Enable auto-scaling if it is not enabled.

  • Identify the root cause of why the service is consuming high CPU.

  • Monitor CPU usage closely.

  • For further assistance, contact administrative support.

Component: Infra: Unified pipeline[]


ALERT_1011: [KAFKA]: High Memory Usage Detected. System Could Become Unstable.

Severity: warning

Affected System: Kafka Data Processing

Impact Summary: High Memory usage in Kafka may disrupt the data ingestion process, causing delays in data being processed and affecting the real-time flow of data into the dataset.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Allocate more memory if necessary or enable auto-scaling if it is not enabled.

  • Monitor the memory usage closely.

  • Identify the root cause of why the service is consuming high memory.

  • For more assistance, contact the administrative support.

Component: Infra: Kafka[]


ALERT_1012: [VALKEY DENORM]: High Memory Usage Detected. System Could Become Unstable.

Severity: warning

Affected System: Valkey Caching Service

Impact Summary: High Memory usage in Valkey may stop the caching of new data, causing delays in processing real-time data and preventing new data from being queried.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Allocate more memory if necessary or enable auto-scaling if it is not enabled.

  • Monitor the memory usage closely.

  • Identify the root cause of why the service is consuming high memory.

  • For more assistance, contact the administrative support.

Component: Infra: Valkey[]


ALERT_1013: [VALKEY DEDUPE]: High Memory Usage Detected. System Could Become Unstable.

Severity: warning

Affected System: Valkey Caching Service

Impact Summary: High Memory usage in Valkey may stop the caching of new data, causing delays in processing real-time data and preventing new data from being queried.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Allocate more memory if necessary or enable auto-scaling if it is not enabled.

  • Monitor the memory usage closely.

  • Identify the root cause of why the service is consuming high memory.

  • For more assistance, contact the administrative support.

Component: Infra: Valkey[]


ALERT_1014: [DRUID_HISTORICAL]: High Memory Usage Detected. System Could Become Unstable.

Severity: warning

Affected System: Druid Historical Nodes

Impact Summary: High Memory usage in Druid Historicals can affect the ability to access old segments, leading to delayed or incomplete query results.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Allocate more memory if necessary or enable auto-scaling if it is not enabled.

  • Monitor the memory usage closely.

  • Identify the root cause of why the service is consuming high memory.

  • For more assistance, contact the administrative support.

Component: Infra: Druid[Historicals]


ALERT_1015: [DRUID INDEXER]: High Memory Usage Detected. System Could Become Unstable.

Severity: warning

Affected System: Druid Indexer Nodes

Impact Summary: High Memory usage in Druid Indexer can interrupt data ingestion, making real-time data unavailable for querying.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Allocate more memory if necessary or enable auto-scaling if it is not enabled.

  • Monitor the memory usage closely.

  • Identify the root cause of why the service is consuming high memory.

  • For more assistance, contact the administrative support.

Component: Infra: Druid[Indexer]


ALERT_1016 : [DRUID OVERLORD]: High Memory Usage Detected. System Could Become Unstable.

Severity: warning

Affected System: Druid Overlord Node

Impact Summary: High memory usage in Druid Overlord can interrupt the handling of ingestion tasks, leading to delays or loss of new data, and impacting real-time data querying.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Allocate more memory if necessary or enable auto-scaling if it is not enabled.

  • Monitor the memory usage closely.

  • Identify the root cause of why the service is consuming high memory.

  • For more assistance, contact the administrative support.

Component: Infra: Druid[Overlord]


ALERT_1017: [DRUID BROKER]: High Memory Usage Detected. System Could Become Unstable.

Severity: warning

Affected System: Druid Broker Nodes

Impact Summary: High memory usage in Druid Broker can cause query routing issues, leading to delays while querying the data

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Allocate more memory if necessary or enable auto-scaling if it is not enabled.

  • Monitor the memory usage closely.

  • Identify the root cause of why the service is consuming high memory.

  • For more assistance, contact the administrative support.

Component: Infra: Druid[Broker]


ALERT_1018: [PROMETHEUS]: High Memory Usage Detected. System Could Become Unstable.

Severity: warning

Affected System: Prometheus Monitoring Service

Impact Summary: High Memory usage in Prometheus can cause delays in collecting and processing metrics, potentially leading to slower response times for queries and incomplete data for monitoring and alerting.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Allocate more memory if necessary or enable auto-scaling if it is not enabled.

  • Monitor the memory usage closely.

  • Identify the root cause of why the service is consuming high memory.

  • For more assistance, contact the administrative support.

Component: Infra: Prometheus[]


ALERT_1019: [GRAFANA]: High Memory Usage Detected. System Could Become Unstable.

Severity: warning

Affected System: Grafana Visualization Service

Impact Summary: High Memory usage in Grafana can lead to delayed alerting, affecting the monitoring and analysis of system metrics.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Allocate more memory if necessary or enable auto-scaling if it is not enabled.

  • Monitor the memory usage closely.

  • Identify the root cause of why the service is consuming high memory.

  • For more assistance, contact the administrative support.

Component: Infra: Grafana[]


INFRA_LOKI_023: [LOKI]: High Memory Usage Detected. System Could Become Unstable.

Severity: warning

Affected System: Loki Logging Service

Impact Summary: High memory usage in Loki


ALERT_1020: [UNIFIED_PIPELINE]: High Memory Usage Detected. System Could Become Unstable.

Severity: warning

Affected System: Unified Data Processing Pipeline

Impact Summary: High memory usage in the Unified pipeline can interrupt the processing of real-time data, making it unavailable for querying.

Causes:

  • The system is running out of resources due to Insufficient resource allocation

  • Autoscaling might be not enabled

  • A high volume of data might be processed and queried.

Actions:

  • Allocate more memory if necessary or enable auto-scaling if it is not enabled.

  • Monitor the memory usage closely.

  • Identify the root cause of why the service is consuming high memory.

  • For more assistance, contact the administrative support.

Component: Infra: Unified Pipeline[]


ALERT_1021: [KUBERNETES NODE]: High CPU usage is detected in the Cluster Nodes.

Severity: warning

Affected System: Kubernetes Nodes

Impact Summary: Persistent high CPU usage can cause Kubernetes nodes to become under pressure, leading to pod throttling, evictions, or failed launches, impacting overall cluster health

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Kubernetes Node[]


ALERT_1022: [KUBERNETES NODE]: High memory usage is detected in Cluster Nodes

Severity: warning

Affected System: Kuberenetes Nodes

Impact Summary: Persistent high memory usage has led to resource exhaustion, causing performance degradation and potential failures in services

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Kubernetes Node[]


ALERT_1023: [KAFKA]: System has restarted. May Affect Ingestion of data

Severity: critical

Affected System: Kafka Data Processing

Impact Summary: Frequent restarts in kafka system may disrupt the data ingestion process, causing delays in data being processed and affecting the real-time flow of data into the dataset.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Kafka[]


ALERT_1024: [DRUID HISTORICALS]: System has Restarted. May Affect Querying.

Severity: critical

Affected System: Druid Historical Nodes

Impact Summary: Frequent restarts in Druid Historicals can affect the ability to access old segments, leading to delayed or incomplete query results.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Druid[Historicals]


ALERT_1025: [DRUID INDEXER]: System has Restarted. May Affect Querying.

Severity: critical

Affected System: Druid Indexer Nodes

Impact Summary: Frequent restarts in Druid Indexer can interrupt data ingestion. As a result, real-time data is unavailable for querying.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Druid[Indexer]


ALERT_1026: [DRUID OVERLORD]: System has Restarted: May Affect Querying.

Severity: critical

Affected System: Druid Overlord Node

Impact Summary: Frequent restarts in Druid Overlord can interrupt the handling of ingestion tasks, leading to delays or loss of new data, and impacting real-time data querying.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Druid[Overlord]


ALERT_1027: [DRUID BROKER]: System has Restarted. May Affect Querying.

Severity: critical

Affected System: Druid Broker Node

Impact Summary: Frequent restarts in Druid Broker can cause query routing issues, leading to delays while querying the data

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Druid[Broker]


ALERT_1028: [DRUID COORDINATOR]: System has Restarted. May Affect Querying.

Severity: critical

Affected System: Druid Coordinator Node

Impact Summary: Frequent restarts in Druid Coordinator can disrupt segment balancing and the availability of historical data, leading to incomplete query results.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Druid[Coordinator]


ALERT_1029: [API SERVICE]: System has restarted. May Affect API Availability.

Severity: critical

Affected System: API service

Impact Summary: Frequent restarts in the API service can disrupt dataset management, making it difficult to perform operations on the dataset.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: API Service[]


ALERT_1030: [PROMETHEUS]: System has restarted. May Affect Monitoring.

Severity: critical

Affected System: Prometheus Monitoring Service

Impact Summary: Frequent restarts of Prometheus can interrupt metric scraping and storage, causing missing data points, delayed alerts and monitoring.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Prometheus[]


ALERT_1031: [GRAFANA]: System has restarted. May Affect Monitoring.

Severity: critical

Affected System: Grafana Monitoring Service

Impact Summary: Frequent restarts of Grafana can affect monitoring, resulting in delays or issues with alerting about the system.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Grafana[]


ALERT_1032: [LOKI]: System has restarted. May Affect Monitoring.

Severity: critical

Affected System: Loki Monitoring Service

Impact Summary: Frequent restarts in Loki can disrupt the log ingestion process, causing delays in storing logs and potentially leading to issues when analyzing them.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Loki[]


ALERT_1033: [POSTGRES]: System has restarted. May Affect Database Connectivity

Severity: critical

Affected System: Postgres Database Service

Impact Summary: Frequent restarts in PostgreSQL can disrupt dataset management, affecting read/write operations and potentially leading to failed dataset transactions.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Postgres[]


ALERT_1034: [VALKEY]:System has restarted. May affect data enrichment

Severity: critical

Affected System: Valkey Caching Service

Impact Summary: A system restart in Valkey may stop the caching of new data, causing delays in processing of real time data, preventing new data from being queried.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Valkey[]


ALERT_1035: [UNIFIED PIPELINE]: System has restarted. May Affect Dataset.

Severity: critical

Affected System: Unified Data Processing Pipeline

Impact Summary: Frequent restarts in the Unified pipeline can interrupt the processing of real-time data, making it unavailable for querying.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Unified Pipeline[]


ALERT_1036: [CACHE INDEXER]: System has restarted. May Affect Monitoring.

Severity: critical

Affected System: Cache Indexer Processing Pipeline

Impact Summary: A system restart in Cache Indexer job may result in inaccurate data while processing and make all datasets unhealthy, causing delays in processing of real time data, preventing new data from being queried.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Cache Indexer[]


ALERT_1037: [SECOR]: System has restarted. May Affect Services.

Severity: critical

Affected System: Secor Kafka Backup Service

Impact Summary: A system restart in Secor may affect the backup of ingested data, potentially leading to delays or inconsistencies in data availability.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Secor[]


ALERT_1038: [WEB CONSOLE]: System has restarted. May Affect Data Analytics View in Web Console.

Severity: critical

Affected System: Web console UI service

Impact Summary: Frequent web console restarts can disrupt user interactions, affecting visualization of data for all the datasets.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Web Console[]


ALERT_1039: [PROMTAIL]: System has restarted. May Affect Log Ingestion.

Severity: critical

Affected System: Promtail Service

Impact Summary: Frequent restarts in Promtail can disrupt the log ingestion process, causing delays in storing logs and potentially leading to issues when analyzing them.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Promtail[]


ALERT_1040: [KEYCLOAK]: System has restarted. May Affect User Authentication and Login Flow

Severity: critical

Affected System: Keycloak Authentication Service

Impact Summary: Frequent Keycloak restarts can disrupt authentication and authorization flows, causing login failures and access issues in the system.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Keycloak[]


ALERT_1041: [SUPERSET]: System has restarted. May Affect Superset Dashboard Access

Severity: critical

Affected System: Superset service

Impact Summary: Frequent Superset restarts may lead to dashboard unavailability, impacting access to analytical insights for the datasets

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Superset[]


ALERT_1042: [S3 EXPORTER]: System has restarted. May Affect Data export to S3.

Severity: critical

Affected System: S3 Exporter Monitoring Service

Impact Summary: Frequent restarts of the S3 Exporter may interrupt the export of S3 metrics, potentially impacting the monitoring of cloud storage health and performance.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: S3 Exporter[]


ALERT_1043: [MANAGEMENT API]: System has restarted. May Affect dataset management.

Severity: critical

Affected System: Management API service

Impact Summary: Frequent restarts in the Management Api service can disrupt dataset management, making it difficult to perform operations on the dataset.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Management API[]


ALERT_1044: [VELERO]: System has restarted. May Affect cluster backup.

Severity: critical

Affected System: Velero Cluster Backup Servive

Impact Summary: A system restart in Velero may affect the Kubernetes cluster backup, potentially making backup and restore services temporarily unavailable.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Velero[]


ALERT_1045: [VOLUME AUTOSCALER]: System has restarted. Automatic Volume Scaling May Be Disrupted.

Severity: critical

Affected System: Persistent Volume Storage

Impact Summary: Frequent restarts of the Volume Autoscaler may delay persistent volume resizing, potentially leading to disruptions in system operations.

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: Volume Autoscaler[]


ALERT_1046: [SYSTEM RULES INGESTOR]: System has restarted. May Affect System Health

Severity: critical

Affected System: System Health Monitoring

Impact Summary: Frequent restarts of the System rules ingestor may affect the monitoring of the system, causing delays in tracking system health

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: System Rules Ingestor[]


ALERT_1047: [POSTGRES MIGRATION]: System has restarted. May Affect dataset management.

Severity: critical

Affected System: PostgreSQL Service

Impact Summary: Frequent restarts of the PostgreSQL migration process may disrupt the migration flow, affecting dataset management and causing delays in database table creation

Causes:

  • The system could be running out of resources [CPU, Memory and Disk]

  • The system might be encountering functional errors

  • Could be invalid configurations causing system to restart

Actions:

  • Monitor CPU, memory, and disk usage closely.

  • Check for any errors from the logs of loki instance.

  • Manually restart the service or pod to check if the issue persists.

  • Ensure all required variables and configurations are correctly set

  • For more assistance, contact the administrative support with logs

Component: Infra: PostgreSQL Migration[]


Last updated