Querying System Alerts and Recommended Actions

This document details the various alert rules configured within the system, providing information on their impact, potential causes, and recommended actions.

ALERT_1301: [DRUID HISTORICAL]: High Disk Usage Detected

Severity: critical

Affected System: Druid Historical Nodes

Impact Summary: High disk usage in Druid historical can prevent the querying of older data, potentially causing incomplete query results and affecting data availability

Causes:

  • The system is running out of disk space.

  • Volume autoscaling might be disabled.

  • Volume autoscaling may have failed due to threshold limits, subject to cloud provider limitations on scaling frequency.

  • A high volume of data is being written to the persistent storage.

  • A lot of old or unused data is accumulated in the PV.

Actions:

  • Enable the auto scaling of volume.

  • Adjust the volume autoscaler to a higher threshold percentage. This proactive measure will increase volume at a greater utilization level, preventing frequent scaling.

  • Increase the PV size if needed.

  • For more assistance, contact the administrative support.

Component: Querying: Druid[Historicals]


ALERT_1302: [DRUID INDEXER]: High Disk Usage Detected

Severity: critical

Affected System: Druid Indexer Nodes

Impact Summary: High disk usage in Druid Indexer can interrupt data ingestion. As a result, real-time data is unavailable for querying.

Causes:

  • The system is running out of disk space.

  • Volume autoscaling might be disabled.

  • Volume autoscaling may have failed due to threshold limits, subject to cloud provider limitations on scaling frequency.

  • A high volume of data is being written to the persistent storage.

  • A lot of old or unused data is accumulated in the PV.

Actions:

  • Enable the auto scaling of volume.

  • Adjust the volume autoscaler to a higher threshold percentage. This proactive measure will increase volume at a greater utilization level, preventing frequent scaling.

  • Increase the PV size if needed.

  • For more assistance, contact the administrative support.

Component: Querying: Druid[Indexers]


QUERY_DRUID_INDEXER_003: [CRITICAL][DRUID INDEXER]: Detected a high impact on real-time querying performance.

Severity: critical

Affected System: Druid Indexer Nodes

Impact Summary: Failure to query data disrupts access to real-time data, affecting downstream analytics that rely on timely insights.

Causes:

  • Druid Indexer might be down.

  • The datasource might not be fully available for query.

  • Datasource is unavailable in druid.

  • Resources allocated are insufficient.

Actions:

  • Restart the Druid Indexer and check logs for errors.

  • Check if the datasource segments are fully available and loaded.

  • Verify if the datasource exists in Druid and restart Druid supervisor if needed.

  • Enable auto scaling if not enabled.

  • For more assistance, contact the administrative support.

Component: Querying: Druid[Indexers]


QUERY_DRUID_HISTORICAL_004: [CRITICAL][DRUID HISTORICAL]: Detected a high impact on real-time querying performance.

Severity: critical

Affected System: Druid Historical Nodes

Impact Summary: Failure to query data disrupts access to real-time data, affecting downstream analytics that rely on timely insights.

Causes:

  • Historical Node might be down.

  • The datasource might not be fully available for query.

  • Datasource is unavailable in druid.

  • Resources allocated are insufficient.

Actions:

  • Restart the Historical Node and verify segment loading status.

  • Check if the datasource segments are fully available and loaded.

  • Verify if the datasource exists in Druid and restart Druid supervisor if needed.

  • Enable auto scaling if not enabled.

  • For more assistance, contact the administrative support.

Component: Querying: Druid[Historicals]


ALERT_1305: [API]: The Data Query API is encountering more failures to retrieve the data

Severity: critical

Affected System: API Querying

Impact Summary: Query failures are preventing access to the dataset, resulting in an inability to retrieve data as expected.

Causes:

  • An invalid API request might have been sent, resulting in increased failures.

  • The API service might be unable to handle the payload.

  • The API service might be down or experiencing frequent restarts.

Actions:

  • Check the API service pod status to ensure it is running.

  • Check for the logs from the API service for any errors.

  • Check whether the datasource exists in Druid.

  • For more assistance, contact the administrative support.

Component: Querying: API[]


ALERT_1306 :[API]: The Data Query API is facing delays in retrieving data

Severity: warning

Affected System: API Querying

Impact Summary: Delays in queries are affecting access to the dataset, leading to delayed data retrieval.

Causes:

  • Too many concurrent queries might be affecting performance.

  • The Query API service is running low on CPU or memory.

  • Queries are queued due to excessive load and resource limitations.

Actions:

  • Monitor the API service to ensure it is running and responding correctly.

  • Check service logs for errors or failures.

  • Increase CPU and memory resources if the service is experiencing high load.

  • For more assistance, contact the administrative support.

Component: Querying: API[]


ALERT_1307: [DRUID HISTORICAL]: Detected higher amount of query lag than expected

Severity: warning

Affected System: Druid Indexer Nodes Querying

Impact Summary: High indexer lag in the dataset indicates processing of new data is delayed. Because of this delay, new data isn’t available when querying the dataset.

Causes:

  • Detected high Druid Indexer Lag.

  • Too many query requests are executed simultaneously.

  • Queries are queuing due to excessive load and resource limitations.

  • Inefficient segment partitioning may lead to uneven query load balancing.

Actions:

  • Monitor and investigate the cause for the lag.

  • Scale Broker and Historical nodes based on load.

  • Enable autoscaling if necessary.

  • Increase the number of partitions if query performance is impacted by large segment sizes.

  • For more assistance, contact the administrative support.

Component: Querying: Druid[Indexer]


ALERT_1308: [DRUID INDEXER]: Detected higher amount of unparseable data.

Severity: warning

Affected System: Druid Indexer Nodes Querying

Impact Summary: Unparseable data has been detected in the system, preventing it from being processed. Henceforth, queries on this dataset may not return accurate data until the issue is resolved.

Causes:

  • Timestamp Parsing Errors – Incorrect or missing timestamp fields, or an unsupported date format.

  • Schema Mismatch – Required fields are missing, or field data types do not align with the ingestion schema.

  • Incorrect Data Format – The ingested data does not match the expected format (e.g., JSON, Avro, Parquet).

  • Incorrect JSON Path Expressions – Misconfigured JSON path in ingestion specifications.

Actions:

  • Check the Druid ingestion task logs for specific error messages related to unparseable events.

  • Verify that the ingested data matches the expected format and schema, ensuring there are no missing fields, type mismatches, or encoding issues.

  • For more assistance, contact the administrative support.

Component: Querying: Druid[Indexer]


ALERT_1309:[DRUID INDEXER]: Druid supervisor is in an unhealthy state

Severity: critical

Affected System: Druid Indexer Nodes Querying

Impact Summary: The associated Druid Supervisor is in an unhealthy state, preventing druid ingestion tasks from running. As a result, real-time data cannot be queried.

Causes:

  • Frequent task failures are causing the supervisor to become unhealthy.

  • Insufficient resources allocated.

  • Autoscaling is either disabled or has failed to scale up.

Actions:

  • Ensure adequate resources are allocated.

  • Enable autoscaling if necessary.

  • Check task logs for detailed errors.

  • Restart the supervisor if necessary.

  • For more assistance, contact the administrative support.

Component: Querying: Druid[Indexer]


ALERT_1310:[DRUID INDEXER]: Druid tasks are in an unhealthy state

Severity: critical

Affected System: Druid Indexer Nodes Querying

Impact Summary: The Druid ingestion tasks are in an unhealthy state, causing data ingestion delays and failures. As a result, real-time data may not be available for querying.

Causes:

  • Druid middle managers may be overloaded.

  • Druid task failures due to incorrect configurations, such as invalid ingestion.

  • The incoming data might not match the expected ingestion schema structure.

  • Might have encountered invalid or unparseable data.

  • Insufficient resources allocated.

  • Autoscaling is either disabled or has failed to scale up.

Actions:

  • Check task logs for detailed errors.

  • Restart the failed supervisor if required.

  • Ensure enough system resources are available.

  • Enable autoscaling if necessary.

  • For more assistance, contact the administrative support.

Component: Querying: Druid[Indexer]


ALERT_1311: [DRUID INDEXER]: Druid task slot utilization is out of expected range

Severity: critical

Affected System: Druid Indexer Nodes Querying

Impact Summary: High Druid task slot utilization can delay new task assignments, causing ingestion bottlenecks and impacting data freshness for analytics.

Causes:

  • Insufficient task slots configured in the system.

  • Stuck or failed tasks occupying task slots.

Actions:

  • Review and adjust the task slot configuration.

  • Monitor resource usage (CPU, memory) to ensure adequate capacity.

  • Investigate and resolve any stuck or failed tasks.

  • For more assistance, contact the administrative support.

Component: Querying: Druid[Indexer]


Last updated