Processing System Alerts and Recommended Actions
This document details the various alert rules configured within the system, providing information on their impact, potential causes, and recommended actions.
ALERT_1201: [VALKEY DENORM]: High Disk Usage Detected
Section titled “ALERT_1201: [VALKEY DENORM]: High Disk Usage Detected”Severity: critical
Affected System: Valkey Caching Service
Impact Summary: High disk usage in Valkey may delay the enrichment of the data, causing delays in processing real-time data and potentially causing inaccurate data to be returned in queries.
Causes:
- The system is running out of disk space.
- Volume autoscaling might be disabled.
- Volume autoscaling may have failed due to threshold limits, subject to cloud provider limitations on scaling frequency.
- A high volume of data is being written to the persistent storage.
- A lot of old or unused data is accumulated in the PV.
Actions:
- Enable the auto scaling of volume.
- Increase the volume size of the valkey.
- Adjust the volume autoscaler to a higher threshold percentage. This proactive measure will increase volume at a greater utilization level, preventing frequent scaling.
- Increase the PV size if needed.
- For more assistance, contact the administrative support.
Component: Processing: Valkey[]
ALERT_1202: [VALKEY DEDUPE]: High Disk Usage Detected
Section titled “ALERT_1202: [VALKEY DEDUPE]: High Disk Usage Detected”Severity: critical
Affected System: Valkey Caching Service
Impact Summary: High disk usage in Valkey may lead to duplicate data being processed, resulting in inaccurate query results.
Causes:
- The system is running out of disk space.
- Volume autoscaling might be disabled.
- Volume autoscaling may have failed due to threshold limits, subject to cloud provider limitations on scaling frequency.
- A high volume of data is being written to the persistent storage.
- A lot of old or unused data is accumulated in the PV.
Actions:
- Enable the auto scaling of volume.
- Increase the volume size of the valkey.
- Adjust the volume autoscaler to a higher threshold percentage. This proactive measure will increase volume at a greater utilization level, preventing frequent scaling.
- Increase the PV size if needed.
- For more assistance, contact the administrative support.
Component: Processing: Valkey[]
ALERT_1203: [DATASET]: Detected high rate of invalid data than expected
Section titled “ALERT_1203: [DATASET]: Detected high rate of invalid data than expected”Severity: critical
Affected System: Dataset Processing
Impact Summary: Invalid data has been ingested in the system, preventing it from being processed. Henceforth, queries on this dataset may not return accurate data.
Causes:
- Source system might have been producing invalid data and might be failing with dataset schema evaluation.
- If the data is failing during extraction, then the extraction configuration could be invalid.
- An invalid data schema might have been configured.
Actions:
- Check the Kafka backups to find out why the data failed to process.
- Review the schema and modify if necessary.
- Correct the source system to generate the data as per the expected schema format.
- If data processing failed during extraction, update the extraction configuration if it has been overwritten by the admin.
- For more assistance, contact the administrative support.
Component: Processing: Dataset[]
PROCESS_VALKEY_003: Duplicate events found during extraction
Section titled “PROCESS_VALKEY_003: Duplicate events found during extraction”Severity: warning
Affected System: Dataset Processing
Impact Summary: Duplicate events during extraction can lead to inflated processing and storage.
Causes:
- Check logs in Kafka for errors related to the extractor job failure.
Actions:
- Check the logs in Kafka for errors related to the extractor job failure.
Component: Processing: Unified pipeline[]
ALERT_1204: [DATASET]: Detected higher rate of duplicate data than expected
Section titled “ALERT_1204: [DATASET]: Detected higher rate of duplicate data than expected”Severity: warning
Affected System: Dataset Processing
Impact Summary: Duplicate data has been ingested in the system, preventing it from being processed. Henceforth, queries on this dataset may not return accurate data.
Causes:
- The source system may be generating duplicate data, causing multiple records with the same deduplication key (unique identifier) to be ingested.
- If consumer offsets are not saved and the system restarts, data may be reprocessed, leading to a high rate of duplicate records.
- If cache keys expire and the same data is processed again after expiration, an increased number of duplicates may be detected.
Actions:
- Check Kafka backups to identify duplicate records that were ingested.
- Ensure the source system generates unique data records if it is producing correct data.
- Extend the retention period of the deduplication storage system to prevent duplicates.
- Lower the checkpoint commit frequency if it is set too high (useful when the system restarts).
- For further assistance, contact the administrative support team.
Component: Processing: Dataset[]
PROCESS_DATASET_004: Failed to validate the ingested data
Section titled “PROCESS_DATASET_004: Failed to validate the ingested data”Severity: warning
Affected System: Dataset Processing
Impact Summary: Data validation failures can lead to data processing errors and inconsistencies.
Actions:
- Check whether the ingested events are failing to validate against the provided data schema.
Component: Processing: Unified pipeline[]
ALERT_1205: [DATASET]: Detected higher incidence of failures during data enrichment.
Section titled “ALERT_1205: [DATASET]: Detected higher incidence of failures during data enrichment.”Severity: critical
Affected System: Dataset Processing
Impact Summary: The data ingested into the system is failing the enrichment process, which may cause queries on this dataset to return inaccurate data.
Causes:
- The ingested data may be missing the primary key required for enrichment.
- The primary key to enrich the data might be unavailable in the cache system.
Actions:
- Check the Kafka backups to find out why enrichment of the data failed.
- Extend the retention period of the denormalization storage system to ensure primary keys are available for enrichment.
- For more assistance, contact the administrative support.
Component: Processing: Dataset[]
ALERT_1206: [DATASET]: Detected higher incidence of failures during data transformations.
Section titled “ALERT_1206: [DATASET]: Detected higher incidence of failures during data transformations.”Severity: critical
Affected System: Dataset Processing
Impact Summary: The data ingested into the system is failing the data transformation process, which may cause queries on this dataset to return inaccurate data.
Causes:
- The ingested data may be missing the required key for transformations.
- Invalid transformation logic might have been configured.
Actions:
- Check the Kafka backups to find out why the transformation on the data failed.
- For more assistance, contact the administrative support.
Component: Processing: Dataset[]
PROCESS_DATASET_006: The keys are expiring in redis
Section titled “PROCESS_DATASET_006: The keys are expiring in redis”Severity: warning
Affected System: Valkey Caching Service
Impact Summary: Key expirations can lead to cache misses and increased latency.
Causes:
- Keys are having a short TTL (Time-To-Live) set.
- Redis memory limit might have been reached.
- Expiration policy is misconfigured.
Actions:
- Check and adjust TTL values using
TTL <key>. - Increase Redis memory allocation if needed.
- For more assistance, contact the administrative support.
Component: Processing: Valkey[Redis]
PROCESS_DATASET_007: There is high number of connections open to redis
Section titled “PROCESS_DATASET_007: There is high number of connections open to redis”Severity: critical
Affected System: Valkey Caching Service
Impact Summary: A high number of open connections can exhaust Redis resources and impact performance.
Causes:
- Connections might not be closing properly.
- Redis is reaching resource limits, affecting performance.
Actions:
- Monitor the connection count.
- Restart Redis if stuck connections persist.
- Scale Redis by increasing resources.
- For more assistance, contact the administrative support.
Component: Processing: Valkey[Redis]
ALERT_1207: [RDBMS]: A high number of open connections to PostgreSQL has been detected.
Section titled “ALERT_1207: [RDBMS]: A high number of open connections to PostgreSQL has been detected.”Severity: critical
Affected System: PostgreSQL Database Service
Impact Summary: High number of open connections to PostgreSQL can disrupt dataset management, affecting read/write operations and potentially leading to failed dataset transactions.
Causes:
- Open connections may not have been closed.
- Functional issues may be present.
- The external system may have accessed the metadata storage.
Actions:
- Monitor the connection count in the database.
- Restart PostgreSQL service if connections remain stuck.
- For more assistance, contact the administrative support.
Component: Processing: RDBMS[]
PROCESS_RDBMS_007: [RDBMS]: Metadata queries are running slower than expected
Section titled “PROCESS_RDBMS_007: [RDBMS]: Metadata queries are running slower than expected”Severity: warning
Affected System: PostgreSQL Database Service
Impact Summary: Slow queries can delay data retrieval and impact API responsiveness.
Causes:
- Queries are stuck in a wait state due to locks held by other transactions.
- An excessive number of simultaneous transactions is slowing down the system.
- The database server is experiencing CPU or memory shortages, impacting performance.
Actions:
- Monitor the database for long-running queries.
- Identify and resolve blocked queries.
- For more assistance, contact the administrative support.
Component: Processing: RDBMS[]
PROCESSS_UNIFIED_PIPELINE_008: [UNIFIED PIPELINE]: Detected delays in data processing than expected
Section titled “PROCESSS_UNIFIED_PIPELINE_008: [UNIFIED PIPELINE]: Detected delays in data processing than expected”Severity: warning
Affected System: Unified Data Processing Pipeline
Impact Summary: Delays in data processing can result in late data availability for enrichment, storage, or triggering downstream workflows.
Causes:
- System might have received a higher volume of data.
- Errors during data processing are causing retries and slowdowns.
- Allocated resources might be insufficient.
- Expensive transformation logic may be causing delays.
- Autoscaling is either disabled or has failed to scale up.
Actions:
- Check the pipeline job pod logs for any errors.
- Restart the pipeline job pods.
- Increase CPU and memory resources if necessary.
- Enable auto scaling if it is not enabled.
- For further assistance, contact administrative support.
Component: Processing: Unified pipeline[Preprocessor job, Extractor job, Transformation job, Denormalizer job]
ALERT_1208: [UNIFIED PIPELINE]: Detected higher amount of processing lag than expected
Section titled “ALERT_1208: [UNIFIED PIPELINE]: Detected higher amount of processing lag than expected”Severity: warning
Affected System: Unified Data Processing Pipeline
Impact Summary: High pipeline lag in the dataset indicates processing of new data is delayed. Because of this delay, new data isn’t available when querying the dataset.
Causes:
- System might have received a higher volume of data.
- Resources allocated are insufficient.
- Autoscaling is either disabled or has failed to scale up.
Actions:
- Monitor the processing lag closely.
- Allocate resources if required.
- Enable auto scaling if it is not enabled.
- For further assistance, contact administrative support.
Component: Processing: Unified pipeline[]
ALERT_1209: [UNIFIED PIPELINE]: No data has been received for past hour.
Section titled “ALERT_1209: [UNIFIED PIPELINE]: No data has been received for past hour.”Severity: warning
Affected System: Unified Data Processing Pipeline
Impact Summary: The dataset has not received any new data, which will impact real-time data processing.
Causes:
- The source system may not be generating or sending data.
- The connector may have failed to receive data from the connector source.
- Connector might have ignored or failed to process the data.
Actions:
- Check the health status of the source connector.
- Review the connector logs to determine if it is running and failing to process the received data.
- For further assistance, contact administrative support.
Component: Processing: Unified pipeline[]
PROCESS_UNIFIED_PIPELINE_011: [UNIFIED PIPELINE]: Detected unexpected load in the system
Section titled “PROCESS_UNIFIED_PIPELINE_011: [UNIFIED PIPELINE]: Detected unexpected load in the system”Severity: warning
Affected System: Unified Data Processing Pipeline
Impact Summary: Unexpected load can slow down data processing, increase latency, and potentially lead to job failures or restarts.
Causes:
- There is a sudden spike in incoming data from connectors detected.
- There are multiple or incorrect connector sources sending data.
Actions:
- Monitor and validate the data received from connectors.
- Remove any unnecessary or misconfigured data sources.
- For further assistance, contact administrative support.
Component: Processing: Unified pipeline[]
ALERT_1210: [VALKEY]: Detected higher memory usage than expected
Section titled “ALERT_1210: [VALKEY]: Detected higher memory usage than expected”Severity: critical
Affected System: Valkey Caching Service (Master datasets)
Impact Summary: High memory usage in Valkey can cause the data processing system to pause. As a result, no new data will be processed or available for querying.
Causes:
- There might be a large number of keys stored in the valkey.
Actions:
- Allocate more memory to Valkey to accommodate the increased data load.
- For further assistance, contact administrative support.
Component: Processing: Valkey[]