Sanity Checklist

After installation, a sanity test must be performed to validate the deployment

Category
Check Item
Status (✔/✘)

Ingestion

All ingestion connectors running with expected replicas

(✔/✘)

Data flowing from all expected upstream sources

(✔/✘)

No ingestion backlog in Kafka topics

(✔/✘)

Schema validation passing for incoming messages

(✔/✘)

No ingestion errors messages from the connector pods

(✔/✘)

The resource configurations are correct as per the environment and load

(✔/✘)

Processing

The unified pipeline, cache-indxer and lakehouse-connector jobs in RUNNING state with expected replica configurations

(✔/✘)

Checkpointing active and stable

(✔/✘)

0% failed event (No schema and deduplicate events) and no higher lag

(✔/✘)

Kafka partitions match Flink job configs are correct as per the load and environment

(✔/✘)

No errors in the pods logs

(✔/✘)

Querying

Druid ingestion tasks running and segments published

(✔/✘)

Hudi datasets up-to-date and queryable

(✔/✘)

Query APIs responding within acceptable latency

(✔/✘)

Able to query realtime and historical data from both hudi and druid

(✔/✘)

Spot checks return correct and fresh data

(✔/✘)

Storage

Velero backups completed successfully

(✔/✘)

Kafka/Druid/Hudi backups available

(✔/✘)

Secor backup service is running healthy

(✔/✘)

Dataset events secor backup files are available in the blob storage

(✔/✘)

No error or higher amount of lage in the secor service

(✔/✘)

Restore test performed in staging (optional)

(✔/✘)

Monitoring

All key metrics collected (Kafka, Flink, Druid, Hudi, APIs)

(✔/✘)

Grafana dashboards rendering without gaps

(✔/✘)

No abnormal spikes in error rates, latency, or usage

(✔/✘)

Alerts

All alerting rules enabled and targeting correct channels

(✔/✘)

Test alerts sent and acknowledged

(✔/✘)

Critical alert thresholds correctly configured

(✔/✘)

Management Console

Management console is able to access

(✔/✘)

All the datasets are healthy

(✔/✘)

CPU, Memory, Volume usages are not abnormal

(✔/✘)

All service pods in Running state with expected restarts

(✔/✘)

Final

End-to-end data flow verified (Ingestion → Processing → Storage → Query)

(✔/✘)

Last updated