Sanity Checklist
After installation, a sanity test must be performed to validate the deployment
Ingestion
All ingestion connectors running with expected replicas
(✔/✘)
Data flowing from all expected upstream sources
(✔/✘)
No ingestion backlog in Kafka topics
(✔/✘)
Schema validation passing for incoming messages
(✔/✘)
No ingestion errors messages from the connector pods
(✔/✘)
The resource configurations are correct as per the environment and load
(✔/✘)
Processing
The unified pipeline, cache-indxer and lakehouse-connector jobs in RUNNING state with expected replica configurations
(✔/✘)
Checkpointing active and stable
(✔/✘)
0% failed event (No schema and deduplicate events) and no higher lag
(✔/✘)
Kafka partitions match Flink job configs are correct as per the load and environment
(✔/✘)
No errors in the pods logs
(✔/✘)
Querying
Druid ingestion tasks running and segments published
(✔/✘)
Hudi datasets up-to-date and queryable
(✔/✘)
Query APIs responding within acceptable latency
(✔/✘)
Able to query realtime and historical data from both hudi and druid
(✔/✘)
Spot checks return correct and fresh data
(✔/✘)
Storage
Velero backups completed successfully
(✔/✘)
Kafka/Druid/Hudi backups available
(✔/✘)
Secor backup service is running healthy
(✔/✘)
Dataset events secor backup files are available in the blob storage
(✔/✘)
No error or higher amount of lage in the secor service
(✔/✘)
Restore test performed in staging (optional)
(✔/✘)
Monitoring
All key metrics collected (Kafka, Flink, Druid, Hudi, APIs)
(✔/✘)
Grafana dashboards rendering without gaps
(✔/✘)
No abnormal spikes in error rates, latency, or usage
(✔/✘)
Alerts
All alerting rules enabled and targeting correct channels
(✔/✘)
Test alerts sent and acknowledged
(✔/✘)
Critical alert thresholds correctly configured
(✔/✘)
Management Console
Management console is able to access
(✔/✘)
All the datasets are healthy
(✔/✘)
CPU, Memory, Volume usages are not abnormal
(✔/✘)
All service pods in Running state with expected restarts
(✔/✘)
Final
End-to-end data flow verified (Ingestion → Processing → Storage → Query)
(✔/✘)
Last updated
