Autoscaling Components
This guide explains how to configure and manage autoscaling for Obsrv components
Overview
Obsrv components can be scaled using KEDA (Kubernetes Event-driven Autoscaling). The default configuration supports autoscaling for Flink Task Managers and Druid components based on various metrics like CPU usage, query latency, and Kafka lag.
Understanding Scaling Scenarios
Flink Autoscaling
Flink Task Managers need to scale when:
Kafka consumer lag grows beyond acceptable thresholds
Processing backpressure increases
Event processing latency rises
Key considerations:
Scale up quickly to handle sudden data bursts
Match scaling with Kafka partition count for optimal parallelism
Scale down conservatively to avoid job restarts
Consider checkpoint completion times during scaling
Druid Autoscaling
Druid components scale for different reasons:
Historical Nodes:
Query response times exceed thresholds
Segment load times increase
Available capacity for new segments decreases
High CPU utilization impacts query performance
Broker Nodes:
High query queuing or wait times
Increased number of concurrent queries
CPU utilization affects query routing efficiency
Key considerations:
Historical scaling impacts data availability
Broker scaling affects query routing and caching
Both require careful monitoring of query patterns
Consider time of day and workload patterns
Enabling Autoscaling
Enable autoscaling in
autoscaling.yaml:
Install the autoscaling rules:
Configuration Components
Common Parameters
enabled: Enable/disable autoscaling for the componentkind: Kubernetes resource type (Deployment/StatefulSet)minReplicaCount: Minimum number of replicasmaxReplicaCount: Maximum number of replicaspollingInterval: How often to check metrics (in seconds)cooldownPeriod: Minimum time between scaling operations (in seconds)
HorizontalPodAutoscalerConfig
The horizontalPodAutoscalerConfig section controls scaling behavior:
Key Timing Parameters
stabilizationWindowSeconds:Duration the metrics should be in scaling range before scaling occurs
Longer windows prevent oscillation but reduce responsiveness
Typically longer for scale-down than scale-up
periodSeconds:Minimum time between scaling operations for a specific policy
Should be >=
stabilizationWindowSecondsLonger periods provide more stability
cooldownPeriod:Global cooldown between any scaling operations
Prevents rapid scaling changes
Should be longer than
periodSeconds
Example Configurations
1. Flink Task Manager (Unified Pipeline)
Explanation:
Uses percentage-based scaling for exponential growth
Quick scale-up (30s) for responsive lag handling
Conservative scale-down (5m) to prevent oscillation
Matches Kafka partitions for maximum parallelism
2. Druid Historicals
Explanation:
Conservative scaling due to stateful nature
Long pod retention (2h) for stability
Pod-based scaling for precise control
Considers segment loading time
3. Druid Brokers
Explanation:
Moderate scaling speed (faster than historicals, slower than Flink)
1-hour pod retention for query stability
Pod-based scaling for controlled growth
Balances query routing and cache warmup needs
KEDA Triggers
KEDA supports various trigger types. In our configuration, we use:
Prometheus Triggers:
Cron Triggers (for night-time scaling):
For more trigger types and configurations, refer to KEDA Documentation.
Important Notes
Druid Component Cleanup
Historical Nodes
When scaling down Druid historicals, manual cleanup may be required:
Delete PVCs for unused replicas:
Delete corresponding PVs:
Delete cloud provider disks:
AWS: Delete EBS volumes
Azure: Delete Azure Disks
GCP: Delete Persistent Disks
⚠️ Warning: Always verify that the historical node is not in use and data is replicated before cleanup.
Broker Nodes
When scaling down Druid brokers:
Ensure query drain:
Monitor active queries on the broker
Wait for existing queries to complete
Verify no new queries are being routed
Cache considerations:
Be aware that scaling down brokers will lose their query cache
New brokers will need time to warm up their cache
Consider gradual scale-down during off-peak hours
⚠️ Warning: Sudden broker scale-down may temporarily impact query performance until cache is rebuilt.
Infrastructure Requirements
Node and IP Requirements
Before enabling autoscaling, ensure your cluster has:
Sufficient Nodes:
Available nodes with required resources (CPU/Memory)
Node autoscaling enabled if using cloud providers
Appropriate node labels/taints if using node affinity
IP Address Availability:
Enough free IPs in the subnet
Consider IP per pod requirements
Reserve IPs for maximum scale scenario
Resource Quotas:
Namespace resource quotas allow for max pods
Cluster-wide limits accommodate scaling
Storage class has sufficient quota (for Historicals)
⚠️ Warning: Scaling operations will fail if infrastructure requirements are not met.
Troubleshooting
Checking Scaling Status
View KEDA ScaledObject status:
Check HPA status:
Monitor Kubernetes Events:
Common Scaling Issues
Scaling Not Triggered:
Verify KEDA metrics:
Check Prometheus query results directly
Verify trigger thresholds are appropriate
Scaling Fails:
Check for resource constraints:
Verify IP availability:
Look for PVC/Storage issues (for Historicals):
Pods Stuck in Pending:
Check node resources:
Verify node affinity rules
Check for PVC binding issues
Unexpected Scaling Behavior:
Review KEDA logs:
Check stabilization windows and cooldown periods
Verify metric values over time in Prometheus
Using Events for Monitoring
Set up event monitoring:
Important event types to monitor:
SuccessfulRescale: Successful scaling operationsFailedRescale: Failed scaling attemptsFailedGetMetrics: Metric collection issuesFailedComputeMetricsReplicas: Scaling computation issues
Event patterns to watch for:
Repeated scaling attempts
Resource constraint messages
Metric collection failures
PVC/PV binding issues
Best Practices
Start Conservative:
Begin with longer stabilization windows
Use pod-based scaling for precise control
Gradually reduce timing as you understand patterns
Monitor Metrics:
Watch for scaling oscillations
Monitor resource usage patterns
Track query performance impact
Resource Planning:
Ensure cluster has capacity for max replicas
Consider node affinity rules
Plan for storage requirements
Testing:
Test scaling behavior in non-production first
Verify cleanup procedures
Monitor data consistency during scaling
References
Last updated
