Autoscaling Components

This guide explains how to configure and manage autoscaling for Obsrv components

Overview

Obsrv components can be scaled using KEDA (Kubernetes Event-driven Autoscaling). The default configuration supports autoscaling for Flink Task Managers and Druid components based on various metrics like CPU usage, query latency, and Kafka lag.

Understanding Scaling Scenarios

Flink Task Managers need to scale when:

  • Kafka consumer lag grows beyond acceptable thresholds

  • Processing backpressure increases

  • Event processing latency rises

Key considerations:

  • Scale up quickly to handle sudden data bursts

  • Match scaling with Kafka partition count for optimal parallelism

  • Scale down conservatively to avoid job restarts

  • Consider checkpoint completion times during scaling

Druid Autoscaling

Druid components scale for different reasons:

Historical Nodes:

  • Query response times exceed thresholds

  • Segment load times increase

  • Available capacity for new segments decreases

  • High CPU utilization impacts query performance

Broker Nodes:

  • High query queuing or wait times

  • Increased number of concurrent queries

  • CPU utilization affects query routing efficiency

Key considerations:

  • Historical scaling impacts data availability

  • Broker scaling affects query routing and caching

  • Both require careful monitoring of query patterns

  • Consider time of day and workload patterns

Enabling Autoscaling

  1. Enable autoscaling in autoscaling.yaml:

  1. Install the autoscaling rules:

Configuration Components

Common Parameters

  • enabled: Enable/disable autoscaling for the component

  • kind: Kubernetes resource type (Deployment/StatefulSet)

  • minReplicaCount: Minimum number of replicas

  • maxReplicaCount: Maximum number of replicas

  • pollingInterval: How often to check metrics (in seconds)

  • cooldownPeriod: Minimum time between scaling operations (in seconds)

HorizontalPodAutoscalerConfig

The horizontalPodAutoscalerConfig section controls scaling behavior:

Key Timing Parameters

  1. stabilizationWindowSeconds:

    • Duration the metrics should be in scaling range before scaling occurs

    • Longer windows prevent oscillation but reduce responsiveness

    • Typically longer for scale-down than scale-up

  2. periodSeconds:

    • Minimum time between scaling operations for a specific policy

    • Should be >= stabilizationWindowSeconds

    • Longer periods provide more stability

  3. cooldownPeriod:

    • Global cooldown between any scaling operations

    • Prevents rapid scaling changes

    • Should be longer than periodSeconds

Example Configurations

Explanation:

  • Uses percentage-based scaling for exponential growth

  • Quick scale-up (30s) for responsive lag handling

  • Conservative scale-down (5m) to prevent oscillation

  • Matches Kafka partitions for maximum parallelism

2. Druid Historicals

Explanation:

  • Conservative scaling due to stateful nature

  • Long pod retention (2h) for stability

  • Pod-based scaling for precise control

  • Considers segment loading time

3. Druid Brokers

Explanation:

  • Moderate scaling speed (faster than historicals, slower than Flink)

  • 1-hour pod retention for query stability

  • Pod-based scaling for controlled growth

  • Balances query routing and cache warmup needs

KEDA Triggers

KEDA supports various trigger types. In our configuration, we use:

  1. Prometheus Triggers:

  1. Cron Triggers (for night-time scaling):

For more trigger types and configurations, refer to KEDA Documentation.

Important Notes

Druid Component Cleanup

Historical Nodes

When scaling down Druid historicals, manual cleanup may be required:

  1. Delete PVCs for unused replicas:

  1. Delete corresponding PVs:

  1. Delete cloud provider disks:

  • AWS: Delete EBS volumes

  • Azure: Delete Azure Disks

  • GCP: Delete Persistent Disks

⚠️ Warning: Always verify that the historical node is not in use and data is replicated before cleanup.

Broker Nodes

When scaling down Druid brokers:

  1. Ensure query drain:

    • Monitor active queries on the broker

    • Wait for existing queries to complete

    • Verify no new queries are being routed

  2. Cache considerations:

    • Be aware that scaling down brokers will lose their query cache

    • New brokers will need time to warm up their cache

    • Consider gradual scale-down during off-peak hours

⚠️ Warning: Sudden broker scale-down may temporarily impact query performance until cache is rebuilt.

Infrastructure Requirements

Node and IP Requirements

Before enabling autoscaling, ensure your cluster has:

  1. Sufficient Nodes:

    • Available nodes with required resources (CPU/Memory)

    • Node autoscaling enabled if using cloud providers

    • Appropriate node labels/taints if using node affinity

  2. IP Address Availability:

    • Enough free IPs in the subnet

    • Consider IP per pod requirements

    • Reserve IPs for maximum scale scenario

  3. Resource Quotas:

    • Namespace resource quotas allow for max pods

    • Cluster-wide limits accommodate scaling

    • Storage class has sufficient quota (for Historicals)

⚠️ Warning: Scaling operations will fail if infrastructure requirements are not met.

Troubleshooting

Checking Scaling Status

  1. View KEDA ScaledObject status:

  1. Check HPA status:

  1. Monitor Kubernetes Events:

Common Scaling Issues

  1. Scaling Not Triggered:

    • Verify KEDA metrics:

    • Check Prometheus query results directly

    • Verify trigger thresholds are appropriate

  2. Scaling Fails:

    • Check for resource constraints:

    • Verify IP availability:

    • Look for PVC/Storage issues (for Historicals):

  3. Pods Stuck in Pending:

    • Check node resources:

    • Verify node affinity rules

    • Check for PVC binding issues

  4. Unexpected Scaling Behavior:

    • Review KEDA logs:

    • Check stabilization windows and cooldown periods

    • Verify metric values over time in Prometheus

Using Events for Monitoring

  1. Set up event monitoring:

  1. Important event types to monitor:

    • SuccessfulRescale: Successful scaling operations

    • FailedRescale: Failed scaling attempts

    • FailedGetMetrics: Metric collection issues

    • FailedComputeMetricsReplicas: Scaling computation issues

  2. Event patterns to watch for:

    • Repeated scaling attempts

    • Resource constraint messages

    • Metric collection failures

    • PVC/PV binding issues

Best Practices

  1. Start Conservative:

    • Begin with longer stabilization windows

    • Use pod-based scaling for precise control

    • Gradually reduce timing as you understand patterns

  2. Monitor Metrics:

    • Watch for scaling oscillations

    • Monitor resource usage patterns

    • Track query performance impact

  3. Resource Planning:

    • Ensure cluster has capacity for max replicas

    • Consider node affinity rules

    • Plan for storage requirements

  4. Testing:

    • Test scaling behavior in non-production first

    • Verify cleanup procedures

    • Monitor data consistency during scaling

References

Last updated