AWS Installation Guide

Installation Steps

1. Clone the Obsrv automation Repository

Start by cloning the Obsrv automation repository and checkout to either latest release tag or master.

git clone [email protected]:Sanketika-Obsrv/obsrv-automation.git
git checkout <latest_release_tag> or <main>

2. Configure the Kubernetes Cluster

By executing the following commands which will bring up the kubernetes cluster in the AWS environment of configured region.

  1. Navigate to the Configuration Directory:

    cd ./obsrv-automation/terraform/aws/vars
  2. Update Configuration Files:

    • Open cluster_overides.tf and modify the configuration values to match your environment.

    building_block = "obsrv"
    env = "dev"
    region = "us-east-2"
    availability_zones = ["us-east-2a", "us-east-2b", "us-east-2c"]
    timezone = "UTC"
    create_kong_ingress_ip = "false"  # Set to "true" if Kong service type is LoadBalancer, otherwise set to "false" for NodePort.
    create_vpc = "false"
    create_velero_user = "false"
    eks_node_group_instance_type = ["t2.xlarge"] # Choose depending on your requirements by considering the CPU requirements
    eks_node_group_capacity_type = "ON_DEMAND"
    eks_node_group_scaling_config = { desired_size = 5, max_size = 5, min_size = 1 } # Choose depending on your requirements by considering the CPU requirements
    eks_node_disk_size = 100
    
  3. Configure S3 for Cluster State:

    • Open obsrv.conf in the obsrvation-automation/infra-setup directory and update your AWS credentials and bucket names

    <aside> 💡

    If EC2 instance is configured with Assumed Identity then no need of defining the AWS Credentials.

    </aside>

    AWS_ACCESS_KEY_ID=<your_access_key_id>
    AWS_SECRET_ACCESS_KEY=<your_secret_access_key>
    AWS_DEFAULT_REGION="us-east-2"
    KUBE_CONFIG_PATH="$HOME/.kube/obsrv-kube-config.yaml"
    AWS_TERRAFORM_BACKEND_BUCKET_NAME="obsrv-tfstate"
    AWS_TERRAFORM_BACKEND_BUCKET_REGION="us-east-2"

3. Run the Installation Script

  1. Make the Script Executable: The file is located in the obsrvation-automation/infra-setup directory

  2. Run the Installation:

    • To start the installation, run the script:

    • If you want the installer to automatically handle dependencies, set install_dependencies=true.

4. Verify the Cluster

Once the installation completes, verify that your Kubernetes cluster is up and running:

The result of the above command should show the nodes in your Kubernetes cluster.


Helm Chart Configuration

1. Navigate to the Helm Chart Directory

2. Update AWS Cloud Configuration

Modify global-cloud-values-aws.yaml with the appropriate values for your environment:

3. Configure Domain

Update the global-values.yaml file, replacing <domain> with your actual domain, Elastic IP, or cluster node IP and port, depending on your Kong service type:

  • LoadBalancer: If Kong's service type is LoadBalancer, retrieve the Elastic IP from the AWS console and use the following format: Domain: <eip>.sslip.io

  • NodePort: If Kong's service type is NodePort, use the external IP of the cluster node where Kong is deployed, along with Kong's NodePort, in this format: Domain: <Cluster node external IP>:<node-port-of-kong>

Important: Security Group Update (NodePort Only)

This step is only required if Kong's service type is NodePort. You must update the security group's inbound rules.

Instructions for Modifying Security Group Inbound Rules:

  1. Add a new inbound rule.

  2. Set the Type to "Custom TCP."

  3. For Port Range, specify the port used by Kong. Retrieve this port by running: kubectl get svc -n kong-ingress. This command will show you Kong's NodePort.

  4. To restrict access, set the Source to your organization's IP address in /32 CIDR format. Ensure the Port Rangematches Kong's NodePort.

  5. Additionally, set the Source to the external IPs of all nodes in your EKS cluster (also in /32 CIDR format). Again, the Port Range must be Kong's NodePort. This ensures proper cluster access restriction. </aside>

4. Clone the Obsrv Client Automation Repository

Start by cloning the Obsrv automation repository and checkout to either latest release tag or master.

5. Install Obsrv

Make the script executable and set the environment variables and run the installation

The file enterprise.sh is located in the /obsrv-scripts-infy/automation-v2/enterprise-automation/kitchen/


Post-Installation Verification

After completing the installation, follow these steps to verify that all components are running correctly:

1. Check Kubernetes Components

  1. Verify all pods are running:

    All pods should be in Running state. Common namespaces to check:

    • flink: Core Pipeline

    • monitoring: Monitoring stack

    • dataset-api: Dataset APIs

    • web-console: Dataset Management console

  2. Check Services:

    Verify that essential services have external IPs assigned, particularly the Kong service.

If any component fails these checks, refer to the component-specific logs:


By following these steps, you will ensure a successful installation and configuration of Obsrv on AWS.

Sanity Checklist

After installation, a sanity test must be performed to validate the deployment

Category
Check Item
Status (✔/✘)

Ingestion

All ingestion connectors running with expected replicas

(✔/✘)

Data flowing from all expected upstream sources

(✔/✘)

No ingestion backlog in Kafka topics

(✔/✘)

Schema validation passing for incoming messages

(✔/✘)

No ingestion errors messages from the connector pods

(✔/✘)

The resource configurations are correct as per the environment and load

(✔/✘)

Processing

The unified pipeline, cache-indxer and lakehouse-connector jobs in RUNNING state with expected replica configurations

(✔/✘)

Checkpointing active and stable

(✔/✘)

0% failed event (No schema and deduplicate events) and no higher lag

(✔/✘)

Kafka partitions match Flink job configs are correct as per the load and environment

(✔/✘)

No errors in the pods logs

(✔/✘)

Querying

Druid ingestion tasks running and segments published

(✔/✘)

Hudi datasets up-to-date and queryable

(✔/✘)

Query APIs responding within acceptable latency

(✔/✘)

Able to query realtime and historical data from both hudi and druid

(✔/✘)

Spot checks return correct and fresh data

(✔/✘)

Storage

Velero backups completed successfully

(✔/✘)

Kafka/Druid/Hudi backups available

(✔/✘)

Secor backup service is running healthy

(✔/✘)

Dataset events secor backup files are available in the blob storage

(✔/✘)

No error or higher amount of lage in the secor service

(✔/✘)

Restore test performed in staging (optional)

(✔/✘)

Monitoring

All key metrics collected (Kafka, Flink, Druid, Hudi, APIs)

(✔/✘)

Grafana dashboards rendering without gaps

(✔/✘)

No abnormal spikes in error rates, latency, or usage

(✔/✘)

Alerts

All alerting rules enabled and targeting correct channels

(✔/✘)

Test alerts sent and acknowledged

(✔/✘)

Critical alert thresholds correctly configured

(✔/✘)

Management Console

Management console is able to access

(✔/✘)

All the datasets are healthy

(✔/✘)

CPU, Memory, Volume usages are not abnormal

(✔/✘)

All service pods in Running state with expected restarts

(✔/✘)

Final

End-to-end data flow verified (Ingestion → Processing → Storage → Query)

(✔/✘)

Last updated