Migration Guide: Obsrv 1.x to Obsrv 2.x
This documentation provides detailed steps to perform the obsrv migration from the 1.x version to the 2.x version.
Overview
This document outlines the migration strategy from Obsrv 1.x to Obsrv 2.x, with a focus on data integrity, minimal downtime, and operational continuity.
You have two migration options:
Method 1: Stop the 1.x ingestion system and upgrade everything in one go (downtime is required).
When to choose this method:
If a few hours of downtime is acceptable (meaning real-time data won’t be available for querying during that period, but historical data will still be accessible).
If you want the simplest upgrade process (🚀 go-to option for a quick, no-complex migration)
Method 2: Using the Kafka Metadata Sync tool to replicate metadata (topics, consumer offsets, etc.) live between the old and new Kafka clusters (minimal downtime).
When to choose this method:
The downtime for real-time querying should not exceed a few minutes.
If you are comfortable with the setup of a tool that synchronizes Kafka metadata between two systems, then proceed. (Note: Steps to set up the metadata synchronization tool are provided at the end of this section.)
Method 1 – Stop Ingestion & Upgrade
Step-by-step
1) Stop data ingestion
Identify all ingestion jobs/connectors that send data to Obsrv (e.g., Kafka Connect, Debezium, Neo4j, API jobs, etc.).
Please scale down all the connectors to prevent any new events from entering.
2) Clear processing lag
Please allow services to clear all the lags
Flink jobs
Druid ingestion tasks
Hudi writers
Monitor the consumer lag until all groups display a value of 0.
3) Take a backup (for disaster recovery)
Why: If anything breaks, you can roll back quickly.
Create a Velero backup of the Obsrv namespace:
4) Verify Kafka 3.6 consumer groups have zero lag
No output → all lags are cleared.
If any number is displayed, it indicates that there is still lag; wait until the number reaches 0.
5) Deploy Obsrv 2.0
Update environment values in the 2.0.0 manifests (secrets, resource configuration, etc.).
Apply the changes and verify health of the pods
6) Support of Existing Datasets
By default, new datasources point to the managed Kafka version, so no manual update is needed after creation.
For existing datasources, you can manually update Postgres, use the Datasource Update API to modify the ingestion spec with the latest Kafka URL, or simply edit and republish the datasets, the dataset will then pick up the latest configured Kafka URL.
6) Sanity
Keep ingestion disabled at first.
Run sanity tests:
Open the Obsrv console UI and verify the health of datasets
Run basic queries either in Druid or using Query APIs.
Check Druid, Hudi, and Pipeline health status.
The detailed sanity checklists are defined below in the tabular format.
Once verified, gradually enable ingestion connectors and monitor logs for errors and ensure the data is ingested in the database.
More details of sanity checklists are defined below in tabular format.
Method 2 – Live Kafka Sync (Low Downtime)
Step-by-step
1). Upgrade to Obsrv 2.0.0-GA (pre-release)
Upgrade the existing Obsrv deployment from 1.x to 2.0.0-GA.
This version supports syncing metadata from the old Kafka cluster to the new Kafka 4.0 cluster.
Before upgrading, update all environment-specific configurations:
2). Install the Kafka Sync Operator Tool
This step will help to sync the metadata of the kafka from the one version to another version
Create a namespace for MM2:
Install Strimzi:
3). Prepare MirrorMaker 2 config
Create
mm2.yamlwith the source and target Kafka clusters defined:source: old Kafka 3.6 cluster
target: new Kafka 4.0 cluster
Please make sure the topicsPattern and groupsPattern are configured to replicate everything.
.*Use IdentityReplicationPolicy to keep topic names unchanged.
Create mm2.yaml file with the below yaml snippet
4). Deploy MirrorMaker 2
This will start:
SourceConnector → copies data from old to new topics.
CheckpointConnector → copies consumer offsets.
HeartbeatConnector → keeps track of connectivity.
5). Verify topic and offset sync
This process generally takes a few minutes (approximately 15 to 30 minutes) to sync all the data from one Kafka system to another Kafka.
On the target cluster:
You should see all topics from the source.
Check consumer groups:
Offsets should match or be close to the source.
6). Test data flow
If any messages are flowing to source kafka topic (3.6), they should get synced in the newer version of Kafka (4.0)
Consume the messages from the same topic in the target cluster, and if the messages are available, then sync, which is happening.
7). Upgrade to Obsrv 2.0
Update environment configs, resource configurations, etc. before performing the obsrv upgrade.
Deploy Obsrv 2.0.0
Once the system upgraded all the data will flow to the newer version of the kafka (4.0)
8). Upgrade to Obsrv 2.0.1
Once data is fully flowing into the target Kafka (4.0) and verified, decommission the source Kafka (3.x) by upgrading Obsrv to version 2.0.1. Ensure all required configurations are in place and validated before initiating the 2.0.1 upgrade.
Sanity Checklist
Ingestion
All ingestion connectors running with expected replicas
(✔/✘)
Data flowing from all expected upstream sources
(✔/✘)
No ingestion backlog in Kafka topics
(✔/✘)
Schema validation passing for incoming messages
(✔/✘)
No ingestion errors messages from the connector pods
(✔/✘)
The resource configurations are correct as per the environment and load
(✔/✘)
Processing
The unified pipeline, cache-indxer and lakehouse-connector jobs in RUNNING state with expected replica configurations
(✔/✘)
Checkpointing active and stable
(✔/✘)
0% failed event (No schema and deduplicate events) and no higher lag
(✔/✘)
Kafka partitions match Flink job configs are correct as per the load and environment
(✔/✘)
No errors in the pods logs
(✔/✘)
Querying
Druid ingestion tasks running and segments published
(✔/✘)
Hudi datasets up-to-date and queryable
(✔/✘)
Query APIs responding within acceptable latency
(✔/✘)
Able to query realtime and historical data from both hudi and druid
(✔/✘)
Spot checks return correct and fresh data
(✔/✘)
Storage
Velero backups completed successfully
(✔/✘)
Kafka/Druid/Hudi backups available
(✔/✘)
Secor backup service is running healthy
(✔/✘)
Dataset events secor backup files are available in the blob storage
(✔/✘)
No error or higher amount of lage in the secor service
(✔/✘)
Restore test performed in staging (optional)
(✔/✘)
Monitoring
All key metrics collected (Kafka, Flink, Druid, Hudi, APIs)
(✔/✘)
Grafana dashboards rendering without gaps
(✔/✘)
No abnormal spikes in error rates, latency, or usage
(✔/✘)
Alerts
All alerting rules enabled and targeting correct channels
(✔/✘)
Test alerts sent and acknowledged
(✔/✘)
Critical alert thresholds correctly configured
(✔/✘)
Management Console
Management console is able to access
(✔/✘)
All the datasets are healthy
(✔/✘)
CPU, Memory, Volume usages are not abnormal
(✔/✘)
All service pods in Running state with expected restarts
(✔/✘)
Final
End-to-end data flow verified (Ingestion → Processing → Storage → Query)
(✔/✘)
Last updated
