Create a Dataset

This documentation provides a detailed overview of how to create a dataset

🧾 Introduction

Creating datasets is the essential first step to making your data usable in obsrv. This guide details the process using the user interface wizard. You'll learn how to define your dataset's structure, configure ingestion and processing rules, select storage options, and finally publish it for active use.

1. Navigate to Dataset Creation

From the main Dashboard, locate the left-hand navigation menu.
Click on the Dataset Creation option.

2. Initiate New Dataset

You will land on the "New Dataset" page.
Click the Create New Dataset button to begin the setup wizard.

3. Connector Selection (Optional)

The wizard starts at the Connector step (Step 1).
This step allows you to choose specific data connectors if needed. For a basic setup using the default API or manual uploads later, you can skip this.
Click the Skip button to proceed with the default configuration, but if you are feeling adventurous checkout how to . Connectors can be added later if required.

4. Configure Ingestion - Dataset Details

You are now on the Ingestion step (Step 2).
Dataset Name: Enter a unique, descriptive name for your dataset (e.g., demo-dataset). Follow the guideline: use only alphabets and avoid special characters for better ID generation.
Dataset ID: This field is typically auto-generated based on the Dataset Name. You can leave it as is.
Dataset Type: Choose the type that best describes your data:
- Event/Telemetry Data: For ongoing, append-only data like logs or sensor readings (selected in the example).
- Data Changes (Updates or Transactions): For data that involves updates or changes to existing records (CDC).
- Master Data: For reference data used to enrich other datasets (denormalisation).
Select the appropriate radio button for your Dataset Type.

5. Configure Ingestion - Upload Data/Schema

Scroll down to the Upload Data section.
You have two options:
- Upload Sample Data: Drag and drop your JSONL sample file onto the designated area, or click Choose a JSON File to browse. This is used for schema inference. The example shows nyt_nov_apr_sample.json being uploaded.
- Upload Schema File: If you have a predefined JSON schema, you can upload it here instead.
Once your sample file is successfully uploaded (indicated by a progress bar reaching 100% and the file listed under "Files Uploaded"), click the Proceed button.

6-1. Configure Ingestion - Review Schema Details

Obsrv infers the schema from your sample data and displays it.
Review the generated Schema Details:
- Fields: The names of the fields detected in your data.
- Arrival Format: The format detected in the JSON (e.g., number, text).
- Data Type: The proposed data type for storage and querying (e.g., integer, string, date-time, double).
- Required: Toggle switch to mark if a field must be present in every record.

6-2. Configure Ingestion - Address Recommendations/Conflicts:

Obsrv may suggest Recommended Changes for data types (e.g., changing a number identified as potentially double to integer, as seen for trip_distance).
Sometimes, a Must-Fix conflict might appear if the sample data suggests conflicting types for the same field (e.g., trip_distance appearing as both double and integer in different records).
Click the red warning icon next to a field with a conflict (like trip_distance in 09_recommended_changes_conflict.png) to see details.
Choose an appropriate resolution: either accept the recommendation, manually change the Data Type, or Mark as resolved if you agree with the initially inferred type despite the conflict warning.

Repeat until all recommendations and conflicts are addressed (showing as "Resolved" like in resolve_all_conflicts.png).

7. Configure Ingestion - Add Fields (Optional)

Scroll to the bottom of the Schema Details page.
If your sample data didn't include all necessary fields, you can manually add them using the Add New Field section by specifying the Field path, New field name, Arrival format, and Data type.
Click + Add new field to add it to the schema.
Once the schema review is complete, click Proceed.

8. Configure Processing

You are now on the Processing step (Step 3). This section defines how data is validated, enriched, and transformed before storage.
Allow Additional Fields: Choose Yes or No. Decides if records with fields not defined in your schema should be allowed (Yes) or rejected (No). The default is often No.
Data Denormalization: Configure joins with Master Datasets here if needed. Requires pre-existing Master Datasets.
Data Privacy: Define rules for masking or encrypting sensitive fields (PII). Click + Add Sensitive Field to configure.
Data Transformations: Apply custom transformations using JSONata (e.g., filtering, restructuring, calculations). Click + Add Transformation.
Derived Fields: Create new fields based on calculations or transformations applied to existing fields using JSONata. Click + Add Derived Field.
Data Deduplication: Enable this to prevent duplicate records based on a unique key. Toggle Enable Deduplication and select the unique key field from the dropdown (e.g., dedupKey).
Configure these options as needed for your use case. For a basic setup, you might leave many of these at their defaults or disabled.
Click Proceed.

9. Configure Storage

You are now on the Storage step (Step 4).
Configure Storage Type: Select the storage system(s) where your data will reside. Options often include:
- Real-time Store (Druid): Optimized for fast aggregations and real-time analytics (checked in the example).
- Data Lakehouse (Hudi): For cost-effective storage, batch analytics, and data science workloads.
- Cache Store (Redis): For rapid lookups, often used with Master Datasets.
- Check the box(es) for your desired storage type(s).
Configure Storage Keys: Specify key fields for indexing and optimization based on the selected storage types:
- Primary Key: A unique identifier for each record (required for Lakehouse/Cache, useful for updates).
- Timestamp Key: The primary time field for time-based partitioning and querying (required for Druid). Select the appropriate date/time field from your schema (The default is Event Arrival Time). The example shows the user selecting from the dropdown.
- Partition Key: Field(s) used for partitioning data in the Lakehouse.
- Select the appropriate fields from the dropdowns based on your schema and chosen storage types.
Click Proceed.

10. Preview Configuration

You are now on the Preview step (Step 5).
This screen summarizes all the configurations you've made across the previous steps (Connector, Ingestion, Processing, Storage).
Expand each section (Connector, Ingestion, Processing, Storage) by clicking on it.
Carefully review the details: Dataset name, schema, processing rules (like Add New Fields, Deduplication), storage types, and keys.
Ensure everything matches your requirements. If you need to make changes, use the Back button or click the specific step number in the stepper at the top.

11. Save the Dataset

If you are satisfied with the configuration preview, click the Save Dataset button at the bottom right.
A confirmation dialog will appear asking "Are you sure you want to save the dataset?".
Click Agree.

12. View and Publish the Dataset

You will be redirected to the All Datasets page.
Your newly created dataset (e.g., demo-dataset) will appear in the list with the status Ready To Publish.
To make the dataset active and ready to receive data, you need to publish it.
Locate your dataset in the list and click the three-dot menu (...) on the right side.
Select Publish from the options menu.

A loading indicator (publish_dataset.png) will appear while the system processes the request.
Once published, the dataset status will change (e.g., to LIVE-Running or Active), and it will be ready for data ingestion based on the configured connectors or API endpoints.

PreviousData Backup and Restoration NextRegister a Connector

Last updated 8 months ago

hashtag🧾 Introduction

🧾 Introduction