Skip to main content

Epilot Datalake

Welcome to the documentation for our Data Lake feature, which serves as the centralized repository for real-time event streams of entity operations, snapshots of workflow executions and journey analytics data. This feature empowers users to access and analyze essential data generated by the 360 portal, including changes to entities such as orders, opportunities, contacts, accounts, products, interactions from user journeys and more.

Our Data Lake is seamlessly integrated with Clickhouse for data warehousing, enabling the users to leverage Business Intelligence (BI) tools and create insightful reports. This documentation will guide you through the key components of the Data Lake feature, including data schemas, usage, and credential management.

Feel free to contact our customer support or sales for help in enabling this datalake feature for your organization.

Data Schemas​

Our Data Lake features two distinct datasets:

1. Entity Operations Event Streams​

This dataset captures real-time event streams of entity operations.

The schema for entity operations is as follows:

{
"activity_id": "string", // ID for the individual entity operation/activity
"entity_id": "string", // ID of the entity being mutated
"org_id": "string", // Organization ID
"operation": "string", // Operation can be "createEntity," "updateEntity," or "deleteEntity"
"schema": "string", // Schema of the entity
"timestamp": "DateTime", // Timestamp of the entity mutation
"detail": "string", // Stringified JSON payload containing entity data
}

Fields of interest in this schema include:

Operation: Describes the type of entity operation, including creation, update, or deletion.

  • createEntity: This operation type is recorded when a new entity is created.
  • updateEntity: This operation type is recorded when an existing entity is updated.
  • deleteEntity: This operation type is recorded when an entity is deleted.

timestamp: Provides a timestamp for building time series reports.

activity_id and entity_id: These fields contain unique identifiers that can be used to track individual entity operations (activity_id) and identify the specific entity that was mutated (entity_id).

detail: The detail field contains a stringified JSON payload that includes the full entity data at the time of the operation. This payload can be parsed and used to access detailed information about the entity's state during the operation.

Additionally, we offer a simplified view called {org_id}_current_entities_final, which displays only the latest state of currently active entities in the organization, excluding deleted entities

2: Workflow Execution Snapshots​

In addition to entity operations, we ingest real-time updates from workflow executions, allowing users to access current snapshots of workflow data. This dataset comprises five tables:

  • {org_id}_workflow_executions_final This table contains all the attributes of workflow execution data.

  • {org_id}_workflow_phases_final Here, you can find attributes of the various phases involved in a workflow execution.

  • {org_id}_workflow_tasks_final This table stores attributes related to individual steps or tasks within a workflow execution.

  • {org_id}_workflow_cancellations_final This table records data regarding the reasons for workflow execution cancellations.

  • {org_id}_workflow_contexts_final Contains information about related entities associated with a workflow execution.

To know more about workflow execution details, please refer here

3. Journey Analytics​

This dataset showcases the user sessions created for each journey and the interactions that occur within those sessions.

The schema for journey sessions is as follows:

{
"id": "string", // Unique identifier for the session (UUID)
"org_id": "string", // Unique identifier for the organization associated with the session (UUID)
"session_id": "string", //
"journey_id": "string", // Unique identifier for the user journey to which the session belongs (UUID)
"type": "datetime", // Timestamp indicating when the session started
"details": "string", // Additional details about the session, in JSON format
"created_at": "datetime" // Timestamp indicating when the session ended
}

Fields of interest in this schema include:

details: This field possibly contains basic user's browser session details

  • deviceType: The type of device used by the user (e.g., mobile, desktop).
  • osType: The operating system of the device (e.g., iOS, Android, Windows).
  • browserTypeAndVersion: The browser name and version used by the user (e.g., Chrome 92, Firefox 89).
  • screenResolution: The screen resolution of the user's device (e.g., 1920x1080).
  • viewportSize: The size of the viewport in the browser (i.e., the visible area of the web page).
  • colorDepth: The color depth of the user's display (e.g., 24).
  • languagePreference: The preferred language set in the user's browser (e.g., en-US, fr-FR).
  • ip: The IP address of the user's device when the event was recorded.

The schema for journey events is as follows:

{
"id": "string", // Unique identifier for the event (UUID)
"org_id": "string", // Unique identifier for the organization associated with the event (UUID)
"session_id": "string", // Unique identifier for the session in which the event occurred (UUID)
"journey_id": "string", // Unique identifier for the user journey to which the event belongs (UUID)
"type": "string", // Type of event (e.g., "page_navigation", "journey_submit", "journey_exit")
"details": "string", // Additional details about the event, typically in JSON format
"created_at": "datetime" // Timestamp indicating when the event was recorded
}

Fields of interest in this schema include:

type: This field contains the type of user interaction or event that occurs on the session

  • journey_load_time: The time it takes for the journey to load, typically measured from journey API response time.
  • page_navigation: The event triggered when a user navigates from one page to another within the journey..
  • journey_reset: The event triggered when a user resets the journey, potentially clearing previous inputs or starting the journey over.
  • journey_submit: The event triggered when a user completes and submits the journey, often marking the end of the user interaction.
  • journey_exit: The event triggered when a user exits the journey, possibly by closing the journey tab in the browser.

details: This field contains additional details about the type of event that occurs. Basically from and to pages on page navigation event, the step on which journey reset occurs, the time of journey submission and exit, time taken for journeys to load etc.

  • {org_id}_journey_sessions_final This table stores the above attributes related to journey sessions.

  • {org_id}_journey_events_final This table stores the above attributes related to journey events.

Journey analytics db schema

Setting up Datalake​

Generate datalake credentials​

  • Navigate to the "Settings" menu on the top right in your organization's portal. Then locate the "Data Lake" section from the drop down menu. Datalake menu

  • The Data Lake page will display active credentials for your Data Lake. Here, you can revoke existing credentials or create new ones as needed. Datalake page

  • To create new credentials, select the Generate Credentials button.

  • A new set of credentials will be generated. Note that the password will be visible only once at the time of creation for security reasons, so it's crucial to save it securely. Datalake Credentials

  • Utilize the generated username, host, port, database and password details to connect to the Data Lake from any BI tool or other data sources.

Querying the Data​

Once you've set up the necessary credentials, you can connect to the Clickhouse data warehouse and query the data. There are two primary ways to interact with the data:

Directly Quering via SQL​

You can write SQL queries to retrieve the data directly from Clickhouse using SQL functions. Below are examples illustrating how to query the data for insights.

Example 1: Reporting Opportunities Created Over Time Suppose you need to create a report showing opportunities created over time, grouped by journey source, with a time granularity of months. You can use SQL to accomplish this task:

select
count(*) as count,
JSONExtractString(detail, 'payload', 'source', 'title') as journey_source,
toStartOfYear (timestamp) as year,
toStartOfMonth(timestamp) as month
from
entity_operations
where
and schema = 'opportunity'
and operation = 'createEntity'
group by
journey_source,
year,
month;

This SQL query retrieves data about opportunities created over time, extracts relevant information from the JSON payload, and aggregates it by year and month, providing insights into opportunities created during different periods.

Datalake page

Example 2: Reporting Journey Sessions Created Over Time Suppose you need to create a report showing journey sessions created over time for a specific journey. You can use SQL to accomplish this task:

select
journey_id,
start_time,
details
from
{org_id}_journey_sessions_final
where
journey_id = {your_journey_id}
and
start_time > '2024-08-01 00:00:00'

This SQL query retrieves the journey sessions created over time for a specific journey

Datalake page

You can use any SQL client to connect to the Clickhouse Data Warehouse (DWH) using the credentials provided. For more detailed information, please refer to this link.

Connecting to BI Tools​

Alternatively, you can connect to Business Intelligence (BI) tools of your choice to load the data from Clickhouse and build reports and dashboards. Many BI tools support direct integration with Clickhouse, allowing you to create visually appealing and interactive reports based on your data or alternatively you can make use of official Clickhouse ODBC driver to establish the connection setup with Clickhouse.

Example - we will walk you through an example of connecting to Power BI to create a demo BI report for Raven Energy GmbH, an energy utility company with two distinct journeys and workflow processes for handling Wallbox and Energieausweis sales digitally. Each use case involves two products. We will leverage our Data Lake to set up a BI report for this scenario.

To get started with the various ways to establish a connection between Clickhouse and PowerBI, please refer to [this link] (https://clickhouse.com/docs/en/integrations/powerbi).

Steps to Create a Power BI Report:

  • Open Power BI: Once you have installed all the necessary dependencies to set up an ODBC connection as specified in the link above, you can add a new data source by selecting the ODBC option from the list of supported connectors. Power BI ODBC

  • Enter Connection Details: Enter the required connection details, including the ODBC connection, username, password, and the specific database. Then, click "OK" to establish the connection.

Example ODBC Connection String - Driver={ClickHouse ODBC Driver (ANSI)};Server={replace-it-with-host};Port=8443;Database=datawarehouse;

Power BI Connection

  • Select Data: Once connected, Power BI will display the available schemas and tables. You can easily locate the "entity_operations" table within the "datawarehouse" schema, along with the "{org_id}_current_entities_final" simplified view for easier data access.

Power BI Entity Operations

  • Load or Transform Data: You have two options for data handling. You can click "Load Data" to load all the raw data directly into Power BI. Alternatively, you can click "Transform" to access Power BI's query editor, where you can define your custom transformation logic to preprocess and shape the data to your specific requirements.

Power BI Current Entities

  • Data Modeling: After loading or transforming the data, you can begin setting up data models. Create relationships between different tables to enable seamless data exploration and reporting. For example, you can establish relationships between the entity_operations data and other relevant tables in your dataset.

Power BI Data Model

  • Report Creation: With the data modeling in place, you can start creating visually appealing reports within Power BI. Utilize various chart types, tables, and visuals to convey insights effectively.

Power BI Report Power BI Report

Interactive PowerBI report using Epilot Datalake​

Our Data Lake feature provides a powerful way to capture and analyze real-time entity operations, enabling you to gain valuable insights from your data. By understanding the data schema and following the steps to set up credentials, you can leverage this feature to build reports, perform analytics, and make data-driven decisions for your organization.

Interactive PowerBI report for Journey Analytics using Epilot Datalake​

Our Journey Analytics Data lake feature offers a comprehensive solution to track and analyze user journeys in real-time, providing you with deep insights into how users interact with your journeys. By capturing detailed event data, such as page navigations, form submissions, and user exits, you can gain a clear understanding of user behavior and identify opportunities for optimization.

You can also refer to the following link to establish connection to different BI tools.

If you have any further questions or need assistance with data queries, please reach out to our team.