Building a Standard Claims Data Model With the Cube Semantic Layer and Databricks

By: Josh Patterson

Effective claims data modeling is essential for empowering the claims team to make informed, high-quality decisions. By implementing a robust data model through a semantic layer, like Cube, insurance companies can create a unified, consistent foundation for their data. This model bridges the gap between raw data and business intelligence tools, such as Power BI or Tableau, enabling the Chief Claims Officer to better forecast potential losses, optimize cash flow management, and make strategic decisions. With features like real-time query performance, integration with Excel, and secure authentication, the right data model ensures the claims team has reliable, actionable insights, driving improved decision-making and business outcomes.

In this article I'm going to take you through how to model the bordereaux claims data we've ingested into our Databricks data table with the Cube semantic layer. The advantage here will be that we're providing standardize and complete information in our Information Layer for the knowledge work in the claims division.

Towards a Standard Claims Data Model

The Cube semantic layer enhances knowledge workflow efficiency by offering a unified, governed data access platform that ensures consistent, trustworthy models and metrics across the organization. One advantage of Cube is that it integrates seamlessly with tools like Power BI and Excel, enabling faster, more reliable query performance and reducing common errors from SQL translations. With features such as the Cube DAX API and enterprise-friendly authentication (including SSO), Cube not only simplifies data consumption but also accelerates decision-making by providing a single source of truth. This makes it a compelling choice for improving data accessibility and supporting impactful business decisions in a dynamic environment.

Cube, like OLAP systems, organizes data into a multidimensional model that supports efficient aggregation and analysis across various levels of detail.

Key components:

Measures
Quantitative values used for analysis, such as:
- Totals
- Averages
- Counts
- Ratios
Dimensions
Categories used to group, filter, and explore measures, such as:
- Time (e.g., year, quarter, month)
- Geography (e.g., region, city)
- Product or business unit

This structure allows users to quickly navigate complex datasets using familiar business hierarchies, enabling deeper and faster insights. We want to provide quality information that is complete and quick to access in our Information Layer up to the Knowledge Work Layer. Let's quickly review the Knowledge Work Architecture concepts before we move on to connecting Cube to our Databricks data lakehouse.

The Information Layer

In the diagram below we see the knowledge work architecture abstractions of the knowledge work layer, the information layer, and the infrastructure layer.

Knowledge Work Layer

Business processes using information
People using expertise with information
Convert information into revenue

Information Layer

Data model definitions
Semantic layer
APIS: SQL, REST, GraphQL

Infrastructure Layer

Cloud infrastructure
Databases, ETL pipelines, Servers
Data Governance, Security

At Patterson Consulting we frame every data platform project and tool in this framework to keep outcomes aligned on impacting the line of business through the knowledge work layer.

Given that the line of busniess tools (Excel, PowerBI, etc) in the knowledge work layer operate on information, not data, we need to build a logical materialized view of the claims information needed (our claims data model) in the information layer.

In this example we are building our materialized view of the bordereaux claims data in the Cube semantic layer. This materialized view of the information needed at the knowledge work layer will be in a low-latency cached system.

Now that we've gotten the background framing out of the way, let's connect the Cube platform to our Databricks data lakehouse.

Connecting Cube to Databricks

Head over to Cube.dev and sign up for a free account to start your first Cube deployment. From there, click on "Create Deployment" on the main account portal and you should see a screen similar to the one below for creating a new deployment.

You will need to provide some configuration information from your Databricks workspace such as:

Databricks JDBC Connection string
Databricks Personal Access Token
A default catalog to work from

Once you've established connectivity with your Databricks workspace's Serverless SQL Data Warehouse, you can start building out your data model. You can use the visual guide to build a data model (as shown below), but the wasier way will be to just skip this part and use the data model definition we include as part of this series once your deployment is completely set up.

Once you are beyond the visual data modeling UX screen and your deployment is ready, you should see a screen similar to the one below.

You are now to start building out data models over your Databricks delta lake tables.

Our Claims Data Model

To define a standardized claims model I want to start with a list of areas we want to cover in this data model. The list of areas of the claims data model are listed below.

Claims Summary at Current
KPIs
Claims Lag
FNOL Reported by Month
Aged Pending
No Reserve
Reserve Change
Large Loss
Open to Reported & Case Reserve by Age
CAT Claim Summary
All Claim Summary

If we start with just one of the above parts of our standardized claims data model, we want to build out the "claims summary at current" part. This part of the data model is defined as "A comprehensive snapshot of all active claims at any specific point in time, showing status, reserves, and key metrics across the entire claims portfolio". The code listing below (also provided on our github repo) is how we implement this part of the claims model as part of the semantic layer on Cube.dev.

  
cubes:
  - name: claims_summary_current
    sql: SELECT * FROM claims_workspace.bronze.bordereaux_data
    title: Claims Summary at Current
    description: A comprehensive snapshot of all active claims at any specific point in time, showing status, reserves, and key metrics across the entire claims portfolio.
    joins: []
    dimensions:
    - name: policy_number
      sql: "{CUBE}.policy_number"
      type: string
      description: The policy number of the claims record.

    - name: months_insured
      sql: "{CUBE}.months_insured"
      type: number
      description: The number of months the customer has been insured.

    - name: has_claims
      sql: "{CUBE}.has_claims"
      type: boolean
      description: Whether the policy has any associated claims.

    - name: items_insured
      sql: "{CUBE}.items_insured"
      type: number
      description: Number of items insured under the policy.

    - name: claim_reference
      sql: "{CUBE}.claim_reference"
      type: string
      description: A unique reference for the claim.

    - name: insured_name
      sql: "{CUBE}.insured_name"
      type: string
      description: Name of the insured party.

    - name: policy_start_date
      sql: "{CUBE}.policy_start_date"
      type: time
      description: Start date of the policy coverage.

    - name: date_of_loss
      sql: "{CUBE}.date_of_loss"
      type: time
      description: The date on which the loss occurred.

    - name: date_reported
      sql: "{CUBE}.date_reported"
      type: time
      description: The date the loss was reported.

    - name: claim_status
      sql: "{CUBE}.claim_status"
      type: string
      description: Status of the claim (e.g., Open, Closed, Reopened).

    - name: loss_type
      sql: "{CUBE}.loss_type"
      type: string
      description: Type of loss associated with the claim.

    - name: paid_amount
      sql: "{CUBE}.paid_amount"
      type: number
      description: Amount paid on the claim to date.

    - name: reserve_amount
      sql: "{CUBE}.reserve_amount"
      type: number
      description: Reserve amount set aside for future payments.

    - name: total_incurred
      sql: "{CUBE}.total_incurred"
      type: number
      description: Total cost incurred for the claim (paid + reserve).

    - name: claim_propensity
      sql: "{CUBE}.claim_propensity"
      type: number
      description: Model-driven score estimating the likelihood of a claim.

    measures:
      - name: count
        type: count

      - name: total_claims
        type: count
      
      - name: open_claims
        type: count
        filters:
          - sql: "{CUBE}.claim_status != 'Closed'"

      - name: total_reserves
        sql: "{CUBE.reserve_amount}"
        type: sum
        filters:
          - sql: "{CUBE}.claim_status != 'Closed'"

      - name: total_paid
        sql: "paid_amount"
        type: sum

      - name: total_incurred_all
        sql: "total_incurred"
        type: sum

      - name: average_claim_size
        sql: "{CUBE.total_incurred_all} / {CUBE.total_claims}"
        type: number

Here are the steps to update your main branch of your claims data models in your Cube deployment:

Click on "Data Model" on the left side toolbar
Click on "Dev Model" at the top, and create a new branch to work from
Add the file contents above to a new file in the data model file named "claims_summary.yml" via the file toolbar area on the left side
Save the changes, wait a few seconds, and then merge the changes back into the "master" branch

Most BI tools want to work against a Cube "view", so we've provided a simple view of the claims_summary_current data model as well:

  

views:
  - name: claims_summary

    cubes:
      - join_path: claims_summary_current
        includes:
          - total_claims
          - open_claims
          - total_reserves
          - total_paid
          - total_incurred_all
          - average_claim_size

Add the above yaml code to the "views" subfolder in your data model and save it.

Once you've merged your updates to your Cube.dev claims data model, you are ready to head over to the Cube.dev "Playground" to test out your data model.

Demo in Cube Playground

Let's run a quick test with our claims data model and generate a quick report on the Companies that generated a loss in the claims data in the past week. Click on "Playground" in the left hand toolbar in the main Cube deployment dashboard area. Once in the data model "Playground" part of the deployment you can visually edit test queries for your newly deployed data model. Click on the dimension "insured_name" and the measure "date_of_loss" and select "This week" for the temporal filter. It should look like what you see in the Cube.dev UX image below.

If your query runs and gives results similar to what you see above, congratulations, you now have a claims data model operational for your property insurance data lakehouse. Let's move on to thinking about how to integrate this semantic layer into the rest of our knowledge work tools.

Integration with Knowledge Work Tools

Cube provides out of the box integrations for the following

This allows our data model to efficiently integrate with many of today's core knowledge work tools as shown below.

In the next article, we'll dig into just how to use our data models directly in a Google GSheet.

Next in Series

Connecting to Your Claims Data Model from GSheets with Cube

Quickly using your claims data models in a GSheets spreadsheet with the Cube connector.