Author: Josh Patterson
Date: March 21st, 2023
DBT (Data Build Tool) is an open-source command-line tool that allows data analysts and data engineers a way to manage a collection of data transformations written in SQL or Python for analytics and data science. It provides a framework for creating and executing complex SQL queries and transforms, and it helps manage the entire data transformation process.
More simply, DBT is focused on transforming data — the ‘T’ in ELT.
DuckDB is an embedded database (think “SQLite”), but designed for OLAP-style analytics instead of OLTP. DuckDB is exceptionally fast and lets you work directly with data stored in files such as CSV and Parquet files directly, without forcing you to do any load operations first.
Together, with the dbt-duckdb
project, they form a “Modern Data Stack In A Box” — or a simple and powerful data lakehouse (sans Java and Scala).
This is part 1 in a 3 part series on using DBT and DuckDB
In Part 1 of 3 of this tutorial you will:
Further, you’ll learn:
We’ll build our DBT “Hello World” project with synthetic patient data.
In our “hello world” scenario, we want to build a summarization of how often a set of patients get their yearly physical checkup. The hypothetical company here is an insurance company (“InsurCo”) evaluating a set of customers and building forecasts with machine learning to predict how much the customer will potentially cost in terms of healthcare while under an insurance policy.
The Github repository we’ll use for this project is located at:
https://github.com/pattersonconsulting/dbt_hello_world_duckdb
The data engineering team has requested that the analysts pull the data from the (cloud) data warehouse and provide the a consistent calculation per customer for use in their modeling experiments as a feature.
In this exercise you’ll take synthetic patient checkup history data (which insurance companies are able to access) and use DBT to build data models and produce the patient checkup summary data model (e.g., “cached table in the database”). We’ll use DuckDB for our database as its a quick and easy database to use locally and keeps us from having to provision cloud infrastructure for a simple Hello World example.
The Synthetic Patient Doctor Visit Data is shown below:
Date,PatientID
2016-09-05,pid-001
2015-08-17,pid-002
2013-07-18,pid-003
2015-09-22,pid-003
2015-10-13,pid-003
2019-01-08,pid-003
2021-07-24,pid-003
So let’s dig into building a local data stack.
Our tasks for this exercise are listed below:
Let’s start off by building a new environemnt in Conda for our project.
To create an environment in Conda, you can follow these steps:
Open your terminal (or Anaconda Prompt on Windows).
Type the following command to create a new environment named my_env (you can replace my_env with your preferred name):
conda create --name [my_env]
Conda will ask you to confirm the installation of any additional packages that are required to create the environment. Type y and press Enter to proceed.
Once the environment is created, activate it by typing:
conda activate [my_env]
Your prompt should now show the name of the environment in parentheses, indicating that it’s active.
You can now install any packages you need for your project using conda install or pip install.
When you’re done working in the environment, you can deactivate it by typing:
conda deactivate
Now that we have an environment to work in, let’s get started with using DuckDB.
DuckDB sits in an interesting spot as it offers an excellent way to solve complex problems using SQL while keeping things simple. DuckDB is a columnar, in-memory SQL database designed for analytical workloads. It’s key features are:
DuckDB is compelling because you only have to point it at the data file you’d like to work with, giving you a short-cut to running SQL against CSV, Parquet, and other data files. Ease of use and time to insight are two key properties of why DBT has become a key tool for experimenting in the data “lab”.
In a conversation with me, JD Long commented that he prefers DuckDB “when the data fits on a single machine so that he doesn’t have to stand up a cluster for Spark”.
As stated by the DuckDB Website:
DuckDB is an embedded database, similar to SQLite, but designed for OLAP-style analytics. It is crazy fast and allows you to read and write data stored in CSV and Parquet files directly, without requiring you to load them into the database first.
A recent article at dlthub.com, “As DuckDB crosses 1M downloads / month, what do its users do?”, had some interesting notes on how DuckDB is being used. Among the use cases, they were seeing DuckDB being used as a processing engine for local data workflows.
They also called out how many users (“Normies”) enjoyed the simplicity of the DuckDB user experience . I found both of those usage trends worth of note.
With those points noted, let’s move on and install DBT and the dbt-duckdb connector.
dbt is the best way to manage a collection of data transformations written in SQL or Python for analytics and data science. dbt-duckdb is the project that ties DuckDB and dbt together, allowing you to create a Modern Data Stack In A Box or a simple and powerful data lakehouse- no Java or Scala required.
You can also read further about the connector on dbt’s website as well:
https://docs.getdbt.com/reference/warehouse-setups/duckdb-setup
This project is hosted on PyPI so we can use pip to install the connector and dependencies:
pip3 install dbt-duckdb
Note: Ideally we’d use conda
to install here, but currently only pip
-based installs are supported.
Once that is installed, we’ll need the DuckDB CLI to work directly with DuckDB from the local command line to load our dataset.
The DuckDB CLI (Command Line Interface) is a single, dependency free executable.
Download it from:
https://duckdb.org/docs/api/cli.html
Make sure and get the version of the CLI that matches the version of DuckDB you have installed locally via pip.
Now let’s load some data for analysis.
Let’s crank up the local DuckDB CLI with the command:
./duckdb dbt_patient_visits.duckdb
Note: if you don’t give it a database name at start, for some ODD REASON, DuckDB will not let you save the in-memory db later.
Once we’re inside the CLI the first command you want to know about is how to quit:
.q [enter]
Typing .q or .quit will quit the CLI and get you back to your shell.
Now, once we’re back inside the CLI, let’s see if we can access our local patient data in CSV form with the command:
select * from './patient_checkup_logs.csv';
This should show:
┌────────────┬────────────┐
│ Date │ Patient ID │
│ date │ varchar │
├────────────┼────────────┤
│ 2016-09-05 │ pid-001 │
│ 2015-08-17 │ pid-002 │
│ 2013-07-18 │ pid-003 │
│ 2015-09-22 │ pid-003 │
│ 2015-10-13 │ pid-003 │
│ · │ · │
│ · │ · │
│ · │ · │
│ 2019-12-19 │ pid-1336 │
│ 2020-01-14 │ pid-1336 │
│ 2020-09-23 │ pid-1336 │
├────────────┴────────────┤
│ 6808 rows (40 shown) │
└─────────────────────────┘
The interesting part about the command we just executed is that we used a SQL command to query a raw file as we would a table. DuckDB is quite flexible like that.
Let’s quickly load the checkup visit csv file into DuckDB and let DuckDB automatically infer the schema with the read_csv_auto()
command:
CREATE TABLE patient_visits AS SELECT * FROM read_csv_auto ('./patient_checkup_logs.csv');
select * from patient_visits;
┌────────────┬────────────┐
│ Date │ PatientID │
│ date │ varchar │
├────────────┼────────────┤
│ 2016-09-05 │ pid-001 │
│ 2015-08-17 │ pid-002 │
│ 2013-07-18 │ pid-003 │
│ 2015-09-22 │ pid-003 │
│ 2015-10-13 │ pid-003 │
│ · │ · │
│ · │ · │
│ · │ · │
│ 2019-12-11 │ pid-1338 │
│ 2020-02-26 │ pid-1338 │
│ 2020-10-02 │ pid-1338 │
├────────────┴────────────┤
│ 6808 rows (40 shown) │
└─────────────────────────┘
We now have a table loaded into our DuckDB database, as we can see when we type the following describe
command:
describe table patient_visits;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ Date │ DATE │ YES │ │ │ │
│ PatientID │ VARCHAR │ YES │ │ │ │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Lastly, let’s configure our DBT profile to see DuckDB as a database it can connect to for data processing.
To configure DBT to connect to a database, you need to create a configuration file called profiles.yml
. This file should contain the necessary connection information for each database you want to connect to.
On OSX for example, the DBT profile.yml file lives at:
~/.dbt/profiles.yml
Let’s now edit our profiles.yml
In the code listing below, we can see some sample contents of a type profiles.yml
file:
...
dbt_duckdb_patient_visits:
outputs:
dev:
type: duckdb
path: /Users/josh/Documents/PattersonConsulting/workspaces/dbt_demos/dbt_hello_world/patient_db.duckdb
target: dev
...
The connection we’ll use for this demo is dbt_duckdb_patient_visits
and we’ll reference that in our dbt project in a moment. Let’s now create a new DBT project.
Now that we have our raw data loaded into DuckDB, we can start building our DBT pipeline in part 2 of 3 in this blog series.
In this blog post we showed you have to setup a local data stack environment based on DBT and DuckDB. For more information on DBT, check out their documentation. For more information on cloud infrastructure, check out the rest of the articles in our blog.
To continue on in this series, check out part 2 to start building our DBT pipeline in part 2 of 3 in this blog series.
We also offer private workshops for companies on topics such as creating and running DBT pipelines (on multiple platforms such as Snowflake and Fivetran), please feel free to reach out if you’d like to discuss attending one of our workshops.