00:00:00

dbt

open source tool for analytics/feature engineering

@oliver

Notes

Disclosure

I don't receive any form of compensation from Fishtown Analytics (the company behind dbt).
It might seem like I'm on their sales team or DevRel team. I'm not!
I just think it's an amazing tool that will give our analytics teams superpowers.

Notes

What superpower do you want? (DA edition)

Flying: deliver analysis fast
X-ray vision: generated data dictionary gives quick insight into your tables
Extra-sensory perception (ESP): monitoring and notifications with data-driven tests
Hive mind: gain context quickly from what your teammates have already done

Notes

Overview

What FinServ was doing before
What dbt does
🤯 features
How we're using dbt

Notes

What FinServ was doing before

PayLater: Mostly Domo, R1R2 used a DIY pipeline tool (because we didn't know about dbt)
Insurance: BQ scheduler
Payment: Databricks/Redshift

Notes

Before: problems

Domo
- easy to do things in an unsustainable way – single layer with incredible complexity or complex dataflows that are difficult to trace
- discoverability of teammates' work was next to impossible – massive problem when people leave the org
- scheduled jobs are difficult to find; easy for costs to explode
- no version control – we can't know what was running previously, even earlier that day
- hard to identify PII data (no easy way to search column names)

Notes

Before: problems

DIY pipeline tool
- extremely limited feature set – anything else that was required needed to be developed ourselves
- queries run in series – slooow to run ~50 queries/run in feature pipeline

Notes

Before: problems

BQ Scheduler
- Dependencies between queries can't be properly handled
- No easy way to have version control and CI/CD
Databricks/Redshift
- Slower to develop than BQ
- More that needs to be tuned: number of nodes, indexes
- Different tech stack than the rest of finserv

Notes

What `dbt` does

dbt's tagline: Analytics engineering tools designed for analysts
In essence:
- pip install dbt
- set-up some configuration
- write your queries in .sql files
- dbt run
- ...
- profit!

Notes

What `dbt` does

dbt will take all of your .sql files and send them to the BQ API to be run
Along the way, there are many useful things that dbt can do
There's also a paid cloud version that can handle scheduling, monitoring, etc which would be an Airflow replacement
But we're just using the CLI to avoid giving another third party access to all of our data
Besides BQ, dbt also fully supports Snowflake, Redshift and PostGreSQL

Notes

🤯 automagically inferring the DAG

Jinja templating is used, so it's recommended to refer to tables using ref:

SELECT
  *
FROM
  {{ ref('table_name') }}

If this is done consistently, dbt can infer the full DAG
You don't need to work out the dependencies between queries – which query needs to run before others
The DAG file for Airflow will be much simpler.

Notes

🤯 running parts of DAGs

Suppose you're working on an update of table_x.sql:

Of course, you can just run that: dbt run --model table_x
Need all upstream dependencies to be refreshed? dbt run +table_x
Want to see the effect on downstream dependencies? dbt run table_x+
Or run it all at once: dbt run +table_x+

Notes

🤯 multi-threaded runs

Since dbt infers the DAG, it knows what queries can run in parallel
BQ scales horizontally exceptionally well, and it is just an http call on our worker pod
We're using 4 threads but more should be fine

Notes

🤯 incremental materialisations

{{
  config(
    materialized="incremental",
    incremental_strategy="insert_overwrite"
    partitions=get_list_of_day("\'" + var('ds') + "\'", -4),
    partition_by={"field": "partition_date", "data_type": "date"},
  )
}}

Each .sql file can have configuration at the top as a jinja template
A query can be materialized as incremental, table or ephemeral
With incremental using the strategy insert_overwrite, dbt will append the indicated partition(s) and write into them in sequence
What if there's a query with a join that we want to be incremental?

Notes

Missed join on incremental materialisation

Notes

🤯 Kevin Kevin's trick

WITH t_join AS (
  SELECT *
  FROM
    {{ ref('table_a') }} JOIN {{ ref('table_b') }} USING(user_id)
  {% if is_incremental() %}
    WHERE
    a_date BETWEEN DATE_SUB(DATE('{{ var("ds") }}'), INTERVAL 5 DAY)
           AND DATE('{{ var("ds") }}')
    AND b_date BETWEEN DATE_SUB(DATE('{{ var("ds") }}'), INTERVAL 5 DAY)
               AND DATE('{{ var("ds") }}')
  {% endif %}
)

SELECT * FROM t_join
WHERE
  a_date BETWEEN DATE_SUB(DATE('{{ var("ds") }}'), INTERVAL 3 DAY)
         AND DATE('{{ var("ds") }}')

Notes

🤯 Fast backfills

Using the is_incremental macro/flag, we can force this to be false with: dbt run --model table_x --full-refresh
Instead of doing backfills with Airflow daily jobs that can take several hours or more, most backfills can be finished in a single query
dbt will first delete the existing table and then run the full refresh query

Notes

🤯 testing data

A great article introducing Great Expectations talked about a specific type of technical debt: pipeline debt
The central idea is that unit tests and integration tests (at pipeline deploy time) are very limited in how much they can help with pipeline debt
What is needed are pipeline tests that test the data everytime pipelines run
dbt has:
- "schema" tests that can test things like whether columns contain nulls or whether a column is unique, to see if characteristics of our source data have changed
- "data" tests that can be used for business logic

Notes

🤯 testing data

Insurance has setup dbt test with Airflow
If the lower bound of insurance issuance is breached, a slack message will be sent to a channel that includes them and insurance PMs

Notes

🤯 developer workflow

We've overriden standard macros that build table names from ref('table_x')
When I'm developing on my machine (in a dev enviroment), the table will be written to:

tvlk-data-financial-dev.oliver_chen.finserv_l2_table_x

When dbt is run in staging/prod, the table will be written to:

tvlk-data-financial-stg/prod.finserv_l2.table_x

In this way, DAs can work independently and won't conflict with each other's work.

Notes

🤯 data catalog

dbt allows us to specify table and column descriptions in a yaml file:

models:
  - name: table_x
    description: "Description of table_x"
    columns:
      - name: field_a
        description: "Description of field_a"

This is uploaded to the BQ console's table metadata and can be used to generate a data catalog:

dbt docs generate

Notes

🤯 data catalog

Notes

🤯 data catalog (lineage)

Notes

🤯 all text!!

⌘F is a low latency search tool
A table is named based on the .sql file name that generates it, so ⌘P (in VSCode) performs a table name search
Use the editor of your choice
Version control

Notes

How we're using `dbt`

Everything in Github – close to 400 queries
On pull request: black and flake8 run on dags/ folder.
Merge to master: build & push docker image with queries to staging tag, sync dags/ folder to staging dag bag.
Tag a release: build & push docker image with queries to prod tag, sync dags/ folder to prod dag bag.
In Airflow: dbt run using KubernetesPodOperator on the same GKE cluster as Airflow is on.
DAG per BU, different tasks for L2, L3 and L4 tables.

Notes

How we're using `dbt`

In the future??

If we can just use dbt run to run ALL the queries, then Composer is just a very expensive cron job
In theory, we could use a managed cron service to trigger dbt run. It could be completely "serverless" if it's triggered to run on CloudBuild.
Pros: much cheaper, no need to manage Composer, no persistent infra that has permissions to everything
Cons: No Airflow UI

Notes

Resources

getdbt.com
community.getdbt.com: community slack team (very friendly!)
#dbt: Traveloka slack channel
github.com/fishtown-analytics/dbt

Notes

Quiz Question 4

Why are we recommending dbt CLI but not dbt Cloud?

We're cheap
We like the product (dbt) but not the company (Fishtown Analytics)
We don't trust Fishtown Analytics
We don't want our work to be too easy

Notes

Quiz Question 5

Which pair of statements are both correct?

(SQL is not Turing-complete) & (SQL is a declarative language)
(SQL is Turing-complete) & (SQL is a declarative language)
(SQL is not Turing-complete) & (SQL is a procedural language)
(SQL is Turing-complete) & (SQL is a procedural language)

Notes

Quiz Question 6

A linear ordering of the vertices in a Directed Acyclic Graph (DAG) is said to be topologically sorted if for every directed edge u→v in the DAG, vertex u comes before v in the ordering.

What is the computational complexity to topologically sort a DAG that has V number of vertices and E number of edges?

O(V + E)
O(V * E)
O(V^2 * E)
O(V^2 * E^2)

Table of contents	t
Exposé	ESC
Autoscale	e
Full screen slides	f
Presenter view	p
Source files	s
Slide numbers	n
Blank screen	b
Notes	2
Help	h

open source tool for analytics/feature engineering

In the future??

Table of Contents

Help