00:00:00

dbt

open source tool for analytics/feature engineering









@oliver

Notes

Disclosure

  • I don't receive any form of compensation from Fishtown Analytics (the company behind dbt).
  • It might seem like I'm on their sales team or DevRel team. I'm not!
  • I just think it's an amazing tool that will give our analytics teams superpowers.

Notes

What superpower do you want? (DA edition)

  1. Flying: deliver analysis fast
  2. X-ray vision: generated data dictionary gives quick insight into your tables
  3. Extra-sensory perception (ESP): monitoring and notifications with data-driven tests
  4. Hive mind: gain context quickly from what your teammates have already done

Notes

Overview

  • What FinServ was doing before
  • What dbt does
  • 🤯 features
  • How we're using dbt

Notes

What FinServ was doing before

  • PayLater: Mostly Domo, R1R2 used a DIY pipeline tool (because we didn't know about dbt)
  • Insurance: BQ scheduler
  • Payment: Databricks/Redshift

Notes

Before: problems

  • Domo
    • easy to do things in an unsustainable way – single layer with incredible complexity or complex dataflows that are difficult to trace
    • discoverability of teammates' work was next to impossible – massive problem when people leave the org
    • scheduled jobs are difficult to find; easy for costs to explode
    • no version control – we can't know what was running previously, even earlier that day
    • hard to identify PII data (no easy way to search column names)

Notes

Before: problems

  • DIY pipeline tool
    • extremely limited feature set – anything else that was required needed to be developed ourselves
    • queries run in series – slooow to run ~50 queries/run in feature pipeline

Notes

Before: problems

  • BQ Scheduler
    • Dependencies between queries can't be properly handled
    • No easy way to have version control and CI/CD
  • Databricks/Redshift
    • Slower to develop than BQ
    • More that needs to be tuned: number of nodes, indexes
    • Different tech stack than the rest of finserv

Notes

What dbt does

  • dbt's tagline: Analytics engineering tools designed for analysts
  • In essence:
    • pip install dbt
    • set-up some configuration
    • write your queries in .sql files
    • dbt run
    • ...
    • profit!

Notes

What dbt does

  • dbt will take all of your .sql files and send them to the BQ API to be run
  • Along the way, there are many useful things that dbt can do
  • There's also a paid cloud version that can handle scheduling, monitoring, etc which would be an Airflow replacement
  • But we're just using the CLI to avoid giving another third party access to all of our data
  • Besides BQ, dbt also fully supports Snowflake, Redshift and PostGreSQL

Notes

🤯 automagically inferring the DAG

Jinja templating is used, so it's recommended to refer to tables using ref:

SELECT
  *
FROM
  {{ ref('table_name') }}
  • If this is done consistently, dbt can infer the full DAG
  • You don't need to work out the dependencies between queries – which query needs to run before others
  • The DAG file for Airflow will be much simpler.

Notes

🤯 running parts of DAGs

Suppose you're working on an update of table_x.sql:

  • Of course, you can just run that: dbt run --model table_x
  • Need all upstream dependencies to be refreshed? dbt run +table_x
  • Want to see the effect on downstream dependencies? dbt run table_x+
  • Or run it all at once: dbt run +table_x+

Notes

🤯 multi-threaded runs

  • Since dbt infers the DAG, it knows what queries can run in parallel
  • BQ scales horizontally exceptionally well, and it is just an http call on our worker pod
  • We're using 4 threads but more should be fine

Notes

🤯 incremental materialisations

{{
  config(
    materialized="incremental",
    incremental_strategy="insert_overwrite"
    partitions=get_list_of_day("\'" + var('ds') + "\'", -4),
    partition_by={"field": "partition_date", "data_type": "date"},
  )
}}
  • Each .sql file can have configuration at the top as a jinja template
  • A query can be materialized as incremental, table or ephemeral
  • With incremental using the strategy insert_overwrite, dbt will append the indicated partition(s) and write into them in sequence
  • What if there's a query with a join that we want to be incremental?

Notes

Missed join on incremental materialisation

Notes

🤯 Kevin Kevin's trick

WITH t_join AS (
  SELECT *
  FROM
    {{ ref('table_a') }} JOIN {{ ref('table_b') }} USING(user_id)
  {% if is_incremental() %}
    WHERE
    a_date BETWEEN DATE_SUB(DATE('{{ var("ds") }}'), INTERVAL 5 DAY)
           AND DATE('{{ var("ds") }}')
    AND b_date BETWEEN DATE_SUB(DATE('{{ var("ds") }}'), INTERVAL 5 DAY)
               AND DATE('{{ var("ds") }}')
  {% endif %}
)

SELECT * FROM t_join
WHERE
  a_date BETWEEN DATE_SUB(DATE('{{ var("ds") }}'), INTERVAL 3 DAY)
         AND DATE('{{ var("ds") }}')

Notes

🤯 Fast backfills

  • Using the is_incremental macro/flag, we can force this to be false with: dbt run --model table_x --full-refresh
  • Instead of doing backfills with Airflow daily jobs that can take several hours or more, most backfills can be finished in a single query
  • dbt will first delete the existing table and then run the full refresh query

Notes

🤯 testing data

  • A great article introducing Great Expectations talked about a specific type of technical debt: pipeline debt
  • The central idea is that unit tests and integration tests (at pipeline deploy time) are very limited in how much they can help with pipeline debt
  • What is needed are pipeline tests that test the data everytime pipelines run
  • dbt has:
    • "schema" tests that can test things like whether columns contain nulls or whether a column is unique, to see if characteristics of our source data have changed
    • "data" tests that can be used for business logic

Notes

🤯 testing data

  • Insurance has setup dbt test with Airflow
  • If the lower bound of insurance issuance is breached, a slack message will be sent to a channel that includes them and insurance PMs

Notes

🤯 developer workflow

  • We've overriden standard macros that build table names from ref('table_x')
  • When I'm developing on my machine (in a dev enviroment), the table will be written to:

tvlk-data-financial-dev.oliver_chen.finserv_l2_table_x

  • When dbt is run in staging/prod, the table will be written to:

tvlk-data-financial-stg/prod.finserv_l2.table_x

  • In this way, DAs can work independently and won't conflict with each other's work.

Notes

🤯 data catalog

dbt allows us to specify table and column descriptions in a yaml file:

models:
  - name: table_x
    description: "Description of table_x"
    columns:
      - name: field_a
        description: "Description of field_a"
  • This is uploaded to the BQ console's table metadata and can be used to generate a data catalog:

dbt docs generate

Notes

🤯 data catalog

Notes

🤯 data catalog (lineage)

Notes

🤯 all text!!

  • ⌘F is a low latency search tool
  • A table is named based on the .sql file name that generates it, so ⌘P (in VSCode) performs a table name search
  • Use the editor of your choice
  • Version control

Notes

How we're using dbt

  • Everything in Github – close to 400 queries
  • On pull request: black and flake8 run on dags/ folder.
  • Merge to master: build & push docker image with queries to staging tag, sync dags/ folder to staging dag bag.
  • Tag a release: build & push docker image with queries to prod tag, sync dags/ folder to prod dag bag.
  • In Airflow: dbt run using KubernetesPodOperator on the same GKE cluster as Airflow is on.
  • DAG per BU, different tasks for L2, L3 and L4 tables.

Notes

How we're using dbt

In the future??

  • If we can just use dbt run to run ALL the queries, then Composer is just a very expensive cron job
  • In theory, we could use a managed cron service to trigger dbt run. It could be completely "serverless" if it's triggered to run on CloudBuild.
  • Pros: much cheaper, no need to manage Composer, no persistent infra that has permissions to everything
  • Cons: No Airflow UI

Notes

Resources

Notes

Quiz Question 4

Why are we recommending dbt CLI but not dbt Cloud?

  1. We're cheap
  2. We like the product (dbt) but not the company (Fishtown Analytics)
  3. We don't trust Fishtown Analytics
  4. We don't want our work to be too easy

Notes

Quiz Question 5

Which pair of statements are both correct?

  1. (SQL is not Turing-complete) & (SQL is a declarative language)
  2. (SQL is Turing-complete) & (SQL is a declarative language)
  3. (SQL is not Turing-complete) & (SQL is a procedural language)
  4. (SQL is Turing-complete) & (SQL is a procedural language)

Notes

Quiz Question 6

A linear ordering of the vertices in a Directed Acyclic Graph (DAG) is said to be topologically sorted if for every directed edge u→v in the DAG, vertex u comes before v in the ordering.

What is the computational complexity to topologically sort a DAG that has V number of vertices and E number of edges?

  1. O(V + E)
  2. O(V * E)
  3. O(V^2 * E)
  4. O(V^2 * E^2)

Notes

Notes