Skip to content

Improve ergonomics of updating expected row counts #4333

Open
@zaneselvans

Description

@zaneselvans

Overview

Updating the expected row counts for dbt can be frustrating. Some issues that have come up:

Needing to run the full ETL locally

  • You have to re-run the entire ETL to be 100% confident that you've got the correct expected row counts.
  • Not everybody's computer is up to doing this task, or it takes too long.
  • Ideally we would get reliable row-count expectations just from rematerializing a subset of the assets locally.
  • Alternatively, we could generate new row-count expectations by running a full ETL using a workflow_dispatch
  • After the build completes, the row count update script could be run on all tables, and the resulting CSV(s) would get uploaded along with all the other outputs to gs://builds.catalyst.coop/build-id/ and could be downloaded
  • (note that right now only files in PUDL_OUTPUT get uploaded, so if a change is detected, we'd need to copy the full row counts CSV over there to be saved)
  • Under normal circumstances, there should be no change of any kind.

Using non-standard partition columns

  • We have some default columns that are used to partition tables so we have more granular row count expectations
  • These are not always the right partitioning columns, and so we have the option of specifying other columns in the tests.
  • However, when adding a new table and row count expectations for the first time, the script doesn't allow you to specify what non-standard columns you would like to use. So you have to add it, and then edit the test spec in schema.yml and then go back and regenerate the expectations. Having a direct

Removing obsolete partition values

(maybe this is/was already fixed, by @jdangerx's PR making row count checks more strict?)

  • When updating row counts, if there's an old partition value that's no longer relevant (e.g. because the partitioning column has been changed) the script will add new records to the row count CSV, but doesn't remove the obsolete records.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cliScripts and other command line interfaces to PUDL.dbtIssues related to the data build tool aka dbttestingWriting tests, creating test data, automating testing, etc.

    Type

    No type

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions