Open
Description
Overview
Updating the expected row counts for dbt can be frustrating. Some issues that have come up:
Needing to run the full ETL locally
- You have to re-run the entire ETL to be 100% confident that you've got the correct expected row counts.
- Not everybody's computer is up to doing this task, or it takes too long.
- Ideally we would get reliable row-count expectations just from rematerializing a subset of the assets locally.
- Alternatively, we could generate new row-count expectations by running a full ETL using a
workflow_dispatch
- After the build completes, the row count update script could be run on all tables, and the resulting CSV(s) would get uploaded along with all the other outputs to
gs://builds.catalyst.coop/build-id/
and could be downloaded - (note that right now only files in PUDL_OUTPUT get uploaded, so if a change is detected, we'd need to copy the full row counts CSV over there to be saved)
- Under normal circumstances, there should be no change of any kind.
Using non-standard partition columns
- We have some default columns that are used to partition tables so we have more granular row count expectations
- These are not always the right partitioning columns, and so we have the option of specifying other columns in the tests.
- However, when adding a new table and row count expectations for the first time, the script doesn't allow you to specify what non-standard columns you would like to use. So you have to add it, and then edit the test spec in
schema.yml
and then go back and regenerate the expectations. Having a direct
Removing obsolete partition values
(maybe this is/was already fixed, by @jdangerx's PR making row count checks more strict?)
- When updating row counts, if there's an old partition value that's no longer relevant (e.g. because the partitioning column has been changed) the script will add new records to the row count CSV, but doesn't remove the obsolete records.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
New