Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alias column selection in group_by_columns to enable timestamp wrapping #946

Open
macklin-fluehr opened this issue Sep 2, 2024 · 1 comment
Labels
enhancement New feature or request triage

Comments

@macklin-fluehr
Copy link

macklin-fluehr commented Sep 2, 2024

Describe the feature

group_by_columns does not allow you to group by higher aggregates. For example, if I have a timestamp column, I might want to group by date_trunc('week', timestamp) to monitor that values are expected each week.

Right now, if my yml is set to:

- dbt_utils.not_null_proportion: 
      at_least: 0.95
      group_by_columns: [date_trunc('week', timestamp)]

The code compiles to:

select
      count(*) as failures,
      count(*) != 0 as should_warn,
      count(*) != 0 as should_error
    from (
 
with validation as (
  select
    date_trunc('week' ,timestamp), 
    sum(case when visit_reasons is null then 0 else 1 end) / cast(count(*) as numeric) as not_null_proportion
  from <my_table>
  group by date_trunc('week',timestamp)
),
validation_errors as (
  select
    date_trunc('week' ,timestamp), 
    not_null_proportion
  from validation
  where not_null_proportion < 0.95 or not_null_proportion > 1
)
select
  *
from validation_errors
    ) dbt_internal_test

which, as you can see, in the validation_errors cte does not work. That said, if we just aliased the column in the validation cte and referenced it in validation_errors, we'd be all set. My suggested change in macros/generic_tests/not_null_proportion.sql:

with validation as (
  select
    {% for c in select_gb_cols %}
    {{select_gb_cols}} as group_col_{{loop.counter}}{% if not loop.last %},{%endif%}
    {% endear %}
    sum(case when {{ column_name }} is null then 0 else 1 end) / cast(count(*) as {{ dbt.type_numeric() }}) as not_null_proportion
  from {{ model }}
  {{groupby_gb_cols}}
),
validation_errors as (
  select
   {% for c in select_gb_cols %}
    group_col_{{loop.counter}}{% if not loop.last %},{%endif%}
    {% endear %}
    not_null_proportion
  from validation
  where not_null_proportion < {{ at_least }} or not_null_proportion > {{ at_most }}

Describe alternatives you've considered

I could make my own custom test to do this

Additional context

Don't think database will impact anything here

Who will this benefit?

anybody trying to group by time

Are you interested in contributing this feature?

I hope the above is enough to get started

Let us know if you want to contribute the feature, and whether would need a hand getting started

@macklin-fluehr macklin-fluehr added enhancement New feature or request triage labels Sep 2, 2024
@marinewign
Copy link

My team and I would need this feature to assess data quality on a day-by-day basis for data that we receive every hour thanks to a DATE_TRUNC('day', received_date) !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

2 participants