Features#

Before exploring the recipes for defining features in Triage, remember that features are time-dependent—they are calculated relative to specific prediction dates (as of date).

Before cooking features#

Things you need to know:

  • Feature configurations are defined under the features_aggregations section in Triage’s main configuration yaml file

  • All features in Triage are organized into groups called feature groups

    • Each feature group is identified by a prefix (prefix in the yaml file). That prefix will be part of the name for every feature in that group

    • Each feature group must specify the input data it will use to generate features. This is defined in from_obj, which creates a table in the database named {feature group}_from_obj

    • Each feature group must also specify the column in the input data that indicates when the event of interest occurred. This is defined in knowledge_date_column

  • You can define general imputation rules at the feature group level and/or specify imputation rules for individual features within a group

  • The date on which the prediction is made is specified as collate_date in the features configuration section of the yaml file

  • Individual feature names are generated by Triage using the following structure: {feature group name}_{entity_id}_{time period}_{feature name}_{metric} (e.g., b_ambulance_entity_id_3month_suic_attmpt_total)

  • Each feature defines all the temporal aggregations to generate by specifying metrics and intervals

  • If a feature is imputed, an additional binary column with the suffix _imp is created that allows the model to use the missingness of that feature as a predictor

  • Features generated by Triage are stored in the features schema, in a table named {feature group}_aggregation_imputed (e.g., b_ambulance_aggregation_imputed)

Note

The number of features generated in each feature group depends on the number of metrics and intervals defined.


The following recipes come from the context of an Early Warning System that helps identify people at higher risk of a severe behavioral health episode, so they can be prioritized for outreach. The data we use includes things like diagnoses, ambulance runs, past behavioral health episodes, and basic demographics such as age, gender, and race.

Easy recipes#

Counting events#

This recipe generates features for counting the number of times an event happens on a given period of time, i.e., How many diagnoses an individual had in the last 6 months given a prediction date as of date.

🥕 Ingredients

You will need:

  • A configuration file with the features_aggregations section

  • A defined feature group

  • A defined knowledge_date_column

👩‍🍳 How to cook

Within a feature group definition in the aggregates section define:

  • A quantity type, name it total with the value start * (it means it will count all occurrencies)

  • A metrics of type count

  • If you have defined a general default imputation rule for the whole feature group, then it’s optional to define an imputation rule specific for the feature, but it’s recommended.

In the following code snippet we count total number of events on a group, as well as events of specific types.

feature_aggregations:
  -
    prefix: 'b_ambulance' #features for ambulance runs (count) feature group
    # Query that defines where to get the input data to generate the
    # feature. In this case from the semantic.ambulance_runs table 
    # getting all individuals that have a joid and retrieving different 
    # behavioral health flags 
    from_obj: |
      (SELECT ar.joid::INT AS entity_id,
             ar.event_date AS knowledge_date,
             ar.drug_use_flag::INT,
             ar.alcohol_use_flag::INT,
             ar.suicidal_flag::INT,
             ar.suicide_attempt_flag::INT,
             ar.other_mental_crisis_flag::INT,
             ar.psychosis_nonsubstance_flag::INT,
             ar.psychosis_substance_flag::INT,
             ar.overdose_flag::INT
      FROM semantic.ambulance_runs ar where ar.joid is not null) AS runs
    # Identifies the column that holds the information of when the event happened
    # In this case the name comes from the second line of the query defined 
    # on the from_obj: ar.event_date renamed as knowledge_date
    knowledge_date_column: 'knowledge_date' 
    # General rule to impute any of the features from this feature group. 
    # In this case if there is a null it will add a zero indicating 
    # that no event happend, not that data was missing. Refer to 
    # imputation strategies in https://dssg.github.io/triage/dirtyduck/triage_intro/#imputation for a complete explanation.
    aggregates_imputation:
       all:
         type: 'zero_noflag'
    # the following section defines individual features within the
    # ambulance runs feature group
    aggregates:
      -
        # counting all occurrencies of an ambulance run
        quantity: 
          total: '*' 
        metrics:
          - 'count' 
        # defining the imputation rule of adding a 0 in case of missing
        # values. In this case, it means no events. 
        imputation:
          count:
            type: 'constant'
            value: 0
      -
        # counting specific ocurrencies of the event ambulance run 
        # for suicide attempt. Since suicide attempt is a flag with 
        # values 0 or 1 for abscence or presence of the event, 
        # we are adding all the values 1 to count how many 
        # we got for this specific condition.
        quantity: 
          sui_attmpt: 'suicide_attempt_flag'
        metrics:
          - 'sum'
      -
        # counting specific ocurrencies of the event other mental crisis
        # Since other mental crisis is a flag with values 0 or 1 
        # for abscence or presence of the event, we are adding all the 
        # values 1 to count how many we got this specific condition. 
        quantity: 
          oth_mnt_crisis: 'other_mental_crisis_flag'
        metrics:
          - 'sum'
    # The periods for which this feature will be generated.
    intervals:
      - 'all'    # All available historical data prior to the prediction date
      - '1month' # Last month prior to the prediction date 
      - '3month' # Last three months prior to the prediction date
      - '1year'  # Last year prior to the prediction date

🍲 What to look for

Once Triage has run, look in the features schema for the table generated by Triage named b_ambulance_aggregation_imputed.

This table should have the following columns on it:

  • entity_id

  • as_of_date

  • b_ambulance_entity_id_all_total_count

  • b_ambulance_entity_id_1month_total_count

  • b_ambulance_entity_id_3month_total_count

  • b_ambulance_entity_id_1year_total_count

  • b_ambulance_entity_id_all_sui_attmpt_sum

  • b_ambulance_entity_id_1month_sui_attmpt_sum

  • b_ambulance_entity_id_3month_sui_attmpt_sum

  • b_ambulance_entity_id_1year_sui_attmpt_sum

  • b_ambulance_entity_id_all_oth_mnt_crisis_sum

  • b_ambulance_entity_id_1month_oth_mnt_crisis_sum

  • b_ambulance_entity_id_3month_oth_mnt_crisis_sum

  • b_ambulance_entity_id_1year_oth_mnt_crisis_sum

If there were imputations then columns with the suffix _imp will also be generated on that table.

It is always a good idea to check that the values registered on those columns are correct. To do that you get a couple of entities from your from_obj table and check if the calculations match when calculating manually. We would like to verify the feature b_ambulance_entity_id_3month_oth_mnt_crisis_sum is correct for a given as_of_date.

--get a couple of joids where the feature of interest has 
--a value greater than 0 with one specific date of prediction. 
with examples as(
	select 
	 entity_id,
	 as_of_date,
	 b_ambulance_entity_id_3month_oth_mnt_crisis_sum
	from features.b_ambulance_aggregation_imputed
	where as_of_date = '2022-01-01'
	and b_ambulance_entity_id_3month_oth_mnt_crisis_sum = 2
	limit 3
)

--with those joids check that the calculations match.
select 
	joid, 
  --we need to check on the original column
	sum(other_mental_crisis_flag::int) as manual_calculation, 
  --we are comparing with what Triage calculated 
	max(b_ambulance_entity_id_3month_oth_mnt_crisis_sum) as triage_calculation
from semantic.ambulance_runs a 
join examples b 
on a.joid = b.entity_id
--the event must have occurred within 3 months prior to the as of date
and event_date between '2022-01-01'::date - interval '3month' and '2022-01-01'::date 
group by joid
order by 3

Age as feature#

This recipe will generate an age feature that has the age in years an individual has at a particular prediction date.

🥕 Ingredients

You will need:

  • A configuration file with the features_aggregations section

  • A defined group feature

  • A defined knowledge_date_column

  • A column that has the date of birth (dob)

👩‍🍳 How to cook

feature_aggregations:
-
    prefix: 'b_age' #features for age
    # Query that defines where to get the input data to generate the
    # feature. In this case from the semantic.demographics table 
    # getting all individuals that have a date of birth and 
    # have an id.
    from_obj: |
      (SELECT d.joid::INT AS entity_id, 
          d.event_date AS knowledge_date,
          demographics_value AS dob
      FROM semantic.demographics d
      WHERE demographics_type = 'dob' and d.joid is not null) AS age
    # Identifies the column that holds the information of when the event happened
    # In this case the name comes from the second line of the query defined 
    # on the from_obj: d.event_date renamed as knowledge_date
    knowledge_date_column: 'knowledge_date'
    # General rule to impute any of the features from this feature group. 
    # In this case if there is a null it will use the average age of the cohort 
    # Refer to imputation strategies in 
    # https://dssg.github.io/triage/dirtyduck/triage_intro/#imputation for a complete explanation. 
    aggregates_imputation:
      all:
        type: 'mean'
    aggregates:
      -
        # to get the age of an individual up to a particular date, we subtract 
        # the date of the prediction (collate_date), to the date of birth registered. 
        # In this case we are using postgres function AGE to subtract two dates 
        # and then formating the timestamp output to years in order to get 
        # and integer. Then we average the age of an individual from all the 
        # registers we have. 
        quantity:
          age: "EXTRACT(YEAR FROM AGE('{collate_date}'::DATE, dob::DATE))"
        metrics:
          - 'avg'
    # The periods for which this feature will be generated. 
    intervals:
      - 'all' # All available historical data prior to the prediction date 

🍲 What to look for

Once Triage has run, look in the features schema for the table generated by Triage named b_age_aggregation_imputed.

This table should have the following columns on it:

  • entity_id

  • as_of_date

  • b_age_entity_id_all_age_avg

If there were imputations then columns with the suffix _imp will also be generated on that table.

It is always a good idea to check that the values registered on those columns are correct. To do that you get a couple of entities from your from_obj table and check if the calculations match when calculating manually. We would like to verify the feature b_age_entity_id_all_age_avg is correct for a given as_of_date.

--get a couple of joids where the feature of interest has 
--a value greater than 0 with one specific date of prediction. 
with examples as(
	select 
	 entity_id,
	 as_of_date,
	 b_age_entity_id_all_age_avg
	from features.b_age_aggregation_imputed
	where as_of_date = '2022-01-01'
	and b_age_entity_id_all_age_avg > 0
	limit 3
)

--with those joids check that the calculations match.
select 
	joid, 
  --we need to check on the original column
  avg(extract(year from age('2022-01-01'::date, demographics_value::date))) as manual_calculation, 
  --we are comparing with what Triage calculated 
	max(b_age_entity_id_all_age_avg) as triage_calculation
from semantic.demographics a 
join examples b 
on a.joid = b.entity_id
where demographics_type = 'dob'
--from all the available historical data prior to the prediction date 
and event_date < '2022-01-01'::date 
group by joid
order by 3

Flag events features#

This type of features let us know if an event happened or not —flagged or not.

To generate this type of features in Triage you can follow this recipe:

🥕 Ingredients

You will need:

  • A configuration file with the features_aggregations section

  • A defined group feature

  • A defined knowledge_date_column

  • A column of binary type (0, 1)

👩‍🍳 How to cook

feature_aggregations:
# AMBULANCE RUN FEATURES 
  -
    prefix: 'b_ambulance' #features for ambulance runs (count)
    # Query that defines where to get the input data to generate the
    # feature. In this case from the semantic.ambulance_runs table 
    # getting all individuals that have a joid and retrieving different 
    # behavioral health flags 
    from_obj: |
      (SELECT ar.joid::INT AS entity_id,
             ar.event_date AS knowledge_date,
             ar.drug_use_flag::INT,
             ar.alcohol_use_flag::INT,
             ar.suicidal_flag::INT,
             ar.suicide_attempt_flag::INT,
             ar.other_mental_crisis_flag::INT,
             ar.psychosis_nonsubstance_flag::INT,
             ar.psychosis_substance_flag::INT,
             ar.overdose_flag::INT
      FROM semantic.ambulance_runs ar where ar.joid is not null) AS runs
    # Identifies the column that holds the information of when the event happened
    # In this case the name comes from the second line of the query defined 
    # on the from_obj: ar.event_date renamed as knowledge_date
    knowledge_date_column: 'knowledge_date'
    # General rule to impute any of the features from this feature group. 
    # In this case if there is a null it will add a zero indicating 
    # that no event happend, not that data was missing. Refer to 
    # imputation strategies in https://dssg.github.io/triage/dirtyduck/triage_intro/#imputation for a complete explanation.
    aggregates_imputation:
       all:
         type: 'zero_noflag'
    # the following section defines individual features within the
    # ambulance runs feature group
    aggregates:
      -
        quantity: 
          drug_use: 'drug_use_flag'
        metrics:
          # If the event of having an ambulance run because of 
          # drug used happend, it will have a 1. With the max
          # metric we are making sure that we are picking it up. 
          # In case the event didn't happend the max will be a 0.
          - 'max'
      -
        quantity: 
          # column name from the from_obj table with the event of 
          # interest. In this case, a column that flags whether
          # an ambulance run was related to alcohol use. 
          alcohol: 'alcohol_use_flag'
        metrics:
          # If an ambulance run occurred due to alcohol use,
          # this column will be 1. Using the max ensures we capture 
          # that event (1 if it happened, 0 otherwise). 
          - 'max'
        # there is no imputation rule for this column. The general 
        # imputation defined in the aggregates_imputation is applied
        # in case of missing values.         
      -
        quantity: 
          # column name from the from_obj table with the event of
          # interest. In this case, a column that flags whether  
          # an ambulance run was related to a suicidal event. 
          suicidal: 'suicidal_flag'
        metrics:
          # If an ambulance run occurred due to a suicidal event,
          # this column will be 1. Using the max ensures we capture 
          # that event (1 if it happened, 0 otherwise). 
          - 'max'
        # there is no rule imputation for this column. The general 
        # imputation defined in the aggregates_imputation is applied
        # in case of missing values.
    # The periods for which this feature will be generated.
    intervals:
      - 'all'     # All available historical data prior to the prediction date
      - '1month'  # Last month prior to the prediction date
      - '3month'  # Last three months prior to the prediction date
      - '1year'   # Last year prior to the prediction date 

🍲 What to look for

Once Triage run, look in the features schema for the table generated by Triage named b_ambulance_aggregation_imputed.

It should have the following columns on it:

  • entity_id

  • as_of_date

  • b_ambulance_entity_id_all_drug_use_max

  • b_ambulance_entity_id_1month_drug_use_max

  • b_ambulance_entity_id_3month_drug_use_max

  • b_ambulance_entity_id_1year_drug_use_max

  • b_ambulance_entity_id_all_alcohol_max

  • b_ambulance_entity_id_1month_alcohol_max

  • b_ambulance_entity_id_3month_alcohol_max

  • b_ambulance_entity_id_1year_alcohol_max

  • b_ambulance_entity_id_all_suicidal_max

  • b_ambulance_entity_id_1month_suicidal_max

  • b_ambulance_entity_id_3month_suicidal_max

  • b_ambulance_entity_id_1year_suicidal_max

If there were imputations then columns with the suffix _imp will also be generated on that table.

It is always a good idea to check that the values registered on those columns are correct. To do that you get a couple of entities from your from_obj table and check if the calculations match when calculating manually. We would like to verify the feature b_ambulance_entity_id_1year_drug_use_sum is correct for a given as_of_date.

--get a couple of joids where the feature of interest has 
--a value greater than 0 with one specific date of prediction. 
with examples as(
	select 
	 entity_id,
	 as_of_date,
	 b_ambulance_entity_id_1year_drug_use_max
	from features.b_ambulance_aggregation_imputed
	where as_of_date = '2022-01-01'
	and b_ambulance_entity_id_1year_drug_use_max = 1
	limit 3
)

--with those joids check that the calculations match.
select 
	joid, 
  --we need to check on the original column
	max(drug_use_flag::int) as manual_calculation,
  --we are comparing with what Triage calculated 
	max(b_ambulance_entity_id_1year_drug_use_max) as triage_calculation
from semantic.ambulance_runs a 
join examples b 
on a.joid = b.entity_id
--the event must have occurred within 1 year prior to the as of date
and event_date between '2022-01-01'::date - interval '1year' and '2022-01-01'::date 
group by joid
order by 3

Categorical features#

When you have categorical features, you can either specify to Triage the different values it has/or the ones you are interested on (more on this on the [Advanced recipes]) using the choices parameter, or a query on how to get those using the choice_query parameter. Triage will generate a One-hot encoding setup, where each defined value will be a feature.

Note

If we don’t specify all the different values, the ones that were not included will be imputed.

You can define a generic imputation rule in the categoricals_imputation section, and/or you can have specific imputation rules for each feature using the imputation section within the definition of a feature. For categorical imputation rules defined in Triage you can look here.

🥕 Ingredients

You will need:

  • A configuration file with the features_aggregations section

  • A defined group feature

  • A defined knowledge_date_column

  • A column of categorical type

👩‍🍳 How to cook

feature_aggregations:
  # features generated based on different services 
  -
    prefix: 'd_svcs' # features for doco services from a variety of sources
    # Query that defines where to get the input data to generate the
    # feature. In this case from the semi_clean.joco110hsccclientservice2
    # table getting all individuals that have a joid 
    from_obj: 
      (SELECT ce.joid::INT AS entity_id,
             ce.event_date AS knowledge_date,
             ce.servicetype, ce.serviceproviderorg
      FROM semi_clean.joco110hsccclientservice2 ce where ce.joid is not null) AS svcs
    # what is the name of the column that stores the date in which the
    # event (an ambulance run) happened. In this case the name comes 
    # from the second line of the query defined on the from_obj
    knowledge_date_column: 'knowledge_date'
    # defining the categorical features 
    categoricals:
      -
        column: 'servicetype'
        metrics:
          # counting the number of times each type of service happened 
          - 'sum'
          # one-hot encoding for each type of service defined on 
          # choices section for the column servicetype
          - 'max'
        # defining the different type of services we want to convert
        # to one hot encoding style        
        choices: ['Clinic Services', 'Community Corrections', 'Emergency', 'Parole','Medication','Outpatient','Case Management','Supervision']
        # defining the specific imputation rule for this particular feature
        imputation:
          all:
            type: 'zero_noflag' # if no services received, count as 0
    # defining the different periods of time that we would like to 
    # check for the different values each servicetype could have
    intervals:
      - '1month'  # Last month prior to the prediction date
      - '6month'  # Last six months prior to the prediction date
      - '1year'   # Last year prior to the prediction date 
      - 'all'     # All available historical data prior to the prediction date

  # feature group race
  -
    prefix: 'b_race' #features for race
    # Query that defines where to get the input data to generate the
    # feature. In this case from the semantic.demographics table
    # getting all individuals that have a joid
    from_obj: |
     (SELECT d.joid::INT AS entity_id, 
            d.event_date AS knowledge_date,
            demographics_type, demographics_value
     FROM semantic.demographics d
     WHERE demographics_type = 'race' and d.joid is not null) AS race
    # what is the name of the column that stores the date in which the
    # event (an ambulance run) happened. In this case the name comes 
    # from the second line of the query defined on the from_obj
    knowledge_date_column: 'knowledge_date'
    # general imputation rule for all categorical features in this 
    # feature group
    categoricals_imputation: 
      all:
        # in case of null value, flag it on the null category columns
        type: 'null_category'
    # defining the categorical features to be generated from race
    categoricals:
     -
      column: 'demographics_value'
      metrics:
        # one-hot encoding of the column demographics_value
        # for the different values found by the query defined in
        # choice section.
         - 'max'
      # letting know Triage how to get the different values this 
      # column could have. We could do more advance things in this 
      # query, more on this in the "Advanced recipes" section. 
      choice_query: select distinct demographics_value from semantic.demographics d where demographics_type = 'race'
    # defining the period of time that we would like to 
    # check for the different values each demographics_value could have
    intervals:
     - 'all' # All available historical data prior to the prediction date

🍲 What to look for

Once Triage run, look in the features schema for the table generated by Triage named d_svcs_aggregation_imputed and the table b_race_aggregation_imputed.

Table d_svcs_aggregation_imputed should have the following columns:

  • entity_id

  • as_of_date

  • d_svcs_entity_id_1month_servicetype_Clinic Services_max

  • d_svcs_entity_id_1month_servicetype_Clinic Services_sum

  • d_svcs_entity_id_6month_servicetype_Clinic Services_max

  • d_svcs_entity_id_6month_servicetype_Clinic Services_sum

  • d_svcs_entity_id_1year_servicetype_Clinic Services_max

  • d_svcs_entity_id_1year_servicetype_Clinic Services_sum

  • d_svcs_entity_id_all_servicetype_Clinic Services_max

  • d_svcs_entity_id_all_servicetype_Clinic Services_sum

  • d_svcs_entity_id_NULL_sum

  • d_svcs_entity_id_NULL_max

If there were imputations then columns with the suffix _imp will also be generated on that table.

And the same columns with suffix _max and _sum for each of the other 7 different values specified on the choices parameter.

For table b_race_aggregation_imputed, since we specified a choice_query the different values we got from the query were: A, B, H, I, O, P, and W

  • entity_id

  • as_of_date

  • b_race_entity_id_all_demographics_value_A_max

  • b_race_entity_id_all_demographics_value_B_max

  • b_race_entity_id_all_demographics_value_H_max

  • b_race_entity_id_all_demographics_value_I_max

  • b_race_entity_id_all_demographics_value_O_max

  • b_race_entity_id_all_demographics_value_P_max

  • b_race_entity_id_all_demographics_value_W_max

  • b_race_entity_id_all_demographics_value_NULL_max

It is always a good idea to check that the values registered on those columns are correct. To do that, you get a couple of entities from your from_obj table and check if the calculations match when calculating manually. We would like to verify for the feature group d_svcs if the feature d_svcs_entity_id_6month_servicetype_Emergency_max is correct for a given as_of_date, and for the feature group b_race if the feature b_race_entity_id_all_demographics_value_B_max is correct for a given as_of_date.

For feature group d_svcs:

--get a couple of joids where the feature of interest has 
--a value greater than 0 with one specific date of prediction. 
with examples as(
	select 
	 entity_id,
	 as_of_date,
	 b_ambulance_entity_id_1year_drug_use_max
	from features.d_svcs_aggregation_imputed
	where as_of_date = '2022-01-01'
	and d_svcs_entity_id_6month_servicetype_Emergency_max = 1
	limit 3
)

--with those joids check that the calculations match.
select 
	joid, 
  --we need to check on the original column
	max(drug_use_flag::int) as manual_calculation,
  --we are comparing with what Triage calculated 
	max(b_ambulance_entity_id_1year_drug_use_max) as triage_calculation
from semantic.ambulance_runs a 
join examples b 
on a.joid = b.entity_id
--the event must have occurred within 1 year prior to the as of date
and event_date between '2022-01-01'::date - interval '1year' and '2022-01-01'::date 
group by joid
order by 3

For feature group b_race:

--get a couple of joids where the feature of interest has 
--a value greater than 0 with one specific date of prediction. 
with examples as(
	select 
	 entity_id,
	 as_of_date,
	 "b_race_entity_id_all_demographics_value_B_max"
	from features.b_race_aggregation_imputed
	where as_of_date = '2022-01-01'
	and "b_race_entity_id_all_demographics_value_B_max" = 1
	limit 3
)

--with those joids check that the calculations match.
select 
	joid, 
  --we need to check on the original column
	max(
    case
      when demographics_value = 'B' then 1 
      else 0 
    end 
  ) as manual_calculation,
  --we are comparing with what Triage calculated 
	max("b_race_entity_id_all_demographics_value_B_max") as triage_calculation
from semantic.demographics a 
join examples b 
on a.joid = b.entity_id
where demographics_type = 'race'
--the event must have occurred prior to the as of date
and event_date <'2022-01-01'::date 
group by joid
order by 3

Summary statistics for events features#

These features hold an aggregation of the type:

  • Maximum value of

  • Minimum value of

  • Average number of

  • Standard deviation of

  • Sum of

  • Count of (See recipe Counting events)

🥕 Ingredients

You will need:

  • A configuration file with the features_aggregations section

  • A defined group feature

  • A defined knowledge_date_column

  • A column of numerical type

👩‍🍳 How to cook

feature-aggregations:
-
  prefix: 'b_bkgs' # features for days in jail
  # Query that defines where to get the input data to generate the
  # feature. In this case from the semantic.jail_bookings
  # table getting all individuals that have a joid. This means 
  # all the people that had a registered booking in the system. 
  from_obj: 
    (SELECT jb.joid::INT as entity_id,
            jb.start_date AS knowledge_date,
            jb.end_date as end_date
    FROM semantic.jail_bookings jb where jb.joid is not null) AS bookings
  # what is the name of the column that stores the date in which the
  # event (having a booking) happened. In this case the name comes 
  # from the second line of the query defined on the from_obj
  knowledge_date_column: 'knowledge_date'
  # General rule to impute any of the features from this feature group. 
  # In this case if there is a null it will raise an error. Refer to 
  # imputation strategies in https://dssg.github.io/triage/dirtyduck/triage_intro/#imputation for a complete explanation
  aggregates_imputation:
    all:
      type: 'error'
  # the following section defines individual features in the
  # bookings feature group
  aggregates: 
  -
    # days in jail, calculated by subtracting from the 
    # release date (end_date) the date of the event (being booked)
    # If there is no end date, it means it hasn't been released
    # so we subtract the date of the event (being booked)
    # to the prediction date (collate_date).
    quantity:
      d_in_jail: | 
        case 
          when end_date is null then '{collate_date}'::DATE - knowledge_date::DATE
          else end_date::DATE - knowledge_date::DATE
        end
    metrics:
      - 'sum'     # total number of days in jail
      - 'max'     # maximum number of days in jail 
      - 'min'     # minimum number of days in jail
      - 'stddev'  # the standard deviation of the number of days in jail
    # if there is no end date, the value will be 0 
    imputation:
      avg:
        type: 'zero'
  # defining the period of time that we would like to 
  # check for the different values each demographics_value could have
  intervals:
    - 'all'     # All available historical data prior to the prediction date      
    - '3month'  # Last three months prior to the prediction date
    - '6month'  # Last six months prior to the prediction date
    - '1year'   # Last year prior to the prediction date
    - '3years'  # Last three years prior to the prediction date
    - '5years'  # Last five years prior to the prediction date

🍲 What to look for

Once Triage run, look in the features schema for the table generated by Triage named b_bkgs_aggregation_imputed.

It should have the following columns on it:

  • entity_id

  • as_of_date

  • b_bkgs_entity_id_all_d_in_jail_sum

  • b_bkgs_entity_id_all_d_in_jail_max

  • b_bkgs_entity_id_all_d_in_jail_min

  • b_bkgs_entity_id_all_d_in_jail_avg

  • b_bkgs_entity_id_all_d_in_jail_stddev

  • b_bkgs_entity_id_3month_d_in_jail_sum

  • b_bkgs_entity_id_3month_d_in_jail_max

  • b_bkgs_entity_id_3month_d_in_jail_min

  • b_bkgs_entity_id_3month_d_in_jail_avg

  • b_bkgs_entity_id_3month_d_in_jail_stddev

  • b_bkgs_entity_id_6month_d_in_jail_sum

  • b_bkgs_entity_id_6month_d_in_jail_max

  • b_bkgs_entity_id_6month_d_in_jail_min

  • b_bkgs_entity_id_6month_d_in_jail_avg

  • b_bkgs_entity_id_6month_d_in_jail_stddev

  • b_bkgs_entity_id_1year_d_in_jail_sum

  • b_bkgs_entity_id_1year_d_in_jail_max

  • b_bkgs_entity_id_1year_d_in_jail_min

  • b_bkgs_entity_id_1year_d_in_jail_avg

  • b_bkgs_entity_id_1year_d_in_jail_stddev

  • b_bkgs_entity_id_3years_d_in_jail_sum

  • b_bkgs_entity_id_3years_d_in_jail_max

  • b_bkgs_entity_id_3years_d_in_jail_min

  • b_bkgs_entity_id_3years_d_in_jail_avg

  • b_bkgs_entity_id_3years_d_in_jail_stddev

  • b_bkgs_entity_id_5years_d_in_jail_sum

  • b_bkgs_entity_id_5years_d_in_jail_max

  • b_bkgs_entity_id_5years_d_in_jail_min

  • b_bkgs_entity_id_5years_d_in_jail_avg

  • b_bkgs_entity_id_5years_d_in_jail_stddev

If there were imputations then columns with the suffix _imp will also be generated on that table.

It is always a good idea to check that the values registered on those columns are correct. To do that you get a couple of entities from your from_obj table and check if the calculations match when calculating manually. We would like to verify the feature b_bkgs_entity_id_1year_d_in_jail_avg is correct for a given as_of_date.

--get a couple of joids where the feature of interest has 
--a value greater than 0 with one specific date of prediction. 
with examples as(
	select 
	 entity_id,
	 as_of_date,
	 b_bkgs_entity_id_1year_d_in_jail_avg
	from features.b_bkgs_aggregation_imputed
	where as_of_date = '2022-01-01'
  and b_bkgs_entity_id_1year_d_in_jail_avg > 0
	limit 3
)

select 
	joid, 
  --we need to check on the original column and average 
	avg(
    case
      when end_date is null then '2022-01-01'::DATE - start_date::DATE
      else end_date::DATE - start_date::DATE
    end 
  ) as manual_calculation,
  --we are comparing with what Triage calculated 
	max(b_bkgs_entity_id_1year_d_in_jail_avg) as triage_calculation
from semantic.jail_bookings a 
join examples b 
on a.joid = b.entity_id
--the event must have occurred in the last year prior to the prediction date
where start_date between '2022-01-01'::date - interval '1year' and '2022-01-01'
group by joid
order by 3

Intermediate recipes#

Days since last event (DSL)#

This recipe will help you to define behavioral features that care about how long it has passed since the last time something of interest has happened. For example:

  • How many months have passed since the last severe behavioral health episode?

  • How many days have passed since the last ambulance run for suicidal thoughts happened?

  • How many months have passed since the last time an individual interacted with any of the homelessness services on the county?

You can generate these type of features for different time units that make sense on your context (e.g., days, weeks, months, years) as well as different intervals (e.g., last 3 months, last year, etc.).

🥕 Ingredients

You will need:

  • A configuration file with the features_aggregations section

  • A defined group feature

  • A defined knowledge_date_column

  • An event registered with a timestamp

👩‍🍳 How to cook

feature_aggregations:
- 
  prefix: 'b_ambulance' #features for ambulance runs
      # Query that defines the input for generating the features. In this case, all
      # individuals with an ID that had an ambulance run.
      from_obj: |
        (SELECT ar.joid::INT AS entity_id,
              ar.event_date AS knowledge_date,
              ar.drug_use_flag::INT,
              ar.alcohol_use_flag::INT,
              ar.suicidal_flag::INT,
              ar.suicide_attempt_flag::INT,
              ar.other_mental_crisis_flag::INT,
              ar.psychosis_nonsubstance_flag::INT,
              ar.psychosis_substance_flag::INT,
              ar.overdose_flag::INT
        FROM semantic.ambulance_runs ar where ar.joid is not null) AS runs
      # Identifies the column that holds the information of when the event happened
      # In this case the name comes from the second line of the query defined 
      # on the from_obj: ar.event_date renamed as knowledge_date
      knowledge_date_column: 'knowledge_date'
      # General rule to impute any of the features from this feature group. 
      # In this case if there is a null it will add a zero indicating 
      # that no event happend, not that data was missing. Refer to 
      # imputation strategies in https://dssg.github.io/triage/dirtyduck/triage_intro/#imputation for a complete explanation.
      aggregates_imputation:
        all:
          type: 'zero_noflag'
      aggregates:
        -
          quantity: 
          # Calculating days since last event, collate_date is the prediction 
          # date (called as_of_date). When subtracting the event of the date 
          # from the prediction date we get the number of days that has passed. 
          # Beacause an individual can have multiple events on different days, 
          # when defining the metric as "min" we are selecting the least number
          # of days, identifying when was the last event. 
            dsl: "'{collate_date}'::DATE - knowledge_date::DATE" 
          metrics:
            - 'min'
          imputation:
          # specific imputation rule to apply when there's no event date
          # We use a very big number to indicate that the event hasn´t 
          # happened (happened a long time ago!). If we were using 0 we 
          # would be defining that the last event of interest had JUST happened. 
            min:
              type: 'constant'
              value: 999999
      # The periods for which this feature will be generated.
      intervals:
      - 'all'    # All available historical data prior to the prediction date
      - '1month' # Last month prior to the prediction date
      - '3month' # Last three months prior to the prediction date
      - '6month' # Last six months prior to the prediction date
      - '1year'  # Last year prior to the prediction date
      - '2years' # Last two years prior to the prediction date
      - '5years' # Last five years prior to the prediction date

🍲 What to look for

Once Triage run, look in the features schema for the table generated by Triage named b_ambulance_aggregation_imputed.

It should have the following columns on it:

  • entity_id

  • as_of_date

  • b_ambulance_entity_id_all_dsl_min

  • b_ambulance_entity_id_1month_dsl_min

  • b_ambulance_entity_id_3month_dsl_min

  • b_ambulance_entity_id_6month_dsl_min

  • b_ambulance_entity_id_1year_dsl_min

  • b_ambulance_entity_id_2years_dsl_min

If there were imputations then columns with the suffix _imp will also be generated on that table.

It is always a good idea to check that the values registered on those columns are correct. To do that you get a couple of entities from your from_obj table and check if the calculations match when calculating manually. We would like to verify the feature b_ambulance_entity_id_2years_dsl_min is correct for a given as_of_date.

--get a couple of joids where the feature of interest has 
--a value greater than 0 with one specific date of prediction. 
with examples as(
	select 
	 entity_id,
	 as_of_date,
	 b_ambulance_entity_id_2years_dsl_min
	from features.b_ambulance_aggregation_imputed
	where as_of_date = '2022-01-01'
	and b_ambulance_entity_id_2years_dsl_min = 1
	limit 3
)

--with those joids check that the calculations match.
select 
	joid, 
  --subtract from the prediction date the event date 
	min('2022-01-01'::date - event_date) as manual_calculation,
  --we are comparing with what Triage calculated 
	b_ambulance_entity_id_2years_dsl_min as triage_calculation
from semantic.ambulance_runs a 
join examples b 
on a.joid = b.entity_id
--the event must have occurred within two years prior to the as of date
where event_date between '2022-01-01'::date - interval '2year' and '2022-01-01'
group by 1, 3

Days since first event (DSF)#

This recipe will help you to define behavioral features that care about how long it has passed since the first time something of interest has happened. For example:

  • How many months have passed since the first severe behavioral health episode?

  • How many days have passed since the first ambulance run for suicidal thoughts happened?

  • How many months have passed since the first time an individual interacted with any of the homelessness services on the county?

You can generate these type of features for different time units that make sense on your context (e.g., days, weeks, months, years) as well as different intervals (e.g., last 3 months, last year, etc.).

🥕 Ingredients

You will need:

  • A configuration file with the features_aggregations section

  • A defined group feature

  • A defined knowledge_date_column

  • An event registered with a timestamp

👩‍🍳 How to cook

feature_aggregations:
-
    prefix: 'b_bkgs' # features for bookings
    from_obj: 
    # Query that defines the input for generating the features. In this case, all
    # individuals with an ID in the jail bookings data source.
      (SELECT jb.joid::INT as entity_id,
              jb.start_date AS knowledge_date,
              jb.end_date as end_date
      FROM semantic.jail_bookings jb where jb.joid is not null) AS bookings
    # Identifies the column that holds the information of when the event happened 
    # In this case the name comes from the second line of the query defined 
    # on the from_obj: jb.start_date renamed as knowledge_date
    knowledge_date_column: 'knowledge_date'
    # General imputation rule to apply when no specific rule is defined 
    # for a specific feature. In this case, if there is a missing value
    # an error is raised.
    aggregates_imputation: 
      all:
        type: 'error' 
    aggregates:
      -
        # Defining the feature days since first event calculated by 
        # subtracting from the prediction datye the event when the event 
        # happend (when a booking into jail occurred). Because an individual 
        # can have multiple bookings, to identify the first time a booking 
        # happened in a particular perior of time we use the metric max: 
        # the longest number of days. 
        quantity: 
            dsf: "'{collate_date}'::DATE - knowledge_date::DATE"
        metrics:
          - 'max'
        imputation:
          # specific imputation rule in case we didn't find an event date
          # which means that the first time the event happend was a long 
          # time ago. Notice that if we put 0, it would mean that the booking
          # have just happened. 
          max:
            type: 'constant'
            value: 999999
    # The periods for which this feature will be generated. 
    intervals:
      - 'all'     # All available historical data prior to the prediction date
      - '1month'  # Last month prior to the prediction date
      - '3month'  # Last three months prior to the prediction date
      - '6month'  # Last six months prior to the prediction date
      - '1year'   # Las year prior to the prediction date

🍲 What to look for

Once Triage run, look in the features schema for the table generated by Triage named b_bkgs_aggregation_imputed.

It should have the following columns on it:

  • entity_id

  • as_of_date

  • b_bkgs_entity_id_all_dsf_max

  • b_bkgs_entity_id_1month_dsf_max

  • b_bkgs_entity_id_3month_dsf_max

  • b_bkgs_entity_id_6month_dsf_max

  • b_bkgs_entity_id_1year_dsf_max

  • b_bkgs_entity_id_2years_dsf_max

If there were imputations then columns with the suffix _imp will also be generated on that table.

It is always a good idea to check that the values registered on those columns are correct. To do that you get a couple of entities from your from_obj table and check if the calculations match when calculating manually. We would like to verify the feature b_bkgs_entity_id_3month_dsf_max is correct for a given as_of_date.

--get a couple of joids where the feature of interest has 
--a value greater than 0 with one specific date of prediction.
--we can check for imputations also, but in here we are 
--focusing on those that had a value.
with examples as(
	select 
	 entity_id,
	 as_of_date,
	 b_bkgs_entity_id_3month_dsf_max
	from features.b_bkgs_aggregation_imputed
	where as_of_date = '2022-01-01'
	and b_bkgs_entity_id_3month_dsf_max > 0 
  and b_bkgs_entity_id_3month_dsf_max < 999999
	limit 3
)

--with those joids check that the calculations match.
select 
	joid, 
  --subtract the event date from the prediction date
	max(
    case 
      when start_date is not null then '2022-01-01'::date - start_date
      else 999999
    end
  ) as manual_calculation,
  --we are comparing with what Triage calculated 
	b_bkgs_entity_id_3month_dsf_max as triage_calculation
from semantic.jail_bookings a 
join examples b 
on a.joid = b.entity_id
where start_date between '2022-01-01'::date - interval '3month' and '2022-01-01'
group by 1, 3

Days between events (DBE)#

This recipe will help you to define behavioral features that care about how long it has passed between events, think about it as measuring the gap of time between events of interest. For example:

  • What’s the average number of months between severe behavioral health episodes?

  • How many days passed between the last two ambulance runs related to suicide attempts?

  • What’s the minimum number of weeks between calls to the mental health center?

You can generate these type of features for different time units that make sense on your context (e.g., days, weeks, months, years) as well as different intervals (e.g., last 3 months, last year, etc.).

🥕 Ingredients

You will need:

  • A configuration file with the features_aggregations section

  • A defined group feature

  • A defined knowledge_date_column

  • An event registered with a timestamp

👩‍🍳 How to cook

feature_aggregations:
  -
    prefix: 'b_all_event_gaps' #features for gap between different events
    # Query that defines the input for generating the features. In this case, all
    # individuals that had an event and an ID.
    # we are also calculating the lag between events (window function) 
    # More on window functions can be found here: 
    # https://www.postgresql.org/docs/current/functions-window.html
    from_obj: |
      (SELECT ce.joid::INT AS entity_id,
             ce.event_date AS knowledge_date,
             LAG(knowledge_date) OVER (PARTITION BY entity_id ORDER BY knowledge_date) as lagged_date
      FROM semantic.client_events ce where ce.joid is not null) AS event_gaps
    # Identifies the column that holds the information of when the event happened 
    # In this case the name comes from the second line of the query defined 
    # on the from_obj: ce.event_date renamed as knowledge_date
    knowledge_date_column: 'knowledge_date'
    # General imputation rule to apply when no specific rule is defined 
    # for a specific feature. In this case, if there is a missing value
    # the average of the cohort will be used. 
    aggregates_imputation:
      all:
        type: 'mean'
    aggregates:
      -
        # Defining the feature days between events calculated   
        # by subtracting from the event date the date for the 
        # previous event.
        quantity:
          days_btwn: | 
            (knowledge_date - lagged_date)::INT
        # We are interested on getting the max, min, and average 
        # number of days between events. 
        metrics:
          - 'avg'
          - 'max'
          - 'min'   
    # The periods for which this feature will be generated.
    intervals:
      - 'all'     # All available historical data prior to the prediction date 
      - '1month'  # Last month prior to the prediction date
      - '3month'  # Last three months prior to the prediction date
      - '6month'  # Last six months prior to the prediction date
      - '1year'   # Last year prior to the prediction date
      - '3years'  # Last three years prior to the prediction date
      - '5years'  # Last five years prior to the prediction date

🍲 What to look for

Once Triage run, look in the features schema for the table generated by Triage named b_all_event_gaps_aggregation_imputed.

It should have the following columns on it:

  • entity_id

  • as_of_date

  • b_all_event_gaps_entity_id_all_days_btwn_avg

  • b_all_event_gaps_entity_id_all_days_btwn_max

  • b_all_event_gaps_entity_id_all_days_btwn_min

  • b_all_event_gaps_entity_id_1month_days_btwn_avg

  • b_all_event_gaps_entity_id_1month_days_btwn_max

  • b_all_event_gaps_entity_id_1month_days_btwn_min

  • b_all_event_gaps_entity_id_3month_days_btwn_avg

  • b_all_event_gaps_entity_id_3month_days_btwn_max

  • b_all_event_gaps_entity_id_3month_days_btwn_min

  • b_all_event_gaps_entity_id_6month_days_btwn_avg

  • b_all_event_gaps_entity_id_6month_days_btwn_max

  • b_all_event_gaps_entity_id_6month_days_btwn_min

  • b_all_event_gaps_entity_id_1year_days_btwn_avg

  • b_all_event_gaps_entity_id_1year_days_btwn_max

  • b_all_event_gaps_entity_id_1year_days_btwn_min

  • b_all_event_gaps_entity_id_3years_days_btwn_avg

  • b_all_event_gaps_entity_id_3years_days_btwn_max

  • b_all_event_gaps_entity_id_3years_days_btwn_min

  • b_all_event_gaps_entity_id_5years_days_btwn_avg

  • b_all_event_gaps_entity_id_5years_days_btwn_max

  • b_all_event_gaps_entity_id_5years_days_btwn_min

If there were imputations then columns with the suffix _imp will also be generated on that table.

It is always a good idea to check that the values registered on those columns are correct. To do that you get a couple of entities from your from_obj table and check if the calculations match when calculating manually. We would like to verify the feature b_all_event_gaps_entity_id_3year_days_btwn_avg is correct for a given as_of_date.

--get a couple of joids where the feature of interest has 
--a value greater than 0 with one specific date of prediction.
--we can check for imputations also, but in here we are 
--focusing on those that had a value.
with examples as(
	select 
	 entity_id,
	 as_of_date,
	 b_all_event_gaps_entity_id_3years_days_btwn_avg
	from features.b_all_event_gaps_aggregation_imputed
	where as_of_date = '2022-01-01'
	and b_all_event_gaps_entity_id_3years_days_btwn_avg > 0 
	limit 3
),

--with those joids check that the calculations match.
manual_calculation as (
  select 
    joid, 
    --generate the lag between dates for each individual,  
    --calculate the number of days that have passed between events 
    (event_date - LAG(event_date) OVER (PARTITION BY joid ORDER BY event_date))::INT as manual_calculation,
    --we are comparing with what Triage calculated 
    b_all_event_gaps_entity_id_3years_days_btwn_avg as triage_calculation
  from semantic.client_events a 
  join examples b 
  on a.joid = b.entity_id
  where event_date between '2022-01-01'::date - interval '3years' and '2022-01-01'
)

--get the average of number of days calculated for 
--each individual. And get the max of Triage's 
--calculation (it is the same number).
select 
  joid, 
  avg(manual_calculation) as manual_calculation, 
  max(triage_calculation) as triage_calculation
from manual_calculation 
group by 1;

Adding conditions within a feature definition#

Often we found ourselves with the necessity to generate features based on specific conditions that allows us to extract richer information. For example, when we try to generate a flag indicator when particular behavioral health county services has been used by an individual. In this case, we can add conditions to generate these flags (and other type of features) outside of the from_obj.

Flag features#

🥕 Ingredients You will need:

  • A configuration file with the features_aggregations section

  • A defined group feature

  • A defined knowledge_date_column

  • A defined from_obj

👩‍🍳 How to cook

feature_aggregations:
  - 
    prefix: 'd_svcs' # features for doco services from a variety of sources
    # Query that defines the input for generating the features. In this case, all
    # individuals with an ID that had any type of service. 
    from_obj: 
      (SELECT ce.joid::INT AS entity_id,
             ce.event_date AS knowledge_date,
             ce.servicetype, ce.serviceproviderorg
      FROM semi_clean.joco110hsccclientservice2 ce where ce.joid is not null) AS svcs
    # Identifies the column that holds the information of when the event happened
    # In this case the name comes from the second line of the query defined 
    # on the from_obj: ce.event_date renamed as knowledge_date
    knowledge_date_column: 'knowledge_date'
    aggregates:
      - 
        quantity:
          # We are defining a feature sum_cjs where we would like to 
          # signal if an individual interacted with spcific county services. 
          # To do that, we are adding a condition to the table generated from the from_obj
          # to identify those that interacted with specific services of interest 
          # like Community corrections, Parole, and Supervision. 
          # If the individaul interacted with any of those services then we are 
          # assigning a 1, 0 otherwise. 
          sum_cjs: |
                      case when (servicetype = 'Community Corrections' or servicetype = 'Parole' 
                      or servicetype = 'Supervision') then 1 else 0 end
        metrics: 
          # We are going to count how many times an individual interacted with those type
          # of services (sum all ones).
          - 'sum'
        # We are defining the imputation rule to follow in case the servicetype column is empty
        # to 0, which means that the individual didn't have an interaction with any of the 
        # services of interest.
        imputation:
          sum:
            type: 'zero'
    # The periods for which this feature will be generated. 
    intervals:
    - '1month'  # Last month prior to the prediction date
    - '6month'  # Last six months prior to the prediction date 
    - '1year'   # Last year prior to the prediction date
    - 'all'     # All available historical data prior to the prediction date

🍲 What to look for

Once Triage run, look in the features schema for the table generated by Triage named d_svcs_aggregation_imputed and look for the following columns:

  • entity_id

  • as_of_date

  • d_svcs_entity_id_1month_sum_cjs_sum

  • d_svcs_entity_id_6month_sum_cjs_sum

  • d_svcs_entity_id_1year_sum_cjs_sum

  • d_svcs_entity_id_all_sum_cjs_sum

If there were imputations then columns with the suffix _imp will also be generated on that table.

It is always a good idea to check that the values registered on those columns are correct. To do that, you get a couple of entities from your from_obj table and check if the calculations match when calculating manually. We would like to verify for the feature group d_svcs if the feature d_svcs_entity_id_6month_sum_cjs_sum is correct for a given as_of_date.

--get a couple of joids where the feature of interest has 
--a value greater than 0 with one specific date of prediction. 
with examples as(
	select 
	 entity_id,
	 as_of_date,
	 d_svcs_entity_id_6month_sum_cjs_sum
	from features.d_svcs_aggregation_imputed
	where as_of_date = '2022-01-01'  --date of prediction, aligned with the time splits defines on your experiment setup timechops
	-- the specific feature we want to verify, in this case we are asking 
  -- for individuals that had a sum of 2 days in jail in the last 6 months
  and d_svcs_entity_id_6month_sum_cjs_sum = 2
	limit 3
)

--with those joids check that the calculations match.
select 
  joid, 
  --we need to check on the original column, in the same way we defined
  --how to calculated on Triage.
  sum(
    case 
      when (servicetype = 'Community Corrections' or servicetype = 'Parole' 
                        or servicetype = 'Supervision') then 1 
      else 0 
    end
  ) as sum_cjs_manully, 
  d_svcs_entity_id_6month_sum_cjs_sum as sum_cjs_triage_calculation 
from semi_clean.joco110hsccclientservice2 a --from from_obj statement
join examples b 
on a.joid = b.entity_id
--the event must have occurred within 1 year prior to the as of date
and event_date between '2022-01-01'::date - interval '6month' and '2022-01-01'::date
group by 1, 3

Numeric features#

🥕 Ingredients You will need:

  • A configuration file with the features_aggregations section

  • A defined group feature

  • A defined knowledge_date_column

  • A defined from_obj

👩‍🍳 How to cook

feature_aggregations:
  -
    prefix: 'j_juv' #features for services in johnson county
    # Query that defines the input for generating the features. In this case, all
    # individuals with an ID from the inmates data source. 
    from_obj: |
      (select j.joid::INT as entity_id,
        j.event_date as knowledge_date, *
        from semi_clean.jocojimsjuvinmatedata j where j.joid is not null ) AS juv_bkgs
    # Identifies the column that holds the information of when the event happened
    # In this case the name comes from the second line of the query defined 
    # on the from_obj: j.event_date renamed as knowledge_date
    knowledge_date_column: 'knowledge_date'
    # General imputation rule to apply in case a specific feature doesn't have 
    # its own imputation rule defined. In this case, if there's a missing value
    # it will indicate that is an error. Usefull when doing sanity checks on 
    # the data used on the model.
    aggregates_imputation:
        all:
          type: error
    aggregates:
      - 
        # This feature called jtime calculates the amount ouf days that an individual has been at jail. 
        # It could happen that we don't have the release date, in that case we use the current date as 
        # the release date and then subtract the date of booking to get the total number of days an 
        # individual was in jail. 
        quantity: 
          jtime: "CASE WHEN release_date_88 = '' THEN CURRENT_DATE - bk_dt_40::date ELSE release_date_88::date - bk_dt_40::date END "
        # We are interested in getting the average number of days an individual has been in jail
        # which covers the cases on which the same individual has been more than once in jail, and 
        # also the total amount of days spent in jail for a given period of time. 
        metrics:
          - 'avg' 
          - 'sum' 
        imputation:
          # specific imputation rule when calculating the average. In case of missing 
          # values, use 0 (no jail days)
          avg:
            type: 'constant'
            value: 0
          # specific imputation rule when calculating the total number of days in jail. 
          # In case of missing values, use 0 (no jail in days)
          sum:
            type: 'constant'
            value: 0
    # The periods for which this feature will be generated. 
    intervals:
      - '1month' # Last month prior to the prediction date
      - '6month' # Last six months prior to the prediction date
      - '1year'  # Last year prior to the prediction date
      - 'all'    # All available historical data prior to the prediction date

🍲 What to look for

Once Triage run, look in the features schema for the table generated by Triage named j_services_aggregation_imputed and look for the following columns:

  • entity_id

  • as_of_date

  • j_juv_entity_id_all_jtime_avg

  • j_juv_entity_id_all_jtime_sum

  • j_juv_entity_id_1month_jtime_avg

  • j_juv_entity_id_1month_jtime_sum

  • j_juv_entity_id_6month_jtime_avg

  • j_juv_entity_id_6month_jtime_sum

  • j_juv_entity_id_1year_jtime_avg

  • j_juv_entity_id_1year_jtime_sum

If there were imputations then columns with the suffix _imp will also be generated on that table.

It is always a good idea to check that the values registered on those columns are correct. To do that, you get a couple of entities from your from_obj table and check if the calculations match when calculating manually. We would like to verify for the feature group j_juv if the feature j_juv_entity_id_6month_jtime_avg is correct for a given as_of_date.

--get a couple of joids where the feature of interest has 
--a value greater than 0 with one specific date of prediction. 
with examples as(
	select 
	 entity_id,
	 as_of_date,
	 j_juv_entity_id_6month_jtime_avg
	from features.j_juv_aggregation_imputed
	where as_of_date = '2022-01-01'  --date of prediction, aligned with the time splits defines on your experiment setup timechops
	-- the specific feature we want to verify, in this case we are asking 
  -- for individuals that had been in jail (their average number of days in jail is greater than 0)
  and j_juv_entity_id_6month_jtime_avg > 0
	limit 3
)

--with those joids check that the calculations match.
select 
  joid, 
  --we need to check on the original column, in the same way we defined
  --how to calculated on Triage.
  avg(
    case 
      when release_date_88 = '' then CURRENT_DATE - bk_dt_40::date 
      else release_date_88::date - bk_dt_40::date 
    end
  ) as jtime_avg_manually, 
  j_juv_entity_id_6month_jtime_avg as avg_jtime_triage_calculation 
from semi_clean.jocojimsjuvinmatedata a  --from from_obj statement
join examples b 
on a.joid = b.entity_id
--the event must have occurred within 1 year prior to the as of date
and event_date between '2022-01-01'::date - interval '6month' and '2022-01-01'::date
group by 1, 3

Advanced recipes#

Most recent value#

Here you can find an example on how to calculate the most recent value of a variable outside of Triage, and then use it in Triage.