Machine learning governance#

When triage executes the experiment, it creates a series of new schemas for storing the copious output of the experiment. The schemas are test_results, train_results, and triage_metadata. These schemas store the metadata of the trained models, features, parameters, and hyperparameters used in their training. It also stores the predictions and evaluations of the models on the test sets.

The schema triage_metadata is composed by the tables:

\dt triage_metadata.*

List of relations
Schema	Name	Type	Owner
model_metadata	experiment_matrices	table	food_user
model_metadata	experiment_models	table	food_user
model_metadata	experiments	table	food_user
model_metadata	list_predictions	table	food_user
model_metadata	matrices	table	food_user
model_metadata	model_groups	table	food_user
model_metadata	models	table	food_user

The tables contained in test_results are:

\dt test_results.*

List of relations
Schema	Name	Type	Owner
test_results	aequitas	table	food_user
test_results	evaluations	table	food_user
test_results	individual_importances	table	food_user
test_results	predictions	table	food_user

Lastly, if you have interest in how the model performed in the training data sets you could consult train_results

\dt train_results.*

List of relations
Schema	Name	Type	Owner
train_results	aequitas	table	food_user
train_results	evaluations	table	food_user
train_results	feature_importances	table	food_user
train_results	predictions	table	food_user

What are all the results tables about?#

model_groups stores the algorithm (model_type), the hyperparameters (hyperparameters), and the features shared by a particular set of models. models contains data specific to a model: the model_group (you can use model_group_id for linking the model to a model group), temporal information (like train_end_time), and the train matrix UUID (train_matrix_uuid). This UUID is important because it's the name of the file in which the matrix is stored.

Lastly, {train, test}_results.predictions contains all the scores generated by every model for every entity. {train_test}_results.evaluation stores the value of all the metrics for every model, which were specified in the scoring section in the config file.

`triage_metadata.experiments`#

This table has the two columns: experiment_hash and config

\d triage_metadata.experiments

Table "model_{metadata.experiments}"
Column	Type	Collation	Nullable	Default
experiment_hash	character varying		not null
config	jsonb
Indexes:
"experiments_pkey" PRIMARY KEY, btree (experiment_hash)
Referenced by:
TABLE "model_{metadata.experiment}_matrices" CONSTRAINT "experiment_matrices_experiment_hash_fkey" FOREIGN KEY (experiment_hash) REFERENCES model_{metadata.experiments}(experiment_hash)
TABLE "model_{metadata.experiment}_models" CONSTRAINT "experiment_models_experiment_hash_fkey" FOREIGN KEY (experiment_hash) REFERENCES model_{metadata.experiments}(experiment_hash)
TABLE "model_{metadata.matrices}" CONSTRAINT "matrices_built_by_experiment_fkey" FOREIGN KEY (built_by_experiment) REFERENCES model_{metadata.experiments}(experiment_hash)
TABLE "model_{metadata.models}" CONSTRAINT "models_experiment_hash_fkey" FOREIGN KEY (built_by_experiment) REFERENCES model_{metadata.experiments}(experiment_hash)

experiment_hash contains the hash of the configuration file that we used for our triage run.¹ config that contains the configuration experiment file that we used for our triage run, stored as jsonb.

select experiment_hash,
config ->  'user_metadata' as user_metadata
from triage_metadata.experiments;

experiment_hash	user_metadata
67a1d564d31811b9c20ca63672c25abd	{"org": "DSaPP", "team": "Tutorial", "author": "Adolfo De Unanue", "etl_date": "2019-02-21", "experiment_type": "test", "label_definition": "failed_inspection"}

We could use the following advice: If we are interested in all models that resulted from a certain config, we could lookup that config in triage_metadata.experiments and then use its experiment_hash on other tables to find all the models that resulted from that configuration.

`metadata_model.model_groups`#

Do you remember how we defined in grid_config the different classifiers that we want triage to train? For example, we could use in a configuration file the following:

    'sklearn.tree.DecisionTreeClassifier':
        criterion: ['entropy']
        max_depth: [1, 2, 5, 10]
        random_state: [2193]

By doing so, we are saying that we want to train 4 decision trees (max_depth is one of 1, 2, 5, 10). However, remember that we are using temporal crossvalidation to build our models, so we are going to have different temporal slices that we are training models on, e.g., 2010-2011, 2011-2012, etc.

Therefore, we are going to train our four decision trees on each temporal slice. Therefore, the trained model (or the instance of that model) will change across temporal splits, but the configuration will remain the same. This table lets us keep track of the different configurations (model_groups) and gives us an id for each configuration (model_group_id). We can leverage the model_group_id to find all the models that were trained using the same config but on different slices of time.

In our simple test configuration file we have:

    'sklearn.dummy.DummyClassifier':
        strategy: [most_frequent]

Therefore, if we run the following

select
    model_group_id,
    model_type,
    hyperparameters,
    model_config -> 'feature_groups' as feature_groups,
    model_config -> 'cohort_name' as cohort,
    model_config -> 'label_name' as label,
    model_config -> 'label_definition' as label_definition,
    model_config -> 'experiment_type' as experiment_type,
    model_config -> 'etl_date' as etl_date
from
    triage_metadata.model_groups;

model_group_id	model_type	hyperparameters	feature_groups	cohort	label	label_definition	experiment_type	etl_date
1	sklearn.dummy.DummyClassifier	{"strategy": "most_frequent"}	["prefix: results", "prefix: risks", "prefix: inspections"]	"test_facilities"	"failed_inspections"	"failed_inspection"	"test"	"2019-02-21"

You can see that a model group is defined by the classifier (model_type), its hyperparameters (hyperparameters), the features (feature_list) (not shown), and the model_config.

The field model_config is created using information from the block model_group_keys. In our test configuration file the block is:

model_group_keys:
  - 'class_path'
  - 'parameters'
  - 'feature_names'
  - 'feature_groups'
  - 'cohort_name'
  - 'state'
  - 'label_name'
  - 'label_timespan'
  - 'training_as_of_date_frequency'
  - 'max_training_history'
  - 'label_definition'
  - 'experiment_type'
  - 'org'
  - 'team'
  - 'author'
  - 'etl_date'

What can we learn from that? For example, if we add a new feature and rerun triage, triage will create a new model_group even if the classifier and the hyperparameters are the same as before.

`triage_metadata.models`#

This table stores the information about our actual models, i.e., instances of our classifiers trained on specific temporal slices.

\d triage_metadata.models

Table "model_{metadata.models}"
Column	Type	Collation	Nullable	Default
model_id	integer		not null	nextval('model_{metadata.models}_model_id_seq'::regclass)
model_group_id	integer
model_hash	character varying
run_time	timestamp without time zone
batch_run_time	timestamp without time zone
model_type	character varying
hyperparameters	jsonb
model_comment	text
batch_comment	text
config	json
built_by_experiment	character varying
train_end_time	timestamp without time zone
test	boolean
train_matrix_uuid	text
training_label_timespan	interval
model_size	real
Indexes:
"models_pkey" PRIMARY KEY, btree (model_id)
"ix_results_models_model_hash" UNIQUE, btree (model_hash)
Foreign-key constraints:
"matrix_uuid_for_models" FOREIGN KEY (train_matrix_uuid) REFERENCES model_{metadata.matrices}(matrix_uuid)
"models_experiment_hash_fkey" FOREIGN KEY (built_by_experiment) REFERENCES model_{metadata.experiments}(experiment_hash)
"models_model_group_id_fkey" FOREIGN KEY (model_group_id) REFERENCES model_{metadata.model}_groups(model_group_id)
Referenced by:
TABLE "test_{results.evaluations}" CONSTRAINT "evaluations_model_id_fkey" FOREIGN KEY (model_id) REFERENCES model_{metadata.models}(model_id)
TABLE "train_{results.feature}_importances" CONSTRAINT "feature_importances_model_id_fkey" FOREIGN KEY (model_id) REFERENCES model_{metadata.models}(model_id)
TABLE "test_{results.individual}_importances" CONSTRAINT "individual_importances_model_id_fkey" FOREIGN KEY (model_id) REFERENCES model_{metadata.models}(model_id)
TABLE "model_{metadata.list}_predictions" CONSTRAINT "list_predictions_model_id_fkey" FOREIGN KEY (model_id) REFERENCES model_{metadata.models}(model_id)
TABLE "test_{results.predictions}" CONSTRAINT "predictions_model_id_fkey" FOREIGN KEY (model_id) REFERENCES model_{metadata.models}(model_id)
TABLE "train_{results.evaluations}" CONSTRAINT "train_evaluations_model_id_fkey" FOREIGN KEY (model_id) REFERENCES model_{metadata.models}(model_id)
TABLE "train_{results.predictions}" CONSTRAINT "train_predictions_model_id_fkey" FOREIGN KEY (model_id) REFERENCES model_{metadata.models}(model_id)

Noteworthy columns are:

model_id: The id of the model (i.e., instance…). We will use this ID to trace a model evaluation to a model_group and vice versa.
model_group_id: The id of the models model group we encountered above.
model_hash: The hash of our model. We can use the hash to load the actual model. It gets stored under TRIAGE_OUTPUT_PATH/trained_models/{model_hash}. We are going to this later to look at a trained decision tree.
run_time: Time when the model was trained.
model_type: The algorithm used for training.
model_comment: Literally the text in the model_comment block in the configuration file
hyperparameters: Hyperparameters used for the model configuration.
built_by_experiment: The hash of our experiment. We encountered this value in the results.experiments table before.
train_end_time: When building the training matrix, we included training samples up to this date.
train_matrix_uuid: The hash of the matrix that we used to train this model. The matrix gets stored as csv under TRIAGE_OUTPUT_PATH/matrices/{train_matrix_uuid}.csv. This is helpful when trying to inspect the matrix and features that were used for training.
train_label_timespan: How big was our window to get the labels for our training matrix? For example, a train_label_window of 1 year would mean that we look one year from a given date in the training matrix into the future to find the label for that training sample.

`triage_metadata.matrices`#

This schema contains information about the matrices used in the model's training. You could use this information to debug your models. Important columns are matrix_uuid (The matrix gets stored as TRIAGE_OUTPUT_PATH/matrices/{train_matrix_uuid}.csv), matrix_type (indicated if the matrix was used for training models or testing them), lookback_duration and feature_starttime (give information about the temporal setting of the features) and num_observations (size of the matrices).

\d triage_metadata.matrices

Table "model_{metadata.matrices}"
Column	Type	Collation	Nullable	Default
matrix_id	character varying
matrix_uuid	character varying		not null
matrix_type	character varying
labeling_window	interval
num_observations	integer
creation_time	timestamp with time zone			now()
lookback_duration	interval
feature_start_time	timestamp without time zone
matrix_metadata	jsonb
built_by_experiment	character varying
Indexes:
"matrices_pkey" PRIMARY KEY, btree (matrix_uuid)
"ix_model_metadata_matrices_matrix_uuid" UNIQUE, btree (matrix_uuid)
Foreign-key constraints:
"matrices_built_by_experiment_fkey" FOREIGN KEY (built_by_experiment) REFERENCES model_{metadata.experiments}(experiment_hash)
Referenced by:
TABLE "test_{results.evaluations}" CONSTRAINT "evaluations_matrix_uuid_fkey" FOREIGN KEY (matrix_uuid) REFERENCES model_{metadata.matrices}(matrix_uuid)
TABLE "train_{results.evaluations}" CONSTRAINT "evaluations_matrix_uuid_fkey" FOREIGN KEY (matrix_uuid) REFERENCES model_{metadata.matrices}(matrix_uuid)
TABLE "model_{metadata.models}" CONSTRAINT "matrix_uuid_for_models" FOREIGN KEY (train_matrix_uuid) REFERENCES model_{metadata.matrices}(matrix_uuid)
TABLE "test_{results.predictions}" CONSTRAINT "matrix_uuid_for_testpred" FOREIGN KEY (matrix_uuid) REFERENCES model_{metadata.matrices}(matrix_uuid)
TABLE "train_{results.predictions}" CONSTRAINT "matrix_uuid_for_trainpred" FOREIGN KEY (matrix_uuid) REFERENCES model_{metadata.matrices}(matrix_uuid)
TABLE "train_{results.predictions}" CONSTRAINT "train_predictions_matrix_uuid_fkey" FOREIGN KEY (matrix_uuid) REFERENCES model_{metadata.matrices}(matrix_uuid)

`{test, train}_results.evaluations`#

These tables lets us analyze how well our models are doing. Based on the config that we used for our triage run, triage is calculating metrics and storing them in this table, e.g., our model's precision in top 10%.

\d test_results.evaluations

Table "test_{results.evaluations}"
Column	Type	Collation	Nullable	Default
model_id	integer		not null
evaluation_start_time	timestamp without time zone		not null
evaluation_end_time	timestamp without time zone		not null
as_of_date_frequency	interval		not null
metric	character varying		not null
parameter	character varying		not null
value	numeric
num_labeled_examples	integer
num_labeled_above_threshold	integer
num_positive_labels	integer
sort_seed	integer
matrix_uuid	text
Indexes:
"evaluations_pkey" PRIMARY KEY, btree (model_id, evaluation_start_time, evaluation_end_time, as_of_date_frequency, metric, parameter)
Foreign-key constraints:
"evaluations_matrix_uuid_fkey" FOREIGN KEY (matrix_uuid) REFERENCES model_{metadata.matrices}(matrix_uuid)
"evaluations_model_id_fkey" FOREIGN KEY (model_id) REFERENCES model_{metadata.models}(model_id)

Its columns are:

model_id: Our beloved model_id that we have encountered before.
evaluation_start_time: After training the model, we evaluate it on a test matrix. This column tells us the earliest time that an example in our test matrix could have.
evaluation_end_time: After training the model, we evaluate it on a test matrix. This column tells us the latest time that an example in our test matrix could have.
metric: Indicates which metric we are evaluating, e.g., precision@.
parameter ::Indicates at which parameter we are evaluating our metric, e.g., a metric of precision@ and a parameter of 100.0_pct shows us the precision@100pct.
value: The value observed for our metric@parameter.
num_labeled_examples: The number of labeled examples in our test matrix. Why does it matter? It could be the case that we have entities that have no label for the test timeframe (for example, not all facilities will have an inspection). We still want to make predictions for these entities but can't include them when calculating performance metrics.
num_labeled_above_threshold: How many examples above our threshold were labeled?
num_positive_labels: The number of rows that had true positive labels.

A look at the table shows that we have multiple rows for each model, each showing a different performance metric.

select
    evaluation_end_time,
    model_id,
    metric || parameter as metric,
    value,
    num_labeled_examples,
    num_labeled_above_threshold,
    num_positive_labels
from
    test_results.evaluations
where
    parameter = '100.0_pct';

evaluation_end_time	model_id	metric	value	num_labeled_examples	num_labeled_above_threshold	num_positive_labels
2016-01-01 00:00:00	1	precision@100.0_pct	0.6666666666666666	3	3	2
2016-01-01 00:00:00	1	recall@100.0_pct	1.0	3	3	2
2017-01-01 00:00:00	2	precision@100.0_pct	0.3333333333333333	3	3	1
2017-01-01 00:00:00	2	recall@100.0_pct	1.0	3	3	1

Remember that at 100%, the recall should be 1, and the precision is equal to the baserate. If these two things don't match, there are problems in your data, pipeline, etl. You must get this correct!

What does this query tell us?

We can now see how the different instances (trained on different temporal slices, but with the same model params) of a model group performs over time. Note how we only included the models that belong to our model group 1.

`{test_train}_results.aequitas`#

Standard evaluation metrics don't tell us the entire story: what are the biases in our models? what is the fairest model?
Given the bias_audit_config in the experiment config in which we defined what protected attributes we care about (e.g. ethnicity) and the specific thresholds our model is going to be used, Triage uses Aequitas to generate a bias report on each model and matrix, similar to standard evaluation metrics. The aequitas tables will have a row for each combination of: - model_id - subset_hash - tie_breaker (e.g. best, worst) - evaluation_start_time - evaluation_end_time - parameter (e.g. 25_abs, similar to evaluation metric thresholds) - attribute_name (e.g. 'facility_type') - attribute_value (e.g. 'kids_facility', 'restaurant')

For each row Aequitas calculates the following group metrics:

Metric	Formula	Description
Predicted Positive	$PP_g$	The number of entities within a group where the decision is positive, i.e., $\widehat{Y}=1.$
Total Predictive Positive	$K = \sum_{A=a_1}^{A=a_n} PP_{g(a_i)}$	The total number of entities predicted positive across groups defined by $A.$
Predicted Negative	$PN_g$	The number of entities within a group which decision is negative, i.e., $\widehat{Y}=0.$
Predicted Prevalence	$PPrev_g = \frac{PP_g}{\|g\|} = \text{Pr(}\widehat{Y}=1\;\|\;A=a_i)$	The fraction of entities within a group which were predicted as positive.
Predicted Positive Rate	$PPR_g = \frac{PP_g}{K} = \text{Pr(}A=a_i\;\|\;\widehat{Y}=1)$	The fraction of the entities predicted as positive that belong to a certain group.
False Positive	$FP_g$	The number of entities of the group with $\widehat{Y}=1$ and $Y=0.$
False Negative	$FN_g$	The number of entities of the group with $\widehat{Y}=0$ and $Y=1.$
True Positive	$TP_g$	The number of entities of the group with $\widehat{Y}=1$ and $Y=1.$
True Negative	$TN_g$	The number of entities of the group with $\widehat{Y}=0$ and $Y=0.$
False Discovery Rate	$FDR_g = \frac{FP_g}{PP_g} = \text{Pr(}Y=0\;\|\;\widehat{Y}=1,A=a_i)$	The fraction of false positives of a group within the predicted positive of the group.
False Omission Rate	$FOR_g = \frac{FN_g}{PN_g} = \text{Pr(}Y=1\;\|\;\widehat{Y}=0,A=a_i)$	The fraction of false negatives of a group within the predicted negative of the group.
False Positive Rate	$FPR_g = \frac{FP_g}{LN_g} = \text{Pr(}\widehat{Y}=1\;\|\;Y=0,A=a_i)$	The fraction of false positives of a group within the labeled negative of the group.
False Negative Rate	$FNR_g = \frac{FN_g}{LP_g} = \text{Pr(}\widehat{Y}=0\;\|\;Y=1, A=a_i)$	The fraction of false negatives of a group within the labeled positives of the group.

In the context of public policy and social good we want to avoid providing less benefits to specific groups of entities, if the intervention is assistive, as well as, avoid hurting more specific groups, if the intervention is punitive. Therefore we define bias as a disparity measure of group metric values of a given group when compared with a reference group. This reference can be selected using different criteria. For instance, one could use the majority group (with larger size) across the groups defined by A, or the group with minimum group metric value, or the traditional approach of fixing a historically favored group (e.g ethnicity:caucasian).

Each disparity metric $j$ for a given group $a_i$ is calculated as follows: $disparity_{j,\;a_{i}} = \frac{metric_{j,\;a_{i}}}{metric_{j,\;a_{reference\;group}}}$

To read about the bias metrics saved in this table, look at the Aequitas documentation.

Table "test_results.aequitas"
Column	Type	Collation	Nullable	Default
model_id	integer		not null
subset_hash	character varying		not null
tie_breaker	character varying		not null
evaluation_start_time	timestamp without time zone		not null
evaluation_end_time	timestamp without time zone		not null
matrix_uuid	text
parameter	character varying		not null
attribute_name	character varying		not null
attribute_value	character varying		not null
total_entities	integer
group_label_pos	integer
group_label_neg	integer
group_size	integer
group_size_pct	numeric
prev	numeric
pp	integer
pn	integer
fp	integer
fn	integer
tn	integer
tp	integer
ppr	numeric
pprev	numeric
tpr	numeric
tnr	numeric
for	numeric
fdr	numeric
fpr	numeric
fnr	numeric
npv	numeric
precision	numeric
ppr_disparity	numeric
ppr_ref_group_value	character varying
pprev_disparity	numeric
pprev_ref_group_value	character varying
precision_disparity	numeric
precision_ref_group_value	character varying
fdr_disparity	numeric
fdr_ref_group_value	character varying
for_disparity	numeric
for_ref_group_value	character varying
fpr_disparity	numeric
fpr_ref_group_value	character varying
fnr_disparity	numeric
fnr_ref_group_value	character varying
tpr_disparity	numeric
tpr_ref_group_value	character varying
tnr_disparity	numeric
tnr_ref_group_value	character varying
npv_disparity	numeric
npv_ref_group_value	character varying
Statistical_Parity	boolean
Impact_Parity	boolean
FDR_Parity	boolean
FPR_Parity	boolean
FOR_Parity	boolean
FNR_Parity	boolean
TypeI_Parity	boolean
TypeII_Parity	boolean
Equalized_Odds	boolean
Unsupervised_Fairness	boolean
Supervised_Fairness	boolean

`{test, train}_results.predictions`#

You can think of the previous table {test, train}_results.{test, train}_predictions as a summary of individuals predictions that our model is making. But where can you find the individual predictions that our model is making? (So you can generate a list from here). And where can we find the test matrix that the predictions are based on? Let us introduce you to the results.predictions table.

Here is what its first row looks like:

select
    model_id,
    entity_id,
    as_of_date,
    score,
    label_value,
    matrix_uuid
from
    test_results.predictions
where
    model_id = 1
order by score desc;

model_id	entity_id	as_of_date	score	label_value	matrix_uuid
1	229	2016-01-01 00:00:00	1.0	1	cd0ae68d6ace43033b49ee0390c3583e
1	355	2016-01-01 00:00:00	1.0	1	cd0ae68d6ace43033b49ee0390c3583e
1	840	2016-01-01 00:00:00	1.0	0	cd0ae68d6ace43033b49ee0390c3583e

As you can see, the table contains our models' predictions for a given entity and date.

And do you notice the field matrix_uuid? Doesn't it look similar to the fields from above that gave us the names of our training matrices? In fact, it is the same. You can find the test matrix that was used to make this prediction under TRIAGE_OUTPUT_PATH/matrices/{matrix_uuid}.csv.

`{test, train}_results.feature_importances`#

This tables store the feature importance of all the models.

Footnotes#

¹ Literally from the configuration file. If you modify something it will generate a new hash. Handle with care!

Machine learning governance#

What are all the results tables about?#

triage_metadata.experiments#

metadata_model.model_groups#

triage_metadata.models#

triage_metadata.matrices#

{test, train}_results.evaluations#

{test_train}_results.aequitas#

{test, train}_results.predictions#

{test, train}_results.feature_importances#