Machine learning governance#
When triage
executes the experiment, it creates a series of new schemas for storing the copious output of the experiment. The schemas are test_results, train_results
, and triage_metadata
. These schemas store the metadata of the trained models, features, parameters, and hyperparameters used in their training. It also stores the predictions and evaluations of the models on the test sets.
The schema triage_metadata
is composed by the tables:
\dt triage_metadata.*
List of relations | |||
---|---|---|---|
Schema | Name | Type | Owner |
modelmetadata | experimentmatrices | table | fooduser |
modelmetadata | experimentmodels | table | fooduser |
modelmetadata | experiments | table | fooduser |
modelmetadata | listpredictions | table | fooduser |
modelmetadata | matrices | table | fooduser |
modelmetadata | modelgroups | table | fooduser |
modelmetadata | models | table | fooduser |
The tables contained in test_results
are:
\dt test_results.*
List of relations | |||
---|---|---|---|
Schema | Name | Type | Owner |
testresults | aequitas | table | fooduser |
testresults | evaluations | table | fooduser |
testresults | individualimportances | table | fooduser |
testresults | predictions | table | fooduser |
Lastly, if you have interest in how the model performed in the training data sets you could consult train_results
\dt train_results.*
List of relations | |||
---|---|---|---|
Schema | Name | Type | Owner |
trainresults | aequitas | table | fooduser |
trainresults | evaluations | table | fooduser |
trainresults | featureimportances | table | fooduser |
trainresults | predictions | table | fooduser |
What are all the results tables about?#
model_groups
stores the algorithm (model_type
), the hyperparameters (hyperparameters
), and the features shared by a particular set of models. models
contains data specific to a model: the model_group
(you can use model_group_id
for linking the model to a model group), temporal information (like train_end_time
), and the train matrix UUID (train_matrix_uuid
). This UUID is important because it's the name of the file in which the matrix is stored.
Lastly, {train, test}_results.predictions
contains all the scores generated by every model for every entity. {train_test}_results.evaluation
stores the value of all the metrics for every model, which were specified in the scoring
section in the config file.
triage_metadata.experiments
#
This table has the two columns: experiment_hash
and config
\d triage_metadata.experiments
Table "modelmetadata.experiments" | ||||
---|---|---|---|---|
Column | Type | Collation | Nullable | Default |
experimenthash | character varying | not null | ||
config | jsonb | |||
Indexes: | ||||
"experimentspkey" PRIMARY KEY, btree (experimenthash) | ||||
Referenced by: | ||||
TABLE "modelmetadata.experimentmatrices" CONSTRAINT "experimentmatricesexperimenthashfkey" FOREIGN KEY (experimenthash) REFERENCES modelmetadata.experiments(experimenthash) | ||||
TABLE "modelmetadata.experimentmodels" CONSTRAINT "experimentmodelsexperimenthashfkey" FOREIGN KEY (experimenthash) REFERENCES modelmetadata.experiments(experimenthash) | ||||
TABLE "modelmetadata.matrices" CONSTRAINT "matricesbuiltbyexperimentfkey" FOREIGN KEY (builtbyexperiment) REFERENCES modelmetadata.experiments(experimenthash) | ||||
TABLE "modelmetadata.models" CONSTRAINT "modelsexperimenthashfkey" FOREIGN KEY (builtbyexperiment) REFERENCES modelmetadata.experiments(experimenthash) |
experiment_hash
contains the hash of the configuration file that we used for our triage
run.1 config
that contains the configuration experiment file that we used for our triage
run, stored as jsonb
.
select experiment_hash,
config -> 'user_metadata' as user_metadata
from triage_metadata.experiments;
experimenthash | usermetadata |
---|---|
67a1d564d31811b9c20ca63672c25abd | {"org": "DSaPP", "team": "Tutorial", "author": "Adolfo De Unanue", "etldate": "2019-02-21", "experimenttype": "test", "labeldefinition": "failedinspection"} |
We could use the following advice: If we are interested in all models that resulted from a certain config, we could lookup that config in triage_metadata.experiments
and then use its experiment_hash
on other tables to find all the models that resulted from that configuration.
metadata_model.model_groups
#
Do you remember how we defined in grid_config
the different classifiers that we want triage
to train? For example, we could use in a configuration file the following:
'sklearn.tree.DecisionTreeClassifier':
criterion: ['entropy']
max_depth: [1, 2, 5, 10]
random_state: [2193]
By doing so, we are saying that we want to train 4 decision trees (max_depth
is one of 1, 2, 5, 10
). However, remember that we are using temporal crossvalidation to build our models, so we are going to have different temporal slices that we are training models on, e.g., 2010-2011, 2011-2012, etc.
Therefore, we are going to train our four decision trees on each temporal slice. Therefore, the trained model (or the instance of that model) will change across temporal splits, but the configuration will remain the same. This table lets us keep track of the different configurations (model_groups
) and gives us an id
for each configuration (model_group_id
). We can leverage the model_group_id
to find all the models that were trained using the same config but on different slices of time.
In our simple test configuration file we have:
'sklearn.dummy.DummyClassifier':
strategy: [most_frequent]
Therefore, if we run the following
select
model_group_id,
model_type,
hyperparameters,
model_config -> 'feature_groups' as feature_groups,
model_config -> 'cohort_name' as cohort,
model_config -> 'label_name' as label,
model_config -> 'label_definition' as label_definition,
model_config -> 'experiment_type' as experiment_type,
model_config -> 'etl_date' as etl_date
from
triage_metadata.model_groups;
modelgroupid | modeltype | hyperparameters | featuregroups | cohort | label | labeldefinition | experimenttype | etldate |
---|---|---|---|---|---|---|---|---|
1 | sklearn.dummy.DummyClassifier | {"strategy": "mostfrequent"} | ["prefix: results", "prefix: risks", "prefix: inspections"] | "testfacilities" | "failedinspections" | "failedinspection" | "test" | "2019-02-21" |
You can see that a model group is defined by the classifier (model_type
), its hyperparameters (hyperparameters
), the features (feature_list
) (not shown), and the model_config
.
The field model_config
is created using information from the block model_group_keys
. In our test configuration file the block is:
model_group_keys:
- 'class_path'
- 'parameters'
- 'feature_names'
- 'feature_groups'
- 'cohort_name'
- 'state'
- 'label_name'
- 'label_timespan'
- 'training_as_of_date_frequency'
- 'max_training_history'
- 'label_definition'
- 'experiment_type'
- 'org'
- 'team'
- 'author'
- 'etl_date'
What can we learn from that? For example, if we add a new feature and rerun triage
, triage
will create a new model_group
even if the classifier and the hyperparameters
are the same as before.
triage_metadata.models
#
This table stores the information about our actual models, i.e., instances of our classifiers trained on specific temporal slices.
\d triage_metadata.models
Table "modelmetadata.models" | ||||
---|---|---|---|---|
Column | Type | Collation | Nullable | Default |
modelid | integer | not null | nextval('modelmetadata.modelsmodelidseq'::regclass) | |
modelgroupid | integer | |||
modelhash | character varying | |||
runtime | timestamp without time zone | |||
batchruntime | timestamp without time zone | |||
modeltype | character varying | |||
hyperparameters | jsonb | |||
modelcomment | text | |||
batchcomment | text | |||
config | json | |||
builtbyexperiment | character varying | |||
trainendtime | timestamp without time zone | |||
test | boolean | |||
trainmatrixuuid | text | |||
traininglabeltimespan | interval | |||
modelsize | real | |||
Indexes: | ||||
"modelspkey" PRIMARY KEY, btree (modelid) | ||||
"ixresultsmodelsmodelhash" UNIQUE, btree (modelhash) | ||||
Foreign-key constraints: | ||||
"matrixuuidformodels" FOREIGN KEY (trainmatrixuuid) REFERENCES modelmetadata.matrices(matrixuuid) | ||||
"modelsexperimenthashfkey" FOREIGN KEY (builtbyexperiment) REFERENCES modelmetadata.experiments(experimenthash) | ||||
"modelsmodelgroupidfkey" FOREIGN KEY (modelgroupid) REFERENCES modelmetadata.modelgroups(modelgroupid) | ||||
Referenced by: | ||||
TABLE "testresults.evaluations" CONSTRAINT "evaluationsmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid) | ||||
TABLE "trainresults.featureimportances" CONSTRAINT "featureimportancesmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid) | ||||
TABLE "testresults.individualimportances" CONSTRAINT "individualimportancesmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid) | ||||
TABLE "modelmetadata.listpredictions" CONSTRAINT "listpredictionsmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid) | ||||
TABLE "testresults.predictions" CONSTRAINT "predictionsmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid) | ||||
TABLE "trainresults.evaluations" CONSTRAINT "trainevaluationsmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid) | ||||
TABLE "trainresults.predictions" CONSTRAINT "trainpredictionsmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid) |
Noteworthy columns are:
model_id
: The id of the model (i.e., instance…). We will use this ID to trace a model evaluation to amodel_group
and vice versa.model_group_id
: The id of the models model group we encountered above.model_hash
: The hash of our model. We can use the hash to load the actual model. It gets stored underTRIAGE_OUTPUT_PATH/trained_models/{model_hash}
. We are going to this later to look at a trained decision tree.run_time
: Time when the model was trained.model_type
: The algorithm used for training.model_comment
: Literally the text in themodel_comment
block in the configuration filehyperparameters
: Hyperparameters used for the model configuration.built_by_experiment
: The hash of our experiment. We encountered this value in theresults.experiments
table before.train_end_time
: When building the training matrix, we included training samples up to this date.train_matrix_uuid
: The hash of the matrix that we used to train this model. The matrix gets stored ascsv
underTRIAGE_OUTPUT_PATH/matrices/{train_matrix_uuid}.csv
. This is helpful when trying to inspect the matrix and features that were used for training.train_label_timespan
: How big was our window to get the labels for our training matrix? For example, atrain_label_window
of 1 year would mean that we look one year from a given date in the training matrix into the future to find the label for that training sample.
triage_metadata.matrices
#
This schema contains information about the matrices used in the model's training. You could use this information to debug your models. Important columns are matrix_uuid
(The matrix gets stored as TRIAGE_OUTPUT_PATH/matrices/{train_matrix_uuid}.csv
), matrix_type
(indicated if the matrix was used for training models or testing them), lookback_duration
and feature_starttime
(give information about the temporal setting of the features) and num_observations
(size of the matrices).
\d triage_metadata.matrices
Table "modelmetadata.matrices" | ||||
---|---|---|---|---|
Column | Type | Collation | Nullable | Default |
matrixid | character varying | |||
matrixuuid | character varying | not null | ||
matrixtype | character varying | |||
labelingwindow | interval | |||
numobservations | integer | |||
creationtime | timestamp with time zone | now() | ||
lookbackduration | interval | |||
featurestarttime | timestamp without time zone | |||
matrixmetadata | jsonb | |||
builtbyexperiment | character varying | |||
Indexes: | ||||
"matricespkey" PRIMARY KEY, btree (matrixuuid) | ||||
"ixmodelmetadatamatricesmatrixuuid" UNIQUE, btree (matrixuuid) | ||||
Foreign-key constraints: | ||||
"matricesbuiltbyexperimentfkey" FOREIGN KEY (builtbyexperiment) REFERENCES modelmetadata.experiments(experimenthash) | ||||
Referenced by: | ||||
TABLE "testresults.evaluations" CONSTRAINT "evaluationsmatrixuuidfkey" FOREIGN KEY (matrixuuid) REFERENCES modelmetadata.matrices(matrixuuid) | ||||
TABLE "trainresults.evaluations" CONSTRAINT "evaluationsmatrixuuidfkey" FOREIGN KEY (matrixuuid) REFERENCES modelmetadata.matrices(matrixuuid) | ||||
TABLE "modelmetadata.models" CONSTRAINT "matrixuuidformodels" FOREIGN KEY (trainmatrixuuid) REFERENCES modelmetadata.matrices(matrixuuid) | ||||
TABLE "testresults.predictions" CONSTRAINT "matrixuuidfortestpred" FOREIGN KEY (matrixuuid) REFERENCES modelmetadata.matrices(matrixuuid) | ||||
TABLE "trainresults.predictions" CONSTRAINT "matrixuuidfortrainpred" FOREIGN KEY (matrixuuid) REFERENCES modelmetadata.matrices(matrixuuid) | ||||
TABLE "trainresults.predictions" CONSTRAINT "trainpredictionsmatrixuuidfkey" FOREIGN KEY (matrixuuid) REFERENCES modelmetadata.matrices(matrixuuid) |
{test, train}_results.evaluations
#
These tables lets us analyze how well our models are doing. Based on the config that we used for our triage
run, triage
is calculating metrics and storing them in this table, e.g., our model's precision in top 10%.
\d test_results.evaluations
Table "testresults.evaluations" | ||||
---|---|---|---|---|
Column | Type | Collation | Nullable | Default |
modelid | integer | not null | ||
evaluationstarttime | timestamp without time zone | not null | ||
evaluationendtime | timestamp without time zone | not null | ||
asofdatefrequency | interval | not null | ||
metric | character varying | not null | ||
parameter | character varying | not null | ||
value | numeric | |||
numlabeledexamples | integer | |||
numlabeledabovethreshold | integer | |||
numpositivelabels | integer | |||
sortseed | integer | |||
matrixuuid | text | |||
Indexes: | ||||
"evaluationspkey" PRIMARY KEY, btree (modelid, evaluationstarttime, evaluationendtime, asofdatefrequency, metric, parameter) | ||||
Foreign-key constraints: | ||||
"evaluationsmatrixuuidfkey" FOREIGN KEY (matrixuuid) REFERENCES modelmetadata.matrices(matrixuuid) | ||||
"evaluationsmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid) |
Its columns are:
model_id
: Our belovedmodel_id
that we have encountered before.evaluation_start_time
: After training the model, we evaluate it on a test matrix. This column tells us the earliest time that an example in our test matrix could have.evaluation_end_time
: After training the model, we evaluate it on a test matrix. This column tells us the latest time that an example in our test matrix could have.metric
: Indicates which metric we are evaluating, e.g.,precision@
.parameter
::Indicates at which parameter we are evaluating our metric, e.g., a metric of precision@ and a parameter of100.0_pct
shows us theprecision@100pct
.value
: The value observed for our metric@parameter.num_labeled_examples
: The number of labeled examples in our test matrix. Why does it matter? It could be the case that we have entities that have no label for the test timeframe (for example, not all facilities will have an inspection). We still want to make predictions for these entities but can't include them when calculating performance metrics.num_labeled_above_threshold
: How many examples above our threshold were labeled?num_positive_labels
: The number of rows that had true positive labels.
A look at the table shows that we have multiple rows for each model, each showing a different performance metric.
select
evaluation_end_time,
model_id,
metric || parameter as metric,
value,
num_labeled_examples,
num_labeled_above_threshold,
num_positive_labels
from
test_results.evaluations
where
parameter = '100.0_pct';
evaluationendtime | modelid | metric | value | numlabeledexamples | numlabeledabovethreshold | numpositivelabels |
---|---|---|---|---|---|---|
2016-01-01 00:00:00 | 1 | precision@100.0pct | 0.6666666666666666 | 3 | 3 | 2 |
2016-01-01 00:00:00 | 1 | recall@100.0pct | 1.0 | 3 | 3 | 2 |
2017-01-01 00:00:00 | 2 | precision@100.0pct | 0.3333333333333333 | 3 | 3 | 1 |
2017-01-01 00:00:00 | 2 | recall@100.0pct | 1.0 | 3 | 3 | 1 |
Remember that at 100%, the
recall
should be 1, and theprecision
is equal to the baserate. If these two things don't match, there are problems in your data, pipeline, etl. You must get this correct!
What does this query tell us?
We can now see how the different instances (trained on different temporal slices, but with the same model params) of a model group performs over time. Note how we only included the models that belong to our model group 1
.
{test_train}_results.aequitas
#
Standard evaluation metrics don't tell us the entire story: what are the biases in our models? what is the fairest model?
Given the bias_audit_config
in the experiment config in which we defined what protected attributes we care about (e.g. ethnicity) and the specific thresholds our model is going to be used,
Triage uses Aequitas to generate a bias report on each model and matrix, similar to standard evaluation metrics.
The aequitas
tables will have a row for each combination of:
- model_id
- subset_hash
- tie_breaker (e.g. best, worst)
- evaluation_start_time
- evaluation_end_time
- parameter (e.g. 25_abs
, similar to evaluation metric thresholds)
- attribute_name (e.g. 'facility_type')
- attribute_value (e.g. 'kids_facility', 'restaurant')
For each row Aequitas calculates the following group metrics:
Metric | Formula | Description |
---|---|---|
Predicted Positive | The number of entities within a group where the decision is positive, i.e., | |
Total Predictive Positive | The total number of entities predicted positive across groups defined by | |
Predicted Negative | The number of entities within a group which decision is negative, i.e., | |
Predicted Prevalence | The fraction of entities within a group which were predicted as positive. | |
Predicted Positive Rate | The fraction of the entities predicted as positive that belong to a certain group. | |
False Positive | The number of entities of the group with and | |
False Negative | The number of entities of the group with and | |
True Positive | The number of entities of the group with and | |
True Negative | The number of entities of the group with and | |
False Discovery Rate | The fraction of false positives of a group within the predicted positive of the group. | |
False Omission Rate | The fraction of false negatives of a group within the predicted negative of the group. | |
False Positive Rate | The fraction of false positives of a group within the labeled negative of the group. | |
False Negative Rate | The fraction of false negatives of a group within the labeled positives of the group. |
In the context of public policy and social good we want to avoid providing less benefits to specific groups of entities, if the intervention is assistive, as well as, avoid hurting more specific groups, if the intervention is punitive. Therefore we define bias as a disparity measure of group metric values of a given group when compared with a reference group. This reference can be selected using different criteria. For instance, one could use the majority group (with larger size) across the groups defined by A, or the group with minimum group metric value, or the traditional approach of fixing a historically favored group (e.g ethnicity:caucasian).
Each disparity metric for a given group is calculated as follows:
To read about the bias metrics saved in this table, look at the Aequitas documentation.
Table "test_results.aequitas" | ||||
---|---|---|---|---|
Column | Type | Collation | Nullable | Default |
model_id | integer | not null | ||
subset_hash | character varying | not null | ||
tie_breaker | character varying | not null | ||
evaluation_start_time | timestamp without time zone | not null | ||
evaluation_end_time | timestamp without time zone | not null | ||
matrix_uuid | text | |||
parameter | character varying | not null | ||
attribute_name | character varying | not null | ||
attribute_value | character varying | not null | ||
total_entities | integer | |||
group_label_pos | integer | |||
group_label_neg | integer | |||
group_size | integer | |||
group_size_pct | numeric | |||
prev | numeric | |||
pp | integer | |||
pn | integer | |||
fp | integer | |||
fn | integer | |||
tn | integer | |||
tp | integer | |||
ppr | numeric | |||
pprev | numeric | |||
tpr | numeric | |||
tnr | numeric | |||
for | numeric | |||
fdr | numeric | |||
fpr | numeric | |||
fnr | numeric | |||
npv | numeric | |||
precision | numeric | |||
ppr_disparity | numeric | |||
ppr_ref_group_value | character varying | |||
pprev_disparity | numeric | |||
pprev_ref_group_value | character varying | |||
precision_disparity | numeric | |||
precision_ref_group_value | character varying | |||
fdr_disparity | numeric | |||
fdr_ref_group_value | character varying | |||
for_disparity | numeric | |||
for_ref_group_value | character varying | |||
fpr_disparity | numeric | |||
fpr_ref_group_value | character varying | |||
fnr_disparity | numeric | |||
fnr_ref_group_value | character varying | |||
tpr_disparity | numeric | |||
tpr_ref_group_value | character varying | |||
tnr_disparity | numeric | |||
tnr_ref_group_value | character varying | |||
npv_disparity | numeric | |||
npv_ref_group_value | character varying | |||
Statistical_Parity | boolean | |||
Impact_Parity | boolean | |||
FDR_Parity | boolean | |||
FPR_Parity | boolean | |||
FOR_Parity | boolean | |||
FNR_Parity | boolean | |||
TypeI_Parity | boolean | |||
TypeII_Parity | boolean | |||
Equalized_Odds | boolean | |||
Unsupervised_Fairness | boolean | |||
Supervised_Fairness | boolean |
{test, train}_results.predictions
#
You can think of the previous table {test, train}_results.{test, train}_predictions
as a summary of individuals predictions that our model is making. But where can you find the individual predictions that our model is making? (So you can generate a list from here). And where can we find the test matrix that the predictions are based on? Let us introduce you to the results.predictions
table.
Here is what its first row looks like:
select
model_id,
entity_id,
as_of_date,
score,
label_value,
matrix_uuid
from
test_results.predictions
where
model_id = 1
order by score desc;
modelid | entityid | asofdate | score | labelvalue | matrixuuid |
---|---|---|---|---|---|
1 | 229 | 2016-01-01 00:00:00 | 1.0 | 1 | cd0ae68d6ace43033b49ee0390c3583e |
1 | 355 | 2016-01-01 00:00:00 | 1.0 | 1 | cd0ae68d6ace43033b49ee0390c3583e |
1 | 840 | 2016-01-01 00:00:00 | 1.0 | 0 | cd0ae68d6ace43033b49ee0390c3583e |
As you can see, the table contains our models' predictions for a given entity and date.
And do you notice the field matrix_uuid
? Doesn't it look similar to the fields from above that gave us the names of our training matrices? In fact, it is the same. You can find the test matrix that was used to make this prediction under TRIAGE_OUTPUT_PATH/matrices/{matrix_uuid}.csv
.
{test, train}_results.feature_importances
#
This tables store the feature importance of all the models.
Footnotes#
1 Literally from the configuration file. If you modify something it will generate a new hash. Handle with care!