Skip to content

Machine learning governance#

When triage executes the experiment, it creates a series of new schemas for storing the copious output of the experiment. The schemas are test_results, train_results, and triage_metadata. These schemas store the metadata of the trained models, features, parameters, and hyperparameters used in their training. It also stores the predictions and evaluations of the models on the test sets.

The schema triage_metadata is composed by the tables:

\dt triage_metadata.*
List of relations
Schema Name Type Owner
modelmetadata experimentmatrices table fooduser
modelmetadata experimentmodels table fooduser
modelmetadata experiments table fooduser
modelmetadata listpredictions table fooduser
modelmetadata matrices table fooduser
modelmetadata modelgroups table fooduser
modelmetadata models table fooduser

The tables contained in test_results are:

\dt test_results.*
List of relations
Schema Name Type Owner
testresults aequitas table fooduser
testresults evaluations table fooduser
testresults individualimportances table fooduser
testresults predictions table fooduser

Lastly, if you have interest in how the model performed in the training data sets you could consult train_results

\dt train_results.*
List of relations
Schema Name Type Owner
trainresults aequitas table fooduser
trainresults evaluations table fooduser
trainresults featureimportances table fooduser
trainresults predictions table fooduser

What are all the results tables about?#

model_groups stores the algorithm (model_type), the hyperparameters (hyperparameters), and the features shared by a particular set of models. models contains data specific to a model: the model_group (you can use model_group_id for linking the model to a model group), temporal information (like train_end_time), and the train matrix UUID (train_matrix_uuid). This UUID is important because it's the name of the file in which the matrix is stored.

Lastly, {train, test}_results.predictions contains all the scores generated by every model for every entity. {train_test}_results.evaluation stores the value of all the metrics for every model, which were specified in the scoring section in the config file.

triage_metadata.experiments#

This table has the two columns: experiment_hash and config

\d triage_metadata.experiments
Table "modelmetadata.experiments"
Column Type Collation Nullable Default
experimenthash character varying not null
config jsonb
Indexes:
"experimentspkey" PRIMARY KEY, btree (experimenthash)
Referenced by:
TABLE "modelmetadata.experimentmatrices" CONSTRAINT "experimentmatricesexperimenthashfkey" FOREIGN KEY (experimenthash) REFERENCES modelmetadata.experiments(experimenthash)
TABLE "modelmetadata.experimentmodels" CONSTRAINT "experimentmodelsexperimenthashfkey" FOREIGN KEY (experimenthash) REFERENCES modelmetadata.experiments(experimenthash)
TABLE "modelmetadata.matrices" CONSTRAINT "matricesbuiltbyexperimentfkey" FOREIGN KEY (builtbyexperiment) REFERENCES modelmetadata.experiments(experimenthash)
TABLE "modelmetadata.models" CONSTRAINT "modelsexperimenthashfkey" FOREIGN KEY (builtbyexperiment) REFERENCES modelmetadata.experiments(experimenthash)

experiment_hash contains the hash of the configuration file that we used for our triage run.1 config that contains the configuration experiment file that we used for our triage run, stored as jsonb.

select experiment_hash,
config ->  'user_metadata' as user_metadata
from triage_metadata.experiments;
experimenthash usermetadata
67a1d564d31811b9c20ca63672c25abd {"org": "DSaPP", "team": "Tutorial", "author": "Adolfo De Unanue", "etldate": "2019-02-21", "experimenttype": "test", "labeldefinition": "failedinspection"}

We could use the following advice: If we are interested in all models that resulted from a certain config, we could lookup that config in triage_metadata.experiments and then use its experiment_hash on other tables to find all the models that resulted from that configuration.

metadata_model.model_groups#

Do you remember how we defined in grid_config the different classifiers that we want triage to train? For example, we could use in a configuration file the following:

    'sklearn.tree.DecisionTreeClassifier':
        criterion: ['entropy']
        max_depth: [1, 2, 5, 10]
        random_state: [2193]

By doing so, we are saying that we want to train 4 decision trees (max_depth is one of 1, 2, 5, 10). However, remember that we are using temporal crossvalidation to build our models, so we are going to have different temporal slices that we are training models on, e.g., 2010-2011, 2011-2012, etc.

Therefore, we are going to train our four decision trees on each temporal slice. Therefore, the trained model (or the instance of that model) will change across temporal splits, but the configuration will remain the same. This table lets us keep track of the different configurations (model_groups) and gives us an id for each configuration (model_group_id). We can leverage the model_group_id to find all the models that were trained using the same config but on different slices of time.

In our simple test configuration file we have:

    'sklearn.dummy.DummyClassifier':
        strategy: [most_frequent]

Therefore, if we run the following

select
    model_group_id,
    model_type,
    hyperparameters,
    model_config -> 'feature_groups' as feature_groups,
    model_config -> 'cohort_name' as cohort,
    model_config -> 'label_name' as label,
    model_config -> 'label_definition' as label_definition,
    model_config -> 'experiment_type' as experiment_type,
    model_config -> 'etl_date' as etl_date
from
    triage_metadata.model_groups;
modelgroupid modeltype hyperparameters featuregroups cohort label labeldefinition experimenttype etldate
1 sklearn.dummy.DummyClassifier {"strategy": "mostfrequent"} ["prefix: results", "prefix: risks", "prefix: inspections"] "testfacilities" "failedinspections" "failedinspection" "test" "2019-02-21"

You can see that a model group is defined by the classifier (model_type), its hyperparameters (hyperparameters), the features (feature_list) (not shown), and the model_config.

The field model_config is created using information from the block model_group_keys. In our test configuration file the block is:

model_group_keys:
  - 'class_path'
  - 'parameters'
  - 'feature_names'
  - 'feature_groups'
  - 'cohort_name'
  - 'state'
  - 'label_name'
  - 'label_timespan'
  - 'training_as_of_date_frequency'
  - 'max_training_history'
  - 'label_definition'
  - 'experiment_type'
  - 'org'
  - 'team'
  - 'author'
  - 'etl_date'

What can we learn from that? For example, if we add a new feature and rerun triage, triage will create a new model_group even if the classifier and the hyperparameters are the same as before.

triage_metadata.models#

This table stores the information about our actual models, i.e., instances of our classifiers trained on specific temporal slices.

\d triage_metadata.models
Table "modelmetadata.models"
Column Type Collation Nullable Default
modelid integer not null nextval('modelmetadata.modelsmodelidseq'::regclass)
modelgroupid integer
modelhash character varying
runtime timestamp without time zone
batchruntime timestamp without time zone
modeltype character varying
hyperparameters jsonb
modelcomment text
batchcomment text
config json
builtbyexperiment character varying
trainendtime timestamp without time zone
test boolean
trainmatrixuuid text
traininglabeltimespan interval
modelsize real
Indexes:
"modelspkey" PRIMARY KEY, btree (modelid)
"ixresultsmodelsmodelhash" UNIQUE, btree (modelhash)
Foreign-key constraints:
"matrixuuidformodels" FOREIGN KEY (trainmatrixuuid) REFERENCES modelmetadata.matrices(matrixuuid)
"modelsexperimenthashfkey" FOREIGN KEY (builtbyexperiment) REFERENCES modelmetadata.experiments(experimenthash)
"modelsmodelgroupidfkey" FOREIGN KEY (modelgroupid) REFERENCES modelmetadata.modelgroups(modelgroupid)
Referenced by:
TABLE "testresults.evaluations" CONSTRAINT "evaluationsmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid)
TABLE "trainresults.featureimportances" CONSTRAINT "featureimportancesmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid)
TABLE "testresults.individualimportances" CONSTRAINT "individualimportancesmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid)
TABLE "modelmetadata.listpredictions" CONSTRAINT "listpredictionsmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid)
TABLE "testresults.predictions" CONSTRAINT "predictionsmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid)
TABLE "trainresults.evaluations" CONSTRAINT "trainevaluationsmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid)
TABLE "trainresults.predictions" CONSTRAINT "trainpredictionsmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid)

Noteworthy columns are:

  • model_id: The id of the model (i.e., instance…). We will use this ID to trace a model evaluation to a model_group and vice versa.
  • model_group_id: The id of the models model group we encountered above.
  • model_hash: The hash of our model. We can use the hash to load the actual model. It gets stored under TRIAGE_OUTPUT_PATH/trained_models/{model_hash}. We are going to this later to look at a trained decision tree.
  • run_time: Time when the model was trained.
  • model_type: The algorithm used for training.
  • model_comment: Literally the text in the model_comment block in the configuration file
  • hyperparameters: Hyperparameters used for the model configuration.
  • built_by_experiment: The hash of our experiment. We encountered this value in the results.experiments table before.
  • train_end_time: When building the training matrix, we included training samples up to this date.
  • train_matrix_uuid: The hash of the matrix that we used to train this model. The matrix gets stored as csv under TRIAGE_OUTPUT_PATH/matrices/{train_matrix_uuid}.csv. This is helpful when trying to inspect the matrix and features that were used for training.
  • train_label_timespan: How big was our window to get the labels for our training matrix? For example, a train_label_window of 1 year would mean that we look one year from a given date in the training matrix into the future to find the label for that training sample.

triage_metadata.matrices#

This schema contains information about the matrices used in the model's training. You could use this information to debug your models. Important columns are matrix_uuid (The matrix gets stored as TRIAGE_OUTPUT_PATH/matrices/{train_matrix_uuid}.csv), matrix_type (indicated if the matrix was used for training models or testing them), lookback_duration and feature_starttime (give information about the temporal setting of the features) and num_observations (size of the matrices).

\d triage_metadata.matrices
Table "modelmetadata.matrices"
Column Type Collation Nullable Default
matrixid character varying
matrixuuid character varying not null
matrixtype character varying
labelingwindow interval
numobservations integer
creationtime timestamp with time zone now()
lookbackduration interval
featurestarttime timestamp without time zone
matrixmetadata jsonb
builtbyexperiment character varying
Indexes:
"matricespkey" PRIMARY KEY, btree (matrixuuid)
"ixmodelmetadatamatricesmatrixuuid" UNIQUE, btree (matrixuuid)
Foreign-key constraints:
"matricesbuiltbyexperimentfkey" FOREIGN KEY (builtbyexperiment) REFERENCES modelmetadata.experiments(experimenthash)
Referenced by:
TABLE "testresults.evaluations" CONSTRAINT "evaluationsmatrixuuidfkey" FOREIGN KEY (matrixuuid) REFERENCES modelmetadata.matrices(matrixuuid)
TABLE "trainresults.evaluations" CONSTRAINT "evaluationsmatrixuuidfkey" FOREIGN KEY (matrixuuid) REFERENCES modelmetadata.matrices(matrixuuid)
TABLE "modelmetadata.models" CONSTRAINT "matrixuuidformodels" FOREIGN KEY (trainmatrixuuid) REFERENCES modelmetadata.matrices(matrixuuid)
TABLE "testresults.predictions" CONSTRAINT "matrixuuidfortestpred" FOREIGN KEY (matrixuuid) REFERENCES modelmetadata.matrices(matrixuuid)
TABLE "trainresults.predictions" CONSTRAINT "matrixuuidfortrainpred" FOREIGN KEY (matrixuuid) REFERENCES modelmetadata.matrices(matrixuuid)
TABLE "trainresults.predictions" CONSTRAINT "trainpredictionsmatrixuuidfkey" FOREIGN KEY (matrixuuid) REFERENCES modelmetadata.matrices(matrixuuid)

{test, train}_results.evaluations#

These tables lets us analyze how well our models are doing. Based on the config that we used for our triage run, triage is calculating metrics and storing them in this table, e.g., our model's precision in top 10%.

\d test_results.evaluations
Table "testresults.evaluations"
Column Type Collation Nullable Default
modelid integer not null
evaluationstarttime timestamp without time zone not null
evaluationendtime timestamp without time zone not null
asofdatefrequency interval not null
metric character varying not null
parameter character varying not null
value numeric
numlabeledexamples integer
numlabeledabovethreshold integer
numpositivelabels integer
sortseed integer
matrixuuid text
Indexes:
"evaluationspkey" PRIMARY KEY, btree (modelid, evaluationstarttime, evaluationendtime, asofdatefrequency, metric, parameter)
Foreign-key constraints:
"evaluationsmatrixuuidfkey" FOREIGN KEY (matrixuuid) REFERENCES modelmetadata.matrices(matrixuuid)
"evaluationsmodelidfkey" FOREIGN KEY (modelid) REFERENCES modelmetadata.models(modelid)

Its columns are:

  • model_id: Our beloved model_id that we have encountered before.
  • evaluation_start_time: After training the model, we evaluate it on a test matrix. This column tells us the earliest time that an example in our test matrix could have.
  • evaluation_end_time: After training the model, we evaluate it on a test matrix. This column tells us the latest time that an example in our test matrix could have.
  • metric: Indicates which metric we are evaluating, e.g., precision@.
  • parameter ::Indicates at which parameter we are evaluating our metric, e.g., a metric of precision@ and a parameter of 100.0_pct shows us the precision@100pct.
  • value: The value observed for our metric@parameter.
  • num_labeled_examples: The number of labeled examples in our test matrix. Why does it matter? It could be the case that we have entities that have no label for the test timeframe (for example, not all facilities will have an inspection). We still want to make predictions for these entities but can't include them when calculating performance metrics.
  • num_labeled_above_threshold: How many examples above our threshold were labeled?
  • num_positive_labels: The number of rows that had true positive labels.

A look at the table shows that we have multiple rows for each model, each showing a different performance metric.

select
    evaluation_end_time,
    model_id,
    metric || parameter as metric,
    value,
    num_labeled_examples,
    num_labeled_above_threshold,
    num_positive_labels
from
    test_results.evaluations
where
    parameter = '100.0_pct';
evaluationendtime modelid metric value numlabeledexamples numlabeledabovethreshold numpositivelabels
2016-01-01 00:00:00 1 precision@100.0pct 0.6666666666666666 3 3 2
2016-01-01 00:00:00 1 recall@100.0pct 1.0 3 3 2
2017-01-01 00:00:00 2 precision@100.0pct 0.3333333333333333 3 3 1
2017-01-01 00:00:00 2 recall@100.0pct 1.0 3 3 1

Remember that at 100%, the recall should be 1, and the precision is equal to the baserate. If these two things don't match, there are problems in your data, pipeline, etl. You must get this correct!

What does this query tell us?

We can now see how the different instances (trained on different temporal slices, but with the same model params) of a model group performs over time. Note how we only included the models that belong to our model group 1.

{test_train}_results.aequitas#

Standard evaluation metrics don't tell us the entire story: what are the biases in our models? what is the fairest model?
Given the bias_audit_config in the experiment config in which we defined what protected attributes we care about (e.g. ethnicity) and the specific thresholds our model is going to be used, Triage uses Aequitas to generate a bias report on each model and matrix, similar to standard evaluation metrics. The aequitas tables will have a row for each combination of: - model_id - subset_hash - tie_breaker (e.g. best, worst) - evaluation_start_time - evaluation_end_time - parameter (e.g. 25_abs, similar to evaluation metric thresholds) - attribute_name (e.g. 'facility_type') - attribute_value (e.g. 'kids_facility', 'restaurant')

For each row Aequitas calculates the following group metrics:

Metric Formula Description
Predicted Positive The number of entities within a group where the decision is positive, i.e.,
Total Predictive Positive The total number of entities predicted positive across groups defined by
Predicted Negative The number of entities within a group which decision is negative, i.e.,
Predicted Prevalence The fraction of entities within a group which were predicted as positive.
Predicted Positive Rate The fraction of the entities predicted as positive that belong to a certain group.
False Positive The number of entities of the group with and
False Negative The number of entities of the group with and
True Positive The number of entities of the group with and
True Negative The number of entities of the group with and
False Discovery Rate The fraction of false positives of a group within the predicted positive of the group.
False Omission Rate The fraction of false negatives of a group within the predicted negative of the group.
False Positive Rate The fraction of false positives of a group within the labeled negative of the group.
False Negative Rate The fraction of false negatives of a group within the labeled positives of the group.

In the context of public policy and social good we want to avoid providing less benefits to specific groups of entities, if the intervention is assistive, as well as, avoid hurting more specific groups, if the intervention is punitive. Therefore we define bias as a disparity measure of group metric values of a given group when compared with a reference group. This reference can be selected using different criteria. For instance, one could use the majority group (with larger size) across the groups defined by A, or the group with minimum group metric value, or the traditional approach of fixing a historically favored group (e.g ethnicity:caucasian).

Each disparity metric for a given group is calculated as follows:

To read about the bias metrics saved in this table, look at the Aequitas documentation.

Table "test_results.aequitas"
Column Type Collation Nullable Default
model_id integer not null
subset_hash character varying not null
tie_breaker character varying not null
evaluation_start_time timestamp without time zone not null
evaluation_end_time timestamp without time zone not null
matrix_uuid text
parameter character varying not null
attribute_name character varying not null
attribute_value character varying not null
total_entities integer
group_label_pos integer
group_label_neg integer
group_size integer
group_size_pct numeric
prev numeric
pp integer
pn integer
fp integer
fn integer
tn integer
tp integer
ppr numeric
pprev numeric
tpr numeric
tnr numeric
for numeric
fdr numeric
fpr numeric
fnr numeric
npv numeric
precision numeric
ppr_disparity numeric
ppr_ref_group_value character varying
pprev_disparity numeric
pprev_ref_group_value character varying
precision_disparity numeric
precision_ref_group_value character varying
fdr_disparity numeric
fdr_ref_group_value character varying
for_disparity numeric
for_ref_group_value character varying
fpr_disparity numeric
fpr_ref_group_value character varying
fnr_disparity numeric
fnr_ref_group_value character varying
tpr_disparity numeric
tpr_ref_group_value character varying
tnr_disparity numeric
tnr_ref_group_value character varying
npv_disparity numeric
npv_ref_group_value character varying
Statistical_Parity boolean
Impact_Parity boolean
FDR_Parity boolean
FPR_Parity boolean
FOR_Parity boolean
FNR_Parity boolean
TypeI_Parity boolean
TypeII_Parity boolean
Equalized_Odds boolean
Unsupervised_Fairness boolean
Supervised_Fairness boolean

{test, train}_results.predictions#

You can think of the previous table {test, train}_results.{test, train}_predictions as a summary of individuals predictions that our model is making. But where can you find the individual predictions that our model is making? (So you can generate a list from here). And where can we find the test matrix that the predictions are based on? Let us introduce you to the results.predictions table.

Here is what its first row looks like:

select
    model_id,
    entity_id,
    as_of_date,
    score,
    label_value,
    matrix_uuid
from
    test_results.predictions
where
    model_id = 1
order by score desc;
modelid entityid asofdate score labelvalue matrixuuid
1 229 2016-01-01 00:00:00 1.0 1 cd0ae68d6ace43033b49ee0390c3583e
1 355 2016-01-01 00:00:00 1.0 1 cd0ae68d6ace43033b49ee0390c3583e
1 840 2016-01-01 00:00:00 1.0 0 cd0ae68d6ace43033b49ee0390c3583e

As you can see, the table contains our models' predictions for a given entity and date.

And do you notice the field matrix_uuid? Doesn't it look similar to the fields from above that gave us the names of our training matrices? In fact, it is the same. You can find the test matrix that was used to make this prediction under TRIAGE_OUTPUT_PATH/matrices/{matrix_uuid}.csv.

{test, train}_results.feature_importances#

This tables store the feature importance of all the models.

Footnotes#

1 Literally from the configuration file. If you modify something it will generate a new hash. Handle with care!