Experiment Architecture#
This document is aimed at people wishing to contribute to Triage development. It explains the design and architecture of the Experiment class.
Dependency Graphs#
For a general overview of how the parts of an experiment depend on each other, refer to the graphs below.
Experiment (high-level)#
The FeatureGenerator section above hides some details to make the overall graph flow more concise. To support feature grouping, there are more operations that happen between feature table creation and matrix building. The relevant section of the dependency graph is expanded below, along with the output that each pair of components sends between each other within the arrow
Feature Dependency Details#
Component List and Input/Output#
These are where the interesting data science work is done.
- Timechop (temporal cross-validation)
- Architect (design matrix creation)
- Catwalk (modeling)
Timechop#
Timechop does the necessary temporal math to set up temporal cross-validation. It 'chops' time according to config into train-test split definitions, which other components use.
Input
temporal_configin experiment config
Output
- Time splits containing temporal cross-validation definition, including each
as_of_dateto be included in the matrices in each time split
Entity-Date Table Generator#
The EntityDateTableGenerator manages entity-date tables (including cohort and subset tables) by running the configured query for a number of different as_of_dates. Alternately, will retrieve all unique entities and dates from the labels table if no query is configured.
Input
- All unique
as_of_datesneeded by matrices in the experiment, as provided by Timechop - query and name from
cohort_configorsubsetsin thescoringsection in an experiment config - entity-date table name that the caller wants to use
Output
- An entity-date table in the database, consisting of entity ids and dates
Label Generator#
The LabelGenerator manages a labels table by running the configured label query for a number of different as_of_dates and label_timespans.
Input
- All unique
as_of_datesandlabel_timespans, needed by matrices in the experiment, as provided by Timechop - query and name from
label_configin experiment config
Output
- A labels table in the database, consisting of entity ids, dates, and boolean labels
Feature Generator#
The FeatureGenerator manages a number of features tables by converting the configured feature_aggregations into collate.Spacetime objects, and then running the queries generated by collate. For each feature_aggregation, it runs a few passes:
- Optionally, convert a complex from object (e.g. the
FROMpart of the configured aggregation query) into an indexed table for speed. - Create a number of empty tables at different
GROUP BYlevels (alwaysentity_idintriage) and run inserts individually for eachas_of_date. These inserts are split up into individual tasks and parallelized for speed. - Roll up the
GROUP BYtables from step 1 to theentity_idlevel with a singleLEFT JOINquery. - Use the cohort table to find all members of the cohort not present in the table from step 2 and create a new table with all members of the cohort, null values filled in with values based on the rules in the
feature_aggregationsconfig.
Input
- All unique
as_of_datesneeded by matrices in the experiment, and the start time for features, as provided by Timechop - The populated cohort table, as provided by Entity-Date Table Generator
feature_aggregationsin experiment config
Output
- Populated feature tables in the database, one for each
feature_aggregation
Feature Dictionary Creator#
Summarizes the feature tables created by FeatureGenerator into a dictionary more easily usable for feature grouping and serialization purposes. Does this by querying the database's information_schema.
Input
- Names of feature tables and the index of each table, as provided by Feature Generator
Output
- A master feature dictionary, consisting of each populated feature table and all of its feature column names.
Feature Group Creator#
Creates feature groups by taking the configured feature grouping rules and applying them to the master feature dictionary, to create a collection of smaller feature dictionaries.
Input
- Master feature dictionary, as provided by Feature Dictionary Creator
feature_group_definitionin experiment config
Output
- List of feature dictionaries, each representing one feature group
Feature Group Mixer#
Combines feature groups into new ones based on the configured rules (e.g. leave-one-out, leave-one-in).
Input
- List of feature dictionaries, as provided by Feature Group Creator
feature_group_strategiesin experiment config
Output
- List of feature dictionaries, each representing one or more feature groups.
Planner#
Mixes time split definitions and feature groups to create the master list of matrices that are required for modeling to proceed.
Input
- List of feature dictionaries, as provided by Feature Group Mixer
- List of matrix split definitions, as provided by Timechop
user_metadata, in experiment configfeature_start_timefromtemporal_configin experiment config- cohort name from
cohort_configin experiment config - label name from
cohort_configin experiment config
Output
- List of serializable matrix build tasks, consisting of everything needed to build a single matrix:
- list of as-of-dates
- a label name
- a label type
- a feature dictionary
- matrix uuid
- matrix metadata
- matrix type (train or test)
Matrix Builder#
Takes matrix build tasks from the Planner and builds them if they don't already exist.
Input
- A matrix build task, as provided by Planner
include_missing_labels_in_train_asfromlabel_configin experiment config- The experiment's MatrixStorageEngine
Output
- The built matrix saved in the MatrixStorageEngine
- A row describing the matrix saved in the database's
triage_metadata.matricestable.
ModelTrainTester#
A meta-component of sorts. Encompasses all of the other catwalk components.
Input
- One temporal split, as provided by Timechop
grid_configin experiment config- Fully configured ModelTrainer, Predictor, ModelEvaluator, Individual Importance Calculator objects
Output
- All of its components are run, resulting in trained models, predictions, evaluation metrics, and individual importances
ModelGrouper#
Assigns a model group to each model based on its metadata.
Input
model_group_keysin experiment config- All the data about a particular model neded to decide a model group for the model: classifier name, hyperparameter list, and matrix metadata, as provided by ModelTrainer
Output
- a model group id corresponding to a row in the
triage_metadata.model_groupstable, either a matching one that already existed in the table or one that it autoprovisioned.
ModelTrainer#
Trains a model, stores it, and saves its metadata (including model group information and feature importances) to the database. Each model to be trained is expressed as a serializable task so that it can be parallelized.
Input
- an instance of the ModelGrouper class.
- the experiment's ModelStorageEngine
- a MatrixStore object
- an importable classifier path and a set of hyperparameters
Output
- a row in the database's triage_metadata.model_groups table, the triage_metadata.models table, and rows in train_results.feature_importances for each feature.
- the trained model persisted in the ModelStorageEngine
Predictor#
Generates predictions for a given model and matrix, both returning them for immediate use and saving them to the database.
Input
- The experiment's Model Storage Engine
- A model id corresponding to a row from the database
- A MatrixStore object
Output
- The predictions as an array
- Each prediction saved to the database, unless configured not to. The table they are stored in depends on which type of matrix it is (e.g.
test_results.predictionsortrain_results.predictions)
Protected Group Table Generator#
Generates a table containing protected group attributes (e.g. race, sex, age).
Input
- A cohort table name and its configuration's unique hash
- Bias audit configuration, specifically a from object (either a table or query), and column names in the from object for protected attributes, knowledge date, and entity id.
- A name for the protected groups table
Output
- A protected groups table, containing all rows from the cohort and any protected group information present in the from object, as well as the cohort hash so multiple cohorts can live in the same table.
ModelEvaluator#
Generates evaluation metrics for a given model and matrix over the entire matrix and for any subsets.
Input
scoringin experiment config- array of predictions
- the MatrixStore and model_id that the predictions were generated from
- the subset to be evaluated (or
Nonefor the whole matrix) - the reference group and thresholding rules from
bias_audit_configin experiment config - the protected group generator object (for retrieving protected group data)
Output
- A row in the database for each evaluation metric for each subset. The table they are stored in depends on which type of matrix it is (e.g.
test_results.evaluationsortrain_results.evaluations). - A row in the database for each Aequitas bias report. Either
test_results.aequitasortrain_results.aequitas.
Individual Importance Calculator#
Generates the top n feature importances for each entity in a given model.
Input
individual_importance_configin experiment config.- model id
- a MatrixStore object for a test matrix
- an as-of-date
Output
- rows in the
test_results.individual_importancestable for the model, date, and matrix based on the configured method and number of top features per entity.
General Class Design#
The Experiment class is designed to have all work done by component objects that reside as attributes on the instance. The purpose of this is to maximize the reuse potential of the components outside of the Experiment, as well as avoid excessive class inheritance within the Experiment.
The inheritance tree of the Experiment is reserved for execution concerns, such as switching between singlethreaded, multiprocess, or cluster execution. To enable these different execution contexts without excessive duplicated code, the components that cover computationally or memory-intensive work generally implement methods to generate a collection of serializable tasks to perform later, on either that same object or perhaps another one running in another process or machine. The subclasses of Experiment then differentiate themselves by implementing methods to execute a collection of these tasks using their preferred method of execution, whether it be a simple loop, a process pool, or a cluster.
The components are created and experiment configuration is bound to them at Experiment construction time, so that the instance methods can have concise call signatures that only cover the information passed by other components mid-experiment.
Data reuse/replacement is handled within components. The Experiment generally just hands the replace flag to each component at object construction, and at runtime each component uses that and determines whether or not the needed work has already been done.
I'm trying to find some behavior. Where does it reside?#
If you're looking to change behavior of the Experiment,
- When possible, the logic resides in one of the components and hopefully the component list above should be helpful at finding the lines between components.
- Logic that specifically relates to parallel execution is in one of the experiment subclasses (see parallelization section below).
- Everything else is in the Experiment base class. This is where the public interface (
.run()) resides, and follows a template method pattern to define the skeleton of the Experiment: instantating components based on experiment configuration and runtime inputs, and passing output from one component to another.
I want to add a new option. Where should I put it?#
Generally, the experiment configuration is where any new options go that change any data science-related functionality; in other words, if you could conceivably get better precision from the change, it should make it into experiment configuration. This is so the hashed experiment config is meaningful and the experiment can be audited by looking at the experiment configuration rather than requiring the perusal of custom code. The blind spot in this is, of course, the state of the database, which can always change results, but it's useful for database state to continue to be the only exception to this rule.
On the other hand, new options that affect only runtime concerns (e.g. performance boosts) should go as arguments to the Experiment. For instance, changing the number of cores to use for matrix building, or telling it to skip predictions won't change the answer you're looking for; options like these just help you potentially get to the answer faster. Once an experiment is completed, runtime flags like these should be totally safe to ignore in analysis.
Storage Abstractions#
Another important part of enabling different execution contexts is being able to pass large, persisted objects (e.g. matrices or models) by reference to another process or cluster. To achieve this, as well as provide the ability to configure different storage mediums (e.g. S3) and formats (e,g, HDF) without changes to the Experiment class, all references to these large objects within any components are handled through an abstraction layer.
Matrix Storage#
All interactions with individual matrices and their bundled metadata are handled through MatrixStore objects. The storage medium is handled through a base Store object that is an attribute of the MatrixStore. The storage format is handled through inheritance on the MatrixStore: Each subclass, such as CSVMatrixStore or HDFMatrixStore, implements the necessary methods (save, load, head_of_matrix) to properly persist or load a matrix from its storage.
In addition, the MatrixStore provides a variety of methods to retrieve data from either the base matrix itself or its metadata. For instance (this is not meant to be a complete list):
matrix- the raw matrixmetadata- the raw metadata dictionaryexists- whether or not it exists in storagecolumns- the column listlabels- the label columnuuid- the matrix's UUIDas_of_dates- the matrix's list of as-of-dates
One MatrixStorageEngine exists at the Experiment level, and roughly corresponds with a directory wherever matrices are stored. Its only interface is to provide a MatrixStore object given a matrix UUID.
Model Storage#
Model storage is handled similarly to matrix storage, although the interactions with it are far simpler so there is no single-model class akin to the MatrixStore. One ModelStorageEngine exists at the Experiment level, configured with the Experiment's storage medium, and through it trained models can be saved or loaded. The ModelStorageEngine uses joblib to save and load compressed pickles of the model.
Miscellaneous Project Storage#
Both the ModelStorageEngine and MatrixStorageEngine are based on a more general storage abstraction that is suitable for any other auxiliary objects (e.g. graph images) that need to be stored. That is the ProjectStorage object, which roughly corresponds to a directory on some storage medium where we store everything. One of these exists as an Experiment attribute, and its interface .get_store can be used to persist or load whatever is needed.
Parallelization/Subclassing Details#
In the Class Design section above, we introduced tasks for parallelization and subclassing for execution changes. In this section, we expand on these to help provide a new guide to working with these.
Currently there are three methods that must be implemented by subclasses of Experiment in order to be fully functional.
Abstract Methods#
process_query_tasks- Run feature generation queries. Receives a list of tasks. eachtaskactually represents a table and is split into three lists of queries to enable the implementation to avoid deadlocks:prepare(table creation),inserts(a collection of INSERT INTO SELECT queries), andfinalize(indexing).prepareneeds to be run before the inserts andfinalizeis best run after the inserts, so it is advised that only the inserts are parallelized. The subclass should run each individual batch of queries by callingself.feature_generator.run_commands([list of queries]), which will run all of the queries serially, so the implementation can send a batch of queries to each worker instead of having each individual query be on a new worker.process_matrix_build_tasks- Run matrix build tasks (that assume all the necessary label/cohort/feature tables have been built). Receives a dictionary of tasks. Each key is a matrix UUID, and each value is a dictionary that has all the necessary keyword arguments to callself.matrix_builder.build_matrixto build one matrix.process_train_test_batches- Run model train/test task batches (that assume all matrices are built). Receives a list oftriage.component.catwalk.TaskBatchobjects, each of which has a list of tasks, a description of those tasks, and whether or not that batch is safe to run in parallel. Within this, each task is a dictionary that has all the necessary keyword arguments to callself.model_train_tester.process_taskto train and test one model. Each task covers model training, prediction (on both test and train matrices), model evaluation (on both test and train matrices), and saving of global and individual feature importances.
Reference Implementations#
- SingleThreadedExperiment is a barebones implementation that runs everything serially.
- MultiCoreExperiment utilizes local multiprocessing to run tasks through a worker pool. Reading this is helpful to see the minimal implementation needed for some parallelization.
- RQExperiment - utilizes an RQ worker cluster to allow the tasks to be parallelized either locally or distributed to other. Does not take care of spawning a cluster or any other infrastructural concerns: it expects that the cluster is running somewhere and is reading from the same Redis instance that is passed to the
RQExperiment. TheRQExperimentsimply enqueues tasks and waits for them to be completed. Reading this is helpful as a simple example of how to enable distributed computing.