Skip to content

Dirty duckling: the quick start guide#

workflow

This quickstart guide follows the workflow explained here. The goal is to show you an instance of that workflow using the Chicago Food Inspections dataset data source.

We packed this a sample of Chicago Food Inspections data source as part of the dirty duck tutorial. Just run in the folder that contains the triage local repository:

./tutorial.sh up

from you command line. This will start the database.

1. Install Triage: Check!#

We also containerized triage, so, in this tutorial it is already installed for you! Just run

./tutorial.sh bastion

The prompt in your command line should change to something like

[triage@dirtyduck$:/dirtyduck]#

Type triage, if no error. You completed this step!

Now you have triage installed, with all its power at the point of your fingers.

2. Structure your data: Events (and entities)#

As mentioned in the quickstart workflow, at least you need one table that contains events, i.e. something that happened to your entities of interest somewhere at sometime. So you need at least three columns in your data: entity_id, event_id, date (and location if you have it would be a nice addition).

In dirtyduck, we provide you with two tables: semantic.entities and semantic.events. The latter is the required minimum. We added the semantic.entities table as a good practice.

This is the simplest way to structure your data: as a series of events connected to your entity of interest (people, organization, business, etc.) that take place at a certain time. Each row of the data will be an event.

For this quickstart tutorial, you don't need to interact manually with the database, but, if you are curious you can peek inside it, and verify how the events table look like.

Inside bastion you can connect to the database typing

psql $DATABASE_URL

This will change the prompt one more time to

food=#

Now, type (or copy-paste) the following

select
  event_id
  entity_id,
  date,
  zip_code,
  type
  from
      semantic.events
 where random() < 0.001
 limit 5;
entity_id date zip_code type
1092838 2014-02-27 60657 license
1325036 2014-05-19 60612 canvass
1385431 2014-06-25 60651 complaint
1395315 2014-01-08 60707 canvass
1395916 2014-02-03 60641 canvass

Each row in this table is an event with event_id and entity_id (which links to the entity it happened to), a date (when it happened), as well a location (the zip_code column). The event will have attributes that describe it in its particularity, in this case we are just showing one of those attributes: the type of the inspection (type)

And, if you also want to see the entities in your data

select
  entity_id, license_num, facility, facility_type, activity_period
  from
      semantic.entities
 where random() < 0.001 limit 5;
entity_id license_num facility facility_type activity_period
2218 1223576 loretto hospital hospital [2014-02-27,)
2353 1804587 subway restaurant [2014-03-05,)
636 2002788 duck walk restaurant [2014-01-17,2016-02-29)
3748 1904141 zaragoza restaurant restaurant [2014-04-03,)
5118 2224978 saint cajetan school [2014-05-06,)

Triage needs a field named entity_id (that needs to be of type integer) to refer to the primary entities of interest in our project.

When you're done exploring the database, you can exit the postgres command line interface by typing \q

3. Set up Dirty duck's triage configuration file#

The configuration file sets up the modeling process to mirror the operational scenario the models will be used in. This involves defining the cohort to train/predict on, the outcome we're predicting, how far in the future we're predicting, how often will the model be updated, how often will the predicted list be used for interventions, what are the resources available to intervene to define the evaluation metric, etc.

Here's the sample configuration file called dirty-duckling.yaml

If you wish, you can check the content of the file with cat experiments/dirty-ducking.yaml

config_version: 'v8'

model_comment: 'dirtyduck-quickstart'

random_seed: 1234

temporal_config:
    label_timespans: ['3months']

label_config:
  query: |
    select
    entity_id,
    bool_or(result = 'fail')::integer as outcome
    from semantic.events
    where '{as_of_date}'::timestamp <= date
    and date < '{as_of_date}'::timestamp + interval '{label_timespan}'
    group by entity_id
  name: 'failed_inspections'

feature_aggregations:
  -
    prefix: 'inspections'
    from_obj: 'semantic.events'
    knowledge_date_column: 'date'

    aggregates_imputation:
      count:
        type: 'zero_noflag'

    aggregates:
      -
        quantity:
          total: "*"
        metrics:
          - 'count'

    intervals: ['all']

model_grid_preset:  'quickstart'

scoring:
    testing_metric_groups:
        -
          metrics: [precision@]
          thresholds:
            percentiles: [10]


    training_metric_groups:
      -
          metrics: [precision@]
          thresholds:
            percentiles: [10]

This is the minimum configuration file, and it still has a lot of sections (ML is a complex business!).

Warning

If you use the minimum configuration file several parameters will fill up using defaults. Most of the time those defaults are not the values that your modeling of the problem needs! Please check here to see which values are being used and act accordingly.

triage uses/needs a data connection in order to work. The connection will be created using the database credentials information (name of the database, server, username, and password).

You could use a database configuration file here's an example database configuation file or you can setup an environment variable named $DATABASE_URL, this is the approach taken in the dirtyduck tutorial, its value inside bastion is

   postgresql://food_user:some_password@food_db/food

For the quick explanation of the sections check the quickstart workflow guide. For a detailed explanation about each section of the configuration file look here

4. Run triage#

Now we are ready for run something! First we will validate the configuration files by running:

triage experiment experiments/dirty-duckling.yaml --validate-only

If everything was OK (it should!), you will see this in your screen:

2020-08-20 16:55:34 - SUCCESS Experiment validation ran to completion with no errors
2020-08-20 16:55:34 - SUCCESS Experiment (a336de4800cec8964569d051dc56f85d)'s configuration file is OK!

Now you can run the experiment with:

triage experiment experiments/dirty-duckling.yaml
That's it! If you see this message in your screen:

 2020-08-20 16:56:56 - SUCCESS Training, testing and evaluating models completed
 2020-08-20 16:56:56 - SUCCESS All matrices that were supposed to be build were built. Awesome!
 2020-08-20 16:56:56 - SUCCESS All models that were supposed to be trained were trained. Awesome!
 2020-08-20 16:56:56 - SUCCESS Experiment (a336de4800cec8964569d051dc56f85d) ran through completion

it would mean that triage actually built (in this order) cohort (table cohort_all_entities...), labels (table labels_failed_inspections...), features (schema features), matrices (table model_metdata.matrices and folder matrices), models (tables triage_metadata.models and triage_metadata.model_groups; folder trained_models), predictions (table test_results.predictions) and evaluations (table test_results.evaluations).

5. Look at results of your duckling!#

Next, let's quickly check the tables in the schemas triage_metadata and test_results to make sure everything worked. There you will find a lot of information related to the performance of your models.

Still connected to the bastion docker container, you can connect to the database by typing:

psql $DATABASE_URL

Again, you should see the postgreSQL prompt:

food=#

Tables in the triage_metadata schema have some general information about experiments that you've run and the models they created. The quickstart model grid preset should have built 3 models. Let's check with:

select
  model_id, model_group_id, model_type
  from
      triage_metadata.models;

This should give you a result that looks something like:

model_id model_group_id model_type
1 1 triage.component.catwalk.estimators.classifiers.ScaledLogisticRegression
2 2 sklearn.tree.DecisionTreeClassifier
3 3 sklearn.dummy.DummyClassifier

If you want to see predictions for individual entities, you can check out test_results.predictions, for instance:

select
  model_id, entity_id, as_of_date, score, label_value
  from
      test_results.predictions
  where entity_id = 15596
  order by model_id;

This will give you something like:

model_id entity_id as_of_date score label_value
1 15596 2017-09-29 00:00:00 0.21884 0
2 15596 2017-09-29 00:00:00 0.22831 0
3 15596 2017-09-29 00:00:00 0.25195 0

Finally, test_results.evaluations holds some aggregate information on model performance. In our config above, we only focused on precision in the top ten percent, so let's see how the models are doing based on this:

select
  model_id, metric, parameter,
  round(stochastic_value,3) as stochatic_value
  from
      test_results.evaluations
  where metric = 'precision@'
        and parameter='10_pct'
  order by model_id;
model_id metric parameter stochastic_value
1 precision@ 10_pct 0.287
2 precision@ 10_pct 0.292
3 precision@ 10_pct 0.237

Not great! But then again, these were just a couple of overly simple model specifications to get things up and running...

Feel free to explore some of the other tables in these schemas (note that there's also a train_results schema with performance on the training set as well as feature importances, where defined). When you're done exploring the database, you can exit the postgres command line interface by typing \q

With a real modeling run you could (should) do model selection, postmodeling, bias audit, etc. triage provides tools for doing all of that, but we the purpose of this little experiment was just to get things up and running. If you have successfully arrived to this point, you are all set to do your own modeling (here's a good place to start), but if you want to go deeper in this example and learn about these other triage functions, continue reading our in-depth tutorial.

6. What's next?#

Take a deeper look at triage through this example#

Get started with your own project and data#