{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Understanding Input Data\n",
    "\n",
    "Input data has slightly different requirements depending on whether you are using Aequitas via the webapp, CLI or Python package. In general, input data is a single table with the following columns:\n",
    "\n",
    "* `score`\n",
    "* `label_value`\n",
    "* at least one attribute. e.g. `race`, `sex` and `age`. (Attribute categories are defined by users.)\n",
    "\n",
    "\n",
    "| score     | label_value| race | sex | age | income|\n",
    "| --------- |------------| -----| --- | ------- | ----|\n",
    "|   0       | 1          | African-American | Male | 25 | 180000 |\n",
    "|   1       | 1          | Caucasian | Male | 37 | 34000|\n",
    "\n",
    "## Input data for Webapp\n",
    "\n",
    "The webapp requires a single CSV with columns for a binary `score`, a binary `label_value` and an arbitrary number of attribute columns. Each row is associated with a single observation.\n",
    "\n",
    "![](_static/webapp_input.jpg)\n",
    "\n",
    "### `score`\n",
    "\n",
    "Aequitas webapp assumes the `score` column is a binary decision (0 or 1).\n",
    "\n",
    "### `label_value`\n",
    "\n",
    "This is the ground truth value of a binary decision. The data again must be binary 0 or 1.\n",
    "\n",
    "### attributes e.g. `race`, `sex`, `age`,`income`\n",
    "\n",
    "Group columns can be categorical or continuous. If categorical, Aequitas will produce crosstabs with bias metrics for each group_level. If continuous, Aequitas will first bin the data into quartiles and then create crosstabs with the newly defined categories.\n",
    "\n",
    "\n",
    "## Input data for CLI\n",
    "\n",
    "The CLI accepts csv files and also accomodates database calls defined in Configuration files.\n",
    "\n",
    "![](_static/CLI_input.jpg)\n",
    "\n",
    "### `score`\n",
    "By default, Aequitas CLI assumes the `score` column is a binary decision (0 or 1). Alternatively, the `score` column can contain the score (e.g. the output from a logistic regression applied to the data). In this case, the user sets a threshold to determine the binary decision. See [configurations](./config.html) for more on thresholds.\n",
    "\n",
    "### `label_value`\n",
    "\n",
    "As with the webapp, this is the ground truth value of a binary decision. The data must be binary 0 or 1.\n",
    "\n",
    "### attributes e.g. `race`, `sex`, `age`,`income`\n",
    "\n",
    "Group columns can be categorical or continuous. If categorical, Aequitas will produce crosstabs with bias metrics for each group_level. If continuous, Aequitas will first bin the data into quartiles.\n",
    "\n",
    "### `model_id`\n",
    "\n",
    "`model_id` is an identifier tied to the output of a specific model. With a `model_id` column you can test the bias of multiple models at once. This feature is available using the CLI or the Python package.\n",
    "\n",
    "\n",
    "### Reserved column names:\n",
    "\n",
    "* `id`\n",
    "* `model_id`\n",
    "* `entity_id`\n",
    "* `rank_abs`\n",
    "* `rank_pct`\n",
    "\n",
    "\n",
    "\n",
    "## Input data for Python package\n",
    "\n",
    "Python input data can be handled identically to CLI by using `preprocess_input_df()`. Otherwise, you must discretize continuous attribute columns prior to passing the data to `Group().get_crosstabs()`.\n",
    "\n",
    "```{python}\n",
    "\n",
    "from Aequitas.preprocessing import preprocess_input_df()\n",
    "\n",
    "# *input_data* matches CLI input data norms.\n",
    "df, _ = preprocess_input_df(*input_data*)\n",
    "\n",
    "```\n",
    "![](_static/python_input.jpg)\n",
    "\n",
    "\n",
    "### `score`\n",
    "See CLI. Threshholds are set in a dictionary passed to `get_crosstabs()`.\n",
    "\n",
    "### `label_value`\n",
    "See CLI. \n",
    "\n",
    "### attributes e.g. `race`, `sex`, `age`,`income` \n",
    "\n",
    "See CLI. If you plan to bin or discritize continuous features manually, note that `get_crosstabs()` expects attribute columns to be type string. This excludes pandas 'categorical' data type, which is the default output of certain pandas discritizing functions. You can recast 'categorical' columns to strings as follows:\n",
    "\n",
    "```\n",
    "df['categorical_type'] = df['categorical_type'].astype(str)\n",
    "```\n",
    "\n",
    "### `model_id`\n",
    "\n",
    "See CLI.\n",
    "\n",
    "### Reserved column names:\n",
    "\n",
    "* `id`\n",
    "* `model_id`\n",
    "* `entity_id`\n",
    "* `rank_abs`\n",
    "* `rank_pct`\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}