Skip to the content.

Tech Setup

  1. Make sure you are on cmu vpn (Full VPN group)
  2. Connect to class server: mlpolicylab.dssg.io (command line/terminal/putty) : type ssh your_andrew_id@server.mlpolicylab.dssg.io
  3. Connect to database server: mlpolicylab.db.dssg.io If you're on the server, type psql -h database.mlpolicylab.dssg.io -U YOUR_ANDREW_ID group_students_database
  4. Setting up dbeaver or dbvisualizer (a visual ide to the database) instructions are here

Detailed instructions are available here and will be covered at the first Wednesday tech

Tech Session Materials:

ssh

ssh your_andrew_id@server.mlpolicylab.dssg.io

ssh is what you'll use to connect to the class server, which is where you will do all the work. You will need to give us your public ssh key, using the instructions we sent, and then you'll be good to go. Based on which operating system you're using, you can google for which tool is the best (command line, terminal, putty, etc.)

Linux Command Line (Bash)

If you're not too familiar with working at the command line, we have a quick overview and intro here

A couple of quick pointers that might be helpful:

github

We'll use github to collaborate on the code all semester. You will have a project repository based on your projhect assignment.

common (extremely simple) workflow

A more advanced cheatsheet. Another useful tutorial is here and you might want to check out this interactive walk-through (however some of the concepts it focuses on go beyond what you'll need for class)

PostgreSQL

If you're not too familiar with SQL or would like a quick review, we have an overview and intro here.

Additionally, check out these notes and tips about using the course database.

psql

PSQL is a command line tool to connect to the postgresql databvase server we're using for class. You will bneed to be on the server through assh first and then type psql -h database.mlpolicylab.dssg.io -U YOUR_ANDREW_ID databasename where databasename is the database for your project that you will receive after your project assignment. To test it you can use psql -h mlpolicylab.db.dssg.io -U YOUR_ANDREW_ID group_students_database - make sure to change YOUR_ANDREW_ID

A couple quick usage pointers:

dbeaver

dbeaver is a free tool that gives you a slightly nicer and visual interface to the database. Instructions for installing and set up are here

Connecting to the database from python

The sqlalchemy module provides an interface to connect to a postgres database from python (you'll also need to install psycopg2 in order to talk to postgres specifically). You'll can install it in your virtualenv with:

pip install psycopg2-binary sqlalchemy

(Note that psycopg2-binary comes packaged with its dependencies, so you should install it rather than the base psycopg2 module)

A simple usage pattern might look like:

from sqlalchemy import create_engine

# read parameters from a secrets file, don't hard-code them!
db_params = get_secrets('db')
engine = create_engine('postgresql://{user}:{password}@{host}:{port}/{dbname}'.format(
  host=db_params['host'],
  port=db_params['port'],
  dbname=db_params['dbname'],
  user=db_params['user'],
  password=db_params['password']    
))
result_set = engine.execute("SELECT * FROM your_table LIMIT 100;")
for record in result_set:
  process_record(record)

# Close communication with the database
engine.dispose()

If you're changing data in the database, note that you may need to use engine.execute("COMMIT") to ensure that changes persist.

Note that the engine object can also be used with other utilities that interact with the database, such as ohio or pandas (though the latter can be very inefficient/slow)

For a more detailed walk-through of using python and postgresql together, check out the Python+SQL tech session notebook

Jupyter Notebooks

Although not a good environment for running your ML pipeline and models, jupyter notebooks can be useful for exploratory data analysis as well as visualizing modeling results. Since the data needs to stay in the AWS environment, you'll need to do so by running a notebook server on the remote machine and creating an SSH tunnel (because the course server can only be accessed via the SSH protocol) so you can access it via your local browser.

One important note: be sure to explicitly shut down the kernels when you're done working with a notebook as "zombie" notebook sessions can end up using up a lot of processed!

You can find some details about using jupyter with the class server here

Handling Secrets

You'll need access to various secrets (such as database credentials) in your code, but keeping these secrets out of the code itself is an important part of keeping your infrastructure and data secure. You can find a few tips about different ways to do so here

Triage Pointers

We'll be using triage as a machine learning pipeline tool for this class. Below are a couple of links to resources that you might find helpful as you explore and use triage for your project:

Also, here are a few tips as you're working on your project: