DSSG 2015 Bootcamp

Coalition Impact

405 W Superior Street, Chicago, IL 60654
May 27 - June 1, 2015
9:00 am - 5:30 pm

General Information

The goal of the DSSG Bootcamp is to ensure that every fellow can perform essential data science tasks with a standard set of tools. Fellows will get more done in less time and with less pain by teaching if they all know basic skills for data mangement, analysis, and visualization. This hands-on, collaborative workshop will cover basic concepts and tools as teams or learners work on a real-world data project for the City of Chicago. Participants are encouraged tohelp one another throughout the training and to apply what they have learned to their summer projects. Much of the material in this workshop we have modified from the fantastic course materials developed by Software Carpentry, Data Carpentry and the Coalition for Open Data Education.

Instructors: Matt Gee, Rayid Ghani, Joe Walsh

Helpers: Dav Clark, Nick Eng, Carl Chan, Eric Potash

Who: The course is aimed at Data Science for Social Good fellows and other researchers.

Requirements: Participants must bring a laptop with a few specific software packages installed (listed below).

Contact: Please mail mattgee@uchicago.edu for more information.


Resources

The main bootcamp task and resource list can be found on this hackpad.

The bootcamp project description and data is in this github repository.

The example code for each of the learning modules is in this github repository.

The team list is here.


Schedule

Wednesday 8:00 Optional computer setup session
9:00 Donuts and Coffee
9:30 Intro to Learning at DSSG
10:00 Introduction to Project Inspector Gadget (Project Github Repository)
10:30 Data Science Project Lifecycle
11:00 Break
11:15 Principles and Tools for Collaboration in Data Science
11:15 Team Breakout: Getting Started with Github
12:30 Lunch break
1:30 Getting data
2:00 Working with data in the commandline
2:30 Working with data in databases using SQL
4:30 Daily wrapup / War Stories
Thursday 9:30 Recap of data science lifecycle / SQL Review
9:45 Data munging in Python
10:45 Data munging in R
11:45 Team breakout: Data description and exploration
12:45 Lunch break
1:30 Code review on github
2:00 Feature brainstorming
2:30 Team breakout: Feature generation
3:30 Modeling Part 1
5:00 Code Review / Daily Wrapup
Friday 9:00 Morning kickoff / Recap of data science lifecycle
9:30 Modeling Part 2: Introduction to evaluation
10:00 Team breakout: Evaluation
11:00 Code Review / Plan for model improvement
12:00 Lunch Speaker
1:00-8:00 Scavenger Hunt!
Monday 9:30 Morning Kickoff / Recap of data science lifecycle
10:00 Intro to AWS / Deploying models in production
11:00 Team breakout: Deploying models
12:00 Lunch break
1:00 Intro to visualization and communication
1:30 Team breakouts: Visualization and Communication
2:15 Intro to project managment and documentation
2:45 Team breakouts: Project documentation
3:30 Break
3:45-5:30 Project Deep Dives

Setup

To participate in the DSSG Data Bootcamp, you will need working copies of the software described below. Please make sure to install everything (or at least to download the installers) before the start of your bootcamp.

Overview of the tools

Editor

When you're writing code, it's nice to have a text editor that is optimized for writing code, with features like automatic color-coding of key words.

The Bash Shell

Bash is a commonly-used shell. Using a shell gives you more power to do more tasks more quickly with your computer.

Python

Python is becoming very popular in scientific computing, and it's a great language for teaching general programming concepts due to its easy-to-read syntax. We teach with Python version 2.7, since it is still the most widely used. Installing all the scientific packages for Python individually can be a bit difficult, so we recommend an all-in-one installer.

IPython Notebook

The IPython Notebook is a web-based interface for interactive computing with Python. Individual notebooks are composable, executable, and sharable documents that mix text, code, data, and visualizations. The IPython Notebook comes pre-loaded on many all-in-one python installers like Anaconda CE.

SQL

SQL is a specialized programming language used with databases. SQL is a declarative langauge for describing (declaring) the data you want from the database. At DSSG, we primarily use an open source database called PostgreSQL. You can learn more about it here.,

Windows Installation

Python

  • Download and install Anaconda CE.
  • Use all of the defaults for installation except make sure to check Make Anaconda the default Python.

Editor

Notepad++ is a popular free code editor for Windows. Be aware that you must add its installation directory to your system path in order to launch it from the command line (or have other tools like Git launch it for you). Please ask your instructor to help you do this.

Mac OS X Installation

Python

  • Download and install Anaconda CE.
  • Use all of the defaults for installation except make sure to check Make Anaconda the default Python.

Editor

We recommend Text Wrangler or Sublime Text. In a pinch, you can use nano or vi, which should be pre-installed.

DBeaver

In addition to using psql from the command line, we will use DBeaver, a great open source tool for GUI-based interaction with databases. To install it, go here: