What to Expect#
While the specifics will vary from project to project, the summer will follow roughly the structure in our high level summer plan.
Before the Summer#
Prior to your arrival, we'll provide you with the prerequisites so you can familiarize yourself with the tools you’ll use over the summer, install them on your local machine, and equip yourself with the knowledge to be able to follow along with the curriculum. You'll receive a longer list of software to install before the first day of orientation, programming languages you should brush up on, and tools we suggest you use to manage your ML/Data workflow.
Learning about Projects and Partners#
Staff at the Data Science and Public Policy team at CMU and DSSG alum volunteers have worked hard to recruit partners and scope projects. This is a lengthy, complicated process with plenty of logistical hurdles (think legal data sharing agreements and data transfer challenges), which means the list of project partners is usually not finalized until we get close to the beginning of the fellowship. We'll usually give you a list of potential projects a couple of weeks before the start and ask you to give us a list of your project preferences. We'll use that, the needs of each project, and other things we want to balance across the team members (skills, backgrounds, experience, disciplines) to create project teams (typically of four fellows each) during the first week of the program. Fellows typically gwet the project assignments at the beginning of week 2.
We want you to meet the people you’re working with face to face and get to know them in the beginning of the summer so you can work with them effectively for the rest of the program. We ask all our project partners to come meet their team in the second week of the summer. During partner visits, you’ll spend a lot of time talking with them about the problem and the data, and they will give a presentation to the fellowship. This also gives all of the fellows a chance to hear about all of the projects and for the partners to meet other project partners and the other fellows.
After orientation, you will spend the first part of the summer getting to know your project partner and their unique challenges. While the projects have already been (initially) scoped, you will almost certainly need to refine that scope throughout the summer. For example, we may know your partner’s goal is to find violations of a particular law. Your team would then work with the partner to narrow that to: (1) locations at risk of violations in general; (2) locations at risk of multiple violations; or (3) locations with the most impactful violations.
We believe it is important for you to thoroughly understand the problem, how the project partner is currently takcling it, the challenges associated with it, as well as the people and the processes that generate the data we are getting before we get deeper. A better understanding of your partner and the problems they face is crucial to knowing what the rows and columns in your data really mean. Defining your outcome and evaluation metrics will depend heavily upon this understanding. While we know you are eager to dig into the data, you will find that at least as much of your time is devoted to talking about the data as manipulating it. This is a good thing.
You will also be working with real data, which is messy! You will encounter missing values and things that don’t seem to make sense. Talking to your partners about their existing process, and sharing with them what you see from analyzing their data will help you judge your own understanding of their problem, data, and processes, reconcile inconsistencies (or carry on being aware of them), and identify whether the errors lie in the data itself or in the information you weere provided.
Working on Your Project#
Although we aim to have all the data from project partners ready in advance of the fellowship, there are inevitably data transfer delays and partial data that will continue to be augmented throughout - and sometimes after - the fellowship. As you explore, you'll find holes in the provided information, or identify potential new useful sources of data, and will need to work with your partner to decide whether it’s possible to acquire the data you need in the time that you have; that’s the reality of working with real world partners and sensitive data!
Fellows drive the work of every project, learning about the domain and problem they're tackling in depth, writing code, analyzing data, and collaborating with their project anbd comunity partners to develop a useful and usable system. Over the course of the summer, your team will:
- Explore the (real) data your partners have provided
- Design your project workflow based on what tools you'll use and how your team works together
- Identify user stories to make sure what you're creating has a real purpose
- Develop a machine learning pipeline to turn raw data into analysis that can inform decisions
- Build relevant models that reflect the subject you're analyzing as closely as possible
- Add features to your model based on subject matter expertise, available data, and exploratory analysis
- Evaluate model performance using the metrics that make sense for your project
- Create an interface for your partner to use your results (API, dashboard applications, etc)
Presentations#
We believe that our work has impact if we are able to communicate what we do and why it’s important to our partners, peers, and the general public. As such, an important piece of this training program is learning to present the work you do.
Each week, a member of your team will give a 3-5 minute update to the entire fellowship, outlining your recent progress and findings, giving shoutouts to other fellows or staff members who have helped you along the way, and things you're stuck on and are seeking help with. Two teams per week (so each team presents 2-3 times throughout the summer) will give a longer 20-minute "deep dive" presentation, outlining more technical components of your project and seeking feedback from other fellows and mentors.
At the end of the summer…
At the end of the summer, your team will develop two polished presentations: one 5-minute presentation for the DataFest final event, and one 20-minute technical presentation for use at a local tech meetup at the end of the summer and for future presentations at conferences or your home institution.
In the last few weeks of the fellowship, each team will present at a local meetup. Each team will elect one team member to deliver the short final presentation at DataFest; however, all team members should feel comfortable delivering all presentations. Our communications staff will work with you on both of these presentations, brainstorming ways to present your work and providing feedback on your delivery.
Wrap-Up and Handovers#
To make sure the work you do this summer has real impact, a lot more work needs to be done after the official end of the fellowship; some of this will be done by your project partner, and some will be done by the Data Science and Public Policy team at Carnegie Mellon University. You will need to transition the work over to your project partners so they can validate, implement, and extend your work. To do this, you will have to document your work throughout the summer and wrap it up neatly at the end of the summer.
We ask that you prepare a poster to be displayed at DSSG events and for potential conference poster sessions, a technical report, and an outline of a paper. This makes it easier to collaborate once you and your teammates are no longer working at the same desk every day. We also ask that all your code is tested to run on a new machine, and that there is sufficient documentation for someone else to replicate and understand your work.
Curriculum#
Our number one goal is to train the fellows to do data science for social good work. Here is some insight into how we accomplish this throughout the summer.
To look through all of our curriculum materials, please see the curriculum section.
Orientation#
We expect that every incoming fellow has experience programming in Python, a basic working knowledge of statistics and social science, and an interest in doing social good work. However, we understand that everyone comes from a different background, so to ensure that everyone is able to contribute as a productive member of the team and the fellowship, we start the first few weeks off with an intensive orientation (see sample from 2022), getting everyone "up to speed" with the basic skills and tools they'll need.
Typical schedule#
- Week One
- Software Setup
- Pipelines and Project Workflow
- Git and Github
- Making the Fellowship
- Skills You Need to do DSSG
- Command Line Tools
- Project Management, Partners, and Communications
- Data Exploration in Python
- Project Scoping Intro
- Week Two
- Week Three
- Reproducible ETL
- The Work We Do
- Record Linkage
- Databases
- Quantitative Social Science
Ongoing Curriculum#
Training continues on throughout the summer in the form of "lunch and learns" - less formal lessons over lunch - and teachouts by staff or fellows who have relevant specializations. Sometimes we ask for volunteers to do a teachout on a topic we think is important, like data visualization or inference with observational data, and a few fellows will work together to put together a lesson. Sometimes a DSSGer will suggest a topic that they have a pet interest in, or that they think will be relevant to one or more of the summer projects. We have lunch and learns scheduled twice a week through the summer, and some fellows choose to offer optional teachouts at the end of the workday.
Although we don't expect all teams to be working in unison, there is a general structure to the summer that guides how we pace the remaining curriculum - we try to schedule topics so that fellows know about them with enough time to incorporate them into their projects, but not so early that they've forgotten about what they learned by the time the knowledge would be useful. As we get nearer to the end of the summer, there are fewer required topics, so there are more open time slots for fellows to do teachouts.
- Example topics for the Rest of the Summer
- Educational Data and Testing
- Social Good Business Models
- Basic Web Scraping
- ML Pipelines and Evaluation
- Feature Generation Workshop
- Test, Test, Test
- Beyond the Deep Learning Hype
- Causal Inference with Observational Data
- Model Evaluation
- Spatial Analysis Tools
- Operations Research
- Theory and Theorizing in the Social Sciences
- NLP
- Image Analysis
- Presentation Skills
- Data Visualization
- Natural Language Processing
- Opening Closed Data
- Dealing with Fairness and Bias in ML systems
- Explainability and it's use in ML and Social Good problems
- Responsible AI/ML
- Deeper dive into social issues including poverty, homelessness, food insecurity, education, public health.
- Working with commubnity members to develop ML systems that affect them