Getting and Keeping Data#
Data comes in many forms, from many sources: - you may get a database dump or backup directly from a project partner - you may get a load of excel files - you may get a set of CSVs (See tutorial on Getting data from CSVs into a Database). - you may need to scrape data from the web (see Basic Web Scraping).
Regardless of what you get, once you've got your hands on some data, you'll need to bring it into a consolidated database, and start formatting it in such a way that you can use it for analysis. Command Line Tools will start to come in handy here. If your data is in a format that resembles CSV these instructions will be helpful.
You'll definitely want to keep track of the steps you took to go from raw data to model-ready data (Reproducible ETL) so you can incorporate that as part of your larger pipeline.
Often DSSG projects will involve sensitive data, so it's important to be aware of some basic principles of data security: Data Security Primer as well.