Reproducible ETL#
Motivation#
- Understanding what you did: Save all the steps you took so you can tell how you got here.
- Using what you did: Use and re-use code for this project or for others. Fix errors easily. Import new data with confidence. GET IT IMPLEMENTED!
This session builds on what you learned last week in the CSV to DB session.
Potential Teachouts#
- Proprietary-database transfers (e.g. SQL Server, Oracle)
Concepts#
Many people associate data science with fancy machine-learning algorithms, but ETL is arguably more important.
ETL:
- Extract: Get data from the source, e.g. a CSV the partner gave you.
- Transform: Get the data into the format you want/need, e.g. standardize missing values.
- Load: Get the data into the database.
There are two reasons why ETL matters so much:
- The rest of your analysis depends on your ETL. For example, you might ignore some of the most important cases if you drop rows with missing values.
- Better data can help more than better methods.
So you should do ETL well:
- Reliably
- Understandably
- Preferably automatically
Tools:
- Code is
typicallybetter than GUIs. Code can be automated. - All else being equal, command-line tools are good choices. They are time tested and efficient.
make
(command-line tool written to compile software efficiently)- If you can't save the code, save the notes, e.g. record how you used Pentaho to transfer an Oracle database to PostgreSQL.
Examples#
Hitchhiker's Guide Weather Example#
Remember the weather example? Let's make sure it's reproducible. I stored the code in two files:
jwalsh_table.sql
dropsjwalsh_schema.jwalsh_table
if it exists, creates the table using our statement from the CSV-to-DB session, then copies the data.
To run it, make sure you specify the PostgresQL environment variables in
default_profile
, then type drake
while in this directory.
I've run this code many times without error, and I feel pretty confident that it will continue to run without error for a while.
Because we wrote some decent ETL code, we don't need to start from scratch. We can borrow the code for this project. (Some of this code originated with the lead project.)
Let's say NOAA changes the format of the weather file. This code will
throw an error when we try to run it. We don't need to start from
scratch. We can simply modify jwalsh_table.sql
to match the new
format, re-run the code without error, and enjoy the up-to-date data.
ETL Tests#
You run the risk of losing or corrupting data with each step. To ensure that you extracted, transformed, and loaded the data correctly, you can run simple checks. Here are a few ways:
- If you're copying a file, check the hash before and after. They should match.
- Count rows. If they should match before and after, do they?
- Check aggregate statistics, such as the sum of a column. If they should match before and after, do they?
- If you receive a database dump, count the number of
schemas/tables/sequences/etc in the origin and destination
databases. Do they match? For example, we request partners who use
Oracle to run the
count_rows_oracle.sql
script, which gives the number of rows and columns in the origin database. It's one (necessary but not sufficient) way to check that we got all the data.
Lead Project#
Well-developed ETL in input/
.
Sanergy Project#
input/Drakefile
calls an R script. The repository is
here.
What If You Have to Use Points and Clicks?#
See last year's police team Oracle directions.