Command Line Tools#
Motivation#
As data scientists, we often receive data in text-based files. We need to explore these files to understand what they contain, we need to manipulate and clean them and we need to handle them on our file system. The most robust way to do this, even with large files, is the command line.
Command line tools are the data scientist's swiss army knife. They are versitile, portable, and have plenty of functions that you may not quite sure how to use, but you're sure they'll be useful at some point. From helping you obtain, clean, and explore your data, to helping you build models and manager your workflow, command line tools are essential to every well-built data science pipeline, will be used throughout DSSG, and should be your starting point as you build your data science toolkit.
Slides#
Here's the presentation from a previous workshop.
The basics#
Where am I?#
pwd
print working directory - this prints the name of the current working directory
cd ..
changes directory to one level/folder up
cd ~/
goes to the home directory
cd -
return to the previous directory
What's in my folder?#
ls
lists the contents in your current dictory.
ls -l
"long listing" format (-l
) shows the filesize, date of last change, and file permissions
ls -la
"long listing" format (-l
), shows all files (-a
) including hidden dotfiles
tree
lists the contents of the current directory and all sub-directories as a tree structure (great for peeking into folder structures!)
tree -L 2
limits the tree expansion to 2 levels
tree -hs
shows file sizes (-s
) in human-readable format (-h
)
What's in my file?#
head -n10 $f
shows the "head" of the file, in this case the top 10 lines
tail -n10 $f
shows the "tail" of the file
tail -n10 $f | watch -n1
watches the tail of the file for any changes every second (-n1
)
tail -f -n10 $f
follows (-f
) the tail of the file every time it changes, useful if you are checking the log of a running program
less $f
paginated viewer for the contents of a text file
wc $f
counts words, lines and characters in a file (separate counts using -w
or -l
or -c
)
Where is my file?#
find -name "<lost_file_name>" -type f
finds files by name
find -name "<lost_dir_name>" -type d
finds directories by name
Renaming files#
Rename files with rename
. For example, to replace all space bars with underscores:
rename 's/ /_/g' space\ bars\ .txt
This command substitutes (s
) space bars (/ /
) for underscores (/_/
) in the entire file name (globally, g
). (The 3 slashes can be replaced by any sequence of 3 characters, so 's# #_#g'
would also work and can sometimes be more legible, for example when you need to escape a special character with a backslash.)
You can replace multiple characters at a time by using a simple logical OR "regular expression" (|
) such as [ |?] which will replace every space bar or question mark.
rename 's/[ |?]/_/g' space\ bars?.txt
(The file will be renamed to space_bars_.txt
)
Bonus points:
rename 'y/A-Z/a-z/'
renames files to all-lowercase
rename 'y/a-z/A-Z/'
renames files to all-uppercase
Some useful things to know#
- Be careful what you wish for, the command line is very powerful, it will do exactly what you ask. This can be dangerous when you're running commands like
rm
(remove), ormv
(move). You can "echo" your commands to just print the command text without actually running the command. - Use tab completion to type commands faster and find filenames, press the tab key whilst typing to see suggestions
tab
- Prepend
man
to a command to read the manual for exampleman rm
- You can use
ctrl + r
to search the command line history, and search for previously searched commands. Or typehistory
to see the history. - Beware of spaces when creating filenames, this is not generally good practice, if you must you can use the
\
escape character to add blank spaces in a file name. For exampletouch space\ bars\ .txt
, if you runtouch space bars .txt
this will create three filesspace
,bars
, and.txt
. - Have a look into using
screen
ortmux
for keeping processes alive and working with multiple terminals (see further reading living-in-the-terminal). - Use
htop
for monitoring the usage of your instance (usage guide). - Have a go at learning the basics of
vim
, since it is ubiquitous on unix servers (see further reading living-in-the-terminal). - Check out some tutorials on regular expressions if you are not already familiar with them.
Command Line for Data Science - Let's talk about the weather#
As an exercise, let's take a shot at creating our own weather predictions using data from NOAA.
You can find daily data from 2016 for the US here:
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2016.csv.gz
(The documentation is here)
Getting Data from the Command Line#
First we have to get the data. For that we're going to use curl.
$ curl ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2016.csv.gz
Whoa! Terminal is going crazy! This may impress your less savvy friends, but it's not going to help you answer your question. We need to stop this process. Try control-c. This is the universal escape command in terminal.
We obviously didn't use curl right. Let's look up the manual for the command using man
.
$ man curl
Looks like if we want to write this to a file, we've got to pass the -O
argument.
$ curl -O ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2016.csv.gz
Let's check to see if it worked.
$ ls -lah
Great. Now we need to know the file format so we know what tool to use to unpack it.
$ file 2016.csv.gz
Looks like it's a gzip so we'll have to use gunzip
.
$ gunzip 2016.csv.gz $ ls -lah
Now we've got a .csv file we can start playing with. Let's see how big it is using wc
Viewing Data from the Command Line#
The simpilest streaming command is cat
. This dumps the whole file, line by line, into standard out and prints.
$ cat 2016.csv
That's a bit much. Let's see if we can slow it down by viewing the file page by page using more
or less
.
$ less 2016.csv
Great. But let's say I just want to see the top of the file to get a sense of it's structure. We can use head
for that.
$ head 2016.csv $ head -n 3 2016.csv
Similarly, if I'm only interested in viewing the end of the file, I can use tail
.
$ tail 2016.csv
These commands all print things out raw and bunched together. I want
to take advantage of the fact that I know this is a csv to get a
prettier view of the data. This is where csvkit
starts to shine. The
first command we'll use from csvkit is csvlook
.
$ csvlook 2016.csv
But that's everything again. We just want to see the top. If only we could take the output from head
and send it to csvlook
.
We can! It's called piping, and you do it like this:
head 2016.csv | csvlook
The output from head
was sent to csvlook
for processing. Piping
and redirection (more on that later) are two of the most important
concepts to keep in mind when using command line tools. Because most
commands use text as the interface, you can chain commands together to
create simple and powerful data processing pipelines!
Filtering Data from the Command Line#
It looks like in order for us to make sense of the weather dataset, we're going to need to figure out what these station numbers mean. Let's grab the station dictionary from NOAA and take a look at it.
$ curl -O https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt $ head ghcnd-stations.txt
Looks like the station description might come in handy. We want to look at just the stations in Chicago.
$ grep CHICAGO ghcnd-stations.txt | csvlook -H
Let's pick OHARE as the station we'll use for now. Its ID is 'USW00094846'
Let's take a look at just the ID column from the weather file. We can do this using cut
.
$ cut -f 1 2016.csv
Looks like cut
isn't smart enough to know that we're using a csv. We can either use csvcut, or pass a delimiter argument that specifies comma.
$ cut -d , -f 1 2016.csv | head
Now let's filter out just the oberservations from OHARE.
$ cut -d , -f 1 2016.csv | grep USW00094846 | head
Another powerful tool that can do filtering (and much more) is
awk
. awk
treats every file as a set of row-based records and
allows you to create contition/{action} pairs for the records in the
file. The default {action} in awk
is to print the records that meet
the condition. Let's try reproducing the above statement using awk
.
$ cut -d , -f 1 2016.csv | awk '/USW00094846/' | head
awk
requires familiarity with regular expressions for contitions and
has its own language for actions, so man
and stack overflow will be
your friends if you want to go deep with awk
.
Editing and Transforming Data#
Let's say we want to replace values in the files. PRCP is confusing. Let's change PRCP to RAIN.
To do this, we use sed
. sed
stands for streaming editor, and is
very useful for editing large text files because it doesn't have to
load all the data into memory to make changes. Here's how we can use
sed
to replace a string.
$ sed s/PRCP/RAIN/ 2016.csv | head
Notice the strings have changed!
But when we look at the source file
$ head 2016.csv
Noting has changed. That's because we didn't write it to a file. In fact, none of the changes we've made have.
$ sed s/PRCP/RAIN/ 2016.csv > 2016_clean.csv $ head 2016_clean.csv
We can also use awk for subsitution, but this time, let's replace "WSFM" with "WINDSPEED" in all the weather files in the directory. Once again, stackoverflow is your friend here.
$ ls -la > files.txt $ awk '$9 ~/2016*/ {gsub(/WSFM/, "WINDSPEED"); print;}' files.txt
Group Challenges#
For group challenges, log onto the training
ec2 instance and change
directories to /mnt/data/training/yourusername. This should be your
working directory for all the excercises.
-
Create a final weather file that just has weather data from OHARE airport for days when it rained, and change PRCP to RAIN. Save the sequence of commands to a shell script so it's replicable by your teammate and push to a training repository you've created on github.
-
Create a separate file with just the weather from OHARE for days when the tempurature was above 70 degrees F. (hint: try using csvgrep to filter a specific column on a range of values)
-
Get ready to explore the relationship between weather and crime in Chicago. Using crime data from 2016 (below), parse the json and convert it to a csv. Explore the fields and cut the dataset down to just day, location, and crime type. Then subset the dataset to just homicides and save as a new file.
https://data.cityofchicago.org/resource/6zsd-86xi.json
-
Using just command line tools, can you use the lat and long coordinates of the weather stations to rapidly identify which weather station is closest to the DSSG building?
Cheat Sheet#
We're going to cover a variety of command line tools that help us obtain, parse, scrub, and explore your data. (The first few steps toward being an OSEMN data scientist). Here's a list of commands and concepts we'll cover:
-
Getting to know you: navigating files and directories in the command line
cd
mkdir
ls
file
mv
cp
rm
findit
(bonus)
-
Getting, unpacking, and parsing Data
curl
wget
(bonus)gunzip
tar
(bonus)in2csv
json2csv
(bonus)
-
Exploring Data
wc
cat
less
head
tail
csvlook
-
Filtering, Slicing, and Transforming
grep
cut
sed
awk
csvgrep
csvcut
csvjoin
(bonus)jq
(bonus; sed for JSON)
-
Exploring & Summarizing
csvstat
-
Writing shell scripts
Further Resources#
Jeroen Janssens wrote the book literally on data science in the command line. Also, check out his post on 7 essential command line tools for data scientists.
For command line basics, Learning CLI the Hard Way is, as always, a great resource.
Potential Teachouts#
tmux
: Getting your command line organizedtmux
is a great way to manage many environments at once. Give it a shot!