The *nix command line, although invented decades ago, is an amazing environment for doing data science. By combining small, yet powerful, command-line tools we can really explore our data and quickly hack together prototypes. The recent addition of tools such as GNU Parallel, jq, and, Drake, further enables us to be more productive and more efficient data scientists. Installing these command-line tools and setting up an efficient environment is, unfortunately, not straightforward.
In the first part of this talk I will desribe how the command line can be used for doing data science. The focus will be common operation regarding obtaining, scrubbing, and exploring data. I will walk through an example where we scrape a data set from a Wikipedia page using more modern command-line tools.
In the second part I will present a new open-source project called the Data Science Toolbox (http://datasciencetoolbox.org), which is a virtual environment that allows you to get started doing data science in minutes. It comes with commonly used software for data science and allows for easy installation of additional tools. Because the Data Science Toolbox runs on top of VirtualBox, it can be installed not only on Linux, but also on Mac OS X and Microsoft Windows.
Once you have a solid environment, it is worthwhile to further customize it to your own needs. In the third part of the talk I will explain how to (1) make your environment more efficient and (2) create reusable command-line tools from one-off commands or from existing code in, for example, Python and R.
By the end of this talk you will have a solid understanding of how to leverage the power of the command line for your next data science project.