Data Science at the Command Line

January 21st, 2015
Data Science Consulting
Héloïse Nonne
@heloisenonne,
hnonne@quantmetry.com
Data Science at the Command Line
Paris Data Geek @ AXA
PLEASE DO NOT PRINT THE BLACK SLIDES.
THINK ABOUT THE ENVIRONMENT

Data Science at the Command Line
at Paris Data Geek
@ AXA

$ Let’s stop clicking
and
start typing
CLI has 40 years and counting
A modern tool!

• Agile and interactive
read–eval–print loop (REPL) vs edit-compile-run-debug cycle
• Close to the filesystem
• A good tool for the lazy
Automates repetitive tasks
• Good for integration with other technologies
C/C++, python, perl, R, ruby, etc.
Create your own tools
$ why use the command line?

Do we really need
Hadoop to process
a few GB of data?
1.75 GB – 2 million chess games
• Hadoop: 26 minutes (1.14MB/sec)
• Bash, local: 12 seconds (270MB/sec)
Source: Adam Drake, aadrake.com 2014

Streaming at the command line
Spout
<
Bolt
|
Sink
>

MapReduce at the command line
Word count
Mapper
grep –oE ‘[a-zA-Z]{2,}’
(Shuffle)
& Sort
sort
Reduce
uniq -c

We have many jobs to run and 4CPUs
Naive parallelization
GNU parallel
spawns a new process when one finishes
All CPUs remain active

csvcut
csvsort
csvstack
csvjoin
csvstat
gnuplot
lowercase
regex
sed
tr
csvcut
cut
awk
sort
uniq
curl
in2csv
sql2csv
scrape
jq

$ Machine learning at
the command line

mlpack
$ linear_regression --input_file dataset.csv --test_file predict.csv -v

dbacl
Don’t Be Afraid of the Command Line?

Online learning with
Vowpal Wabbit

Command line is good for
• Starting data project before going on Hadoop, Spark, …
• Data discovery
• Data cleaning
• Do some efficient machine learning (online, C)
• Model / Feature discovery

What next?
• Online learning
• Benchmark with bigger data
• Hadoop (Hive) vs CLI
• Benchmark of Machine learning at the CLI
• CLI tools vs Python / R

William Shotts
Jeroen Janssens
The man pages!
$ Bibliography

Data Science at the Command Line

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Data Science at the Command Line

Similar to Data Science at the Command Line (20)

Recently uploaded

Recently uploaded (20)

Data Science at the Command Line