The workshop will present how to combine tools to quickly query, transform and model data using command line tools.
The goal is to show that command line tools are efficient at handling reasonable sizes of data and can accelerate the data science
process. We will show that in many instances, command line processing ends up being much faster than ‘big-data’ solutions. The content
of the workshop is derived from the book of the same name (http://datascienceatthecommandline.com/). In addition, we will cover
vowpal-wabbit (https://github.com/JohnLangford/vowpal_wabbit) as a versatile command line tool for modeling large datasets.
Developer Data Modeling Mistakes: From Postgres to NoSQL
Data science at the command line
1. Data science at the command line
Rapid prototyping and reproducible science
Sharat Chikkerur
sharat@alum.mit.edu
Principal Data Scientist
Nanigans Inc.
1
5. About me
• UB Alumni, 2005 (M.S., EE Dept, cubs.buffalo.edu)
• MIT Alumni, 2010 (PhD., EECS Dept, cbcl.mit.edu)
• Senior Software Engineer, Google AdWords modeling
• Senior Software Engineer, Microsoft Machine learning
• Principal data scientist, Nanigans Inc
4
6. About the workshop
• Based on a book by Jeroen Janssens 1
• Vowpal wabbit 2
1
http://datascienceatthecommandline.com/
2
https://github.com/JohnLangford/vowpal_wabbit
5
7. POSIX command line
• Exposes operating system functionalities through a shell
• Example shells include bash, zsh, tcsh, fish etc.
• Comprises of a large number of utility programs
• Examples: grep, awk, sed, etc.
• GNU user space programs
• Common API: Input through stdin and output to stdout
• stdin and stdout can be redirected during execution
• Pipes (|) allows composition through pipes (chaining).
• Allows redirection of output of one command as input to
another
ls -lh | more
cat file | sort | uniq -c
6
8. Why command line ?
• REPL allows rapid iteration (through immediate feedback)
• Allows composition of scripts and commands using pipes
• Automation and scaling
• Reproducibility
• Extensibility
• R, python, perl, ruby scripts can be invoked like command
line utilities
7
9. Data science workflow
A typical workflow OSEMN model
• Obtaining data
• Scrubbing data
• Exploring data
• Modeling data
• Interpreting data
8
11. Workflow example: Boston housing dataset
https://github.com/sharatsc/cdse/blob/master/
boston-housing
10
12. Boston housing dataset
Python workflow 3
import urllib
import pandas as pd
# Obtain data
urllib.urlretrieve(’https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housin
df = pd.read_csv(’boston.csv’)
# Scrub data
df = df.fillna(0)
# Model data
from statsmodels import regression
from statsmodels.formula import api as smf
formula = ’medv~’ + ’ + ’.join(df.columns - [’medv’])
model = smf.ols(formula=formula, data=df)
res=model.fit()
res.summary()
Command line workflow
URL="https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housing/boston.csv"
curl $URL| Rio -e ’model=lm("medv~.", df);model’
3
https://github.com/sharatsc/cdse/blob/master/boston-housing
11
14. Obtaining data from the web: curl
• CURL (curl.haxx.se)
• Cross platform command line tool that supports data
transfer using
HTTP, HTTPS, FTP, IMAP, SCP, SFTP
• Supports cookies, user+password authentication
• can be used to get data from RESTful APIs 4
• http GET
curl http://www.google.com
• ftp GET
curl ftp://catless.ncl.ac.uk
• scp COPY
curl -u username: --key key_file --pass password scp://example.com/~/file.txt
4
www.codingpedia.org/ama/how-to-test-a-rest-api-from-command-line-with-curl/
12
16. Scrubbing web data: Scrape
• Scrape is a python command line tool to parse html
documents
• Queries can be made in CSS selector or XPath syntax
htmldoc=$(cat << EOF
<div id=a>
<a href="x.pdf">x</a>
</div>
<div id=b>
<a href="png.png">y</a>
<a href="pdf.pdf">y</a>
</div>
EOF
)
# Select liks that end with pdf and are within div with id=b (Use CSS3 selector)
echo $htmldoc | scrape -e "$b a[href$=pdf]"
# Select all anchors (use Xpath)
echo $htmldoc | scrape -e "//a"
<a href="pdf.pdf">y</a>
13
17. CSS selectors
.class selects all elements with class=’class’
div p selects all <p> elements inside div elements
div > p selects <p> elements where parent is <div>
[target=blank] selects all elements with target="blank"
[href^=https] selects urls beginning with https
[href$=pdf] selects urls ending with pdf
More examples at
https://www.w3schools.com/cssref/css_selectors.asp
14
18. XPath Query
author
selects all <author> elements at
the current level
//author
selects all <author> elements at
any level
//author[@class=’x’]
selects <author> elements with
class=famous
//book//author
All <author> elements that are
below <book> element"
//author/* All children of <author> nodes
More examples at
https://msdn.microsoft.com/en-us/library/ms256086
15
20. Scrubbing JSON data: JQ 5
• JQ is a portable command line utility to manipulate and
filter JSON data
• Filters can be defined to access individual fields, transform
records or produce derived objets
• Filters can be composed and combined
• Provides builtin functions an operators
Example:
curl ’https://api.github.com/repos/stedolan/jq/commits’ | jq ’.[0]’
5
https://stedolan.github.io/jq/
17
22. Object construction
Object construction allows you to derive new objects out of
existing ones.
# Field selection
echo {"foo": "F", "bar": "B", "baz": "Z"} | jq ’{"foo": .foo}’
{"foo": "F"}
# Array expansion
echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{"foo": .foo, "bar": .bar[]}’
{"foo": "F", "bar": "X"}
{"foo": "F", "bar": "Y"}
# Expression evaluation, key and value can be substituted
echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{(.foo): .bar[]}’
{"A": "X"}
{"A": "Y"}
19
23. Operators
Addition
• Numbers are added by normal arithmetic.
• Arrays are added by being concatenated into a larger array.
• Strings are added by being joined into a larger string.
• Objects are added by merging, that is, inserting all the
key-value pairs from both objects into a single combined
object.
# Adding fields
echo ’{"foo": 10}’ | jq ’.foo + 1’
11
# Adding arrays
echo ’{"foo": [1,2,3], "bar": [11,12,13]}’ | jq ’.foo + .bar’
[1,2,3,11,12,13]
20
26. CSV
• CSV (Comma separated value) is the common demoniator
for data exchange
• Tabular data with ’,’ as a separator
• Can be ingested by R, python, excel etc.
• No explicit specification of data types (ARFF supports type
annotation)
Example
state,county,quantity
NE,ADAMS,1
NE,BUFFALO,1
NE,THURSTON,1
22
27. CSVKit (Groskopf and contributors [2016])
csvkit 6 is a suite of command line tools for converting and
working with CSV, the defacto standard for tabular file formats.
Example use cases
• Importing data from excel, sql
• Select subset of columns
• Reorder columns
• Mergeing multiple files (row and column wise)
• Summary statistics
6
https://csvkit.readthedocs.io/
23
28. Importing data
# Fetch data in XLS format
# (LESO) 1033 Program dataset, which describes how surplus military arms have been distributed
# This data was widely cited in the aftermath of the Ferguson, Missouri protests. T
curl -L https://github.com/sharatsc/cdse/blob/master/csvkit/ne_1033_data.xls?raw=true -o ne_10
# Convert to csv
in2csv ne_1033_data.xlxs > data.csv
# Inspect the columns
csvcut -n data.csv
# Inspect the data in specific columns
csvcut -c county, quantity data.csv | csvlook
24
29. CSVKit: Examining data
csvstat provides a summary view of the data similar to
summary() function in R.
# Get summary for county, and cost
csvcut -c county,acquisition_cost,ship_date data.csv | csvstat
1. county
Text
Nulls: False
Unique values: 35
Max length: 10
5 most frequent values:
DOUGLAS: 760
DAKOTA: 42
CASS: 37
HALL: 23
LANCASTER: 18
2. acquisition_cost
Number
Nulls: False
Min: 0.0
Max: 412000.0
Sum: 5430787.55
Mean: 5242.072924710424710424710425
Median: 6000.0
Standard Deviation: 13368.07836799839045093904423
Unique values: 75
5 most frequent values:
6800.0: 304
25
30. CSVKit: searching data
csvgrep can be used to search the content of the CSV file.
Options include
• Exact match -m
• Regex match -r
• Invert match -i
• Search specific columns -c columns
csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER | csvlook
| county | item_name | total_cost |
| --------- | ------------------------------ | ---------- |
| LANCASTER | RIFLE,5.56 MILLIMETER | 120 |
| LANCASTER | RIFLE,5.56 MILLIMETER | 120 |
| LANCASTER | RIFLE,5.56 MILLIMETER | 120 |
26
31. CSVKit: Power tools
• csvjoin can be used to combine columns from multiple files
csvjoin -c join_column data.csv other_data.csv
• csvsort can be used to sort the file based on specific
columns
csvsort -c total_population | csvlook | head
• csvstack allows you to merge mutiple files together
(row-wise)
curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/re
in2csv ne_1033_data.xls > ne_1033_data.csv
csvstack -g region ne_1033_data.csv ks_1033_data.csv > region.csv
27
32. CSVKit: SQL
• csvsql allows queryring against one or more CSV files.
• The results of the query can be inserted back into a db
Examples
• Import from csv into a table
# Inserts into a specific table
csvsql --db postgresql:///test --table data --insert data.csv
# Inserts each file into a separate table
csvsql --db postgresql:///test --insert examples/*_tables.csv
• Regular SQL query
csvsql --query "select count(*) from data" data.csv
28
35. GNU parallel
• GNU parallel (Tange [2011]) is a tool for executing jobs in
parallel on one or more machines.
• It can be used to parallelize across arguments, lines and
files.
Examples
# Parallelize across lines
seq 1000 | parallel "echo {}"
# Parallelize across file content
cat input.csv | parallel -C, "mv {1} {2}"
cat input.csv | parallel -C --header "mv {source} {dest}"
30
36. Parallel (cont.)
• By default, parallel runs one job per cpu core
• Concurrency can be controlled by --jobs or -j option
seq 100 | parallel -j2 "echo number: {}"
seq 100 | parallel -j200% "echo number: {}"
Logging
• Output from each parallel job can be captured separately
using --results
seq 10 | parallel --results data/outdir "echo number: {}"
find data/outdir
31
37. Parallel (cont.)
• Remote execution
parallel --nonall --slf instances hostname
# nonall - no argument command to follow
# slf - uses ~/.parallel/sshloginfile as the list of sshlogins
• Distributing data
# Split 1-1000 into sections of 100 and pipe it to remote instances
seq 1000 | parallel -N100 --pipe --slf instances "(wc -l)"
#transmit, retrieve and cleanup
# sends jq to all instances
# transmits the input file, retrives the results into {.}csv and cleanup
ls *.gz | parallel -v --basefile jq --trc {.} csv
32
39. Overview
• Fast, online , scalable learning system
• Supports out of core execution with in memory model
• Scalable (Terascale)
• 1000 nodes (??)
• Billions of examples
• Trillions of unique features.
• Actively developed
https://github.com/JohnLangford/vowpal_wabbit
33
42. Optimization
VW solves optimization of the form
i
l(wT
xi; yi) + λR(w)
Here, l() is convex, R(w) = λ1|w| + λ2||w||2.
VW support a variety of loss function
Linear regression (y − wT x)2
Logistic regression log(1 + exp(−ywT x))
SVM regression max(0, 1 − ywT x)
Quantile regression τ(wT x − y) ∗ I(y < wT x) + (1 − τ)(y − wT x)I
36
43. Detour: Feature hashing
• Feature hashing can be used to reduce dimensions of
sparse features.
• Unlike random projections ? , retains sparsity
• Preserves dot products (random projection preserves
distances).
• Model can fit in memory.
• Unsigned ?
Consider a hash function
h(x) : [0 . . . N] → [0 . . . m], m << N.
φi(x) =
j:h(j)=i
xj
• Signed ?
Consider additionaly a hash function
ξ(x) : [0 . . . N] → {1, −1}.
φi(x) = ξ(j)xj
37
44. Detour: Generalized linear models
A generalized linear predictor specifies
• A linear predictor of the form η(x) = wT x
• A mean estimate µ
• A link function g(µ) such that g(µ) = η(x) that relates the
mean estimate to the linear predictor.
This framework supports a variety of regression problems
Linear regression µ = wT x
Logistic regression log( µ
1−µ) = wT x
Poisson regression log(µ) = wT x
38
45. fragileInput format
Label Importance [Tag]|namespace Feature . . . | namespace
Feature . . .
namespace = String[:Float]
feature = String[:Float]
Examples:
• 1 | 1:0.01 32:-0.1
• example|namespace normal text features
• 1 3 tag|ad-features ad description |user-features name
address age
39
48. Output options
• Examining feature construction --audit
• Generating prediction --predictions or -p
• Unnormalized predictions --raw_predictions
• Testing only --testonly or -t
42
49. Model options
• Model size --bit_precision or -b . Number of
coefficients limited to 2b
• Update existing model --initial_regressor or -i.
• Final model destination --final_regressor or -f
• Readable model definition --readable_model
• Readable feature values --invert_hash
• Snapshot model every pass --save_per_pass
• Weight initialization
--initial_weight or --random_weights
43
56. Christopher Groskopf and contributors. csvkit, 2016. URL
https://csvkit.readthedocs.org/.
O. Tange. Gnu parallel - the command-line power tool. ;login:
The USENIX Magazine, 36(1):42–47, Feb 2011. doi:
http://dx.doi.org/10.5281/zenodo.16303. URL
http://www.gnu.org/s/parallel.
48