SlideShare a Scribd company logo
1 of 56
Download to read offline
Data science at the command line
Rapid prototyping and reproducible science
Sharat Chikkerur
sharat@alum.mit.edu
Principal Data Scientist
Nanigans Inc.
1
Outline
Introduction
Motivation
Data science workflow
Obtaining data
Scrubbing data
Exploring data
Managing large workflows
Modeling using vowpal wabbit
2
Introduction
Setup
• Follow instructions at
https://github.com/sharatsc/cdse
• Install virtualbox virtualbox.org
• Install vagrant vagrantup.com
• Initialize virtual machine
mkdir datascience
cd datascience
vagrant init data-science-toolbox/data-science-at-the-command-line
vagrant up && vagrant ssh
3
About me
• UB Alumni, 2005 (M.S., EE Dept, cubs.buffalo.edu)
• MIT Alumni, 2010 (PhD., EECS Dept, cbcl.mit.edu)
• Senior Software Engineer, Google AdWords modeling
• Senior Software Engineer, Microsoft Machine learning
• Principal data scientist, Nanigans Inc
4
About the workshop
• Based on a book by Jeroen Janssens 1
• Vowpal wabbit 2
1
http://datascienceatthecommandline.com/
2
https://github.com/JohnLangford/vowpal_wabbit
5
POSIX command line
• Exposes operating system functionalities through a shell
• Example shells include bash, zsh, tcsh, fish etc.
• Comprises of a large number of utility programs
• Examples: grep, awk, sed, etc.
• GNU user space programs
• Common API: Input through stdin and output to stdout
• stdin and stdout can be redirected during execution
• Pipes (|) allows composition through pipes (chaining).
• Allows redirection of output of one command as input to
another
ls -lh | more
cat file | sort | uniq -c
6
Why command line ?
• REPL allows rapid iteration (through immediate feedback)
• Allows composition of scripts and commands using pipes
• Automation and scaling
• Reproducibility
• Extensibility
• R, python, perl, ruby scripts can be invoked like command
line utilities
7
Data science workflow
A typical workflow OSEMN model
• Obtaining data
• Scrubbing data
• Exploring data
• Modeling data
• Interpreting data
8
Workshop outline
• Background
• Getting data: Curl, scrape
• Scrubbing: jq, csvkit
• Exploring: csvkit
• Modeling: vowpal wabbit
• Scaling: parallel
Hands on exercises
• Obtaining data walkthrough
• JQ walkthrough
• CSVKit walkthrough
• Vowpal wabbit
9
Workflow example: Boston housing dataset
https://github.com/sharatsc/cdse/blob/master/
boston-housing
10
Boston housing dataset
Python workflow 3
import urllib
import pandas as pd
# Obtain data
urllib.urlretrieve(’https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housin
df = pd.read_csv(’boston.csv’)
# Scrub data
df = df.fillna(0)
# Model data
from statsmodels import regression
from statsmodels.formula import api as smf
formula = ’medv~’ + ’ + ’.join(df.columns - [’medv’])
model = smf.ols(formula=formula, data=df)
res=model.fit()
res.summary()
Command line workflow
URL="https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housing/boston.csv"
curl $URL| Rio -e ’model=lm("medv~.", df);model’
3
https://github.com/sharatsc/cdse/blob/master/boston-housing
11
Obtaining data
Obtaining data from the web: curl
• CURL (curl.haxx.se)
• Cross platform command line tool that supports data
transfer using
HTTP, HTTPS, FTP, IMAP, SCP, SFTP
• Supports cookies, user+password authentication
• can be used to get data from RESTful APIs 4
• http GET
curl http://www.google.com
• ftp GET
curl ftp://catless.ncl.ac.uk
• scp COPY
curl -u username: --key key_file --pass password scp://example.com/~/file.txt
4
www.codingpedia.org/ama/how-to-test-a-rest-api-from-command-line-with-curl/
12
Scrubbing data
Scrubbing web data: Scrape
• Scrape is a python command line tool to parse html
documents
• Queries can be made in CSS selector or XPath syntax
htmldoc=$(cat << EOF
<div id=a>
<a href="x.pdf">x</a>
</div>
<div id=b>
<a href="png.png">y</a>
<a href="pdf.pdf">y</a>
</div>
EOF
)
# Select liks that end with pdf and are within div with id=b (Use CSS3 selector)
echo $htmldoc | scrape -e "$b a[href$=pdf]"
# Select all anchors (use Xpath)
echo $htmldoc | scrape -e "//a"
<a href="pdf.pdf">y</a>
13
CSS selectors
.class selects all elements with class=’class’
div p selects all <p> elements inside div elements
div > p selects <p> elements where parent is <div>
[target=blank] selects all elements with target="blank"
[href^=https] selects urls beginning with https
[href$=pdf] selects urls ending with pdf
More examples at
https://www.w3schools.com/cssref/css_selectors.asp
14
XPath Query
author
selects all <author> elements at
the current level
//author
selects all <author> elements at
any level
//author[@class=’x’]
selects <author> elements with
class=famous
//book//author
All <author> elements that are
below <book> element"
//author/* All children of <author> nodes
More examples at
https://msdn.microsoft.com/en-us/library/ms256086
15
Getting data walkthrough
https:
//github.com/sharatsc/cdse/tree/master/curl
16
Scrubbing JSON data: JQ 5
• JQ is a portable command line utility to manipulate and
filter JSON data
• Filters can be defined to access individual fields, transform
records or produce derived objets
• Filters can be composed and combined
• Provides builtin functions an operators
Example:
curl ’https://api.github.com/repos/stedolan/jq/commits’ | jq ’.[0]’
5
https://stedolan.github.io/jq/
17
Basic filters
# Identity ’.’
echo ’"Hello world"’ | jq ’.’
"Hello world"
# Examples ’.foo’, ’.foo.bar’, ’.foo|.bar’, ’.["foo"]
echo ’{"foo": 42, "bar": 10}’ | jq ’.foo’
42
echo ’{"foo": {"bar": 10, "baz": 20}} | jq ’.foo.bar’
10
#Arrays ’.[]’, ’.[0]’, ’.[1:3]’
echo ’["foo", "bar"]’ | jq ’.[0]’
"foo"
echo ’["foo", "bar", "baz"]’ | jq ’.[1:3]’
["bar", "baz"]
# Pipe ’.foo|.bar’, ’.[]|.bar’
echo ’[{"f": 10}, {"f": 20}, {"f": 30}]’ | jq ’.[] | .f
10 20 30
18
Object construction
Object construction allows you to derive new objects out of
existing ones.
# Field selection
echo {"foo": "F", "bar": "B", "baz": "Z"} | jq ’{"foo": .foo}’
{"foo": "F"}
# Array expansion
echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{"foo": .foo, "bar": .bar[]}’
{"foo": "F", "bar": "X"}
{"foo": "F", "bar": "Y"}
# Expression evaluation, key and value can be substituted
echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{(.foo): .bar[]}’
{"A": "X"}
{"A": "Y"}
19
Operators
Addition
• Numbers are added by normal arithmetic.
• Arrays are added by being concatenated into a larger array.
• Strings are added by being joined into a larger string.
• Objects are added by merging, that is, inserting all the
key-value pairs from both objects into a single combined
object.
# Adding fields
echo ’{"foo": 10}’ | jq ’.foo + 1’
11
# Adding arrays
echo ’{"foo": [1,2,3], "bar": [11,12,13]}’ | jq ’.foo + .bar’
[1,2,3,11,12,13]
20
JQ walkthrough
https:
//github.com/sharatsc/cdse/tree/master/jq
21
Exploring data
CSV
• CSV (Comma separated value) is the common demoniator
for data exchange
• Tabular data with ’,’ as a separator
• Can be ingested by R, python, excel etc.
• No explicit specification of data types (ARFF supports type
annotation)
Example
state,county,quantity
NE,ADAMS,1
NE,BUFFALO,1
NE,THURSTON,1
22
CSVKit (Groskopf and contributors [2016])
csvkit 6 is a suite of command line tools for converting and
working with CSV, the defacto standard for tabular file formats.
Example use cases
• Importing data from excel, sql
• Select subset of columns
• Reorder columns
• Mergeing multiple files (row and column wise)
• Summary statistics
6
https://csvkit.readthedocs.io/
23
Importing data
# Fetch data in XLS format
# (LESO) 1033 Program dataset, which describes how surplus military arms have been distributed
# This data was widely cited in the aftermath of the Ferguson, Missouri protests. T
curl -L https://github.com/sharatsc/cdse/blob/master/csvkit/ne_1033_data.xls?raw=true -o ne_10
# Convert to csv
in2csv ne_1033_data.xlxs > data.csv
# Inspect the columns
csvcut -n data.csv
# Inspect the data in specific columns
csvcut -c county, quantity data.csv | csvlook
24
CSVKit: Examining data
csvstat provides a summary view of the data similar to
summary() function in R.
# Get summary for county, and cost
csvcut -c county,acquisition_cost,ship_date data.csv | csvstat
1. county
Text
Nulls: False
Unique values: 35
Max length: 10
5 most frequent values:
DOUGLAS: 760
DAKOTA: 42
CASS: 37
HALL: 23
LANCASTER: 18
2. acquisition_cost
Number
Nulls: False
Min: 0.0
Max: 412000.0
Sum: 5430787.55
Mean: 5242.072924710424710424710425
Median: 6000.0
Standard Deviation: 13368.07836799839045093904423
Unique values: 75
5 most frequent values:
6800.0: 304
25
CSVKit: searching data
csvgrep can be used to search the content of the CSV file.
Options include
• Exact match -m
• Regex match -r
• Invert match -i
• Search specific columns -c columns
csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER | csvlook
| county | item_name | total_cost |
| --------- | ------------------------------ | ---------- |
| LANCASTER | RIFLE,5.56 MILLIMETER | 120 |
| LANCASTER | RIFLE,5.56 MILLIMETER | 120 |
| LANCASTER | RIFLE,5.56 MILLIMETER | 120 |
26
CSVKit: Power tools
• csvjoin can be used to combine columns from multiple files
csvjoin -c join_column data.csv other_data.csv
• csvsort can be used to sort the file based on specific
columns
csvsort -c total_population | csvlook | head
• csvstack allows you to merge mutiple files together
(row-wise)
curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/re
in2csv ne_1033_data.xls > ne_1033_data.csv
csvstack -g region ne_1033_data.csv ks_1033_data.csv > region.csv
27
CSVKit: SQL
• csvsql allows queryring against one or more CSV files.
• The results of the query can be inserted back into a db
Examples
• Import from csv into a table
# Inserts into a specific table
csvsql --db postgresql:///test --table data --insert data.csv
# Inserts each file into a separate table
csvsql --db postgresql:///test --insert examples/*_tables.csv
• Regular SQL query
csvsql --query "select count(*) from data" data.csv
28
CSVKit Walkthrough
https:
//github.com/sharatsc/cdse/tree/master/csvkit
29
Managing large workflows
GNU parallel
• GNU parallel (Tange [2011]) is a tool for executing jobs in
parallel on one or more machines.
• It can be used to parallelize across arguments, lines and
files.
Examples
# Parallelize across lines
seq 1000 | parallel "echo {}"
# Parallelize across file content
cat input.csv | parallel -C, "mv {1} {2}"
cat input.csv | parallel -C --header "mv {source} {dest}"
30
Parallel (cont.)
• By default, parallel runs one job per cpu core
• Concurrency can be controlled by --jobs or -j option
seq 100 | parallel -j2 "echo number: {}"
seq 100 | parallel -j200% "echo number: {}"
Logging
• Output from each parallel job can be captured separately
using --results
seq 10 | parallel --results data/outdir "echo number: {}"
find data/outdir
31
Parallel (cont.)
• Remote execution
parallel --nonall --slf instances hostname
# nonall - no argument command to follow
# slf - uses ~/.parallel/sshloginfile as the list of sshlogins
• Distributing data
# Split 1-1000 into sections of 100 and pipe it to remote instances
seq 1000 | parallel -N100 --pipe --slf instances "(wc -l)"
#transmit, retrieve and cleanup
# sends jq to all instances
# transmits the input file, retrives the results into {.}csv and cleanup
ls *.gz | parallel -v --basefile jq --trc {.} csv
32
Modeling using vowpal wabbit
Overview
• Fast, online , scalable learning system
• Supports out of core execution with in memory model
• Scalable (Terascale)
• 1000 nodes (??)
• Billions of examples
• Trillions of unique features.
• Actively developed
https://github.com/JohnLangford/vowpal_wabbit
33
Swiss army knife of online algorithms
• Binary classification
• Multiclass classification
• Linear regression
• Quantile regression
• Topic modeling (online LDA)
• Structured prediction
• Active learning
• Recommendation (Matrix factorization)
• Contextual bandit learning (explore/exploit algorithms)
• Reductions
34
Features
• Flexible input format: Allows free form text, identifying tags,
multiple labels
• Speed: Supports online learning
• Scalable
• Streaming removes row limit.
• Hashing removes dimension limit.
• Distributed training: models merged using AllReduce
operation.
• Cross product features: Allows multi-task learning
35
Optimization
VW solves optimization of the form
i
l(wT
xi; yi) + λR(w)
Here, l() is convex, R(w) = λ1|w| + λ2||w||2.
VW support a variety of loss function
Linear regression (y − wT x)2
Logistic regression log(1 + exp(−ywT x))
SVM regression max(0, 1 − ywT x)
Quantile regression τ(wT x − y) ∗ I(y < wT x) + (1 − τ)(y − wT x)I
36
Detour: Feature hashing
• Feature hashing can be used to reduce dimensions of
sparse features.
• Unlike random projections ? , retains sparsity
• Preserves dot products (random projection preserves
distances).
• Model can fit in memory.
• Unsigned ?
Consider a hash function
h(x) : [0 . . . N] → [0 . . . m], m << N.
φi(x) =
j:h(j)=i
xj
• Signed ?
Consider additionaly a hash function
ξ(x) : [0 . . . N] → {1, −1}.
φi(x) = ξ(j)xj
37
Detour: Generalized linear models
A generalized linear predictor specifies
• A linear predictor of the form η(x) = wT x
• A mean estimate µ
• A link function g(µ) such that g(µ) = η(x) that relates the
mean estimate to the linear predictor.
This framework supports a variety of regression problems
Linear regression µ = wT x
Logistic regression log( µ
1−µ) = wT x
Poisson regression log(µ) = wT x
38
fragileInput format
Label Importance [Tag]|namespace Feature . . . | namespace
Feature . . .
namespace = String[:Float]
feature = String[:Float]
Examples:
• 1 | 1:0.01 32:-0.1
• example|namespace normal text features
• 1 3 tag|ad-features ad description |user-features name
address age
39
Input options
• data file -d datafile
• network --daemon --port <port=26542>
• compressed data --compressed
• stdin cat <data> | vw
40
Manipulation options
• ngrams --ngram
• skips --skips
• quadratic interaction -q args. e.g -q ab
• cubic interaction --cubic args. e.g. --cubic ccc
41
Output options
• Examining feature construction --audit
• Generating prediction --predictions or -p
• Unnormalized predictions --raw_predictions
• Testing only --testonly or -t
42
Model options
• Model size --bit_precision or -b . Number of
coefficients limited to 2b
• Update existing model --initial_regressor or -i.
• Final model destination --final_regressor or -f
• Readable model definition --readable_model
• Readable feature values --invert_hash
• Snapshot model every pass --save_per_pass
• Weight initialization
--initial_weight or --random_weights
43
Regression (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/linear-regression
• Linear regression --loss_function square
• Quantile regression
--loss_function quantile --quantile_tau <=0.5>
44
Binary classification (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/classification
• Note: a linear regressor can be used as a classifier as well
• Logistic loss
--loss_function logistic, --link logistic
• Hinge loss (SVM loss function)
--loss_function hinge
• Report binary loss instead of logistic loss --binary
45
Multiclass classification (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/multiclass
• One against all --oaa <k>
• Error correcting tournament --ect <k>
• Online decision trees ---log_multi <k>
• Cost sensitive one-against-all --csoaa <k>
46
LDA options (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/lda
• Number of topics --lda
• Prior on per-document topic weights --lda_alpha
• Prior on topic distributions --lda_rho
• Estimated number of documents --lda_D
• Convergence parameter for topic estimation
--lda_epsilon
• Mini batch size --minibatch
47
Daemon mode (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/daemon-respond
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/daemon-request
• Loads model and answers any prediction request coming
over the network
• Preferred way to deploy a VW model
• Options
• --daemon. Enables demon mode
• --testonly or -t. Does not update the model in
response to requests
• --initial_model or -i. Model to load
• --port <arg>. Port to listen to the request
• --num_children <arg>. Number of threads listening to
request 48
References
Christopher Groskopf and contributors. csvkit, 2016. URL
https://csvkit.readthedocs.org/.
O. Tange. Gnu parallel - the command-line power tool. ;login:
The USENIX Magazine, 36(1):42–47, Feb 2011. doi:
http://dx.doi.org/10.5281/zenodo.16303. URL
http://www.gnu.org/s/parallel.
48

More Related Content

What's hot

MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineJason Terpko
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB MongoDB
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichNorberto Leite
 
Working With a Real-World Dataset in Neo4j: Import and Modeling
Working With a Real-World Dataset in Neo4j: Import and ModelingWorking With a Real-World Dataset in Neo4j: Import and Modeling
Working With a Real-World Dataset in Neo4j: Import and ModelingNeo4j
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationJoe Drumgoole
 
Python Development (MongoSF)
Python Development (MongoSF)Python Development (MongoSF)
Python Development (MongoSF)Mike Dirolf
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorHenrik Ingo
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkTyler Brock
 
On Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data TypesOn Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data TypesJonathan Katz
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopSteven Francia
 
PostgreSQL 9.4 JSON Types and Operators
PostgreSQL 9.4 JSON Types and OperatorsPostgreSQL 9.4 JSON Types and Operators
PostgreSQL 9.4 JSON Types and OperatorsNicholas Kiraly
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsMongoDB
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation Amit Ghosh
 
Webinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation FrameworkWebinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation FrameworkMongoDB
 
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
Conceptos básicos. Seminario web 5: Introducción a Aggregation FrameworkConceptos básicos. Seminario web 5: Introducción a Aggregation Framework
Conceptos básicos. Seminario web 5: Introducción a Aggregation FrameworkMongoDB
 
Fast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGFast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGDuyhai Doan
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016Duyhai Doan
 

What's hot (20)

MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation Pipeline
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days Munich
 
Working With a Real-World Dataset in Neo4j: Import and Modeling
Working With a Real-World Dataset in Neo4j: Import and ModelingWorking With a Real-World Dataset in Neo4j: Import and Modeling
Working With a Real-World Dataset in Neo4j: Import and Modeling
 
MongoDB and Python
MongoDB and PythonMongoDB and Python
MongoDB and Python
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced Aggregation
 
Python Development (MongoSF)
Python Development (MongoSF)Python Development (MongoSF)
Python Development (MongoSF)
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop Connector
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
On Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data TypesOn Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data Types
 
Drill 1.0
Drill 1.0Drill 1.0
Drill 1.0
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
 
PostgreSQL 9.4 JSON Types and Operators
PostgreSQL 9.4 JSON Types and OperatorsPostgreSQL 9.4 JSON Types and Operators
PostgreSQL 9.4 JSON Types and Operators
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
 
Webinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation FrameworkWebinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation Framework
 
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
Conceptos básicos. Seminario web 5: Introducción a Aggregation FrameworkConceptos básicos. Seminario web 5: Introducción a Aggregation Framework
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
 
Fast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGFast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ ING
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
 

Viewers also liked

Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringDataRobot
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovationHackerEarth
 
Leverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingLeverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingHackerEarth
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science CompetitionJeong-Yoon Lee
 
Ethics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningEthics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningHJ van Veen
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Domino Data Lab
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEoHackerEarth
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case StudyHackerEarth
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
State of women in technical workforce
State of women in technical workforceState of women in technical workforce
State of women in technical workforceHackerEarth
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Antti Haapala
 
How to recruit excellent tech talent
How to recruit excellent tech talentHow to recruit excellent tech talent
How to recruit excellent tech talentHackerEarth
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHackerEarth
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionJeong-Yoon Lee
 
Intra company hackathons using HackerEarth
Intra company hackathons using HackerEarthIntra company hackathons using HackerEarth
Intra company hackathons using HackerEarthHackerEarth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 

Viewers also liked (20)

Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
Work - LIGHT Ministry
Work - LIGHT MinistryWork - LIGHT Ministry
Work - LIGHT Ministry
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovation
 
Leverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingLeverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and Recruiting
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science Competition
 
Ethics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningEthics in Data Science and Machine Learning
Ethics in Data Science and Machine Learning
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
State of women in technical workforce
State of women in technical workforceState of women in technical workforce
State of women in technical workforce
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
 
How to recruit excellent tech talent
How to recruit excellent tech talentHow to recruit excellent tech talent
How to recruit excellent tech talent
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
 
No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
 
Kill the wabbit
Kill the wabbitKill the wabbit
Kill the wabbit
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry Introduction
 
Intra company hackathons using HackerEarth
Intra company hackathons using HackerEarthIntra company hackathons using HackerEarth
Intra company hackathons using HackerEarth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 

Similar to Data science at the command line

ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with ElasticsearchSamantha Quiñones
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
OSCON 2011 CouchApps
OSCON 2011 CouchAppsOSCON 2011 CouchApps
OSCON 2011 CouchAppsBradley Holt
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
 
OSCON 2011 Learning CouchDB
OSCON 2011 Learning CouchDBOSCON 2011 Learning CouchDB
OSCON 2011 Learning CouchDBBradley Holt
 
You got schema in my json
You got schema in my jsonYou got schema in my json
You got schema in my jsonPhilipp Fehre
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
GraphConnect 2014 SF: From Zero to Graph in 120: Scale
GraphConnect 2014 SF: From Zero to Graph in 120: ScaleGraphConnect 2014 SF: From Zero to Graph in 120: Scale
GraphConnect 2014 SF: From Zero to Graph in 120: ScaleNeo4j
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBaseWibiData
 
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...HostedbyConfluent
 

Similar to Data science at the command line (20)

Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
OSCON 2011 CouchApps
OSCON 2011 CouchAppsOSCON 2011 CouchApps
OSCON 2011 CouchApps
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 
OSCON 2011 Learning CouchDB
OSCON 2011 Learning CouchDBOSCON 2011 Learning CouchDB
OSCON 2011 Learning CouchDB
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
You got schema in my json
You got schema in my jsonYou got schema in my json
You got schema in my json
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
GraphConnect 2014 SF: From Zero to Graph in 120: Scale
GraphConnect 2014 SF: From Zero to Graph in 120: ScaleGraphConnect 2014 SF: From Zero to Graph in 120: Scale
GraphConnect 2014 SF: From Zero to Graph in 120: Scale
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
 
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
 
AI Development with H2O.ai
AI Development with H2O.aiAI Development with H2O.ai
AI Development with H2O.ai
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Data science at the command line

  • 1. Data science at the command line Rapid prototyping and reproducible science Sharat Chikkerur sharat@alum.mit.edu Principal Data Scientist Nanigans Inc. 1
  • 2. Outline Introduction Motivation Data science workflow Obtaining data Scrubbing data Exploring data Managing large workflows Modeling using vowpal wabbit 2
  • 4. Setup • Follow instructions at https://github.com/sharatsc/cdse • Install virtualbox virtualbox.org • Install vagrant vagrantup.com • Initialize virtual machine mkdir datascience cd datascience vagrant init data-science-toolbox/data-science-at-the-command-line vagrant up && vagrant ssh 3
  • 5. About me • UB Alumni, 2005 (M.S., EE Dept, cubs.buffalo.edu) • MIT Alumni, 2010 (PhD., EECS Dept, cbcl.mit.edu) • Senior Software Engineer, Google AdWords modeling • Senior Software Engineer, Microsoft Machine learning • Principal data scientist, Nanigans Inc 4
  • 6. About the workshop • Based on a book by Jeroen Janssens 1 • Vowpal wabbit 2 1 http://datascienceatthecommandline.com/ 2 https://github.com/JohnLangford/vowpal_wabbit 5
  • 7. POSIX command line • Exposes operating system functionalities through a shell • Example shells include bash, zsh, tcsh, fish etc. • Comprises of a large number of utility programs • Examples: grep, awk, sed, etc. • GNU user space programs • Common API: Input through stdin and output to stdout • stdin and stdout can be redirected during execution • Pipes (|) allows composition through pipes (chaining). • Allows redirection of output of one command as input to another ls -lh | more cat file | sort | uniq -c 6
  • 8. Why command line ? • REPL allows rapid iteration (through immediate feedback) • Allows composition of scripts and commands using pipes • Automation and scaling • Reproducibility • Extensibility • R, python, perl, ruby scripts can be invoked like command line utilities 7
  • 9. Data science workflow A typical workflow OSEMN model • Obtaining data • Scrubbing data • Exploring data • Modeling data • Interpreting data 8
  • 10. Workshop outline • Background • Getting data: Curl, scrape • Scrubbing: jq, csvkit • Exploring: csvkit • Modeling: vowpal wabbit • Scaling: parallel Hands on exercises • Obtaining data walkthrough • JQ walkthrough • CSVKit walkthrough • Vowpal wabbit 9
  • 11. Workflow example: Boston housing dataset https://github.com/sharatsc/cdse/blob/master/ boston-housing 10
  • 12. Boston housing dataset Python workflow 3 import urllib import pandas as pd # Obtain data urllib.urlretrieve(’https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housin df = pd.read_csv(’boston.csv’) # Scrub data df = df.fillna(0) # Model data from statsmodels import regression from statsmodels.formula import api as smf formula = ’medv~’ + ’ + ’.join(df.columns - [’medv’]) model = smf.ols(formula=formula, data=df) res=model.fit() res.summary() Command line workflow URL="https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housing/boston.csv" curl $URL| Rio -e ’model=lm("medv~.", df);model’ 3 https://github.com/sharatsc/cdse/blob/master/boston-housing 11
  • 14. Obtaining data from the web: curl • CURL (curl.haxx.se) • Cross platform command line tool that supports data transfer using HTTP, HTTPS, FTP, IMAP, SCP, SFTP • Supports cookies, user+password authentication • can be used to get data from RESTful APIs 4 • http GET curl http://www.google.com • ftp GET curl ftp://catless.ncl.ac.uk • scp COPY curl -u username: --key key_file --pass password scp://example.com/~/file.txt 4 www.codingpedia.org/ama/how-to-test-a-rest-api-from-command-line-with-curl/ 12
  • 16. Scrubbing web data: Scrape • Scrape is a python command line tool to parse html documents • Queries can be made in CSS selector or XPath syntax htmldoc=$(cat << EOF <div id=a> <a href="x.pdf">x</a> </div> <div id=b> <a href="png.png">y</a> <a href="pdf.pdf">y</a> </div> EOF ) # Select liks that end with pdf and are within div with id=b (Use CSS3 selector) echo $htmldoc | scrape -e "$b a[href$=pdf]" # Select all anchors (use Xpath) echo $htmldoc | scrape -e "//a" <a href="pdf.pdf">y</a> 13
  • 17. CSS selectors .class selects all elements with class=’class’ div p selects all <p> elements inside div elements div > p selects <p> elements where parent is <div> [target=blank] selects all elements with target="blank" [href^=https] selects urls beginning with https [href$=pdf] selects urls ending with pdf More examples at https://www.w3schools.com/cssref/css_selectors.asp 14
  • 18. XPath Query author selects all <author> elements at the current level //author selects all <author> elements at any level //author[@class=’x’] selects <author> elements with class=famous //book//author All <author> elements that are below <book> element" //author/* All children of <author> nodes More examples at https://msdn.microsoft.com/en-us/library/ms256086 15
  • 20. Scrubbing JSON data: JQ 5 • JQ is a portable command line utility to manipulate and filter JSON data • Filters can be defined to access individual fields, transform records or produce derived objets • Filters can be composed and combined • Provides builtin functions an operators Example: curl ’https://api.github.com/repos/stedolan/jq/commits’ | jq ’.[0]’ 5 https://stedolan.github.io/jq/ 17
  • 21. Basic filters # Identity ’.’ echo ’"Hello world"’ | jq ’.’ "Hello world" # Examples ’.foo’, ’.foo.bar’, ’.foo|.bar’, ’.["foo"] echo ’{"foo": 42, "bar": 10}’ | jq ’.foo’ 42 echo ’{"foo": {"bar": 10, "baz": 20}} | jq ’.foo.bar’ 10 #Arrays ’.[]’, ’.[0]’, ’.[1:3]’ echo ’["foo", "bar"]’ | jq ’.[0]’ "foo" echo ’["foo", "bar", "baz"]’ | jq ’.[1:3]’ ["bar", "baz"] # Pipe ’.foo|.bar’, ’.[]|.bar’ echo ’[{"f": 10}, {"f": 20}, {"f": 30}]’ | jq ’.[] | .f 10 20 30 18
  • 22. Object construction Object construction allows you to derive new objects out of existing ones. # Field selection echo {"foo": "F", "bar": "B", "baz": "Z"} | jq ’{"foo": .foo}’ {"foo": "F"} # Array expansion echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{"foo": .foo, "bar": .bar[]}’ {"foo": "F", "bar": "X"} {"foo": "F", "bar": "Y"} # Expression evaluation, key and value can be substituted echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{(.foo): .bar[]}’ {"A": "X"} {"A": "Y"} 19
  • 23. Operators Addition • Numbers are added by normal arithmetic. • Arrays are added by being concatenated into a larger array. • Strings are added by being joined into a larger string. • Objects are added by merging, that is, inserting all the key-value pairs from both objects into a single combined object. # Adding fields echo ’{"foo": 10}’ | jq ’.foo + 1’ 11 # Adding arrays echo ’{"foo": [1,2,3], "bar": [11,12,13]}’ | jq ’.foo + .bar’ [1,2,3,11,12,13] 20
  • 26. CSV • CSV (Comma separated value) is the common demoniator for data exchange • Tabular data with ’,’ as a separator • Can be ingested by R, python, excel etc. • No explicit specification of data types (ARFF supports type annotation) Example state,county,quantity NE,ADAMS,1 NE,BUFFALO,1 NE,THURSTON,1 22
  • 27. CSVKit (Groskopf and contributors [2016]) csvkit 6 is a suite of command line tools for converting and working with CSV, the defacto standard for tabular file formats. Example use cases • Importing data from excel, sql • Select subset of columns • Reorder columns • Mergeing multiple files (row and column wise) • Summary statistics 6 https://csvkit.readthedocs.io/ 23
  • 28. Importing data # Fetch data in XLS format # (LESO) 1033 Program dataset, which describes how surplus military arms have been distributed # This data was widely cited in the aftermath of the Ferguson, Missouri protests. T curl -L https://github.com/sharatsc/cdse/blob/master/csvkit/ne_1033_data.xls?raw=true -o ne_10 # Convert to csv in2csv ne_1033_data.xlxs > data.csv # Inspect the columns csvcut -n data.csv # Inspect the data in specific columns csvcut -c county, quantity data.csv | csvlook 24
  • 29. CSVKit: Examining data csvstat provides a summary view of the data similar to summary() function in R. # Get summary for county, and cost csvcut -c county,acquisition_cost,ship_date data.csv | csvstat 1. county Text Nulls: False Unique values: 35 Max length: 10 5 most frequent values: DOUGLAS: 760 DAKOTA: 42 CASS: 37 HALL: 23 LANCASTER: 18 2. acquisition_cost Number Nulls: False Min: 0.0 Max: 412000.0 Sum: 5430787.55 Mean: 5242.072924710424710424710425 Median: 6000.0 Standard Deviation: 13368.07836799839045093904423 Unique values: 75 5 most frequent values: 6800.0: 304 25
  • 30. CSVKit: searching data csvgrep can be used to search the content of the CSV file. Options include • Exact match -m • Regex match -r • Invert match -i • Search specific columns -c columns csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER | csvlook | county | item_name | total_cost | | --------- | ------------------------------ | ---------- | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | 26
  • 31. CSVKit: Power tools • csvjoin can be used to combine columns from multiple files csvjoin -c join_column data.csv other_data.csv • csvsort can be used to sort the file based on specific columns csvsort -c total_population | csvlook | head • csvstack allows you to merge mutiple files together (row-wise) curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/re in2csv ne_1033_data.xls > ne_1033_data.csv csvstack -g region ne_1033_data.csv ks_1033_data.csv > region.csv 27
  • 32. CSVKit: SQL • csvsql allows queryring against one or more CSV files. • The results of the query can be inserted back into a db Examples • Import from csv into a table # Inserts into a specific table csvsql --db postgresql:///test --table data --insert data.csv # Inserts each file into a separate table csvsql --db postgresql:///test --insert examples/*_tables.csv • Regular SQL query csvsql --query "select count(*) from data" data.csv 28
  • 35. GNU parallel • GNU parallel (Tange [2011]) is a tool for executing jobs in parallel on one or more machines. • It can be used to parallelize across arguments, lines and files. Examples # Parallelize across lines seq 1000 | parallel "echo {}" # Parallelize across file content cat input.csv | parallel -C, "mv {1} {2}" cat input.csv | parallel -C --header "mv {source} {dest}" 30
  • 36. Parallel (cont.) • By default, parallel runs one job per cpu core • Concurrency can be controlled by --jobs or -j option seq 100 | parallel -j2 "echo number: {}" seq 100 | parallel -j200% "echo number: {}" Logging • Output from each parallel job can be captured separately using --results seq 10 | parallel --results data/outdir "echo number: {}" find data/outdir 31
  • 37. Parallel (cont.) • Remote execution parallel --nonall --slf instances hostname # nonall - no argument command to follow # slf - uses ~/.parallel/sshloginfile as the list of sshlogins • Distributing data # Split 1-1000 into sections of 100 and pipe it to remote instances seq 1000 | parallel -N100 --pipe --slf instances "(wc -l)" #transmit, retrieve and cleanup # sends jq to all instances # transmits the input file, retrives the results into {.}csv and cleanup ls *.gz | parallel -v --basefile jq --trc {.} csv 32
  • 39. Overview • Fast, online , scalable learning system • Supports out of core execution with in memory model • Scalable (Terascale) • 1000 nodes (??) • Billions of examples • Trillions of unique features. • Actively developed https://github.com/JohnLangford/vowpal_wabbit 33
  • 40. Swiss army knife of online algorithms • Binary classification • Multiclass classification • Linear regression • Quantile regression • Topic modeling (online LDA) • Structured prediction • Active learning • Recommendation (Matrix factorization) • Contextual bandit learning (explore/exploit algorithms) • Reductions 34
  • 41. Features • Flexible input format: Allows free form text, identifying tags, multiple labels • Speed: Supports online learning • Scalable • Streaming removes row limit. • Hashing removes dimension limit. • Distributed training: models merged using AllReduce operation. • Cross product features: Allows multi-task learning 35
  • 42. Optimization VW solves optimization of the form i l(wT xi; yi) + λR(w) Here, l() is convex, R(w) = λ1|w| + λ2||w||2. VW support a variety of loss function Linear regression (y − wT x)2 Logistic regression log(1 + exp(−ywT x)) SVM regression max(0, 1 − ywT x) Quantile regression τ(wT x − y) ∗ I(y < wT x) + (1 − τ)(y − wT x)I 36
  • 43. Detour: Feature hashing • Feature hashing can be used to reduce dimensions of sparse features. • Unlike random projections ? , retains sparsity • Preserves dot products (random projection preserves distances). • Model can fit in memory. • Unsigned ? Consider a hash function h(x) : [0 . . . N] → [0 . . . m], m << N. φi(x) = j:h(j)=i xj • Signed ? Consider additionaly a hash function ξ(x) : [0 . . . N] → {1, −1}. φi(x) = ξ(j)xj 37
  • 44. Detour: Generalized linear models A generalized linear predictor specifies • A linear predictor of the form η(x) = wT x • A mean estimate µ • A link function g(µ) such that g(µ) = η(x) that relates the mean estimate to the linear predictor. This framework supports a variety of regression problems Linear regression µ = wT x Logistic regression log( µ 1−µ) = wT x Poisson regression log(µ) = wT x 38
  • 45. fragileInput format Label Importance [Tag]|namespace Feature . . . | namespace Feature . . . namespace = String[:Float] feature = String[:Float] Examples: • 1 | 1:0.01 32:-0.1 • example|namespace normal text features • 1 3 tag|ad-features ad description |user-features name address age 39
  • 46. Input options • data file -d datafile • network --daemon --port <port=26542> • compressed data --compressed • stdin cat <data> | vw 40
  • 47. Manipulation options • ngrams --ngram • skips --skips • quadratic interaction -q args. e.g -q ab • cubic interaction --cubic args. e.g. --cubic ccc 41
  • 48. Output options • Examining feature construction --audit • Generating prediction --predictions or -p • Unnormalized predictions --raw_predictions • Testing only --testonly or -t 42
  • 49. Model options • Model size --bit_precision or -b . Number of coefficients limited to 2b • Update existing model --initial_regressor or -i. • Final model destination --final_regressor or -f • Readable model definition --readable_model • Readable feature values --invert_hash • Snapshot model every pass --save_per_pass • Weight initialization --initial_weight or --random_weights 43
  • 50. Regression (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/linear-regression • Linear regression --loss_function square • Quantile regression --loss_function quantile --quantile_tau <=0.5> 44
  • 51. Binary classification (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/classification • Note: a linear regressor can be used as a classifier as well • Logistic loss --loss_function logistic, --link logistic • Hinge loss (SVM loss function) --loss_function hinge • Report binary loss instead of logistic loss --binary 45
  • 52. Multiclass classification (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/multiclass • One against all --oaa <k> • Error correcting tournament --ect <k> • Online decision trees ---log_multi <k> • Cost sensitive one-against-all --csoaa <k> 46
  • 53. LDA options (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/lda • Number of topics --lda • Prior on per-document topic weights --lda_alpha • Prior on topic distributions --lda_rho • Estimated number of documents --lda_D • Convergence parameter for topic estimation --lda_epsilon • Mini batch size --minibatch 47
  • 54. Daemon mode (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/daemon-respond https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/daemon-request • Loads model and answers any prediction request coming over the network • Preferred way to deploy a VW model • Options • --daemon. Enables demon mode • --testonly or -t. Does not update the model in response to requests • --initial_model or -i. Model to load • --port <arg>. Port to listen to the request • --num_children <arg>. Number of threads listening to request 48
  • 56. Christopher Groskopf and contributors. csvkit, 2016. URL https://csvkit.readthedocs.org/. O. Tange. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42–47, Feb 2011. doi: http://dx.doi.org/10.5281/zenodo.16303. URL http://www.gnu.org/s/parallel. 48