PK
COPYRIGHT © 2015 PROKARMA INC. Copyrights, trademarks, and registered trademarks for all technology described in this document are owned by the respective companies.
+
OVERVIEW
ü  Strategy
ü  Technology Verticals
ü  Industry Verticals
ü  Platform Solutions
SERVICES
ü  21 Offices including 11 Global Delivery
Centers in US, India, Argentina & Peru
PRESENCE
ü  150+ Customers
ü  40 Fortune 1000 Companies
CUSTOMERS
ü  Doing business since 2000
ü  Global Staff of 2000+ employees
PEOPLE
PK
problem formulation -> tool chain construction
data science with notebooks
the data set
Demo: Spark + GraphLab Create
RDD -> SFrame
ML on PySpark
introduction to boosted trees in GraphLab Create
Demo: Spark + GraphLab Create
PySpark ML with GraphLab Create
PK
What should we talk about?
+
Moving down the
data pipeline from
raw to results
How best to quickly
move through
pipeline to:
1.  Show value of
work
2.  Communicate
results
3.  Move models into
production
pipeline
PK
problem formulation -> tool chain construction
+
Clean up your
data
Run
Sophisticated
Analytics
Discover
Insights!
Integrate
Results into
Pipelines
Use lots of different
data sources
ETL Communicate
Results and
Value
…NOTEBOOKS
Based on Diagram Source: Paco Nathan
PK
problem formulation -> tool chain construction
+
“Models are not right or wrong; they're always
wrong. They're always approximations. The
question you have to ask is whether a model
tells you more information than you would
have had otherwise. If it does, it's skillful.” -
Gavin Schmidt’s excellent TED Talk
Data science with notebooks allows
data science teams to quickly move
from exploration > transformation >
modeling > visualization > export to
pipeline
Many startups create to fill the need
for one product solution with
collaboration and containerization
In the mean time:
PySpark +
IPython +
GraphLab Create
Fatality Analysis Reporting System
National Highway Transportation Safety Administration
●  Publicly available
●  Historical time-series
Currently available: 1975 – 2013
●  Raw, Rich, Relevant
●  Time-series, geo location
●  Human recorded events transcribed into annual databases (dbf,
SQL, SAS)
●  Measurable outcomes for modelling (updated on yearly basis)
PK
example data set: transportation safety +
²  Proof  of  concept  demonstra/on  for  
customer  concerned  with  
Transporta/on  Safety
²  Applica/on  of  science  method  
towards  diverse  data  sets
²  Evolving  real-­‐world  data  sets  for  
advanced  analy/cs  workshops  and  
training  sessions
²  Visual  and  conceptual  presenta/on  
of  scien/fic  approach  to  
computa/onal  analysis
data science showcase
IENCE – DATA PROTOTYPING
25
30
35
40
45
50
−120 −100 −80
Fatalities in Fatal Accidents from 1975−2012
lat
0
40000
80000
120000
160000
deaths
ü  Scien/fic  Method
ü  Predic/ve  Modeling
ü  Hidden  Insights  
PK
+
Hardware:
•  MacBook Pro (late 2012)
•  ~36 GB free disk space
•  8GB RAM
•  2 cores
•  Not exactly a blazingly fast, top of the line machine….
Software:
•  Spark 1.1.0 for Hadoop 2.4
•  GraphLab Create 1.2.1
•  Hadoop 2.4
•  Scala 2.10.4
•  Python 2.7.9 on Anaconda 1.9.1
PK
notes about my setup
+
One more step…
gl.get_spark_integration_jar_path()
$SPARK_HOME/bin/spark-submit --driver-class-path /path/to/graphlab-
create-spark-integration.jar --master yarn-client my_awesome_code.py
Or…
$SPARK_HOME/bin/pyspark --driver-class-path /path/to/graphlab-create-
spark-integration.jar --master yarn-client
Works with –master local OR –master yarn-client
Then…it works J
PK
working with glc in pyspark
+
Demo time!
Graphic courtesy Dato
GraphLab Create
Recommender system
Factorization-based methods
Neighborhood-based item similarity
Popularity-based methods
Classification
Deep learning
Boosted trees
Support vector machine
Logistic regression
Regression
Boosted trees
Deep learning
Linear regression
Text analysis
Topic modeling (LDA)
Featurization utilities: bm25, td_idf, remove stop
words, etc.
Image analysis
Deep learning
Image load, resize, mean computation
Clustering – K-means
Graph analytics
Nearest neighbors
Vowpal Wabbit wrapper
PK
ML in PySpark…even better now
+
MLlib + GraphX
Classification and regression
linear models (SVMs, logistic regression,
linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and
Gradient-Boosted Trees)
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
Graph analytics
PK
Gradient Boosted Trees with GraphLab Create
+
Source: Dato, ICCV 2009 Tutorial
……!
tree t1! tree tT!
split nodes!
leaf nodes!
v! v!
 !
Demo time!
Graphic courtesy Dato
+
PK
http://goo.gl/forms/y8LYl53hje
?
DATA SCIENCE – DATA PROTOTYPING
PK
Amanda Casari
acasari@prokarma.com
+
humor from xkcd

Spark Hearts GraphLab Create

  • 1.
    PK COPYRIGHT © 2015PROKARMA INC. Copyrights, trademarks, and registered trademarks for all technology described in this document are owned by the respective companies. +
  • 2.
    OVERVIEW ü  Strategy ü  TechnologyVerticals ü  Industry Verticals ü  Platform Solutions SERVICES ü  21 Offices including 11 Global Delivery Centers in US, India, Argentina & Peru PRESENCE ü  150+ Customers ü  40 Fortune 1000 Companies CUSTOMERS ü  Doing business since 2000 ü  Global Staff of 2000+ employees PEOPLE PK
  • 3.
    problem formulation ->tool chain construction data science with notebooks the data set Demo: Spark + GraphLab Create RDD -> SFrame ML on PySpark introduction to boosted trees in GraphLab Create Demo: Spark + GraphLab Create PySpark ML with GraphLab Create PK What should we talk about? +
  • 4.
    Moving down the datapipeline from raw to results How best to quickly move through pipeline to: 1.  Show value of work 2.  Communicate results 3.  Move models into production pipeline PK problem formulation -> tool chain construction + Clean up your data Run Sophisticated Analytics Discover Insights! Integrate Results into Pipelines Use lots of different data sources ETL Communicate Results and Value …NOTEBOOKS Based on Diagram Source: Paco Nathan
  • 5.
    PK problem formulation ->tool chain construction + “Models are not right or wrong; they're always wrong. They're always approximations. The question you have to ask is whether a model tells you more information than you would have had otherwise. If it does, it's skillful.” - Gavin Schmidt’s excellent TED Talk Data science with notebooks allows data science teams to quickly move from exploration > transformation > modeling > visualization > export to pipeline Many startups create to fill the need for one product solution with collaboration and containerization In the mean time: PySpark + IPython + GraphLab Create
  • 6.
    Fatality Analysis ReportingSystem National Highway Transportation Safety Administration ●  Publicly available ●  Historical time-series Currently available: 1975 – 2013 ●  Raw, Rich, Relevant ●  Time-series, geo location ●  Human recorded events transcribed into annual databases (dbf, SQL, SAS) ●  Measurable outcomes for modelling (updated on yearly basis) PK example data set: transportation safety +
  • 7.
    ²  Proof  of concept  demonstra/on  for   customer  concerned  with   Transporta/on  Safety ²  Applica/on  of  science  method   towards  diverse  data  sets ²  Evolving  real-­‐world  data  sets  for   advanced  analy/cs  workshops  and   training  sessions ²  Visual  and  conceptual  presenta/on   of  scien/fic  approach  to   computa/onal  analysis data science showcase IENCE – DATA PROTOTYPING 25 30 35 40 45 50 −120 −100 −80 Fatalities in Fatal Accidents from 1975−2012 lat 0 40000 80000 120000 160000 deaths ü  Scien/fic  Method ü  Predic/ve  Modeling ü  Hidden  Insights   PK +
  • 8.
    Hardware: •  MacBook Pro(late 2012) •  ~36 GB free disk space •  8GB RAM •  2 cores •  Not exactly a blazingly fast, top of the line machine…. Software: •  Spark 1.1.0 for Hadoop 2.4 •  GraphLab Create 1.2.1 •  Hadoop 2.4 •  Scala 2.10.4 •  Python 2.7.9 on Anaconda 1.9.1 PK notes about my setup +
  • 9.
    One more step… gl.get_spark_integration_jar_path() $SPARK_HOME/bin/spark-submit--driver-class-path /path/to/graphlab- create-spark-integration.jar --master yarn-client my_awesome_code.py Or… $SPARK_HOME/bin/pyspark --driver-class-path /path/to/graphlab-create- spark-integration.jar --master yarn-client Works with –master local OR –master yarn-client Then…it works J PK working with glc in pyspark +
  • 10.
  • 11.
    GraphLab Create Recommender system Factorization-basedmethods Neighborhood-based item similarity Popularity-based methods Classification Deep learning Boosted trees Support vector machine Logistic regression Regression Boosted trees Deep learning Linear regression Text analysis Topic modeling (LDA) Featurization utilities: bm25, td_idf, remove stop words, etc. Image analysis Deep learning Image load, resize, mean computation Clustering – K-means Graph analytics Nearest neighbors Vowpal Wabbit wrapper PK ML in PySpark…even better now + MLlib + GraphX Classification and regression linear models (SVMs, logistic regression, linear regression) naive Bayes decision trees ensembles of trees (Random Forests and Gradient-Boosted Trees) Collaborative filtering alternating least squares (ALS) Clustering k-means Dimensionality reduction singular value decomposition (SVD) principal component analysis (PCA) Optimization (developer) stochastic gradient descent limited-memory BFGS (L-BFGS) Graph analytics
  • 12.
    PK Gradient Boosted Treeswith GraphLab Create + Source: Dato, ICCV 2009 Tutorial ……! tree t1! tree tT! split nodes! leaf nodes! v! v!  !
  • 13.
  • 14.
  • 15.
    DATA SCIENCE –DATA PROTOTYPING PK Amanda Casari acasari@prokarma.com + humor from xkcd