Spark Hearts GraphLab Create

PK
COPYRIGHT © 2015 PROKARMA INC. Copyrights, trademarks, and registered trademarks for all technology described in this document are owned by the respective companies.
+

OVERVIEW
ü  Strategy
ü  Technology Verticals
ü  Industry Verticals
ü  Platform Solutions
SERVICES
ü  21 Offices including 11 Global Delivery
Centers in US, India, Argentina & Peru
PRESENCE
ü  150+ Customers
ü  40 Fortune 1000 Companies
CUSTOMERS
ü  Doing business since 2000
ü  Global Staff of 2000+ employees
PEOPLE
PK

problem formulation -> tool chain construction
data science with notebooks
the data set
Demo: Spark + GraphLab Create
RDD -> SFrame
ML on PySpark
introduction to boosted trees in GraphLab Create
Demo: Spark + GraphLab Create
PySpark ML with GraphLab Create
PK
What should we talk about?
+

Moving down the
data pipeline from
raw to results
How best to quickly
move through
pipeline to:
1.  Show value of
work
2.  Communicate
results
3.  Move models into
production
pipeline
PK
+
Clean up your
data
Run
Sophisticated
Analytics
Discover
Insights!
Integrate
Results into
Pipelines
Use lots of different
data sources
ETL Communicate
Results and
Value
…NOTEBOOKS
Based on Diagram Source: Paco Nathan

PK
+
“Models are not right or wrong; they're always
wrong. They're always approximations. The
question you have to ask is whether a model
tells you more information than you would
have had otherwise. If it does, it's skillful.” -
Gavin Schmidt’s excellent TED Talk
Data science with notebooks allows
data science teams to quickly move
from exploration > transformation >
modeling > visualization > export to
pipeline
Many startups create to fill the need
for one product solution with
collaboration and containerization
In the mean time:
PySpark +
IPython +
GraphLab Create

Fatality Analysis Reporting System
National Highway Transportation Safety Administration
●  Publicly available
●  Historical time-series
Currently available: 1975 – 2013
●  Raw, Rich, Relevant
●  Time-series, geo location
●  Human recorded events transcribed into annual databases (dbf,
SQL, SAS)
●  Measurable outcomes for modelling (updated on yearly basis)
PK
example data set: transportation safety +

²  Proof of concept demonstra/on for
customer concerned with
Transporta/on Safety
²  Applica/on of science method
towards diverse data sets
²  Evolving real-‐world data sets for
advanced analy/cs workshops and
training sessions
²  Visual and conceptual presenta/on
of scien/ﬁc approach to
computa/onal analysis
data science showcase
IENCE – DATA PROTOTYPING
25
30
35
40
45
50
−120 −100 −80
Fatalities in Fatal Accidents from 1975−2012
lat
0
40000
80000
120000
160000
deaths
ü  Scien/ﬁc Method
ü  Predic/ve Modeling
ü  Hidden Insights
PK
+

Hardware:
•  MacBook Pro (late 2012)
•  ~36 GB free disk space
•  8GB RAM
•  2 cores
•  Not exactly a blazingly fast, top of the line machine….
Software:
•  Spark 1.1.0 for Hadoop 2.4
•  GraphLab Create 1.2.1
•  Hadoop 2.4
•  Scala 2.10.4
•  Python 2.7.9 on Anaconda 1.9.1
PK
notes about my setup
+

One more step…
gl.get_spark_integration_jar_path()
$SPARK_HOME/bin/spark-submit --driver-class-path /path/to/graphlab-
create-spark-integration.jar --master yarn-client my_awesome_code.py
Or…
$SPARK_HOME/bin/pyspark --driver-class-path /path/to/graphlab-create-
spark-integration.jar --master yarn-client
Works with –master local OR –master yarn-client
Then…it works J
PK
working with glc in pyspark
+

Demo time!
Graphic courtesy Dato

GraphLab Create
Recommender system
Factorization-based methods
Neighborhood-based item similarity
Popularity-based methods
Classification
Deep learning
Boosted trees
Support vector machine
Logistic regression
Regression
Boosted trees
Deep learning
Linear regression
Text analysis
Topic modeling (LDA)
Featurization utilities: bm25, td_idf, remove stop
words, etc.
Image analysis
Deep learning
Image load, resize, mean computation
Clustering – K-means
Graph analytics
Nearest neighbors
Vowpal Wabbit wrapper
PK
ML in PySpark…even better now
+
MLlib + GraphX
Classification and regression
linear models (SVMs, logistic regression,
linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and
Gradient-Boosted Trees)
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
Graph analytics

PK
Gradient Boosted Trees with GraphLab Create
+
Source: Dato, ICCV 2009 Tutorial
……!
tree t1! tree tT!
split nodes!
leaf nodes!
v! v!
!

+
PK
http://goo.gl/forms/y8LYl53hje
?

DATA SCIENCE – DATA PROTOTYPING
PK
Amanda Casari
acasari@prokarma.com
+
humor from xkcd

Spark Hearts GraphLab Create

More Related Content

What's hot

Similar to Spark Hearts GraphLab Create

More from Amanda Casari

Recently uploaded

Spark Hearts GraphLab Create