BUILT FOR THE SPEED OF BUSINESS
Data Science as a Commodity:

How to use MADlib, R, and other Publicly
Available and Open Source Tools for Data
Science
Pi...
What we will cover in today’s Meetup
Ÿ  What is data science, big data,
buzzword, buzzword?
Ÿ  What are some examples of...
What we will not cover #notdatascience

© Copyright 2014 Pivotal. All rights reserved.

4
Instead: Practical Data Science Tools #useful

– Kaushik Das
http://blog.gopivotal.com/p-o-v/the-eightfold-path-of-data-sc...
Instead: Practical Data Science Tools #useful
“At companies where there is no
framework for operationalization
of the mode...
Instead: Practical Data Science Tools #useful
“At companies where there is no
framework for operationalization
of the mode...
DATA
IS THE NEW
CENTER OF GRAVITY

Data > Application!

“BIG DATA IS THE NEW NORMAL”
“‘BIG DATA’ BECOMES ‘DATA’ ONCE AGAIN...
What Can “Small Data” Scientists Bring on Their
“Big Data” Journey?

http://factspy.net/the-differencebetween-geeks-vs-ner...
What Can “Small Data” Scientists Bring on Their
“Big Data” Journey?
Small Data

Databases

In-me
m

Flat files

Big Data

...
Basic DS Tools: From Command-line to GUI
Ÿ  Quick-and-dirty tricks using
command-line tools
– 
– 
– 
– 

Fast feedback - ...
Basic DS Tools: From Command-line to GUI
Ÿ  Quick-and-dirty tricks using
command-line tools
– 
– 
– 
– 

Fast feedback - ...
Basic DS Tools: From Command-line to GUI
Ÿ  Quick-and-dirty tricks using
command-line tools
– 
– 
– 
– 

Fast feedback - ...
Favorite python and R packages and resources
Python

–  NumPy
–  SciPy
–  scikit-learn – machine
learning package
–  stats...
Favorite python and R packages, resources, and more
Ÿ  R

– 
– 
– 
– 
– 
– 
– 
– 
– 

ggplot
reshape
plyr
Shiny
Good supp...
What do I do at Pivotal?
A New Platform for a New Era
DATA-DRIVEN APPLICATION DEVELOPMENT

App Fabric

Data Fabric

“The n...
Pivotal Big Data Technology: HAWQ
Think of it as multiple PostGreSQL servers
Master

Segments/Workers
Rows are distributed...
Performance Through Parallelism
Ÿ  Automatic parallelization
–  Load and query like any database
–  Automatically distrib...
Data Science Tools for Big Data
COMMERCIAL

OPEN SOURCE (OR FREE)

PL/R,	
  PL/Python	
  PL/Java	
  

© Copyright 2014 Piv...
Making sense of your “big data”
Ÿ  Large volumes of data may be difficult to understand
–  ~100 tables
–  Tens of thousan...
Making sense of your “big data”
Ÿ  Large volumes of data may be difficult to understand
–  ~100 tables
–  Tens of thousan...
Making sense of your “big data”
Ÿ  Large volumes of data may be difficult to understand
–  ~100 tables
–  Tens of thousan...
Making sense of your “big data”
Ÿ  Large volumes of data may be difficult to understand
–  ~100 tables
–  Tens of thousan...
MADlib In-Database Functions
Predictive Modeling Library
Generalized Linear Models
•  Linear Regression
•  Logistic Regres...
MADlib in Action: Regression on
Billions of Rows

Ÿ  Input Data

–  10s of millions of rows from data collected at multip...
Linear Regression: Streaming Algorithm
Ÿ  Finding linear
dependencies between
variables
Ÿ  How to compute with a
single ...
Linear Regression: Parallel Computation
XT
y

X T y = ∑ xiT yi
i

© Copyright 2014 Pivotal. All rights reserved.

27
Linear Regression: Parallel Computation
XT
y

T
X1 y1

Segment 1
© Copyright 2013 Pivotal. All rights reserved.

+

T
X 2 ...
Linear Regression: Parallel Computation
XT
y

T
X1 y1

Segment 1
© Copyright 2013 Pivotal. All rights reserved.

+

T
X 2 ...
Performing a linear regression on 10 million
rows in seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library...
Calling MADlib Functions: Fast Training, Scoring
Ÿ  MADlib allows users to easily and
create models without moving data
o...
Calling MADlib Functions: Fast Training, Scoring
Ÿ  MADlib allows users to easily and
create models without moving data
o...
Calling MADlib Functions: Fast Training, Scoring
Ÿ  MADlib allows users to easily and
create models without moving data
o...
PivotalR: Bringing MADlib and HAWQ to a familiar
R interface
Ÿ  Challenge

Want to harness the familiarity of R’s interfa...
PivotalR: Bringing MADlib and HAWQ to a familiar
R interface
Ÿ  Challenge

Want to harness the familiarity of R’s interfa...
PivotalR Design Overview
• 
• 

Call MADlib’s in-DB machine learning functions
directly from R
Syntax is analogous to nati...
PivotalR: Current Features
And more ... (SQL wrapper)

• 
MADlib Functionality

•  Linear Regression
•  Logistic Regressio...
http://gopivotal.github.io/PivotalR/

Woo Jung
© Copyright 2014 Pivotal. All rights reserved.

38
http://www.rstudio.com/shiny/
http://gopivotal.github.io/PivotalR/

Woo Jung
© Copyright 2014 Pivotal. All rights reserved...
Shiny Showcase: Example Web Apps in R
Ÿ  Users can choose
input parameters with
sliders, drop-downs,
and text fields.
Ÿ ...
Shiny Showcase: Example Web Apps in R
Ÿ  Users can choose
input parameters with
sliders, drop-downs,
and text fields.
Ÿ ...
http://d3js.org/
© Copyright 2014 Pivotal. All rights reserved.

42
D3 Data-Driven Documents

http://d3js.org/
© Copyright 2014 Pivotal. All rights reserved.

43
D3 Data-Driven Documents

http://d3js.org/
© Copyright 2014 Pivotal. All rights reserved.

44
PyMADlib
Ÿ  Python wrapper for MADlib

http://nbviewer.ipython.org/gist/vatsan/5275846
© Copyright 2014 Pivotal. All righ...
PyMADlib
Ÿ  Python wrapper for MADlib

http://nbviewer.ipython.org/gist/vatsan/5275846
© Copyright 2014 Pivotal. All righ...
Procedural Languages in Big Data Science
Ÿ  HAWQ & PL/X can take advantage of “data
parallel” tasks by performing analyse...
Structure of input table for PL/R function
Columns
Description

A

Network ID
ID of the network.
300K in total.

Terminal
...
Performance Analysis
Number of
networks

Time/network
(ms)

Total time
(seconds)

500

6.604

3.30

1000

3.637

3.64

500...
Performance Analysis
R package used

optim

quadprog

Rsymphony

Rglpk

Single network in R (time)

~60s

6.3 s

0.145 s

...
Performance Analysis
R package used

optim

quadprog

Rsymphony

Rglpk

Single network in R (time)

~60s

6.3 s

0.145 s

...
Performance Analysis
R package used

optim

quadprog

Rsymphony

Rglpk

Single network in R (time)

~60s

6.3 s

0.145 s

...
Natural language processing
Data sources

Applications

NLP processing
pipeline

Text sources
Documents, books,
emails
Sen...
Open source tools for common NLP tasks
RELEVANT NLP TOOLS

OPEN SOURCE SOFTWARE

WORD CLOUDS

T O P I C M O D E L I N G / ...
Open source tools for common NLP tasks
RELEVANT NLP TOOLS

OPEN SOURCE SOFTWARE

WORD CLOUDS
Tokenization

Stemming/
lemma...
Open source tools for common NLP tasks
RELEVANT NLP TOOLS

OPEN SOURCE SOFTWARE

WORD CLOUDS
Tokenization

Stemming/
lemma...
Open source tools for common NLP tasks
RELEVANT NLP TOOLS

OPEN SOURCE SOFTWARE

WORD CLOUDS
Tokenization

Stemming/
lemma...
Topic Analysis – MADlib pLDA
Natural Language Processing - GPText
Filter
relevant
content

Align
Data

Social
Media
Tokeni...
Topic Analysis – MADlib pLDA
Natural Language Processing - GPText
Filter
relevant
content

Align
Data

Social
Media
Tokeni...
Is there more? What’s next?
blog.gopivotal.com/tag/data-science
blog.gopivotal.com/tag/data-science-tech

© Copyright 2014...
BUILT FOR THE SPEED OF BUSINESS
Upcoming SlideShare
Loading in...5
×

Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

2,486

Published on

Slides from the Pivotal Open Source Hub Meetup
"Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science!"


As the need for data science as a key differentiator grows in all industries, from large corporations to startups, the need to get to results quickly is enabled by sharing ideas and methods in the community. The data science team at Pivotal leverages and contributes to this community of publicly available and open source technologies as part of their practice. We will share the resources we use by highlighting specific toolkits for building models (e.g. MADlib, R) and visualization (e.g. Gephi and Circos) along with their benefits and limitations by sharing examples from Pivotal's data science engagements. At the end of this session we hope to have answered the questions: Where can I get started with Data Science? Which toolkit is most appropriate for building a model with my dataset? How can I visualize my results to have the greatest impact?

Bio: Sarah Aerni is a member of the Pivotal Data Science team with a focus on healthcare and life science. She has a background in the field of Bioinformatics, developing tools to help biomedical researchers understand their data. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. During her time as a researcher she focused on the interface between machine learning and biology, building computational models enabling research for a broad range of fields in biomedicine. She also co-founded a start-up providing informatics services to researchers and small companies. At Pivotal she works with customers in life science and healthcare building models to derive insight and business value from their data.

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,486
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
123
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide

Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

  1. 1. BUILT FOR THE SPEED OF BUSINESS
  2. 2. Data Science as a Commodity: How to use MADlib, R, and other Publicly Available and Open Source Tools for Data Science Pivotal OSS Meetups Sarah Aerni Pivotal Senior Data Scientist @itweetsarah saerni@gopivotal.com January 28, 2014 © Copyright 2014 Pivotal. All rights reserved. 2
  3. 3. What we will cover in today’s Meetup Ÿ  What is data science, big data, buzzword, buzzword? Ÿ  What are some examples of data science in action? Ÿ  What do I do at Pivotal? Ÿ  Who are our data scientists? Ÿ  Why is open source software important for data science? Ÿ  What do I do with loads of data? Ÿ  How can I create good models? Ÿ  What types of open source tools can I use to build models? Ÿ  How can I build a quick app? Ÿ  What can I do to get started analyzing text data? Ÿ  Which tools exist to create Ÿ  What tools does our team use? For visualizations of my data that I can NLP? For optimization? For understand? regression? © Copyright 2014 Pivotal. All rights reserved. 3
  4. 4. What we will not cover #notdatascience © Copyright 2014 Pivotal. All rights reserved. 4
  5. 5. Instead: Practical Data Science Tools #useful – Kaushik Das http://blog.gopivotal.com/p-o-v/the-eightfold-path-of-data-science © Copyright 2014 Pivotal. All rights reserved. 5
  6. 6. Instead: Practical Data Science Tools #useful “At companies where there is no framework for operationalization of the models, PowerPoint is where models go to die!” – Hulya Farinas http://venturebeat.com/2013/12/03/how-torevolutionize-healthcare-get-data-scientists-andapp-developers-together/ © Copyright 2014 Pivotal. All rights reserved. 6
  7. 7. Instead: Practical Data Science Tools #useful “At companies where there is no framework for operationalization of the models, PowerPoint is where models go to die!” – Hulya Farinas http://venturebeat.com/2013/12/03/how-torevolutionize-healthcare-get-data-scientists-andapp-developers-together/ “The use of statistical and machine learning techniques on big multistructured data — in a distributed computing environment — to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment.” – Annika Jimenez http://blog.gopivotal.com/news-2/annika-jimenez-ondisruptive-data-science-at-the-strata-conference © Copyright 2014 Pivotal. All rights reserved. 7
  8. 8. DATA IS THE NEW CENTER OF GRAVITY Data > Application! “BIG DATA IS THE NEW NORMAL” “‘BIG DATA’ BECOMES ‘DATA’ ONCE AGAIN” © Copyright 2014 Pivotal. All rights reserved. 8
  9. 9. What Can “Small Data” Scientists Bring on Their “Big Data” Journey? http://factspy.net/the-differencebetween-geeks-vs-nerds/ © Copyright 2014 Pivotal. All rights reserved. 9
  10. 10. What Can “Small Data” Scientists Bring on Their “Big Data” Journey? Small Data Databases In-me m Flat files Big Data MapRe duce Many tools and approaches are being adapted to big data technologies S HDF Cloud computing ory m buildin odel g Command-line tools © Copyright 2014 Pivotal. All rights reserved. pu d com tribute ting Dis Command-line tools 10
  11. 11. Basic DS Tools: From Command-line to GUI Ÿ  Quick-and-dirty tricks using command-line tools –  –  –  –  Fast feedback - interactive Fast to process Easy to write, hard to read Background processing (screen) Ian Huston, Alex Kagoshima, Ronert Obst © Copyright 2014 Pivotal. All rights reserved. 11
  12. 12. Basic DS Tools: From Command-line to GUI Ÿ  Quick-and-dirty tricks using command-line tools –  –  –  –  Fast feedback - interactive Fast to process Easy to write, hard to read Background processing (screen) Ÿ  Large-volumes of data à automatically parallel environments (e.g. GPDB) may be faster Ian Huston, Alex Kagoshima, Ronert Obst © Copyright 2014 Pivotal. All rights reserved. 12
  13. 13. Basic DS Tools: From Command-line to GUI Ÿ  Quick-and-dirty tricks using command-line tools –  –  –  –  Fast feedback - interactive Fast to process Easy to write, hard to read Background processing (screen) Ÿ  Large-volumes of data à automatically parallel environments (e.g. GPDB) may be faster Ÿ  Python and R –  Rstudio –  iPython (iPythonNotebook) Ian Huston, Alex Kagoshima, Ronert Obst © Copyright 2014 Pivotal. All rights reserved. 13
  14. 14. Favorite python and R packages and resources Python –  NumPy –  SciPy –  scikit-learn – machine learning package –  statsmodels –  pandas –  pyMC –  IPython (IPythonNotebook) –  matplotlib Ian Huston, Alex Kagoshima, Ronert Obst © Copyright 2014 Pivotal. All rights reserved. 14
  15. 15. Favorite python and R packages, resources, and more Ÿ  R –  –  –  –  –  –  –  –  –  ggplot reshape plyr Shiny Good support for time series analyses Rstudio ( weave ) foreach, parallel taskviews parboost Ian Huston, Alex Kagoshima, Ronert Obst © Copyright 2014 Pivotal. All rights reserved. 15
  16. 16. What do I do at Pivotal? A New Platform for a New Era DATA-DRIVEN APPLICATION DEVELOPMENT App Fabric Data Fabric “The new Middleware” “The new Database” Cloud Fabric “The new OS” ...ETC “The new Hardware” © Copyright 2014 Pivotal. All rights reserved. 16
  17. 17. Pivotal Big Data Technology: HAWQ Think of it as multiple PostGreSQL servers Master Segments/Workers Rows are distributed across segments by a particular field (or randomly) Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database © Copyright 2014 Pivotal. All rights reserved. 17
  18. 18. Performance Through Parallelism Ÿ  Automatic parallelization –  Load and query like any database –  Automatically distributed tables across nodes Ÿ  Analytics-oriented query optimization Ÿ  Scalable MPP architecture –  All nodes can scan and process in parallel –  Linear scalability by adding nodes Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database © Copyright 2014 Pivotal. All rights reserved. 18
  19. 19. Data Science Tools for Big Data COMMERCIAL OPEN SOURCE (OR FREE) PL/R,  PL/Python  PL/Java   © Copyright 2014 Pivotal. All rights reserved. 19
  20. 20. Making sense of your “big data” Ÿ  Large volumes of data may be difficult to understand –  ~100 tables –  Tens of thousands of columns © Copyright 2014 Pivotal. All rights reserved. 20
  21. 21. Making sense of your “big data” Ÿ  Large volumes of data may be difficult to understand –  ~100 tables –  Tens of thousands of columns Ÿ  How do you build models that use all the data? Score all the data? © Copyright 2014 Pivotal. All rights reserved. 21
  22. 22. Making sense of your “big data” Ÿ  Large volumes of data may be difficult to understand –  ~100 tables –  Tens of thousands of columns Ÿ  How do you build models that use all the data? Score all the data? Ÿ  Where do you focus your effort? –  Getting a rapid grasp of relevant fields is important –  Scanning lots of data is slow, creating models with huge numbers of features is possible, but generally better to understand your data –  Columns with little or no variation or only null values © Copyright 2014 Pivotal. All rights reserved. 22
  23. 23. Making sense of your “big data” Ÿ  Large volumes of data may be difficult to understand –  ~100 tables –  Tens of thousands of columns Ÿ  How do you build models that use all the data? Score all the data? Ÿ  Where do you focus your effort? –  Getting a rapid grasp of relevant fields is important –  Scanning lots of data is slow, creating models with huge numbers of features is possible, but generally better to understand your data –  Columns with little or no variation or only null values Ÿ  These functions exist in MADlib © Copyright 2014 Pivotal. All rights reserved. 23
  24. 24. MADlib In-Database Functions Predictive Modeling Library Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white, clustered, marginal effects) Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank © Copyright 2014 Pivotal. All rights reserved. Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation Linear Systems •  Sparse and Dense Solvers Descriptive Statistics Sketch-based Estimators •  CountMin (CormodeMuthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions 24
  25. 25. MADlib in Action: Regression on Billions of Rows Ÿ  Input Data –  10s of millions of rows from data collected at multiple drill testing sites –  Sensor data for drills during operation, including rate of penetration, depth of penetration, weight on drill bit and more Ÿ  Data Massaging and Review –  Rapid summarization of many columns of data - to identify outliers, missing data and remove them from analysis –  Used window functions to construct a moving average (smoothing) of all the features and dependent variable Ÿ  Model –  Linear regression on the complete dataset –  K-means clustering to determine similarities of sites Rashmi Raghu © Copyright 2014 Pivotal. All rights reserved. Drilling into the San Andreas Fault at Parkfield California. Credit: Stephen H. Hickman, USGS 25
  26. 26. Linear Regression: Streaming Algorithm Ÿ  Finding linear dependencies between variables Ÿ  How to compute with a single scan? © Copyright 2014 Pivotal. All rights reserved. 26
  27. 27. Linear Regression: Parallel Computation XT y X T y = ∑ xiT yi i © Copyright 2014 Pivotal. All rights reserved. 27
  28. 28. Linear Regression: Parallel Computation XT y T X1 y1 Segment 1 © Copyright 2013 Pivotal. All rights reserved. + T X 2 y2 Segment 2 = XT y Master 28
  29. 29. Linear Regression: Parallel Computation XT y T X1 y1 Segment 1 © Copyright 2013 Pivotal. All rights reserved. + T X 2 y2 Segment 2 = XT y Master 29
  30. 30. Performing a linear regression on 10 million rows in seconds Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711. © Copyright 2013 Pivotal. All rights reserved. 30
  31. 31. Calling MADlib Functions: Fast Training, Scoring Ÿ  MADlib allows users to easily and create models without moving data out of the systems –  Model generation –  Model validation –  Scoring (evaluation of) new data Ÿ  All the data can be used in one model MADlib model function Table containing training data SELECT madlib.linregr_train( 'houses’,! 'houses_linregr’,! 'price’,! 'ARRAY[1, tax, bath, size]’);! Features included in the model Table in which to save results Column containing dependent variable Ÿ  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature) Ÿ  Open-source lets you tweak and extend methods, or build your own © Copyright 2014 Pivotal. All rights reserved. 31
  32. 32. Calling MADlib Functions: Fast Training, Scoring Ÿ  MADlib allows users to easily and create models without moving data out of the systems –  Model generation –  Model validation –  Scoring (evaluation of) new data Ÿ  All the data can be used in one model Ÿ  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature) MADlib model function Table containing training data SELECT madlib.linregr_train( 'houses’,! 'houses_linregr’,! 'price’,! 'ARRAY[1, tax, bath, size]’,! ‘bedroom’);! Table in which to save results Column containing dependent variable Features included in the model Create multiple output models (one for each value of bedroom) Ÿ  Open-source lets you tweak and extend methods, or build your own © Copyright 2014 Pivotal. All rights reserved. 32
  33. 33. Calling MADlib Functions: Fast Training, Scoring Ÿ  MADlib allows users to easily and create models without moving data out of the systems –  Model generation –  Model validation –  Scoring (evaluation of) new data Ÿ  All the data can be used in one model Ÿ  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature) Ÿ  Open-source lets you tweak and extend methods, or build your own © Copyright 2014 Pivotal. All rights reserved. SELECT madlib.linregr_train( 'houses’,! 'houses_linregr’,! 'price’,! 'ARRAY[1, tax, bath, size]’);! MADlib model scoring function SELECT houses.*, madlib.linregr_predict(ARRAY[1,tax,bath,size], m.coef! )as predict ! FROM houses, houses_linregr m;! Table with data to be scored Table containing model 33
  34. 34. PivotalR: Bringing MADlib and HAWQ to a familiar R interface Ÿ  Challenge Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics Ÿ  Simple solution: Translate R code into SQL Pivotal R d <- db.data.frame(”houses")! houses_linregr <- madlib.lm(price ~ tax! ! ! !+ bath! ! ! !+ size! ! ! !, data=d)! SQL Code SELECT madlib.linregr_train( 'houses’,! 'houses_linregr’,! 'price’,! 'ARRAY[1, tax, bath, size]’);! http://gopivotal.github.io/PivotalR/ Woo Jung © Copyright 2014 Pivotal. All rights reserved. 34
  35. 35. PivotalR: Bringing MADlib and HAWQ to a familiar R interface Ÿ  Challenge Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics Ÿ  Simple solution: Translate R code into SQL Pivotal R # # # # # Build a regression model with a different! intercept term for each state! (state=1 as baseline).! Note that PivotalR supports automated! indicator coding a la as.factor()!! d <- db.data.frame(”houses")! houses_linregr <- madlib.lm(price ~ as.factor(state)! ! ! ! !+ tax! ! ! ! !+ bath! ! ! ! !+ size! ! ! ! !, data=d)! http://gopivotal.github.io/PivotalR/ Woo Jung © Copyright 2014 Pivotal. All rights reserved. 35
  36. 36. PivotalR Design Overview •  •  Call MADlib’s in-DB machine learning functions directly from R Syntax is analogous to native R function PivotalR R à SQL No data here http://gopivotal.github.io/PivotalR/ RPostgreSQL Data lives here SQL to execute Computation results •  •  Database w/ MADlib Data doesn’t need to leave the database All heavy lifting, including model estimation & computation, are done in the database Woo Jung © Copyright 2014 Pivotal. All rights reserved. 36
  37. 37. PivotalR: Current Features And more ... (SQL wrapper) •  MADlib Functionality •  Linear Regression •  Logistic Regression •  Elastic Net •  ARIMA •  Marginal Effects •  Cross Validation •  Bagging •  summary on model objects http://gopivotal.github.io/PivotalR/ © Copyright 2014 Pivotal. All rights reserved. + - * %/% ^ / %% • Automated Indicator Variable Coding as.factor • predict •  •  •  •  •  •  •  •  •  dim names $ [ == & by [[ != | $<> ! •  •  < [<>= merge sort db.data.frame •  •  [[<<= •  is.na preview content as.db.data.frame c mean sum sd var min max length colMeans colSums db.connect db.disconnect db.list db.objects db.existsObject delete 37
  38. 38. http://gopivotal.github.io/PivotalR/ Woo Jung © Copyright 2014 Pivotal. All rights reserved. 38
  39. 39. http://www.rstudio.com/shiny/ http://gopivotal.github.io/PivotalR/ Woo Jung © Copyright 2014 Pivotal. All rights reserved. 39
  40. 40. Shiny Showcase: Example Web Apps in R Ÿ  Users can choose input parameters with sliders, drop-downs, and text fields. Ÿ  HTML/JavaScript knowledge not required. http://www.rstudio.com/shiny/ © Copyright 2014 Pivotal. All rights reserved. 40
  41. 41. Shiny Showcase: Example Web Apps in R Ÿ  Users can choose input parameters with sliders, drop-downs, and text fields. Ÿ  HTML/JavaScript knowledge not required. http://www.rstudio.com/shiny/ © Copyright 2014 Pivotal. All rights reserved. 41
  42. 42. http://d3js.org/ © Copyright 2014 Pivotal. All rights reserved. 42
  43. 43. D3 Data-Driven Documents http://d3js.org/ © Copyright 2014 Pivotal. All rights reserved. 43
  44. 44. D3 Data-Driven Documents http://d3js.org/ © Copyright 2014 Pivotal. All rights reserved. 44
  45. 45. PyMADlib Ÿ  Python wrapper for MADlib http://nbviewer.ipython.org/gist/vatsan/5275846 © Copyright 2014 Pivotal. All rights reserved. 45
  46. 46. PyMADlib Ÿ  Python wrapper for MADlib http://nbviewer.ipython.org/gist/vatsan/5275846 © Copyright 2014 Pivotal. All rights reserved. 46
  47. 47. Procedural Languages in Big Data Science Ÿ  HAWQ & PL/X can take advantage of “data parallel” tasks by performing analyses in parallel – embarrassingly parallel tasks Ÿ  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks Ÿ  Examples of ‘data parallel’ problems: –  Counting words in documents –  Genome-Wide Association Study –  Studying network anomalies http://gopivotal.github.io/gp-r/ © Copyright 2014 Pivotal. All rights reserved. SQL & R Master Severs Network Interconnect Segment Severs Doc1 Doc2 DocM Stem1 Stem2 StemM Count1 Count2 CountM 47
  48. 48. Structure of input table for PL/R function Columns Description A Network ID ID of the network. 300K in total. Terminal readings Topology Network Readings Array of integers defining the topology tree. Array of readings from network terminal points over (say) a week. C Ÿ  Using historical readings, solve a linear program to establish baseline behavior, for example number of shipments 0 B Ÿ  Topology: Hubs connected to multiple terminal points D Ÿ  Detecting anomalies within subnetworks on future observations Vivek Ramamurthy © Copyright 2014 Pivotal. All rights reserved. 48
  49. 49. Performance Analysis Number of networks Time/network (ms) Total time (seconds) 500 6.604 3.30 1000 3.637 3.64 5000 2.822 14.11 400 10,000 2.356 23.56 300 50,000 2.160 108.02 200 100,000 2.142 214.20 100 150,000 2.162 324.29 200,000 2.142 428.48 250,000 2.138 534.69 300,000 2.132 639.85 Execution time v/s number of networks Time (seconds) 700 600 500 0 0 50 100 150 200 250 Number of networks (in thousands) 300 Vivek Ramamurthy © Copyright 2014 Pivotal. All rights reserved. 49
  50. 50. Performance Analysis R package used optim quadprog Rsymphony Rglpk Single network in R (time) ~60s 6.3 s 0.145 s 0.181 s 300K networks in PL/R (time) ~84 hrs 5.87 hrs 10.7 min 14.6 min Time per network in PL/R 1005.2 ms 70.44 ms 2.13 ms 2.92 ms Vivek Ramamurthy © Copyright 2014 Pivotal. All rights reserved. 50
  51. 51. Performance Analysis R package used optim quadprog Rsymphony Rglpk Single network in R (time) ~60s 6.3 s 0.145 s 0.181 s 300K networks in PL/R (time) ~84 hrs 5.87 hrs 10.7 min 14.6 min Time per network in PL/R 1005.2 ms 70.44 ms 2.13 ms 2.92 ms COIN-OR : Computational Infrastructure for Operations Research http://www.coin-or.org/ –  Libraries for linear and non-linear programming, integer programming –  SYMPHONY : Callable library in COIN-OR for solving mixed integer linear programs GLPK : GNU Linear Programming Kit Used for large-scale LPs, MIPs and related problems Vivek Ramamurthy © Copyright 2014 Pivotal. All rights reserved. 51
  52. 52. Performance Analysis R package used optim quadprog Rsymphony Rglpk Single network in R (time) ~60s 6.3 s 0.145 s 0.181 s 300K networks in PL/R (time) ~84 hrs 5.87 hrs 10.7 min 14.6 min Time per network in PL/R 1005.2 ms 70.44 ms 2.13 ms 2.92 ms COIN-OR : Computational Infrastructure for Operations Research http://www.coin-or.org/ –  Libraries for linear and non-linear programming, integer programming –  SYMPHONY : Callable library in COIN-OR for solving mixed integer linear programs GLPK : GNU Linear Programming Kit –  Used for large-scale LPs, MIPs and related problems http://www.gnu.org/software/glpk/ Vivek Ramamurthy © Copyright 2014 Pivotal. All rights reserved. 52
  53. 53. Natural language processing Data sources Applications NLP processing pipeline Text sources Documents, books, emails Sentence detection Tokenization Morphological stemming Stop word removal Word-sense disambiguation Part-of-Speech tagging Syntactic parsing Semantic role labeling Entity recognition Reference resolution Speech Phone logs, conversations Event processing Word clouds Topic modeling Sentiment analysis Machine translation Document classification Document summarization Language generation Search Question answering Information Extraction … Common tasks/tools in NLP Niels Kasch © Copyright 2014 Pivotal. All rights reserved. 53
  54. 54. Open source tools for common NLP tasks RELEVANT NLP TOOLS OPEN SOURCE SOFTWARE WORD CLOUDS T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N I N F O R M AT I O N E X T R A C T I O N Niels Kasch © Copyright 2014 Pivotal. All rights reserved. 54
  55. 55. Open source tools for common NLP tasks RELEVANT NLP TOOLS OPEN SOURCE SOFTWARE WORD CLOUDS Tokenization Stemming/ lemmatization Stop word removal •  •  •  GPText Apache UIMA OpenNLP (Java) •  •  •  NLTK (Python) WordNet Pytagcloud T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N I N F O R M AT I O N E X T R A C T I O N Niels Kasch © Copyright 2014 Pivotal. All rights reserved. 55
  56. 56. Open source tools for common NLP tasks RELEVANT NLP TOOLS OPEN SOURCE SOFTWARE WORD CLOUDS Tokenization Stemming/ lemmatization Stop word removal •  •  •  GPText Apache UIMA OpenNLP (Java) •  •  •  NLTK (Python) WordNet Pytagcloud T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N Tokenization Stemming/ lemmatization Stop word removal Language detection •  •  •  Madlib (PLDA) gensim (LSA & LDA package for python) https://code.google.com/p/language-detection/ I N F O R M AT I O N E X T R A C T I O N Niels Kasch © Copyright 2014 Pivotal. All rights reserved. 56
  57. 57. Open source tools for common NLP tasks RELEVANT NLP TOOLS OPEN SOURCE SOFTWARE WORD CLOUDS Tokenization Stemming/ lemmatization •  •  •  Stop word removal GPText Apache UIMA OpenNLP (Java) •  •  •  NLTK (Python) WordNet Pytagcloud T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N Tokenization Stemming/ lemmatization Stop word removal Language detection •  •  •  Madlib (PLDA) gensim (LSA & LDA package for python) https://code.google.com/p/language-detection/ •  •  •  GPText and Madlib OpenNLP NLTK I N F O R M AT I O N E X T R A C T I O N Sentence detection Tokenization Language detection Relationship extraction Syntactic parsing Entity extraction •  Stanford CoreNLP (incl. POS tagger, NER, parser, etc.) Niels Kasch © Copyright 2014 Pivotal. All rights reserved. 57
  58. 58. Topic Analysis – MADlib pLDA Natural Language Processing - GPText Filter relevant content Align Data Social Media Tokenizer Stemming, frequency filtering Prepare dataset for Topic Modeling Srivatsan Ramanujam © Copyright 2014 Pivotal. All rights reserved. 58
  59. 59. Topic Analysis – MADlib pLDA Natural Language Processing - GPText Filter relevant content Align Data Social Media Tokenizer Stemming, frequency filtering Prepare dataset for Topic Modeling Topic Graph Topic composition MADlib Topic Model Topic Clouds Srivatsan Ramanujam © Copyright 2014 Pivotal. All rights reserved. 59
  60. 60. Is there more? What’s next? blog.gopivotal.com/tag/data-science blog.gopivotal.com/tag/data-science-tech © Copyright 2014 Pivotal. All rights reserved. 60
  61. 61. BUILT FOR THE SPEED OF BUSINESS
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×