SlideShare a Scribd company logo
Hands-on Data Science
and OSS
Driving Business
Value with
Open Source and
Data Science
Kevin Crocker,
@_K_C_Pivotal
#datascience, #oscon
VM info
Everything is ‘oscon2014’
User:password –> oscon2014:oscon2014
PostgreSQL 9.2.8 dbname -> oscon2014
Root password -> oscon2014
Installed software: postgresql 9.2.8, R, MADlib, pl/pythonu,
pl/pgpsl, anaconda ( /home/oscon2014/anaconda), pgadmin3,
Rstudio, pyMADlib, and more to come in v1.1
Objective of Data Science
DRIVE AUTOMATED
LOW LATENCY ACTIONS
IN RESPONSE TO
EVENTS OF INTEREST
What Matters: Apps. Data. Analytics.
Apps power businesses, and
those apps generate data
Analytic insights from that data
drive new app functionality,
which in-turn drives new data
The faster you can move
around that cycle, the faster
you learn, innovate & pull
away from the competition
What Matters: OSS at the core
Apps power businesses, and
those apps generate data
Analytic insights from that data
drive new app functionality,
which in-turn drives new data
The faster you can move
around that cycle, the faster
you learn, innovate & pull
away from the competition
End Game: Drive Business Value with OSS
interesting problems that can’t easily be solved with current
technology
Use (find) the right tool for the job
- If they don’t exist, create them
- Prefer OSS if it fits the need
- Drive business value through distributed, MPP analytics
- Operationalization (O16n) of your Analytics
Create interesting solutions that drive business value
1 Find Data
Platforms
•Pivotal Greenplum
DB
•Pivotal HD
•Hadoop (other)
•SAS HPA
•AWS
2 Write Code
3 Run Code
Interfaces
•pgAdminIII
•psql
•psycopg2
•Terminal
•Cygwin
•Putty
•Winscp
4 Write Code for Big Data
5 Implement Algorithms
6 Show Results
7 Collaborate
Sharing Tools
•Chorus
•Confluence
•Socialcast
•Github
•Google Drive &
Hangouts
PIVOTAL DATA
SCIENCE TOOLKIT
A large and
varied tool box!
Toolkit?
This image was created
by Swami
Chandresekaran,
Enterprise Architect,
IBM.
He has a great article
about what it takes to
be a Data Scientist:
Road Map to Data
Scientist
http://nirvacana.com/tho
ughts/becoming-a-data-
scientist/
We need the right technology for every
step
Open Source At Pivotal
 Pivotal has a lot of open source projects (and people) involved in Open Source
 PostgreSQL, Apache Hadoop (4)
 MADlib (16), PivotalR (2), pyMADlib (4), Pandas via SQL (3),
 Spring (56), Groovy (3), Grails (3)
 Apache Tomcat (2) and HTTP Server (1)
 Redis (1)
 Rabbit MQ (4)
 Cloud Foundry (90)
 Open Chorus
 We use a combination of our commercial software and OSS to drive business value through
Data Science
Motivation
 Our story starts with SQL – so naturally we try to use SQL
for everything! Everything?
 SQL is great for many things, but it’s not nearly enough
–Straightforward way to query data
–Not necessarily designed for
data science
 Data Scientists know other
languages – R, Python, …
Our challenge
 MADlib
– Open source
– Extremely powerful/scalable
– Growing algorithm breadth
– SQL
 R / Python
– Open source
– Memory limited
– High algorithm breadth
– Language/interface purpose-designed for data science
 Want to leverage both the performance benefits of MADlib and the
usability of languages like R and Python
How Pivotal Data Scientists Select Which
Tool to Use
Optimized for algorithm performance,
scalability, & code overhead
Pivotal, MADlib, R, and Python
 Pivotal & MADlib & R Interoperability
–PivotalR
–PL/R
 Pivotal & MADlib & Python Interoperability
–pyMADlib
–PL/Python
MADlib
 MAD stands for:
 lib stands for library of:
– advanced (mathematical, statistical, machine learning)
– parallel & scalable
– in-database functions
 Mission: to foster widespread development of scalable
analytic skills, by harnessing efforts from commercial
practice, academic research, and open-source development
MADlib: A Community Project
Open Source: BSD License
• Developed as a partnership with multiple universities
– University of California-Berkeley
– University of Wisconsin-Madison
– University of Florida
• Compatibile with Postgres, Greenplum Database, and Hadoop
via HAWQ
• Designed for Data Scientists to provide Scalable, Robust
Analytics capabilities for their business problems.
• Homepage: http://madlib.net
• Documentation: http://doc.madlib.net
• Source: https://github.com/madlib
• Forum: http://groups.google.com/group/madlib-user-forum
Community
Database Platform Layer
MADlib: Architecture
C++ Database Abstraction Layer
User Defined FunctionsUser Defined Functions
User Defined TypesUser Defined Types
User Defined AggregatesUser Defined Aggregates
OLAP Window FunctionsOLAP Window Functions
User Defined OperatorsUser Defined Operators OLAP Grouping SetsOLAP Grouping Sets
Data Type MappingData Type Mapping
Exception HandlingException Handling
Logging and ReportingLogging and Reporting
Linear AlgebraLinear Algebra
Memory ManagementMemory Management
Boost SupportBoost Support
Support Modules
Random SamplingRandom Sampling
Array OperationsArray OperationsSparse VectorsSparse Vectors
Core Methods
Generalized Linear
Models
Generalized Linear
Models Matrix FactorizationMatrix Factorization
Machine Learning
Algorithms
Machine Learning
Algorithms
Linear SystemsLinear Systems
Probability FunctionsProbability Functions
MADlib: Diverse User Experience
SQL Python
Open Chorus R
from pymadlib.pymadlib import *
conn = DBConnect()
mdl = LinearRegression(conn)
lreg.train(input_table, indepvars, depvar)
cursor = lreg.predict(input_table, depvar)
scatterPlot(actual,predicted, dataset)
psql> madlib.linregr_train('abalone',
'abalone_linregr',
'rings',
'array[1,diameter,height]');
psql> select coef, r2 from abalone_linregr;
-[ RECORD 1 ]----------------------------------------------
coef | {2.39392531944631,11.7085575219689,19.8117069108094}
r2 | 0.350379630701758
MADlib In-Database
Functions
Predictive Modeling Library
Linear Systems
•Sparse and Dense Solvers
Matrix Factoriization
•Single Value Decomposition (SVD)
•Low-Rank
Generalized Linear Models
•Linear Regression
•Logistic Regression
•Multinomial Logistic Regression
•Cox Proportional Hazards
•Regression
•Elastic Net Regularization
•Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms
•Principal Component Analysis (PCA)
•Association Rules (Affinity Analysis, Market
Basket)
•Topic Modeling (Parallel LDA)
•Decision Trees
•Ensemble Learners (Random Forests)
•Support Vector Machines
•Conditional Random Field (CRF)
•Clustering (K-means)
•Cross Validation
Descriptive Statistics
Sketch-based Estimators
•CountMin (Cormode-
Muthukrishnan)
•FM (Flajolet-Martin)
•MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions
Calling MADlib Functions: Fast Training, Scoring
SELECT madlib.linregr_train( 'houses’,
'houses_linregr’,
'price’,
'ARRAY[1, tax, bath, size]’,
‘bedroom’);
SELECT madlib.linregr_train( 'houses’,
'houses_linregr’,
'price’,
'ARRAY[1, tax, bath, size]’,
‘bedroom’);
MADlib model function
Table containing
training data
Table in which
to save results
Column containing
dependent variable
Create multiple output
models (one for each value
of bedroom)
 MADlib allows users to easily and
create models without moving data
out of the systems
– Model generation
– Model validation
– Scoring (evaluation of) new data
 All the data can be used in one
model
 Built-in functionality to create of
multiple smaller models (e.g.
regression/classification grouped
by feature)
 Open-source lets you tweak and
extend methods, or build your own
Features included in the
model
Calling MADlib Functions: Fast Training, Scoring
SELECT madlib.linregr_train( 'houses’,
'houses_linregr’,
'price’,
'ARRAY[1, tax, bath, size]’);
SELECT madlib.linregr_train( 'houses’,
'houses_linregr’,
'price’,
'ARRAY[1, tax, bath, size]’);
SELECT houses.*,
madlib.linregr_predict(ARRAY[1,tax,bath,size],
m.coef
)as predict
FROM houses, houses_linregr m;
SELECT houses.*,
madlib.linregr_predict(ARRAY[1,tax,bath,size],
m.coef
)as predict
FROM houses, houses_linregr m;
MADlib model scoring function
Table with data to be scored Table containing model
 MADlib allows users to easily and
create models without moving data
out of the systems
– Model generation
– Model validation
– Scoring (evaluation of) new data
 All the data can be used in one
model
 Built-in functionality to create of
multiple smaller models (e.g.
regression/classification grouped
by feature)
 Open-source lets you tweak and
extend methods, or build your own
K-Means Clustering
Clustering refers to the problem of partitioning a set of
objects according to some problem-dependent
measure of similarity. In the k-means variant, given n
points x1,…,xn d, the goal is to position k centroids∈ℝ
c1,…,ck d so that the sum of distances between∈ℝ
each point and its closest centroid is minimized. Each
centroid represents a cluster that consists of all points
to which this centroid is closest.
So, we are trying to find the centroids which minimize
the total distance between all the points and the
centroids.
K-means Clustering
Example Use Cases:
Which Blogs are Spam Blogs?
Given a user’s preferences, which
other blog might she/he enjoy?
What are our customers saying about
us?
What are our customers saying about us?
 Discern trends and categories
on-line conversations?
- Search for relevant blogs
- ‘Fingerprinting’ based on word
frequencies
- Similarity Measure
- Identify ‘clusters’ of documents
What are our customers saying about us?
Method
• Construct document histograms
• Transform histograms into document “fingerprints”
• Use clustering techniques to discover similar
documents.
What are our customers saying about us?
Constructing document histograms
 Parsing & extract html files
 Using natural language processing for
tokenization and stemming
 Cleansing inconsistencies
 Transforming unstructured data into structured
data
What are our customers saying about us?
“Fingerprinting”
- Term frequency of words within a document vs.
frequency that those words occur in all
documents
- Term frequency-inverse document frequency (tf-
idf weight)
- Easily calculated based on formulas over the
document histograms.
- The result is a vector in n-dim. Euclidean space.
K-Means Clustering – Training Function
The k-means algorithm can be invoked in four ways, depending on the
source of the initial set of centroids:
1.Use the random centroid seeding method.
2.Use the kmeans++ centroid seeding method.
3.Supply an initial centroid set in a relation identified by the rel_initial_centroids
argument.
4.Provide an initial centroid set as an array expression in the initial_centroids
argument.
Random Centroid seeding method
kmeans_random( rel_source,
expr_point,
k,
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned
)
Kmeans++ centroid seeding method
kmeanspp( rel_source,
expr_point,
k,
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned
)
Initial Centroid set in a relation
kmeans( rel_source,
expr_point,
rel_initial_centroids, -- this is the relation
expr_centroid,
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned
)
Initial centroid as an array
kmeans( rel_source,
expr_point,
initial_centroids, -- this is the array
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned
)
K-Means Clustering – Cluster Assignment
1. After training, the cluster assignment for each data point can be computed with
the help of the following function:
closest_column( m, x )
Assessing the quality of the clustering
A popular method to assess the quality of the clustering is the silhouette
coefficient, a simplified version of which is provided as part of the k-means
module. Note that for large data sets, this computation is expensive.
The silhouette function has the following syntax:
simple_silhouette( rel_source,
expr_point,
centroids,
fn_dist
)
What are our customers saying about us?
ANIMATED?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
 innovation
 leader
 design
•speed
•graphics
•improvement
•bug
•installation
•download
Pivotal, MADlib, R, and Python
 Pivotal & MADlib & R Interoperability
–PivotalR
–PL/R
 Pivotal & MADlib & Python Interoperability
–pyMADlib
–PL/Python
Pivotal & R Interoperability
 In a traditional analytics workflow using R:
–Datasets are transferred from a data source
–Modeled or visualized
–Model scoring results are pushed back to the data source
 Such an approach works well when:
–The amount of data can be loaded into memory, and
–The transfer of large amounts of data is inexpensive and/or fast
 PivotalR explores the situation involving large data sets where
these two assumptions are violated and you have an R
background
Enter PivotalR
 Challenge
Want to harness the familiarity of R’s interface and the performance &
scalability benefits of in-DB analytics
 Simple solution:
Translate R code into SQL
PivotalR Design Overview
2. SQL to execute
3. Computation results
1. R  SQL
RPostgreSQL
Pivotal
R
Data lives hereNo data here
Database/Hadoop
w/ MADlib
PivotalR Design Overview
 Call MADlib’s in-database machine learning functions
directly from R
 Syntax is analogous to native R functions – for example,
madlib.lm() mimics the syntax of the native lm() function
 Data does not need to leave the database
 All heavy lifting, including model estimation & computation,
is done in the database
PivotalR Design Overview
 Manipulate database tables directly from R without needing
to be familiar with SQL
 Perform the equivalent of SQL’s ‘select’ statements
(including joins) in a syntax that is similar to R’s
data.frame operations
 For example: R’s ‘merge’  SQL’s ‘join’
PivotalR: Current Functionality
MADlib Functionality
•Linear Regression
•Logistic Regression
•Elastic Net
•ARIMA
•Marginal Effects
•Cross Validation
•Bagging
•summary on model objects
• Automated Indicator
Variable Coding
as.factor
• predict
•$ [ [[ $<- [<- [[<-
•is.na
•+ - * / %% %/%
^
•& | !
•== != > < >= <=
•merge
•by
•db.data.frame
•as.db.data.frame
•preview•sort
•c mean sum sd var min max
length colMeans colSums
• db.connect db.disconnect
db.list db.objects
db.existsObject delete
•dim names
•content
And more ... (SQL wrapper)
http://github.com/gopivotal/PivotalR/
PivotalR Example
 Load the PivotalR package
– > library('PivotalR')
 Get help for a function
– > help(db.connect)
 Connect to a database
– > db.connect(host = “dca.abc.com”, user = “student01”, dbname =
“studentdb”, password = ”studentpw", port = 5432, madlib =
"madlib", conn.pkg = "RPostgreSQL", default.schemas = NULL)
 List connections
– > db.list()
PivotalR Example
 Connect to a table via db.data.frame function (note that the data remains in the
database and is not loaded into memory)
– > y <- db.data.frame(“test.abalone”, conn.id = 1, key =
character(0), verbose = TRUE, is.temp = FALSE)
 Fit a linear regression model (one model for each gender) and display it
– > fit <- madlib.lm(rings ~ . - id | sex, data = y)
– > fit # view the result
 Apply the model to data in another table (i.e. x) and compute mean-square-error
– lookat(mean((x$rings - predict(fit, x))^2))
PivotalR
PivotalR
 PivotalR is an R package you can download from CRAN.
- http://cran.r-project.org/web/packages/PivotalR/index.html
- Using Rstudio, you can install it with: install.packages("PivotalR")
 GitHub has the latest, greatest code and features but is less stable.
- https://github.com/gopivotal/PivotalR
 R front end to PostgreSQL and all PostgreSQL databases.
 R wrapper around MADlib, the open source library for in-database scalable analytics
 Mimics regular R syntax for manipulating R’s “data.frame”
 Provides R functionality to Big Data stored in-database or Apache Hadoop.
 Demo code: https://github.com/gopivotal/PivotalR/wiki/Example
 Training Video: https://docs.google.com/file/d/0B9bfZ-YiuzxQc1RWTEJJZ2V1TWc/edit
Pivotal, MADlib, R, and Python
 Pivotal & MADlib & R Interoperability
–PivotalR
–PL/R
 Pivotal & MADlib & Python Interoperability
–pyMADlib
–PL/Python
SQL & RSQL & R
PL/R on Pivotal
 Procedural Language (PL/X)
– X includes R, Python, pgSQL, Java, Perl, C etc
– need to be installed on each database
 PL/R enables you to write PostgreSQL
and DB functions in the R language
 R installed on each segment of the
Pivotal cluster
 Parsimonious – R piggy-backs on
Pivotal’s parallel architecture
 Minimize data movement
PL/R on Pivotal
 Allows most of R’s capabilities. Basic guide: “PostgreSQL Functions by Example”
http://www.joeconway.com/presentations/function_basics.pdf
 In PostgreSQL and GPDB/PHD: Check which PL languages are installed in database:
 PL/R is an “untrusted” language – only database superusers have the ability to create
UDFs with PL/R (see “lanpltrusted” column in pg_language table)
select * from pg_language;
lanname | lanispl | lanpltrusted | lanplcallfoid | lanvalidator | lanacl
-----------+---------+--------------+---------------+--------------+--------
internal | f | f | 0 | 2246 |
c | f | f | 0 | 2247 |
sql | f | t | 0 | 2248 |
plpgsql | t | t | 10885 | 10886 |
plpythonu | t | f | 16386 | 0 |
plr | t | f | 18975 | 0 |
(6 rows)
PL/R Example
 Consider the census dataset below (each row represents an individual):
– h_state = integer encoding which state they live in
– earns = their income
– hours = how many hours per week they work
– … and other features
 Suppose we want to build a model of income for each state separately
SQL
Models
PL/R Example
 Prepare table for PL/R by converting it into array form
-- Create array version of table
DROP TABLE IF EXISTS use_r.census1_array_state;
CREATE TABLE use_r.census1_array_state AS(
SELECT
h_state::text h_state,
array_agg(h_serialno::float8) h_serialno, array_agg(earns::float8) earns,
array_agg(hours::float8) hours, array_agg((earns/hours)::float8) wage,
array_agg(hsdeg::float8) hsdeg, array_agg(somecol::float8) somecol,
array_agg(associate::float8) associate, array_agg(bachelor::float8) bachelor,
array_agg(masters::float8) masters, array_agg(professional::float8) professional,
array_agg(doctorate::float8) doctorate, array_agg(female::float8) female,
array_agg(rooms::float8) rooms, array_agg(bedrms::float8) bedrms,
array_agg(notcitizen::float8) notcitizen, array_agg(rentshouse::float8) rentshouse,
array_agg(married::float8) married
FROM use_r.census1
GROUP BY h_state
) DISTRIBUTED BY (h_state);
PL/R Example
SQL & RSQL & R
TN
Data
CA
Data
NY
Data
PA
Data
TX
Data
CT
Data
NJ
Data
IL
Data
MA Data WA
Data
TN Model CA Model NY Model PA Model TX Model CT Model NJ Model IL
Model
MA Model WA
Model
PL/R Example
 Run linear regression to predict income in each state
–Define output data type
–Create PL/R function
Define
output type
PL/R
function
Body of the
function in R
SQL
Wrapper
SQL Wrapper
PL/R Example
 Execute PL/R function
PL/R
 PL/R is not installed – I have to download the source and compile it
 Instructions can be found here
 http://www.joeconway.com/plr/doc/plr-install.html
 CHALLENGE: Download PL/R, compile it, and install it in PostgreSQL
Pivotal, MADlib, R, and Python
 Pivotal & MADlib & R Interoperability
–PivotalR
–PL/R
 Pivotal & MADlib & Python Interoperability
–pyMADlib
–PL/Python
Pivotal & Python Interoperability
 In a traditional analytics workflow using Python:
–Datasets are transferred from a data source
–Modeled or visualized
–Model scoring results are pushed back to the data source
 Such an approach works well when:
–The amount of data can be loaded into memory, and
–The transfer of large amounts of data is inexpensive and/or fast
 pyMADlib explores the situation involving large data sets where
these two assumptions are violated and you have a Python
background
Enter pyMADlib
 Challenge
Want to harness the familiarity of Python’s interface and the
performance & scalability benefits of in-DB analytics
 Simple solution:
Translate Python code into SQL
pyMADlib Design Overview
2. SQL to execute
3. Computation results
1. Python  SQL
ODBC/JDBC
Data lives hereNo data here
Database/Hadoop
w/ MADlib
Simple solution: Translate Python code
into SQL
 All data stays in DB and all model estimation and heavy lifting done in DB by MADlib
 Only strings of SQL and model output transferred across ODBC/JDBC
 Best of both worlds: number crunching power of MADlib along with rich set of
visualizations of Matplotlib, NetworkX and all your other favorite Python libraries. Let
MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you
program in your favorite language – Python.
SQL to execute MADlib
Model output
ODBC/
JDBC
Python 
SQL
Hands-on Exploration
PyMADlib Tutorial –
IPython Notebook Viewer Link
http://nbviewer.ipython.org/5275846
Where do I get it ?
$pip install pymadlib
Pivotal, MADlib, R, and Python
 Pivotal & MADlib & R Interoperability
–PivotalR
–PL/R
 Pivotal & MADlib & Python Interoperability
–pyMADlib
–PL/Python
PL/Python on Pivotal
 Syntax is like normal Python function with function definition line replaced by
SQL wrapper
 Alternatively like a SQL User Defined Function with Python inside
 Name in SQL is plpythonu
– ‘u’ means untrusted so need to be superuser to create functions
CREATE FUNCTION pymax (a integer, b integer)
RETURNS integer
AS $$
if a > b:
return a
return b
$$ LANGUAGE plpythonu;
SQL wrapper
SQL wrapper
Normal Python
Returning Results
 Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
 Composite types can be returned by creating a composite type in the database:
CREATE TYPE named_value AS (
name text,
value integer
);
 Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE FUNCTION make_pair (name text, value integer)
RETURNS named_value
AS $$
return [ name, value ]
# or alternatively, as tuple: return ( name, value )
# or as dict: return { "name": name, "value": value }
# or as an object with attributes .name and .value
$$ LANGUAGE plpythonu;
 For functions which return multiple rows, prefix “setof” before the return type
Returning more results
You can return multiple results by wrapping them in a sequence (tuple, list or set),
an iterator or a generator:
CREATE FUNCTION make_pair (name text)
RETURNS SETOF named_value
AS $$
return ([ name, 1 ], [ name, 2 ], [ name, 3])
$$ LANGUAGE plpythonu;
Sequence
Generator
CREATE FUNCTION make_pair (name text)
RETURNS SETOF named_value AS $$
for i in range(3):
yield (name, i)
$$ LANGUAGE plpythonu;
Accessing Packages
 In an MPP environment: To be available, packages must be installed on
every individual segment node.
–Can use “parallel ssh” tool gpssh to conda/pip install
 Then just import as usual inside function:
CREATE FUNCTION make_pair (name text)
RETURNS named_value
AS $$
import numpy as np
return ((name,i) for i in np.arange(3))
$$ LANGUAGE plpythonu;
Benefits of PL/Python
 Easy to bring your code to the data
 When SQL falls short leverage your Python (or R/Java/C)
experience quickly
 Apply Python across terabytes of data with minimal
overhead or additional requirements
 Results are already in the database system, ready for
further analysis or storage
Spring
What it is: Application Framework introduced as open source in 2003
Intention: Build enterprise-class Java applications more easily.
Outcomes:
1.Streamlined architecture, speeding application development by 2x and accelerating Time to
Value.
2.Portable, since Spring applications are identical for every platform.
Portable across multiple app servers
Spring Ecosystem
WEB
Controllers, REST,
WebSocket
INTEGRATION
Channels, Adapters,
Filters, Transformers
BATCH
Jobs, Steps,
Readers, Writers
BIG DATA
Ingestion, Export,
Orchestration, Hadoop
DATA
NON-RELATIONALRELATIONAL
CORE
GROOVYFRAMEWORK SECURITY REACTOR
GRAILS
Full-stack, Web
XD
Stream, Taps, Jobs
BOOT
Bootable, Minimal, Ops-Ready
http://projects.spring.io/spring-xd/ http://projects.spring.io/spring-data/
http://spring.io
Spring XD - Tackling Big Data Complexity
 One stop shop for
–Data Ingestion
–Real-time Analytics
–Workflow Orchestration
–Data Export
 Built on existing Spring
Assets
–Spring Integration, Batch, Data
 XD = 'eXtreme Data'
Analytics
Big Data Programming Model
Ingest
HDFS
Compute
Jobs
Workflow
Export
RDBMS
Gemfire
Redis
Mobile SocialSensorFiles
OLAP
... ...
Groovy and Grails
 Dynamic Language for the JVM
 Inspired by Smalltalk, Python, and Ruby
 Integrated with the Java language & platform at every level
“Cloud”
 Means many things to many people.
 Distributed applications accessible over a network
 Typically, but not necessarily, The Internet
 An application and/or it's platform
 Resources on demand
 Inherently virtualized
 Can run in-house (private cloud) as well
 Hardware and/or software sold as a commodity
Pivotal Speakers at OSCON 2014
10:40am Tuesday, Global Scaling at the New York Times using RabbitMQ, F150
Alvaro Videla (RabbitMQ), Michael Laing (New York Times)
Cloud
11:30am Tuesday, The Full Stack Java Developer, D136
Joshua Long (Pivotal), Phil Webb (Pivotal)
Java & JVM | JavaScript - HTML5 - Web
1:40pm Tuesday, A Recovering Java Developer Learns to Go, E142
Matt Stine (Pivotal)
Emerging Languages | Java & JVM
Pivotal Speakers at OSCON 2014
2:30pm Tuesday, Unicorns, Dragons, Open Source Business Models And Other Mythical
Creatures, PORTLAND BALLROOM, Main Stage
Andrew Clay Shafer (Pivotal)
11:30am Wednesday, Building a Recommendation Engine with Spring and Hadoop, D136
Michael Minella (Pivotal)
Java & JVM
1:40pm Wednesday, Apache Spark: A Killer or Savior of Apache Hadoop?, E143
Roman Shaposhnik (Pivotal)
Sponsored Sessions
Pivotal Speakers at OSCON 2014
10:00am Thursday, Developing Micro-services with Java and Spring, D139/140
Phil Webb (Pivotal)
Java & JVM | Tools & Techniques
11:00am Thursday, Apache HTTP Server; SSL from End-to-End, D136
William A Rowe Jr (Pivotal)
Security
Data Science At Pivotal
 Drive business value by operationalizing Data Science models using a combination of our
commercial software (based on open source) and open source software.
 Open Source is at the core of what we do
Thank You!
Kevin Crocker
kcrocker@gopivotal.com
@_K_C_Pivotal
Data Science Education Lead
BUILT FOR THE SPEED OF BUSINESS

More Related Content

What's hot

AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
jeykottalam
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
Xiangrui Meng
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
Ted Dunning
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
Databricks
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaHundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
Spark Summit
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataBenjamin Bengfort
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Spark Summit
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Srivatsan Ramanujam
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
Anass Bensrhir - Senior Data Scientist
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
Turi, Inc.
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Benjamin Bengfort
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks Cloud
Sri Ambati
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
Yves Raimond
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark Summit
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
Jake Mannix
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
jeykottalam
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Databricks
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 

What's hot (20)

AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaHundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks Cloud
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 

Similar to Data science and OSS

A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
Ganesan Narayanasamy
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path Forward
Dan Mallinger
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanycOpen Analytics
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
Amazon Web Services
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0vithakur
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
Robert Grossman
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Revolution Analytics
 
E05312426
E05312426E05312426
E05312426
IOSR-JEN
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
Dipendra Kusi
 
2.DATAMANAGEMENT-DIGITAL TRANSFORMATION AND STRATEGY
2.DATAMANAGEMENT-DIGITAL TRANSFORMATION AND STRATEGY2.DATAMANAGEMENT-DIGITAL TRANSFORMATION AND STRATEGY
2.DATAMANAGEMENT-DIGITAL TRANSFORMATION AND STRATEGY
GeorgeDiamandis11
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
Mark Kerzner
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
MLconf
 

Similar to Data science and OSS (20)

A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path Forward
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 
E05312426
E05312426E05312426
E05312426
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
2.DATAMANAGEMENT-DIGITAL TRANSFORMATION AND STRATEGY
2.DATAMANAGEMENT-DIGITAL TRANSFORMATION AND STRATEGY2.DATAMANAGEMENT-DIGITAL TRANSFORMATION AND STRATEGY
2.DATAMANAGEMENT-DIGITAL TRANSFORMATION AND STRATEGY
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 

Recently uploaded

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 

Recently uploaded (20)

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 

Data science and OSS

  • 1. Hands-on Data Science and OSS Driving Business Value with Open Source and Data Science Kevin Crocker, @_K_C_Pivotal #datascience, #oscon
  • 2.
  • 3. VM info Everything is ‘oscon2014’ User:password –> oscon2014:oscon2014 PostgreSQL 9.2.8 dbname -> oscon2014 Root password -> oscon2014 Installed software: postgresql 9.2.8, R, MADlib, pl/pythonu, pl/pgpsl, anaconda ( /home/oscon2014/anaconda), pgadmin3, Rstudio, pyMADlib, and more to come in v1.1
  • 4. Objective of Data Science DRIVE AUTOMATED LOW LATENCY ACTIONS IN RESPONSE TO EVENTS OF INTEREST
  • 5. What Matters: Apps. Data. Analytics. Apps power businesses, and those apps generate data Analytic insights from that data drive new app functionality, which in-turn drives new data The faster you can move around that cycle, the faster you learn, innovate & pull away from the competition
  • 6. What Matters: OSS at the core Apps power businesses, and those apps generate data Analytic insights from that data drive new app functionality, which in-turn drives new data The faster you can move around that cycle, the faster you learn, innovate & pull away from the competition
  • 7. End Game: Drive Business Value with OSS interesting problems that can’t easily be solved with current technology Use (find) the right tool for the job - If they don’t exist, create them - Prefer OSS if it fits the need - Drive business value through distributed, MPP analytics - Operationalization (O16n) of your Analytics Create interesting solutions that drive business value
  • 8. 1 Find Data Platforms •Pivotal Greenplum DB •Pivotal HD •Hadoop (other) •SAS HPA •AWS 2 Write Code 3 Run Code Interfaces •pgAdminIII •psql •psycopg2 •Terminal •Cygwin •Putty •Winscp 4 Write Code for Big Data 5 Implement Algorithms 6 Show Results 7 Collaborate Sharing Tools •Chorus •Confluence •Socialcast •Github •Google Drive & Hangouts PIVOTAL DATA SCIENCE TOOLKIT A large and varied tool box!
  • 9. Toolkit? This image was created by Swami Chandresekaran, Enterprise Architect, IBM. He has a great article about what it takes to be a Data Scientist: Road Map to Data Scientist http://nirvacana.com/tho ughts/becoming-a-data- scientist/
  • 10. We need the right technology for every step
  • 11. Open Source At Pivotal  Pivotal has a lot of open source projects (and people) involved in Open Source  PostgreSQL, Apache Hadoop (4)  MADlib (16), PivotalR (2), pyMADlib (4), Pandas via SQL (3),  Spring (56), Groovy (3), Grails (3)  Apache Tomcat (2) and HTTP Server (1)  Redis (1)  Rabbit MQ (4)  Cloud Foundry (90)  Open Chorus  We use a combination of our commercial software and OSS to drive business value through Data Science
  • 12. Motivation  Our story starts with SQL – so naturally we try to use SQL for everything! Everything?  SQL is great for many things, but it’s not nearly enough –Straightforward way to query data –Not necessarily designed for data science  Data Scientists know other languages – R, Python, …
  • 13. Our challenge  MADlib – Open source – Extremely powerful/scalable – Growing algorithm breadth – SQL  R / Python – Open source – Memory limited – High algorithm breadth – Language/interface purpose-designed for data science  Want to leverage both the performance benefits of MADlib and the usability of languages like R and Python
  • 14. How Pivotal Data Scientists Select Which Tool to Use Optimized for algorithm performance, scalability, & code overhead
  • 15. Pivotal, MADlib, R, and Python  Pivotal & MADlib & R Interoperability –PivotalR –PL/R  Pivotal & MADlib & Python Interoperability –pyMADlib –PL/Python
  • 16. MADlib  MAD stands for:  lib stands for library of: – advanced (mathematical, statistical, machine learning) – parallel & scalable – in-database functions  Mission: to foster widespread development of scalable analytic skills, by harnessing efforts from commercial practice, academic research, and open-source development
  • 17. MADlib: A Community Project Open Source: BSD License • Developed as a partnership with multiple universities – University of California-Berkeley – University of Wisconsin-Madison – University of Florida • Compatibile with Postgres, Greenplum Database, and Hadoop via HAWQ • Designed for Data Scientists to provide Scalable, Robust Analytics capabilities for their business problems. • Homepage: http://madlib.net • Documentation: http://doc.madlib.net • Source: https://github.com/madlib • Forum: http://groups.google.com/group/madlib-user-forum Community
  • 18. Database Platform Layer MADlib: Architecture C++ Database Abstraction Layer User Defined FunctionsUser Defined Functions User Defined TypesUser Defined Types User Defined AggregatesUser Defined Aggregates OLAP Window FunctionsOLAP Window Functions User Defined OperatorsUser Defined Operators OLAP Grouping SetsOLAP Grouping Sets Data Type MappingData Type Mapping Exception HandlingException Handling Logging and ReportingLogging and Reporting Linear AlgebraLinear Algebra Memory ManagementMemory Management Boost SupportBoost Support Support Modules Random SamplingRandom Sampling Array OperationsArray OperationsSparse VectorsSparse Vectors Core Methods Generalized Linear Models Generalized Linear Models Matrix FactorizationMatrix Factorization Machine Learning Algorithms Machine Learning Algorithms Linear SystemsLinear Systems Probability FunctionsProbability Functions
  • 19. MADlib: Diverse User Experience SQL Python Open Chorus R from pymadlib.pymadlib import * conn = DBConnect() mdl = LinearRegression(conn) lreg.train(input_table, indepvars, depvar) cursor = lreg.predict(input_table, depvar) scatterPlot(actual,predicted, dataset) psql> madlib.linregr_train('abalone', 'abalone_linregr', 'rings', 'array[1,diameter,height]'); psql> select coef, r2 from abalone_linregr; -[ RECORD 1 ]---------------------------------------------- coef | {2.39392531944631,11.7085575219689,19.8117069108094} r2 | 0.350379630701758
  • 20. MADlib In-Database Functions Predictive Modeling Library Linear Systems •Sparse and Dense Solvers Matrix Factoriization •Single Value Decomposition (SVD) •Low-Rank Generalized Linear Models •Linear Regression •Logistic Regression •Multinomial Logistic Regression •Cox Proportional Hazards •Regression •Elastic Net Regularization •Sandwich Estimators (Huber white, clustered, marginal effects) Machine Learning Algorithms •Principal Component Analysis (PCA) •Association Rules (Affinity Analysis, Market Basket) •Topic Modeling (Parallel LDA) •Decision Trees •Ensemble Learners (Random Forests) •Support Vector Machines •Conditional Random Field (CRF) •Clustering (K-means) •Cross Validation Descriptive Statistics Sketch-based Estimators •CountMin (Cormode- Muthukrishnan) •FM (Flajolet-Martin) •MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions
  • 21. Calling MADlib Functions: Fast Training, Scoring SELECT madlib.linregr_train( 'houses’, 'houses_linregr’, 'price’, 'ARRAY[1, tax, bath, size]’, ‘bedroom’); SELECT madlib.linregr_train( 'houses’, 'houses_linregr’, 'price’, 'ARRAY[1, tax, bath, size]’, ‘bedroom’); MADlib model function Table containing training data Table in which to save results Column containing dependent variable Create multiple output models (one for each value of bedroom)  MADlib allows users to easily and create models without moving data out of the systems – Model generation – Model validation – Scoring (evaluation of) new data  All the data can be used in one model  Built-in functionality to create of multiple smaller models (e.g. regression/classification grouped by feature)  Open-source lets you tweak and extend methods, or build your own Features included in the model
  • 22. Calling MADlib Functions: Fast Training, Scoring SELECT madlib.linregr_train( 'houses’, 'houses_linregr’, 'price’, 'ARRAY[1, tax, bath, size]’); SELECT madlib.linregr_train( 'houses’, 'houses_linregr’, 'price’, 'ARRAY[1, tax, bath, size]’); SELECT houses.*, madlib.linregr_predict(ARRAY[1,tax,bath,size], m.coef )as predict FROM houses, houses_linregr m; SELECT houses.*, madlib.linregr_predict(ARRAY[1,tax,bath,size], m.coef )as predict FROM houses, houses_linregr m; MADlib model scoring function Table with data to be scored Table containing model  MADlib allows users to easily and create models without moving data out of the systems – Model generation – Model validation – Scoring (evaluation of) new data  All the data can be used in one model  Built-in functionality to create of multiple smaller models (e.g. regression/classification grouped by feature)  Open-source lets you tweak and extend methods, or build your own
  • 23. K-Means Clustering Clustering refers to the problem of partitioning a set of objects according to some problem-dependent measure of similarity. In the k-means variant, given n points x1,…,xn d, the goal is to position k centroids∈ℝ c1,…,ck d so that the sum of distances between∈ℝ each point and its closest centroid is minimized. Each centroid represents a cluster that consists of all points to which this centroid is closest. So, we are trying to find the centroids which minimize the total distance between all the points and the centroids.
  • 24. K-means Clustering Example Use Cases: Which Blogs are Spam Blogs? Given a user’s preferences, which other blog might she/he enjoy? What are our customers saying about us?
  • 25. What are our customers saying about us?  Discern trends and categories on-line conversations? - Search for relevant blogs - ‘Fingerprinting’ based on word frequencies - Similarity Measure - Identify ‘clusters’ of documents
  • 26. What are our customers saying about us? Method • Construct document histograms • Transform histograms into document “fingerprints” • Use clustering techniques to discover similar documents.
  • 27. What are our customers saying about us? Constructing document histograms  Parsing & extract html files  Using natural language processing for tokenization and stemming  Cleansing inconsistencies  Transforming unstructured data into structured data
  • 28. What are our customers saying about us? “Fingerprinting” - Term frequency of words within a document vs. frequency that those words occur in all documents - Term frequency-inverse document frequency (tf- idf weight) - Easily calculated based on formulas over the document histograms. - The result is a vector in n-dim. Euclidean space.
  • 29. K-Means Clustering – Training Function The k-means algorithm can be invoked in four ways, depending on the source of the initial set of centroids: 1.Use the random centroid seeding method. 2.Use the kmeans++ centroid seeding method. 3.Supply an initial centroid set in a relation identified by the rel_initial_centroids argument. 4.Provide an initial centroid set as an array expression in the initial_centroids argument.
  • 30. Random Centroid seeding method kmeans_random( rel_source, expr_point, k, fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned )
  • 31. Kmeans++ centroid seeding method kmeanspp( rel_source, expr_point, k, fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned )
  • 32. Initial Centroid set in a relation kmeans( rel_source, expr_point, rel_initial_centroids, -- this is the relation expr_centroid, fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned )
  • 33. Initial centroid as an array kmeans( rel_source, expr_point, initial_centroids, -- this is the array fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned )
  • 34. K-Means Clustering – Cluster Assignment 1. After training, the cluster assignment for each data point can be computed with the help of the following function: closest_column( m, x )
  • 35. Assessing the quality of the clustering A popular method to assess the quality of the clustering is the silhouette coefficient, a simplified version of which is provided as part of the k-means module. Note that for large data sets, this computation is expensive. The silhouette function has the following syntax: simple_silhouette( rel_source, expr_point, centroids, fn_dist )
  • 36. What are our customers saying about us? ANIMATED?
  • 37. What are our customers saying about us?
  • 38. What are our customers saying about us?
  • 39. What are our customers saying about us?
  • 40. What are our customers saying about us?
  • 41. What are our customers saying about us?
  • 42. What are our customers saying about us?
  • 43. What are our customers saying about us?
  • 44. What are our customers saying about us?
  • 45. What are our customers saying about us?
  • 46. What are our customers saying about us?
  • 47. What are our customers saying about us?
  • 48. What are our customers saying about us?
  • 49. What are our customers saying about us?  innovation  leader  design •speed •graphics •improvement •bug •installation •download
  • 50. Pivotal, MADlib, R, and Python  Pivotal & MADlib & R Interoperability –PivotalR –PL/R  Pivotal & MADlib & Python Interoperability –pyMADlib –PL/Python
  • 51. Pivotal & R Interoperability  In a traditional analytics workflow using R: –Datasets are transferred from a data source –Modeled or visualized –Model scoring results are pushed back to the data source  Such an approach works well when: –The amount of data can be loaded into memory, and –The transfer of large amounts of data is inexpensive and/or fast  PivotalR explores the situation involving large data sets where these two assumptions are violated and you have an R background
  • 52. Enter PivotalR  Challenge Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics  Simple solution: Translate R code into SQL
  • 53. PivotalR Design Overview 2. SQL to execute 3. Computation results 1. R  SQL RPostgreSQL Pivotal R Data lives hereNo data here Database/Hadoop w/ MADlib
  • 54. PivotalR Design Overview  Call MADlib’s in-database machine learning functions directly from R  Syntax is analogous to native R functions – for example, madlib.lm() mimics the syntax of the native lm() function  Data does not need to leave the database  All heavy lifting, including model estimation & computation, is done in the database
  • 55. PivotalR Design Overview  Manipulate database tables directly from R without needing to be familiar with SQL  Perform the equivalent of SQL’s ‘select’ statements (including joins) in a syntax that is similar to R’s data.frame operations  For example: R’s ‘merge’  SQL’s ‘join’
  • 56. PivotalR: Current Functionality MADlib Functionality •Linear Regression •Logistic Regression •Elastic Net •ARIMA •Marginal Effects •Cross Validation •Bagging •summary on model objects • Automated Indicator Variable Coding as.factor • predict •$ [ [[ $<- [<- [[<- •is.na •+ - * / %% %/% ^ •& | ! •== != > < >= <= •merge •by •db.data.frame •as.db.data.frame •preview•sort •c mean sum sd var min max length colMeans colSums • db.connect db.disconnect db.list db.objects db.existsObject delete •dim names •content And more ... (SQL wrapper) http://github.com/gopivotal/PivotalR/
  • 57. PivotalR Example  Load the PivotalR package – > library('PivotalR')  Get help for a function – > help(db.connect)  Connect to a database – > db.connect(host = “dca.abc.com”, user = “student01”, dbname = “studentdb”, password = ”studentpw", port = 5432, madlib = "madlib", conn.pkg = "RPostgreSQL", default.schemas = NULL)  List connections – > db.list()
  • 58. PivotalR Example  Connect to a table via db.data.frame function (note that the data remains in the database and is not loaded into memory) – > y <- db.data.frame(“test.abalone”, conn.id = 1, key = character(0), verbose = TRUE, is.temp = FALSE)  Fit a linear regression model (one model for each gender) and display it – > fit <- madlib.lm(rings ~ . - id | sex, data = y) – > fit # view the result  Apply the model to data in another table (i.e. x) and compute mean-square-error – lookat(mean((x$rings - predict(fit, x))^2))
  • 60. PivotalR  PivotalR is an R package you can download from CRAN. - http://cran.r-project.org/web/packages/PivotalR/index.html - Using Rstudio, you can install it with: install.packages("PivotalR")  GitHub has the latest, greatest code and features but is less stable. - https://github.com/gopivotal/PivotalR  R front end to PostgreSQL and all PostgreSQL databases.  R wrapper around MADlib, the open source library for in-database scalable analytics  Mimics regular R syntax for manipulating R’s “data.frame”  Provides R functionality to Big Data stored in-database or Apache Hadoop.  Demo code: https://github.com/gopivotal/PivotalR/wiki/Example  Training Video: https://docs.google.com/file/d/0B9bfZ-YiuzxQc1RWTEJJZ2V1TWc/edit
  • 61. Pivotal, MADlib, R, and Python  Pivotal & MADlib & R Interoperability –PivotalR –PL/R  Pivotal & MADlib & Python Interoperability –pyMADlib –PL/Python
  • 62. SQL & RSQL & R PL/R on Pivotal  Procedural Language (PL/X) – X includes R, Python, pgSQL, Java, Perl, C etc – need to be installed on each database  PL/R enables you to write PostgreSQL and DB functions in the R language  R installed on each segment of the Pivotal cluster  Parsimonious – R piggy-backs on Pivotal’s parallel architecture  Minimize data movement
  • 63. PL/R on Pivotal  Allows most of R’s capabilities. Basic guide: “PostgreSQL Functions by Example” http://www.joeconway.com/presentations/function_basics.pdf  In PostgreSQL and GPDB/PHD: Check which PL languages are installed in database:  PL/R is an “untrusted” language – only database superusers have the ability to create UDFs with PL/R (see “lanpltrusted” column in pg_language table) select * from pg_language; lanname | lanispl | lanpltrusted | lanplcallfoid | lanvalidator | lanacl -----------+---------+--------------+---------------+--------------+-------- internal | f | f | 0 | 2246 | c | f | f | 0 | 2247 | sql | f | t | 0 | 2248 | plpgsql | t | t | 10885 | 10886 | plpythonu | t | f | 16386 | 0 | plr | t | f | 18975 | 0 | (6 rows)
  • 64. PL/R Example  Consider the census dataset below (each row represents an individual): – h_state = integer encoding which state they live in – earns = their income – hours = how many hours per week they work – … and other features  Suppose we want to build a model of income for each state separately
  • 66. PL/R Example  Prepare table for PL/R by converting it into array form -- Create array version of table DROP TABLE IF EXISTS use_r.census1_array_state; CREATE TABLE use_r.census1_array_state AS( SELECT h_state::text h_state, array_agg(h_serialno::float8) h_serialno, array_agg(earns::float8) earns, array_agg(hours::float8) hours, array_agg((earns/hours)::float8) wage, array_agg(hsdeg::float8) hsdeg, array_agg(somecol::float8) somecol, array_agg(associate::float8) associate, array_agg(bachelor::float8) bachelor, array_agg(masters::float8) masters, array_agg(professional::float8) professional, array_agg(doctorate::float8) doctorate, array_agg(female::float8) female, array_agg(rooms::float8) rooms, array_agg(bedrms::float8) bedrms, array_agg(notcitizen::float8) notcitizen, array_agg(rentshouse::float8) rentshouse, array_agg(married::float8) married FROM use_r.census1 GROUP BY h_state ) DISTRIBUTED BY (h_state);
  • 67. PL/R Example SQL & RSQL & R TN Data CA Data NY Data PA Data TX Data CT Data NJ Data IL Data MA Data WA Data TN Model CA Model NY Model PA Model TX Model CT Model NJ Model IL Model MA Model WA Model
  • 68. PL/R Example  Run linear regression to predict income in each state –Define output data type –Create PL/R function Define output type PL/R function Body of the function in R SQL Wrapper SQL Wrapper
  • 69. PL/R Example  Execute PL/R function
  • 70. PL/R  PL/R is not installed – I have to download the source and compile it  Instructions can be found here  http://www.joeconway.com/plr/doc/plr-install.html  CHALLENGE: Download PL/R, compile it, and install it in PostgreSQL
  • 71. Pivotal, MADlib, R, and Python  Pivotal & MADlib & R Interoperability –PivotalR –PL/R  Pivotal & MADlib & Python Interoperability –pyMADlib –PL/Python
  • 72. Pivotal & Python Interoperability  In a traditional analytics workflow using Python: –Datasets are transferred from a data source –Modeled or visualized –Model scoring results are pushed back to the data source  Such an approach works well when: –The amount of data can be loaded into memory, and –The transfer of large amounts of data is inexpensive and/or fast  pyMADlib explores the situation involving large data sets where these two assumptions are violated and you have a Python background
  • 73. Enter pyMADlib  Challenge Want to harness the familiarity of Python’s interface and the performance & scalability benefits of in-DB analytics  Simple solution: Translate Python code into SQL
  • 74. pyMADlib Design Overview 2. SQL to execute 3. Computation results 1. Python  SQL ODBC/JDBC Data lives hereNo data here Database/Hadoop w/ MADlib
  • 75. Simple solution: Translate Python code into SQL  All data stays in DB and all model estimation and heavy lifting done in DB by MADlib  Only strings of SQL and model output transferred across ODBC/JDBC  Best of both worlds: number crunching power of MADlib along with rich set of visualizations of Matplotlib, NetworkX and all your other favorite Python libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you program in your favorite language – Python. SQL to execute MADlib Model output ODBC/ JDBC Python  SQL
  • 76. Hands-on Exploration PyMADlib Tutorial – IPython Notebook Viewer Link http://nbviewer.ipython.org/5275846
  • 77. Where do I get it ? $pip install pymadlib
  • 78. Pivotal, MADlib, R, and Python  Pivotal & MADlib & R Interoperability –PivotalR –PL/R  Pivotal & MADlib & Python Interoperability –pyMADlib –PL/Python
  • 79. PL/Python on Pivotal  Syntax is like normal Python function with function definition line replaced by SQL wrapper  Alternatively like a SQL User Defined Function with Python inside  Name in SQL is plpythonu – ‘u’ means untrusted so need to be superuser to create functions CREATE FUNCTION pymax (a integer, b integer) RETURNS integer AS $$ if a > b: return a return b $$ LANGUAGE plpythonu; SQL wrapper SQL wrapper Normal Python
  • 80. Returning Results  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)  Composite types can be returned by creating a composite type in the database: CREATE TYPE named_value AS ( name text, value integer );  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: CREATE FUNCTION make_pair (name text, value integer) RETURNS named_value AS $$ return [ name, value ] # or alternatively, as tuple: return ( name, value ) # or as dict: return { "name": name, "value": value } # or as an object with attributes .name and .value $$ LANGUAGE plpythonu;  For functions which return multiple rows, prefix “setof” before the return type
  • 81. Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator: CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ return ([ name, 1 ], [ name, 2 ], [ name, 3]) $$ LANGUAGE plpythonu; Sequence Generator CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ for i in range(3): yield (name, i) $$ LANGUAGE plpythonu;
  • 82. Accessing Packages  In an MPP environment: To be available, packages must be installed on every individual segment node. –Can use “parallel ssh” tool gpssh to conda/pip install  Then just import as usual inside function: CREATE FUNCTION make_pair (name text) RETURNS named_value AS $$ import numpy as np return ((name,i) for i in np.arange(3)) $$ LANGUAGE plpythonu;
  • 83. Benefits of PL/Python  Easy to bring your code to the data  When SQL falls short leverage your Python (or R/Java/C) experience quickly  Apply Python across terabytes of data with minimal overhead or additional requirements  Results are already in the database system, ready for further analysis or storage
  • 84. Spring What it is: Application Framework introduced as open source in 2003 Intention: Build enterprise-class Java applications more easily. Outcomes: 1.Streamlined architecture, speeding application development by 2x and accelerating Time to Value. 2.Portable, since Spring applications are identical for every platform. Portable across multiple app servers
  • 85. Spring Ecosystem WEB Controllers, REST, WebSocket INTEGRATION Channels, Adapters, Filters, Transformers BATCH Jobs, Steps, Readers, Writers BIG DATA Ingestion, Export, Orchestration, Hadoop DATA NON-RELATIONALRELATIONAL CORE GROOVYFRAMEWORK SECURITY REACTOR GRAILS Full-stack, Web XD Stream, Taps, Jobs BOOT Bootable, Minimal, Ops-Ready http://projects.spring.io/spring-xd/ http://projects.spring.io/spring-data/ http://spring.io
  • 86. Spring XD - Tackling Big Data Complexity  One stop shop for –Data Ingestion –Real-time Analytics –Workflow Orchestration –Data Export  Built on existing Spring Assets –Spring Integration, Batch, Data  XD = 'eXtreme Data' Analytics Big Data Programming Model Ingest HDFS Compute Jobs Workflow Export RDBMS Gemfire Redis Mobile SocialSensorFiles OLAP ... ...
  • 87. Groovy and Grails  Dynamic Language for the JVM  Inspired by Smalltalk, Python, and Ruby  Integrated with the Java language & platform at every level
  • 88. “Cloud”  Means many things to many people.  Distributed applications accessible over a network  Typically, but not necessarily, The Internet  An application and/or it's platform  Resources on demand  Inherently virtualized  Can run in-house (private cloud) as well  Hardware and/or software sold as a commodity
  • 89. Pivotal Speakers at OSCON 2014 10:40am Tuesday, Global Scaling at the New York Times using RabbitMQ, F150 Alvaro Videla (RabbitMQ), Michael Laing (New York Times) Cloud 11:30am Tuesday, The Full Stack Java Developer, D136 Joshua Long (Pivotal), Phil Webb (Pivotal) Java & JVM | JavaScript - HTML5 - Web 1:40pm Tuesday, A Recovering Java Developer Learns to Go, E142 Matt Stine (Pivotal) Emerging Languages | Java & JVM
  • 90. Pivotal Speakers at OSCON 2014 2:30pm Tuesday, Unicorns, Dragons, Open Source Business Models And Other Mythical Creatures, PORTLAND BALLROOM, Main Stage Andrew Clay Shafer (Pivotal) 11:30am Wednesday, Building a Recommendation Engine with Spring and Hadoop, D136 Michael Minella (Pivotal) Java & JVM 1:40pm Wednesday, Apache Spark: A Killer or Savior of Apache Hadoop?, E143 Roman Shaposhnik (Pivotal) Sponsored Sessions
  • 91. Pivotal Speakers at OSCON 2014 10:00am Thursday, Developing Micro-services with Java and Spring, D139/140 Phil Webb (Pivotal) Java & JVM | Tools & Techniques 11:00am Thursday, Apache HTTP Server; SSL from End-to-End, D136 William A Rowe Jr (Pivotal) Security
  • 92. Data Science At Pivotal  Drive business value by operationalizing Data Science models using a combination of our commercial software (based on open source) and open source software.  Open Source is at the core of what we do Thank You! Kevin Crocker kcrocker@gopivotal.com @_K_C_Pivotal Data Science Education Lead
  • 93. BUILT FOR THE SPEED OF BUSINESS

Editor's Notes

  1. Access MADlib’s functions through SQL – but also more than just SQL … can access MADlib functions through R (with PivotalR package) and Python (with pyMADlib library) as well as the Data Science Studio in Chorus.
  2. http://spring.io