SlideShare a Scribd company logo
1 of 45
Download to read offline
© 2016 IBM Corporation
Scaling R using Hadoop and Spark
Virender S. Thakur- Big Data Specialist @IBM
Open Source Analytics Meetup – NYC
March 8th, 2016
© 2016 IBM Corporation2
IBM’s Framework for Getting Value out of Big Data
 All agree on Big Data’s potential, but wide divergence on how to
exploit it
 Pioneers who have started to harness Big Data have benefited greatly
 We see Big Data adoption as a continual process – maturity levels
 IBM’s approach enables faster adoption of Big Data technologies
 Open source innovation (Hadoop, Spark)
 Standards-based technologies (ODP, SQL, R)
 Familiar interfaces and integration with established tools (IBM innovations)
 Advanced analytics (IBM innovations)
 IBM’s commitment for continued innovation
© 2016 IBM Corporation3
IBM is Committed to Open Source
 Open source technologies are the base for IBM software and solutions
 IBM’s long history of deep open source commitment
 Apache Software Foundation: Founding member in 1999
 Cloud Foundry: #1 contributor; Basis for Bluemix
 OpenStack: #4 contributor; Basis for IBM’s IaaS
 Linux: #3 contributor; IBM first enterprise backer of Linux
 Hadoop/Spark: Extensive investment in open source contribution; Integration with
Analytics software
Infrastructure
Systems
Application
© 2016 IBM Corporation4
IBM has the largest investment in Spark of any company in the world
visit www.spark.tc for more informationIBM | Spark
IBM Spark Technology Center
Top committer/contributor
300+ inventors
Commitment to educate 1 million data scientists
Contributed SystemML
Founding member of AMPLab
Partnerships in the ecosystem
IBM Has the Largest Investment in Spark in the World
© 2016 IBM Corporation5
IBM is all-in on its commitment to Spark
Foster
Community
Educate 1M+ data
scientists and engineers
via online courses
Sponsor AMPLab,
creators and
evangelists of Spark
Infuse the
Portfolio
Integrate Spark
throughout portfolio
3,500 employees working
on Spark-related topics
Spark however
customers want it –
standalone, platform or
products
Source: https://www-03.ibm.com/press/us/en/pressrelease/47107.wss
Launch Spark
Technology Cluster
(STC), 300 engineers
Open source
SystemML
Partner with databricks
Contribute to
the Core
"It's like Spark
just got blessed
by the enterprise
rabbi."
Ben Horowitz
Andreessen Horowitz
© 2016 IBM Corporation6
Open Data Platform Initiative
Why is IBM involved?
 Strong history of leadership in open source &
standards
 Supports our commitment to open source currency
in all future releases
 Accelerates our innovation within Hadoop &
surrounding applications
Open Data Platform (ODP) vs. Apache Software
Foundation (ASF)
 ODP supports the ASF mission
 ASF provides a governance model around individual
projects without looking at ecosystem
 ODP aims to provide a vendor-led consistent
packaging model for core Apache components as an
ecosystem
All Standard Apache Open Source Components
HDFS
YARN
MapReduce
Ambari HBase
Spark
Flume
Hive Pig
Sqoop
HCatalog
Solr/Lucene
ODP
© 2016 IBM Corporation7
Text Analytics
POSIX Distributed
Filesystem
Multi-workload, Multi-
tenant scheduling
IBM BigInsights
Enterprise Management
Machine Learning on
Big R
Big R
IBM BigInsights
Data Scientist
IBM BigInsights
Analyst
Big SQL
BigSheets
Big SQL
BigSheets
IBM BigInsights for
Apache Hadoop
IBM Open Platform with Apache Hadoop – all open source
HDFS
YARN
MapReduce
Ambari HBase
Spark
Flume
Hive Pig
Sqoop
HCatalog
Solr/Lucene
Zookeeper Oozie Knox Slider
IBM BigInsights for Apache Hadoop
7
Initial ODP Scope
Cloud On Prem Appliance
© 2016 IBM Corporation8
What Makes Us Different?
 Open: Open Data Platform
 Insight: Tools and accelerators to visualize, filter, and
analyze large data sets
 Anywhere: Cloud, On Premise, Appliance
© 2015 International Business Machines Corporation 8
Key Benefits
» BigSQL = Makes Hive faster and more secure
» BigSheets for visualization and exploration
» Text Analytics from IBM Research
IBM Hadoop
Distribution
IBM
Hadoop Ecosystem
ODP Core
100% Open
source Apache
Hadoop
distribution
including Spark
SQL
Security
Business
Intelligence
Predictive
Analytics
Streaming
Text Analytics
Data
Management
MDM
Visualization
Workload
Optimization
BigSQL
GPFS
HA
Integration with IBM Portfolio
» Analytics: SPSS, Cognos, Streams
» Data Warehouse: Netezza, DB2
» Governance + Security: Optim, Guardium, Information
Governance
» + Data Integration + Security Intelligence + Watson Explorer
+ MDM + Data Replication
© 2016 IBM Corporation
Big R Overview
Scalable in-Hadoop Analytics
© 2016 IBM Corporation10
Challenges with Running Large-Scale Analytics
TRADITIONAL APPROACH BIG DATA APPROACH
Analyze small subsets
of information
Analyze
all information
Analyzed
information
All available
information
All available
information
analyzed
© 2016 IBM Corporation11
What is Big R?
R Clients
Scalable
Statistic
s Engine
Data Sources
Embedded R Execution
R Packages
R Packages
1
2
3
1. Explore, visualize, transform,
and model big data using
familiar R syntax and
paradigm (no MapReduce
code)
2. Scale out R
• Partitioning of large data (“divide”)
• Parallel cluster execution of
pushed down R code (“conquer”)
• All of this from within the R
environment (Jaql, Map/Reduce
are hidden from you
• Almost any R package can run in
this environment
3. Scalable machine learning
• A scalable statistics engine that
provides canned algorithms, and
an ability to author new ones, all
via R
“End-to-end integration of R-Project with BigInsights”
Pull data
(summaries) to
R client
Or, push R
functions
right on the
data
© 2016 IBM Corporation12
User Experience for Big R
Connect to BI cluster
Data frame proxy to large data file
Data transformation step
Run scalable linear regression on cluster
© 2016 IBM Corporation13
Job Summary (Ambari)
© 2016 IBM Corporation14
Rich Functionality in Big R
Big R Function
Connection connect, disconnect, …
HDFS listfs, rmfs
Types & Functions
Types bigr.frame, bigr.vector
Functions
dim, nrow, colnames, coltypes, head, tail, na.string, na.omit,
sort, summary
Coercion and
Casting
as.bigr.frame, as.data.frame, ….vector
as.integer, as.logical, as.numeric
Built-in Functions
Arithmetic +, -, *, /, ^
Mathematical abs, acos, asin, atan, ceiling, floor, exp, …
String grepl, substr
Statistical cor, cov, mean, sd
Miscellaneous attach, pull, random, sample, ifelse
Visualization histogram
Apply R functions groupApply, tableApply, rowApply
Run scalable algorithms bigr.lm, bigr.svm, bigr. … (see subsequent slide)3
2
1
© 2016 IBM Corporation15
How Big R compares with other R solutions . . . .
RHIPE implementation
Other solutions offer an R API for writing MapReduce from R.
Example: Compute the mean departure delay for each airline on a monthly basis*.
RHadoop implementation Big R implementation
*Dataset: “airline”.
Scheduled flights in US 1987-2009.
© 2016 IBM Corporation16
Machine learning with Big R
 Based on SystemML (IBM Research,.now open source)
 Scalability for large data sets
 R API inspired by R’s ML libraries
Big R functions Inspired by R’s Algorithm
bigr.lm() lm() Linear regression
bigr.glm() glm() Generalized Linear Models
. . . . . . . . .
bigr.kmeans() kmeans() K-means clustering
bigr.naive.bayes() naiveBayes() Naïve Bayes classifier
bigr.sample() sample() Uniform sample by percentage, exact
number of samples, or partitioned
sampling.
© 2016 IBM Corporation17
Arbitrary Large Data Structures: bigr.frame
> data <- bigr.frame (dataPath = "/user/bigr/email_data_virender.csv",
+ dataSource="DEL",
+ delimiter=",",
+ header=TRUE,
+ coltypes=ifelse(1:10 %in% c(2,4,7,9), "numeric", "character"),
+ useMapReduce=TRUE)
> eval(parse(text=paste0(paste0("data$newColumn",1:100),"<- as.integer(bigr.random()
* 100) + 1")))
> dim(data)
[1] 7897921 110
> summary(data[,c("TITLE_CD","newColumn99")])
TITLE_CD newColumn99
A Min. :JC Min. :1
B Max. :MW Max. :100
C Mean :50
© 2016 IBM Corporation18
Open Source R: glm2 Logistic Regression
> modelBuildTime1 <- proc.time()
> lg <- glm2(Target_JC_oo ~.,
+ data=data2,
+ family=binomial,
+ x=FALSE,
+ y=TRUE)
> modelBuildTime2 <- proc.time()
> resultTimeModel1 <- modelBuildTime2 - modelBuildTime1
> print(resultTimeModel1)
user system elapsed
1.90 0.19 2.09
© 2016 IBM Corporation19
Big R: bigr.glm Logistic Regression
> modelBuildTime0 <- proc.time()
> glmModel <- bigr.glm(Target_JC_oo ~ .,
+ data=train,
+ family=binomial(logit),
+ neg.binomial.class=1,
+ intercept=TRUE,
+ shiftAndRescale=TRUE,
+ directory="/user/bigr/glm/glm.model")
> modelBuildTime1 <- proc.time()
> resultTimeModel0 <- modelBuildTime1 - modelBuildTime0
> print(resultTimeModel0)
user system elapsed
4.06 0.05 20.66
© 2016 IBM Corporation20
HDFS
instead of
Memory
Spills to disk over
course of a job
PDL 500 GB Dataset
Big R’s Scalability Beyond R and Aggregate Cluster
Memory
Scaling Beyond Aggregate Memory
 Automatically “spills to disk”:
 If more data than what fits into memory
 e.g., IBM Research 4 TB Dataset:
 6.6X More Data than RAM
Arbitrarily Large Data Structures
bf1 <- bigr.frame(...arbitrarily_large_data...)
df1 <- as.data.frame(…small_data…)
bf2 <- as.bigr.frame(…data.frame…)
1
2
© 2016 IBM Corporation21
Scalable Machine Learning Algorithms in Big R
Category Description Big R Function
Descriptive Statistics
Univariate bigr.univariateStats()
Bivariate bigr.bivariateStats()
Stratified Bivariate bigr.bivariateStats()
Classification
Logistic Regression (multinomial) bigr.logistic.regression()
Multi-Class SVM bigr.svm()
Naïve Bayes (multinomial) bigr.naive.bayes()
Clustering k-Means bigr.kmeans()
Regression
Linear
Regression
system of equations bigr.lm()
CG (conjugate gradient descent) bigr.lm()
Generalized
Linear
Models
(GLM)
Distributions: Gaussian, Poisson,
Gamma, Inverse Gaussian,
Binomial and Bernoulli
bigr.glm()
Links for all distributions: identity,
log, sq. root, inverse, 1/μ2
bigr.glm()
Links for Binomial / Bernoulli: logit,
probit, cloglog, cauchit
bigr.glm()
Predict Scoring bigr.predict()
Transformation
dummy coding, binning, scaling,
missing value imputation
bigr.transform()
3
© 2016 IBM Corporation22
Big R Machine Learning -- Scalability and Performance
bigr.lm
28x
Performance
(data fit in memory)
Scalability
(data larger than
aggr. memory)
R out-of-memory
28X Speedup
Scales beyond
cluster memory
© 2016 IBM Corporation
Apache Spark R
© 2016 IBM Corporation24
Spark includes a set of core libraries that enable various
analytic methods which can process data from many sources
Spark Core
general compute
engine, handles
distributed task
dispatching, scheduling
and basic I/O functions
Spark SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
executes SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
large variety of
data sources and
formats can be
supported, both
on-premise or
cloud
BigInsights
(HDFS)
Cloudant
dashDB
Object
Storage
SQL
DB
…many
others
IBM CLOUD OTHER CLOUD CLOUD APPS ON-PREMISE
© 2016 IBM Corporation25
Relationship to Hadoop – Spark both competes and
coexists with Hadoop MapReduce
Standalone on
HDFS
Hadoop Yarn
Deployment
Spark in
MapReduce
Run side-by-side
with MapReduce,
leverages current
Hadoop stack
Simply run on Yarn
without any admin
access required
Launch Spark jobs
inside of
MapReduce
Spark is a
superset of
MapReduce
and is based
on similar
distributed
computing
principles
Source: https://databricks.com/blog/2014/01/21/spark-and-hadoop.html
© 2016 IBM Corporation26
Spark Technology Center
 Focal point for IBM investment in Spark
 Code contributions to Apache Spark project
 Build industry solutions using Spark
 Evangelize Spark technology inside/outside IBM
 Agile engagement across IBM divisions
 Systems: contribute enhancements to Spark core, and optimized infrastructure
(hardware/software) for Spark
 Analytics: IBM Analytics software will exploit Spark processing
 Research: build innovations above (solutions that use Spark), inside
(improvements to Spark core), and below (improve systems that execute Spark)
the Spark stack
Goal: To be the #1 contributor and adopter in the Spark ecosystem
© 2016 IBM Corporation27
SparkR: R on Spark
 Opens Apache Spark to the world of R users
 Exposes Spark functionally in R-friendly syntax via DataFrames API
 V1.4 supports operations like selection, filtering, aggregation, etc.
 Roadmap for SparkR ~v1.5 (v1.6 is here too)
 Extend machine learning capabilities into SparkR (e.g., SparkML, MLlib)
library(SparkR)
sc <- sparkR.init("local[*]")
demo <- read.df(sqlCtx, "/pathtofile/customerdata.json",
"json")
printSchema(demo)
head(demo)
© 2016 IBM Corporation28
Spark DataFrame
DataFrame = RDD + Schema + DSL
 A tabular data structure
 Distributed collection of rows organized into named columns
 Unified interface for reading and writing data (I/O)
 Abstractions for filtering, slicing & dicing, aggregation, visualization
 Catalyst optimizer: plan optimization + execution
Simple data structure. Less code. Language Performance
Parity.
© 2016 IBM Corporation29
Code Execution (1)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt
 ‘spark-shell’ provides Spark context as ‘sc’
© 2016 IBM Corporation30
Spark DataFrame Execution
Scala/Java
DF
Logical Plan
Physical
Execution
R
DF
Python
DF
Simple wrappers to create logical plan
Intermediate
representation for
computation
Catalyst
optimizer
© 2016 IBM Corporation31
Spark DataFrames have uniform performance across all
languages
© 2016 IBM Corporation32
MLlib
Library providing machine learning primitives on top of Apache Spark
 Ease of use
 Use any Hadoop data source (e.g. HDFS, HBase, or local files)
 Build single application with other Spark components
(e.g., GraphX, Spark Streaming, SparkSQL, etc.)
 Performance
 Contains high-quality iterative algorithms
 MLlib runs up to 100X faster than MapReduce
 Easy to Deploy
 Runs on existing Hadoop clusters
 Standalone mode, EC2, Mesos
© 2016 IBM Corporation33
MLlib “off-the-shelf algorithms”
Category Algorithm
Basic Statistics
Summary statistics
Correlations
Stratified sampling
Hypothesis testing
Random data generation
Classification and
Regression
Linear models (SVMs, logistic regression, linear regression)
Naïve Bayes
Decision trees
Ensembles of trees (Random Forests and Gradient-Boosted Trees)
Isotonic regression
Collaborative Filtering Alternating least squares (ALS)
Clustering
K-means
Gaussian mixture
Power iteration clustering (PIC)
Latent Dirichlet allocation (LDA)
Streaming k-means
© 2016 IBM Corporation34
MLlib “off-the-shelf algorithms” (con’t)
Category Algorithm
Dimensionality Reduction
Singular value decomposition (SVD)
Principal component analysis (PCA)
Feature Extraction and
Transformation
Term frequency-inverse document frequency (TF-IDF)
Word2Vec
StandardScaler
Normalizer
ChiSqSelector
ElementwiseProduct
Frequent Pattern Mining FP-growth
Optimization
Stochastic gradient descent
Limited-memory BFGS (L-BFGS)
PMML Model Export
KMeansModel
LinearRegressionModel
LassoModel
SVMModel
BinaryLogisticRegressionModel
© 2016 IBM Corporation35
SparkML Pipelines API
Robust systems for end-to-end machine learning pipelines
 Specify pipeline, inspect and debug, re-run on new data, tune parameters
KeystoneML is a research compliment to SparkML
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(),
outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
model = pipeline.fit(training)
Tokenizer
HashingT
F
Logistic
Regressio
n
Logistic
Regressio
n Model
© 2016 IBM Corporation36
Project Tungsten: Hardware Exploitation for Apache
Spark
DataFram
e
Logical
Plan
SQL Python Scala/Java R …
JVM LLVM GPU NVRAM …
Language frontend
Intermediate
representation
Tungsten backend
© 2016 IBM Corporation37
Asynchronous Design Patterns
Sophisticated communication patterns for sophisticated machine learning
DeepDist “Lightening-Fast Deep Learning on Spark”
• Training deep belief networks requires extensive data and computation
• Asychronous stochastic gradient descent (convergence happens more
quickly)
• Based on Google’s DistBelief project
Exciting machine
learning ecosystem
developing on and
around Spark:
• GitHub projects
• Simple APIs
• Exploiting hardware
© 2016 IBM Corporation
Where to go next?
© 2016 IBM Corporation39
Spark will be infused throughout IBM products and will be
delivered however the customer wants to access its power
Standalone
Within
Platforms
Within
Solutions
Spark as a Service
(Bluemix)
IBM Open Platform
(w/ Spark)
BigInsights on
Cloud (w/ Spark)
IBM Streams
…many others
underway
Analytics
Commerce
Watson Health
…many others
underway
© 2016 IBM Corporation40
IBM’s vision for IBM Analytics for Apache Spark (Spark-
as-a-Service)
We make Spark
ACCESSIBLE,
INTEGRATED and
POWERFUL
© 2016 IBM Corporation41
IBM Bluemix (https://console.ng.bluemix.net/)
© 2016 IBM Corporation42
Data Scientist Workbench (http://www.datascientistworkbench.com/)
© 2016 IBM Corporation43
Big Data University (http://bigdatauniversity.com/)
© 2016 IBM Corporation
IBM big data • IBM big data • IBM big data
IBM big data • IBM big data • IBM big data
IBMbigdata•IBMbigdata
IBMbigdata•IBMbigdata
THINK
© 2016 IBM Corporation45
Landing and
Archive Zone
Real-time
Analytics
Zone
Enterprise
Warehouse
and Mart
Zone
Information Governance, Security and Business Continuity
Analytic
Appliances
Big Data Platform
Capabilities
Streaming Data
Text Data
Applications Data
Time Series
Geo Spatial
Relational
• Information Ingest
• Real Time Analytics
• Warehouse & Data Marts
• Analytic Appliances
Social Network
Video &
Image
All Data Sources
Advanced
Analytics /
New Insights
New / Enhanced
Applications
Automated Process
Case Management
Analytic Applications
Cognitive
Learn Dynamically?
Prescriptive
Best Outcomes?
Predictive
What Could Happen?
Descriptive
What Has Happened?
Exploration and
Discovery
What Do You Have?
Watson
Cloud Services
ISV Solutions
Alerts
IBM Big Data and analytics sample architecture
Ingestion
and
Operational
Information

More Related Content

What's hot

Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient DataCarol McDonald
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm DataWorks Summit/Hadoop Summit
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...DataWorks Summit/Hadoop Summit
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...Amazon Web Services
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
How to deploy machine learning models into production
How to deploy machine learning models into productionHow to deploy machine learning models into production
How to deploy machine learning models into productionDataWorks Summit
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?J Langley
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 

What's hot (20)

Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient Data
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Admiral Group
Admiral GroupAdmiral Group
Admiral Group
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
How to deploy machine learning models into production
How to deploy machine learning models into productionHow to deploy machine learning models into production
How to deploy machine learning models into production
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 

Viewers also liked

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...Spark Summit
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Spark Summit
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015Alexander DEJANOVSKI
 
The SparkSQL things you maybe confuse
The SparkSQL things you maybe confuseThe SparkSQL things you maybe confuse
The SparkSQL things you maybe confusevito jeng
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016clairvoyantllc
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtMichael Stack
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionThe DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionDataWorks Summit/Hadoop Summit
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...Nicolas Kourtellis
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5Yan Zhou
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkRKien Dang
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandJosh Elser
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmInfoFarm
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 
The SparkSQL things you maybe confuse
The SparkSQL things you maybe confuseThe SparkSQL things you maybe confuse
The SparkSQL things you maybe confuse
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionThe DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to Production
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 

Similar to Galvanise NYC - Scaling R with Hadoop & Spark. V1.0

Get Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a ServiceGet Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a ServiceIBM Cloud Data Services
 
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016Anand Haridass
 
IBM Smarter Analytics
IBM Smarter AnalyticsIBM Smarter Analytics
IBM Smarter AnalyticsAdrian Turcu
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics systemModusOptimum
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSSKevin Crocker
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsStephan Reimann
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Journey to SAS Analytics Grid with SAS, R, Python
Journey to SAS Analytics Grid with SAS, R, PythonJourney to SAS Analytics Grid with SAS, R, Python
Journey to SAS Analytics Grid with SAS, R, PythonSumit Sarkar
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...Romeo Kienzler
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Making Hadoop Ready for the Enterprise
Making Hadoop Ready for the Enterprise Making Hadoop Ready for the Enterprise
Making Hadoop Ready for the Enterprise DataWorks Summit
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark DataWorks Summit/Hadoop Summit
 
Your Self-Driving Car - How Did it Get So Smart?
Your Self-Driving Car - How Did it Get So Smart?Your Self-Driving Car - How Did it Get So Smart?
Your Self-Driving Car - How Did it Get So Smart?Hortonworks
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 

Similar to Galvanise NYC - Scaling R with Hadoop & Spark. V1.0 (20)

Get Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a ServiceGet Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a Service
 
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
 
IBM Smarter Analytics
IBM Smarter AnalyticsIBM Smarter Analytics
IBM Smarter Analytics
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics system
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of Things
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Journey to SAS Analytics Grid with SAS, R, Python
Journey to SAS Analytics Grid with SAS, R, PythonJourney to SAS Analytics Grid with SAS, R, Python
Journey to SAS Analytics Grid with SAS, R, Python
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Iotbds v1.0
Iotbds v1.0Iotbds v1.0
Iotbds v1.0
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Making Hadoop Ready for the Enterprise
Making Hadoop Ready for the Enterprise Making Hadoop Ready for the Enterprise
Making Hadoop Ready for the Enterprise
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Your Self-Driving Car - How Did it Get So Smart?
Your Self-Driving Car - How Did it Get So Smart?Your Self-Driving Car - How Did it Get So Smart?
Your Self-Driving Car - How Did it Get So Smart?
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 

Galvanise NYC - Scaling R with Hadoop & Spark. V1.0

  • 1. © 2016 IBM Corporation Scaling R using Hadoop and Spark Virender S. Thakur- Big Data Specialist @IBM Open Source Analytics Meetup – NYC March 8th, 2016
  • 2. © 2016 IBM Corporation2 IBM’s Framework for Getting Value out of Big Data  All agree on Big Data’s potential, but wide divergence on how to exploit it  Pioneers who have started to harness Big Data have benefited greatly  We see Big Data adoption as a continual process – maturity levels  IBM’s approach enables faster adoption of Big Data technologies  Open source innovation (Hadoop, Spark)  Standards-based technologies (ODP, SQL, R)  Familiar interfaces and integration with established tools (IBM innovations)  Advanced analytics (IBM innovations)  IBM’s commitment for continued innovation
  • 3. © 2016 IBM Corporation3 IBM is Committed to Open Source  Open source technologies are the base for IBM software and solutions  IBM’s long history of deep open source commitment  Apache Software Foundation: Founding member in 1999  Cloud Foundry: #1 contributor; Basis for Bluemix  OpenStack: #4 contributor; Basis for IBM’s IaaS  Linux: #3 contributor; IBM first enterprise backer of Linux  Hadoop/Spark: Extensive investment in open source contribution; Integration with Analytics software Infrastructure Systems Application
  • 4. © 2016 IBM Corporation4 IBM has the largest investment in Spark of any company in the world visit www.spark.tc for more informationIBM | Spark IBM Spark Technology Center Top committer/contributor 300+ inventors Commitment to educate 1 million data scientists Contributed SystemML Founding member of AMPLab Partnerships in the ecosystem IBM Has the Largest Investment in Spark in the World
  • 5. © 2016 IBM Corporation5 IBM is all-in on its commitment to Spark Foster Community Educate 1M+ data scientists and engineers via online courses Sponsor AMPLab, creators and evangelists of Spark Infuse the Portfolio Integrate Spark throughout portfolio 3,500 employees working on Spark-related topics Spark however customers want it – standalone, platform or products Source: https://www-03.ibm.com/press/us/en/pressrelease/47107.wss Launch Spark Technology Cluster (STC), 300 engineers Open source SystemML Partner with databricks Contribute to the Core "It's like Spark just got blessed by the enterprise rabbi." Ben Horowitz Andreessen Horowitz
  • 6. © 2016 IBM Corporation6 Open Data Platform Initiative Why is IBM involved?  Strong history of leadership in open source & standards  Supports our commitment to open source currency in all future releases  Accelerates our innovation within Hadoop & surrounding applications Open Data Platform (ODP) vs. Apache Software Foundation (ASF)  ODP supports the ASF mission  ASF provides a governance model around individual projects without looking at ecosystem  ODP aims to provide a vendor-led consistent packaging model for core Apache components as an ecosystem All Standard Apache Open Source Components HDFS YARN MapReduce Ambari HBase Spark Flume Hive Pig Sqoop HCatalog Solr/Lucene ODP
  • 7. © 2016 IBM Corporation7 Text Analytics POSIX Distributed Filesystem Multi-workload, Multi- tenant scheduling IBM BigInsights Enterprise Management Machine Learning on Big R Big R IBM BigInsights Data Scientist IBM BigInsights Analyst Big SQL BigSheets Big SQL BigSheets IBM BigInsights for Apache Hadoop IBM Open Platform with Apache Hadoop – all open source HDFS YARN MapReduce Ambari HBase Spark Flume Hive Pig Sqoop HCatalog Solr/Lucene Zookeeper Oozie Knox Slider IBM BigInsights for Apache Hadoop 7 Initial ODP Scope Cloud On Prem Appliance
  • 8. © 2016 IBM Corporation8 What Makes Us Different?  Open: Open Data Platform  Insight: Tools and accelerators to visualize, filter, and analyze large data sets  Anywhere: Cloud, On Premise, Appliance © 2015 International Business Machines Corporation 8 Key Benefits » BigSQL = Makes Hive faster and more secure » BigSheets for visualization and exploration » Text Analytics from IBM Research IBM Hadoop Distribution IBM Hadoop Ecosystem ODP Core 100% Open source Apache Hadoop distribution including Spark SQL Security Business Intelligence Predictive Analytics Streaming Text Analytics Data Management MDM Visualization Workload Optimization BigSQL GPFS HA Integration with IBM Portfolio » Analytics: SPSS, Cognos, Streams » Data Warehouse: Netezza, DB2 » Governance + Security: Optim, Guardium, Information Governance » + Data Integration + Security Intelligence + Watson Explorer + MDM + Data Replication
  • 9. © 2016 IBM Corporation Big R Overview Scalable in-Hadoop Analytics
  • 10. © 2016 IBM Corporation10 Challenges with Running Large-Scale Analytics TRADITIONAL APPROACH BIG DATA APPROACH Analyze small subsets of information Analyze all information Analyzed information All available information All available information analyzed
  • 11. © 2016 IBM Corporation11 What is Big R? R Clients Scalable Statistic s Engine Data Sources Embedded R Execution R Packages R Packages 1 2 3 1. Explore, visualize, transform, and model big data using familiar R syntax and paradigm (no MapReduce code) 2. Scale out R • Partitioning of large data (“divide”) • Parallel cluster execution of pushed down R code (“conquer”) • All of this from within the R environment (Jaql, Map/Reduce are hidden from you • Almost any R package can run in this environment 3. Scalable machine learning • A scalable statistics engine that provides canned algorithms, and an ability to author new ones, all via R “End-to-end integration of R-Project with BigInsights” Pull data (summaries) to R client Or, push R functions right on the data
  • 12. © 2016 IBM Corporation12 User Experience for Big R Connect to BI cluster Data frame proxy to large data file Data transformation step Run scalable linear regression on cluster
  • 13. © 2016 IBM Corporation13 Job Summary (Ambari)
  • 14. © 2016 IBM Corporation14 Rich Functionality in Big R Big R Function Connection connect, disconnect, … HDFS listfs, rmfs Types & Functions Types bigr.frame, bigr.vector Functions dim, nrow, colnames, coltypes, head, tail, na.string, na.omit, sort, summary Coercion and Casting as.bigr.frame, as.data.frame, ….vector as.integer, as.logical, as.numeric Built-in Functions Arithmetic +, -, *, /, ^ Mathematical abs, acos, asin, atan, ceiling, floor, exp, … String grepl, substr Statistical cor, cov, mean, sd Miscellaneous attach, pull, random, sample, ifelse Visualization histogram Apply R functions groupApply, tableApply, rowApply Run scalable algorithms bigr.lm, bigr.svm, bigr. … (see subsequent slide)3 2 1
  • 15. © 2016 IBM Corporation15 How Big R compares with other R solutions . . . . RHIPE implementation Other solutions offer an R API for writing MapReduce from R. Example: Compute the mean departure delay for each airline on a monthly basis*. RHadoop implementation Big R implementation *Dataset: “airline”. Scheduled flights in US 1987-2009.
  • 16. © 2016 IBM Corporation16 Machine learning with Big R  Based on SystemML (IBM Research,.now open source)  Scalability for large data sets  R API inspired by R’s ML libraries Big R functions Inspired by R’s Algorithm bigr.lm() lm() Linear regression bigr.glm() glm() Generalized Linear Models . . . . . . . . . bigr.kmeans() kmeans() K-means clustering bigr.naive.bayes() naiveBayes() Naïve Bayes classifier bigr.sample() sample() Uniform sample by percentage, exact number of samples, or partitioned sampling.
  • 17. © 2016 IBM Corporation17 Arbitrary Large Data Structures: bigr.frame > data <- bigr.frame (dataPath = "/user/bigr/email_data_virender.csv", + dataSource="DEL", + delimiter=",", + header=TRUE, + coltypes=ifelse(1:10 %in% c(2,4,7,9), "numeric", "character"), + useMapReduce=TRUE) > eval(parse(text=paste0(paste0("data$newColumn",1:100),"<- as.integer(bigr.random() * 100) + 1"))) > dim(data) [1] 7897921 110 > summary(data[,c("TITLE_CD","newColumn99")]) TITLE_CD newColumn99 A Min. :JC Min. :1 B Max. :MW Max. :100 C Mean :50
  • 18. © 2016 IBM Corporation18 Open Source R: glm2 Logistic Regression > modelBuildTime1 <- proc.time() > lg <- glm2(Target_JC_oo ~., + data=data2, + family=binomial, + x=FALSE, + y=TRUE) > modelBuildTime2 <- proc.time() > resultTimeModel1 <- modelBuildTime2 - modelBuildTime1 > print(resultTimeModel1) user system elapsed 1.90 0.19 2.09
  • 19. © 2016 IBM Corporation19 Big R: bigr.glm Logistic Regression > modelBuildTime0 <- proc.time() > glmModel <- bigr.glm(Target_JC_oo ~ ., + data=train, + family=binomial(logit), + neg.binomial.class=1, + intercept=TRUE, + shiftAndRescale=TRUE, + directory="/user/bigr/glm/glm.model") > modelBuildTime1 <- proc.time() > resultTimeModel0 <- modelBuildTime1 - modelBuildTime0 > print(resultTimeModel0) user system elapsed 4.06 0.05 20.66
  • 20. © 2016 IBM Corporation20 HDFS instead of Memory Spills to disk over course of a job PDL 500 GB Dataset Big R’s Scalability Beyond R and Aggregate Cluster Memory Scaling Beyond Aggregate Memory  Automatically “spills to disk”:  If more data than what fits into memory  e.g., IBM Research 4 TB Dataset:  6.6X More Data than RAM Arbitrarily Large Data Structures bf1 <- bigr.frame(...arbitrarily_large_data...) df1 <- as.data.frame(…small_data…) bf2 <- as.bigr.frame(…data.frame…) 1 2
  • 21. © 2016 IBM Corporation21 Scalable Machine Learning Algorithms in Big R Category Description Big R Function Descriptive Statistics Univariate bigr.univariateStats() Bivariate bigr.bivariateStats() Stratified Bivariate bigr.bivariateStats() Classification Logistic Regression (multinomial) bigr.logistic.regression() Multi-Class SVM bigr.svm() Naïve Bayes (multinomial) bigr.naive.bayes() Clustering k-Means bigr.kmeans() Regression Linear Regression system of equations bigr.lm() CG (conjugate gradient descent) bigr.lm() Generalized Linear Models (GLM) Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial and Bernoulli bigr.glm() Links for all distributions: identity, log, sq. root, inverse, 1/μ2 bigr.glm() Links for Binomial / Bernoulli: logit, probit, cloglog, cauchit bigr.glm() Predict Scoring bigr.predict() Transformation dummy coding, binning, scaling, missing value imputation bigr.transform() 3
  • 22. © 2016 IBM Corporation22 Big R Machine Learning -- Scalability and Performance bigr.lm 28x Performance (data fit in memory) Scalability (data larger than aggr. memory) R out-of-memory 28X Speedup Scales beyond cluster memory
  • 23. © 2016 IBM Corporation Apache Spark R
  • 24. © 2016 IBM Corporation24 Spark includes a set of core libraries that enable various analytic methods which can process data from many sources Spark Core general compute engine, handles distributed task dispatching, scheduling and basic I/O functions Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph) executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework large variety of data sources and formats can be supported, both on-premise or cloud BigInsights (HDFS) Cloudant dashDB Object Storage SQL DB …many others IBM CLOUD OTHER CLOUD CLOUD APPS ON-PREMISE
  • 25. © 2016 IBM Corporation25 Relationship to Hadoop – Spark both competes and coexists with Hadoop MapReduce Standalone on HDFS Hadoop Yarn Deployment Spark in MapReduce Run side-by-side with MapReduce, leverages current Hadoop stack Simply run on Yarn without any admin access required Launch Spark jobs inside of MapReduce Spark is a superset of MapReduce and is based on similar distributed computing principles Source: https://databricks.com/blog/2014/01/21/spark-and-hadoop.html
  • 26. © 2016 IBM Corporation26 Spark Technology Center  Focal point for IBM investment in Spark  Code contributions to Apache Spark project  Build industry solutions using Spark  Evangelize Spark technology inside/outside IBM  Agile engagement across IBM divisions  Systems: contribute enhancements to Spark core, and optimized infrastructure (hardware/software) for Spark  Analytics: IBM Analytics software will exploit Spark processing  Research: build innovations above (solutions that use Spark), inside (improvements to Spark core), and below (improve systems that execute Spark) the Spark stack Goal: To be the #1 contributor and adopter in the Spark ecosystem
  • 27. © 2016 IBM Corporation27 SparkR: R on Spark  Opens Apache Spark to the world of R users  Exposes Spark functionally in R-friendly syntax via DataFrames API  V1.4 supports operations like selection, filtering, aggregation, etc.  Roadmap for SparkR ~v1.5 (v1.6 is here too)  Extend machine learning capabilities into SparkR (e.g., SparkML, MLlib) library(SparkR) sc <- sparkR.init("local[*]") demo <- read.df(sqlCtx, "/pathtofile/customerdata.json", "json") printSchema(demo) head(demo)
  • 28. © 2016 IBM Corporation28 Spark DataFrame DataFrame = RDD + Schema + DSL  A tabular data structure  Distributed collection of rows organized into named columns  Unified interface for reading and writing data (I/O)  Abstractions for filtering, slicing & dicing, aggregation, visualization  Catalyst optimizer: plan optimization + execution Simple data structure. Less code. Language Performance Parity.
  • 29. © 2016 IBM Corporation29 Code Execution (1) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible File: sparkQuotes.txt  ‘spark-shell’ provides Spark context as ‘sc’
  • 30. © 2016 IBM Corporation30 Spark DataFrame Execution Scala/Java DF Logical Plan Physical Execution R DF Python DF Simple wrappers to create logical plan Intermediate representation for computation Catalyst optimizer
  • 31. © 2016 IBM Corporation31 Spark DataFrames have uniform performance across all languages
  • 32. © 2016 IBM Corporation32 MLlib Library providing machine learning primitives on top of Apache Spark  Ease of use  Use any Hadoop data source (e.g. HDFS, HBase, or local files)  Build single application with other Spark components (e.g., GraphX, Spark Streaming, SparkSQL, etc.)  Performance  Contains high-quality iterative algorithms  MLlib runs up to 100X faster than MapReduce  Easy to Deploy  Runs on existing Hadoop clusters  Standalone mode, EC2, Mesos
  • 33. © 2016 IBM Corporation33 MLlib “off-the-shelf algorithms” Category Algorithm Basic Statistics Summary statistics Correlations Stratified sampling Hypothesis testing Random data generation Classification and Regression Linear models (SVMs, logistic regression, linear regression) Naïve Bayes Decision trees Ensembles of trees (Random Forests and Gradient-Boosted Trees) Isotonic regression Collaborative Filtering Alternating least squares (ALS) Clustering K-means Gaussian mixture Power iteration clustering (PIC) Latent Dirichlet allocation (LDA) Streaming k-means
  • 34. © 2016 IBM Corporation34 MLlib “off-the-shelf algorithms” (con’t) Category Algorithm Dimensionality Reduction Singular value decomposition (SVD) Principal component analysis (PCA) Feature Extraction and Transformation Term frequency-inverse document frequency (TF-IDF) Word2Vec StandardScaler Normalizer ChiSqSelector ElementwiseProduct Frequent Pattern Mining FP-growth Optimization Stochastic gradient descent Limited-memory BFGS (L-BFGS) PMML Model Export KMeansModel LinearRegressionModel LassoModel SVMModel BinaryLogisticRegressionModel
  • 35. © 2016 IBM Corporation35 SparkML Pipelines API Robust systems for end-to-end machine learning pipelines  Specify pipeline, inspect and debug, re-run on new data, tune parameters KeystoneML is a research compliment to SparkML tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(maxIter=10, regParam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) model = pipeline.fit(training) Tokenizer HashingT F Logistic Regressio n Logistic Regressio n Model
  • 36. © 2016 IBM Corporation36 Project Tungsten: Hardware Exploitation for Apache Spark DataFram e Logical Plan SQL Python Scala/Java R … JVM LLVM GPU NVRAM … Language frontend Intermediate representation Tungsten backend
  • 37. © 2016 IBM Corporation37 Asynchronous Design Patterns Sophisticated communication patterns for sophisticated machine learning DeepDist “Lightening-Fast Deep Learning on Spark” • Training deep belief networks requires extensive data and computation • Asychronous stochastic gradient descent (convergence happens more quickly) • Based on Google’s DistBelief project Exciting machine learning ecosystem developing on and around Spark: • GitHub projects • Simple APIs • Exploiting hardware
  • 38. © 2016 IBM Corporation Where to go next?
  • 39. © 2016 IBM Corporation39 Spark will be infused throughout IBM products and will be delivered however the customer wants to access its power Standalone Within Platforms Within Solutions Spark as a Service (Bluemix) IBM Open Platform (w/ Spark) BigInsights on Cloud (w/ Spark) IBM Streams …many others underway Analytics Commerce Watson Health …many others underway
  • 40. © 2016 IBM Corporation40 IBM’s vision for IBM Analytics for Apache Spark (Spark- as-a-Service) We make Spark ACCESSIBLE, INTEGRATED and POWERFUL
  • 41. © 2016 IBM Corporation41 IBM Bluemix (https://console.ng.bluemix.net/)
  • 42. © 2016 IBM Corporation42 Data Scientist Workbench (http://www.datascientistworkbench.com/)
  • 43. © 2016 IBM Corporation43 Big Data University (http://bigdatauniversity.com/)
  • 44. © 2016 IBM Corporation IBM big data • IBM big data • IBM big data IBM big data • IBM big data • IBM big data IBMbigdata•IBMbigdata IBMbigdata•IBMbigdata THINK
  • 45. © 2016 IBM Corporation45 Landing and Archive Zone Real-time Analytics Zone Enterprise Warehouse and Mart Zone Information Governance, Security and Business Continuity Analytic Appliances Big Data Platform Capabilities Streaming Data Text Data Applications Data Time Series Geo Spatial Relational • Information Ingest • Real Time Analytics • Warehouse & Data Marts • Analytic Appliances Social Network Video & Image All Data Sources Advanced Analytics / New Insights New / Enhanced Applications Automated Process Case Management Analytic Applications Cognitive Learn Dynamically? Prescriptive Best Outcomes? Predictive What Could Happen? Descriptive What Has Happened? Exploration and Discovery What Do You Have? Watson Cloud Services ISV Solutions Alerts IBM Big Data and analytics sample architecture Ingestion and Operational Information