SlideShare a Scribd company logo
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Apache Spark and R
A (big data) love story?
Mark Sellors - Technical Architect @ Mango Solutions
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
About me.
• Technical Architect
• Design and deploy analytic computing
environments
• Not really an R user but have broad knowledge of
the analytic computing ecosystem
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Overview
• The rise of big data
• Barriers to big data
• Big Data vs R
• Spark
• Spark and Hadoop
• SparkR
• Is it a love story?
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
The rise of ‘big data’
• Storage prices
• Commodity
• compute
infrastructure
• Volume of data
• Hadoop ties the
two together
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Barriers to ‘big data’
• Hadoop is complex ecosystem
• Primary programming paradigm,
Map/Reduce jobs, largely written in Java
• Map/Reduce unsuited to exploratory,
interactive analysis
• Map/Reduce is slow
• RHadoop is built on top of Map/Reduce
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Some problems with ‘big data’
• Many hadoop deployments do not
achieve an appreciable ROI.
• Hard to find the staff with crossover skills
• infrastructure
• analysts
• Existing business processes not fit for
purpose
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Hadoop
• Until recently limited to batch based
operations
• MASSIVE data sets
• easy to add storage/compute capacity
But…
• Map/Reduce operations can be quite slow
• Hard to find/deploy appropriate talent
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
R
• Interactive
• fast
• great for exploratory or batch
But…
• Single threaded
• Limited by available memory
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
The value of your
data is in what you
do with it.
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
What is it?
• Open source cluster computing
framework
• Relies heavily on in memory processing
• One of the most contributed-to big data
projects of the past year
• Started in the AMPLab at UC Berkeley in
2009
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
What problem does it solve
• In memory makes for very fast data
processing
• minimal disk IO
• High level programming abstraction
reduces the amount of code
• In turn makes it more suitable for
exploratory work.
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
How does it do it?
• Provides a core programming abstraction
called RDD
• The RDD API has been extended to
include DataFrames
• Can deploy ad-hoc processing clusters as
well as integrate with HDFS,
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
“Will Spark replace Hadoop?”
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Hadoop is an
Ecosystem!
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Spark and Hadoop
• Very Complimentary.
• Spark already comes with all the major
Hadoop distributions
• easier to use and faster than map/reduce
• suitable for exploratory work, which
previously was difficult in hadoop
deployments
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
How does this fit with R
• Originally supported languages Scala,
Java and Python
• SparkR was a separate project
• Integrated into Spark as of v1.4
• Support is still evolving - v1.5 released
last week
• MASSIVE data frames
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
SparkR Features
• Designed to be familiar
• Massive DataFrames
• SQL operations on those DataFrames
• Fitting of GLM’s
• Works on top of Hadoop or as a stand
alone cluster
• Load data from a variety of sources
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Spark SQL
• Arbitrary SQL operations on massive in-
memory data frames
• Treats the data frame as though it were a
database table
• Useful for exploring your data set
• Also great for creating subsets
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
# Create the DataFrame
df <- createDataFrame(sqlContext, iris)
# Fit a linear model over the dataset.
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family =
"gaussian")
# Model coefficients are returned in a similar format to R's native glm().
summary(model)
##$coefficients
## Estimate
##(Intercept) 2.2513930
##Sepal_Width 0.8035609
##Species_versicolor 1.4587432
##Species_virginica 1.9468169
# Make predictions based on the model.
predictions <- predict(model, newData = df)
head(select(predictions, "Sepal_Length", "prediction"))
## Sepal_Length prediction
##1 5.1 5.063856
##2 4.9 4.662076
##3 4.7 4.822788
##4 4.6 4.742432
##5 5.0 5.144212
##6 5.4 5.385281 Source: http://spark.apache.org
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Lowering the barrier to adoption
• Hadoop can be tricky to
get started with.
• Spark can run locally on
your laptop
• Can build ad-hoc
processing clusters
• Supports pulling data
from a variety of sources
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
What’s in it for me?
• Currently supports:
• DataFrames
• SparkSQL
• limited subset of MLlib
• Is missing any native R support for:
• Spark Streaming
• GraphX
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Is it a love story?
• It wasn’t originally, but things are heating
up
• Data not getting any smaller
• Dramatically lowers the barrier to entry
• Evolving rapidly
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
@MangoTheCat / @sellorm

More Related Content

What's hot

Optimizing Your Cloud Applications in RightScale
Optimizing Your Cloud Applications in RightScaleOptimizing Your Cloud Applications in RightScale
Optimizing Your Cloud Applications in RightScale
RightScale
 
Journey to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big DataJourney to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big Data
Lightbend
 
IT HealthCheck
IT HealthCheckIT HealthCheck
IT HealthCheck
RISC Networks
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
VMware Tanzu Korea
 
AWS Outage Analysis
AWS Outage AnalysisAWS Outage Analysis
AWS Outage Analysis
ThousandEyes
 
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud StrategyMulti-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
ThousandEyes
 
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
Amazon Web Services
 
Correlate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a JediCorrelate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a Jedi
Trevor Parsons
 
Cloud and Analytics - From Platforms to an Ecosystem
Cloud and Analytics - From Platforms to an EcosystemCloud and Analytics - From Platforms to an Ecosystem
Cloud and Analytics - From Platforms to an Ecosystem
Databricks
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
TriNimbus
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data Platforms
Chris Kernaghan
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
Joseph Arriola
 
Disrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDisrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging Technologies
Databricks
 
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
Amazon Web Services
 
Process big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public CloudProcess big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public Cloud
OVHcloud
 
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
ThousandEyes
 
Cloud Wars: Performance Benchmarking AWS, GCP and Azure
Cloud Wars: Performance Benchmarking AWS, GCP and Azure Cloud Wars: Performance Benchmarking AWS, GCP and Azure
Cloud Wars: Performance Benchmarking AWS, GCP and Azure
ThousandEyes
 
Informatica + Hadoop = Best of Both Worlds
Informatica + Hadoop = Best of Both WorldsInformatica + Hadoop = Best of Both Worlds
Informatica + Hadoop = Best of Both Worlds
Ahmed Tayeh
 
Lambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingLambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB Testing
Trieu Nguyen
 
Complex Data Transformations Made Easy
Complex Data Transformations Made EasyComplex Data Transformations Made Easy
Complex Data Transformations Made Easy
Data Con LA
 

What's hot (20)

Optimizing Your Cloud Applications in RightScale
Optimizing Your Cloud Applications in RightScaleOptimizing Your Cloud Applications in RightScale
Optimizing Your Cloud Applications in RightScale
 
Journey to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big DataJourney to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big Data
 
IT HealthCheck
IT HealthCheckIT HealthCheck
IT HealthCheck
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
 
AWS Outage Analysis
AWS Outage AnalysisAWS Outage Analysis
AWS Outage Analysis
 
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud StrategyMulti-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
 
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
 
Correlate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a JediCorrelate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a Jedi
 
Cloud and Analytics - From Platforms to an Ecosystem
Cloud and Analytics - From Platforms to an EcosystemCloud and Analytics - From Platforms to an Ecosystem
Cloud and Analytics - From Platforms to an Ecosystem
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data Platforms
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
 
Disrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDisrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging Technologies
 
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
 
Process big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public CloudProcess big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public Cloud
 
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
 
Cloud Wars: Performance Benchmarking AWS, GCP and Azure
Cloud Wars: Performance Benchmarking AWS, GCP and Azure Cloud Wars: Performance Benchmarking AWS, GCP and Azure
Cloud Wars: Performance Benchmarking AWS, GCP and Azure
 
Informatica + Hadoop = Best of Both Worlds
Informatica + Hadoop = Best of Both WorldsInformatica + Hadoop = Best of Both Worlds
Informatica + Hadoop = Best of Both Worlds
 
Lambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingLambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB Testing
 
Complex Data Transformations Made Easy
Complex Data Transformations Made EasyComplex Data Transformations Made Easy
Complex Data Transformations Made Easy
 

Similar to Apache Spark and R: A (Big Data) Love Story?

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
Codecamp Romania
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
Alice Zheng
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataData Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudData
WeCloudData
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
James Serra
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database World
Karen Lopez
 
Big Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data ModelingBig Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data Modeling
DATAVERSITY
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
Databricks
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
Adam Doyle
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
Ashnikbiz
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
Adi Challa
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with Salesforce
Sense Corp
 
Are Data Lakes the new Core DWHs?
Are Data Lakes the new Core DWHs?Are Data Lakes the new Core DWHs?
Are Data Lakes the new Core DWHs?
Andreas Buckenhofer
 
Evolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital eraEvolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital era
Vishal Puri
 

Similar to Apache Spark and R: A (Big Data) Love Story? (20)

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataData Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudData
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database World
 
Big Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data ModelingBig Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data Modeling
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with Salesforce
 
Are Data Lakes the new Core DWHs?
Are Data Lakes the new Core DWHs?Are Data Lakes the new Core DWHs?
Are Data Lakes the new Core DWHs?
 
Evolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital eraEvolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital era
 

Recently uploaded

The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 

Recently uploaded (20)

The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 

Apache Spark and R: A (Big Data) Love Story?

  • 1. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Apache Spark and R A (big data) love story? Mark Sellors - Technical Architect @ Mango Solutions
  • 2. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com About me. • Technical Architect • Design and deploy analytic computing environments • Not really an R user but have broad knowledge of the analytic computing ecosystem
  • 3. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com
  • 4. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Overview • The rise of big data • Barriers to big data • Big Data vs R • Spark • Spark and Hadoop • SparkR • Is it a love story?
  • 5. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com The rise of ‘big data’ • Storage prices • Commodity • compute infrastructure • Volume of data • Hadoop ties the two together
  • 6. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Barriers to ‘big data’ • Hadoop is complex ecosystem • Primary programming paradigm, Map/Reduce jobs, largely written in Java • Map/Reduce unsuited to exploratory, interactive analysis • Map/Reduce is slow • RHadoop is built on top of Map/Reduce
  • 7. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Some problems with ‘big data’ • Many hadoop deployments do not achieve an appreciable ROI. • Hard to find the staff with crossover skills • infrastructure • analysts • Existing business processes not fit for purpose
  • 8. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com
  • 9. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Hadoop • Until recently limited to batch based operations • MASSIVE data sets • easy to add storage/compute capacity But… • Map/Reduce operations can be quite slow • Hard to find/deploy appropriate talent
  • 10. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com R • Interactive • fast • great for exploratory or batch But… • Single threaded • Limited by available memory
  • 11. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com The value of your data is in what you do with it.
  • 12. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com
  • 13. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com What is it? • Open source cluster computing framework • Relies heavily on in memory processing • One of the most contributed-to big data projects of the past year • Started in the AMPLab at UC Berkeley in 2009
  • 14. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com What problem does it solve • In memory makes for very fast data processing • minimal disk IO • High level programming abstraction reduces the amount of code • In turn makes it more suitable for exploratory work.
  • 15. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com How does it do it? • Provides a core programming abstraction called RDD • The RDD API has been extended to include DataFrames • Can deploy ad-hoc processing clusters as well as integrate with HDFS,
  • 16. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com “Will Spark replace Hadoop?”
  • 17. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Hadoop is an Ecosystem!
  • 18. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Spark and Hadoop • Very Complimentary. • Spark already comes with all the major Hadoop distributions • easier to use and faster than map/reduce • suitable for exploratory work, which previously was difficult in hadoop deployments
  • 19. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com How does this fit with R • Originally supported languages Scala, Java and Python • SparkR was a separate project • Integrated into Spark as of v1.4 • Support is still evolving - v1.5 released last week • MASSIVE data frames
  • 20. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com SparkR Features • Designed to be familiar • Massive DataFrames • SQL operations on those DataFrames • Fitting of GLM’s • Works on top of Hadoop or as a stand alone cluster • Load data from a variety of sources
  • 21. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Spark SQL • Arbitrary SQL operations on massive in- memory data frames • Treats the data frame as though it were a database table • Useful for exploring your data set • Also great for creating subsets
  • 22. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com # Create the DataFrame df <- createDataFrame(sqlContext, iris) # Fit a linear model over the dataset. model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") # Model coefficients are returned in a similar format to R's native glm(). summary(model) ##$coefficients ## Estimate ##(Intercept) 2.2513930 ##Sepal_Width 0.8035609 ##Species_versicolor 1.4587432 ##Species_virginica 1.9468169 # Make predictions based on the model. predictions <- predict(model, newData = df) head(select(predictions, "Sepal_Length", "prediction")) ## Sepal_Length prediction ##1 5.1 5.063856 ##2 4.9 4.662076 ##3 4.7 4.822788 ##4 4.6 4.742432 ##5 5.0 5.144212 ##6 5.4 5.385281 Source: http://spark.apache.org
  • 23. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Lowering the barrier to adoption • Hadoop can be tricky to get started with. • Spark can run locally on your laptop • Can build ad-hoc processing clusters • Supports pulling data from a variety of sources
  • 24. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com What’s in it for me? • Currently supports: • DataFrames • SparkSQL • limited subset of MLlib • Is missing any native R support for: • Spark Streaming • GraphX
  • 25. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Is it a love story? • It wasn’t originally, but things are heating up • Data not getting any smaller • Dramatically lowers the barrier to entry • Evolving rapidly
  • 26. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com @MangoTheCat / @sellorm