SlideShare a Scribd company logo
Enable Interactive R and Python Environments on Hadoop
June 27, 2017 • Room #308
Dave Carlson
Data Science Technical Lead
Zurich North America
Ming Yuan
Architect, Big Data & Cloud
Capital One
Disclaimer
The views expressed in this presentation are solely those of
Dave Carlson or Ming Yan and are not those Zurich American
Insurance Company or Capital One.
June 25, 2017 1
Goals
 Enables organizations to take advantage of a ecosystem for Big Data
Analytics
• Leverage storage and processing power of Hadoop and Analytical
Power of R
 Build an interactive and easy-to-use interface for data scentists
 Leverage open source projects to minimum the cost
 Overcome technical challenges from Kerberos
+ + -
June 25, 2017 2
Why R and Python?
 Analysts have been
transitioning to open
source languages like R
and Python and away
from proprietary
languages like SAS and
SPSS.
 This is true for both
academic and
commercial
communities.
June 25, 2017 4
Python integration uses Anaconda and Jupyter
June 25, 2017 5
R integration uses RStudio Server and RHadoop
R on Hadoop
Install Hadoop
Client Node
Installed
Rhadoop/rJDBC
Installed
RStudio Server
Integrated R
libraries with
RStudio server
June 25, 2017 6
RHadoop and RJDBC Packages
 rhdfs
 Provides basic connectivity to HDFS so that R programmers
can browse, read, write, and modify files stored in HDFS
• Installed on the node that will run the R code
 plyrmr
 Enables R users to perform common data manipulation
operations on very large data sets stored on Hadoop
• Installed on every node in Hadoop cluster
 rmr2
 Allows R developers to perform statistical analysis in R via
Hadoop MapReduce functionality on a Hadoop cluster
• Installed on every node in Hadoop cluster
 ravro
 Enable the read and write to avro files from local and HDFS
file system and adds an avro input format for rmr2
• Installed only on the node that will run the R code
 rhbase
 Not use
 rJDBC
• Provides basic connectivity to Hive
with a JDBC driver
• Installed on the node that will run
the R code
• Depends on rJava and DBI packages
June 25, 2017 7
R and Hadoop Integration
June 25, 2017 8
RStudio Server Installation and Configuration
 Install RStudio Server rpm
 User Authentication against LDAP
 sudo cp /etc/pam.d/login /etc/pam.d/rstudio
 Resource Allocation
 Integrate with R and its libraries in rserver.conf
 Specify R version
 rsession-which-r=/usr/local/bin/R
 Locates shared libraries
 rsession-ld-library-path
=/opt/someapp/lib:/opt/anotherapp/lib
June 25, 2017 9
RStudio Server, R and Hadoop Integration
June 25, 2017 10
Python -- Anaconda Cluster
 Head Node
 A system configured to act as the intermediary between the cluster and the
outside network
 Can also be referred to as the master or edge node
 Compute node
 The machines managed by the head node that all work together to complete a
single task
Thin Client Head Node
Compute Node
Compute Node
Compute Node
June 25, 2017 11
System Architecture
RStudio Server
 Hadoop libraries
 rhdfs
 rmr2
 plyrmr
 rJDBC
 Sparkr
Hadoop Data Node
• Hadoop data node
Anaconda Compute Node
rmr2
plyrmr
Hadoop Management Node
MANAGE
Anaconda Head Node
Anaconda Enterprise Notebook
Thin Client
June 25, 2017 12
R Framework
June 25, 2017 13
 R framework requires ‘dplyr 0.5.0’.
 The easiest way to verify that the correct set of dependencies are installed is to use a
fixed snapshot of CRAN using a fixed MRAN (Microsoft) snapshot.
 Add the following lines to your ‘Rprofile.site’ and ‘Renviron.site’ files. If you have a
fresh installation of R you will need to create the files.
$R_HOME/etc/Rprofile.site
options(repos=structure(c(CRAN='https://mran.revolutionanalytics.com/snapshot/2017-
04-01/')))
$R_HOME/etc/Renviron.site
HADOOP_CMD=/usr/bin/hadoop
USE_KERBEROS=0
HIVE_SERVER_HOST=localhost
HIVE_SERVER_PORT=10000
HIVE_JAR_FOLDERS=/usr/lib/hive/lib
 Now install the ‘honeycomb’ package using the ‘devtools’ library within R.
devtools::install_github('ZurichPA/rhdfs', subdir='pkg')
devtools::install_github('ZurichPA/orpheus')
devtools::install_github('ZurichPA/honeycomb')
June 25, 2017 14
DEMO
https://zurichpa.github.io/honeycomb
Additional References
 R
 RHadoop (https://github.com/RevolutionAnalytics/RHadoop/wiki)
 Zurich Predictive Analytics (https://github.com/zurichpa)
 honeycomb (https://github.com/ZurichPA/honeycomb)
 Python
 Impyla (https://github.com/cloudera/impyla)

More Related Content

What's hot

Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
Olesya Eidam
 
Hadoop Everywhere
Hadoop EverywhereHadoop Everywhere
Bridging ICESat and ICESat-2 Standard Data Products
Bridging ICESat and ICESat-2 Standard Data ProductsBridging ICESat and ICESat-2 Standard Data Products
Bridging ICESat and ICESat-2 Standard Data Products
The HDF-EOS Tools and Information Center
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
The HDF-EOS Tools and Information Center
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
sarith divakar
 
R and-hadoop
R and-hadoopR and-hadoop
R and-hadoop
Bryan Downing
 
HPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and WorkflowsHPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and Workflows
inside-BigData.com
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
The HDF-EOS Tools and Information Center
 
Parallel Computing with HDF Server
Parallel Computing with HDF ServerParallel Computing with HDF Server
Parallel Computing with HDF Server
The HDF-EOS Tools and Information Center
 
ICESat-2 Metadata and Status
ICESat-2 Metadata and StatusICESat-2 Metadata and Status
ICESat-2 Metadata and Status
The HDF-EOS Tools and Information Center
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
The HDF-EOS Tools and Information Center
 
Rakesh Chander Oracle
Rakesh Chander OracleRakesh Chander Oracle
Rakesh Chander Oracle
Rakesh Chander
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Spark Summit
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
The HDF-EOS Tools and Information Center
 

What's hot (19)

Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
Hadoop Everywhere
Hadoop EverywhereHadoop Everywhere
Hadoop Everywhere
 
Bridging ICESat and ICESat-2 Standard Data Products
Bridging ICESat and ICESat-2 Standard Data ProductsBridging ICESat and ICESat-2 Standard Data Products
Bridging ICESat and ICESat-2 Standard Data Products
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF5 <-> Zarr
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
R and-hadoop
R and-hadoopR and-hadoop
R and-hadoop
 
HPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and WorkflowsHPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and Workflows
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
Parallel Computing with HDF Server
Parallel Computing with HDF ServerParallel Computing with HDF Server
Parallel Computing with HDF Server
 
ICESat-2 Metadata and Status
ICESat-2 Metadata and StatusICESat-2 Metadata and Status
ICESat-2 Metadata and Status
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
Rakesh Chander Oracle
Rakesh Chander OracleRakesh Chander Oracle
Rakesh Chander Oracle
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 

Similar to R & Python on Hadoop

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integration
Dzung Nguyen
 
Analysing big data with cluster service and R
Analysing big data with cluster service and RAnalysing big data with cluster service and R
Analysing big data with cluster service and R
Lushi Chen
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
DataWorks Summit
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
Alex Palamides
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Aravind Babu
 
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Revolution Analytics
 
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsightAnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
Łukasz Grala
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Revolution Analytics
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management TrendsMeetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management Trends
avanttic Consultoría Tecnológica
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
Data Science Thailand
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
Gregg Barrett
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
Revolution Analytics
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
Jeffrey Breen
 
RHadoop - beginners
RHadoop - beginnersRHadoop - beginners
RHadoop - beginners
Mohamed Ramadan
 
HDFS presented by VIJAY
HDFS presented by VIJAYHDFS presented by VIJAY
HDFS presented by VIJAY
thevijayps
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Willy Marroquin (WillyDevNET)
 
RHadoop
RHadoopRHadoop

Similar to R & Python on Hadoop (20)

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integration
 
Analysing big data with cluster service and R
Analysing big data with cluster service and RAnalysing big data with cluster service and R
Analysing big data with cluster service and R
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
 
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
 
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsightAnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management TrendsMeetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management Trends
 
Data Science
Data ScienceData Science
Data Science
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
RHadoop - beginners
RHadoop - beginnersRHadoop - beginners
RHadoop - beginners
 
HDFS presented by VIJAY
HDFS presented by VIJAYHDFS presented by VIJAY
HDFS presented by VIJAY
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
RHadoop
RHadoopRHadoop
RHadoop
 

More from Ming Yuan

Cloud and Analytics -- 2020 sparksummit
Cloud and Analytics -- 2020 sparksummitCloud and Analytics -- 2020 sparksummit
Cloud and Analytics -- 2020 sparksummit
Ming Yuan
 
Forrester2019
Forrester2019Forrester2019
Forrester2019
Ming Yuan
 
SSO with sfdc
SSO with sfdcSSO with sfdc
SSO with sfdc
Ming Yuan
 
Singleton
SingletonSingleton
Singleton
Ming Yuan
 
Rest and beyond
Rest and beyondRest and beyond
Rest and beyond
Ming Yuan
 
Simplifying Apache Cascading
Simplifying Apache CascadingSimplifying Apache Cascading
Simplifying Apache Cascading
Ming Yuan
 
Building calloutswithoutwsdl2apex
Building calloutswithoutwsdl2apexBuilding calloutswithoutwsdl2apex
Building calloutswithoutwsdl2apex
Ming Yuan
 

More from Ming Yuan (7)

Cloud and Analytics -- 2020 sparksummit
Cloud and Analytics -- 2020 sparksummitCloud and Analytics -- 2020 sparksummit
Cloud and Analytics -- 2020 sparksummit
 
Forrester2019
Forrester2019Forrester2019
Forrester2019
 
SSO with sfdc
SSO with sfdcSSO with sfdc
SSO with sfdc
 
Singleton
SingletonSingleton
Singleton
 
Rest and beyond
Rest and beyondRest and beyond
Rest and beyond
 
Simplifying Apache Cascading
Simplifying Apache CascadingSimplifying Apache Cascading
Simplifying Apache Cascading
 
Building calloutswithoutwsdl2apex
Building calloutswithoutwsdl2apexBuilding calloutswithoutwsdl2apex
Building calloutswithoutwsdl2apex
 

Recently uploaded

在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 

Recently uploaded (20)

在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 

R & Python on Hadoop

  • 1. Enable Interactive R and Python Environments on Hadoop June 27, 2017 • Room #308 Dave Carlson Data Science Technical Lead Zurich North America Ming Yuan Architect, Big Data & Cloud Capital One
  • 2. Disclaimer The views expressed in this presentation are solely those of Dave Carlson or Ming Yan and are not those Zurich American Insurance Company or Capital One. June 25, 2017 1
  • 3. Goals  Enables organizations to take advantage of a ecosystem for Big Data Analytics • Leverage storage and processing power of Hadoop and Analytical Power of R  Build an interactive and easy-to-use interface for data scentists  Leverage open source projects to minimum the cost  Overcome technical challenges from Kerberos + + - June 25, 2017 2
  • 4. Why R and Python?  Analysts have been transitioning to open source languages like R and Python and away from proprietary languages like SAS and SPSS.  This is true for both academic and commercial communities.
  • 5. June 25, 2017 4 Python integration uses Anaconda and Jupyter
  • 6. June 25, 2017 5 R integration uses RStudio Server and RHadoop
  • 7. R on Hadoop Install Hadoop Client Node Installed Rhadoop/rJDBC Installed RStudio Server Integrated R libraries with RStudio server June 25, 2017 6
  • 8. RHadoop and RJDBC Packages  rhdfs  Provides basic connectivity to HDFS so that R programmers can browse, read, write, and modify files stored in HDFS • Installed on the node that will run the R code  plyrmr  Enables R users to perform common data manipulation operations on very large data sets stored on Hadoop • Installed on every node in Hadoop cluster  rmr2  Allows R developers to perform statistical analysis in R via Hadoop MapReduce functionality on a Hadoop cluster • Installed on every node in Hadoop cluster  ravro  Enable the read and write to avro files from local and HDFS file system and adds an avro input format for rmr2 • Installed only on the node that will run the R code  rhbase  Not use  rJDBC • Provides basic connectivity to Hive with a JDBC driver • Installed on the node that will run the R code • Depends on rJava and DBI packages June 25, 2017 7
  • 9. R and Hadoop Integration June 25, 2017 8
  • 10. RStudio Server Installation and Configuration  Install RStudio Server rpm  User Authentication against LDAP  sudo cp /etc/pam.d/login /etc/pam.d/rstudio  Resource Allocation  Integrate with R and its libraries in rserver.conf  Specify R version  rsession-which-r=/usr/local/bin/R  Locates shared libraries  rsession-ld-library-path =/opt/someapp/lib:/opt/anotherapp/lib June 25, 2017 9
  • 11. RStudio Server, R and Hadoop Integration June 25, 2017 10
  • 12. Python -- Anaconda Cluster  Head Node  A system configured to act as the intermediary between the cluster and the outside network  Can also be referred to as the master or edge node  Compute node  The machines managed by the head node that all work together to complete a single task Thin Client Head Node Compute Node Compute Node Compute Node June 25, 2017 11
  • 13. System Architecture RStudio Server  Hadoop libraries  rhdfs  rmr2  plyrmr  rJDBC  Sparkr Hadoop Data Node • Hadoop data node Anaconda Compute Node rmr2 plyrmr Hadoop Management Node MANAGE Anaconda Head Node Anaconda Enterprise Notebook Thin Client June 25, 2017 12
  • 14. R Framework June 25, 2017 13  R framework requires ‘dplyr 0.5.0’.  The easiest way to verify that the correct set of dependencies are installed is to use a fixed snapshot of CRAN using a fixed MRAN (Microsoft) snapshot.  Add the following lines to your ‘Rprofile.site’ and ‘Renviron.site’ files. If you have a fresh installation of R you will need to create the files. $R_HOME/etc/Rprofile.site options(repos=structure(c(CRAN='https://mran.revolutionanalytics.com/snapshot/2017- 04-01/'))) $R_HOME/etc/Renviron.site HADOOP_CMD=/usr/bin/hadoop USE_KERBEROS=0 HIVE_SERVER_HOST=localhost HIVE_SERVER_PORT=10000 HIVE_JAR_FOLDERS=/usr/lib/hive/lib  Now install the ‘honeycomb’ package using the ‘devtools’ library within R. devtools::install_github('ZurichPA/rhdfs', subdir='pkg') devtools::install_github('ZurichPA/orpheus') devtools::install_github('ZurichPA/honeycomb')
  • 15. June 25, 2017 14 DEMO https://zurichpa.github.io/honeycomb
  • 16. Additional References  R  RHadoop (https://github.com/RevolutionAnalytics/RHadoop/wiki)  Zurich Predictive Analytics (https://github.com/zurichpa)  honeycomb (https://github.com/ZurichPA/honeycomb)  Python  Impyla (https://github.com/cloudera/impyla)

Editor's Notes

  1. Install RStudio Server rpm $ sudo yum install --nogpgcheck Activate your instance $ sudo rstudio-server license-manager activate $ sudo rstudio-server restart Access the server from web browser http://<server-ip>:8787 Check Logs /var/log/messages RStudio server depends on multiple configuration files in /etc/rstudio rserver.conf -- Core server settings rsession.conf -- Settings related to individual R sessions Create profiles within /etc/rstudio/profiles Global ([*]) Per-group ([@groupname]) Per-user ([username]) Assign system settings to each of the profiles [*] cpu-affinity = 1-4 max-processes = 100 max-memory-mb = 2048 session-timeout-minutes=60 [@powerusers] cpu-affinity = 5-16 nice = -10 max-memory-mb = 4096 [jsmith] r-version = /opt/R/3.1.0 session-timeout-minutes=360