SlideShare a Scribd company logo
1 of 31
Los Angeles R Users’ Group
Accessing R from Python using RPy2
Ryan R. Rosario
October 19, 2010
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
What is Python?
An interpreted, object-oriented, high-level programming language
with...
1 dynamic typing, yet strongly typed
2 an interactive shell for real time
testing of code
3 good rapid application
development support
4 extensive scripting capabilities
Python is similar in purpose to Perl, with a more well-defined
syntax, and is more readable
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
What is RPy2?
RPy2 is a simple interface to R from Python.
RSPython was the first interface to R from Python (and
Python from R) was RSPython by Duncan Temple Lang. Last
updated in 2005. There is also a version for Perl called
RSPerl.
RPy focused on providing a simple and robust interface to R
from the Python programming language.
RPy2 is a rewrite of RPy, providing for a more user friendly
experience and expanded capabilities.
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
So, Why use Python instead of with R?
In the Twitter #rstats community, the consensus is that it is
simply PREFERENCE. Here are some other reasons.
1 primitive data types in Python are more flexible for text
mining.
tuples for paired, associated data. (no equivalent in R)
lists are similar to vectors
dictionaries provide associative arrays; a list with named entries
can provide an equivalent in R.
pleasant string handling (though stringr helps significantly in
R)
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
So, Why use Python instead of with R?
In the Twitter #rstats community, the consensus is that it is
simply PREFERENCE. Here are some other reasons.
2 handles the unexpected better
flexible exceptions; R offers tryCatch.
3 Python has a much larger community with very diverse
interests, and text and web mining are big focuses.
4 state of art algorithms for text mining typically hit Python or
Perl before they hit R.
5 parallelism does not rely on R’s parallelism packages.
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
So, Why use Python instead of with R?
In the Twitter #rstats community, the consensus is that it is
simply PREFERENCE. Here are some other reasons.
6 robust regular expression engine.
7 robust processing of HTML (BeautifulSoup), XML (xml),
and JSON (simplejson).
8 web crawling and spidering; dealing with forms, cookies etc.
(mechanize and twill).
9 more access to datastores; not only MySQL and PostgreSQL
but also Redis and MongoDB, with more to come.
10 more natural object-oriented features.
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
And, What are the Advantages of R over Python?
In the Twitter #rstats community, the consensus is that it is
simply PREFERENCE. Here are some other reasons.
1 Data frames; closest equivalent in Python is a dictionary,
whose values are the lists of values for each variable, or a list
of tuples, each tuple a row.
2 obviously, great analysis and visualization routines.
3 some genius developers have made some web mining tasks
easier in R than in Python.
4 Factors; Python does not even have an enum primitive data
type.
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Using RPy2 to Extract Data in Python and Use in R
The site offtopic.com is the largest standalone discussion forum
site on the Web. Discussion has no restriction and topics include
everything under the sun including topics NSFW. It’s massive size
and diversity provide a lot of opportunities for text mining, NLP
and social network analysis.
The questions:
What are the ages of the participants?
Is the number of posts a user makes related to the number of
days he/she has been an active member?
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Using RPy2 to Extract Data in Python and Use in R
The offtopic.com member list:
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Using RPy2 to Extract Data in Python and Use in R
Let’s use Python to perform the following tasks:
1 Log in to the website using a username and password on an
HTML form (this is required).
2 Navigate to the list of all members (a feature provided by
vBulletin).
3 Scrape the content of all of the members on each page.
4 Repeat the previous step for all 1,195 pages in the list.
5 Parse the HTML tables to get the data and put it into an R
data frame.
and then use R (from Python) to perform the rest:
1 Create some pretty graphics and fit a linear model.
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Using RPy2 to Extract Data in Python and Use in R
The specifics for you to explore:
1 I fill the web form and submit it using twill1 a wrapper to
the web browsing Python module mechanize2.
2 I extract the correct HTML table; the one that contains the
list of members, using a regular expression.
3 I navigate through the HTML table and extract the data using
BeautifulSoup3, an HTML parsing library.
1
http://twill.idyll.org/
2
http://wwwsearch.sourceforge.net/mechanize/
3
http://www.crummy.com/software/BeautifulSoup/
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Using RPy2 to Extract Data in Python and Use in R
Brief code walkthrough.
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Constructing an R Data Type
To create a dataframe,
1 I keep a list for each variable and store the lists in a dictionary,
using the variable name as a key.
2 In a new dictionary, coerce the lists to R vectors
1 dataframe = {}
2 dataframe[’username ’] = R.StrVector (( data[’username ’]))
3 dataframe[’join_date ’] = R.StrVector (( data[’join_date ’]))
4 dataframe[’posts ’] = R.IntVector (( data[’posts ’]))
5 dataframe[’last_visit ’] = R.StrVector (( data[’last_visit ’
]))
6 dataframe[’birthday ’] = R.StrVector (( data[’birthday ’]))
7 dataframe[’age’] = R.IntVector (( data[’age ’]))
8 MyRDataframe = R.DataFrame(dataframe)
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Constructing an R Data Type
Caveat 1 Python may not keep your columns in the same order in
which they were specified because dictionaries do not have a
concept of order. See the documentation for a workaround using
an ordered dictionary type.
Caveat 2 Subsetting and indexing in RPy2 is not as trivial as it is
in R. In my code, I give a trivial example. For more information,
see the documentation.
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Constructing an R Data Type
We can also create a matrix:
1 M = R.r.matrix(robjects.IntVector(range (10)), nrow =5)
Note that constructing objects in R looks vaguely similar to how it
is done in native R.
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Plotting a Histogram
Let’s see the distribution of ages of offtopic.com users. It is best
to print the plot to a file. If we do not, the graphic relies on X11
(on Linux) and will disappear when the script ends.
1 #Plot ages
2 hist = R.r.hist
3 R.r.png(’~/Desktop/hist.png ’,width =300 , height =300)
4 hist( MyRDataframe [2], main="" xlab="", br =20)
5 R.r[’dev.off’]()
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Plotting a Bivariate Relationship and Custom R Functions
Next, let’s do some data management and access our own R
function from Python.
1 activity = R.r(r’’’
2 function(x, y) {
3 if (is.na(x) | is.na(y)) NA
4 else {
5 date .1 <- strptime(x, "%m-%d-%Y")
6 date .2 <- strptime(y, "%m-%d-%Y")
7 difftime(date.1, date.2, units=’days ’)
8 }
9 }’’’)
10 as_numeric = R.r[’as.numeric ’]
11 days_active = activity(dataframe[’join_date ’], dataframe[
’last_visit ’])
12 days_active = R.r.abs(days_active)
13 days_active = as_numeric(days_active)
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Plotting a Bivariate Relationship and Custom R Function
1 plot = R.r.plot
2 R.r.png(’~/Desktop/plot.png ’, width =300 , height =300)
3 plot(days_active , MyRDataframe [3], pch=’.’)
4 R.r[’dev.off’]()
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Linear Models
Let’s assume a linear relationship and try to predict posts from
number of days active.
1 fit = R.r.lm(fm1a)
2 summary = R.r.summary(fit)
3 #Now we can program with the results of fit in Python .
4 print summary [3] # Display an R-like summary
5 # element 3 is the R-like summary .
6 # elements 0 and 1 are the a and b coefficients
respectively .
7 intercept = summary [3][0]
8 slope = summary [3][1]
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Wrap Up of RPy2
RPy2 allows the user to easily access R from Python, with just a
few hoops to jump through. Due to time constraints, it is not
possible to discuss this package is more depth. You are encouraged
to check out the excellent documentation:
http://rpy.sourceforge.net/rpy2 documentation.html
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Alternatives to RPy2
There are some other Python packages that allow use of R in
Python:
1 PypeR is a brand new package discussed in the Journal of
Statistical Software4,
2 pyRServe for accessing R running RServe from Python.
4
http://www.jstatsoft.org/v35/c02/paper
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Python’s Natural Language Toolkit (NLTK)
Another reason to call R from Python is to use Python’s awesome NLTK
module. Unfortunately, due to time constraints, I can only list what NLTK is
capable of.
several corpora and labeled corpora for classification tasks and training.
several classifiers including Naive Bayes, and Latent Dirichlet Allocation.
WordNet interface
Part of speech taggers
Word stemmers
Tokenization, entity washing, stop word lists and dictionaries for parsing.
Feature grammar parsing.
and much more
Then, data can be passed to R for analysis and visualization.
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Python’s Natural Language Toolkit (NLTK)
To learn more about NLTK:
Natural Language Processing with Python:
Analyzing Text with the Natural Language
Toolkit
by Steven Bird, Ewan Klein, and Edward Loper
The NLTK community at http://www.nltk.org is huge and a
great resource.
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Duncan Temple Lang to the Rescue
In my opinion, Python has a cleaner syntax for parsing text and
HTML. Duncan Temple Lang’s XML package provides some HTML
parsing in R:
Can read an HTML table and convert the data to a dataframe
(just like we did here).
Can parse and create HTML and XML documents.
Contains an XPath interpreter.
Many web mining R packages suggest or require this package
(RAmazonS3, RHTMLForms, RNYTimes).
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
RCurl and RJSON
R provides other packages that provide functionality that is
typically explored more in Python and other scripting languages.
RCurl brings the power of Unix curl to R and allows R to act as a
web client. Using RCurl it may be possible to do a lot of the stuff
that twill and mechanize do using getForm() and postForm().
RJSON allows us to parse JSON data (common with REST APIs
and other APIs) into R data structures. JSON is essentially just a
bunch of name/value pairs similar to a Python dictionary. The
Twitter API, which we will see shortly, can spit out JSON data.
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
A Parting Example
One can perform a sentiment analysis on the two gubernatorial candidates to see if we
can do a better job than the polls, assuming Twitter is representative of likely voters.
The code snippet below loops through hashtags and pages of results to extract results
from the Twitter Search API.
1 library(RCurl)
2 library(rjson)
3 hashtags <- c("%23 cagov", "%23 cagovdebate", "jerry+brown"
, "meg+whitman", "brown+whitman", "%23 cadebate")
4 for (i in 1: length(hashtags)) {
5 for (j in 1:10) {
6 text <- getURL(paste(c("http://search.twitter.com
/search.json?q=", hashtags[i],"&rpp =100&since
=2010 -10 -13&until =2010 -10 -13&page=", j),
collapse=’’))
7 entry <- fromJSON(text)
8 print(entry) #DO SOMETHING ... here we just print
9 }
10 }
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
A Parting Example
Short demo.
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Conclusion
In conclusion...
1 Calling R from Python gives the best of both worlds by
combining a fully functional, beautiful programming language
with an amazing modeling, analysis and visualization system.
2 R developers are making inroads at providing tools for not
only text mining and NLP, but also for extracting and
managing text and web data.
3 The challenge to R developers is to remove that “dirty”
feeling of using R for text mining, with some of its clunky
data structures.
4 The previous bullet is a difficult challenge, but will greatly
reduce the gap between both technologies.
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Keep in Touch!
My email: ryan@stat.ucla.edu
My blog: http://www.bytemining.com
Follow me on Twitter: @datajunkie
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
The End
Questions?
Source: http://www.r-chart.com/2010/10/hadley-on-postage-stamp.html, 10/19/10
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group
Text Mining: Extracting Data Some Interesting Packages for R Conclusion
Thank You!
Ryan R. Rosario
Accessing R from Python using RPy2 Los Angeles R Users’ Group

More Related Content

What's hot

RF Module Design - [Chapter 4] Transceiver Architecture
RF Module Design - [Chapter 4] Transceiver ArchitectureRF Module Design - [Chapter 4] Transceiver Architecture
RF Module Design - [Chapter 4] Transceiver ArchitectureSimen Li
 
Implementation of Wireless Channel Model in MATLAB: Simplified
Implementation of Wireless Channel Model in MATLAB: SimplifiedImplementation of Wireless Channel Model in MATLAB: Simplified
Implementation of Wireless Channel Model in MATLAB: SimplifiedRosdiadee Nordin
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learningdeawoo Kim
 
Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)홍배 김
 
57277345 cours-matlab
57277345 cours-matlab57277345 cours-matlab
57277345 cours-matlabgeniem1
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering
 
Image super-resolution via iterative refinement.pptx
Image super-resolution via iterative refinement.pptxImage super-resolution via iterative refinement.pptx
Image super-resolution via iterative refinement.pptxxiaoyiWang32
 
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Anne Nicolas
 
MIMO Channel Capacity
MIMO Channel CapacityMIMO Channel Capacity
MIMO Channel CapacityPei-Che Chang
 
DPDK (Data Plane Development Kit)
DPDK (Data Plane Development Kit) DPDK (Data Plane Development Kit)
DPDK (Data Plane Development Kit) ymtech
 
RF Module Design - [Chapter 5] Low Noise Amplifier
RF Module Design - [Chapter 5]  Low Noise AmplifierRF Module Design - [Chapter 5]  Low Noise Amplifier
RF Module Design - [Chapter 5] Low Noise AmplifierSimen Li
 
生成對抗模式 GAN 的介紹
生成對抗模式 GAN 的介紹生成對抗模式 GAN 的介紹
生成對抗模式 GAN 的介紹Yen-lung Tsai
 
2.4 code division multiple access
2.4   code division multiple access2.4   code division multiple access
2.4 code division multiple accessJAIGANESH SEKAR
 
ZYNQ BRAM Implementation
ZYNQ BRAM ImplementationZYNQ BRAM Implementation
ZYNQ BRAM ImplementationVincent Claes
 

What's hot (20)

RF Module Design - [Chapter 4] Transceiver Architecture
RF Module Design - [Chapter 4] Transceiver ArchitectureRF Module Design - [Chapter 4] Transceiver Architecture
RF Module Design - [Chapter 4] Transceiver Architecture
 
Implementation of Wireless Channel Model in MATLAB: Simplified
Implementation of Wireless Channel Model in MATLAB: SimplifiedImplementation of Wireless Channel Model in MATLAB: Simplified
Implementation of Wireless Channel Model in MATLAB: Simplified
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learning
 
Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)
 
PA linearity
PA linearityPA linearity
PA linearity
 
Mimo dr. morsi
Mimo  dr. morsiMimo  dr. morsi
Mimo dr. morsi
 
DDR Desense Issue
DDR Desense IssueDDR Desense Issue
DDR Desense Issue
 
57277345 cours-matlab
57277345 cours-matlab57277345 cours-matlab
57277345 cours-matlab
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 
Image super-resolution via iterative refinement.pptx
Image super-resolution via iterative refinement.pptxImage super-resolution via iterative refinement.pptx
Image super-resolution via iterative refinement.pptx
 
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
 
MIMO Channel Capacity
MIMO Channel CapacityMIMO Channel Capacity
MIMO Channel Capacity
 
Delphi 6 básico
Delphi 6 básicoDelphi 6 básico
Delphi 6 básico
 
DPDK (Data Plane Development Kit)
DPDK (Data Plane Development Kit) DPDK (Data Plane Development Kit)
DPDK (Data Plane Development Kit)
 
Cisco data center training for ibm
Cisco data center training for ibmCisco data center training for ibm
Cisco data center training for ibm
 
RF Module Design - [Chapter 5] Low Noise Amplifier
RF Module Design - [Chapter 5]  Low Noise AmplifierRF Module Design - [Chapter 5]  Low Noise Amplifier
RF Module Design - [Chapter 5] Low Noise Amplifier
 
生成對抗模式 GAN 的介紹
生成對抗模式 GAN 的介紹生成對抗模式 GAN 的介紹
生成對抗模式 GAN 的介紹
 
2.4 code division multiple access
2.4   code division multiple access2.4   code division multiple access
2.4 code division multiple access
 
The pocl Kernel Compiler
The pocl Kernel CompilerThe pocl Kernel Compiler
The pocl Kernel Compiler
 
ZYNQ BRAM Implementation
ZYNQ BRAM ImplementationZYNQ BRAM Implementation
ZYNQ BRAM Implementation
 

Viewers also liked

NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...Ryan Rosario
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Ryan Rosario
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Ryan Rosario
 
Massive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RMassive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RJose Luis Lopez Pino
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsEdwin de Jonge
 
Parallel Computing with R
Parallel Computing with RParallel Computing with R
Parallel Computing with RAbhirup Mallik
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsAjay Ohri
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
Python for R Users
Python for R UsersPython for R Users
Python for R UsersAjay Ohri
 

Viewers also liked (10)

NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
 
Massive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RMassive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using R
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasets
 
Parallel Computing with R
Parallel Computing with RParallel Computing with R
Parallel Computing with R
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
Python for R Users
Python for R UsersPython for R Users
Python for R Users
 

Similar to Accessing R from Python using RPy2

Accessing r from python using r py2
Accessing r from python using r py2Accessing r from python using r py2
Accessing r from python using r py2Wisdio
 
R Programming Language
R Programming LanguageR Programming Language
R Programming LanguageNareshKarela1
 
R vs python. Which one is best for data science
R vs python. Which one is best for data scienceR vs python. Which one is best for data science
R vs python. Which one is best for data scienceStat Analytica
 
Learning R via Python…or the other way around
Learning R via Python…or the other way aroundLearning R via Python…or the other way around
Learning R via Python…or the other way aroundSid Xing
 
R basics for MBA Students[1].pptx
R basics for MBA Students[1].pptxR basics for MBA Students[1].pptx
R basics for MBA Students[1].pptxrajalakshmi5921
 
Unit1_Introduction to R.pdf
Unit1_Introduction to R.pdfUnit1_Introduction to R.pdf
Unit1_Introduction to R.pdfMDDidarulAlam15
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R ProgrammingIRJET Journal
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
 
1_Introduction.pptx
1_Introduction.pptx1_Introduction.pptx
1_Introduction.pptxranapoonam1
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programminghemasri56
 
Which programming language to learn R or Python - MeasureCamp XII
Which programming language to learn R or Python - MeasureCamp XIIWhich programming language to learn R or Python - MeasureCamp XII
Which programming language to learn R or Python - MeasureCamp XIIMaggie Petrova
 
PPT - Introduction to R.pdf
PPT - Introduction to R.pdfPPT - Introduction to R.pdf
PPT - Introduction to R.pdfssuser65af26
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document usefulssuser3c3f88
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptxAvinabaHandson
 

Similar to Accessing R from Python using RPy2 (20)

Accessing r from python using r py2
Accessing r from python using r py2Accessing r from python using r py2
Accessing r from python using r py2
 
The Great Debate.pdf
The Great Debate.pdfThe Great Debate.pdf
The Great Debate.pdf
 
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
 
R vs python. Which one is best for data science
R vs python. Which one is best for data scienceR vs python. Which one is best for data science
R vs python. Which one is best for data science
 
Learning R via Python…or the other way around
Learning R via Python…or the other way aroundLearning R via Python…or the other way around
Learning R via Python…or the other way around
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
R basics for MBA Students[1].pptx
R basics for MBA Students[1].pptxR basics for MBA Students[1].pptx
R basics for MBA Students[1].pptx
 
Unit1_Introduction to R.pdf
Unit1_Introduction to R.pdfUnit1_Introduction to R.pdf
Unit1_Introduction to R.pdf
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R Programming
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
1_Introduction.pptx
1_Introduction.pptx1_Introduction.pptx
1_Introduction.pptx
 
Python and r in data science
Python and r in data sciencePython and r in data science
Python and r in data science
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Reason To learn & use r
Reason To learn & use rReason To learn & use r
Reason To learn & use r
 
Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
 
Which programming language to learn R or Python - MeasureCamp XII
Which programming language to learn R or Python - MeasureCamp XIIWhich programming language to learn R or Python - MeasureCamp XII
Which programming language to learn R or Python - MeasureCamp XII
 
PPT - Introduction to R.pdf
PPT - Introduction to R.pdfPPT - Introduction to R.pdf
PPT - Introduction to R.pdf
 
Chapter I.pptx
Chapter I.pptxChapter I.pptx
Chapter I.pptx
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
 

Recently uploaded

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 

Recently uploaded (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Accessing R from Python using RPy2

  • 1. Los Angeles R Users’ Group Accessing R from Python using RPy2 Ryan R. Rosario October 19, 2010 Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 2. Text Mining: Extracting Data Some Interesting Packages for R Conclusion What is Python? An interpreted, object-oriented, high-level programming language with... 1 dynamic typing, yet strongly typed 2 an interactive shell for real time testing of code 3 good rapid application development support 4 extensive scripting capabilities Python is similar in purpose to Perl, with a more well-defined syntax, and is more readable Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 3. Text Mining: Extracting Data Some Interesting Packages for R Conclusion What is RPy2? RPy2 is a simple interface to R from Python. RSPython was the first interface to R from Python (and Python from R) was RSPython by Duncan Temple Lang. Last updated in 2005. There is also a version for Perl called RSPerl. RPy focused on providing a simple and robust interface to R from the Python programming language. RPy2 is a rewrite of RPy, providing for a more user friendly experience and expanded capabilities. Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 4. Text Mining: Extracting Data Some Interesting Packages for R Conclusion So, Why use Python instead of with R? In the Twitter #rstats community, the consensus is that it is simply PREFERENCE. Here are some other reasons. 1 primitive data types in Python are more flexible for text mining. tuples for paired, associated data. (no equivalent in R) lists are similar to vectors dictionaries provide associative arrays; a list with named entries can provide an equivalent in R. pleasant string handling (though stringr helps significantly in R) Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 5. Text Mining: Extracting Data Some Interesting Packages for R Conclusion So, Why use Python instead of with R? In the Twitter #rstats community, the consensus is that it is simply PREFERENCE. Here are some other reasons. 2 handles the unexpected better flexible exceptions; R offers tryCatch. 3 Python has a much larger community with very diverse interests, and text and web mining are big focuses. 4 state of art algorithms for text mining typically hit Python or Perl before they hit R. 5 parallelism does not rely on R’s parallelism packages. Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 6. Text Mining: Extracting Data Some Interesting Packages for R Conclusion So, Why use Python instead of with R? In the Twitter #rstats community, the consensus is that it is simply PREFERENCE. Here are some other reasons. 6 robust regular expression engine. 7 robust processing of HTML (BeautifulSoup), XML (xml), and JSON (simplejson). 8 web crawling and spidering; dealing with forms, cookies etc. (mechanize and twill). 9 more access to datastores; not only MySQL and PostgreSQL but also Redis and MongoDB, with more to come. 10 more natural object-oriented features. Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 7. Text Mining: Extracting Data Some Interesting Packages for R Conclusion And, What are the Advantages of R over Python? In the Twitter #rstats community, the consensus is that it is simply PREFERENCE. Here are some other reasons. 1 Data frames; closest equivalent in Python is a dictionary, whose values are the lists of values for each variable, or a list of tuples, each tuple a row. 2 obviously, great analysis and visualization routines. 3 some genius developers have made some web mining tasks easier in R than in Python. 4 Factors; Python does not even have an enum primitive data type. Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 8. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Using RPy2 to Extract Data in Python and Use in R The site offtopic.com is the largest standalone discussion forum site on the Web. Discussion has no restriction and topics include everything under the sun including topics NSFW. It’s massive size and diversity provide a lot of opportunities for text mining, NLP and social network analysis. The questions: What are the ages of the participants? Is the number of posts a user makes related to the number of days he/she has been an active member? Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 9. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Using RPy2 to Extract Data in Python and Use in R The offtopic.com member list: Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 10. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Using RPy2 to Extract Data in Python and Use in R Let’s use Python to perform the following tasks: 1 Log in to the website using a username and password on an HTML form (this is required). 2 Navigate to the list of all members (a feature provided by vBulletin). 3 Scrape the content of all of the members on each page. 4 Repeat the previous step for all 1,195 pages in the list. 5 Parse the HTML tables to get the data and put it into an R data frame. and then use R (from Python) to perform the rest: 1 Create some pretty graphics and fit a linear model. Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 11. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Using RPy2 to Extract Data in Python and Use in R The specifics for you to explore: 1 I fill the web form and submit it using twill1 a wrapper to the web browsing Python module mechanize2. 2 I extract the correct HTML table; the one that contains the list of members, using a regular expression. 3 I navigate through the HTML table and extract the data using BeautifulSoup3, an HTML parsing library. 1 http://twill.idyll.org/ 2 http://wwwsearch.sourceforge.net/mechanize/ 3 http://www.crummy.com/software/BeautifulSoup/ Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 12. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Using RPy2 to Extract Data in Python and Use in R Brief code walkthrough. Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 13. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Constructing an R Data Type To create a dataframe, 1 I keep a list for each variable and store the lists in a dictionary, using the variable name as a key. 2 In a new dictionary, coerce the lists to R vectors 1 dataframe = {} 2 dataframe[’username ’] = R.StrVector (( data[’username ’])) 3 dataframe[’join_date ’] = R.StrVector (( data[’join_date ’])) 4 dataframe[’posts ’] = R.IntVector (( data[’posts ’])) 5 dataframe[’last_visit ’] = R.StrVector (( data[’last_visit ’ ])) 6 dataframe[’birthday ’] = R.StrVector (( data[’birthday ’])) 7 dataframe[’age’] = R.IntVector (( data[’age ’])) 8 MyRDataframe = R.DataFrame(dataframe) Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 14. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Constructing an R Data Type Caveat 1 Python may not keep your columns in the same order in which they were specified because dictionaries do not have a concept of order. See the documentation for a workaround using an ordered dictionary type. Caveat 2 Subsetting and indexing in RPy2 is not as trivial as it is in R. In my code, I give a trivial example. For more information, see the documentation. Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 15. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Constructing an R Data Type We can also create a matrix: 1 M = R.r.matrix(robjects.IntVector(range (10)), nrow =5) Note that constructing objects in R looks vaguely similar to how it is done in native R. Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 16. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Plotting a Histogram Let’s see the distribution of ages of offtopic.com users. It is best to print the plot to a file. If we do not, the graphic relies on X11 (on Linux) and will disappear when the script ends. 1 #Plot ages 2 hist = R.r.hist 3 R.r.png(’~/Desktop/hist.png ’,width =300 , height =300) 4 hist( MyRDataframe [2], main="" xlab="", br =20) 5 R.r[’dev.off’]() Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 17. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Plotting a Bivariate Relationship and Custom R Functions Next, let’s do some data management and access our own R function from Python. 1 activity = R.r(r’’’ 2 function(x, y) { 3 if (is.na(x) | is.na(y)) NA 4 else { 5 date .1 <- strptime(x, "%m-%d-%Y") 6 date .2 <- strptime(y, "%m-%d-%Y") 7 difftime(date.1, date.2, units=’days ’) 8 } 9 }’’’) 10 as_numeric = R.r[’as.numeric ’] 11 days_active = activity(dataframe[’join_date ’], dataframe[ ’last_visit ’]) 12 days_active = R.r.abs(days_active) 13 days_active = as_numeric(days_active) Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 18. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Plotting a Bivariate Relationship and Custom R Function 1 plot = R.r.plot 2 R.r.png(’~/Desktop/plot.png ’, width =300 , height =300) 3 plot(days_active , MyRDataframe [3], pch=’.’) 4 R.r[’dev.off’]() Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 19. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Linear Models Let’s assume a linear relationship and try to predict posts from number of days active. 1 fit = R.r.lm(fm1a) 2 summary = R.r.summary(fit) 3 #Now we can program with the results of fit in Python . 4 print summary [3] # Display an R-like summary 5 # element 3 is the R-like summary . 6 # elements 0 and 1 are the a and b coefficients respectively . 7 intercept = summary [3][0] 8 slope = summary [3][1] Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 20. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Wrap Up of RPy2 RPy2 allows the user to easily access R from Python, with just a few hoops to jump through. Due to time constraints, it is not possible to discuss this package is more depth. You are encouraged to check out the excellent documentation: http://rpy.sourceforge.net/rpy2 documentation.html Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 21. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Alternatives to RPy2 There are some other Python packages that allow use of R in Python: 1 PypeR is a brand new package discussed in the Journal of Statistical Software4, 2 pyRServe for accessing R running RServe from Python. 4 http://www.jstatsoft.org/v35/c02/paper Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 22. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Python’s Natural Language Toolkit (NLTK) Another reason to call R from Python is to use Python’s awesome NLTK module. Unfortunately, due to time constraints, I can only list what NLTK is capable of. several corpora and labeled corpora for classification tasks and training. several classifiers including Naive Bayes, and Latent Dirichlet Allocation. WordNet interface Part of speech taggers Word stemmers Tokenization, entity washing, stop word lists and dictionaries for parsing. Feature grammar parsing. and much more Then, data can be passed to R for analysis and visualization. Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 23. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Python’s Natural Language Toolkit (NLTK) To learn more about NLTK: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein, and Edward Loper The NLTK community at http://www.nltk.org is huge and a great resource. Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 24. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Duncan Temple Lang to the Rescue In my opinion, Python has a cleaner syntax for parsing text and HTML. Duncan Temple Lang’s XML package provides some HTML parsing in R: Can read an HTML table and convert the data to a dataframe (just like we did here). Can parse and create HTML and XML documents. Contains an XPath interpreter. Many web mining R packages suggest or require this package (RAmazonS3, RHTMLForms, RNYTimes). Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 25. Text Mining: Extracting Data Some Interesting Packages for R Conclusion RCurl and RJSON R provides other packages that provide functionality that is typically explored more in Python and other scripting languages. RCurl brings the power of Unix curl to R and allows R to act as a web client. Using RCurl it may be possible to do a lot of the stuff that twill and mechanize do using getForm() and postForm(). RJSON allows us to parse JSON data (common with REST APIs and other APIs) into R data structures. JSON is essentially just a bunch of name/value pairs similar to a Python dictionary. The Twitter API, which we will see shortly, can spit out JSON data. Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 26. Text Mining: Extracting Data Some Interesting Packages for R Conclusion A Parting Example One can perform a sentiment analysis on the two gubernatorial candidates to see if we can do a better job than the polls, assuming Twitter is representative of likely voters. The code snippet below loops through hashtags and pages of results to extract results from the Twitter Search API. 1 library(RCurl) 2 library(rjson) 3 hashtags <- c("%23 cagov", "%23 cagovdebate", "jerry+brown" , "meg+whitman", "brown+whitman", "%23 cadebate") 4 for (i in 1: length(hashtags)) { 5 for (j in 1:10) { 6 text <- getURL(paste(c("http://search.twitter.com /search.json?q=", hashtags[i],"&rpp =100&since =2010 -10 -13&until =2010 -10 -13&page=", j), collapse=’’)) 7 entry <- fromJSON(text) 8 print(entry) #DO SOMETHING ... here we just print 9 } 10 } Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 27. Text Mining: Extracting Data Some Interesting Packages for R Conclusion A Parting Example Short demo. Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 28. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Conclusion In conclusion... 1 Calling R from Python gives the best of both worlds by combining a fully functional, beautiful programming language with an amazing modeling, analysis and visualization system. 2 R developers are making inroads at providing tools for not only text mining and NLP, but also for extracting and managing text and web data. 3 The challenge to R developers is to remove that “dirty” feeling of using R for text mining, with some of its clunky data structures. 4 The previous bullet is a difficult challenge, but will greatly reduce the gap between both technologies. Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 29. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Keep in Touch! My email: ryan@stat.ucla.edu My blog: http://www.bytemining.com Follow me on Twitter: @datajunkie Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 30. Text Mining: Extracting Data Some Interesting Packages for R Conclusion The End Questions? Source: http://www.r-chart.com/2010/10/hadley-on-postage-stamp.html, 10/19/10 Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group
  • 31. Text Mining: Extracting Data Some Interesting Packages for R Conclusion Thank You! Ryan R. Rosario Accessing R from Python using RPy2 Los Angeles R Users’ Group