My recent talk at the Birmingham R User Meeting (BRUM) was on Big Data in R. Different people have different definitions of big data. For this talk, my definition of big data is:
“Data collections big enough to require you to change the way you store and process them.” - Andy Pryke
I discuss the factors which can limit the size of data analysed using R and a variety of ways to address these, including moving data structures out of RAM and onto disk; using in database processing / analytics and harnessing the power of Hadoop to allow massively parallel R.
1. Big Data,
Bigger Data
&
Big R Data
Birmingham R Users Meeting
23rd April 2013
Andy Pryke
Andy@The-Data-Mine.co.uk / @AndyPryke
2. My Bias…
www.the-data-mine.co.uk
I work in commercial data
mining, data analysis and data
visualisation
Background in computing and
artificial intelligence
Use R to write programs which
analyse data
3. What is Big Data?
www.the-data-mine.co.uk
Depends who you ask.
Answers are often “too big to ….”
…load into memory
…store on a hard drive
…fit in a standard database
Plus
“Fast changing”
Not just relational
4. My “Big Data” Definition
www.the-data-mine.co.uk
“Data collections big
enough to require you to
change the way you
store and process them.”
- Andy Pryke
5. Data Size Limits in R
www.the-data-mine.co.uk
Standard R packages use a single
thread, with data held in memory (RAM)
help("Memory-limits")
• Vectors limited to 2 Billion items
• Memory limit of ~128Tb
Servers with 1Tb+ memory are available
• Also, Amazon EC2 servers up to 244Gb
6. Overview
www.the-data-mine.co.uk
• Problems using R with Big Data
• Processing data on disk
• Hadoop for parallel computation and Big
Data storage / access
• “In Database” analysis
• What next for Birmingham R User Group?
7. Background: R matrix class
www.the-data-mine.co.uk
“matrix”
- Built in (package base).
- Stored in RAM
- “Dense” - takes up memory
to store zero values)
Can be replaced by…..
8. Sparse / Disk Based Matrices
www.the-data-mine.co.uk
• Matrix – Package Matrix. Sparse. In RAM
• big.matrix – Package bigmemory /
bigmemoryExtras & VAM. On disk. VAM
allows access from parallel R sessions
• Analysis – Packages
irlba, bigalgebra, biganalytics (R-Forge
list)etc.
More details?
“Large-Scale Linear Algebra with R”, Bryan
W. Lewis, Boston R Users Meetup
9. Commercial Versions of R
www.the-data-mine.co.uk
Revolution Analytics have specialised
versions of R for parallel execution & big data
I believe many if not most components are
also available under Free Open Source
licences, including the RHadoop set of
packages
Plenty more info here
10. Background: Hadoop
www.the-data-mine.co.uk
• Parallel data processing environment
based on Google’s “MapReduce” model
• “Map” – divide up data and sending it for
processing to multiple nodes.
• “Reduce” – Combine the results
Plus:
• Hadoop Distributed File System (HDFS)
• HBase – Distributed database like
Google’s BigTable
11. RHadoop – Revolution Analytics
www.the-data-mine.co.uk
Package: rmr2, rhbase, rhdfs
• Example code using RMR (R Map-Reduce)
• R and Hadoop – Step by Step Tutorials
• Install and Demo RHadoop (Google for
more of these online)
• Data Hacking with RHadoop
12. E.g. Function Output
wc.map <- function(., lines) { RHadoop
## split "lines" of text into a vector of individual "words"
## In, 1
## the, 1
words <- unlist(strsplit(x = lines,split = " "))
www.the-data-mine.co.uk
keyval(words,1) ## each word occurs once
## beginning, 1
} ##...
wc.reduce <- function(word, counts ) { ## the, 2345
## Add up the counts, grouping them by word ## word, 987
keyval(word, sum(counts))
}
## beginning, 123
##...
wordcount <- function(input, output = NULL){
mapreduce(
input = input ,
output = output,
input.format = "text",
map = wc.map,
reduce = wc.reduce,
combine = T)
}
13. Other Hadoop libraries for R
www.the-data-mine.co.uk
Other packages: hive, segue, RHIPE…
segue
– easy way to distribute CPU intensive work
- Uses Amazon’s Elastic Map Reduce service,
which costs money.
- not designed for big data, but easy and fun.
Example follows…
14. # first, let's generate a 10-element list of
# 999 random numbers + RHadoop
1 NA:
> myList <- getMyTestList()
www.the-data-mine.co.uk
# Add up each set of 999 numbers
> outputLocal <- lapply(myList, mean, na.rm=T)
> outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T)
RUNNING - 2011-01-04 15:16:57
RUNNING - 2011-01-04 15:17:27
RUNNING - 2011-01-04 15:17:58
WAITING - 2011-01-04 15:18:29
## Check local and cluster results match
> all.equal(outputEmr, outputLocal)
[1] TRUE
# The key is the emrlapply() function. It works just like lapply(),
# but automagically spreads its work across the specified cluster
15. Oracle R Connector for Hadoop
www.the-data-mine.co.uk
• Integrates with Oracle Db, “Oracle Big Data
Appliance” (sounds expensive!) & HDFS
• Map-Reduce is very similar to the rmr example
• Documentation lists examples for Linear
Regression, k-means, working with graphs
amongst others
• Introduction to Oracle R Connector for Hadoop.
• Oracle also offer some in-database algorithms
for R via Oracle R Enterprise (overview)
16. Teradata Integration
www.the-data-mine.co.uk
Package: teradataR
• Teradata offer in-database analytics, accessible
through R
• These include k-means clustering, descriptive
statistics and the ability to create and call in-
database user defined functions
17. What Next?
www.the-data-mine.co.uk
I propose an informal “big data” Special Interest
Group, where we collaborate to explore big data
options within R, producing example code etc.
“R” you interested?