Big Data, Bigger Data & Big R Data

Big Data,
Bigger Data
&
Big R Data
Birmingham R Users Meeting
23rd April 2013
Andy Pryke
Andy@The-Data-Mine.co.uk / @AndyPryke

My Bias…
www.the-data-mine.co.uk

I work in commercial data
mining, data analysis and data
visualisation
Background in computing and
artificial intelligence
Use R to write programs which
analyse data

What is Big Data?

Depends who you ask.
Answers are often “too big to ….”
…load into memory
…store on a hard drive
…fit in a standard database
Plus
“Fast changing”
Not just relational

My “Big Data” Definition

“Data collections big
enough to require you to
change the way you
store and process them.”
- Andy Pryke

Data Size Limits in R

Standard R packages use a single
thread, with data held in memory (RAM)
help("Memory-limits")
• Vectors limited to 2 Billion items
• Memory limit of ~128Tb
Servers with 1Tb+ memory are available
• Also, Amazon EC2 servers up to 244Gb

Overview

• Problems using R with Big Data
• Processing data on disk
• Hadoop for parallel computation and Big
Data storage / access
• “In Database” analysis
• What next for Birmingham R User Group?

Background: R matrix class

“matrix”
- Built in (package base).
- Stored in RAM
- “Dense” - takes up memory
to store zero values)

Can be replaced by…..

Sparse / Disk Based Matrices

• Matrix – Package Matrix. Sparse. In RAM
• big.matrix – Package bigmemory /
bigmemoryExtras & VAM. On disk. VAM
allows access from parallel R sessions
• Analysis – Packages
irlba, bigalgebra, biganalytics (R-Forge
list)etc.
More details?
“Large-Scale Linear Algebra with R”, Bryan
W. Lewis, Boston R Users Meetup

Commercial Versions of R

Revolution Analytics have specialised
versions of R for parallel execution & big data

I believe many if not most components are
also available under Free Open Source
licences, including the RHadoop set of
packages

Plenty more info here

Background: Hadoop

• Parallel data processing environment
based on Google’s “MapReduce” model
• “Map” – divide up data and sending it for
processing to multiple nodes.
• “Reduce” – Combine the results
Plus:
• Hadoop Distributed File System (HDFS)
• HBase – Distributed database like
Google’s BigTable

RHadoop – Revolution Analytics

Package: rmr2, rhbase, rhdfs

• Example code using RMR (R Map-Reduce)
• R and Hadoop – Step by Step Tutorials
• Install and Demo RHadoop (Google for
more of these online)
• Data Hacking with RHadoop

E.g. Function Output
wc.map <- function(., lines) { RHadoop
## split "lines" of text into a vector of individual "words"
## In, 1
## the, 1
words <- unlist(strsplit(x = lines,split = " "))

keyval(words,1) ## each word occurs once
## beginning, 1
} ##...

wc.reduce <- function(word, counts ) { ## the, 2345
## Add up the counts, grouping them by word ## word, 987
keyval(word, sum(counts))
}
## beginning, 123
##...
wordcount <- function(input, output = NULL){
mapreduce(
input = input ,
output = output,
input.format = "text",
map = wc.map,
reduce = wc.reduce,
combine = T)
}

Other Hadoop libraries for R

Other packages: hive, segue, RHIPE…

segue
– easy way to distribute CPU intensive work
- Uses Amazon’s Elastic Map Reduce service,
which costs money.
- not designed for big data, but easy and fun.

Example follows…

# first, let's generate a 10-element list of
# 999 random numbers + RHadoop
1 NA:
> myList <- getMyTestList()
# Add up each set of 999 numbers
> outputLocal <- lapply(myList, mean, na.rm=T)
> outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T)
RUNNING - 2011-01-04 15:16:57
RUNNING - 2011-01-04 15:17:27
RUNNING - 2011-01-04 15:17:58
WAITING - 2011-01-04 15:18:29

## Check local and cluster results match
> all.equal(outputEmr, outputLocal)
[1] TRUE

# The key is the emrlapply() function. It works just like lapply(),
# but automagically spreads its work across the specified cluster

Oracle R Connector for Hadoop

• Integrates with Oracle Db, “Oracle Big Data
Appliance” (sounds expensive!) & HDFS
• Map-Reduce is very similar to the rmr example
• Documentation lists examples for Linear
Regression, k-means, working with graphs
amongst others
• Introduction to Oracle R Connector for Hadoop.
• Oracle also offer some in-database algorithms
for R via Oracle R Enterprise (overview)

Teradata Integration

Package: teradataR
• Teradata offer in-database analytics, accessible
through R
• These include k-means clustering, descriptive
statistics and the ability to create and call in-
database user defined functions

What Next?

I propose an informal “big data” Special Interest
Group, where we collaborate to explore big data
options within R, producing example code etc.

“R” you interested?

Big Data, Bigger Data & Big R Data

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Recently uploaded

Recently uploaded (20)

Big Data, Bigger Data & Big R Data