Big Data, Bigger Data & Big R Data
Upcoming SlideShare
Loading in...5
×
 

Big Data, Bigger Data & Big R Data

on

  • 2,441 views

My recent talk at the Birmingham R User Meeting (BRUM) was on Big Data in R. Different people have different definitions of big data. For this talk, my definition of big data is: ...

My recent talk at the Birmingham R User Meeting (BRUM) was on Big Data in R. Different people have different definitions of big data. For this talk, my definition of big data is:
“Data collections big enough to require you to change the way you store and process them.” - Andy Pryke

I discuss the factors which can limit the size of data analysed using R and a variety of ways to address these, including moving data structures out of RAM and onto disk; using in database processing / analytics and harnessing the power of Hadoop to allow massively parallel R.

Statistics

Views

Total Views
2,441
Views on SlideShare
2,434
Embed Views
7

Actions

Likes
2
Downloads
72
Comments
0

5 Embeds 7

http://192.168.6.179 2
http://localhost 2
http://searchutil01 1
http://10.17.208.221 1
http://pmomale-ld1 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big Data, Bigger Data & Big R Data Big Data, Bigger Data & Big R Data Presentation Transcript

  • Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23rd April 2013 Andy PrykeAndy@The-Data-Mine.co.uk / @AndyPryke
  • My Bias…www.the-data-mine.co.uk I work in commercial data mining, data analysis and data visualisation Background in computing and artificial intelligence Use R to write programs which analyse data
  • What is Big Data?www.the-data-mine.co.uk Depends who you ask. Answers are often “too big to ….” …load into memory …store on a hard drive …fit in a standard database Plus “Fast changing” Not just relational
  • My “Big Data” Definitionwww.the-data-mine.co.uk “Data collections big enough to require you to change the way you store and process them.” - Andy Pryke
  • Data Size Limits in Rwww.the-data-mine.co.uk Standard R packages use a single thread, with data held in memory (RAM) help("Memory-limits") • Vectors limited to 2 Billion items • Memory limit of ~128Tb Servers with 1Tb+ memory are available • Also, Amazon EC2 servers up to 244Gb
  • Overviewwww.the-data-mine.co.uk • Problems using R with Big Data • Processing data on disk • Hadoop for parallel computation and Big Data storage / access • “In Database” analysis • What next for Birmingham R User Group?
  • Background: R matrix classwww.the-data-mine.co.uk “matrix” - Built in (package base). - Stored in RAM - “Dense” - takes up memory to store zero values) Can be replaced by…..
  • Sparse / Disk Based Matriceswww.the-data-mine.co.uk • Matrix – Package Matrix. Sparse. In RAM • big.matrix – Package bigmemory / bigmemoryExtras & VAM. On disk. VAM allows access from parallel R sessions • Analysis – Packages irlba, bigalgebra, biganalytics (R-Forge list)etc. More details? “Large-Scale Linear Algebra with R”, Bryan W. Lewis, Boston R Users Meetup
  • Commercial Versions of Rwww.the-data-mine.co.uk Revolution Analytics have specialised versions of R for parallel execution & big data I believe many if not most components are also available under Free Open Source licences, including the RHadoop set of packages Plenty more info here
  • Background: Hadoopwww.the-data-mine.co.uk • Parallel data processing environment based on Google’s “MapReduce” model • “Map” – divide up data and sending it for processing to multiple nodes. • “Reduce” – Combine the results Plus: • Hadoop Distributed File System (HDFS) • HBase – Distributed database like Google’s BigTable
  • RHadoop – Revolution Analyticswww.the-data-mine.co.uk Package: rmr2, rhbase, rhdfs • Example code using RMR (R Map-Reduce) • R and Hadoop – Step by Step Tutorials • Install and Demo RHadoop (Google for more of these online) • Data Hacking with RHadoop
  • E.g. Function Outputwc.map <- function(., lines) { RHadoop ## split "lines" of text into a vector of individual "words" ## In, 1 ## the, 1 words <- unlist(strsplit(x = lines,split = " "))www.the-data-mine.co.uk keyval(words,1) ## each word occurs once ## beginning, 1} ##...wc.reduce <- function(word, counts ) { ## the, 2345 ## Add up the counts, grouping them by word ## word, 987 keyval(word, sum(counts))} ## beginning, 123 ##...wordcount <- function(input, output = NULL){ mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)}
  • Other Hadoop libraries for Rwww.the-data-mine.co.uk Other packages: hive, segue, RHIPE… segue – easy way to distribute CPU intensive work - Uses Amazon’s Elastic Map Reduce service, which costs money. - not designed for big data, but easy and fun. Example follows…
  • # first, lets generate a 10-element list of# 999 random numbers + RHadoop 1 NA:> myList <- getMyTestList()www.the-data-mine.co.uk# Add up each set of 999 numbers> outputLocal <- lapply(myList, mean, na.rm=T)> outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T)RUNNING - 2011-01-04 15:16:57RUNNING - 2011-01-04 15:17:27RUNNING - 2011-01-04 15:17:58WAITING - 2011-01-04 15:18:29## Check local and cluster results match> all.equal(outputEmr, outputLocal)[1] TRUE# The key is the emrlapply() function. It works just like lapply(),# but automagically spreads its work across the specified cluster
  • Oracle R Connector for Hadoopwww.the-data-mine.co.uk • Integrates with Oracle Db, “Oracle Big Data Appliance” (sounds expensive!) & HDFS • Map-Reduce is very similar to the rmr example • Documentation lists examples for Linear Regression, k-means, working with graphs amongst others • Introduction to Oracle R Connector for Hadoop. • Oracle also offer some in-database algorithms for R via Oracle R Enterprise (overview)
  • Teradata Integrationwww.the-data-mine.co.uk Package: teradataR • Teradata offer in-database analytics, accessible through R • These include k-means clustering, descriptive statistics and the ability to create and call in- database user defined functions
  • What Next?www.the-data-mine.co.uk I propose an informal “big data” Special Interest Group, where we collaborate to explore big data options within R, producing example code etc. “R” you interested?