Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- R and Big Data using Revolution R E... by Revolution Analytics 3179 views
- Big Data Analytics with R by Great Wide Open 1188 views
- Using R for Big Data Analytics by Arimo, Inc. 972 views
- Parallel Computing with R by Abhirup Mallik 2156 views
- Big Data Analysis Starts with R by Revolution Analytics 6664 views
- NumPy and SciPy for Data Mining and... by Ryan Rosario 18591 views

4,081 views

Published on

“Data collections big enough to require you to change the way you store and process them.” - Andy Pryke

I discuss the factors which can limit the size of data analysed using R and a variety of ways to address these, including moving data structures out of RAM and onto disk; using in database processing / analytics and harnessing the power of Hadoop to allow massively parallel R.

Published in:
Technology

License: CC Attribution-ShareAlike License

No Downloads

Total views

4,081

On SlideShare

0

From Embeds

0

Number of Embeds

22

Shares

0

Downloads

228

Comments

0

Likes

14

No embeds

No notes for slide

- 1. Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23rd April 2013 Andy PrykeAndy@The-Data-Mine.co.uk / @AndyPryke
- 2. My Bias…www.the-data-mine.co.uk I work in commercial data mining, data analysis and data visualisation Background in computing and artificial intelligence Use R to write programs which analyse data
- 3. What is Big Data?www.the-data-mine.co.uk Depends who you ask. Answers are often “too big to ….” …load into memory …store on a hard drive …fit in a standard database Plus “Fast changing” Not just relational
- 4. My “Big Data” Definitionwww.the-data-mine.co.uk “Data collections big enough to require you to change the way you store and process them.” - Andy Pryke
- 5. Data Size Limits in Rwww.the-data-mine.co.uk Standard R packages use a single thread, with data held in memory (RAM) help("Memory-limits") • Vectors limited to 2 Billion items • Memory limit of ~128Tb Servers with 1Tb+ memory are available • Also, Amazon EC2 servers up to 244Gb
- 6. Overviewwww.the-data-mine.co.uk • Problems using R with Big Data • Processing data on disk • Hadoop for parallel computation and Big Data storage / access • “In Database” analysis • What next for Birmingham R User Group?
- 7. Background: R matrix classwww.the-data-mine.co.uk “matrix” - Built in (package base). - Stored in RAM - “Dense” - takes up memory to store zero values) Can be replaced by…..
- 8. Sparse / Disk Based Matriceswww.the-data-mine.co.uk • Matrix – Package Matrix. Sparse. In RAM • big.matrix – Package bigmemory / bigmemoryExtras & VAM. On disk. VAM allows access from parallel R sessions • Analysis – Packages irlba, bigalgebra, biganalytics (R-Forge list)etc. More details? “Large-Scale Linear Algebra with R”, Bryan W. Lewis, Boston R Users Meetup
- 9. Commercial Versions of Rwww.the-data-mine.co.uk Revolution Analytics have specialised versions of R for parallel execution & big data I believe many if not most components are also available under Free Open Source licences, including the RHadoop set of packages Plenty more info here
- 10. Background: Hadoopwww.the-data-mine.co.uk • Parallel data processing environment based on Google’s “MapReduce” model • “Map” – divide up data and sending it for processing to multiple nodes. • “Reduce” – Combine the results Plus: • Hadoop Distributed File System (HDFS) • HBase – Distributed database like Google’s BigTable
- 11. RHadoop – Revolution Analyticswww.the-data-mine.co.uk Package: rmr2, rhbase, rhdfs • Example code using RMR (R Map-Reduce) • R and Hadoop – Step by Step Tutorials • Install and Demo RHadoop (Google for more of these online) • Data Hacking with RHadoop
- 12. E.g. Function Outputwc.map <- function(., lines) { RHadoop ## split "lines" of text into a vector of individual "words" ## In, 1 ## the, 1 words <- unlist(strsplit(x = lines,split = " "))www.the-data-mine.co.uk keyval(words,1) ## each word occurs once ## beginning, 1} ##...wc.reduce <- function(word, counts ) { ## the, 2345 ## Add up the counts, grouping them by word ## word, 987 keyval(word, sum(counts))} ## beginning, 123 ##...wordcount <- function(input, output = NULL){ mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)}
- 13. Other Hadoop libraries for Rwww.the-data-mine.co.uk Other packages: hive, segue, RHIPE… segue – easy way to distribute CPU intensive work - Uses Amazon’s Elastic Map Reduce service, which costs money. - not designed for big data, but easy and fun. Example follows…
- 14. # first, lets generate a 10-element list of# 999 random numbers + RHadoop 1 NA:> myList <- getMyTestList()www.the-data-mine.co.uk# Add up each set of 999 numbers> outputLocal <- lapply(myList, mean, na.rm=T)> outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T)RUNNING - 2011-01-04 15:16:57RUNNING - 2011-01-04 15:17:27RUNNING - 2011-01-04 15:17:58WAITING - 2011-01-04 15:18:29## Check local and cluster results match> all.equal(outputEmr, outputLocal)[1] TRUE# The key is the emrlapply() function. It works just like lapply(),# but automagically spreads its work across the specified cluster
- 15. Oracle R Connector for Hadoopwww.the-data-mine.co.uk • Integrates with Oracle Db, “Oracle Big Data Appliance” (sounds expensive!) & HDFS • Map-Reduce is very similar to the rmr example • Documentation lists examples for Linear Regression, k-means, working with graphs amongst others • Introduction to Oracle R Connector for Hadoop. • Oracle also offer some in-database algorithms for R via Oracle R Enterprise (overview)
- 16. Teradata Integrationwww.the-data-mine.co.uk Package: teradataR • Teradata offer in-database analytics, accessible through R • These include k-means clustering, descriptive statistics and the ability to create and call in- database user defined functions
- 17. What Next?www.the-data-mine.co.uk I propose an informal “big data” Special Interest Group, where we collaborate to explore big data options within R, producing example code etc. “R” you interested?

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment