SlideShare a Scribd company logo
1 of 17
Big Data,
         Bigger Data
              &
         Big R Data
     Birmingham R Users Meeting
            23rd April 2013
            Andy Pryke
Andy@The-Data-Mine.co.uk / @AndyPryke
My Bias…
www.the-data-mine.co.uk




      I work in commercial data
      mining, data analysis and data
      visualisation
      Background in computing and
      artificial intelligence
      Use R to write programs which
      analyse data
What is Big Data?
www.the-data-mine.co.uk




      Depends who you ask.
      Answers are often “too big to ….”
        …load into memory
        …store on a hard drive
        …fit in a standard database
      Plus
        “Fast changing”
        Not just relational
My “Big Data” Definition
www.the-data-mine.co.uk




     “Data collections big
     enough to require you to
     change the way you
     store and process them.”
               - Andy Pryke
Data Size Limits in R
www.the-data-mine.co.uk




      Standard R packages use a single
      thread, with data held in memory (RAM)
      help("Memory-limits")
               •     Vectors limited to 2 Billion items
               •     Memory limit of ~128Tb
      Servers with 1Tb+ memory are available
               • Also, Amazon EC2 servers up to 244Gb
Overview
www.the-data-mine.co.uk




      • Problems using R with Big Data
      • Processing data on disk
      • Hadoop for parallel computation and Big
        Data storage / access
      • “In Database” analysis
      • What next for Birmingham R User Group?
Background: R matrix class
www.the-data-mine.co.uk




      “matrix”
       - Built in (package base).
       - Stored in RAM
       - “Dense” - takes up memory
                 to store zero values)

      Can be replaced by…..
Sparse / Disk Based Matrices
www.the-data-mine.co.uk




      • Matrix – Package Matrix. Sparse. In RAM
      • big.matrix – Package bigmemory /
        bigmemoryExtras & VAM. On disk. VAM
        allows access from parallel R sessions
      • Analysis – Packages
        irlba, bigalgebra, biganalytics (R-Forge
        list)etc.
      More details?
        “Large-Scale Linear Algebra with R”, Bryan
          W. Lewis, Boston R Users Meetup
Commercial Versions of R
www.the-data-mine.co.uk




      Revolution Analytics have specialised
      versions of R for parallel execution & big data

      I believe many if not most components are
      also available under Free Open Source
      licences, including the RHadoop set of
      packages

      Plenty more info here
Background: Hadoop
www.the-data-mine.co.uk




      • Parallel data processing environment
         based on Google’s “MapReduce” model
      • “Map” – divide up data and sending it for
         processing to multiple nodes.
      • “Reduce” – Combine the results
      Plus:
      • Hadoop Distributed File System (HDFS)
      • HBase – Distributed database like
                 Google’s BigTable
RHadoop – Revolution Analytics
www.the-data-mine.co.uk




       Package: rmr2, rhbase, rhdfs

       • Example code using RMR (R Map-Reduce)
       • R and Hadoop – Step by Step Tutorials
       • Install and Demo RHadoop (Google for
         more of these online)
       • Data Hacking with RHadoop
E.g. Function Output
wc.map <- function(., lines) {    RHadoop
  ## split "lines" of text into a vector of individual "words"
                                                                 ## In, 1
                                                                 ## the, 1
  words <- unlist(strsplit(x = lines,split = " "))
www.the-data-mine.co.uk

  keyval(words,1) ## each word occurs once
                                                                 ## beginning, 1
}                                                                ##...

wc.reduce <- function(word, counts ) {                           ## the, 2345
  ## Add up the counts, grouping them by word                    ## word, 987
  keyval(word, sum(counts))
}
                                                                 ## beginning, 123
                                                                 ##...
wordcount <- function(input, output = NULL){
  mapreduce(
   input = input ,
   output = output,
   input.format = "text",
   map = wc.map,
   reduce = wc.reduce,
   combine = T)
}
Other Hadoop libraries for R
www.the-data-mine.co.uk




 Other packages: hive, segue, RHIPE…

 segue
 – easy way to distribute CPU intensive work
 - Uses Amazon’s Elastic Map Reduce service,
   which costs money.
 - not designed for big data, but easy and fun.

 Example follows…
# first, let's generate a 10-element list of
# 999 random numbers + RHadoop
                             1 NA:
> myList <- getMyTestList()
www.the-data-mine.co.uk
# Add up each set of 999 numbers
> outputLocal <- lapply(myList, mean, na.rm=T)
> outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T)
RUNNING - 2011-01-04 15:16:57
RUNNING - 2011-01-04 15:17:27
RUNNING - 2011-01-04 15:17:58
WAITING - 2011-01-04 15:18:29

## Check local and cluster results match
> all.equal(outputEmr, outputLocal)
[1] TRUE

# The key is the emrlapply() function. It works just like lapply(),
# but automagically spreads its work across the specified cluster
Oracle R Connector for Hadoop
www.the-data-mine.co.uk




 • Integrates with Oracle Db, “Oracle Big Data
   Appliance” (sounds expensive!) & HDFS
 • Map-Reduce is very similar to the rmr example
 • Documentation lists examples for Linear
   Regression, k-means, working with graphs
   amongst others
 • Introduction to Oracle R Connector for Hadoop.
 • Oracle also offer some in-database algorithms
   for R via Oracle R Enterprise (overview)
Teradata Integration
www.the-data-mine.co.uk




 Package: teradataR
 • Teradata offer in-database analytics, accessible
   through R
 • These include k-means clustering, descriptive
   statistics and the ability to create and call in-
   database user defined functions
What Next?
www.the-data-mine.co.uk




 I propose an informal “big data” Special Interest
 Group, where we collaborate to explore big data
 options within R, producing example code etc.


         “R” you interested?

More Related Content

Viewers also liked

Model Visualisation (with ggplot2)
Model Visualisation (with ggplot2)Model Visualisation (with ggplot2)
Model Visualisation (with ggplot2)Hadley Wickham
 
R workshop iii -- 3 hours to learn ggplot2 series
R workshop iii -- 3 hours to learn ggplot2 seriesR workshop iii -- 3 hours to learn ggplot2 series
R workshop iii -- 3 hours to learn ggplot2 seriesVivian S. Zhang
 
Machine learning in R
Machine learning in RMachine learning in R
Machine learning in Rapolol92
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply FunctionSakthi Dasans
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyrRomain Francois
 
Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Ram Narasimhan
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2yannabraham
 

Viewers also liked (17)

Model Visualisation (with ggplot2)
Model Visualisation (with ggplot2)Model Visualisation (with ggplot2)
Model Visualisation (with ggplot2)
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R workshop iii -- 3 hours to learn ggplot2 series
R workshop iii -- 3 hours to learn ggplot2 seriesR workshop iii -- 3 hours to learn ggplot2 series
R workshop iii -- 3 hours to learn ggplot2 series
 
03 Modelling
03 Modelling03 Modelling
03 Modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
R packages
R packagesR packages
R packages
 
02 Ddply
02 Ddply02 Ddply
02 Ddply
 
01 Intro
01 Intro01 Intro
01 Intro
 
Reshaping Data in R
Reshaping Data in RReshaping Data in R
Reshaping Data in R
 
Machine learning in R
Machine learning in RMachine learning in R
Machine learning in R
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2
 
Rsplit apply combine
Rsplit apply combineRsplit apply combine
Rsplit apply combine
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Big Data, Bigger Data & Big R Data

  • 1. Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23rd April 2013 Andy Pryke Andy@The-Data-Mine.co.uk / @AndyPryke
  • 2. My Bias… www.the-data-mine.co.uk I work in commercial data mining, data analysis and data visualisation Background in computing and artificial intelligence Use R to write programs which analyse data
  • 3. What is Big Data? www.the-data-mine.co.uk Depends who you ask. Answers are often “too big to ….” …load into memory …store on a hard drive …fit in a standard database Plus “Fast changing” Not just relational
  • 4. My “Big Data” Definition www.the-data-mine.co.uk “Data collections big enough to require you to change the way you store and process them.” - Andy Pryke
  • 5. Data Size Limits in R www.the-data-mine.co.uk Standard R packages use a single thread, with data held in memory (RAM) help("Memory-limits") • Vectors limited to 2 Billion items • Memory limit of ~128Tb Servers with 1Tb+ memory are available • Also, Amazon EC2 servers up to 244Gb
  • 6. Overview www.the-data-mine.co.uk • Problems using R with Big Data • Processing data on disk • Hadoop for parallel computation and Big Data storage / access • “In Database” analysis • What next for Birmingham R User Group?
  • 7. Background: R matrix class www.the-data-mine.co.uk “matrix” - Built in (package base). - Stored in RAM - “Dense” - takes up memory to store zero values) Can be replaced by…..
  • 8. Sparse / Disk Based Matrices www.the-data-mine.co.uk • Matrix – Package Matrix. Sparse. In RAM • big.matrix – Package bigmemory / bigmemoryExtras & VAM. On disk. VAM allows access from parallel R sessions • Analysis – Packages irlba, bigalgebra, biganalytics (R-Forge list)etc. More details? “Large-Scale Linear Algebra with R”, Bryan W. Lewis, Boston R Users Meetup
  • 9. Commercial Versions of R www.the-data-mine.co.uk Revolution Analytics have specialised versions of R for parallel execution & big data I believe many if not most components are also available under Free Open Source licences, including the RHadoop set of packages Plenty more info here
  • 10. Background: Hadoop www.the-data-mine.co.uk • Parallel data processing environment based on Google’s “MapReduce” model • “Map” – divide up data and sending it for processing to multiple nodes. • “Reduce” – Combine the results Plus: • Hadoop Distributed File System (HDFS) • HBase – Distributed database like Google’s BigTable
  • 11. RHadoop – Revolution Analytics www.the-data-mine.co.uk Package: rmr2, rhbase, rhdfs • Example code using RMR (R Map-Reduce) • R and Hadoop – Step by Step Tutorials • Install and Demo RHadoop (Google for more of these online) • Data Hacking with RHadoop
  • 12. E.g. Function Output wc.map <- function(., lines) { RHadoop ## split "lines" of text into a vector of individual "words" ## In, 1 ## the, 1 words <- unlist(strsplit(x = lines,split = " ")) www.the-data-mine.co.uk keyval(words,1) ## each word occurs once ## beginning, 1 } ##... wc.reduce <- function(word, counts ) { ## the, 2345 ## Add up the counts, grouping them by word ## word, 987 keyval(word, sum(counts)) } ## beginning, 123 ##... wordcount <- function(input, output = NULL){ mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T) }
  • 13. Other Hadoop libraries for R www.the-data-mine.co.uk Other packages: hive, segue, RHIPE… segue – easy way to distribute CPU intensive work - Uses Amazon’s Elastic Map Reduce service, which costs money. - not designed for big data, but easy and fun. Example follows…
  • 14. # first, let's generate a 10-element list of # 999 random numbers + RHadoop 1 NA: > myList <- getMyTestList() www.the-data-mine.co.uk # Add up each set of 999 numbers > outputLocal <- lapply(myList, mean, na.rm=T) > outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T) RUNNING - 2011-01-04 15:16:57 RUNNING - 2011-01-04 15:17:27 RUNNING - 2011-01-04 15:17:58 WAITING - 2011-01-04 15:18:29 ## Check local and cluster results match > all.equal(outputEmr, outputLocal) [1] TRUE # The key is the emrlapply() function. It works just like lapply(), # but automagically spreads its work across the specified cluster
  • 15. Oracle R Connector for Hadoop www.the-data-mine.co.uk • Integrates with Oracle Db, “Oracle Big Data Appliance” (sounds expensive!) & HDFS • Map-Reduce is very similar to the rmr example • Documentation lists examples for Linear Regression, k-means, working with graphs amongst others • Introduction to Oracle R Connector for Hadoop. • Oracle also offer some in-database algorithms for R via Oracle R Enterprise (overview)
  • 16. Teradata Integration www.the-data-mine.co.uk Package: teradataR • Teradata offer in-database analytics, accessible through R • These include k-means clustering, descriptive statistics and the ability to create and call in- database user defined functions
  • 17. What Next? www.the-data-mine.co.uk I propose an informal “big data” Special Interest Group, where we collaborate to explore big data options within R, producing example code etc. “R” you interested?