Sawmill - Integrating R and Large Data Clouds
 

Sawmill - Integrating R and Large Data Clouds

on

  • 2,855 views

This is a version of a talk that I have given a few times recently.

This is a version of a talk that I have given a few times recently.

Statistics

Views

Total Views
2,855
Views on SlideShare
2,855
Embed Views
0

Actions

Likes
1
Downloads
31
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Sawmill - Integrating R and Large Data Clouds Sawmill - Integrating R and Large Data Clouds Presentation Transcript

    • SawmillSome Lessons Learned Running R in Large Data Clouds
      Robert Grossman
      Open Data Group
    • What Do You Do if Your Data is to Big for a Database?
      Give up and invoke sampling.
      Buy a proprietary system and ask for a raise.
      Begin to build a custom system and explain why it is not yet done.
      Use Hadoop.
      Use an alternative large data cloud (e.g. Sector)
    • Basic Idea
      Turn it into a pleasantly parallel problem.
      Use a large data cloud to manage and prepare the data.
      Use a Map/Bucket function to split the job.
      Run R on each piece using Reduce/UDF or streams.
      Use PMML multiple models to glue the pieces together.
    • Why Listen?
      This approach allows you to scale R relatively easily to hundreds of TB to PB.
      The approach is easy.
      (A plus: it may look hard to your colleagues, boss or clients.)
      There is at least an order of magnitude of performance to be gained with the right design.
    • Part 1. Stacks for Big Data
      5
    • The Google Data Stack
      The Google File System (2003)
      MapReduce: Simplified Data Processing… (2004)
      BigTable: A Distributed Storage System… (2006)
      6
    • Map-Reduce Example
      Input is file with one document per record
      User specifies map function
      key = document URL
      Value = terms that document contains
      “it”, 1“was”, 1“the”, 1“best”, 1
      (“doc cdickens”,“it was the best of times”)
      map
    • Example (cont’d)
      MapReduce library gathers together all pairs with the same key value (shuffle/sort phase)
      The user-defined reduce function combines all the values associated with the same key
      key = “it”values = 1, 1
      “it”, 2“was”, 2“best”, 1“worst”, 1
      key = “was”values = 1, 1
      reduce
      key = “best”values = 1
      key = “worst”values = 1
    • Applying MapReduce to the Data in Storage Cloud
      shuffle/reduce
      map
      9
    • Google’s Large Data Cloud
      Compute Services
      Data Services
      Storage Services
      10
      Applications
      Google’s MapReduce
      Google’s BigTable
      Google File System (GFS)
      Google’s Stack
    • Hadoop’s Large Data Cloud
      Applications
      Compute Services
      11
      Hadoop’sMapReduce
      Data Services
      NoSQL Databases
      Hadoop Distributed File System (HDFS)
      Storage Services
      Hadoop’s Stack
    • Amazon Style Data Cloud
      Load Balancer
      Simple Queue Service
      12
      SDB
      EC2 Instance
      EC2 Instance
      EC2 Instance
      EC2 Instance
      EC2 Instance
      EC2 Instance
      EC2 Instance
      EC2 Instance
      EC2 Instance
      EC2 Instance
      EC2 Instances
      EC2 Instances
      S3 Storage Services
    • Sector’s Large Data Cloud
      13
      Applications
      Compute Services
      Sphere’s UDFs
      Data Services
      Sector’s Distributed File System (SDFS)
      Storage Services
      Routing & Transport Services
      UDP-based Data Transport Protocol (UDT)
      Sector’s Stack
    • Apply User Defined Functions (UDF) to Files in Storage Cloud
      map
      shuffle /reduce
      14
      UDF
      UDF
    • Folklore
      MapReduce is great.
      But sometimes it is easier to use UDFs or other parallel programming frameworks for large data clouds.
      And often it is easier to use Hadoop streams, Sector streams, etc.
    • Sphere UDF vsMapReduce
    • Terasort Benchmark
      Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.
    • MalStone
      18
      entities
      sites
      dk-2
      dk-1
      dk
      time
    • MalStone
      Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.
    • Part 2Predictive Model Markup Language
    • Problems Deploying Models
      Models are deployed in proprietary formats
      Models are application dependent
      Models are system dependent
      Models are architecture dependant
      Time required to deploy models and to integrate models with other applications can be long.
    • Predictive ModelMarkup Language (PMML)
      Based on XML
      Benefits of PMML
      Open standard for Data Mining & Statistical Models
      Not concerned with the process of creating a model
      Provides independence from application, platform, and operating system
      Simplifies use of data mining models by other applications (consumers of data mining models)
    • PMML Document Components
      Data dictionary
      Mining schema
      Transformation Dictionary
      Multiple models, including segments and ensembles.
      Model verification, …
      Univariate Statistics (ModelStats)
      Optional Extensions
    • PMML Models
      polynomial regression
      logistic regression
      general regression
      center based clusters
      density based clusters
      • trees
      • associations
      • neural nets
      • naïve Bayes
      • sequences
      • text models
      • support vector machines
      • ruleset
    • PMML Producer & Consumers
      25
      Modeling Environment
      2
      1
      1
      Model Producer
      Data
      Data Pre-processing
      PMML
      Model
      Deployment Environment
      2
      PMML
      Model
      3
      3
      1
      Model Consumer
      Post Processing
      data
      actions
      scores
      rules
    • Part 3Sawmill
    • Step 2: Invoke R on each segment/bucket and build PMML model
      Step 1: Preprocess data using MapReduce or UDF
      models
      Step 3: Gather the models together to form a multiple model PMML file
    • Step 1: Preprocess data using MapReduce or UDF
      Step 2: Build separate model in each segment using R
      Step 1: Preprocess data using MapReduce or UDF
      Step 2: Score data in each segment using R
    • Sawmill Summary
      Use HadoopMapReduce or Sector UDFsto preprocess the data
      Use HadoopMap or Sector buckets to segment the data to gain parallelism
      Build separate statistical model for each segment using R & Hadoop / Sector Streams
      Use multiple models specification in PMML version 4.0 to specify segmentation
      Example: use Hadoop Map function to send all data for each web site to different segment (on different processor)
    • Small Example: Scoring Engine written in R
      • R processed a typical segment in 20 minutes
      • Using R to score 2 segments concatenated together = 60 minutes
      • Using R to score 3 segments concatenated together = 140 minutes
    • With Sawmill Framework
      • 1 month of data, about 50 GB, hundreds of segments
      • 300 mapper keys / segments
      • Mapping and Reducing < 2 minutes
      • Scoring: 20 minutes * max of segments per reducer
      • Had anywhere from 2 to 3 reducers per node and 2 to 8 segments per reducer.
      • Often ran in under 2 hours.
    • Reducer R Process?
      • There are at least three ways to tie theMapReduceprocess to the R process.
      • MACHINE: One instance of the R process on each data node (ornper node)
      • REDUCER: One instance of the R process bound to each reducer
      • SEGMENT: Instances can be launched by the reducers as necessary (when keys are reduced)
    • Tradeoffs
      • You need to have a general idea of
      • how long the records for a key take to be reduced.
      • how long the application takes to process the segment
      • how many keys are seen per reducer
      • In order to prevent bottlenecks
    • Thank You!
      www.opendatagroup.com