Sawmill - Integrating R and Large Data Clouds


Published on

This is a version of a talk that I have given a few times recently.

Published in: Technology
  • Be the first to comment

Sawmill - Integrating R and Large Data Clouds

  1. 1. SawmillSome Lessons Learned Running R in Large Data Clouds<br />Robert Grossman<br />Open Data Group<br />
  2. 2. What Do You Do if Your Data is to Big for a Database?<br />Give up and invoke sampling.<br />Buy a proprietary system and ask for a raise.<br />Begin to build a custom system and explain why it is not yet done.<br />Use Hadoop.<br />Use an alternative large data cloud (e.g. Sector)<br />
  3. 3. Basic Idea<br />Turn it into a pleasantly parallel problem.<br />Use a large data cloud to manage and prepare the data.<br />Use a Map/Bucket function to split the job.<br />Run R on each piece using Reduce/UDF or streams.<br />Use PMML multiple models to glue the pieces together.<br />
  4. 4. Why Listen?<br />This approach allows you to scale R relatively easily to hundreds of TB to PB.<br />The approach is easy.<br />(A plus: it may look hard to your colleagues, boss or clients.)<br />There is at least an order of magnitude of performance to be gained with the right design.<br />
  5. 5. Part 1. Stacks for Big Data<br />5<br />
  6. 6. The Google Data Stack<br />The Google File System (2003)<br />MapReduce: Simplified Data Processing… (2004)<br />BigTable: A Distributed Storage System… (2006)<br />6<br />
  7. 7. Map-Reduce Example<br />Input is file with one document per record<br />User specifies map function<br />key = document URL<br />Value = terms that document contains<br />“it”, 1“was”, 1“the”, 1“best”, 1<br />(“doc cdickens”,“it was the best of times”)<br />map<br />
  8. 8. Example (cont’d)<br />MapReduce library gathers together all pairs with the same key value (shuffle/sort phase)<br />The user-defined reduce function combines all the values associated with the same key<br />key = “it”values = 1, 1<br />“it”, 2“was”, 2“best”, 1“worst”, 1<br />key = “was”values = 1, 1<br />reduce<br />key = “best”values = 1<br />key = “worst”values = 1<br />
  9. 9. Applying MapReduce to the Data in Storage Cloud<br />shuffle/reduce<br />map<br />9<br />
  10. 10. Google’s Large Data Cloud<br />Compute Services<br />Data Services<br />Storage Services<br />10<br />Applications<br />Google’s MapReduce<br />Google’s BigTable<br />Google File System (GFS)<br />Google’s Stack<br />
  11. 11. Hadoop’s Large Data Cloud<br />Applications<br />Compute Services<br />11<br />Hadoop’sMapReduce<br />Data Services<br />NoSQL Databases<br />Hadoop Distributed File System (HDFS)<br />Storage Services<br />Hadoop’s Stack<br />
  12. 12. Amazon Style Data Cloud<br />Load Balancer<br />Simple Queue Service<br />12<br />SDB<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instances<br />EC2 Instances<br />S3 Storage Services<br />
  13. 13. Sector’s Large Data Cloud<br />13<br />Applications<br />Compute Services<br />Sphere’s UDFs<br />Data Services<br />Sector’s Distributed File System (SDFS)<br />Storage Services<br />Routing & Transport Services<br />UDP-based Data Transport Protocol (UDT)<br />Sector’s Stack<br />
  14. 14. Apply User Defined Functions (UDF) to Files in Storage Cloud<br />map<br />shuffle /reduce<br />14<br />UDF<br />UDF<br />
  15. 15. Folklore <br />MapReduce is great.<br />But sometimes it is easier to use UDFs or other parallel programming frameworks for large data clouds.<br />And often it is easier to use Hadoop streams, Sector streams, etc.<br />
  16. 16. Sphere UDF vsMapReduce<br />
  17. 17. Terasort Benchmark<br />Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.<br />
  18. 18. MalStone<br />18<br />entities<br />sites<br />dk-2<br />dk-1<br />dk<br />time<br />
  19. 19. MalStone<br />Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.<br />
  20. 20. Part 2Predictive Model Markup Language<br />
  21. 21. Problems Deploying Models<br /> Models are deployed in proprietary formats<br /> Models are application dependent<br /> Models are system dependent<br /> Models are architecture dependant<br /> Time required to deploy models and to integrate models with other applications can be long.<br />
  22. 22. Predictive ModelMarkup Language (PMML)<br /> Based on XML <br /> Benefits of PMML<br />Open standard for Data Mining & Statistical Models <br />Not concerned with the process of creating a model<br />Provides independence from application, platform, and operating system<br />Simplifies use of data mining models by other applications (consumers of data mining models)<br />
  23. 23. PMML Document Components<br />Data dictionary<br />Mining schema<br />Transformation Dictionary<br />Multiple models, including segments and ensembles.<br />Model verification, …<br />Univariate Statistics (ModelStats)<br />Optional Extensions<br />
  24. 24. PMML Models<br /> polynomial regression<br /> logistic regression<br /> general regression<br /> center based clusters<br /> density based clusters<br /><ul><li>trees
  25. 25. associations
  26. 26. neural nets
  27. 27. naïve Bayes
  28. 28. sequences
  29. 29. text models
  30. 30. support vector machines
  31. 31. ruleset</li></li></ul><li>PMML Producer & Consumers<br />25<br />Modeling Environment<br />2<br />1<br />1<br />Model Producer<br />Data<br />Data Pre-processing<br />PMML<br />Model<br />Deployment Environment <br />2<br />PMML<br />Model<br />3<br />3<br />1<br />Model Consumer<br />Post Processing<br />data<br />actions<br />scores<br />rules<br />
  32. 32. Part 3Sawmill<br />
  33. 33. Step 2: Invoke R on each segment/bucket and build PMML model <br />Step 1: Preprocess data using MapReduce or UDF<br />models<br />Step 3: Gather the models together to form a multiple model PMML file<br />
  34. 34. Step 1: Preprocess data using MapReduce or UDF<br />Step 2: Build separate model in each segment using R<br />Step 1: Preprocess data using MapReduce or UDF<br />Step 2: Score data in each segment using R<br />
  35. 35. Sawmill Summary<br />Use HadoopMapReduce or Sector UDFsto preprocess the data<br />Use HadoopMap or Sector buckets to segment the data to gain parallelism<br />Build separate statistical model for each segment using R & Hadoop / Sector Streams<br />Use multiple models specification in PMML version 4.0 to specify segmentation<br />Example: use Hadoop Map function to send all data for each web site to different segment (on different processor) <br />
  36. 36. Small Example: Scoring Engine written in R<br /><ul><li>R processed a typical segment in 20 minutes
  37. 37. Using R to score 2 segments concatenated together = 60 minutes
  38. 38. Using R to score 3 segments concatenated together = 140 minutes</li></li></ul><li>With Sawmill Framework<br /><ul><li>1 month of data, about 50 GB, hundreds of segments
  39. 39. 300 mapper keys / segments
  40. 40. Mapping and Reducing < 2 minutes
  41. 41. Scoring: 20 minutes * max of segments per reducer
  42. 42. Had anywhere from 2 to 3 reducers per node and 2 to 8 segments per reducer.
  43. 43. Often ran in under 2 hours.</li></li></ul><li>Reducer R Process?<br /><ul><li>There are at least three ways to tie theMapReduceprocess to the R process.
  44. 44. MACHINE: One instance of the R process on each data node (ornper node)
  45. 45. REDUCER: One instance of the R process bound to each reducer
  46. 46. SEGMENT: Instances can be launched by the reducers as necessary (when keys are reduced)</li></li></ul><li>Tradeoffs<br /><ul><li>You need to have a general idea of
  47. 47. how long the records for a key take to be reduced.
  48. 48. how long the application takes to process the segment
  49. 49. how many keys are seen per reducer
  50. 50. In order to prevent bottlenecks</li></li></ul><li>Thank You!<br /><br />