Your SlideShare is downloading. ×
0
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Sawmill - Integrating R and Large Data Clouds

2,457

Published on

This is a version of a talk that I have given a few times recently.

This is a version of a talk that I have given a few times recently.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,457
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
32
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. SawmillSome Lessons Learned Running R in Large Data Clouds<br />Robert Grossman<br />Open Data Group<br />
  • 2. What Do You Do if Your Data is to Big for a Database?<br />Give up and invoke sampling.<br />Buy a proprietary system and ask for a raise.<br />Begin to build a custom system and explain why it is not yet done.<br />Use Hadoop.<br />Use an alternative large data cloud (e.g. Sector)<br />
  • 3. Basic Idea<br />Turn it into a pleasantly parallel problem.<br />Use a large data cloud to manage and prepare the data.<br />Use a Map/Bucket function to split the job.<br />Run R on each piece using Reduce/UDF or streams.<br />Use PMML multiple models to glue the pieces together.<br />
  • 4. Why Listen?<br />This approach allows you to scale R relatively easily to hundreds of TB to PB.<br />The approach is easy.<br />(A plus: it may look hard to your colleagues, boss or clients.)<br />There is at least an order of magnitude of performance to be gained with the right design.<br />
  • 5. Part 1. Stacks for Big Data<br />5<br />
  • 6. The Google Data Stack<br />The Google File System (2003)<br />MapReduce: Simplified Data Processing… (2004)<br />BigTable: A Distributed Storage System… (2006)<br />6<br />
  • 7. Map-Reduce Example<br />Input is file with one document per record<br />User specifies map function<br />key = document URL<br />Value = terms that document contains<br />“it”, 1“was”, 1“the”, 1“best”, 1<br />(“doc cdickens”,“it was the best of times”)<br />map<br />
  • 8. Example (cont’d)<br />MapReduce library gathers together all pairs with the same key value (shuffle/sort phase)<br />The user-defined reduce function combines all the values associated with the same key<br />key = “it”values = 1, 1<br />“it”, 2“was”, 2“best”, 1“worst”, 1<br />key = “was”values = 1, 1<br />reduce<br />key = “best”values = 1<br />key = “worst”values = 1<br />
  • 9. Applying MapReduce to the Data in Storage Cloud<br />shuffle/reduce<br />map<br />9<br />
  • 10. Google’s Large Data Cloud<br />Compute Services<br />Data Services<br />Storage Services<br />10<br />Applications<br />Google’s MapReduce<br />Google’s BigTable<br />Google File System (GFS)<br />Google’s Stack<br />
  • 11. Hadoop’s Large Data Cloud<br />Applications<br />Compute Services<br />11<br />Hadoop’sMapReduce<br />Data Services<br />NoSQL Databases<br />Hadoop Distributed File System (HDFS)<br />Storage Services<br />Hadoop’s Stack<br />
  • 12. Amazon Style Data Cloud<br />Load Balancer<br />Simple Queue Service<br />12<br />SDB<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instances<br />EC2 Instances<br />S3 Storage Services<br />
  • 13. Sector’s Large Data Cloud<br />13<br />Applications<br />Compute Services<br />Sphere’s UDFs<br />Data Services<br />Sector’s Distributed File System (SDFS)<br />Storage Services<br />Routing & Transport Services<br />UDP-based Data Transport Protocol (UDT)<br />Sector’s Stack<br />
  • 14. Apply User Defined Functions (UDF) to Files in Storage Cloud<br />map<br />shuffle /reduce<br />14<br />UDF<br />UDF<br />
  • 15. Folklore <br />MapReduce is great.<br />But sometimes it is easier to use UDFs or other parallel programming frameworks for large data clouds.<br />And often it is easier to use Hadoop streams, Sector streams, etc.<br />
  • 16. Sphere UDF vsMapReduce<br />
  • 17. Terasort Benchmark<br />Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.<br />
  • 18. MalStone<br />18<br />entities<br />sites<br />dk-2<br />dk-1<br />dk<br />time<br />
  • 19. MalStone<br />Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.<br />
  • 20. Part 2Predictive Model Markup Language<br />
  • 21. Problems Deploying Models<br /> Models are deployed in proprietary formats<br /> Models are application dependent<br /> Models are system dependent<br /> Models are architecture dependant<br /> Time required to deploy models and to integrate models with other applications can be long.<br />
  • 22. Predictive ModelMarkup Language (PMML)<br /> Based on XML <br /> Benefits of PMML<br />Open standard for Data Mining & Statistical Models <br />Not concerned with the process of creating a model<br />Provides independence from application, platform, and operating system<br />Simplifies use of data mining models by other applications (consumers of data mining models)<br />
  • 23. PMML Document Components<br />Data dictionary<br />Mining schema<br />Transformation Dictionary<br />Multiple models, including segments and ensembles.<br />Model verification, …<br />Univariate Statistics (ModelStats)<br />Optional Extensions<br />
  • 24. PMML Models<br /> polynomial regression<br /> logistic regression<br /> general regression<br /> center based clusters<br /> density based clusters<br /><ul><li>trees
  • 25. associations
  • 26. neural nets
  • 27. naïve Bayes
  • 28. sequences
  • 29. text models
  • 30. support vector machines
  • 31. ruleset</li></li></ul><li>PMML Producer & Consumers<br />25<br />Modeling Environment<br />2<br />1<br />1<br />Model Producer<br />Data<br />Data Pre-processing<br />PMML<br />Model<br />Deployment Environment <br />2<br />PMML<br />Model<br />3<br />3<br />1<br />Model Consumer<br />Post Processing<br />data<br />actions<br />scores<br />rules<br />
  • 32. Part 3Sawmill<br />
  • 33. Step 2: Invoke R on each segment/bucket and build PMML model <br />Step 1: Preprocess data using MapReduce or UDF<br />models<br />Step 3: Gather the models together to form a multiple model PMML file<br />
  • 34. Step 1: Preprocess data using MapReduce or UDF<br />Step 2: Build separate model in each segment using R<br />Step 1: Preprocess data using MapReduce or UDF<br />Step 2: Score data in each segment using R<br />
  • 35. Sawmill Summary<br />Use HadoopMapReduce or Sector UDFsto preprocess the data<br />Use HadoopMap or Sector buckets to segment the data to gain parallelism<br />Build separate statistical model for each segment using R & Hadoop / Sector Streams<br />Use multiple models specification in PMML version 4.0 to specify segmentation<br />Example: use Hadoop Map function to send all data for each web site to different segment (on different processor) <br />
  • 36. Small Example: Scoring Engine written in R<br /><ul><li>R processed a typical segment in 20 minutes
  • 37. Using R to score 2 segments concatenated together = 60 minutes
  • 38. Using R to score 3 segments concatenated together = 140 minutes</li></li></ul><li>With Sawmill Framework<br /><ul><li>1 month of data, about 50 GB, hundreds of segments
  • 39. 300 mapper keys / segments
  • 40. Mapping and Reducing < 2 minutes
  • 41. Scoring: 20 minutes * max of segments per reducer
  • 42. Had anywhere from 2 to 3 reducers per node and 2 to 8 segments per reducer.
  • 43. Often ran in under 2 hours.</li></li></ul><li>Reducer R Process?<br /><ul><li>There are at least three ways to tie theMapReduceprocess to the R process.
  • 44. MACHINE: One instance of the R process on each data node (ornper node)
  • 45. REDUCER: One instance of the R process bound to each reducer
  • 46. SEGMENT: Instances can be launched by the reducers as necessary (when keys are reduced)</li></li></ul><li>Tradeoffs<br /><ul><li>You need to have a general idea of
  • 47. how long the records for a key take to be reduced.
  • 48. how long the application takes to process the segment
  • 49. how many keys are seen per reducer
  • 50. In order to prevent bottlenecks</li></li></ul><li>Thank You!<br />www.opendatagroup.com<br />

×