Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Operationalizing R with Azure ML by Chris McHenry 437 views
- Architectures for Data Commons (XLD... by Robert Grossman 546 views
- Clouds and Commons for the Data Int... by Robert Grossman 826 views
- Keynote on 2015 Yale Day of Data by Robert Grossman 739 views
- AnalyticOps - Chicago PAW 2016 by Robert Grossman 772 views
- How to Lower the Cost of Deploying ... by Robert Grossman 1077 views

2,785 views

Published on

This is a version of a talk that I have given a few times recently.

Published in:
Technology

No Downloads

Total views

2,785

On SlideShare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

35

Comments

0

Likes

1

No embeds

No notes for slide

- 1. SawmillSome Lessons Learned Running R in Large Data Clouds<br />Robert Grossman<br />Open Data Group<br />
- 2. What Do You Do if Your Data is to Big for a Database?<br />Give up and invoke sampling.<br />Buy a proprietary system and ask for a raise.<br />Begin to build a custom system and explain why it is not yet done.<br />Use Hadoop.<br />Use an alternative large data cloud (e.g. Sector)<br />
- 3. Basic Idea<br />Turn it into a pleasantly parallel problem.<br />Use a large data cloud to manage and prepare the data.<br />Use a Map/Bucket function to split the job.<br />Run R on each piece using Reduce/UDF or streams.<br />Use PMML multiple models to glue the pieces together.<br />
- 4. Why Listen?<br />This approach allows you to scale R relatively easily to hundreds of TB to PB.<br />The approach is easy.<br />(A plus: it may look hard to your colleagues, boss or clients.)<br />There is at least an order of magnitude of performance to be gained with the right design.<br />
- 5. Part 1. Stacks for Big Data<br />5<br />
- 6. The Google Data Stack<br />The Google File System (2003)<br />MapReduce: Simplified Data Processing… (2004)<br />BigTable: A Distributed Storage System… (2006)<br />6<br />
- 7. Map-Reduce Example<br />Input is file with one document per record<br />User specifies map function<br />key = document URL<br />Value = terms that document contains<br />“it”, 1“was”, 1“the”, 1“best”, 1<br />(“doc cdickens”,“it was the best of times”)<br />map<br />
- 8. Example (cont’d)<br />MapReduce library gathers together all pairs with the same key value (shuffle/sort phase)<br />The user-defined reduce function combines all the values associated with the same key<br />key = “it”values = 1, 1<br />“it”, 2“was”, 2“best”, 1“worst”, 1<br />key = “was”values = 1, 1<br />reduce<br />key = “best”values = 1<br />key = “worst”values = 1<br />
- 9. Applying MapReduce to the Data in Storage Cloud<br />shuffle/reduce<br />map<br />9<br />
- 10. Google’s Large Data Cloud<br />Compute Services<br />Data Services<br />Storage Services<br />10<br />Applications<br />Google’s MapReduce<br />Google’s BigTable<br />Google File System (GFS)<br />Google’s Stack<br />
- 11. Hadoop’s Large Data Cloud<br />Applications<br />Compute Services<br />11<br />Hadoop’sMapReduce<br />Data Services<br />NoSQL Databases<br />Hadoop Distributed File System (HDFS)<br />Storage Services<br />Hadoop’s Stack<br />
- 12. Amazon Style Data Cloud<br />Load Balancer<br />Simple Queue Service<br />12<br />SDB<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instance<br />EC2 Instances<br />EC2 Instances<br />S3 Storage Services<br />
- 13. Sector’s Large Data Cloud<br />13<br />Applications<br />Compute Services<br />Sphere’s UDFs<br />Data Services<br />Sector’s Distributed File System (SDFS)<br />Storage Services<br />Routing & Transport Services<br />UDP-based Data Transport Protocol (UDT)<br />Sector’s Stack<br />
- 14. Apply User Defined Functions (UDF) to Files in Storage Cloud<br />map<br />shuffle /reduce<br />14<br />UDF<br />UDF<br />
- 15. Folklore <br />MapReduce is great.<br />But sometimes it is easier to use UDFs or other parallel programming frameworks for large data clouds.<br />And often it is easier to use Hadoop streams, Sector streams, etc.<br />
- 16. Sphere UDF vsMapReduce<br />
- 17. Terasort Benchmark<br />Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.<br />
- 18. MalStone<br />18<br />entities<br />sites<br />dk-2<br />dk-1<br />dk<br />time<br />
- 19. MalStone<br />Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.<br />
- 20. Part 2Predictive Model Markup Language<br />
- 21. Problems Deploying Models<br /> Models are deployed in proprietary formats<br /> Models are application dependent<br /> Models are system dependent<br /> Models are architecture dependant<br /> Time required to deploy models and to integrate models with other applications can be long.<br />
- 22. Predictive ModelMarkup Language (PMML)<br /> Based on XML <br /> Benefits of PMML<br />Open standard for Data Mining & Statistical Models <br />Not concerned with the process of creating a model<br />Provides independence from application, platform, and operating system<br />Simplifies use of data mining models by other applications (consumers of data mining models)<br />
- 23. PMML Document Components<br />Data dictionary<br />Mining schema<br />Transformation Dictionary<br />Multiple models, including segments and ensembles.<br />Model verification, …<br />Univariate Statistics (ModelStats)<br />Optional Extensions<br />
- 24. PMML Models<br /> polynomial regression<br /> logistic regression<br /> general regression<br /> center based clusters<br /> density based clusters<br /><ul><li>trees
- 25. associations
- 26. neural nets
- 27. naïve Bayes
- 28. sequences
- 29. text models
- 30. support vector machines
- 31. ruleset</li></li></ul><li>PMML Producer & Consumers<br />25<br />Modeling Environment<br />2<br />1<br />1<br />Model Producer<br />Data<br />Data Pre-processing<br />PMML<br />Model<br />Deployment Environment <br />2<br />PMML<br />Model<br />3<br />3<br />1<br />Model Consumer<br />Post Processing<br />data<br />actions<br />scores<br />rules<br />
- 32. Part 3Sawmill<br />
- 33. Step 2: Invoke R on each segment/bucket and build PMML model <br />Step 1: Preprocess data using MapReduce or UDF<br />models<br />Step 3: Gather the models together to form a multiple model PMML file<br />
- 34. Step 1: Preprocess data using MapReduce or UDF<br />Step 2: Build separate model in each segment using R<br />Step 1: Preprocess data using MapReduce or UDF<br />Step 2: Score data in each segment using R<br />
- 35. Sawmill Summary<br />Use HadoopMapReduce or Sector UDFsto preprocess the data<br />Use HadoopMap or Sector buckets to segment the data to gain parallelism<br />Build separate statistical model for each segment using R & Hadoop / Sector Streams<br />Use multiple models specification in PMML version 4.0 to specify segmentation<br />Example: use Hadoop Map function to send all data for each web site to different segment (on different processor) <br />
- 36. Small Example: Scoring Engine written in R<br /><ul><li>R processed a typical segment in 20 minutes
- 37. Using R to score 2 segments concatenated together = 60 minutes
- 38. Using R to score 3 segments concatenated together = 140 minutes</li></li></ul><li>With Sawmill Framework<br /><ul><li>1 month of data, about 50 GB, hundreds of segments
- 39. 300 mapper keys / segments
- 40. Mapping and Reducing < 2 minutes
- 41. Scoring: 20 minutes * max of segments per reducer
- 42. Had anywhere from 2 to 3 reducers per node and 2 to 8 segments per reducer.
- 43. Often ran in under 2 hours.</li></li></ul><li>Reducer R Process?<br /><ul><li>There are at least three ways to tie theMapReduceprocess to the R process.
- 44. MACHINE: One instance of the R process on each data node (ornper node)
- 45. REDUCER: One instance of the R process bound to each reducer
- 46. SEGMENT: Instances can be launched by the reducers as necessary (when keys are reduced)</li></li></ul><li>Tradeoffs<br /><ul><li>You need to have a general idea of
- 47. how long the records for a key take to be reduced.
- 48. how long the application takes to process the segment
- 49. how many keys are seen per reducer
- 50. In order to prevent bottlenecks</li></li></ul><li>Thank You!<br />www.opendatagroup.com<br />

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment