The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

2,641 views

Published on

This is a talk I gave in San Diego on July 29, 2009 explaining some of the impact and some of the opportunities of cloud computing on predictive analytics.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,641
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
192
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

  1. 1. From Data to Decisions: New Strategies for Deploying Analytics Using Clouds <br />Robert Grossman<br />Open Data Group<br />July 29, 2009 <br />
  2. 2. Analytic Strategy<br />Overview<br />Analytics <br />Analytic Infrastructure<br />Cloud computing has changed analytic infrastructure and enabled new classes of analytic algorithms. It’s time to rethink your analytic strategy.<br />
  3. 3. Part 1Quick Review of Clouds<br />3<br />
  4. 4. What is a Cloud?<br />Clouds provide on-demand resources or services over a network, often the Internet, with the scale and reliability of a data center.<br />No standard definition.<br />Cloud architectures are not new.<br />What is new:<br />Scale<br />Ease of use<br />Pricing model.<br />4<br />
  5. 5. 5<br />Scale is new.<br />
  6. 6. Elastic, Usage Based Pricing Is New<br />6<br />costs the same as<br />1 computer in a rack for 120 hours<br />120 computers in three racks for 1 hour<br /><ul><li> Elastic, usage based pricing turns capex into opex.
  7. 7. Clouds can manage surges in computing needs.</li></li></ul><li>Simplicity Offered By the Cloud is New<br />7<br />+<br />.. and you have a computer ready to work.<br />A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.<br />
  8. 8. Two Types of Clouds<br />On-demand resources & services over a network at the scale of a data center<br />On-demand computing instances (IaaS)<br />IaaS: Amazon EC2, S3, etc.; Eucalyptus<br />supports many Web 2.0 applications/users<br />On-demand cloud services for large data cloud applications (PaaS for large data clouds)<br />GFS/MapReduce/Bigtable, Hadoop, Sector, …<br />Manage and compute with large data (say 10+ TB)<br />8<br />
  9. 9. Cloud Architectures – How Do You Fill a Data Center?<br />on-demand computing capacity<br />App<br />App<br />App<br />App<br />App<br />on-demand computing instances<br />Cloud Data Services (BigTable, etc.) <br />Quasi-relational Data Services<br />App<br />App<br />…<br />Cloud Compute Services (MapReduce & Generalizations)<br />App<br />App<br />App<br />App<br />App<br />Cloud Storage Services<br />
  10. 10. What is Analytic Infrastructure ...<br />10<br />Part 2 <br />… and why you should care.<br />
  11. 11. What is Analytics?<br />Short Definition<br />Using data to make decisions.<br />Longer Definition<br />Using data to take actions and make decisions using models that are empirically derived and statistically valid. <br />It is important to understand the difference between reporting and analytics.<br />11<br />
  12. 12. 12<br />Direct Marketing Models<br />Risk Models<br />Online Models<br />
  13. 13. What is the Size of Your Data?<br />Small<br />Fits into memory<br />Medium<br />Too large for memory<br />But fits into a database<br />N.B. databases are designed for safe writing of rows<br />Large<br />To large for a database<br />But can use specialized file system (column-wise)<br />Or storage cloud (Google File System, Hadoop DFS)<br />13<br />
  14. 14. (Very Simplified) Architectural View<br />14<br />Model Producer<br />PMML<br />Model<br />Data<br />The Predictive Model Markup Language (PMML) is an XML language for statistical and data mining models (www.dmg.org).<br />With PMML, it is easy to move models between applications and platforms.<br />
  15. 15. (Simplified) Architectural View<br />15<br />algorithms to estimate models<br />Model Producer<br />Data<br />Data Pre-processing<br />features<br />PMML also supports XML elements to describe data preprocessing.<br />PMML<br />Model<br />
  16. 16. Three Important Interfaces<br />16<br />Modeling Environment<br />2<br />1<br />1<br />Model Producer<br />Data<br />Data Pre-processing<br />PMML<br />Model<br />Deployment Environment <br />2<br />PMML<br />Model<br />3<br />3<br />1<br />Model Consumer<br />Post Processing<br />data<br />actions<br />scores<br />
  17. 17. Actually, This is a Typically a Component in a Workflow<br />17<br />
  18. 18. With the proper analytic infrastructure, cloud computing can be used for data preprocessing, for scoring, for producing models, and as a platform for other services in the analytic infrastructure.<br />18<br />
  19. 19. Cloud Programming Models for Working With Large Data<br />19<br />Part 3 <br />
  20. 20. Map-Reduce Example<br />Both input & output are (key, value) pairs<br />Input is file with one document per record<br />User specifies map function<br />key = document URL<br />Value = terms that document contains<br />“it”, 1“was”, 1“the”, 1“best”, 1<br />(“doc cdickens”,“it was the best of times”)<br />map<br />
  21. 21. Example (cont’d)<br />MapReduce library gathers together all pairs with the same key value (shuffle/sort phase)<br />The user-defined reduce function combines all the values associated with the same key<br />key = “it”values = 1, 1<br />“it”, 2“was”, 2“best”, 1“worst”, 1<br />key = “was”values = 1, 1<br />reduce<br />key = “best”values = 1<br />key = “worst”values = 1<br />
  22. 22. Using Clouds for Scoring (Model Consumers)<br />22<br />Part 4<br />
  23. 23. What is a Statistical/Data Mining Model?<br />Infrastructure<br />Inputs: data attributes, mining attributes<br />Outputs, targets<br />Transformations<br />Segmented models, ensembles of models<br />Models that are part of a standard<br />Trees, SVMs, neural networks, cluster models, etc.<br />In this case, only need to specify parameters<br />Arbitrary models<br />e.g. arbitrary code that takes inputs to outputs<br />23<br />
  24. 24. From an Architectural Viewpoint<br />In an operational environment in which models are being deployed, it may be useful to “Just so no to viewing models as arbitrary code”<br />The deployment can be much shorter if a scoring engine reads a PMML file instead of integrating a new piece of code containing a model.<br />24<br />
  25. 25. Model Producers/Consumers in Clouds<br />Model Consumers take analytic models and use them to score data<br />Very easy to deploy in a cloud<br />Deploy a scoring engine in a cloud and then simply read PMML files<br />Very easy to scale up with cloud surges<br />Model Producers take data and produce models<br />Data parallel applications can be ported to clouds.<br />Others require weighing several factors.<br />25<br />
  26. 26. 26<br />Modeling can be done in-house.<br />Sometimes it makes sense to the pre-processing in the cloud, especially if the data is there.<br />Model Producer<br />Data<br />Data Pre-processing<br />PMML<br />Model<br />PMML<br />Model<br />Scoring engine deployed in a cloud. <br />Model Consumer<br />Post Processing<br />data<br />actions<br />scores<br />
  27. 27. Summary<br />
  28. 28. For More Information<br />Contact information: Robert Grossman<br />blog.rgrossman.com<br />www.rgrossman.com<br />28<br />www.opendatagroup.com<br />

×