This is a talk I gave in San Diego on July 29, 2009 explaining some of the impact and some of the opportunities of cloud computing on predictive analytics.
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
1. From Data to Decisions: New Strategies for Deploying Analytics Using Clouds Robert Grossman Open Data Group July 29, 2009
2. Analytic Strategy Overview Analytics Analytic Infrastructure Cloud computing has changed analytic infrastructure and enabled new classes of analytic algorithms. It’s time to rethink your analytic strategy.
4. What is a Cloud? Clouds provide on-demand resources or services over a network, often the Internet, with the scale and reliability of a data center. No standard definition. Cloud architectures are not new. What is new: Scale Ease of use Pricing model. 4
8. Two Types of Clouds On-demand resources & services over a network at the scale of a data center On-demand computing instances (IaaS) IaaS: Amazon EC2, S3, etc.; Eucalyptus supports many Web 2.0 applications/users On-demand cloud services for large data cloud applications (PaaS for large data clouds) GFS/MapReduce/Bigtable, Hadoop, Sector, … Manage and compute with large data (say 10+ TB) 8
9. Cloud Architectures – How Do You Fill a Data Center? on-demand computing capacity App App App App App on-demand computing instances Cloud Data Services (BigTable, etc.) Quasi-relational Data Services App App … Cloud Compute Services (MapReduce & Generalizations) App App App App App Cloud Storage Services
10. What is Analytic Infrastructure ... 10 Part 2 … and why you should care.
11. What is Analytics? Short Definition Using data to make decisions. Longer Definition Using data to take actions and make decisions using models that are empirically derived and statistically valid. It is important to understand the difference between reporting and analytics. 11
13. What is the Size of Your Data? Small Fits into memory Medium Too large for memory But fits into a database N.B. databases are designed for safe writing of rows Large To large for a database But can use specialized file system (column-wise) Or storage cloud (Google File System, Hadoop DFS) 13
14. (Very Simplified) Architectural View 14 Model Producer PMML Model Data The Predictive Model Markup Language (PMML) is an XML language for statistical and data mining models (www.dmg.org). With PMML, it is easy to move models between applications and platforms.
15. (Simplified) Architectural View 15 algorithms to estimate models Model Producer Data Data Pre-processing features PMML also supports XML elements to describe data preprocessing. PMML Model
16. Three Important Interfaces 16 Modeling Environment 2 1 1 Model Producer Data Data Pre-processing PMML Model Deployment Environment 2 PMML Model 3 3 1 Model Consumer Post Processing data actions scores
18. With the proper analytic infrastructure, cloud computing can be used for data preprocessing, for scoring, for producing models, and as a platform for other services in the analytic infrastructure. 18
20. Map-Reduce Example Both input & output are (key, value) pairs Input is file with one document per record User specifies map function key = document URL Value = terms that document contains “it”, 1“was”, 1“the”, 1“best”, 1 (“doc cdickens”,“it was the best of times”) map
21. Example (cont’d) MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) The user-defined reduce function combines all the values associated with the same key key = “it”values = 1, 1 “it”, 2“was”, 2“best”, 1“worst”, 1 key = “was”values = 1, 1 reduce key = “best”values = 1 key = “worst”values = 1
23. What is a Statistical/Data Mining Model? Infrastructure Inputs: data attributes, mining attributes Outputs, targets Transformations Segmented models, ensembles of models Models that are part of a standard Trees, SVMs, neural networks, cluster models, etc. In this case, only need to specify parameters Arbitrary models e.g. arbitrary code that takes inputs to outputs 23
24. From an Architectural Viewpoint In an operational environment in which models are being deployed, it may be useful to “Just so no to viewing models as arbitrary code” The deployment can be much shorter if a scoring engine reads a PMML file instead of integrating a new piece of code containing a model. 24
25. Model Producers/Consumers in Clouds Model Consumers take analytic models and use them to score data Very easy to deploy in a cloud Deploy a scoring engine in a cloud and then simply read PMML files Very easy to scale up with cloud surges Model Producers take data and produce models Data parallel applications can be ported to clouds. Others require weighing several factors. 25
26. 26 Modeling can be done in-house. Sometimes it makes sense to the pre-processing in the cloud, especially if the data is there. Model Producer Data Data Pre-processing PMML Model PMML Model Scoring engine deployed in a cloud. Model Consumer Post Processing data actions scores