SlideShare a Scribd company logo
1 of 28
From Data to Decisions: New Strategies for Deploying Analytics Using Clouds  Robert Grossman Open Data Group July 29, 2009
Analytic Strategy Overview Analytics  Analytic Infrastructure Cloud computing has changed analytic infrastructure and enabled new classes of analytic algorithms.  It’s time to rethink your analytic strategy.
Part 1Quick Review of Clouds 3
What is a Cloud? Clouds provide on-demand resources or services over a network, often the Internet, with the scale and reliability of a data center. No standard definition. Cloud architectures are not new. What is new: Scale Ease of use Pricing model. 4
5 Scale is new.
Elastic, Usage Based Pricing Is New 6 costs the same as 1 computer in a rack for 120 hours 120 computers in  three racks for 1 hour ,[object Object]
 Clouds can manage surges in computing needs.,[object Object]
Two Types of Clouds On-demand resources & services over a network at the scale of a data center On-demand computing instances (IaaS) IaaS: Amazon EC2, S3, etc.; Eucalyptus supports many Web 2.0 applications/users On-demand cloud services for large data cloud applications (PaaS for large data clouds) GFS/MapReduce/Bigtable, Hadoop, Sector, … Manage and compute with large data  (say 10+ TB) 8
Cloud Architectures – How Do You Fill a Data Center? on-demand computing capacity App App App App App on-demand computing instances Cloud Data Services (BigTable, etc.)  Quasi-relational Data Services App App … Cloud Compute Services (MapReduce & Generalizations) App App App App App Cloud Storage Services
What is Analytic Infrastructure ... 10 Part 2   … and why you should care.
What is Analytics? Short Definition Using data to make decisions. Longer Definition Using data to take actions and  make decisions using models that are empirically derived and statistically valid.   It is important to understand the difference between reporting and analytics. 11
12 Direct Marketing Models Risk Models Online Models
What is the Size of Your Data? Small Fits into memory Medium Too large for memory But fits into a database N.B. databases are designed for safe writing of rows Large To large for a database But can use specialized file system (column-wise) Or storage cloud (Google File System, Hadoop DFS) 13
(Very Simplified) Architectural View 14 Model Producer PMML Model Data The Predictive Model Markup Language (PMML) is an XML language for statistical and data mining models (www.dmg.org). With PMML, it is easy to move models between applications and platforms.
(Simplified) Architectural View 15 algorithms to estimate models Model Producer Data Data Pre-processing features PMML also supports XML elements to describe data preprocessing. PMML Model
Three Important Interfaces 16 Modeling Environment 2 1 1 Model Producer Data Data Pre-processing PMML Model Deployment Environment  2 PMML Model 3 3 1 Model Consumer Post Processing data actions scores
Actually, This is a Typically a Component in a Workflow 17
With the proper analytic infrastructure, cloud computing can be used for data preprocessing, for scoring, for producing models, and as a platform for other services in the analytic infrastructure. 18
Cloud Programming Models for Working With Large Data 19 Part 3
Map-Reduce Example Both input & output are (key, value) pairs Input is file with one document per record User specifies map function key = document URL Value = terms that document contains “it”, 1“was”, 1“the”, 1“best”, 1 (“doc cdickens”,“it was the best of times”) map
Example (cont’d) MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) The user-defined reduce function combines all the values associated with the same key key = “it”values = 1, 1 “it”, 2“was”, 2“best”, 1“worst”, 1 key = “was”values = 1, 1 reduce key = “best”values = 1 key = “worst”values = 1
Using Clouds for Scoring (Model Consumers) 22 Part 4
What is a Statistical/Data Mining Model? Infrastructure Inputs: data attributes, mining attributes Outputs, targets Transformations Segmented models, ensembles of models Models that are part of a standard Trees, SVMs, neural networks, cluster models, etc. In this case, only need to specify parameters Arbitrary models e.g. arbitrary code that takes inputs to outputs 23
From an Architectural Viewpoint In an operational environment in which models are being deployed, it may be useful to  “Just so no to viewing models as arbitrary code” The deployment can be much shorter if a scoring engine reads a PMML file instead of integrating a new piece of code containing a model. 24
Model Producers/Consumers in Clouds Model Consumers take analytic models and use them to score data Very easy to deploy in a cloud Deploy a scoring engine in a cloud and then simply read PMML files Very easy to scale up with cloud surges Model Producers take data and produce models Data parallel applications can be ported to clouds. Others require weighing several factors. 25
26 Modeling can be done in-house. Sometimes it makes sense to the pre-processing in the cloud, especially if the data is there. Model Producer Data Data Pre-processing PMML Model PMML Model Scoring engine deployed in a cloud.   Model Consumer Post Processing data actions scores
Summary
For More Information Contact information: Robert Grossman blog.rgrossman.com www.rgrossman.com 28 www.opendatagroup.com

More Related Content

What's hot

JovianDATA MDX Engine Comad oct 22 2011
JovianDATA MDX Engine Comad oct 22 2011JovianDATA MDX Engine Comad oct 22 2011
JovianDATA MDX Engine Comad oct 22 2011
Satya Ramachandran
 
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
AzarulIkhwan
 
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
AtakanAral
 
Dotnet modeling and optimizing the performance- security tradeoff on d-ncs u...
Dotnet  modeling and optimizing the performance- security tradeoff on d-ncs u...Dotnet  modeling and optimizing the performance- security tradeoff on d-ncs u...
Dotnet modeling and optimizing the performance- security tradeoff on d-ncs u...
Ecway Technologies
 

What's hot (17)

Multicloud Deployment of Computing Clusters for Loosely Coupled Multi Task C...
Multicloud Deployment of Computing Clusters for Loosely  Coupled Multi Task C...Multicloud Deployment of Computing Clusters for Loosely  Coupled Multi Task C...
Multicloud Deployment of Computing Clusters for Loosely Coupled Multi Task C...
 
JovianDATA MDX Engine Comad oct 22 2011
JovianDATA MDX Engine Comad oct 22 2011JovianDATA MDX Engine Comad oct 22 2011
JovianDATA MDX Engine Comad oct 22 2011
 
Gray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsGray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark Applications
 
Energy-aware Task Scheduling using Ant-colony Optimization in cloud
Energy-aware Task Scheduling using Ant-colony Optimization in cloudEnergy-aware Task Scheduling using Ant-colony Optimization in cloud
Energy-aware Task Scheduling using Ant-colony Optimization in cloud
 
A modeling approach for cloud infrastructure planning considering dependabili...
A modeling approach for cloud infrastructure planning considering dependabili...A modeling approach for cloud infrastructure planning considering dependabili...
A modeling approach for cloud infrastructure planning considering dependabili...
 
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
 
11
1111
11
 
10
1010
10
 
Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing
 
Distributed in memory processing of all k nearest neighbor queries
Distributed in memory processing of all k nearest neighbor queriesDistributed in memory processing of all k nearest neighbor queries
Distributed in memory processing of all k nearest neighbor queries
 
Toward fine grained, unsupervised, scalable performance diagnosis for product...
Toward fine grained, unsupervised, scalable performance diagnosis for product...Toward fine grained, unsupervised, scalable performance diagnosis for product...
Toward fine grained, unsupervised, scalable performance diagnosis for product...
 
Ahmed Absi slides bigbwa
Ahmed Absi slides  bigbwaAhmed Absi slides  bigbwa
Ahmed Absi slides bigbwa
 
Application scheduling in cloud sim
Application scheduling in cloud simApplication scheduling in cloud sim
Application scheduling in cloud sim
 
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
 
The Pandemic Changes Everything, the Need for Speed and Resiliency
The Pandemic Changes Everything, the Need for Speed and ResiliencyThe Pandemic Changes Everything, the Need for Speed and Resiliency
The Pandemic Changes Everything, the Need for Speed and Resiliency
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Dotnet modeling and optimizing the performance- security tradeoff on d-ncs u...
Dotnet  modeling and optimizing the performance- security tradeoff on d-ncs u...Dotnet  modeling and optimizing the performance- security tradeoff on d-ncs u...
Dotnet modeling and optimizing the performance- security tradeoff on d-ncs u...
 

Similar to The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Zeller Edm Summit Agile Deployment Of Predictive Analytics
Zeller Edm Summit   Agile Deployment Of Predictive AnalyticsZeller Edm Summit   Agile Deployment Of Predictive Analytics
Zeller Edm Summit Agile Deployment Of Predictive Analytics
Ronald.Ramos
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
Kun Le
 
Amplitude wave architecture - Test
Amplitude wave architecture - TestAmplitude wave architecture - Test
Amplitude wave architecture - Test
Kiran Naiga
 

Similar to The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5 (20)

An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
 
Zeller Edm Summit Agile Deployment Of Predictive Analytics
Zeller Edm Summit   Agile Deployment Of Predictive AnalyticsZeller Edm Summit   Agile Deployment Of Predictive Analytics
Zeller Edm Summit Agile Deployment Of Predictive Analytics
 
Scheduling in CCE
Scheduling in CCEScheduling in CCE
Scheduling in CCE
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland
 
Cloud Computing_ICT Concepts & Trends.pptx
Cloud Computing_ICT Concepts & Trends.pptxCloud Computing_ICT Concepts & Trends.pptx
Cloud Computing_ICT Concepts & Trends.pptx
 
Computer project
Computer projectComputer project
Computer project
 
Victor Chang: Cloud computing business framework
Victor Chang: Cloud computing business frameworkVictor Chang: Cloud computing business framework
Victor Chang: Cloud computing business framework
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
 
EFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTING
EFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTINGEFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTING
EFFICIENT TRUSTED CLOUD STORAGE USING PARALLEL CLOUD COMPUTING
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
 
E5 05 ijcite august 2014
E5 05 ijcite august 2014E5 05 ijcite august 2014
E5 05 ijcite august 2014
 
Amplitude wave architecture - Test
Amplitude wave architecture - TestAmplitude wave architecture - Test
Amplitude wave architecture - Test
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
An Introduction to Cloud Computing (2009)
An Introduction to Cloud Computing (2009)An Introduction to Cloud Computing (2009)
An Introduction to Cloud Computing (2009)
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper
 
Green cloud computing
Green cloud computingGreen cloud computing
Green cloud computing
 

More from Robert Grossman

More from Robert Grossman (20)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

  • 1. From Data to Decisions: New Strategies for Deploying Analytics Using Clouds Robert Grossman Open Data Group July 29, 2009
  • 2. Analytic Strategy Overview Analytics Analytic Infrastructure Cloud computing has changed analytic infrastructure and enabled new classes of analytic algorithms. It’s time to rethink your analytic strategy.
  • 3. Part 1Quick Review of Clouds 3
  • 4. What is a Cloud? Clouds provide on-demand resources or services over a network, often the Internet, with the scale and reliability of a data center. No standard definition. Cloud architectures are not new. What is new: Scale Ease of use Pricing model. 4
  • 5. 5 Scale is new.
  • 6.
  • 7.
  • 8. Two Types of Clouds On-demand resources & services over a network at the scale of a data center On-demand computing instances (IaaS) IaaS: Amazon EC2, S3, etc.; Eucalyptus supports many Web 2.0 applications/users On-demand cloud services for large data cloud applications (PaaS for large data clouds) GFS/MapReduce/Bigtable, Hadoop, Sector, … Manage and compute with large data (say 10+ TB) 8
  • 9. Cloud Architectures – How Do You Fill a Data Center? on-demand computing capacity App App App App App on-demand computing instances Cloud Data Services (BigTable, etc.) Quasi-relational Data Services App App … Cloud Compute Services (MapReduce & Generalizations) App App App App App Cloud Storage Services
  • 10. What is Analytic Infrastructure ... 10 Part 2 … and why you should care.
  • 11. What is Analytics? Short Definition Using data to make decisions. Longer Definition Using data to take actions and make decisions using models that are empirically derived and statistically valid. It is important to understand the difference between reporting and analytics. 11
  • 12. 12 Direct Marketing Models Risk Models Online Models
  • 13. What is the Size of Your Data? Small Fits into memory Medium Too large for memory But fits into a database N.B. databases are designed for safe writing of rows Large To large for a database But can use specialized file system (column-wise) Or storage cloud (Google File System, Hadoop DFS) 13
  • 14. (Very Simplified) Architectural View 14 Model Producer PMML Model Data The Predictive Model Markup Language (PMML) is an XML language for statistical and data mining models (www.dmg.org). With PMML, it is easy to move models between applications and platforms.
  • 15. (Simplified) Architectural View 15 algorithms to estimate models Model Producer Data Data Pre-processing features PMML also supports XML elements to describe data preprocessing. PMML Model
  • 16. Three Important Interfaces 16 Modeling Environment 2 1 1 Model Producer Data Data Pre-processing PMML Model Deployment Environment 2 PMML Model 3 3 1 Model Consumer Post Processing data actions scores
  • 17. Actually, This is a Typically a Component in a Workflow 17
  • 18. With the proper analytic infrastructure, cloud computing can be used for data preprocessing, for scoring, for producing models, and as a platform for other services in the analytic infrastructure. 18
  • 19. Cloud Programming Models for Working With Large Data 19 Part 3
  • 20. Map-Reduce Example Both input & output are (key, value) pairs Input is file with one document per record User specifies map function key = document URL Value = terms that document contains “it”, 1“was”, 1“the”, 1“best”, 1 (“doc cdickens”,“it was the best of times”) map
  • 21. Example (cont’d) MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) The user-defined reduce function combines all the values associated with the same key key = “it”values = 1, 1 “it”, 2“was”, 2“best”, 1“worst”, 1 key = “was”values = 1, 1 reduce key = “best”values = 1 key = “worst”values = 1
  • 22. Using Clouds for Scoring (Model Consumers) 22 Part 4
  • 23. What is a Statistical/Data Mining Model? Infrastructure Inputs: data attributes, mining attributes Outputs, targets Transformations Segmented models, ensembles of models Models that are part of a standard Trees, SVMs, neural networks, cluster models, etc. In this case, only need to specify parameters Arbitrary models e.g. arbitrary code that takes inputs to outputs 23
  • 24. From an Architectural Viewpoint In an operational environment in which models are being deployed, it may be useful to “Just so no to viewing models as arbitrary code” The deployment can be much shorter if a scoring engine reads a PMML file instead of integrating a new piece of code containing a model. 24
  • 25. Model Producers/Consumers in Clouds Model Consumers take analytic models and use them to score data Very easy to deploy in a cloud Deploy a scoring engine in a cloud and then simply read PMML files Very easy to scale up with cloud surges Model Producers take data and produce models Data parallel applications can be ported to clouds. Others require weighing several factors. 25
  • 26. 26 Modeling can be done in-house. Sometimes it makes sense to the pre-processing in the cloud, especially if the data is there. Model Producer Data Data Pre-processing PMML Model PMML Model Scoring engine deployed in a cloud. Model Consumer Post Processing data actions scores
  • 28. For More Information Contact information: Robert Grossman blog.rgrossman.com www.rgrossman.com 28 www.opendatagroup.com