The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5 - Presentation Transcript

    1. From Data to Decisions: New Strategies for Deploying Analytics Using Clouds
      Robert Grossman
      Open Data Group
      July 29, 2009
    2. Analytic Strategy
      Overview
      Analytics
      Analytic Infrastructure
      Cloud computing has changed analytic infrastructure and enabled new classes of analytic algorithms. It’s time to rethink your analytic strategy.
    3. Part 1Quick Review of Clouds
      3
    4. What is a Cloud?
      Clouds provide on-demand resources or services over a network, often the Internet, with the scale and reliability of a data center.
      No standard definition.
      Cloud architectures are not new.
      What is new:
      Scale
      Ease of use
      Pricing model.
      4
    5. 5
      Scale is new.
    6. Elastic, Usage Based Pricing Is New
      6
      costs the same as
      1 computer in a rack for 120 hours
      120 computers in three racks for 1 hour
      • Elastic, usage based pricing turns capex into opex.
      • Clouds can manage surges in computing needs.
    7. Simplicity Offered By the Cloud is New
      7
      +
      .. and you have a computer ready to work.
      A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.
    8. Two Types of Clouds
      On-demand resources & services over a network at the scale of a data center
      On-demand computing instances (IaaS)
      IaaS: Amazon EC2, S3, etc.; Eucalyptus
      supports many Web 2.0 applications/users
      On-demand cloud services for large data cloud applications (PaaS for large data clouds)
      GFS/MapReduce/Bigtable, Hadoop, Sector, …
      Manage and compute with large data (say 10+ TB)
      8
    9. Cloud Architectures – How Do You Fill a Data Center?
      on-demand computing capacity
      App
      App
      App
      App
      App
      on-demand computing instances
      Cloud Data Services (BigTable, etc.)
      Quasi-relational Data Services
      App
      App

      Cloud Compute Services (MapReduce & Generalizations)
      App
      App
      App
      App
      App
      Cloud Storage Services
    10. What is Analytic Infrastructure ...
      10
      Part 2
      … and why you should care.
    11. What is Analytics?
      Short Definition
      Using data to make decisions.
      Longer Definition
      Using data to take actions and make decisions using models that are empirically derived and statistically valid.
      It is important to understand the difference between reporting and analytics.
      11
    12. 12
      Direct Marketing Models
      Risk Models
      Online Models
    13. What is the Size of Your Data?
      Small
      Fits into memory
      Medium
      Too large for memory
      But fits into a database
      N.B. databases are designed for safe writing of rows
      Large
      To large for a database
      But can use specialized file system (column-wise)
      Or storage cloud (Google File System, Hadoop DFS)
      13
    14. (Very Simplified) Architectural View
      14
      Model Producer
      PMML
      Model
      Data
      The Predictive Model Markup Language (PMML) is an XML language for statistical and data mining models (www.dmg.org).
      With PMML, it is easy to move models between applications and platforms.
    15. (Simplified) Architectural View
      15
      algorithms to estimate models
      Model Producer
      Data
      Data Pre-processing
      features
      PMML also supports XML elements to describe data preprocessing.
      PMML
      Model
    16. Three Important Interfaces
      16
      Modeling Environment
      2
      1
      1
      Model Producer
      Data
      Data Pre-processing
      PMML
      Model
      Deployment Environment
      2
      PMML
      Model
      3
      3
      1
      Model Consumer
      Post Processing
      data
      actions
      scores
    17. Actually, This is a Typically a Component in a Workflow
      17
    18. With the proper analytic infrastructure, cloud computing can be used for data preprocessing, for scoring, for producing models, and as a platform for other services in the analytic infrastructure.
      18
    19. Cloud Programming Models for Working With Large Data
      19
      Part 3
    20. Map-Reduce Example
      Both input & output are (key, value) pairs
      Input is file with one document per record
      User specifies map function
      key = document URL
      Value = terms that document contains
      “it”, 1“was”, 1“the”, 1“best”, 1
      (“doc cdickens”,“it was the best of times”)
      map
    21. Example (cont’d)
      MapReduce library gathers together all pairs with the same key value (shuffle/sort phase)
      The user-defined reduce function combines all the values associated with the same key
      key = “it”values = 1, 1
      “it”, 2“was”, 2“best”, 1“worst”, 1
      key = “was”values = 1, 1
      reduce
      key = “best”values = 1
      key = “worst”values = 1
    22. Using Clouds for Scoring (Model Consumers)
      22
      Part 4
    23. What is a Statistical/Data Mining Model?
      Infrastructure
      Inputs: data attributes, mining attributes
      Outputs, targets
      Transformations
      Segmented models, ensembles of models
      Models that are part of a standard
      Trees, SVMs, neural networks, cluster models, etc.
      In this case, only need to specify parameters
      Arbitrary models
      e.g. arbitrary code that takes inputs to outputs
      23
    24. From an Architectural Viewpoint
      In an operational environment in which models are being deployed, it may be useful to “Just so no to viewing models as arbitrary code”
      The deployment can be much shorter if a scoring engine reads a PMML file instead of integrating a new piece of code containing a model.
      24
    25. Model Producers/Consumers in Clouds
      Model Consumers take analytic models and use them to score data
      Very easy to deploy in a cloud
      Deploy a scoring engine in a cloud and then simply read PMML files
      Very easy to scale up with cloud surges
      Model Producers take data and produce models
      Data parallel applications can be ported to clouds.
      Others require weighing several factors.
      25
    26. 26
      Modeling can be done in-house.
      Sometimes it makes sense to the pre-processing in the cloud, especially if the data is there.
      Model Producer
      Data
      Data Pre-processing
      PMML
      Model
      PMML
      Model
      Scoring engine deployed in a cloud.
      Model Consumer
      Post Processing
      data
      actions
      scores
    27. Summary
    28. For More Information
      Contact information: Robert Grossman
      blog.rgrossman.com
      www.rgrossman.com
      28
      www.opendatagroup.com

    + Robert GrossmanRobert Grossman, 3 months ago

    custom

    493 views, 1 favs, 0 embeds more stats

    This is a talk I gave in San Diego on July 29, 2009 more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 493
      • 493 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 49
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories