Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Analytics in the Cloud     Deepak Singh, Ph.D.     Sr. Business Development Manager
Via butteryflysha under a CC-BY license
Image: Simon Cockell under CC-BY
our reality
lots and lots and lots andlots and lots of data
Astronomy       Genome SequencingLogs             ClickstreamsFraud                  SensorsGeolocation
harness data
business insights
better decisions
new business/product    opportunities
People drive innovationCredit: Pieter Musterd a CC-BY-NC-ND license
Image: Drew Conway
challenges
SuccessfulYour Idea             Product
SuccessfulYour Idea             Product
SuccessfulYour Idea             Product
Data VolumeData StructuresData Dimensionality
Resource constraintsTight BudgetsUndifferentiated Heavy Lifting
enter the cloud
Amazon Web Services
infrastructure services
building blocks
Abstract Resources    On Demand  Programmable       Pay As You Go     Secure             Elastic
Predicting Infrastructure Needs                        Actual Usage                                         CustomerComput...
3000 -                     3000 CPU’s for one firm’s risk management processesNumber of EC2 Instances                     ...
requirements for   analytics
scalable storage
scalable compute
on-demand resources
99.999999999          Amazon S3      Elastic Block Store         Amazon EC2   Persistent   Attached    Storage            ...
database
Amazon RDS
what database do you       want?
Oracle Certification, Support and Licensing                        All products certified on the                        Or...
Biomarker            Warehousepre-clinical, clinical, 3rd party data and publications            Estimated cost: 10 TB war...
reporting & business    intelligence
SAP BusinessObjects          Recovery.gov Architecture
new paradigms for data       analysis
unconstrained tools for unconstrained growth
Amazon Elastic MapReduce
Amazon EC2 Instances                                                                                          EndDeploy Ap...
HDFS
mapreduce
Amazon ElasticMapReduce Goals
handle undifferentiated     heavy lifting
integrated with other    cloud services
It is Hadoop
Hive   PigCascading
Customers are using for …Targeted advertising / Clickstream analysisData warehousing applicationsBioinformaticsFinancial m...
CLICKSTREAM ANALYSIS – RAZORFISH AND BESTBUY  Best Buy came to Razorfish    3.5 billion records, 71 million unique cookie...
Clickstream Analysis - Architecture
another example: Etsy
Etsy by the numbers   October 2010     $29.9 million in sales     842 million page views     434 GB web logs     97 millio...
Data collection & EMR integration
Cascading.Jruby   source raw_log_data’   assembly raw_log_data do    # Helper method to parse fields from raw web logs: se...
data warehousing
WEB-SCALE DATA WAREHOUSING
RESIZE RUNNING JOB FLOWS        Use Case: Increase speed of running job flows           Speed up job flow execution in re...
Dynamically Resize Job Flows              Use Case: Agile Data Warehouse Cluster               Customize cluster size to ...
Ecosystem And Tools Business Intelligence   MicroStrategy, Pentaho Analytics   Datameer, Karmasphere Open source   Bees...
some practicalconsiderations
• Launch and monitor job flows   • AWS Management Console   • Command line interface   • REST API      • 3rd party librari...
Hardware requirements for Use Cases Data/IO Intensive (m1/m2 instances)   Data Warehouse   Data Mining       • Click str...
additional resources
http://aws.amazon.com/elasticmapreduce
http://aws.amazon.com/articles/Elastic-MapReduce
http://www.youtube.com/user/AmazonWebServices
in summary
Amazon Web Services
Abstract Resources    On Demand  Programmable       Pay As You Go     Secure             Elastic
many database options
Elastic MapReduce
powerful, flexible, elastic  analytics and data     warehousing
SuccessfulYour Idea             Product
SuccessfulYour Idea             Product
Image: Jeff Hester under a Creative Commons License
AWS will be at Strata 2011
deesingh@amazon.com                                                      Twitter:@mndoci   Inspiration and material from  ...
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Upcoming SlideShare
Loading in …5
×

Analytics in the Cloud

6,077 views

Published on

Published in: Technology

Analytics in the Cloud

  1. 1. Analytics in the Cloud Deepak Singh, Ph.D. Sr. Business Development Manager
  2. 2. Via butteryflysha under a CC-BY license
  3. 3. Image: Simon Cockell under CC-BY
  4. 4. our reality
  5. 5. lots and lots and lots andlots and lots of data
  6. 6. Astronomy Genome SequencingLogs ClickstreamsFraud SensorsGeolocation
  7. 7. harness data
  8. 8. business insights
  9. 9. better decisions
  10. 10. new business/product opportunities
  11. 11. People drive innovationCredit: Pieter Musterd a CC-BY-NC-ND license
  12. 12. Image: Drew Conway
  13. 13. challenges
  14. 14. SuccessfulYour Idea Product
  15. 15. SuccessfulYour Idea Product
  16. 16. SuccessfulYour Idea Product
  17. 17. Data VolumeData StructuresData Dimensionality
  18. 18. Resource constraintsTight BudgetsUndifferentiated Heavy Lifting
  19. 19. enter the cloud
  20. 20. Amazon Web Services
  21. 21. infrastructure services
  22. 22. building blocks
  23. 23. Abstract Resources On Demand Programmable Pay As You Go Secure Elastic
  24. 24. Predicting Infrastructure Needs Actual Usage CustomerCompute Power Dissatisfaction Predicted Usage Waste Time
  25. 25. 3000 - 3000 CPU’s for one firm’s risk management processesNumber of EC2 Instances 300 CPU’s on weekends300 - Wednesday Thursday Friday Saturday Sunday Monday Tuesday 4/22/2009 4/23/2009 4/24/2009 4/25/2009 4/26/2009 4/27/2009 4/28/2009
  26. 26. requirements for analytics
  27. 27. scalable storage
  28. 28. scalable compute
  29. 29. on-demand resources
  30. 30. 99.999999999 Amazon S3 Elastic Block Store Amazon EC2 Persistent Attached Storage Elastic Computing
  31. 31. database
  32. 32. Amazon RDS
  33. 33. what database do you want?
  34. 34. Oracle Certification, Support and Licensing All products certified on the Oracle Virtual Machine are now Certified & Certified on Amazon EC2 supported managed OVM Full Support from Oracle and AWS Standard Licensing Policies Apply AMIs exist today for many Oracle products
  35. 35. Biomarker Warehousepre-clinical, clinical, 3rd party data and publications Estimated cost: 10 TB warehouse over 3 years
  36. 36. reporting & business intelligence
  37. 37. SAP BusinessObjects Recovery.gov Architecture
  38. 38. new paradigms for data analysis
  39. 39. unconstrained tools for unconstrained growth
  40. 40. Amazon Elastic MapReduce
  41. 41. Amazon EC2 Instances EndDeploy Application Hadoop Hadoop Hadoop Elastic Elastic MapReduce MapReduce Hadoop Hadoop Hadoop NotifyWeb Console, Command line tools Input output dataset results Input S3 Output S3 Get Results Input Data bucket bucket Amazon S3
  42. 42. HDFS
  43. 43. mapreduce
  44. 44. Amazon ElasticMapReduce Goals
  45. 45. handle undifferentiated heavy lifting
  46. 46. integrated with other cloud services
  47. 47. It is Hadoop
  48. 48. Hive PigCascading
  49. 49. Customers are using for …Targeted advertising / Clickstream analysisData warehousing applicationsBioinformaticsFinancial modelingFile processingWeb indexingData mining and BI
  50. 50. CLICKSTREAM ANALYSIS – RAZORFISH AND BESTBUY Best Buy came to Razorfish  3.5 billion records, 71 million unique cookies, 1.7 million targeted ads required per day User recently purchased a home theater Targeted Ad system and is searching for (1.7 Million per day) video games• Leveraged AWS and Elastic MapReduce – 100 node cluster on demand – Processing time dropped from 2+ days to 8 hours – Increased ROAS by 500%
  51. 51. Clickstream Analysis - Architecture
  52. 52. another example: Etsy
  53. 53. Etsy by the numbers October 2010 $29.9 million in sales 842 million page views 434 GB web logs 97 million total favorite listings
  54. 54. Data collection & EMR integration
  55. 55. Cascading.Jruby source raw_log_data’ assembly raw_log_data do # Helper method to parse fields from raw web logs: session_id, url parameters, etc. parse_log :query_fields => [affiliate_code’,sale_confirmation_id’] branch affiliate_events do where affiliate_code:string != null’ rename created_at =>; affiliate_timestamp’ end branch sales_events do where sale_confirmation_id:string != null’ rename created_at => sale_timestamp’ end end
  56. 56. data warehousing
  57. 57. WEB-SCALE DATA WAREHOUSING
  58. 58. RESIZE RUNNING JOB FLOWS Use Case: Increase speed of running job flows  Speed up job flow execution in response to changing requirements  Dynamically balance cost versus performance without restarting a job Job Flow Job Flow Job Flow Allocate Expand to Expand to4 instances 9 instances 25 instances Time remaining: Time remaining: 14 Hours 7 Hours Time remaining: 3 Hours
  59. 59. Dynamically Resize Job Flows Use Case: Agile Data Warehouse Cluster  Customize cluster size to support varying resource needs (e.g., query support during the day versus batch processing overnight)  Leverage flexibility to reduce costs and increase cluster utilization Data Warehouse (Batch Processing) Data Warehouse Data Warehouse (Steady State) (Steady State) Allocate Expand to Shrink to9 instances 25 instances 9 instances
  60. 60. Ecosystem And Tools Business Intelligence  MicroStrategy, Pentaho Analytics  Datameer, Karmasphere Open source  Beeswax …
  61. 61. some practicalconsiderations
  62. 62. • Launch and monitor job flows • AWS Management Console • Command line interface • REST API • 3rd party libraries, e.g. MrJob
  63. 63. Hardware requirements for Use Cases Data/IO Intensive (m1/m2 instances)  Data Warehouse  Data Mining • Click stream, logs, events, etc. Compute/IO Intensive (c1, cc1 instances)  Credit Ratings  Fraud Models  Portfolio analysis  VaR calculation
  64. 64. additional resources
  65. 65. http://aws.amazon.com/elasticmapreduce
  66. 66. http://aws.amazon.com/articles/Elastic-MapReduce
  67. 67. http://www.youtube.com/user/AmazonWebServices
  68. 68. in summary
  69. 69. Amazon Web Services
  70. 70. Abstract Resources On Demand Programmable Pay As You Go Secure Elastic
  71. 71. many database options
  72. 72. Elastic MapReduce
  73. 73. powerful, flexible, elastic analytics and data warehousing
  74. 74. SuccessfulYour Idea Product
  75. 75. SuccessfulYour Idea Product
  76. 76. Image: Jeff Hester under a Creative Commons License
  77. 77. AWS will be at Strata 2011
  78. 78. deesingh@amazon.com Twitter:@mndoci Inspiration and material from Matt Wood, Peter Sirota & Larry LessigCredit” Oberazzi under a CC-BY-NC-SA license

×