Big Data Analytics                                Peter SirotaGeneral Manager, Amazon Elastic MapReduce
Overview1. Introducing Big Data2. From data to actionable information3. Analytics and Cloud Computing4. The Big Data ecosy...
1Introducing Big Data
Generation Collection & storageAnalytics & computationCollaboration & sharing
The cost of data generation         is falling
Lower cost,higher throughput         Generation                     Collection & storage                    Analytics & co...
Lower cost,higher throughput         Generation                                                   Highly                  ...
Data volume                                                                                                               ...
Elastic and highly scalable             +No upfront capital expense                                   Remove             +...
Lower cost,higher throughput         Generation                                                   Highly                  ...
Generation               Collection & storageAccelerated              Analytics & computation              Collaboration &...
Close the gap.
Big DataTechnologies and techniques for working productively with data,          at any scale.
2     From data toactionable information
“Who buys video games?”
Per day:    3.5 billion records13 TB of click stream logs71 million unique cookies
Results:      500% return on ad spend17,000% reduction in procurement time
“Who is using our   service?”
Finding signal in the noise of logs      Identified early mobile usage Invested heavily in mobile development
In January 2013 9,432,061 unique mobile devices    used the Yelp mobile app.4 million+ calls. 5 million+ directions.
Open web index.3.4 billion records.  Available to all.
Full parse for impact of    social networks  300 lines of Ruby code.         14 hours.           $100.
Tweeting about Flu      You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011
Tweeting about Food Tweets aboutthe price of rice  Official food price inflation
3  Analytics andCloud Computing
Generation Collection & storageAnalytics & computationCollaboration & sharing
Generation                                S3, Glacier, Collection & storage     Storage Gateway,                          ...
Generation Collection & storage                                      EC2 &Analytics & computation   Elastic MapReduceColla...
Generation Collection & storageAnalytics & computation                                        EC2 & S3,Collaboration & sha...
Generation                                                            S3, Glacier,                                        ...
Generation                                                            S3, Glacier,                                        ...
Elastic MapReduce
Managed Hadoop analytics
S3, DynamoDB, RedshiftInput data
S3, DynamoDB, Redshift       Input dataCode       Elastic          MapReduce
S3, DynamoDB, Redshift       Input dataCode       Elastic    Name          MapReduce   node
S3, DynamoDB, Redshift       Input dataCode       Elastic    Name          MapReduce   node                               ...
S3, DynamoDB, Redshift       Input dataCode       Elastic                        Name          MapReduce                  ...
S3, DynamoDB, Redshift       Input dataCode       Elastic                        Name                                Outpu...
S3, DynamoDB, RedshiftInput data                                      Output
1. Elastic clusters
10 hours
6 hours
Peak capacity
2. Rapid, tuned provisioning
Tedious.
Remove undifferentiated    heavy lifting.
3. Hadoop all the way down
Robust ecosystem.Databases, machine learning, segmentation,   clustering, analytics, metadata stores,      exchange format...
4. Agility for experimentation
Instance choice.Stay flexible on instance type & number.
5. Cost optimizations
Built for Spot.Name-your-price supercomputing.
1. Elastic clusters2. Rapid, tuned provisioning3. Hadoop all the way down4. Agility for experimentation.5. Cost optimizati...
Vin Sharma vin.sharma@intel.comDirector, Product Strategy & MarketingBig Data Software, Intel Corporation
Analysis of Data Can Transform Society   Enhance scientific       Create new business   Increase public safety  understand...
Intel’s Vision to Democratize Big DataUnlock Value in   Support Open   Deliver Software Value    Silicon         Platforms
Intel at the Intersection of Big Data      HPC                   Cloud             Open Source  Enabling exascale     Help...
Intel® Technology at the Heart of the Cloud                  Server        Storage                  Network
Scale-Out Big DataCompute Platform Optimization          Cost-effective performance          •Intel® Advanced Vector Exten...
Intel® Advanced Vector Extensions Technology                                                                              ...
Intel® Turbo Boost Technology 2.0              More Performance              Higher turbo speeds maximize              per...
Intel® Advanced Encryption Standard New Instructions           • Processor assistance for             performing AES encry...
The Power of Intel® Platform Solutions:        TeraSort for       50%                              Richer         1 TB sor...
The Virtuous Cycle of User Experience                                    ClientsCloud                           Intelligen...
4The Big Data Ecosystem
Data, data, everywhere...     Data is stored in silos.
S3      HBase on EMR    RDSDynamoDB       EMR        Redshift            On-premises
“How do I get my data to the cloud?”
Data mobility    Generated and stored in AWS    Inbound data transfer is free    Multipart upload to S3    Physical media ...
“How do I integrate my data for     maximum impact?”
S3      HBase on EMR    RDSDynamoDB       EMR        Redshift            On-premises
S3      HBase on EMR    RDSDynamoDB       EMR        Redshift            On-premises
S3      HBase on EMR    RDSDynamoDB       EMR        Redshift            On premises
S3      HBase on EMR    RDSDynamoDB       EMR        Redshift            On premises
S3      HBase on EMR    RDSDynamoDB       EMR        Redshift            On premises
AWS Data PipelineOrchestration for data-intensive workloads. Announced in November, available now.
AWS Data Pipeline   Data-intensive orchestration and automation   Reliable and scheduled   Easy to use, drag and drop   Ex...
Anatomy of a pipeline
Additional checks and notifications
Arbitrarily complex pipelines
aws.amazon.com/datapipeline
aws.amazon.com/big-data
Summary1. Introducing Big Data2. From data to actionable information3. Analytics and Cloud Computing4. The Big Data ecosys...
Get 600 Hours of free supercomputing                time!        www.powerof60.com
Thank you!sirota@amazon.com
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Upcoming SlideShare
Loading in...5
×

Big Data Analytics

1,738

Published on

Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.

Published in: Technology, Business

Big Data Analytics

  1. 1. Big Data Analytics Peter SirotaGeneral Manager, Amazon Elastic MapReduce
  2. 2. Overview1. Introducing Big Data2. From data to actionable information3. Analytics and Cloud Computing4. The Big Data ecosystem
  3. 3. 1Introducing Big Data
  4. 4. Generation Collection & storageAnalytics & computationCollaboration & sharing
  5. 5. The cost of data generation is falling
  6. 6. Lower cost,higher throughput Generation Collection & storage Analytics & computation Collaboration & sharing
  7. 7. Lower cost,higher throughput Generation Highly Collection & storage constrained Analytics & computation Collaboration & sharing
  8. 8. Data volume Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  9. 9. Elastic and highly scalable +No upfront capital expense Remove + =Only pay for what you use constraints + Available on-demand
  10. 10. Lower cost,higher throughput Generation Highly Collection & storage constrained Analytics & computation Collaboration & sharing
  11. 11. Generation Collection & storageAccelerated Analytics & computation Collaboration & sharing
  12. 12. Close the gap.
  13. 13. Big DataTechnologies and techniques for working productively with data, at any scale.
  14. 14. 2 From data toactionable information
  15. 15. “Who buys video games?”
  16. 16. Per day: 3.5 billion records13 TB of click stream logs71 million unique cookies
  17. 17. Results: 500% return on ad spend17,000% reduction in procurement time
  18. 18. “Who is using our service?”
  19. 19. Finding signal in the noise of logs Identified early mobile usage Invested heavily in mobile development
  20. 20. In January 2013 9,432,061 unique mobile devices used the Yelp mobile app.4 million+ calls. 5 million+ directions.
  21. 21. Open web index.3.4 billion records. Available to all.
  22. 22. Full parse for impact of social networks 300 lines of Ruby code. 14 hours. $100.
  23. 23. Tweeting about Flu You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011
  24. 24. Tweeting about Food Tweets aboutthe price of rice Official food price inflation
  25. 25. 3 Analytics andCloud Computing
  26. 26. Generation Collection & storageAnalytics & computationCollaboration & sharing
  27. 27. Generation S3, Glacier, Collection & storage Storage Gateway, DynamoDB, Redshift, RDS, HBaseAnalytics & computationCollaboration & sharing
  28. 28. Generation Collection & storage EC2 &Analytics & computation Elastic MapReduceCollaboration & sharing
  29. 29. Generation Collection & storageAnalytics & computation EC2 & S3,Collaboration & sharing CloudFormation, Elastic MapReduce, RDS, DynamoDB, Redshift
  30. 30. Generation S3, Glacier, Storage Gateway, DynamoDB, Collection & storage Redshift, RDS, HBaseAWS Data Pipeline EC2 & Analytics & computation Elastic MapReduce EC2 & S3, Collaboration & sharing CloudFormation, Elastic MapReduce, RDS, DynamoDB, Redshift
  31. 31. Generation S3, Glacier, Storage Gateway, DynamoDB, Collection & storage Redshift, RDS, HBaseAWS Data Pipeline EC2 & Analytics & computation Elastic MapReduce EC2 & S3, Collaboration & sharing CloudFormation, Elastic MapReduce, RDS, DynamoDB, Redshift
  32. 32. Elastic MapReduce
  33. 33. Managed Hadoop analytics
  34. 34. S3, DynamoDB, RedshiftInput data
  35. 35. S3, DynamoDB, Redshift Input dataCode Elastic MapReduce
  36. 36. S3, DynamoDB, Redshift Input dataCode Elastic Name MapReduce node
  37. 37. S3, DynamoDB, Redshift Input dataCode Elastic Name MapReduce node S3/HDFS Elastic cluster
  38. 38. S3, DynamoDB, Redshift Input dataCode Elastic Name MapReduce node Queries S3/HDFS + BI Via JDBC, Pig, Hive Elastic cluster
  39. 39. S3, DynamoDB, Redshift Input dataCode Elastic Name Output MapReduce node Queries S3/HDFS + BI Via JDBC, Pig, Hive Elastic cluster
  40. 40. S3, DynamoDB, RedshiftInput data Output
  41. 41. 1. Elastic clusters
  42. 42. 10 hours
  43. 43. 6 hours
  44. 44. Peak capacity
  45. 45. 2. Rapid, tuned provisioning
  46. 46. Tedious.
  47. 47. Remove undifferentiated heavy lifting.
  48. 48. 3. Hadoop all the way down
  49. 49. Robust ecosystem.Databases, machine learning, segmentation, clustering, analytics, metadata stores, exchange formats, and so on...
  50. 50. 4. Agility for experimentation
  51. 51. Instance choice.Stay flexible on instance type & number.
  52. 52. 5. Cost optimizations
  53. 53. Built for Spot.Name-your-price supercomputing.
  54. 54. 1. Elastic clusters2. Rapid, tuned provisioning3. Hadoop all the way down4. Agility for experimentation.5. Cost optimizations
  55. 55. Vin Sharma vin.sharma@intel.comDirector, Product Strategy & MarketingBig Data Software, Intel Corporation
  56. 56. Analysis of Data Can Transform Society Enhance scientific Create new business Increase public safety understanding, drive models and improve and improve innovation, and organizational energy efficiency withaccelerate medical cures. processes. smart grids.
  57. 57. Intel’s Vision to Democratize Big DataUnlock Value in Support Open Deliver Software Value Silicon Platforms
  58. 58. Intel at the Intersection of Big Data HPC Cloud Open Source Enabling exascale Helping enterprises Contributing codecomputing on massive build open and fostering data sets interoperable clouds ecosystem
  59. 59. Intel® Technology at the Heart of the Cloud Server Storage Network
  60. 60. Scale-Out Big DataCompute Platform Optimization Cost-effective performance •Intel® Advanced Vector Extension Technology •Intel® Turbo Boost Technology 2.0 •Intel® Advanced Encryption Standard New Instructions Technology
  61. 61. Intel® Advanced Vector Extensions Technology • Newest in a long line of processor instruction innovations • Increases floating point operations per clock up to 2X1 performance Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer See backup for configuration details. software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other 1 : Performance comparison using Linpack benchmark. systems, components, information information on performance forecasts go to http://www.intel.com/performance For more legal and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.73
  62. 62. Intel® Turbo Boost Technology 2.0 More Performance Higher turbo speeds maximize performance for single and multi-threaded applications
  63. 63. Intel® Advanced Encryption Standard New Instructions • Processor assistance for performing AES encryption 7 new instructions • Makes enabled encryption software faster and stronger
  64. 64. The Power of Intel® Platform Solutions: TeraSort for 50% Richer 1 TB sort Reduction user experiences4 HRS 80% Reduction 50% Reduction 40% Reduction Previous Intel® Xeon® Intel® Xeon® Solid-State 10 MIN Processor Drive 10G Processor E5 2600 Ethernet Intel® Apache Hadoop
  65. 65. The Virtuous Cycle of User Experience ClientsCloud Intelligent Systems
  66. 66. 4The Big Data Ecosystem
  67. 67. Data, data, everywhere... Data is stored in silos.
  68. 68. S3 HBase on EMR RDSDynamoDB EMR Redshift On-premises
  69. 69. “How do I get my data to the cloud?”
  70. 70. Data mobility Generated and stored in AWS Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect Regional replication of AMIs and snapshots
  71. 71. “How do I integrate my data for maximum impact?”
  72. 72. S3 HBase on EMR RDSDynamoDB EMR Redshift On-premises
  73. 73. S3 HBase on EMR RDSDynamoDB EMR Redshift On-premises
  74. 74. S3 HBase on EMR RDSDynamoDB EMR Redshift On premises
  75. 75. S3 HBase on EMR RDSDynamoDB EMR Redshift On premises
  76. 76. S3 HBase on EMR RDSDynamoDB EMR Redshift On premises
  77. 77. AWS Data PipelineOrchestration for data-intensive workloads. Announced in November, available now.
  78. 78. AWS Data Pipeline Data-intensive orchestration and automation Reliable and scheduled Easy to use, drag and drop Execution and retry logic Map data dependencies Create and manage temporary compute resources
  79. 79. Anatomy of a pipeline
  80. 80. Additional checks and notifications
  81. 81. Arbitrarily complex pipelines
  82. 82. aws.amazon.com/datapipeline
  83. 83. aws.amazon.com/big-data
  84. 84. Summary1. Introducing Big Data2. From data to actionable information3. Analytics and Cloud Computing4. The Big Data ecosystem
  85. 85. Get 600 Hours of free supercomputing time! www.powerof60.com
  86. 86. Thank you!sirota@amazon.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×