Glenn GoreBig Data AnalyticsSr. Manager, Solutions Architects, AWS
Overview• The Big Data Challenge• Big Data tools and what can we do with them ?• Packetloop – Big Data Security Analytics•...
An engineer’s definitionWhen your data sets become so large that you have to startinnovating how to collect, store, organi...
GenerationCollection & storageAnalytics & computationCollaboration & sharing
GenerationCollection & storageAnalytics & computationCollaboration & sharingLower cost,higher throughput
GenerationCollection & storageAnalytics & computationCollaboration & sharingLower cost,higher throughputHighlyconstrained
Generated dataAvailable for analysisData volumeGartner: User Survey Analysis: Key Trends Shaping the Future of Data Center...
Amazon Web Services helps removeconstraints
Remove constraints = More experimentationMore experimentation = More innovationMore Innovation = Competitive edge
Elastic MapReduce and RedshiftBig Data tools
EMR is Hadoop in the Cloud
What is Amazon Redshift ?Amazon Redshift is a fast and powerful, fully managed,petabyte-scale data warehouse service in th...
Elastic MapReduce and RedshiftBig Data tools
How does EMR work ?EMREMR ClusterS3Put the datainto S3Choose: Hadoop distribution, # ofnodes, types of nodes, customconfig...
What can you run on EMR…S3EMREMR Cluster
EMREMR ClusterResize NodesS3You can easily add andremove nodes
Resize Nodes with Spot InstancesCost without Spot10 node cluster running for 14 hoursCost = 1.2 * 10 * 14 = $168
Resize Nodes with Spot InstancesCost without Spot Add 10 nodes on spot10 node cluster running for 14 hoursCost = 1.2 * 10 ...
Resize Nodes with Spot InstancesCost without Spot Add 10 nodes on spot10 node cluster running for 14 hoursCost = 1.2 * 10 ...
Ad-Hoc Clusters – What are they ?EMR ClusterS3When processing is complete, youcan terminate the cluster (and stoppaying)1
Ad-Hoc Clusters – When to useEMR ClusterS3Not using HDFSNot using the cluster 24/7Transient jobs1
EMREMR Cluster“Alive” Clusters – What are they ?S3If you run your jobs 24 x 7 , youcan also run a persistent clusterand us...
EMREMR Cluster“Alive” Clusters – When ?S3Frequently running jobsDependencies on map-reduce-mapoutputs2
S3 instead of HDFSS3EMREMR Cluster• S3 provides 99.99999999999% ofdurability• Elastic• Version control against failure• Ru...
S3 and HDFSS3EMREMR ClusterLoad data from S3 using S3DistCPBenefits of HDFSMaster copy of the data in S3Get all the benefi...
Elastic MapReduce and RedshiftBig Data tools
Reporting Data-warehouseRDBMSRedshiftOLTPERPReportingand BI1
Live Archive for (Structured) Big DataDynamoDBRedshiftOLTPWeb Apps Reportingand BI2
Cloud ETL for Big DataRedshiftReportingand BIElastic MapReduceS33
Streaming Hive Pig DynamoDB RedshiftUnstructured Data ✓ ✓Structured Data ✓ ✓ ✓ ✓Language Support Any* HQL Pig Latin Client...
Collection & storageAnalytics & computationCollaboration & sharingRemoveConstraintsGeneration
AWS Summit 2013 | Auckland - Big Data Analytics
Upcoming SlideShare
Loading in...5
×

AWS Summit 2013 | Auckland - Big Data Analytics

834

Published on

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
834
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "AWS Summit 2013 | Auckland - Big Data Analytics"

  1. 1. Glenn GoreBig Data AnalyticsSr. Manager, Solutions Architects, AWS
  2. 2. Overview• The Big Data Challenge• Big Data tools and what can we do with them ?• Packetloop – Big Data Security Analytics• Intel technology on big data.
  3. 3. An engineer’s definitionWhen your data sets become so large that you have to startinnovating how to collect, store, organize, analyze andshare it
  4. 4. GenerationCollection & storageAnalytics & computationCollaboration & sharing
  5. 5. GenerationCollection & storageAnalytics & computationCollaboration & sharingLower cost,higher throughput
  6. 6. GenerationCollection & storageAnalytics & computationCollaboration & sharingLower cost,higher throughputHighlyconstrained
  7. 7. Generated dataAvailable for analysisData volumeGartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  8. 8. Amazon Web Services helps removeconstraints
  9. 9. Remove constraints = More experimentationMore experimentation = More innovationMore Innovation = Competitive edge
  10. 10. Elastic MapReduce and RedshiftBig Data tools
  11. 11. EMR is Hadoop in the Cloud
  12. 12. What is Amazon Redshift ?Amazon Redshift is a fast and powerful, fully managed,petabyte-scale data warehouse service in the AWScloudEasy to provision and scaleNo upfront costs, pay as you goHigh performance at a low priceOpen and flexible with support for popular BI tools
  13. 13. Elastic MapReduce and RedshiftBig Data tools
  14. 14. How does EMR work ?EMREMR ClusterS3Put the datainto S3Choose: Hadoop distribution, # ofnodes, types of nodes, customconfigs, Hive/Pig/etc.Get the output fromS3Launch the cluster using theEMR console, CLI, SDK, orAPIsYou can also storeeverything in HDFS
  15. 15. What can you run on EMR…S3EMREMR Cluster
  16. 16. EMREMR ClusterResize NodesS3You can easily add andremove nodes
  17. 17. Resize Nodes with Spot InstancesCost without Spot10 node cluster running for 14 hoursCost = 1.2 * 10 * 14 = $168
  18. 18. Resize Nodes with Spot InstancesCost without Spot Add 10 nodes on spot10 node cluster running for 14 hoursCost = 1.2 * 10 * 14 = $16820 node cluster running for 7 hoursCost = 1.2 * 10 * 7 = $84= 0.6 * 10 * 7 = $42
  19. 19. Resize Nodes with Spot InstancesCost without Spot Add 10 nodes on spot10 node cluster running for 14 hoursCost = 1.2 * 10 * 14 = $16820 node cluster running for 7 hoursCost = 1.2 * 10 * 7 = $84= 0.6 * 10 * 7 = $42= Total $12625% reduction in price50% reduction in time
  20. 20. Ad-Hoc Clusters – What are they ?EMR ClusterS3When processing is complete, youcan terminate the cluster (and stoppaying)1
  21. 21. Ad-Hoc Clusters – When to useEMR ClusterS3Not using HDFSNot using the cluster 24/7Transient jobs1
  22. 22. EMREMR Cluster“Alive” Clusters – What are they ?S3If you run your jobs 24 x 7 , youcan also run a persistent clusterand use RI models to save costs2
  23. 23. EMREMR Cluster“Alive” Clusters – When ?S3Frequently running jobsDependencies on map-reduce-mapoutputs2
  24. 24. S3 instead of HDFSS3EMREMR Cluster• S3 provides 99.99999999999% ofdurability• Elastic• Version control against failure• Run multiple clusters with a singlesource of truth• Quick recovery from failure• Continuously resize clusters3
  25. 25. S3 and HDFSS3EMREMR ClusterLoad data from S3 using S3DistCPBenefits of HDFSMaster copy of the data in S3Get all the benefits of S3HDFSS3distCP4
  26. 26. Elastic MapReduce and RedshiftBig Data tools
  27. 27. Reporting Data-warehouseRDBMSRedshiftOLTPERPReportingand BI1
  28. 28. Live Archive for (Structured) Big DataDynamoDBRedshiftOLTPWeb Apps Reportingand BI2
  29. 29. Cloud ETL for Big DataRedshiftReportingand BIElastic MapReduceS33
  30. 30. Streaming Hive Pig DynamoDB RedshiftUnstructured Data ✓ ✓Structured Data ✓ ✓ ✓ ✓Language Support Any* HQL Pig Latin Client SQLSQL ✓SQL-Like ✓Volume Unlimited Unlimited Unlimited RelativelyLow1.6 PBLatency Medium Medium Medium Ultra Low Low
  31. 31. Collection & storageAnalytics & computationCollaboration & sharingRemoveConstraintsGeneration

×