Big Data in the Cloud

579 views

Published on

AWS Summit 2014 Brisbane - Breakout 5

Most organisations are facing ever growing volumes of data that need to be stored and processed but most importantly analysed to bring value to the business. Big Data appears to have solutions to address these challenges but the landscape is littered with acronyms and obscure naming conventions such as MPP, NoSQL, Hadoop, Hive and HBase. Attend this Session to find out

- What is the value proposition for each of these technologies
- How do they fit with more traditional Big Data solutions such as data warehouses?
- How AWS can help organisations get maximum value from their data

Presenter: Russell Nash, Solutions Architect, APAC, Amazon Web Services

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
579
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big Data in the Cloud

  1. 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Big Data in the Cloud Russell Nash, Solutions Architect, Amazon Web Services
  2. 2. Big picture slide
  3. 3. MPP  Hadoop   NoSQL   STREAMING  
  4. 4. Structure LowHigh Large Small Size Traditional Database Hadoop NoSQL MPP DW
  5. 5. MPP  Hadoop   NoSQL   Structure Latency Interfaces
  6. 6. 2004 – Map Reduce 2006 – Hadoop Background
  7. 7. Input     File   Hadoop cluster Func;ons   Output     1.  Very Flexible 2.  Very Scalable 3.  Often Transient
  8. 8. map Input file reduce Output file
  9. 9. map Input file reduce Output file map Input file reduce Output file map Input file reduce Output file
  10. 10. Big Data Verticals and Use cases Media/ Advertising Targeted Advertising Image and Video Processing Oil & Gas Seismic Analysis Retail Recommendations Transactions Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulations Risk Analysis Security Anti-virus Fraud Detection Image Recognition Social Network/ Gaming User Demographics Usage analysis In-game metrics
  11. 11. On-premise Cloud Managed on Cloud Deployment Options
  12. 12. Elas;c  MapReduce   Manageability Scalability Cost
  13. 13. 400 GB of logs per day ~12 Terabytes per month
  14. 14. Amazon S3 1) Load log file data for six months of user search history into Amazon S3 Search ID Search Text Final Selection 12423451 westen Westin 14235235 wisten Westin 54332232 westenn Westin 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451
  15. 15. Amazon S3 Amazon EMR Log Files 2) Spin up a 200 node cluster Hadoop Cluster
  16. 16. Amazon S3 Amazon EMR 3) 200 nodes simultaneously analyze this data looking for common misspellings … this takes a few hours Hadoop Cluster
  17. 17. Amazon S3 Amazon EMR 4) New common misspellings and suggestions loaded back into S3 Hadoop Cluster Log Files
  18. 18. Amazon S3 Amazon EMR 5) When the job is done, the cluster is shut down. Log Files
  19. 19. The Hadoop Ecosystem
  20. 20. SQL on Hadoop Spark Trends
  21. 21. MPP  Hadoop   NoSQL   Structure Latency Interfaces Any Mins-Hours Programming SQL-Like Tools
  22. 22. SQL Databases for analytical workloads Performance Scalability Ease of Use Cost Background
  23. 23. Leader Node Compute Node Compute Node Compute Node BI Tools 1.  SQL 2.  High Performance 3.  Broad Toolset
  24. 24. On-premise Cloud Managed on Cloud Deployment Options
  25. 25. Amazon  RedshiA   Manageability Scalability Cost
  26. 26. Performance Evaluation on 2B Rows Aggregate  by  month   02:08:35   00:35:46   00:00:12   Traditional SQL Database
  27. 27. MPP  Hadoop   NoSQL   Structure Latency Interfaces Any Full Mins-Hours Seconds-Minutes Programming SQL-Like Tools SQL BI Tools
  28. 28. Databases for webscale transactions Performance Flexibility Background
  29. 29. ID Age State 123 20 CA 345 25 WA 678 40 FL Relational Table ID Attributes 123 Age:20, State:CA 345 Age:25, Country: Australia, Gender: F, Smoker: No 678 Age:40 Non-Relational Table
  30. 30. On-premise Cloud Managed on Cloud Deployment Options
  31. 31. DynamoDB   Manageability Scalability Cost
  32. 32. digital advertising real-time bidding
  33. 33. MPP  Hadoop   NoSQL   Structure Latency Interfaces Any SemiFull Mins-Hours Sub-secondSeconds-Minutes Programming SQL-Like Tools ProgrammingSQL Tools
  34. 34. Streaming   Analy;cs  
  35. 35.  Data   Sources   App.4     [Machine   Learning]                                       AWS  Endpoint   App.1     [Aggregate  &   De-­‐Duplicate]    Data   Sources   Data   Sources    Data   Sources   App.2     [Metric   ExtracIon]   S3 DynamoDB Redshift App.3   [Sliding   Window   Analysis]    Data   Sources   Availability Zone Shard 1 Shard 2 Shard N Availability Zone Availability Zone Amazon Kinesis EMR
  36. 36. •  Sensor networks analytics •  Ad network analytics •  Log centralization •  Click stream analysis •  Hardware and software appliance metrics •  …more…
  37. 37. Amazon Mobile Analytics Fast: get your data within an hour Automatic MAU, DAU, session and retention reports Design and track custom app events Data is not mined or sold by Amazon
  38. 38. Expand your skills with AWS Certification aws.amazon.com/certification Exams Validate your proven technical expertise with the AWS platform On-Demand Resources aws.amazon.com/training/ self-paced-labs Videos & Labs Get hands-on practice working with AWS technologies in a live environment aws.amazon.com/training Instructor-Led Courses Training Classes Expand your technical expertise to design, deploy, and operate scalable, efficient applications on AWS
  39. 39. Big Data Tutorials aws.amazon.com/big-data Redshift Free Trial aws.amazon.com/redshift/free-trial

×