Your SlideShare is downloading. ×
Big Data in the Cloud 
Russell Nash 
Solutions Architect, Amazon Web Services, APAC 
© 2014 Amazon.com, Inc. and its affil...
Big picture slide
Hadoop 
MPP 
NoSQL 
STREAMING
Structure 
High Low 
Large 
Size 
Small 
Traditional 
Database 
Hadoop 
NoSQL 
MPP DW
Hadoop 
MPP 
NoSQL 
Structure 
Latency 
Interfaces
Background 
2004 – Map Reduce 
2006 – Hadoop
Input 
File 
Hadoop cluster 
Func;ons 
1. Very Flexible 
2. Very Scalable 
3. Often Transient 
Output
Input 
file map 
reduce 
Output 
file
Input 
file map 
reduce 
Output 
file 
Input 
file map 
reduce 
Output 
file 
Input 
file map 
reduce 
Output 
file
Big Data Verticals and Use cases 
Media/ 
Advertising 
Targeted 
Advertising 
Image and 
Video 
Processing 
Oil & Gas 
Sei...
Deployment Options 
On-premise 
Cloud 
Managed on Cloud
Elas;c 
MapReduce 
Manageability 
Scalability 
Cost
400 GB of logs per day 
~12 Terabytes per month
1) Load log file data for six 
months of user search history 
into Amazon S3 
Amazon S3 
Search ID Search Text Final Selec...
Amazon S3 Amazon EMR 
Log Files 
2) Spin up a 200 node cluster 
Hadoop Cluster
3) 200 nodes simultaneously analyze 
this data looking for common 
misspellings 
… this takes a few hours 
Hadoop Cluster ...
Amazon S3 Amazon EMR 
4) New common misspellings and 
suggestions loaded back into S3 
Hadoop Cluster 
Log Files
Amazon S3 Amazon EMR 
5) When the job is done, the 
cluster is shut down. 
Log Files
The Hadoop Ecosystem
Trends 
SQL on Hadoop 
Spark
Hadoop 
MPP 
NoSQL 
Structure 
Latency 
Interfaces 
Any 
Mins-Hours 
Programming 
SQL-Like 
Tools
Background 
SQL Databases 
for analytical workloads 
Performance 
Scalability 
Ease of Use 
Cost
Leader 
Node 
Compute 
Node 
Compute 
Node 
Compute 
Node 
BI Tools 
1. SQL 
2. High Performance 
3. Broad Toolset
Deployment Options 
On-premise 
Cloud 
Managed on Cloud
Amazon 
RedshiA 
Manageability 
Scalability 
Cost
Performance Evaluation on 2B Rows 
Aggregate 
by 
month 
Traditional SQL 
Database 
02:08:35 
00:35:46 
00:00:12
Hadoop 
MPP 
NoSQL 
Structure 
Latency 
Interfaces 
Any Full 
Mins-Hours Seconds-Minutes 
Programming 
SQL-Like 
Tools 
SQ...
Background 
Databases for 
webscale transactions 
Performance 
Flexibility
ID Age State 
123 20 CA 
345 25 WA 
678 40 FL 
Relational Table 
ID Attributes 
123 Age:20, State:CA 
345 Age:25, Country:...
Deployment Options 
On-premise 
Cloud 
Managed on Cloud
DynamoDB 
Manageability 
Scalability 
Cost
digital advertising 
real-time bidding
Hadoop 
MPP 
NoSQL 
Structure 
Latency 
Interfaces 
Any Full Semi 
Mins-Hours Seconds-Minutes Sub-second 
Programming 
SQL...
Streaming 
Analy;cs
Data 
Sources 
App.4 
[Machine 
Learning] 
AWS 
Endpoint 
App.1 
[Aggregate 
& 
De-­‐Duplicate] 
Data 
Sources 
Data 
Sour...
• Sensor networks analytics 
• Ad network analytics 
• Log centralization 
• Click stream analysis 
• Hardware and softwar...
Amazon Mobile Analytics 
Fast: get your data within an hour 
Automatic MAU, DAU, session and 
retention reports 
Design an...
Expand your skills with AWS 
Certification 
Exams 
Validate your proven 
technical expertise with 
the AWS platform 
aws.a...
Big Data Tutorials 
aws.amazon.com/big-data 
Redshift Free Trial 
aws.amazon.com/redshift/free-trial
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or i...
Big Data in the Cloud
Big Data in the Cloud
Upcoming SlideShare
Loading in...5
×

Big Data in the Cloud

226

Published on

AWS Summit 2014 Melbourne - Breakout 3

Most organisations are facing ever growing volumes of data that need to be stored and processed but most importantly analysed to bring value to the business. Big Data appears to have solutions to address these challenges but the landscape is littered with acronyms and obscure naming conventions such as MPP, NoSQL, Hadoop, Hive and HBase. Attend this Session to find out

- What is the value proposition for each of these technologies
- How do they fit with more traditional Big Data solutions such as data warehouses?
- How AWS can help organisations get maximum value from their data

Presenter: Russell Nash, Solutions Architect, APAC, Amazon Web Services

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
226
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Big Data in the Cloud"

  1. 1. Big Data in the Cloud Russell Nash Solutions Architect, Amazon Web Services, APAC © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Big picture slide
  3. 3. Hadoop MPP NoSQL STREAMING
  4. 4. Structure High Low Large Size Small Traditional Database Hadoop NoSQL MPP DW
  5. 5. Hadoop MPP NoSQL Structure Latency Interfaces
  6. 6. Background 2004 – Map Reduce 2006 – Hadoop
  7. 7. Input File Hadoop cluster Func;ons 1. Very Flexible 2. Very Scalable 3. Often Transient Output
  8. 8. Input file map reduce Output file
  9. 9. Input file map reduce Output file Input file map reduce Output file Input file map reduce Output file
  10. 10. Big Data Verticals and Use cases Media/ Advertising Targeted Advertising Image and Video Processing Oil & Gas Seismic Analysis Retail Recommendations Transactions Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulations Risk Analysis Security Anti-virus Fraud Detection Image Recognition Social Network/ Gaming User Demographics Usage analysis In-game metrics
  11. 11. Deployment Options On-premise Cloud Managed on Cloud
  12. 12. Elas;c MapReduce Manageability Scalability Cost
  13. 13. 400 GB of logs per day ~12 Terabytes per month
  14. 14. 1) Load log file data for six months of user search history into Amazon S3 Amazon S3 Search ID Search Text Final Selection 12423451 westen Westin 14235235 wisten Westin 54332232 westenn Westin 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451
  15. 15. Amazon S3 Amazon EMR Log Files 2) Spin up a 200 node cluster Hadoop Cluster
  16. 16. 3) 200 nodes simultaneously analyze this data looking for common misspellings … this takes a few hours Hadoop Cluster Amazon S3 Amazon EMR
  17. 17. Amazon S3 Amazon EMR 4) New common misspellings and suggestions loaded back into S3 Hadoop Cluster Log Files
  18. 18. Amazon S3 Amazon EMR 5) When the job is done, the cluster is shut down. Log Files
  19. 19. The Hadoop Ecosystem
  20. 20. Trends SQL on Hadoop Spark
  21. 21. Hadoop MPP NoSQL Structure Latency Interfaces Any Mins-Hours Programming SQL-Like Tools
  22. 22. Background SQL Databases for analytical workloads Performance Scalability Ease of Use Cost
  23. 23. Leader Node Compute Node Compute Node Compute Node BI Tools 1. SQL 2. High Performance 3. Broad Toolset
  24. 24. Deployment Options On-premise Cloud Managed on Cloud
  25. 25. Amazon RedshiA Manageability Scalability Cost
  26. 26. Performance Evaluation on 2B Rows Aggregate by month Traditional SQL Database 02:08:35 00:35:46 00:00:12
  27. 27. Hadoop MPP NoSQL Structure Latency Interfaces Any Full Mins-Hours Seconds-Minutes Programming SQL-Like Tools SQL BI Tools
  28. 28. Background Databases for webscale transactions Performance Flexibility
  29. 29. ID Age State 123 20 CA 345 25 WA 678 40 FL Relational Table ID Attributes 123 Age:20, State:CA 345 Age:25, Country: Australia, Gender: F, Smoker: No 678 Age:40 Non-Relational Table
  30. 30. Deployment Options On-premise Cloud Managed on Cloud
  31. 31. DynamoDB Manageability Scalability Cost
  32. 32. digital advertising real-time bidding
  33. 33. Hadoop MPP NoSQL Structure Latency Interfaces Any Full Semi Mins-Hours Seconds-Minutes Sub-second Programming SQL-Like Tools SQL Programming Tools
  34. 34. Streaming Analy;cs
  35. 35. Data Sources App.4 [Machine Learning] AWS Endpoint App.1 [Aggregate & De-­‐Duplicate] Data Sources Data Sources Data Sources App.2 [Metric ExtracIon] S3 DynamoDB Redshift App.3 [Sliding Window Analysis] Data Sources Availability Zone Availability Zone Shard 1 Shard 2 Shard N Availability Zone Amazon Kinesis EMR
  36. 36. • Sensor networks analytics • Ad network analytics • Log centralization • Click stream analysis • Hardware and software appliance metrics • …more…
  37. 37. Amazon Mobile Analytics Fast: get your data within an hour Automatic MAU, DAU, session and retention reports Design and track custom app events Data is not mined or sold by Amazon
  38. 38. Expand your skills with AWS Certification Exams Validate your proven technical expertise with the AWS platform aws.amazon.com/certification On-Demand Resources Videos & Labs Get hands-on practice working with AWS technologies in a live environment aws.amazon.com/training/ self-paced-labs Instructor-Led Courses Training Classes Expand your technical expertise to design, deploy, and operate scalable, efficient applications on AWS aws.amazon.com/training
  39. 39. Big Data Tutorials aws.amazon.com/big-data Redshift Free Trial aws.amazon.com/redshift/free-trial
  40. 40. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

×