© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
EMR is Hadoop in the Cloud                                 Hadoop is an open-source framework for                         ...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
Choose: Hadoop distribution,                                                                                              ...
EMR Cluster                                            Amazon S3                                                          ...
Amazon S3                                             EMR Cluster                                                         ...
options© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whol...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
Hive                                                                                                   Pig• Data Warehouse...
HBase                                   Mahout• Column-oriented database              • Machine learning library• Runs on ...
Ganglia                                                                                             R• Scalable distribute...
Hadoop elastic-mapreduce --create --alive  --instance-type m1.xlarge  --num-instances 5© 2012 Amazon.com, Inc. and its aff...
Hive ./elastic-mapreduce --create --alive  --name "Test Hive"  --hadoop-version 0.20  --num-instances 5  --instance-type m...
HBase elastic-mapreduce --create --hbase  --name "$USER HBase Cluster"  --num-instances 2  --instance-type cc2.8xlarge © 2...
bootstrap action elastic-mapreduce --create  --bootstrap-action s3://s3bucket/installganglia© 2012 Amazon.com, Inc. and it...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole o...
Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole o...
Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole o...
Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole o...
Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole o...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
Data                                 Data                                                                                 ...
HParser UI                                                                               - any format                     ...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
End-to-End Flow   Construction                                                                                            ...
Real-World Data                                                                                                         HP...
Minutes                                                     ASN.1 on EMR Cluster      60        50        40        30    ...
Minutes                                         ASN.1 on EMR Cluster – 72 Nodes        60        50        40        30   ...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
Batch processing                          Interactive analysis                            Stream processing  Query runtime...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
Avro IDL                                                                                                      enum Gender ...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
Flexible                                                              Easy                     • Pluggable query languages...
No RegionServers                                         Instant Recovery                                       High Throu...
HBase                          JVM                         DFS                                                          HB...
50B real-time auctions     #1 in audience reach       “M7 is really taking Hadoop to the next level. It allows us to do ne...
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in...
aws.amazon.com/elasticmapreduce• Online Training   – Videos   – Articles/tutorials• Documentation   – Getting Started Guid...
We are sincerely eager to hear your feedback on thispresentation and on re:Invent.  Please fill out an evaluation    form ...
Upcoming SlideShare
Loading in...5
×

BDT202 The Hadoop Ecosystem - AWS re: Invent 2012

991

Published on

The Hadoop ecosystem is blossoming. In this session, learn how to take advantage of tools such as Mesos, Spark, Shark and Mahout on Amazon Elastic MapReduce. Senior Product Manager, Jon Einkauf, discusses the optimizations which make Hadoop sing on EMR, and describes how to use different Hadoop distributions and tools such as Hbase and Hparser with your big data analytics pipelines.

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
991
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "BDT202 The Hadoop Ecosystem - AWS re: Invent 2012"

  1. 1. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. EMR is Hadoop in the Cloud Hadoop is an open-source framework for parallel processing huge amounts of data on a cluster of machines© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  3. 3. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  4. 4. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  5. 5. Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. Put the data into S3 Amazon Simple Storage Service (S3) EMR Cluster 011001101 EMR Launch the cluster using the EMR console, CLI, SDK, or APIs Get the output You can also from S3 store everything in HDFS© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  6. 6. EMR Cluster Amazon S3 EMR You can easily add and remove nodes© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  7. 7. Amazon S3 EMR Cluster When processing is complete, you can terminate the cluster (and stop paying)© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  8. 8. options© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  9. 9. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  10. 10. Hive Pig• Data Warehouse for Hadoop • High-level programming• SQL-like query language language (Pig Latin) (HiveQL) • Supports UDFs• Initially developed at • Ideal for data flow/ETL Facebook© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  11. 11. HBase Mahout• Column-oriented database • Machine learning library• Runs on top of HDFS • Supports recommendation• Ideal for sparse data mining, clustering,• Random, read/write access classification, and frequent• Ideal for very large tables (billions itemset mining of rows, millions of columns)© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  12. 12. Ganglia R• Scalable distributed monitoring • Language and software• View performance of the cluster environment for statistical and individual nodes computing and graphics• Open source • Open source© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  13. 13. Hadoop elastic-mapreduce --create --alive --instance-type m1.xlarge --num-instances 5© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  14. 14. Hive ./elastic-mapreduce --create --alive --name "Test Hive" --hadoop-version 0.20 --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.7.1© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  15. 15. HBase elastic-mapreduce --create --hbase --name "$USER HBase Cluster" --num-instances 2 --instance-type cc2.8xlarge © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  16. 16. bootstrap action elastic-mapreduce --create --bootstrap-action s3://s3bucket/installganglia© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  17. 17. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  18. 18. Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  19. 19. Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  20. 20. Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  21. 21. Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  22. 22. Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  23. 23. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  24. 24. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  25. 25. Data Data Masking Data Exchange Quality MDM Data Transformation Enterprise Data Integration Identity Connectivity Resolution© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  26. 26. HParser UI - any format - any complexity - easilyReal-world - in Map Reducedata Hadoop source M results M R M © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  27. 27. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  28. 28. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  29. 29. End-to-End Flow Construction Execution (Windows) (EMR)binary records text records Map Reduce HParser UI in out transform input output definition© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  30. 30. Real-World Data HParser Flat files Logs Records XML, JSON Industry standards Ex. FIX, SWIFT, X12, ASN.1 Documents Ex. PDF, Excel© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  31. 31. Minutes ASN.1 on EMR Cluster 60 50 40 30 10 GB 50 GB 20 10 0 4 16 24 32 Nodes Notes: - These are only Mappers times. Add 60 sec lead time (Start) and 60 sec tail time (Reducer) for each run - Amazon XL (Extra Large) instances – 64-bit, 15GB RAM, 1.5TB Storage © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  32. 32. Minutes ASN.1 on EMR Cluster – 72 Nodes 60 50 40 30 20 10 0 10 GB 100 GB 400 GB 700 GB 1 TB File Size Notes: - These are only Mappers times. Add 60 sec lead time (Start) and 60 sec tail time (Reducer) for each run - Amazon XL (Extra Large) instances – 64-bit, 15GB RAM, 1.5TB Storage © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  33. 33. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  34. 34. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  35. 35. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  36. 36. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  37. 37. Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries DAG Users Developers Analysts and developers Developers Google project MapReduce Dremel Open source project Hadoop MapReduce Storm and S4 Introducing Apache Drill…© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  38. 38. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  39. 39. Avro IDL enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } JSON { "name": "Srivas", "gender": "Male", "followers": 100 } { "name": "Raina", "gender": "Female", "followers": 200, "zip": "94305" }© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  40. 40. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  41. 41. Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  42. 42. No RegionServers Instant Recovery High Throughput No Manual Splits No Compactions No Garbage Collection No Manual Merges Snapshots Consistent Low Latency No Manual Administration Mirroring No Practical Scale Limits© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  43. 43. HBase JVM DFS HBase JVM JVM ext3 MapR Unified Disks Disks Disks Other Distributions© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  44. 44. 50B real-time auctions #1 in audience reach “M7 is really taking Hadoop to the next level. It allows us to do new things with our data.” - Jan Gelin, VP of Technical Operations 2M+ subscribers 10B+ records “I’m really excited about M7 because it will address both the performance and the day-to-day challenges of Hbase.” – Melinda Graham, Sr. Hadoop Engineer Global leader in email intelligence “M7 is a big win for us. It makes HBase really easy to use. It really helps us make better use of the data we have. It allows us to look at use cases we havent had the opportunity to in the past.” Andy Sautins - CTO© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  45. 45. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  46. 46. aws.amazon.com/elasticmapreduce• Online Training – Videos – Articles/tutorials• Documentation – Getting Started Guide – Developer Guide – API Reference• FAQs• Paid Training – 3-day Developer Course taught by Think Big Analytics• On-Site Consulting – EMR Bootcamp (for companies processing 1+ TB per day)© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  47. 47. We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

×