• Save
BDT202 The Hadoop Ecosystem - AWS re: Invent 2012
Upcoming SlideShare
Loading in...5
×
 

BDT202 The Hadoop Ecosystem - AWS re: Invent 2012

on

  • 1,359 views

The Hadoop ecosystem is blossoming. In this session, learn how to take advantage of tools such as Mesos, Spark, Shark and Mahout on Amazon Elastic MapReduce. Senior Product Manager, Jon Einkauf, ...

The Hadoop ecosystem is blossoming. In this session, learn how to take advantage of tools such as Mesos, Spark, Shark and Mahout on Amazon Elastic MapReduce. Senior Product Manager, Jon Einkauf, discusses the optimizations which make Hadoop sing on EMR, and describes how to use different Hadoop distributions and tools such as Hbase and Hparser with your big data analytics pipelines.

Statistics

Views

Total Views
1,359
Views on SlideShare
1,359
Embed Views
0

Actions

Likes
3
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

BDT202 The Hadoop Ecosystem - AWS re: Invent 2012 BDT202 The Hadoop Ecosystem - AWS re: Invent 2012 Presentation Transcript

  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • EMR is Hadoop in the Cloud Hadoop is an open-source framework for parallel processing huge amounts of data on a cluster of machines© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. Put the data into S3 Amazon Simple Storage Service (S3) EMR Cluster 011001101 EMR Launch the cluster using the EMR console, CLI, SDK, or APIs Get the output You can also from S3 store everything in HDFS© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • EMR Cluster Amazon S3 EMR You can easily add and remove nodes© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Amazon S3 EMR Cluster When processing is complete, you can terminate the cluster (and stop paying)© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • options© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Hive Pig• Data Warehouse for Hadoop • High-level programming• SQL-like query language language (Pig Latin) (HiveQL) • Supports UDFs• Initially developed at • Ideal for data flow/ETL Facebook© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • HBase Mahout• Column-oriented database • Machine learning library• Runs on top of HDFS • Supports recommendation• Ideal for sparse data mining, clustering,• Random, read/write access classification, and frequent• Ideal for very large tables (billions itemset mining of rows, millions of columns)© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Ganglia R• Scalable distributed monitoring • Language and software• View performance of the cluster environment for statistical and individual nodes computing and graphics• Open source • Open source© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Hadoop elastic-mapreduce --create --alive --instance-type m1.xlarge --num-instances 5© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Hive ./elastic-mapreduce --create --alive --name "Test Hive" --hadoop-version 0.20 --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.7.1© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • HBase elastic-mapreduce --create --hbase --name "$USER HBase Cluster" --num-instances 2 --instance-type cc2.8xlarge © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • bootstrap action elastic-mapreduce --create --bootstrap-action s3://s3bucket/installganglia© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Hive© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Data Data Masking Data Exchange Quality MDM Data Transformation Enterprise Data Integration Identity Connectivity Resolution© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • HParser UI - any format - any complexity - easilyReal-world - in Map Reducedata Hadoop source M results M R M © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • End-to-End Flow Construction Execution (Windows) (EMR)binary records text records Map Reduce HParser UI in out transform input output definition© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Real-World Data HParser Flat files Logs Records XML, JSON Industry standards Ex. FIX, SWIFT, X12, ASN.1 Documents Ex. PDF, Excel© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Minutes ASN.1 on EMR Cluster 60 50 40 30 10 GB 50 GB 20 10 0 4 16 24 32 Nodes Notes: - These are only Mappers times. Add 60 sec lead time (Start) and 60 sec tail time (Reducer) for each run - Amazon XL (Extra Large) instances – 64-bit, 15GB RAM, 1.5TB Storage © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Minutes ASN.1 on EMR Cluster – 72 Nodes 60 50 40 30 20 10 0 10 GB 100 GB 400 GB 700 GB 1 TB File Size Notes: - These are only Mappers times. Add 60 sec lead time (Start) and 60 sec tail time (Reducer) for each run - Amazon XL (Extra Large) instances – 64-bit, 15GB RAM, 1.5TB Storage © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries DAG Users Developers Analysts and developers Developers Google project MapReduce Dremel Open source project Hadoop MapReduce Storm and S4 Introducing Apache Drill…© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Avro IDL enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } JSON { "name": "Srivas", "gender": "Male", "followers": 100 } { "name": "Raina", "gender": "Female", "followers": 200, "zip": "94305" }© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • No RegionServers Instant Recovery High Throughput No Manual Splits No Compactions No Garbage Collection No Manual Merges Snapshots Consistent Low Latency No Manual Administration Mirroring No Practical Scale Limits© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • HBase JVM DFS HBase JVM JVM ext3 MapR Unified Disks Disks Disks Other Distributions© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 50B real-time auctions #1 in audience reach “M7 is really taking Hadoop to the next level. It allows us to do new things with our data.” - Jan Gelin, VP of Technical Operations 2M+ subscribers 10B+ records “I’m really excited about M7 because it will address both the performance and the day-to-day challenges of Hbase.” – Melinda Graham, Sr. Hadoop Engineer Global leader in email intelligence “M7 is a big win for us. It makes HBase really easy to use. It really helps us make better use of the data we have. It allows us to look at use cases we havent had the opportunity to in the past.” Andy Sautins - CTO© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • aws.amazon.com/elasticmapreduce• Online Training – Videos – Articles/tutorials• Documentation – Getting Started Guide – Developer Guide – API Reference• FAQs• Paid Training – 3-day Developer Course taught by Think Big Analytics• On-Site Consulting – EMR Bootcamp (for companies processing 1+ TB per day)© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.