Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi
Upcoming SlideShare
Loading in...5
×
 

Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

on

  • 2,979 views

 

Statistics

Views

Total Views
2,979
Views on SlideShare
2,829
Embed Views
150

Actions

Likes
1
Downloads
82
Comments
0

1 Embed 150

http://d.hatena.ne.jp 150

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi Presentation Transcript

  • Making Hadoop Enterprise ready with Amazon Elastic Map/Reduce
    Simone Brunozzi
    Technology Evangelist, Amazon Web Services, APAC
    twitter: @simon
    Blog: www.brunozzi.com
  • What is Elastic MapReduce
    Use Cases
    Service Features
    New Feature Announcements
    Elastic MapReduce Ecosystem
    AGENDA
  • Enables customers to easily, securely and cost-effectively process vast amounts of data.
    Spin-up 10s or 100s or even 1000s of instances
    Process 10s or 100s of Terabytes of data
    Hosted Hadoop framework running on the web-scale infrastructure of Amazon.
    What is Amazon Elastic MapReduce
    • Launch and monitor job flows
    • AWS Management Console
    • Command line interface
    • REST API
  • Why use Amazon Elastic MapReduce
    Elastic MapReduce removes MUCK from Big Data processing
    Hard to manage compute clusters
    Hard to tune Hadoop
    Hard to monitor running Job Flows
    Hard to debug Hadoop jobs
    Hadoop issues prevent smooth operation in the cloud
  • Problems customers solve with Elastic MapReduce
    Data mining and BI
    Log processing, click stream analysis, similarities, advertizing
    Data warehousing applications
    Bio-informatics (Genome analysis)
    Financial simulation (Monte Carlo simulation)
    File processing (resize jpegs)
    Web indexing
  • Web-Scale Data warehousing
  • Hadoop 0.20
    Pig 0.6
    Hive 0.5
    Cascading 1.1
    ELASTIC MAPREDUCE – SUPPORTED CONFIGURATIONS
    Hadoop 0.18
    Pig 0.3
    Hive 0.4
    Cascading 1.1
  • Apache Hive
    Batch and Interactive Mode
    Support Hive Steps
    Integration with Elastic MapReduce Client and Management Console
    Load table partitions automatically to/from Amazon S3
    Optimized data writes to Amazon S3
    Reference resources such as streaming scripts located on Amazon S3
    Specify an off-instance metadata store
    Support variables defined directly in Hive script
    Supports JDBC and ODBC connections
    ELASTIC MAPREDUCE – HIVE FEATURES
  • Apache Pig
    Batch and interactive mode
    Support Pig Steps
    Integration with Elastic MapReduce Client and Management Console
    Concurrent access to multiple file systems (HDFS, Amazon S3)
    Reference resources in Amazon S3 directly from Pig script
    Several User Defined Functions in Piggy Bank
    ELASTIC MAPREDUCE – PIG FEATURES
  • Enterprise customers need more flexibility
    Configuring Clusters
    Running Clusters
    Paying for clusters
    Enterprise customers need more tools
    Application development
    Data analytics
    Enterprise customers need support options
    Forums support is not enough
    Amazon Elastic MapReduce For Enterprise
  • Amazon Elastic MapReduce features
    Bootstrap actions
    Run arbitrary scripts before job flow begins
    Run on all nodes before data processing begins
    Used for
    Hadoop configuration (site-conf, Hadoop-conf, etc.)
    Cluster configuration (memory, swap, etc.)
    Application/packages installation (app-get install r-base)
    Several pre-defined bootstrap actions available
  • Enterprise customers need more flexibility
    Configuring Clusters
    Running Clusters
    Paying for clusters
    Enterprise customers need more tools
    Application development
    Data analytics
    Enterprise customers need support options
    Forum support is not enough
    Amazon Elastic MapReduce For Enterprise
  • Amazon Elastic MapReduce - new features
    Preannounce: Expand running clusters
    Increase number of nodes in a running cluster
    Increase processing speed
    Increasing HDFS size
  • Use Case: Increase speed of running job flows
    Speed up job flow execution in response to changing requirements
    Dynamically balance cost versus performance without restarting a job
    PREANNOUNCE – EXPAND/SHRINK CLUSTERS
    Job Flow
    Job Flow
    Job Flow
    3 Hours
    Allocate
    4 instances
    Expand to
    25 instances
    Expand to
    9 instances
    Time remaining:
    Time remaining:
    14 Hours
    7 Hours
    Time remaining:
  • Amazon Elastic MapReduce - new features
    Shrink running clusters
    Decrease number of nodes in a running job flow
    Different capacity requirements from step to step
    Automatically regulate capacity between steps
  • Use Case: Agile Data Warehouse Cluster
    Customize cluster size to support varying resource needs (e.g., query support during the day versus batch processing overnight)
    Leverage flexibility to reduce costs and increase cluster utilization
    EXPAND/SHRINK CLUSTERS
    Data Warehouse
    (Batch Processing)
    Data Warehouse
    (Steady State)
    Data Warehouse
    (Steady State)
    Allocate
    9 instances
    Expand to
    25 instances
    Shrink to
    9 instances
  • Enterprise customers need more flexibility
    Configuring Clusters
    Running Clusters
    Paying for clusters
    Enterprise customers need more tools
    Application development
    Data analytics
    Enterprise customers need support options
    Forums support is not enough
    Amazon Elastic MapReduce For Enterprise
  • Amazon Elastic MapReduce Price
  • What is a Spot Instance?
    Way to purchase & consume EC2 instances based on compute value
    Reduce your computing costs
    Bid for unused EC2 capacity
    Control your costs
    Differences from On-Demand Instances:
    Request – maximum price bid
    Spot Price – what you pay
    Termination
  • M2.xlarge instance pricing history
    Amazon EC2 On-Demand price for the same instance is $0.50
  • Amazon Elastic MapReduce – new feature
    Spot pricing support for Elastic MapReduce job flows
    Specify the price you want to pay for instances
    Elastic MapReduce takes care of
    Provisioning
    Node addition and removal to/from the cluster
    Can mix On-Demand and Spot instances in the same cluster
  • Use Case: Manage cost of running job flows
    Start with 4 On-Demand instances of type m2.xlarge
    Expand the cluster with 5 Spot Nodes
    Cost without Spot:
    4 instances *14 hrs * $0.50 = $28
    Cost with Spot:
    4 instances *7 hrs * $0.50 = $13 +
    5 instances * 7 hrs * $0.25 = $8.75
    Total = $21.75
    Savings: ~22%
    Integration with EC2 Spot
    Job Flow
    Job Flow
    Allocate
    4 instances
    Expand to
    9 instances
    Time remaining:
    Time remaining:
    14 Hours
    7 Hours
  • Enterprise customers need more flexibility
    Configuring Clusters
    Running Clusters
    Paying for clusters
    Enterprise customers need more tools
    Application development
    Data analytics
    Enterprise customers need support options
    Forums support is not enough
    Amazon Elastic MapReduce For Enterprise
  • Elastic MapReduce Ecosystem
    Ecosystem is growing
    Integrated development environments for Hadoop
    Tools designed for data analytics
    Broad support for Amazon Elastic MapReduce
  • Big Data Intelligence software
    For developers and analysts to work faster and easier
    Purpose built for all popular Hadoop distros and versions
    Tightly integrated with Elastic MapReduce (since 2009)
    Built on Karmasphere Application Framework™
    Native Hadoop client-side platform
    Karmasphere
  • Free version from
    www.karmasphere.com
    Karmasphere Studio
    Professional Edition
    Analyst Edition
    Rich graphical environment
    Develop, debug and deploy easily
    Visualize, manipulate & diagnose
    Jobs, clusters & file systems
    Broad and deep Elastic MapReduce support
    Rapid development
    Comprehensive profiling
    Rich debugging
    • SQL interface for ad hoc analysis
    • Robust Hive implementation
    • Syntax checking, diagnostics, schema browser, JDBC4 compliance, multi-threaded and concurrent
    • No cluster changes
    • Works over proxies and firewalls
    • Integrated Hadoop monitoring
  • Datameer Analytics Solution
    Big data analytics leveraging native Hadoop
    Extreme scale and performance
    Seamless elastic scale on Amazon Elastic MapReduce
    Empowering business users
    UI Driven
    no programming, no modeling, no schema, no ETL
  • Web Logs
    Social Media
    CRM
    Sales
    Excel Files
    Customer Data
    Datameer Analytics Solution
    Amazon Elastic MapReduce
  • MicroStrategy is a Global Leader in Business Intelligence
    Corporate Overview
    Founded in 1989
    Largest independent public BI vendor (NASDAQ: MSTR)
    Positioned in the Gartner “Leader Quadrant” for BI Platforms
    Over 1 million business users at over 3,000 organizations
    The MicroStrategy 9 business intelligence platform enables mobile apps, dashboards, reporting and analytics with your business data
    Build once, deliver instantly and securely any time, to any device
  • What can you do with MicroStrategy and Amazon Elastic MapReduce?
    Deliver insights to a broader range of users.
    End users interact with a point-and-click interface to query data without writing HiveQL or MapReduce jobs
    Use cases:
    Mobile Apps: Floor manager accesses order details stored in Amazon Elastic MapReduce through a custom iPhone App
    Dashboards: End user starts with a Dynamic Dashboard populated from data mart or data warehouse. The user then drills to a detail report that executes in Amazon Elastic MapReduce.
    Reporting: Application developer builds a parameterized HiveQL report, then schedules it to execute. Jobs execute against Amazon Elastic MapReduce and MicroStrategy sends out exception based alerts via email to end users.
    Analysis: Application developer populates a multidimensional cache in MicroStrategy with results of a HiveQL query. End user uses MicroStrategy’s web interface to slice-and-dice through results without going back to Hadoop.
  • How can I learn more?
    Try it!
    Free MicroStrategy software is available at: http://www.microstrategy.com/freereportingsoftware
    Get More information about Microstrategy solutions for Amazon Elastic MapReduce http://aws.amazon.com/solutions/solution-providers/microstrategy
  • Enterprise customers need more flexibility
    Configuring Clusters
    Running Clusters
    Paying for clusters
    Enterprise customers need more tools
    Application development
    Data analytics
    Enterprise customers need more support options
    Forums support is not enough
    Amazon Elastic MapReduce For Enterprise
  • Elastic MapReduce - Support
  • Enterprise customers need more flexibility
    Configuring Clusters
    Running Clusters
    Paying for clusters
    Enterprise customers need more tools
    Application development
    Data analytics
    Enterprise customers need more support options
    Forums support is not enough
    Amazon Elastic MapReduce For Enterprise
  • Making Hadoop Enterprise ready with Amazon Elastic Map/Reduce
    Simone Brunozzi
    Technology Evangelist, Amazon Web Services, APAC
    twitter: @simon
    Blog: www.brunozzi.com