Hadoop and HBase on Amazon Web Services
Upcoming SlideShare
Loading in...5
×
 

Hadoop and HBase on Amazon Web Services

on

  • 9,946 views

Introducing big data and analytics with Hadoop, Hbase and Amazon Elastic Mapreduce.

Introducing big data and analytics with Hadoop, Hbase and Amazon Elastic Mapreduce.

Statistics

Views

Total Views
9,946
Views on SlideShare
6,835
Embed Views
3,111

Actions

Likes
24
Downloads
264
Comments
0

7 Embeds 3,111

http://softwarestrategiesblog.com 3077
https://twitter.com 24
https://si0.twimg.com 3
http://lcolumbus.wordpress.com 3
https://www.linkedin.com 2
https://twimg0-a.akamaihd.net 1
http://www.linkedin.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop and HBase on Amazon Web Services Hadoop and HBase on Amazon Web Services Presentation Transcript

  • Hadoop & HBasewith Amazon Web ServicesDr. Matt Woodmatthew@amazon.com
  • Thank you.
  • 3Introducing Hadoop
  • 3Introducing Hadoop g HBase on AWS
  • 3Introducing Hadoop g HBase on AWS v Cost optimization
  • Data for competitive advantage.
  • Using data Customer segmentation, financial modeling, system analysis, line-of-sight, business intelligence...
  • Generation Collection & storageAnalytics & computationCollaboration & sharing
  • Cost of data generation is falling.
  • lower cost,increased throughput Generation Collection & storage Analytics & computation Collaboration & sharing
  • Generation HIGHLY CONSTRAINED Collection & storageAnalytics & computationCollaboration & sharing
  • Very high barrier to turning data into information.
  • Move from a data generation challenge to analytics challenge.
  • Enter the AWS Cloud.
  • Remove the constraints.
  • Enable data-driven innovation.
  • Move to a distributed data approach.
  • Maturation of two things.
  • Software for distributed storage and analysisMaturation of two things.
  • Software for distributed storage and analysisMaturation of two things. Infrastructure for distributed storage and analysis
  • Software Frameworks for data-intensive workloads. Distributed by design.
  • Infrastructure Platform for data-intensive workloads. Distributed by design.
  • Support the data life cycle.
  • Generation HIGHLY CONSTRAINED Collection & storageAnalytics & computationCollaboration & sharing
  • Generation Collection & storageAnalytics & computationCollaboration & sharing
  • Lower thebarrier to entry.
  • Accelerate time to market and increase agility.
  • Enable new business opportunities.
  • Washington Post Pinterest NASA
  • “AWS enables Pfizer to exploredifficult or deep scientific questions ina timely, scalable manner and helps us make better decisions more quickly” Michael Miller, Pfizer
  • 3Introducing Hadoop
  • Software for distributed storage and analysisMaturation of two things. Infrastructure for distributed storage and analysis
  • Software for distributed storage and analysisMaturation of two things. Infrastructure for distributed storage and analysis
  • Apache Hadoop Software for distributed storage and analysis Implements the map/reduce pattern Focus on your data
  • Built for uncertainty Hadoop provides tools to navigate data Allows discovery Query flexibility at scale
  • Built for flexibility Java native Executes code in any language Just a distribution mechanism
  • Rich ecosystem Diverse tools Machine learning, recommendations, predictive analytics, segmentation, real time analysis Lots of innovation
  • But... A very big project 500k+ lines of code Challenging to configure and optimize
  • GUndifferentiated heavy lifting
  • Amazon Elastic MapReduce
  • Amazon Elastic MapReduce Web service for data processing Hosted Hadoop Configured and optimized
  • Amazon Elastic MapReduce Job flows Elastic platform Maintain clusters or run once and terminate Debugging tools
  • S3Input data
  • S3 Input dataCode Elastic MapReduce
  • S3 Input dataCode Elastic Name MapReduce node
  • S3 Input dataCode Elastic Name MapReduce node Elastic cluster
  • S3 Input dataCode Elastic Name MapReduce node HDFS Elastic cluster
  • S3 Input dataCode Elastic Name MapReduce node Queries HDFS + BI Via JDBC, Pig, Hive Elastic cluster
  • S3 Input dataCode Elastic Name Output MapReduce node S3 + SimpleDB Queries HDFS + BI Via JDBC, Pig, Hive Elastic cluster
  • S3Input data Output S3 + SimpleDB
  • Hadoop all the way down Amazon Hadoop distribution HDFS Streaming interface Hive, Pig, Mahout, Spark, Shark
  • Data integration Optimized and integrated into AWS environment Reads and writes to S3 Analytics on DynamoDB data Can process data from any source: Cassandra, Mongo, Couch, Amazon RDS
  • Data movement Multi-part upload Import/Export AWS Direct Connect Aspera
  • Cluster scalability Resize running job flows Add capacity for shorter runs Remove capacity during off peak hours Balance scale and cost
  • Cluster scalability 14 hours remaining
  • Cluster scalability 7 hours remaining
  • Cluster scalability 3 hours remaining
  • Cluster scalabilitySteady state Steady state Large batch task
  • Cluster availability Canonical source of data Any one in the engineering team IAM integration Monitoring
  • Click stream analysis for retail 3.5 billion records 71 million unique cookies 1.7 million targeted ads 13 Tb of clickstream logs Each day
  • Click stream analysis for retail Workflow time from 2 days to 8 hoursProcurement time from 2 months to 5 minutes $13k per month 500% increase return on advertising spend
  • Log data stored in Amazon S3Amazon S3 Months of user click-through data Search terms Ads displayed Premium listing inventory
  • Elastic Map Reduce spins up 200 instance cluster Hadoop ClusterAmazon S3 Amazon EMR
  • Find patterns across logs. Write results to S3. Hadoop Cluster Amazon S3 Amazon EMR
  • Hadoop in the AWS Cloud Elastic MapReduce for hosted Hadoop Optimized, configured, ready to roll Focus on the business benefit of data Hadoop all the way down
  • Software for distributed storage and analysisMaturation of two things. Infrastructure for distributed storage and analysis
  • gHBase on AWS
  • Vibrant ecosystem Mahout for machine learning Mesos for cluster management Spark for fast analytics HBase for unstructured data
  • HBase NoSQL data store Runs on top of HDFS Scalable Rapid retrieval across large datasets
  • Architecture Huge, distributed map/hash Distributed Implements Bloom filters Sortable
  • Column based Columns are similar to fields Rows are records
  • Built for data Built to scale across billions of rows The more data, the better the relative performance
  • But... Large, complex project Running in production can be challenging Distributed system
  • GUndifferentiated heavy lifting
  • HBase for Elastic MapReduce
  • Using HBase Social media firehose Customer information Usage and application logs Hadoop analytics
  • Generation Collection & storageAnalytics & computationCollaboration & sharing
  • Amazon DynamoDB NoSQL database service Provisioned throughput Unlimited storage Very easy to use
  • DynamoDB & Amazon EMR SQL like queries Query flexibility at scale Integrate queries across datasets Hive
  • NoSQL on the AWS Marketplace CouchDB Cassandra MongoDB aws.amazon.com/marketplace
  • vCost optimization
  • Lowered prices 19 times in the past six years.
  • On-demand
  • Reserved capacity
  • 100% Reserved capacity
  • 100% On-demand Reserved capacity
  • 100% On-demand Reserved capacity
  • Spot market
  • $0.08 vs $0.007 (yesterday evening)
  • Reserved Instance Marketplace
  • 3Introducing Hadoop g HBase on AWS v Cost optimization
  • Baws.amazon.com/elasticmapreduce
  • Thank youaws.amazon.com/rdsmatthew@amazon.com @mza