Apache Hadoop On-Demand on Cloud Infrastructure at Java2Days
Upcoming SlideShare
Loading in...5
×
 

Apache Hadoop On-Demand on Cloud Infrastructure at Java2Days

on

  • 1,646 views

Apache Hadoop runs best on bare metal but there are many cases when it makes sense to run clusters on cloud infrastructure. In this talk I discuss advantages, disadvantages and common problems.

Apache Hadoop runs best on bare metal but there are many cases when it makes sense to run clusters on cloud infrastructure. In this talk I discuss advantages, disadvantages and common problems.

At Axemblr.com our mission is to make the process of provisioning a fully functional cluster on any cloud provider as simple as possible. We are building on Cloudera Manager and the API we provide is fully compatible with Amazon Elastic Map Reduce.

Get in touch with us at hello@axemblr.com

Statistics

Views

Total Views
1,646
Views on SlideShare
1,633
Embed Views
13

Actions

Likes
2
Downloads
40
Comments
0

4 Embeds 13

http://www.linkedin.com 6
https://www.linkedin.com 4
https://twitter.com 2
https://si0.twimg.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Apache Hadoop On-Demand on Cloud Infrastructure at Java2Days Apache Hadoop On-Demand on Cloud Infrastructure at Java2Days Presentation Transcript

  • Apache™ Hadoop® On-Demand on Cloud Infrastructure Why? How? Tools? Andrei Savu @ Cloud2Days 2012 asavu@axemblr.com
  • Me• Co-Founder of Axemblr.com• Co-Organiser of Bucharest JUG (bjug.ro)• OSS contributor (Apache Whirr, jclouds)• Worked at Adobe, Facebook• Connect with me on LinkedIn
  • Outline• Quick Introduction• 7 Reasons Why• 4 Reasons Why Not• How (practical lessons)• OSS Projects & Resources• Q&A
  • Apache™ Hadoop®• Stack of Components (HDFS,YARN, MR, Hive, Pig, Mahout etc.)• Distributed Data Storage & Processing• Designed to Scale to 1000s nodes• Highly available at application level
  • Useful for• Click stream analysis, transactions etc.• General Business Intelligence• Advanced Analytics• Machine Learning (data driven features) and many more (data crunching)
  • IaaS• Infrastructure As A Service• Programatic access to virtual machines, servers, storage, load balancers, network etc.• PAYG. Good hardware utilisation. Flexible• Public & Private
  • IaaS + Hadoop = HaaS Hadoop-As-A-Service Parallel Data Processing as a Service
  • 7 Reasons Why
  • #1 Data Locality“Work near Data” in the same Cloud
  • #2 Low UsageNightly processing. Short lived clusters. Development
  • #3 CAPEXNo upfront investment requirements
  • #4 Multi-TenancyEnforced by IaaS mechanisms (hypervisor & network)
  • #5 SecurityVPNs, VLANS, Authentication, ACLs
  • #6 Easy to ExpandStandard instances types. “Infinite” capacity. Rapid provisioning.
  • #7 Self-Service Access Increase Team Agility. Fast experiments.
  • 4 Reasons Why Not
  • #1 Low I/O Performance Requires more machines for the same workload
  • #2 Assumes Static Infrastructure Scheduling & Recovery. Works fine for short-lived clusters with a fixed size.
  • #3 Difficult Access to Legacy Systems Like databases or log files
  • #4 Topology Awareness Hidden by the IaaS layer
  • How? Provisioning. Configuration.Workflow Management. Ongoing Monitoring
  • Provisioning Not trivial for 10s or 100s of VMs Should be: persistent, granular, idempotentEach provider is like a snowflake (normalisation)
  • Configuration Usual suspectes: Puppet, Chef, BashOr offload: Cloudera Manager, Ambari (HPD)
  • Workflow Management In-House, Oozie, Azkaban API Integration
  • Ongoing MonitoringAutomatic repair by reacting to IaaS events Ganglia, CloudWatch etc.
  • Tools?Emerging Market. Growth determined by rate of cloud adoption.
  • OSS Projects #1• Apache Whirr http://whirr.apache.org/• Juju (Ubuntu) https://juju.ubuntu.com/
  • OSS Projects #2• Pallet http://palletops.com/• Serengeti http://serengeti.cloudfoundry.com/
  • Resources• Hadoop in Cloud Infrastructure (Steve Loughran) http://steveloughran.blogspot.ro/2012/03/ hadoop-in-cloud-infrastructures.html• What factors justify the use of Hadoop http://redmonk.com/sogrady/2011/01/13/ apache-hadoop/
  • Thanks! Questions? Andrei Savu / asavu@axemblr.com Twitter: @andreisavu