Apache Hadoop On-Demand on Cloud Infrastructure at Java2Days

1,771 views
1,585 views

Published on

Apache Hadoop runs best on bare metal but there are many cases when it makes sense to run clusters on cloud infrastructure. In this talk I discuss advantages, disadvantages and common problems.

At Axemblr.com our mission is to make the process of provisioning a fully functional cluster on any cloud provider as simple as possible. We are building on Cloudera Manager and the API we provide is fully compatible with Amazon Elastic Map Reduce.

Get in touch with us at hello@axemblr.com

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,771
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
49
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Apache Hadoop On-Demand on Cloud Infrastructure at Java2Days

    1. 1. Apache™ Hadoop® On-Demand on Cloud Infrastructure Why? How? Tools? Andrei Savu @ Cloud2Days 2012 asavu@axemblr.com
    2. 2. Me• Co-Founder of Axemblr.com• Co-Organiser of Bucharest JUG (bjug.ro)• OSS contributor (Apache Whirr, jclouds)• Worked at Adobe, Facebook• Connect with me on LinkedIn
    3. 3. Outline• Quick Introduction• 7 Reasons Why• 4 Reasons Why Not• How (practical lessons)• OSS Projects & Resources• Q&A
    4. 4. Apache™ Hadoop®• Stack of Components (HDFS,YARN, MR, Hive, Pig, Mahout etc.)• Distributed Data Storage & Processing• Designed to Scale to 1000s nodes• Highly available at application level
    5. 5. Useful for• Click stream analysis, transactions etc.• General Business Intelligence• Advanced Analytics• Machine Learning (data driven features) and many more (data crunching)
    6. 6. IaaS• Infrastructure As A Service• Programatic access to virtual machines, servers, storage, load balancers, network etc.• PAYG. Good hardware utilisation. Flexible• Public & Private
    7. 7. IaaS + Hadoop = HaaS Hadoop-As-A-Service Parallel Data Processing as a Service
    8. 8. 7 Reasons Why
    9. 9. #1 Data Locality“Work near Data” in the same Cloud
    10. 10. #2 Low UsageNightly processing. Short lived clusters. Development
    11. 11. #3 CAPEXNo upfront investment requirements
    12. 12. #4 Multi-TenancyEnforced by IaaS mechanisms (hypervisor & network)
    13. 13. #5 SecurityVPNs, VLANS, Authentication, ACLs
    14. 14. #6 Easy to ExpandStandard instances types. “Infinite” capacity. Rapid provisioning.
    15. 15. #7 Self-Service Access Increase Team Agility. Fast experiments.
    16. 16. 4 Reasons Why Not
    17. 17. #1 Low I/O Performance Requires more machines for the same workload
    18. 18. #2 Assumes Static Infrastructure Scheduling & Recovery. Works fine for short-lived clusters with a fixed size.
    19. 19. #3 Difficult Access to Legacy Systems Like databases or log files
    20. 20. #4 Topology Awareness Hidden by the IaaS layer
    21. 21. How? Provisioning. Configuration.Workflow Management. Ongoing Monitoring
    22. 22. Provisioning Not trivial for 10s or 100s of VMs Should be: persistent, granular, idempotentEach provider is like a snowflake (normalisation)
    23. 23. Configuration Usual suspectes: Puppet, Chef, BashOr offload: Cloudera Manager, Ambari (HPD)
    24. 24. Workflow Management In-House, Oozie, Azkaban API Integration
    25. 25. Ongoing MonitoringAutomatic repair by reacting to IaaS events Ganglia, CloudWatch etc.
    26. 26. Tools?Emerging Market. Growth determined by rate of cloud adoption.
    27. 27. OSS Projects #1• Apache Whirr http://whirr.apache.org/• Juju (Ubuntu) https://juju.ubuntu.com/
    28. 28. OSS Projects #2• Pallet http://palletops.com/• Serengeti http://serengeti.cloudfoundry.com/
    29. 29. Resources• Hadoop in Cloud Infrastructure (Steve Loughran) http://steveloughran.blogspot.ro/2012/03/ hadoop-in-cloud-infrastructures.html• What factors justify the use of Hadoop http://redmonk.com/sogrady/2011/01/13/ apache-hadoop/
    30. 30. Thanks! Questions? Andrei Savu / asavu@axemblr.com Twitter: @andreisavu

    ×