0
Apache™ Hadoop® On-Demand    on Cloud Infrastructure      Why? How? Tools?   Andrei Savu @ Cloud2Days 2012         asavu@a...
Me• Co-Founder of Axemblr.com• Co-Organiser of Bucharest JUG (bjug.ro)• OSS contributor (Apache Whirr, jclouds)• Worked at...
Outline• Quick Introduction• 7 Reasons Why• 4 Reasons Why Not• How (practical lessons)• OSS Projects & Resources• Q&A
Apache™ Hadoop®• Stack of Components (HDFS,YARN, MR,  Hive, Pig, Mahout etc.)• Distributed Data Storage & Processing• Desi...
Useful for• Click stream analysis, transactions etc.• General Business Intelligence• Advanced Analytics• Machine Learning ...
IaaS• Infrastructure As A Service• Programatic access to virtual machines,  servers, storage, load balancers, network  etc...
IaaS + Hadoop = HaaS           Hadoop-As-A-Service   Parallel Data Processing as a Service
7 Reasons Why
#1 Data Locality“Work near Data” in the same Cloud
#2 Low UsageNightly processing. Short lived clusters. Development
#3 CAPEXNo upfront investment requirements
#4 Multi-TenancyEnforced by IaaS mechanisms (hypervisor & network)
#5 SecurityVPNs, VLANS, Authentication, ACLs
#6 Easy to ExpandStandard instances types. “Infinite” capacity.            Rapid provisioning.
#7 Self-Service Access  Increase Team Agility. Fast experiments.
4 Reasons Why Not
#1 Low I/O Performance Requires more machines for the same workload
#2 Assumes Static Infrastructure    Scheduling & Recovery. Works fine for     short-lived clusters with a fixed size.
#3 Difficult Access to Legacy          Systems      Like databases or log files
#4 Topology Awareness     Hidden by the IaaS layer
How?      Provisioning. Configuration.Workflow Management. Ongoing Monitoring
Provisioning      Not trivial for 10s or 100s of VMs   Should be: persistent, granular, idempotentEach provider is like a ...
Configuration    Usual suspectes: Puppet, Chef, BashOr offload: Cloudera Manager, Ambari (HPD)
Workflow Management    In-House, Oozie, Azkaban         API Integration
Ongoing MonitoringAutomatic repair by reacting to IaaS events       Ganglia, CloudWatch etc.
Tools?Emerging Market. Growth determined     by rate of cloud adoption.
OSS Projects #1• Apache Whirr  http://whirr.apache.org/• Juju (Ubuntu)  https://juju.ubuntu.com/
OSS Projects #2• Pallet  http://palletops.com/• Serengeti  http://serengeti.cloudfoundry.com/
Resources• Hadoop in Cloud Infrastructure (Steve  Loughran)  http://steveloughran.blogspot.ro/2012/03/  hadoop-in-cloud-in...
Thanks! Questions?  Andrei Savu / asavu@axemblr.com        Twitter: @andreisavu
Upcoming SlideShare
Loading in...5
×

Apache Hadoop On-Demand on Cloud Infrastructure at Java2Days

1,325

Published on

Apache Hadoop runs best on bare metal but there are many cases when it makes sense to run clusters on cloud infrastructure. In this talk I discuss advantages, disadvantages and common problems.

At Axemblr.com our mission is to make the process of provisioning a fully functional cluster on any cloud provider as simple as possible. We are building on Cloudera Manager and the API we provide is fully compatible with Amazon Elastic Map Reduce.

Get in touch with us at hello@axemblr.com

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,325
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
48
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript of "Apache Hadoop On-Demand on Cloud Infrastructure at Java2Days"

    1. 1. Apache™ Hadoop® On-Demand on Cloud Infrastructure Why? How? Tools? Andrei Savu @ Cloud2Days 2012 asavu@axemblr.com
    2. 2. Me• Co-Founder of Axemblr.com• Co-Organiser of Bucharest JUG (bjug.ro)• OSS contributor (Apache Whirr, jclouds)• Worked at Adobe, Facebook• Connect with me on LinkedIn
    3. 3. Outline• Quick Introduction• 7 Reasons Why• 4 Reasons Why Not• How (practical lessons)• OSS Projects & Resources• Q&A
    4. 4. Apache™ Hadoop®• Stack of Components (HDFS,YARN, MR, Hive, Pig, Mahout etc.)• Distributed Data Storage & Processing• Designed to Scale to 1000s nodes• Highly available at application level
    5. 5. Useful for• Click stream analysis, transactions etc.• General Business Intelligence• Advanced Analytics• Machine Learning (data driven features) and many more (data crunching)
    6. 6. IaaS• Infrastructure As A Service• Programatic access to virtual machines, servers, storage, load balancers, network etc.• PAYG. Good hardware utilisation. Flexible• Public & Private
    7. 7. IaaS + Hadoop = HaaS Hadoop-As-A-Service Parallel Data Processing as a Service
    8. 8. 7 Reasons Why
    9. 9. #1 Data Locality“Work near Data” in the same Cloud
    10. 10. #2 Low UsageNightly processing. Short lived clusters. Development
    11. 11. #3 CAPEXNo upfront investment requirements
    12. 12. #4 Multi-TenancyEnforced by IaaS mechanisms (hypervisor & network)
    13. 13. #5 SecurityVPNs, VLANS, Authentication, ACLs
    14. 14. #6 Easy to ExpandStandard instances types. “Infinite” capacity. Rapid provisioning.
    15. 15. #7 Self-Service Access Increase Team Agility. Fast experiments.
    16. 16. 4 Reasons Why Not
    17. 17. #1 Low I/O Performance Requires more machines for the same workload
    18. 18. #2 Assumes Static Infrastructure Scheduling & Recovery. Works fine for short-lived clusters with a fixed size.
    19. 19. #3 Difficult Access to Legacy Systems Like databases or log files
    20. 20. #4 Topology Awareness Hidden by the IaaS layer
    21. 21. How? Provisioning. Configuration.Workflow Management. Ongoing Monitoring
    22. 22. Provisioning Not trivial for 10s or 100s of VMs Should be: persistent, granular, idempotentEach provider is like a snowflake (normalisation)
    23. 23. Configuration Usual suspectes: Puppet, Chef, BashOr offload: Cloudera Manager, Ambari (HPD)
    24. 24. Workflow Management In-House, Oozie, Azkaban API Integration
    25. 25. Ongoing MonitoringAutomatic repair by reacting to IaaS events Ganglia, CloudWatch etc.
    26. 26. Tools?Emerging Market. Growth determined by rate of cloud adoption.
    27. 27. OSS Projects #1• Apache Whirr http://whirr.apache.org/• Juju (Ubuntu) https://juju.ubuntu.com/
    28. 28. OSS Projects #2• Pallet http://palletops.com/• Serengeti http://serengeti.cloudfoundry.com/
    29. 29. Resources• Hadoop in Cloud Infrastructure (Steve Loughran) http://steveloughran.blogspot.ro/2012/03/ hadoop-in-cloud-infrastructures.html• What factors justify the use of Hadoop http://redmonk.com/sogrady/2011/01/13/ apache-hadoop/
    30. 30. Thanks! Questions? Andrei Savu / asavu@axemblr.com Twitter: @andreisavu
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×