Your SlideShare is downloading. ×
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Cloud Computing & Apache Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cloud Computing & Apache Hadoop

2,095

Published on

A talk about infrastructure as a service, challenges and advantages, Apache Hadoop as an ecosystem of tools for distributed storage and data processing and Apache Whirr for deployment and simple …

A talk about infrastructure as a service, challenges and advantages, Apache Hadoop as an ecosystem of tools for distributed storage and data processing and Apache Whirr for deployment and simple management.

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,095
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
91
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. Cloud Computing & Hadoop Data processing tools based on Apache Whirr Andrei Savu - @andreisavu - Dev:World 2012 asavu@apache.org
    • 2. About meSoftware Engineer @ CloudsoftApache Whirr PMC memberjclouds committerWorked @ Facebook & AdobeConnect with me on LinkedIn
    • 3. @ CloudsoftMonterey: Platform for IntelligentMulti-cloud Application Mobilityhttp://www.cloudsoftcorp.com/
    • 4. From 0 to workingpipelines ... in minutes!* log analysis, ETL, crawling, machine learning etc.
    • 5. Manage a 10+ nodes Hadoop clusterwith custom software as a ... cron job.
    • 6. The planInfrastructure as a Service (context)Apache Hadoop (data processing)Apache Whirr (deployment)Resources (food for thought)Q/A (or asavu@apache.org)
    • 7. What is Infrastructure as a Service? (IaaS)
    • 8. #1 On demand access toinfrastructure components (physical or virtual)
    • 9. #2 Pay as you go model
    • 10. #3 API for automation
    • 11. building blocks for trulyelastic and highly efficient applications
    • 12. 0% over provisioning
    • 13. self-managing
    • 14. highly available by design
    • 15. serious downside: complexity
    • 16. ... managed using libraries, tools & platforms (PaaS)
    • 17. “All problems in computer science can be solved by another level of indirection” David Wheeler... except for the problem of too many levels of indirection
    • 18. to be continued * in a few minutes
    • 19. Apache HadoopAn ecosystem of components and tools
    • 20. Overview• Java, C/C++ • can scale to 1000s of machines• set of distributed systems (hdfs, mr etc.) • designed to be highly available at the• platform for distributed application level data processing • https://hadoop.apache.org/• simple programming model (map / reduce)
    • 21. Components• HDFS (Storage) • Oozie (workflow)• MapReduce (Processing) • Mahout (machine learning)• Hive, Pig (high level languages) • Flume (log streaming)• HBase (database) • Sqoop (data import)• ZooKeeper • Whirr (deployment) (coordination) • etc.
    • 22. Why run Hadoop on cloud infrastructure?http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html
    • 23. #1 “work near data” * if you already use cloud storage
    • 24. #2 infrequent jobs nightly or weekly
    • 25. #3 for better security platform multi-tenancy
    • 26. #4 no upfront investment fund from ongoing revenue
    • 27. #5 easier to expand
    • 28. #6 no network setup
    • 29. #7 homogeneous “hardware”
    • 30. How to run Hadoop on Cloud Infrastructure?
    • 31. Apache Whirr http://whirr.apache.org/
    • 32. OverviewApache Whirr provides a set of libraries forrunning cloud services:* cloud neutral & based on jclouds* has a common service API* smart defaults* available as a command line tool
    • 33. First StepsHome pagehttps://whirr.apache.org/Whirr in 5 minuteshttps://whirr.apache.org/docs/0.7.1/whirr-in-5-minutes.htmlQuick Start Guidehttps://whirr.apache.org/docs/0.7.1/quick-start-guide.html
    • 34. Supported Services• Apache Hadoop & • Apache Hama YARN (incubating)• CDH from Cloudera • Apache HBase• Apache Cassandra • Apache Mahout• Chef & Puppet • Pig• elasticsearch • Voldemort• Ganglia • Apache ZooKeeper
    • 35. Hadoop Recipewhirr.cluster-name=test-hadoopwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker 10 hadoop-datanode+hadoop-tasktracker
    • 36. with Pig & Mahoutwhirr.cluster-name=test-hadoopwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +mahout-client+pig-client 10 hadoop-datanode+hadoop-tasktracker
    • 37. Start / StopLaunch cluster:whirr launch-cluster --config spec.confDestroy cluster:whirr destroy-cluster --config spec.conf
    • 38. demo #1Running a cluster on Rackspace Cloud UK
    • 39. demo #2HadoopClusterExample code walkthrough
    • 40. Thanks! Questions? Andrei Savu - @andreisavu asavu@apache.org
    • 41. Resources & Links
    • 42. Fundamental Papers• Google Filesystem (2003) http://research.google.com/archive/gfs.html• Google MapReduce (2004) http://research.google.com/archive/mapreduce.html• Google BigTable (2006) http://research.google.com/archive/bigtable.html• Amazon Dynamo http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo- sosp2007.pdf
    • 43. Articles• Getting Real About Distributed System Reliability: http://blog.empathybox.com/post/19574936361/getting-real-about- distributed-system-reliability• Towards a Topology of Failure: http://steveloughran.blogspot.com/2011/11/towards-topology-of-failure.html• Hadoop in Cloud Infrastructures: http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud- infrastructures.html
    • 44. jclouds“jclouds is an open source library that helpsyou get started in the cloud and reuse yourjava and clojure development skills” http://www.jclouds.org/
    • 45. RHadoopA way of running R scripts on Hadoophttp://blog.revolutionanalytics.com/2012/03/r-and-hadoop-step-by-step-tutorials.html
    • 46. Thanks!Andrei Savu - @andreisavu asavu@apache.org

    ×