0
Cloud Computing & Hadoop   Data processing tools based on Apache Whirr   Andrei Savu - @andreisavu - Dev:World 2012       ...
About meSoftware Engineer @ CloudsoftApache Whirr PMC memberjclouds committerWorked @ Facebook & AdobeConnect with me on L...
@ CloudsoftMonterey: Platform for IntelligentMulti-cloud Application Mobilityhttp://www.cloudsoftcorp.com/
From 0 to workingpipelines ... in minutes!* log analysis, ETL, crawling, machine learning etc.
Manage a 10+ nodes Hadoop clusterwith custom software as a ... cron job.
The planInfrastructure as a Service (context)Apache Hadoop (data processing)Apache Whirr (deployment)Resources (food for t...
What is Infrastructure as a Service? (IaaS)
#1 On demand access toinfrastructure components    (physical or virtual)
#2 Pay as you go model
#3 API for automation
building blocks for trulyelastic and highly efficient        applications
0% over provisioning
self-managing
highly available by design
serious downside: complexity
... managed using libraries, tools & platforms (PaaS)
“All problems in computer science can be solved by another level of    indirection” David Wheeler... except for the proble...
to be continued   * in a few minutes
Apache HadoopAn ecosystem of components and tools
Overview•   Java, C/C++                •   can scale to 1000s of                                   machines•   set of dist...
Components•   HDFS (Storage)           •   Oozie (workflow)•   MapReduce (Processing)   •   Mahout (machine                ...
Why run Hadoop on   cloud infrastructure?http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html
#1 “work near data”  * if you already use cloud storage
#2 infrequent jobs     nightly or weekly
#3 for better security      platform multi-tenancy
#4 no upfront investment      fund from ongoing revenue
#5 easier to expand
#6 no network setup
#7 homogeneous “hardware”
How to run Hadoop on Cloud Infrastructure?
Apache Whirr http://whirr.apache.org/
OverviewApache Whirr provides a set of libraries forrunning cloud services:* cloud neutral & based on jclouds* has a commo...
First StepsHome pagehttps://whirr.apache.org/Whirr in 5 minuteshttps://whirr.apache.org/docs/0.7.1/whirr-in-5-minutes.html...
Supported Services•   Apache Hadoop &     •   Apache Hama    YARN                    (incubating)•   CDH from Cloudera   •...
Hadoop Recipewhirr.cluster-name=test-hadoopwhirr.instance-templates=  1 hadoop-namenode+hadoop-jobtracker 10 hadoop-datano...
with Pig & Mahoutwhirr.cluster-name=test-hadoopwhirr.instance-templates=  1 hadoop-namenode+hadoop-jobtracker    +mahout-c...
Start / StopLaunch cluster:whirr launch-cluster --config spec.confDestroy cluster:whirr destroy-cluster --config spec.conf
demo #1Running a cluster on Rackspace Cloud UK
demo #2HadoopClusterExample code walkthrough
Thanks! Questions?   Andrei Savu - @andreisavu      asavu@apache.org
Resources & Links
Fundamental Papers• Google Filesystem (2003)  http://research.google.com/archive/gfs.html• Google MapReduce (2004)  http:/...
Articles• Getting Real About Distributed System  Reliability:  http://blog.empathybox.com/post/19574936361/getting-real-ab...
jclouds“jclouds is an open source library that helpsyou get started in the cloud and reuse yourjava and clojure developmen...
RHadoopA way of running R scripts on Hadoophttp://blog.revolutionanalytics.com/2012/03/r-and-hadoop-step-by-step-tutorials...
Thanks!Andrei Savu - @andreisavu   asavu@apache.org
Upcoming SlideShare
Loading in...5
×

Cloud Computing & Apache Hadoop

2,129

Published on

A talk about infrastructure as a service, challenges and advantages, Apache Hadoop as an ecosystem of tools for distributed storage and data processing and Apache Whirr for deployment and simple management.

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,129
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
92
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript of "Cloud Computing & Apache Hadoop"

    1. 1. Cloud Computing & Hadoop Data processing tools based on Apache Whirr Andrei Savu - @andreisavu - Dev:World 2012 asavu@apache.org
    2. 2. About meSoftware Engineer @ CloudsoftApache Whirr PMC memberjclouds committerWorked @ Facebook & AdobeConnect with me on LinkedIn
    3. 3. @ CloudsoftMonterey: Platform for IntelligentMulti-cloud Application Mobilityhttp://www.cloudsoftcorp.com/
    4. 4. From 0 to workingpipelines ... in minutes!* log analysis, ETL, crawling, machine learning etc.
    5. 5. Manage a 10+ nodes Hadoop clusterwith custom software as a ... cron job.
    6. 6. The planInfrastructure as a Service (context)Apache Hadoop (data processing)Apache Whirr (deployment)Resources (food for thought)Q/A (or asavu@apache.org)
    7. 7. What is Infrastructure as a Service? (IaaS)
    8. 8. #1 On demand access toinfrastructure components (physical or virtual)
    9. 9. #2 Pay as you go model
    10. 10. #3 API for automation
    11. 11. building blocks for trulyelastic and highly efficient applications
    12. 12. 0% over provisioning
    13. 13. self-managing
    14. 14. highly available by design
    15. 15. serious downside: complexity
    16. 16. ... managed using libraries, tools & platforms (PaaS)
    17. 17. “All problems in computer science can be solved by another level of indirection” David Wheeler... except for the problem of too many levels of indirection
    18. 18. to be continued * in a few minutes
    19. 19. Apache HadoopAn ecosystem of components and tools
    20. 20. Overview• Java, C/C++ • can scale to 1000s of machines• set of distributed systems (hdfs, mr etc.) • designed to be highly available at the• platform for distributed application level data processing • https://hadoop.apache.org/• simple programming model (map / reduce)
    21. 21. Components• HDFS (Storage) • Oozie (workflow)• MapReduce (Processing) • Mahout (machine learning)• Hive, Pig (high level languages) • Flume (log streaming)• HBase (database) • Sqoop (data import)• ZooKeeper • Whirr (deployment) (coordination) • etc.
    22. 22. Why run Hadoop on cloud infrastructure?http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html
    23. 23. #1 “work near data” * if you already use cloud storage
    24. 24. #2 infrequent jobs nightly or weekly
    25. 25. #3 for better security platform multi-tenancy
    26. 26. #4 no upfront investment fund from ongoing revenue
    27. 27. #5 easier to expand
    28. 28. #6 no network setup
    29. 29. #7 homogeneous “hardware”
    30. 30. How to run Hadoop on Cloud Infrastructure?
    31. 31. Apache Whirr http://whirr.apache.org/
    32. 32. OverviewApache Whirr provides a set of libraries forrunning cloud services:* cloud neutral & based on jclouds* has a common service API* smart defaults* available as a command line tool
    33. 33. First StepsHome pagehttps://whirr.apache.org/Whirr in 5 minuteshttps://whirr.apache.org/docs/0.7.1/whirr-in-5-minutes.htmlQuick Start Guidehttps://whirr.apache.org/docs/0.7.1/quick-start-guide.html
    34. 34. Supported Services• Apache Hadoop & • Apache Hama YARN (incubating)• CDH from Cloudera • Apache HBase• Apache Cassandra • Apache Mahout• Chef & Puppet • Pig• elasticsearch • Voldemort• Ganglia • Apache ZooKeeper
    35. 35. Hadoop Recipewhirr.cluster-name=test-hadoopwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker 10 hadoop-datanode+hadoop-tasktracker
    36. 36. with Pig & Mahoutwhirr.cluster-name=test-hadoopwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +mahout-client+pig-client 10 hadoop-datanode+hadoop-tasktracker
    37. 37. Start / StopLaunch cluster:whirr launch-cluster --config spec.confDestroy cluster:whirr destroy-cluster --config spec.conf
    38. 38. demo #1Running a cluster on Rackspace Cloud UK
    39. 39. demo #2HadoopClusterExample code walkthrough
    40. 40. Thanks! Questions? Andrei Savu - @andreisavu asavu@apache.org
    41. 41. Resources & Links
    42. 42. Fundamental Papers• Google Filesystem (2003) http://research.google.com/archive/gfs.html• Google MapReduce (2004) http://research.google.com/archive/mapreduce.html• Google BigTable (2006) http://research.google.com/archive/bigtable.html• Amazon Dynamo http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo- sosp2007.pdf
    43. 43. Articles• Getting Real About Distributed System Reliability: http://blog.empathybox.com/post/19574936361/getting-real-about- distributed-system-reliability• Towards a Topology of Failure: http://steveloughran.blogspot.com/2011/11/towards-topology-of-failure.html• Hadoop in Cloud Infrastructures: http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud- infrastructures.html
    44. 44. jclouds“jclouds is an open source library that helpsyou get started in the cloud and reuse yourjava and clojure development skills” http://www.jclouds.org/
    45. 45. RHadoopA way of running R scripts on Hadoophttp://blog.revolutionanalytics.com/2012/03/r-and-hadoop-step-by-step-tutorials.html
    46. 46. Thanks!Andrei Savu - @andreisavu asavu@apache.org
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×