• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cloud Computing & Apache Hadoop
 

Cloud Computing & Apache Hadoop

on

  • 2,332 views

A talk about infrastructure as a service, challenges and advantages, Apache Hadoop as an ecosystem of tools for distributed storage and data processing and Apache Whirr for deployment and simple ...

A talk about infrastructure as a service, challenges and advantages, Apache Hadoop as an ecosystem of tools for distributed storage and data processing and Apache Whirr for deployment and simple management.

Statistics

Views

Total Views
2,332
Views on SlideShare
1,735
Embed Views
597

Actions

Likes
2
Downloads
81
Comments
0

5 Embeds 597

http://www.andreisavu.ro 585
http://searchutil01 4
http://coderwall.com 3
https://www.linkedin.com 3
http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Cloud Computing & Apache Hadoop Cloud Computing & Apache Hadoop Presentation Transcript

  • Cloud Computing & Hadoop Data processing tools based on Apache Whirr Andrei Savu - @andreisavu - Dev:World 2012 asavu@apache.org
  • About meSoftware Engineer @ CloudsoftApache Whirr PMC memberjclouds committerWorked @ Facebook & AdobeConnect with me on LinkedIn
  • @ CloudsoftMonterey: Platform for IntelligentMulti-cloud Application Mobilityhttp://www.cloudsoftcorp.com/
  • From 0 to workingpipelines ... in minutes!* log analysis, ETL, crawling, machine learning etc.
  • Manage a 10+ nodes Hadoop clusterwith custom software as a ... cron job.
  • The planInfrastructure as a Service (context)Apache Hadoop (data processing)Apache Whirr (deployment)Resources (food for thought)Q/A (or asavu@apache.org)
  • What is Infrastructure as a Service? (IaaS)
  • #1 On demand access toinfrastructure components (physical or virtual)
  • #2 Pay as you go model
  • #3 API for automation
  • building blocks for trulyelastic and highly efficient applications
  • 0% over provisioning
  • self-managing
  • highly available by design
  • serious downside: complexity
  • ... managed using libraries, tools & platforms (PaaS)
  • “All problems in computer science can be solved by another level of indirection” David Wheeler... except for the problem of too many levels of indirection
  • to be continued * in a few minutes
  • Apache HadoopAn ecosystem of components and tools
  • Overview• Java, C/C++ • can scale to 1000s of machines• set of distributed systems (hdfs, mr etc.) • designed to be highly available at the• platform for distributed application level data processing • https://hadoop.apache.org/• simple programming model (map / reduce)
  • Components• HDFS (Storage) • Oozie (workflow)• MapReduce (Processing) • Mahout (machine learning)• Hive, Pig (high level languages) • Flume (log streaming)• HBase (database) • Sqoop (data import)• ZooKeeper • Whirr (deployment) (coordination) • etc.
  • Why run Hadoop on cloud infrastructure?http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html
  • #1 “work near data” * if you already use cloud storage
  • #2 infrequent jobs nightly or weekly
  • #3 for better security platform multi-tenancy
  • #4 no upfront investment fund from ongoing revenue
  • #5 easier to expand
  • #6 no network setup
  • #7 homogeneous “hardware”
  • How to run Hadoop on Cloud Infrastructure?
  • Apache Whirr http://whirr.apache.org/
  • OverviewApache Whirr provides a set of libraries forrunning cloud services:* cloud neutral & based on jclouds* has a common service API* smart defaults* available as a command line tool
  • First StepsHome pagehttps://whirr.apache.org/Whirr in 5 minuteshttps://whirr.apache.org/docs/0.7.1/whirr-in-5-minutes.htmlQuick Start Guidehttps://whirr.apache.org/docs/0.7.1/quick-start-guide.html
  • Supported Services• Apache Hadoop & • Apache Hama YARN (incubating)• CDH from Cloudera • Apache HBase• Apache Cassandra • Apache Mahout• Chef & Puppet • Pig• elasticsearch • Voldemort• Ganglia • Apache ZooKeeper
  • Hadoop Recipewhirr.cluster-name=test-hadoopwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker 10 hadoop-datanode+hadoop-tasktracker
  • with Pig & Mahoutwhirr.cluster-name=test-hadoopwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +mahout-client+pig-client 10 hadoop-datanode+hadoop-tasktracker
  • Start / StopLaunch cluster:whirr launch-cluster --config spec.confDestroy cluster:whirr destroy-cluster --config spec.conf
  • demo #1Running a cluster on Rackspace Cloud UK
  • demo #2HadoopClusterExample code walkthrough
  • Thanks! Questions? Andrei Savu - @andreisavu asavu@apache.org
  • Resources & Links
  • Fundamental Papers• Google Filesystem (2003) http://research.google.com/archive/gfs.html• Google MapReduce (2004) http://research.google.com/archive/mapreduce.html• Google BigTable (2006) http://research.google.com/archive/bigtable.html• Amazon Dynamo http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo- sosp2007.pdf
  • Articles• Getting Real About Distributed System Reliability: http://blog.empathybox.com/post/19574936361/getting-real-about- distributed-system-reliability• Towards a Topology of Failure: http://steveloughran.blogspot.com/2011/11/towards-topology-of-failure.html• Hadoop in Cloud Infrastructures: http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud- infrastructures.html
  • jclouds“jclouds is an open source library that helpsyou get started in the cloud and reuse yourjava and clojure development skills” http://www.jclouds.org/
  • RHadoopA way of running R scripts on Hadoophttp://blog.revolutionanalytics.com/2012/03/r-and-hadoop-step-by-step-tutorials.html
  • Thanks!Andrei Savu - @andreisavu asavu@apache.org