CloudStack and BigData


Published on

Thoughts about how to use CloudStack to develop a Big Data Solution in your data center.
Cloud as a virtual machine infrastructure and Big Data are converging as two key technical evolution in the data center. Virtualization enables multi-tenancy, heterogenous Operating systems and added security via isolation. Clouds like AWS EC2, Rackspace, or google GCE are good examples. Big Data tackles the challenge of the increase scale (amount) and complexity (type) of data faced in the enterprise. While Compute cloud show a departure from traditional hardware provisioning and configuration management via virtualization, big data is a departure from traditional relational databases and file systems. These two technical evolutions have been triggered by the new workloads of the internet (search, streaming) and the scale needed to server millions of users and millions/billions of objects to store or serve.
In this talk we show how CloudStack and its support for bare-metal provisioning is compatible with a public cloud. CloudStack being a data center orchestrator that can tackle both traditional enterprise workloads and internet scale/type workloads. Multiple zones can be created for compute cloud or big data. Big data can used as backend store to the compute cloud or as zone type to enabled big data workload on the bare metal hardware.
This hybrid mode of operation is seen as the next evolution of clouds and positions a data center orchestrator has more than a VM management system and a solution to big data management as well.

Published in: Technology
  • Big Data Now: 2012 Edition ---
    Are you sure you want to  Yes  No
    Your message goes here
  • Big Data: Principles and best practices of scalable realtime data systems ---
    Are you sure you want to  Yes  No
    Your message goes here
  • Hadoop Explained ---
    Are you sure you want to  Yes  No
    Your message goes here
  • so nice!
    Are you sure you want to  Yes  No
    Your message goes here
  • Very well described. Just met with a CS user that also has a 300 server Hadoop deployment across multiple zones. Interested in how to orchestrate Hadoop with CS.
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

CloudStack and BigData

  1. 1. CloudStackand Big DataSebastien Goasguen@sebgoaMay 22nd2013LinuxTag, Berlin
  2. 2. Google trends• Cloud computing trending down, while “Big Data”is booming. Virtualization remains “constant”.Start of “Clouds”
  3. 3. BigData on the Trigger• Cloud ComputingGoing down tothe “through ofDisillusionment”• “Big Data” on theTechnologyTrigger
  4. 4. • Big Data
  5. 5. What is Big Data ?• Large scale datasets– From scientific instruments– From Web apps logs– From Health records…• Complex datasets– Not necessarily large.– E.g Unstructured data– E.g Natural Language– E.g IBM Watson
  6. 6. A naturalevolution• From traditional file systems and databases• To large scale object store and nosqlmovement designed to handle massive scaleand concurrency
  7. 7. BigDataand map-reduce• While BigData is often associated with HDFS,Map-Reduce is the algorithm used toparallelize data processing.• BigData ≠ Map-Reduce ≠ HDFS• Map-reduce is a way to expressembarrassingly parallel work easily.• You can do Map-Reduce without HDFS.• E.g Basho map-reduce on riackCS
  8. 8. • CloudStack
  9. 9. How aboutIaaS ?
  10. 10. IaaS is really:• A Data Center Orchestrator– Data storage– Data movement– Data processing• That can:– Handle failures– Support large scale– Be programmed
  11. 11. What is CloudStack ?• Open source Infrastructure as a Service (IaaS)solution.• “Programmable” Data Center orchestrator• Hypervisor agnostic (with addition of baremetal provisioning)• Support scalable storage (Ceph, RIAK CS…)• Support complex enterprise networking (e.gFirewall, load balancer, VPN, VPC…)• Multi-tenant
  12. 12. A bit of History• Original company VMOPs (2008)– Founded by Sheng Liang former lead dev on JVM• Open source (GPLv3) as CloudStack• Acquired by Citrix (July 2011)• Relicensed under ASL v2 April 3, 2012• Accepted as Apache Incubating Project April16, 2012• First Apache (ACS 4.0) released november2012• Top Level Project Since March 2013.
  13. 13. Why ASF ?• Open Sourced CloudStack to:– Build a community– Facilitate the building of an ecosystem– Faster time to market• ASF highly recognized OSS foundation.• ASF clear processes• Individual contributions, companies have nostanding
  14. 14. Monthly Contributors
  15. 15. Companies
  16. 16. MultipleContributorsSungard: Announced lastweek that 6 developerswere joining the ApacheprojectSchuberg Philis: Bigcontribution inbuilding/packaging andNicira supportGo Daddy: MavenbuildingCaringo: Support for ownobject storeBasho: Support forRiackCS
  17. 17. • The Apache SoftwareFoundation
  18. 18. Apache SoftwareFoundation
  19. 19. • 35 projects in incubation:– 11 Hadoop related (including Apache provisonr)– ~30% Big Data related– +jclouds• 116 top level projects:– ~14 cloud or bigdata +10%– Deltacloud, Libcloud, Whirr– Hadoop, couchdb, cassandra– Bigtop, accumulo, lucene, UIMA– CloudStack
  20. 20. HadoopEcosystem• Complex ecosystem to perform data processingon big-data• Software components can be managed in VMsvia CloudStack
  21. 21. • BigData and CloudStack
  22. 22. CloudStack and BigData• Apache CloudStack is a data centerorchestrator• BigData solutions as storage backends forimage catalogue and large scale instancestorage.• BigData solutions as workloads to CloudStackbased clouds.
  23. 23. Storage• Primary Storage:– Anything that can be mounted on the node of acluster.– Cluster LVM, iSCSI, NFS, Ceph– Holds disk images of running VMs and user blockstores.• Secondary Storage:– Available across the zone– Holds snapshots and templates (image repo)– Can use multiple object stores (Gluster , Ceph,riackCS, Swift, Caringo )
  24. 24. Big Dataand CloudStack• “Big Data” solutions can be used assecondary storage (OpenStack swift, Caringo,CephFS, Gluster FS, RiackCS…).• Used to deploy a large scale storage backendto manage user images, and user datavolumes.• Primary intent is not to use it inside the VMsfor data processing.
  25. 25. CloudStackand Baremetal• CS supports baremetal provisioning.• This opens the door to multiple scenarios forBig-Data store, Clouds– Provision Hadoop cluster on baremetal– Operate “Hybrid” cloud: part Hypervisor for VMprovisioning, part baremetal for data store.– Reconfigure entire cloud on-demand
  26. 26. “Traditional”CS deployment• Farm of hypervisors, separate secondarystorage to store VM images and data volumes.
  27. 27. “Bare Metal” Hybriddeployment• Set of hypervisors, stand-alone secondarystorage, bare metal cluster with specializedhardware or software.• Access Big-Data store from VM guests
  28. 28. “Bare metal” cluster as secondarystorage• Use bare-metal provisioning to managelarges-scale secondary storage
  29. 29. “Pure” Big-Data store• Use CS as a traditional data centerprovisioning system and build a Big-Data storeon-demand
  30. 30. Combinations• CloudStack offers the possibility to switchbetween these modes on-demand• An elastic reconfigurable cloud• Just be careful not to override your data 
  31. 31. Big Data as a Workload to theCloudtools and demo…
  32. 32. ApacheWhirr• Big Data Provisioningtool• Deploys Hadoop, cdh,Hbase, Yarn, etc in theCloud• Use jclouds• Works with multiplecloud providersincluding CloudStack
  33. 33. jClouds• Under Incubation at theApache SoftwareFoundation (ASF)• Wrapper to multiplecloud providers
  34. 34. WhirrConfigurationwhirr.cluster-name=myhadoopclusterwhirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktrackerwhirr.provider=cloudstackwhirr.private-key-file=${sys:user.home}/.ssh/id_rsawhirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pubwhirr.env.repo=cdh4whirr.hadoop.install-function=install_cdh_hadoopwhirr.hadoop.configure-function=configure_cdh_hadoopwhirr.hardware-id=b6cd1ff5-3a2f-4e9d-a4d1-8988c1191fe8whirr.endpoint=<your access key>whirr.credential=<your secret key>
  35. 35. • Demo ?
  36. 36. Other tools• Brooklyn (• Apache Provisionr incubating
  37. 37. Others: Pallet• Clojure basedprovisioning tool• Provisions Hadoopclusters in the cloud.• Equivalent to Whirr butin clojure
  38. 38. CloStack• Clojure client forCloudStack• Uses native CloudStackAPI• Developed by @pyr , aCloudStack based publiccloud providers
  39. 39. More thanhadoop
  40. 40. On-Going Big-Datadevelopment• Hadoop being an Apache project written inJava, there is great potential synergy betweenCloudStack and Hadoop:e.g Develop Elastic Map-Reduce mechanisms toprovide map-reduce processing in CS backed byHDFS. Implementation of AWS EMR API.• Integration of Basho map-reduce (coming in4.2 release)
  41. 41. GSoC• ASF is a mentoring organization for GSoC• CloudStack has several proposals underconsideration– Improved CloudStack support in Apache Whirr andProvisionr– Integration of Apache Mesos with CloudStack
  42. 42. Info• Apache Top Level project•• #cloudstack on• @cloudstack on Twitter•• contributions and feedback, Join the fun !