A look at clouds and big data trends and history. While Big Data arrived first on the scene -looking at google file system, hadoop, dynamo- Cloud was first in the hyper cycle. Google trends show this clearly. Amazon AWS however has already deployed analytics services on the their cloud while open source IaaS solutions are still struggling to deliver a EC2 clone. Cloud and Big data has three common points: 1-use an EC2 clone and a S3 clone (riakCS, glusterfs etc) to build a cloud 2-Use a big data solutions as a backend to your cloud to provide EBS or large scale image catalogue 3-deploy big data solutions on your cloud with tools like apache whirr, pallet, and newer devops tool chains with vagrant and co.
10. New Distributed systems for:
Large scale datasets
• From scientific instruments
• From Web apps logs
Complex datasets
• Not necessarily large.
Object stores
• S3 clones
11. BigData and map-reduce
• While BigData is often associated with HDFS,
Map-Reduce is the algorithm used to
parallelize data processing.
• BigData ≠ Map-Reduce ≠ HDFS
• Map-reduce is a way to express
embarrassingly parallel work easily.
• You can do Map-Reduce without HDFS.
• e.g Basho map-reduce on riackCS
24. Clouds and BigData
• Object store + compute IaaS to build EC2+S3
clone
• BigData solutions as storage backends for
image catalogue and large scale instance
storage.
• BigData solutions as workloads to CloudStack
based clouds.
25. EC2, S3 clone
• An open source IaaS with an EC2
wrapper e.g Opennebula
• Deploy a S3 compatible object store –
separately- e.g riakCS
• Two independent distributed systems
deployed
Cloud = EC2 + S3
26. Big Data
as IaaS backend
“Big Data” solutions can be used as secondary
storage
.
27. Example
• Open source IaaS + EC2 wrapper, e.g
CloudStack
• Deploy S3 compatible object store, e.g
riakCS or Ceph or glusterFS
• Use S3 as image store
• Your EC2 service is a customer to your
S3 service
• Logstash + elasticsearch for
logs/monitoring
34. Conclusions
• Big Data is “catching up”
• Tackle the big three head on:
• BigData, Cloud and DevOps
• Add a big data backend to your cloud
from the start
• Provide Big Data services on your cloud
38. Get Involved with Apache
CloudStack
Web: http://cloudstack.apache.org/
Mailing Lists: cloudstack.apache.org/mailing-lists.html
IRC: irc.freenode.net: 6667 #cloudstack #cloudstack-dev
Twitter: @cloudstack
LinkedIn: www.linkedin.com/groups/CloudStack-Users-Group-3144859
If it didn’t happen on the mailing list, it didn’t happen.
Editor's Notes
Walmart, 1m customer transactions every hour, db of 2.5 PB in 2010 http://www.economist.com/node/15557443?story_id=15557443
Square Kilometer Array 10-500 TB per second ….1 exabyte per dayFacebook June 2012, 100 PB hadoop cluster, ½ PB per day = 180 PB per year -> ~350 PB now ?CERN ~20 PB EOS
250k cables war and peace 450k words, 260M worlds in cable gate = 500x war and peace