Why we are here? What we will get from this class? Background and our target.
教学方法：故事， video ， pictures 得到（隐藏的）概念，留下印象 1. 什么是 Cloud Computing? Tell the story from SaaS PaaS Utility Computing Cloud Computing Story about Key player’s story User experiences Slogan: Cloud Computing = 2 为什么要 Cloud Computing? Some key characteristics 3.Cloud Computing 面临哪些问题？ 讨论， imagination for your own dream 人们的评论 : positive and negative , objective 认识
Story line: 高清《云计算》最浅显解谜云故事 .flv First write down your own opinion about “cloud computing” 写下来 whatever you though about in your mind question: what ? Who? Why? How? Pros and cons? The most important question is : the relation with me ? Watch the video take the question and search the answer Let’s find out the key concept in the story! SaaS PaaS Utility Computing Cloud Computing
.vs. 传统的 EULA( End User License Agreement ) Level4, Scalable, Configurable, Multi-Tenant-Efficient ：增加 multitier architecutre ，支持 load-balanced farm of identical application instances ，在数量变化的多个服务器上运行。 Provider 按用户需求 demand 来增减服务器，而不需要修改任何软件的体系结构。
LAMP is the industry standard But management is a hassle: Configuration, tuning Backup and recovery, disk space management Hardware failures, system crashes Software updates, security patches Log rotation, cron jobs, and much more Redesign needed once your database exceeds one box
好比让用户把电源插头插在墙上，你得到的电压和 Microsoft 得到的一样，只是你用得少， pay less ； utility computing 的目标就是让计算资源也具有这样的服务能力，用户可以使用 500 强公司所拥有的计算资源，只是 use less pay less 。这是 cloud computing 的一个重要方面
Cloud Computing Provider 需要具备的能力包括： very large datacenters, large-scale software infrastructure 以及 operational expertise 来运行它们。这对于 Google, Amazon, Microsoft, eBay 这样的公司来说是必然的事。另一个重要的驱动力来自于 very large datacenter 比 medium-sized datacenter 有 5-7 倍的经济效益。认为构建 extremely large-scale 的 commodity-computer datacenters 是 cloud computing 的一个关键 enabler 。
What you want to do with 100,000 pcs? It’s really possible now! Calculate Amaon EC2 service price for this:
大规模廉价 PC ，廉价网络连接构成的机群。 But not limited Data Center Paradigm
slides borrowed from Data-Rich Computing: Where It’s At @hadoop summit 2008 Phillip B. GibbonsIntel Research Pittsburgh
Inbio! WAlmart Transactions! trend in science: toward data; trend in data: more of it, more measurements; once it’s about data, the focus starts to fall upon tools for analyzing data [sciences are becoming more data-driven; statistics can directly solve science problems]
Some page from http://www.umiacs.umd.edu/~jimmylin/
Will universe Expand faster and faster (look at sky at night: empty) Expand forever but slow down (look at sky at night: about the same – most likely) Eventually contract into another big bang (look at sky at night: uh-oh) Nt = number of time samples in the detector data Np = number of sky pixels in the maps derived from the detector data Nb = number of bins of multipole if the power spectra derived from the maps
3 slides from “ D ata I ntensive S uper C omputing” Randal E. Bryant, Carnegie Mellon University
4 slides from Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press .
Outline stays the same, map and reduce change to fit the problem
Use AWS if you: Want to use third party open source software Have existing code Want to transfer web app to own machine/servers later on Port code to another language Want control (EC2 instances and machine images) Ex: need to stress/load test app – just load up 1000 instances Features – messaging, payment services, etc
Introduction to Cloud Computing http://net.pku.edu.cn/~course/cs402/2009/ 彭波 [email_address] 北京大学信息科学技术学院 6/30/2009
Some other Voices It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign. Somebody is saying this is inevitable — and whenever you hear somebody saying that, it’s very likely to be a set of businesses campaigning to make it true. Richard Stallman , quoted in The Guardian, September 29, 2008 The interesting thing about Cloud Computing is that we’ve redefined Cloud Computing to include everything that we already do. . . . I don’t understand what we would do differently in the light of Cloud Computing other than change the wording of some of our ads. Larry Ellison , quoted in the Wall Street Journal, September 26, 2008
What you want to do with 1000 pcs, or even 100,000 pcs?
Cloud is coming… Google alone has 450,000 systems running across 20 datacenters , and Microsoft's Windows Live team is doubling the number of servers it uses every 14 months, which is faster than Moore's Law “ Data Center is a Computer” Parallelism everywhere Massive Scalable Reliable Resource Management Data Management Programming Model & Tools
NERSC User George Smoot wins 2006 Nobel Prize in Physics Smoot and Mather 1992 COBE Experiment showed anisotropy of CMB Cosmic Microwave Background Radiation (CMB): an image of the universe at 400,000 years
Use web crawler to gather 151M HTML pages weekly 11 times
Generated 1.2 TB log information
Analyze page statistics and change frequencies
“ Moreover, we experienced a catastrophic disk failure during the third crawl, causing us to lose a quarter of the logs of that crawl.”
Fetterly, Manasse, Najork, Wiener (Microsoft, HP), “A Large-Scale Study of the Evolution of Web Pages,” Software-Practice & Experience, 2004
G A T G C TT A C T A T G C GGG CCCC C GG T C T AA T G C TT A C T A T G C G C TT A C T A T G C GGG CCCC TT AA T G C TT A C T A T G C GGG CCCC TT T AA T G C TT A C T A T G C AA T G C TT A G C T A T G C GGG C AA T G C TT A C T A T G C GGG CCCC TT AA T G C TT A C T A T G C GGG CCCC TT C GG T C T A G A T G C TT A C T A T G C AA T G C TT A C T A T G C GGG CCCC TT C GG T C T AA T G C TT A G C T A T G C A T G C TT A C T A T G C GGG CCCC TT Subject genome Sequencer Reads ?
C GG T C T A G A T G C TT A G C T A T G C GGG CCCC TT Reference sequence Alignment G C TT A T C T A T TT A T C T A T G C A T C T A T G C GG A T C T A T G C GG G C TT A T C T A T T C T A G A T G C T C T A T G C GGG C C T A G A T G C TT A T C T A T G C GG C T A T G C GGG C A T C T A T G C GG Subject reads
C GG T C T A G A T G C TT A T C T A T G C GGG CCCC TT G C TT A T C T A T TT A T C T A T G C A T C T A T G C GG A T C T A T G C GG G C TT A T C T A T GG CCCC TT G CCCC TT CC TT C GG C GG T C C GG T C T C GG T C T A G T C T A G A T G C T C T A T G C GGG C C T A G A T G C TT C TT A T G C GGG CCC Reference sequence Subject reads
2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella
December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes .
January 2006 - Doug Cutting joins Yahoo!
February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS.
March 2006 - Formation of the Yahoo! Hadoop team
May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes
April 2006 - Sort benchmark run on 188 nodes in 47.9 hours
May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark)
October 2006 - Research cluster reaches 600 Nodes
December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8
January 2006 - Research cluster reaches 900 node
April 2007 - Research clusters - 2 clusters of 1000 nodes
Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo!
From Theory to Practice You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster Hadoop Cluster
 J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi , 2004, pp. 137-150.
 G. Sanjay, G. Howard, and L. Shun-Tak, "The Google file system," in Proceedings of the nineteenth ACM symposium on Operating systems principles . Bolton Landing, NY, USA: ACM Press, 2003.
 O. Christopher, R. Benjamin, S. Utkarsh, K. Ravi, and T. Andrew, "Pig latin: a not-so-foreign language for data processing," in Proceedings of the 2008 ACM SIGMOD international conference on Management of data . Vancouver, Canada: ACM, 2008.