Google 文档在 3 月 7 日发生了大批用户文件外泄事件。美国隐私保护组织就此提请政府对 Google 采取措施，使其加强云计算产品的安全性。
Problem of Data Lock-in
Some other Voices It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign. Somebody is saying this is inevitable — and whenever you hear somebody saying that, it’s very likely to be a set of businesses campaigning to make it true. Richard Stallman , quoted in The Guardian, September 29, 2008 The interesting thing about Cloud Computing is that we’ve redefined Cloud Computing to include everything that we already do. . . . I don’t understand what we would do differently in the light of Cloud Computing other than change the wording of some of our ads. Larry Ellison , quoted in the Wall Street Journal, September 26, 2008
What’s matter with ME ?!
What you want to do with 1000 pcs, or even 100,000 pcs?
Cloud is coming… Google alone has 450,000 systems running across 20 datacenters , and Microsoft's Windows Live team is doubling the number of servers it uses every 14 months, which is faster than Moore's Law “ Data Center is a Computer” Parallelism everywhere Massive Scalable Reliable Resource Management Data Management Programming Model & Tools
NERSC User George Smoot wins 2006 Nobel Prize in Physics Smoot and Mather 1992 COBE Experiment showed anisotropy of CMB Cosmic Microwave Background Radiation (CMB): an image of the universe at 400,000 years
The Current CMB Map
Unique imprint of primordial physics through the tiny anisotropies in temperature and polarization.
Extracting these Kelvin fluctuations from inherently noisy data is a serious computational challenge.
source J. Borrill, LBNL
Evolution Of CMB Data Sets: Cost > O(Np^3 ) Experiment N t N p N b Limiting Data Notes COBE (1989) 2x10 9 6x10 3 3x10 1 Time Satellite, Workstation BOOMERanG (1998) 3x10 8 5x10 5 3x10 1 Pixel Balloon, 1st HPC/NERSC (4yr) WMAP (2001) 7x10 10 4x10 7 1x10 3 ? Satellite, Analysis-bound Planck (2007) 5x10 11 6x10 8 6x10 3 Time/ Pixel Satellite, Major HPC/DA effort POLARBEAR (2007) 8x10 12 6x10 6 1x10 3 Time Ground, NG-multiplexing CMBPol (~2020) 10 14 10 9 10 4 Time/ Pixel Satellite, Early planning/design data compression
Example: Wikipedia Anthropology
Download entire revision history of Wikipedia
4.7 M pages, 58 M revisions, 800 GB
Analyze editing patterns & trends
Hadoop on 20-machine cluster
Kittur, Suh, Pendleton (UCLA, PARC), “He Says, She Says: Conflict and Coordination in Wikipedia” CHI, 2007 Increasing fraction of edits are for work indirectly related to articles
Example: Scene Completion
Image Database Grouped by Semantic Content
30 different Flickr.com groups
2.3 M images total (396 GB).
Select Candidate Images Most Suitable for Filling Hole
Classify images with gist scene detector [Torralba]
Local context matching
Index images offline
50 min. scene matching, 20 min. local matching, 4 min. compositing
Reduces to 5 minutes total by using 5 machines
Flickr.com has over 500 million images …
Hays, Efros (CMU), “Scene Completion Using Millions of Photographs” SIGGRAPH, 2007
Example: Web Page Analysis
Use web crawler to gather 151M HTML pages weekly 11 times
Generated 1.2 TB log information
Analyze page statistics and change frequencies
“ Moreover, we experienced a catastrophic disk failure during the third crawl, causing us to lose a quarter of the logs of that crawl.”
Fetterly, Manasse, Najork, Wiener (Microsoft, HP), “A Large-Scale Study of the Evolution of Web Pages,” Software-Practice & Experience, 2004
G A T G C TT A C T A T G C GGG CCCC C GG T C T AA T G C TT A C T A T G C G C TT A C T A T G C GGG CCCC TT AA T G C TT A C T A T G C GGG CCCC TT T AA T G C TT A C T A T G C AA T G C TT A G C T A T G C GGG C AA T G C TT A C T A T G C GGG CCCC TT AA T G C TT A C T A T G C GGG CCCC TT C GG T C T A G A T G C TT A C T A T G C AA T G C TT A C T A T G C GGG CCCC TT C GG T C T AA T G C TT A G C T A T G C A T G C TT A C T A T G C GGG CCCC TT Subject genome Sequencer Reads ?
Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG
Bacteria: ~5 million bp
Humans: ~3 billion bp
Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp)
Shorter reads, but much higher throughput
Per-base error rate estimated at 1-2% (Simpson, et al, 2009)
Recent studies of entire human genomes have used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads
C GG T C T A G A T G C TT A G C T A T G C GGG CCCC TT Reference sequence Alignment G C TT A T C T A T TT A T C T A T G C A T C T A T G C GG A T C T A T G C GG G C TT A T C T A T T C T A G A T G C T C T A T G C GGG C C T A G A T G C TT A T C T A T G C GG C T A T G C GGG C A T C T A T G C GG Subject reads
C GG T C T A G A T G C TT A T C T A T G C GGG CCCC TT G C TT A T C T A T TT A T C T A T G C A T C T A T G C GG A T C T A T G C GG G C TT A T C T A T GG CCCC TT G CCCC TT CC TT C GG C GG T C C GG T C T C GG T C T A G T C T A G A T G C T C T A T G C GGG C C T A G A T G C TT C TT A T G C GGG CCC Reference sequence Subject reads
Evaluate running time on local 24 core cluster
Running time increases linearly with the number of reads
Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press .
Example: Data Mining
del.icio.us crawl->a bipartite graph covering 802739 Webpages and 1021107 tags.
Haoyuan Li , Yi Wang , Dong Zhang, Ming Zhang , Edward Y. Chang : Pfp: parallel fp-growth for query recommendation. RecSys 2008 : 107-114
大规模数据处理 + 云计算 An Example
Try on these collection:
2006 年初，我们在国内搜集了 870 Million 不同网页 , 共约 2 TB .
商业搜索引擎 Google, Yahoo 等，收集网页数量在 100+ Billion pages
Divide and Conquer “ Work” w 1 w 2 w 3 r 1 r 2 r 3 “ Result” “ worker” “ worker” “ worker” Partition Combine
2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella
December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes .
January 2006 - Doug Cutting joins Yahoo!
February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS.
March 2006 - Formation of the Yahoo! Hadoop team
May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes
April 2006 - Sort benchmark run on 188 nodes in 47.9 hours
May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark)
October 2006 - Research cluster reaches 600 Nodes
December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8
January 2006 - Research cluster reaches 900 node
April 2007 - Research clusters - 2 clusters of 1000 nodes
Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo!
From Theory to Practice You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster Hadoop Cluster
Possible of using unlimited resources on-demand, and by anytime and anywhere
Possible of construct and deploy applications automatically scale to tens of thousands computers
Possible of construct and run programs dealing with prodigious volume of data
How to make it real?
Distributed File System
Distributed Computing Framework
 J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi , 2004, pp. 137-150.
 G. Sanjay, G. Howard, and L. Shun-Tak, "The Google file system," in Proceedings of the nineteenth ACM symposium on Operating systems principles . Bolton Landing, NY, USA: ACM Press, 2003.
 O. Christopher, R. Benjamin, S. Utkarsh, K. Ravi, and T. Andrew, "Pig latin: a not-so-foreign language for data processing," in Proceedings of the 2008 ACM SIGMOD international conference on Management of data . Vancouver, Canada: ACM, 2008.
Google App Engine
App Engine handles HTTP(S) requests, nothing else
Think RPC: request in, processing, response out
Works well for the web and AJAX; also for other services
App configuration is dead simple
No performance tuning needed
Everything is built to scale
“ infinite” number of apps, requests/sec, storage capacity
APIs are simple, stupid
App Engine Architecture Python VM process stdlib app memcache datastore mail images urlfech stateful APIs stateless APIs R/O FS req/resp