• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
ppt
 

ppt

on

  • 2,472 views

 

Statistics

Views

Total Views
2,472
Views on SlideShare
2,435
Embed Views
37

Actions

Likes
0
Downloads
40
Comments
0

3 Embeds 37

http://www.cloud24by7.com 16
http://www.cloud24by7.com 16
http://www.slideshare.net 5

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Why we are here? What we will get from this class? Background and our target.
  • 教学方法:故事, video , pictures  得到(隐藏的)概念,留下印象 1. 什么是 Cloud Computing? Tell the story from SaaS PaaS Utility Computing  Cloud Computing Story about Key player’s story User experiences Slogan: Cloud Computing = 2 为什么要 Cloud Computing? Some key characteristics 3.Cloud Computing 面临哪些问题? 讨论, imagination for your own dream 人们的评论 : positive and negative , objective 认识
  • Story line: 高清《云计算》最浅显解谜云故事 .flv First write down your own opinion about “cloud computing” 写下来 whatever you though about in your mind question: what ? Who? Why? How? Pros and cons? The most important question is : the relation with me ? Watch the video take the question and search the answer Let’s find out the key concept in the story! SaaS PaaS Utility Computing  Cloud Computing
  • .vs. 传统的 EULA( End User License Agreement ) Level4, Scalable, Configurable, Multi-Tenant-Efficient :增加 multitier architecutre ,支持 load-balanced farm of identical application instances ,在数量变化的多个服务器上运行。 Provider 按用户需求 demand 来增减服务器,而不需要修改任何软件的体系结构。
  • LAMP is the industry standard But management is a hassle: Configuration, tuning Backup and recovery, disk space management Hardware failures, system crashes Software updates, security patches Log rotation, cron jobs, and much more Redesign needed once your database exceeds one box
  • 好比让用户把电源插头插在墙上,你得到的电压和 Microsoft 得到的一样,只是你用得少, pay less ; utility computing 的目标就是让计算资源也具有这样的服务能力,用户可以使用 500 强公司所拥有的计算资源,只是 use less pay less 。这是 cloud computing 的一个重要方面
  • Cloud Computing Provider 需要具备的能力包括: very large datacenters, large-scale software infrastructure 以及 operational expertise 来运行它们。这对于 Google, Amazon, Microsoft, eBay 这样的公司来说是必然的事。另一个重要的驱动力来自于 very large datacenter 比 medium-sized datacenter 有 5-7 倍的经济效益。认为构建 extremely large-scale 的 commodity-computer datacenters 是 cloud computing 的一个关键 enabler 。
  • 伴随 Web2.0 的出现,过去” high-touch, high-margin, high-commitment” 的服务提供开始转变为” low-touch, low-margin, low-commitment” 的 self-service 。比如 credit card 在以前需要与 payment processing service(VeriSign) 签订合同,这对于个人和小 business 来说很困难。当 PayPal 出现时,任何个人可以用 credit card 来支付,不需要 contract ,没有 long-term commitment ,” touch” (客户支持和关系管理)的开销几乎没有
  • 介绍 app engine 怎么用的实例
  • Refer [1]
  • What you want to do with 100,000 pcs? It’s really possible now! Calculate Amaon EC2 service price for this:
  • 大规模廉价 PC ,廉价网络连接构成的机群。 But not limited Data Center Paradigm
  • slides borrowed from Data-Rich Computing: Where It’s At @hadoop summit 2008 Phillip B. GibbonsIntel Research Pittsburgh
  • Inbio! WAlmart Transactions! trend in science: toward data; trend in data: more of it, more measurements; once it’s about data, the focus starts to fall upon tools for analyzing data [sciences are becoming more data-driven; statistics can directly solve science problems]
  • Some page from http://www.umiacs.umd.edu/~jimmylin/
  • Will universe Expand faster and faster (look at sky at night: empty) Expand forever but slow down (look at sky at night: about the same – most likely) Eventually contract into another big bang (look at sky at night: uh-oh) Nt = number of time samples in the detector data Np = number of sky pixels in the maps derived from the detector data Nb = number of bins of multipole if the power spectra derived from the maps
  • 3 slides from “ D ata I ntensive S uper C omputing” Randal E. Bryant, Carnegie Mellon University
  • 4 slides from Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press .
  • Outline stays the same, map and reduce change to fit the problem
  • http://developer.yahoo.net/blog/archives/2007/07/yahoo-hadoop.html
  • 第一堂课: 回答学生关心的问题:我能学好吗?怎么学好? -- 〉强调实践, java ,没有其它先修要求 学习任务和要求: Project 布置,结成小组;论文阅读和报告; 教学生基本的查找资料的方法: Wikipedia, google scholar
  • Intro Distributed system; Intro Parallel Programming. (references:http://code.google.com/edu/parallel/dsd-tutorial.html, http://code.google.com/edu/parallel/mapreduce-tutorial.html
  • Use AWS if you: Want to use third party open source software Have existing code Want to transfer web app to own machine/servers later on Port code to another language Want control (EC2 instances and machine images) Ex: need to stress/load test app – just load up 1000 instances Features – messaging, payment services, etc

ppt ppt Presentation Transcript

  • Introduction to Cloud Computing http://net.pku.edu.cn/~course/cs402/2009/ 彭波 [email_address] 北京大学信息科学技术学院 6/30/2009
  • 大纲
    • 云计算 (Cloud Computing) 是 ?
    • 大规模数据处理是?
    • 我们这门课的目标和内容是?
  • 云计算 (Cloud Computing)
  • What is Cloud Computing?
    • First write down your own opinion about “ cloud computing ” , whatever you thought about in your mind.
    • Question: What ? Who? Why? How? Pros and cons?
    • The most important question is: What is the relation with me?
  • Cloud Computing is…
    • No software
    • access everywhere by Internet
    • power -- Large-scale data processing
    • Appeal for startups
      • Cost efficiency
      • 实在是太方便了
      • Software as platform
    • Cons
      • Security
      • Data lock-in
    SaaS PaaS Utility Computing
  • Software as a Service (SaaS)
    • a model of software deployment whereby a provider licenses an application to customers for use as a service on demand.
  • Platform as a Service (PaaS)
    • 对于开发 Web Application 和 Services , PaaS 提供了一整套基于 Internet 的,从开发,测试,部署,运营到维护的全方位的集成环境。特别它从一开始就具备了 Multi-tenant architecture ,用户不需要考虑多用户并发的问题,而由 platform 来解决,包括并发管理,扩展性,失效恢复,安全。
  • Utility Computing
    • “ pay-as-you-go” 好比让用户把电源插头插在墙上,你得到的电压和 Microsoft 得到的一样,只是你用得少, pay less ; utility computing 的目标就是让计算资源也具有这样的服务能力,用户可以使用 500 强公司所拥有的计算资源,只是 use less pay less 。这是 cloud computing 的一个重要方面
  • Cloud Computing is…
  • Key Characteristics
    • illusion of infinite computing resources available on demand;
    • elimination of an up-front commitment by Cloud users; 创业启动花费
    • ability to pay for use of computing resources on a short-term basis as needed 。小时间片的 billing ,报告指出 utility computing 在这一点上的实践是失败的
    very large datacenters large-scale software infrastructure operational expertise
  • Why now?
    • very large-scale datacenter 的实践,
    • 因为新的技术趋势和 Business 模式
      • pay-as-you-go computing
  • Key Players
    • Amazon Web Services
    • Google App Engine
    • Microsoft Windows Azure
  • Key Applications
    • Mobile Interactive applications , Tim O’Reilly 相信未来是属于能够实时对用户提供信息的服务。 Mobile 必定是关键。而后台在 datacenter 中运行是很自然的模式,特别是那些 mashup 融合类型的服务。
    • Parallel batch processing 。大规模数据处理使用 Cloud Computing 技术很自然, MapReduce , Hadoop 在这里起到重要作用。这里,数据移入 / 移出 cloud 是很大的开销, Amazon 开始尝试 host large public datasets for free 。
    • The rise of analytics 。数据库应用中 transaction based 应用还在增长,而 analytics 的应用增长迅速。数据挖掘,用户行为分析等应用的巨大推动。
    • Extension of compute-intensive desktop application 。计算密集型的任务,说 matlab, mathematica 都有了 cloud computing 的扩展, woo~
  • Cloud Computing = Silver Bullet?
    • Google 文档在 3 月 7 日发生了大批用户文件外泄事件。美国隐私保护组织就此提请政府对 Google 采取措施,使其加强云计算产品的安全性。
    • Problem of Data Lock-in
  • Challenges
  • Some other Voices It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign. Somebody is saying this is inevitable — and whenever you hear somebody saying that, it’s very likely to be a set of businesses campaigning to make it true. Richard Stallman , quoted in The Guardian, September 29, 2008 The interesting thing about Cloud Computing is that we’ve redefined Cloud Computing to include everything that we already do. . . . I don’t understand what we would do differently in the light of Cloud Computing other than change the wording of some of our ads. Larry Ellison , quoted in the Wall Street Journal, September 26, 2008
  • What’s matter with ME ?!
    • What you want to do with 1000 pcs, or even 100,000 pcs?
  • Cloud is coming… Google alone has 450,000 systems running across 20 datacenters , and Microsoft's Windows Live team is doubling the number of servers it uses every 14 months, which is faster than Moore's Law “ Data Center is a Computer” Parallelism everywhere Massive Scalable Reliable Resource Management Data Management Programming Model & Tools
  • 大规模数据处理
  •  
  • Happening everywhere! Molecular biology (cancer) microarray chips Particle events (LHC) particle colliders microprocessors Simulations (Millennium) Network traffic (spam) fiber optics 300M/day 1B 1M/sec
  • Maximilien Brice, © CERN
  • Maximilien Brice, © CERN
  • Maximilien Brice, © CERN
  • Maximilien Brice, © CERN
  • How much data?
    • Internet archive has 2 PB of data + 20 TB/month
    • Google processes 20 PB a day (2008)
    • “ all words ever spoken by human beings” ~ 5 EB
    • CERN’s LHC will generate 10-15 PB a year
    • Sanger anticipates 6 PB of data in 2009
    640K ought to be enough for anybody.
  • NERSC User George Smoot wins 2006 Nobel Prize in Physics Smoot and Mather 1992 COBE Experiment showed anisotropy of CMB Cosmic Microwave Background Radiation (CMB): an image of the universe at 400,000 years
  • The Current CMB Map
    • Unique imprint of primordial physics through the tiny anisotropies in temperature and polarization.
    • Extracting these  Kelvin fluctuations from inherently noisy data is a serious computational challenge.
    source J. Borrill, LBNL
  • Evolution Of CMB Data Sets: Cost > O(Np^3 ) Experiment N t N p N b Limiting Data Notes COBE (1989) 2x10 9 6x10 3 3x10 1 Time Satellite, Workstation BOOMERanG (1998) 3x10 8 5x10 5 3x10 1 Pixel Balloon, 1st HPC/NERSC (4yr) WMAP (2001) 7x10 10 4x10 7 1x10 3 ? Satellite, Analysis-bound Planck (2007) 5x10 11 6x10 8 6x10 3 Time/ Pixel Satellite, Major HPC/DA effort POLARBEAR (2007) 8x10 12 6x10 6 1x10 3 Time Ground, NG-multiplexing CMBPol (~2020) 10 14 10 9 10 4 Time/ Pixel Satellite, Early planning/design data compression
  • Example: Wikipedia Anthropology
    • Experiment
      • Download entire revision history of Wikipedia
      • 4.7 M pages, 58 M revisions, 800 GB
      • Analyze editing patterns & trends
    • Computation
      • Hadoop on 20-machine cluster
    Kittur, Suh, Pendleton (UCLA, PARC), “He Says, She Says: Conflict and Coordination in Wikipedia” CHI, 2007 Increasing fraction of edits are for work indirectly related to articles
  • Example: Scene Completion
    • Image Database Grouped by Semantic Content
      • 30 different Flickr.com groups
      • 2.3 M images total (396 GB).
    • Select Candidate Images Most Suitable for Filling Hole
      • Classify images with gist scene detector [Torralba]
      • Color similarity
      • Local context matching
    • Computation
      • Index images offline
      • 50 min. scene matching, 20 min. local matching, 4 min. compositing
      • Reduces to 5 minutes total by using 5 machines
    • Extension
      • Flickr.com has over 500 million images …
    Hays, Efros (CMU), “Scene Completion Using Millions of Photographs” SIGGRAPH, 2007
  • Example: Web Page Analysis
    • Experiment
      • Use web crawler to gather 151M HTML pages weekly 11 times
        • Generated 1.2 TB log information
      • Analyze page statistics and change frequencies
    • Systems Challenge
      • “ Moreover, we experienced a catastrophic disk failure during the third crawl, causing us to lose a quarter of the logs of that crawl.”
    Fetterly, Manasse, Najork, Wiener (Microsoft, HP), “A Large-Scale Study of the Evolution of Web Pages,” Software-Practice & Experience, 2004
  • G A T G C TT A C T A T G C GGG CCCC C GG T C T AA T G C TT A C T A T G C G C TT A C T A T G C GGG CCCC TT AA T G C TT A C T A T G C GGG CCCC TT T AA T G C TT A C T A T G C AA T G C TT A G C T A T G C GGG C AA T G C TT A C T A T G C GGG CCCC TT AA T G C TT A C T A T G C GGG CCCC TT C GG T C T A G A T G C TT A C T A T G C AA T G C TT A C T A T G C GGG CCCC TT C GG T C T AA T G C TT A G C T A T G C A T G C TT A C T A T G C GGG CCCC TT Subject genome Sequencer Reads ?
  • DNA Sequencing
    • Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG
      • Bacteria: ~5 million bp
      • Humans: ~3 billion bp
    • Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp)
      • Shorter reads, but much higher throughput
      • Per-base error rate estimated at 1-2% (Simpson, et al, 2009)
    • Recent studies of entire human genomes have used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads
      • ~144 GB of compressed sequence data
    ATCTGATAAGTCCCAGGACTTCAGT GCAAGGCAAACCCGAGCCCAGTTT TCCAGTTCTAGAGTTTCACATGATC GGAGTTAGTAAAAGTCCACATTGAG
  • C GG T C T A G A T G C TT A G C T A T G C GGG CCCC TT Reference sequence Alignment G C TT A T C T A T TT A T C T A T G C A T C T A T G C GG A T C T A T G C GG G C TT A T C T A T T C T A G A T G C T C T A T G C GGG C C T A G A T G C TT A T C T A T G C GG C T A T G C GGG C A T C T A T G C GG Subject reads
  • C GG T C T A G A T G C TT A T C T A T G C GGG CCCC TT G C TT A T C T A T TT A T C T A T G C A T C T A T G C GG A T C T A T G C GG G C TT A T C T A T GG CCCC TT G CCCC TT CC TT C GG C GG T C C GG T C T C GG T C T A G T C T A G A T G C T C T A T G C GGG C C T A G A T G C TT C TT A T G C GGG CCC Reference sequence Subject reads
  • Example: Bioinformatics
    • Evaluate running time on local 24 core cluster
      • Running time increases linearly with the number of reads
    Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press .
  • Example: Data Mining
    • del.icio.us crawl->a bipartite graph covering 802739 Webpages and 1021107 tags.
    Haoyuan Li , Yi Wang , Dong Zhang, Ming Zhang , Edward Y. Chang : Pfp: parallel fp-growth for query recommendation. RecSys 2008 : 107-114
  • 大规模数据处理 + 云计算 An Example
  • 数据处理任务
    • 词频统计:统计一个文档集中每个词出现的次数
    • Try on these collection:
      • 2006 年初,我们在国内搜集了 870 Million 不同网页 , 共约 2 TB .
      • 商业搜索引擎 Google, Yahoo 等,收集网页数量在 100+ Billion pages
    怎样处理海量数据?
  • Divide and Conquer “ Work” w 1 w 2 w 3 r 1 r 2 r 3 “ Result” “ worker” “ worker” “ worker” Partition Combine
  • What’s Mapreduce
    • Parallel/Distributed Computing Programming Model
    Input split shuffle output
  • Typical problem solved by MapReduce
    • 读入数据 : key/value 对的记录格式数据
    • Map : 从每个记录里 extract something
      • map (in_key, in_value) -> list(out_key, intermediate_value)
        • 处理 input key/value pair
        • 输出中间结果 key/value pairs
    • Shuffle: 混排交换数据
      • 把相同 key 的中间结果汇集到相同节点上
    • Reduce : aggregate, summarize, filter, etc.
      • reduce (out_key, list(intermediate_value)) -> list(out_value)
        • 归并某一个 key 的所有 values ,进行计算
        • 输出合并的计算结果 (usually just one)
    • 输出结果
  • Word Frequencies in Web pages
    • 输入: one document per record
    • 用户实现 map function ,输入为
      • key = document URL
      • value = document contents
    • map 输出 (potentially many) key/value pairs.
      • 对 document 中每一个出现的词,输出一个记录 <word, “1”>
  • Example continued:
    • MapReduce 运行系统 ( 库 ) 把所有相同 key 的记录收集到一起 (shuffle/sort)
    • 用户实现 reduce function 对一个 key 对应的 values 计算
      • 求和 sum
    • Reduce 输出 <key, sum>
  • MapReduce Runtime System
  • History of Hadoop
    • 2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella
    • December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes .
    • January 2006 - Doug Cutting joins Yahoo!
    • February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS.
    • March 2006 - Formation of the Yahoo! Hadoop team
    • May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes
    • April 2006 - Sort benchmark run on 188 nodes in 47.9 hours
    • May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark)
    • October 2006 - Research cluster reaches 600 Nodes
    • December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8
    • January 2006 - Research cluster reaches 900 node
    • April 2007 - Research clusters - 2 clusters of 1000 nodes
    • Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo!
  • From Theory to Practice You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster Hadoop Cluster
  • 课程目标和内容
  • 课程目标
    • 掌握 MapReduce 编程模型与运行环境的使用。
    • 掌握算法在 MapReduce 模型下并行化的基本方法。
    • 了解 MapReduce 运行分布式环境的实现技术。
    • 了解云计算中大规模数据处理和算法并行化技术的发展现状和关键问题。
    • 了解并培养并行化思考问题的习惯。
  • 课程内容 学生课程项目报告 项目报告 8 邀请学术界或业界研究技术人员报告 特邀报告 7 课程项目讨论 项目讨论 6 频繁集挖掘问题 介绍 MapReduce 之上的应用和发展 分析频繁集挖掘问题及其 MapReduce 实现 MapReduce 高层应用 5 Clustering 问题 分析 MapReduce 的系统设计和考虑 分析 Clustering 问题及其 MapReduce 实现 MapReduce 系统设计与实现 4 PageRank 问题 介绍大规模并行分布式系统的设计 分析 PageRank 问题及其 MapReduce 实现 并行与分布式系统基础 3 Inverted Index 问题 从函数式语言谈 MapReduce 的基本原理 分析 Inverted Index 问题及其 MapReduce 实现 MapReduce 原理 2 MapReduce 环境 围绕大规模数据处理为背景介绍云计算技术 以 MapReduce 为平台展开讲授和实践,是课程的中心。 课程介绍 - 云计算 1 ABSTRACT TOPICS LEC#
  • Grading Policy
    • 30% Assignments
    • 20% Readings
    • 50% Course project
    Hw1 - Read - Intro Distributed system ; Intro MapReduce Programming . Hw2 - Read MapReduce [1] Hw3 – Read GFS [2] Hw4 – Read Pig Latin [3] Lab 1 - Introduction to Hadoop , Eclipse Lab 2 – A Simple Inverted Index Lab 3 - PageRank over Wikipedia Corpus Lab 4 – Clustering the Netflix Movie Data
  • 课程的要求
    • 熟练一种 Programming Language
      • Lots of java programming practices
  • Teachers and Resources
    • 课程网站
      • http://net.pku.edu.cn/~course/cs402/2009/
    • 讨论组
      • http://groups.google.com/group/cs402pku
    • Hadoop 主页
      • http://hadoop.apache.org/core/
    • Resources
      • http://net.pku.edu.cn/~course/cs402/2008/resource.html
    • 闫宏飞老师
    • 陈日闪助教
  • Homework
    • 登记
      • http://net.pku.edu.cn/~course/cs402/2009/
    • 组成小组
      • 3-4 人,为课程 project 准备
      • 跨专业方向很好
    • Lab1
      • Lab 1 - Introduction to Hadoop, Eclipse
    • HW Reading1
      • Intro Distributed system; Intro Parallel Programming.
        • http://code.google.com/edu/parallel/dsd-tutorial.html
        • http://code.google.com/edu/parallel/mapreduce-tutorial.html
  • Summary
    • C loud C omputing brings
      • Possible of using unlimited resources on-demand, and by anytime and anywhere
      • Possible of construct and deploy applications automatically scale to tens of thousands computers
      • Possible of construct and run programs dealing with prodigious volume of data
    • How to make it real?
      • Distributed File System
      • Distributed Computing Framework
      • …………………………………
  • Q&A
  • 参考文献
    • [1] J. Dean and S. Ghemawat, &quot;MapReduce: Simplified Data Processing on Large Clusters,&quot; in Osdi , 2004, pp. 137-150.
    • [2] G. Sanjay, G. Howard, and L. Shun-Tak, &quot;The Google file system,&quot; in Proceedings of the nineteenth ACM symposium on Operating systems principles . Bolton Landing, NY, USA: ACM Press, 2003.
    • [3] O. Christopher, R. Benjamin, S. Utkarsh, K. Ravi, and T. Andrew, &quot;Pig latin: a not-so-foreign language for data processing,&quot; in Proceedings of the 2008 ACM SIGMOD international conference on Management of data . Vancouver, Canada: ACM, 2008.
  • Google App Engine
    • App Engine handles HTTP(S) requests, nothing else
      • Think RPC: request in, processing, response out
      • Works well for the web and AJAX; also for other services
    • App configuration is dead simple
      • No performance tuning needed
    • Everything is built to scale
      • “ infinite” number of apps, requests/sec, storage capacity
      • APIs are simple, stupid
  • App Engine Architecture Python VM process stdlib app memcache datastore mail images urlfech stateful APIs stateless APIs R/O FS req/resp
  • Microsoft Windows Azure
  • Amazon Web Services
    • Amazon’s infrastructure (auto scaling, load balancing)
    • Elastic Compute Cloud (EC2) – scalable virtual private server instances
    • Simple Storage Service (S3)
    • Simple Queue Service (SQS) – messaging
    • SimpleDB - database
    • Flexible Payments Service, Mechanical Turk, CloudFront, etc.
  • Amazon Web Services
    • Very flexible, lower-level offering (closer to hardware) = more possibilities, higher performing
    • Runs platform you provide (machine images)
    • Supports all major web languages
    • Industry-standard services (move off AWS easily)
    • Require much more work, longer time-to-market
      • Deployment scripts, configuring images, etc.
    • Various libraries and GUI plug-ins make AWS do help
  • Price of Amazon EC2