Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloud computing


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Cloud computing

  1. 1. the cloudCOMPUTING
  2. 2. Outline Large-scale Distributed Systems Introduction to Cloud Computing Cloud Computing paradigms and models Introduction to MapReduce Alternative architectures Writing Application using Hadoop
  3. 3. Distributed Systems Set of discrete machines which cooperate to performcomputation Give the notion of a single “machine” Keep the distribution transparent Examples: Compute clusters Distributed storage systems, such as Dropbox, Google Drive, etc. The Web
  4. 4. Characteristics Ordering Time is used to ensure ordering In most cases, only need to know that a happened beforeb, known as the happens-before relation Distributed Mutual Exclusion Concurrent access to shared resources needs to besynchronized Central lock server: All lock requests are handled by acentral server Token passing: Arrange nodes into a ring and a token ispassed around Totally-ordered multicast: Clients multicast requests toeach other
  5. 5. Characteristics (2) Distributed transactions Distributed transactions span multiple transaction processing servers Actions need to be coordinated across multiple parties ReplicationA number of distributed systems involve replication Data replication: Multiple copies of some object stored at different servers Computation replication: Multiple servers capable of providing anoperation Advantages1. Load balancing: Work spread out across clients2. Lower latency: Better performance if replica close to the client3. Fault tolerance: Failure of some replicas can be tolerated
  6. 6. CAP CAP Consistency: All nodes see the same state Availability: All requests get a response Partitioning: System continues to operate even in theface of node failure Brewer‟s conjecture states that in a distributedsystem only 2 out of 3 possible In the current setup, partitioning is a given:Hardware/software fails all the time Therefore, systems need to choose betweenconsistency and availability
  7. 7. Advantages Scalability: The scale of the Internet (think how many queries Google servers handle daily) Only a matter of adding more machines Cheaper than super computers More machines means more parallelism, hence better performance Sharing: The same resource is shared between multiple users Just like the Internet is shared between millions of users Communication: Communication between (potentially geographically isolated) machines andusers (via email, Facebook, etc.) Reliability: The service can remain active even if multiple machines go down
  8. 8. Challenges Concurrency: Concurrent execution requires some form of coordination Fault-tolerance: Any component can fail at any instant due to a software or ahardware bug Security: One machine can compromise the entire system Coordination: No global time so non-trivial to coordinate Trouble shooting: Hard to trouble shoot because hard to reason about thesystem
  9. 9. Introduction to CloudComputing An emerging IT development, deployment, and delivery model that enablesreal-time delivery of a broad range of IT products, services and solutions overthe internet A realization of utility computing in which computation, storage, and servicesare offered as a metered service Grid Computing: form of distributed computing, actingin concert to perform very large tasks Utility Computing: metered service similar to atraditional public utility Autonomic Computing: capable of self-management Cloud Computing: deployments as of 2009 depend ongrids, have autonomic characteristics and bill likeutilities
  10. 10. Characteristics On-demand self-service: allows users to obtain,configure and deploy cloud services themselves usingcloud service catalogues, without requiring theassistance of IT. Broad network access: capabilities are available overthe network and accessed through standardmechanisms that promote use by heterogeneous thinor thick client platforms Resource pooling: The provider‟s computing resourcesare pooled to serve multiple consumers using a multi-tenant model, with different physical and virtualresources dynamically assigned and reassignedaccording to consumer demand.
  11. 11. Characteristics (2) Rapid elasticity: Capabilities can be rapidly andelastically provisioned, in some cases automatically,to quickly scale out and rapidly released to quicklyscale in. To the consumer, the capabilities availablefor provisioning often appear to be unlimited andcan be purchased in any quantity at any time. Measured service: Cloud systems automaticallycontrol and optimize resource use by leveraging ametering capability at some level of abstractionappropriate to the type of service (e.g., storage,processing, bandwidth, and active user accounts).
  12. 12. Cloud Service Models SaaS – Software as a Service: Network-hostedapplication PaaS– Platform as a Service: Network-hosted softwaredevelopment platform IaaS – Infrastructure as a Service: Provider hostscustomer VMs or provides network storage DaaS – Data as a Service: Customer queries againstprovider’s database IPMaaS – Identity and Policy Management as aService: Provider manages identity and/or accesscontrol policy for customer NaaS – Network as a Service: Provider offers virtualizednetworks (e.g. VPNs)
  13. 13. Deployment Models Private Cloud: infrastructure is operated solely for anorganization. Public Cloud: infrastructure is made available to the generalpublic as a pay-as-you-go model, e.g. Amazon Web Services,Google AppEngine, and Microsoft Azure Community Cloud: infrastructure between severalorganizations from a specific community with commonconcerns (security, compliance, jurisdiction, etc.), whethermanaged internally or by a third-party and hosted internally orexternally. Hybrid Cloud: infrastructure is a combination of two or moreclouds(private, community, or public) that remain uniqueentities but are bound together by standardized or proprietarytechnology that enables data and application portabilitybetween environments.
  14. 14. Private Cloud
  15. 15. Private Outsourced Cloud
  16. 16. Public Cloud
  17. 17. Hybrid Cloud
  18. 18. AdvantagesAdvantages to both service providers and end users Service providers: Simplified software installation and maintenance Centralized control over versioning No need to build, provision, and maintain a datacenter On the fly scaling End users: “Anytime, anywhere” access Share data and collaborate easily Safeguard data stored in the infrastructure
  19. 19. Obstacles Bugs in large-scale distributed systems: Hard todebug large-scale applications in full deployment Scaling quickly: Automatically scaling whileconserving resources and money is an open endedproblem Reputation fate sharing: Bad behavior by onetenant can reflect badly on the rest Software licensing: Gap between pay-as-you-gomodel and software licensing
  20. 20. Obstacles (2) Service availability: Possibility of cloud outage Data lock-in: Dependence on cloud specific APIs Security: Requires strong encrypted storage, VLANs,and network middle-boxes (firewalls, etc.) Data transfer bottlenecks: Moving large amounts ofdata in and out is expensive Performance unpredictability: Resource sharingbetween applications Scalable storage: No standard model to arbitrarilyscale storage up and down on-demand whileensuring data durability and high availability
  21. 21. Introduction to MapReduce A simple programming model that applies to manylarge-scale computing problems Hide messy details in MR runtime library: Automatic parallelization Load balancing Network and disk transfer optimization Handling of machine failures Robustness Improvements to core library benefit all users of library
  22. 22. Google MapReduce – Idea The core idea behind MapReduce is mapping yourdata set into a collection of <key, value> pairs, andthen reducing over all pairs with the same key. Map Apply function to all elements of a list square x = x * x; Map square [1, 2, 3, 4, 5]; [1, 4, 9, 16, 25] Reduce Combine all elements of a list Reduce (+)[1, 2, 3, 4, 5]; 15
  23. 23. Google MapReduce –Overview
  24. 24. MapReduce architecture Master: In charge of all meta data, work schedulingand distribution, and job orchestration Workers: Contain slots to execute map or reducefunctions Mappers: A map worker reads the contents of the input split that it hasbeen assigned It parses the file and converts it to key/value pairs and invokesthe user-defined map function for each pair The intermediate key/value pairs after the application of themap logic are collected (buffered) in memory Once the buffered key/value pairs exceed a threshold they arewritten to local disk and partitioned (using a partitioningfunction) into R partitions. The location of each partition is passedto the master
  25. 25. MapReduce architecture (2) Workers: Contain slots to execute map or reducefunctions Reducers: A reduce worker gets locations of its input partitions from themaster and uses HTTP requests to retrieve them Once it has read all its input, it sorts it by key to grouptogether all occurrences of the same key It then invokes the user-defined reduce for each key andpasses it the key and its associated values The key/value pairs generated after the application of thereduce logic are then written to a final output file, which issubsequently written to the distributed filesystem
  26. 26. Google File System - GFS In-house distributed file system at Google Stores all input an output files Stores files… divided into 64 MB blocks on at least 3 different machines Machines running GFS also run MapReduce
  27. 27. MapReduce job phasesA MapReduce job can be divided into 4 phases: Input split: The input dataset is sliced into M splits, oneper map task Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures thatmap output is locally sorted by key In addition, the key space is also partitioned amongst thereducers Shuffle: Map output is relayed to all reduce tasks Reduce logic: The user-provided reduce function isinvoked Before the application of the reduce function, the inputkeys are merged to get globally sorted key/value pairs
  28. 28. Google MapReduce –Example
  29. 29. Wordcount map in Java1. public void map(Object key, Text value , Contextcontext) {2. StringTokenizer itr = newStringTokenizer(value.toString());3. while (itr.hasMoreTokens()) {4. word.set(itr.nextToken());5. context.write(word , one);6. }7. }
  30. 30. Wordcount reduce in Java1. public void reduce(Text key, Iterable <IntWritable >values ,Context context) {2. int sum = 0;3. for (IntWritable val : values) {4. sum += val.get();5. }6. result.set(sum);7. context.write(key, result);8. }
  31. 31. Hadoop Open-source implementation of MapReduce,developed by Doug Cutting originally at Yahoo! in2004 Now a top-level Apache open-source project Implemented in Java (Google‟s in-houseimplementation is in C++) Jobs can be written in C++, Java, Python, etc. Comes with an associated distributed filesystem,HDFS (clone of GFS)
  32. 32. Hadoop Components Hadoop consists of two core components– The Hadoop Distributed File System (HDFS)– MapReduce Software Framework There are many other projects based aroundcore Hadoop– Often referred to as the „HadoopEcosystem‟– Pig, Hive, HBase,Flume, Oozie, Sqoop, etc
  33. 33. Hadoop Users Adobe: Several areas from social services tounstructured data storage and processing eBay: 532 nodes cluster storing 5.3PB of data Facebook: Used for reporting/analytics; one clusterwith 1100 nodes (12PB) and another with 300 nodes(3PB) LinkedIn: 3 clusters with collectively 4000 nodes Twitter: To store and process Tweets and log files Yahoo!: Multiple clusters with collectively 40000nodes; largest cluster has 4500 nodes!
  34. 34. Running a Hadoop Application The first order of the day is to format the HadoopDFS Jump to the Hadoop directory and execute:bin/hadoop namenode -format Running Hadoop To run Hadoop and HDFS:bin/ To terminate them:bin/
  35. 35. Running a Hadoop Application Generating a dataset Create a temporary directory to hold the data: mkdir /tmp/gutenberg Jump to it: cd /tmp/gutenberg Download text files: wget wget wget
  36. 36. Running a Hadoop Application Copying the dataset to the HDFS Jump to the Hadoop directory and execute: bin/hadoop dfs -copyFromLocal /tmp/gutenberg/ccw/Gutenberg Running Wordcount bin/hadoop jar hadoop-examples-1.0.4.jar wordcount/ccw/gutenberg /ccw/gutenberg-output Retrieving results from the HDFS Copy to the local FS: bin/hadoop dfs –getmerge /ccw/gutenberg-output/tmp/gutenberg-output
  37. 37. Running a Hadoop Application Accessing the web interface JobTracker: http://localhost:50030 TaskTracker: http://localhost:50060 Reference: Running Hadoop on Ubuntu Linux(Single-Node Cluster):
  38. 38. Thanks