Cloud computing

Outline
 Large-scale Distributed Systems
 Introduction to Cloud Computing
 Cloud Computing paradigms and models
 Introduction to MapReduce
 Alternative architectures
 Writing Application using Hadoop

Distributed Systems
 Set of discrete machines which cooperate to perform
computation
 Give the notion of a single “machine”
 Keep the distribution transparent
 Examples:
 Compute clusters
 Distributed storage systems, such as Dropbox, Google Drive, etc.
 The Web

Characteristics
 Ordering
 Time is used to ensure ordering
 In most cases, only need to know that a happened before
b, known as the happens-before relation
 Distributed Mutual Exclusion
 Concurrent access to shared resources needs to be
synchronized
 Central lock server: All lock requests are handled by a
central server
 Token passing: Arrange nodes into a ring and a token is
passed around
 Totally-ordered multicast: Clients multicast requests to
each other

Characteristics (2)
 Distributed transactions
 Distributed transactions span multiple transaction processing servers
 Actions need to be coordinated across multiple parties
 Replication
A number of distributed systems involve replication
 Data replication: Multiple copies of some object stored at different servers
 Computation replication: Multiple servers capable of providing an
operation
 Advantages
1. Load balancing: Work spread out across clients
2. Lower latency: Better performance if replica close to the client
3. Fault tolerance: Failure of some replicas can be tolerated

CAP
 CAP
 Consistency: All nodes see the same state
 Availability: All requests get a response
 Partitioning: System continues to operate even in the
face of node failure
 Brewer‟s conjecture states that in a distributed
system only 2 out of 3 possible
 In the current setup, partitioning is a given:
Hardware/software fails all the time
 Therefore, systems need to choose between
consistency and availability

Advantages
 Scalability:
 The scale of the Internet (think how many queries Google servers handle daily)
 Only a matter of adding more machines
 Cheaper than super computers
 More machines means more parallelism, hence better performance
 Sharing:
 The same resource is shared between multiple users
 Just like the Internet is shared between millions of users
 Communication:
 Communication between (potentially geographically isolated) machines and
users (via email, Facebook, etc.)
 Reliability:
 The service can remain active even if multiple machines go down

Challenges
 Concurrency:
 Concurrent execution requires some form of coordination
 Fault-tolerance:
 Any component can fail at any instant due to a software or a
hardware bug
 Security:
 One machine can compromise the entire system
 Coordination:
 No global time so non-trivial to coordinate
 Trouble shooting:
 Hard to trouble shoot because hard to reason about the
system

Introduction to Cloud
Computing
 An emerging IT development, deployment, and delivery model that enables
real-time delivery of a broad range of IT products, services and solutions over
the internet
 A realization of utility computing in which computation, storage, and services
are offered as a metered service
 Grid Computing: form of distributed computing, acting
in concert to perform very large tasks
 Utility Computing: metered service similar to a
traditional public utility
 Autonomic Computing: capable of self-management
 Cloud Computing: deployments as of 2009 depend on
grids, have autonomic characteristics and bill like
utilities

Characteristics
 On-demand self-service: allows users to obtain,
configure and deploy cloud services themselves using
cloud service catalogues, without requiring the
assistance of IT.
 Broad network access: capabilities are available over
the network and accessed through standard
mechanisms that promote use by heterogeneous thin
or thick client platforms
 Resource pooling: The provider‟s computing resources
are pooled to serve multiple consumers using a multi-
tenant model, with different physical and virtual
resources dynamically assigned and reassigned
according to consumer demand.

Characteristics (2)
 Rapid elasticity: Capabilities can be rapidly and
elastically provisioned, in some cases automatically,
to quickly scale out and rapidly released to quickly
scale in. To the consumer, the capabilities available
for provisioning often appear to be unlimited and
can be purchased in any quantity at any time.
 Measured service: Cloud systems automatically
control and optimize resource use by leveraging a
metering capability at some level of abstraction
appropriate to the type of service (e.g., storage,
processing, bandwidth, and active user accounts).

Cloud Service Models
 SaaS – Software as a Service: Network-hosted
application
 PaaS– Platform as a Service: Network-hosted software
development platform
 IaaS – Infrastructure as a Service: Provider hosts
customer VMs or provides network storage
 DaaS – Data as a Service: Customer queries against
provider’s database
 IPMaaS – Identity and Policy Management as a
Service: Provider manages identity and/or access
control policy for customer
 NaaS – Network as a Service: Provider offers virtualized
networks (e.g. VPNs)

Deployment Models
 Private Cloud: infrastructure is operated solely for an
organization.
 Public Cloud: infrastructure is made available to the general
public as a pay-as-you-go model, e.g. Amazon Web Services,
Google AppEngine, and Microsoft Azure
 Community Cloud: infrastructure between several
organizations from a specific community with common
concerns (security, compliance, jurisdiction, etc.), whether
managed internally or by a third-party and hosted internally or
externally.
 Hybrid Cloud: infrastructure is a combination of two or more
clouds(private, community, or public) that remain unique
entities but are bound together by standardized or proprietary
technology that enables data and application portability
between environments.

Advantages
Advantages to both service providers and end users
 Service providers:
 Simplified software installation and maintenance
 Centralized control over versioning
 No need to build, provision, and maintain a datacenter
 On the fly scaling
 End users:
 “Anytime, anywhere” access
 Share data and collaborate easily
 Safeguard data stored in the infrastructure

Obstacles
 Bugs in large-scale distributed systems: Hard to
debug large-scale applications in full deployment
 Scaling quickly: Automatically scaling while
conserving resources and money is an open ended
problem
 Reputation fate sharing: Bad behavior by one
tenant can reflect badly on the rest
 Software licensing: Gap between pay-as-you-go
model and software licensing

Obstacles (2)
 Service availability: Possibility of cloud outage
 Data lock-in: Dependence on cloud specific APIs
 Security: Requires strong encrypted storage, VLANs,
and network middle-boxes (firewalls, etc.)
 Data transfer bottlenecks: Moving large amounts of
data in and out is expensive
 Performance unpredictability: Resource sharing
between applications
 Scalable storage: No standard model to arbitrarily
scale storage up and down on-demand while
ensuring data durability and high availability

Introduction to MapReduce
 A simple programming model that applies to many
large-scale computing problems
 Hide messy details in MR runtime library:
 Automatic parallelization
 Load balancing
 Network and disk transfer optimization
 Handling of machine failures
 Robustness
 Improvements to core library benefit all users of library

Google MapReduce – Idea
 The core idea behind MapReduce is mapping your
data set into a collection of <key, value> pairs, and
then reducing over all pairs with the same key.
 Map
 Apply function to all elements of a list
 square x = x * x;
 Map square [1, 2, 3, 4, 5];
 [1, 4, 9, 16, 25]
 Reduce
 Combine all elements of a list
 Reduce (+)[1, 2, 3, 4, 5];
 15

MapReduce architecture
 Master: In charge of all meta data, work scheduling
and distribution, and job orchestration
 Workers: Contain slots to execute map or reduce
functions
 Mappers:
 A map worker reads the contents of the input split that it has
been assigned
 It parses the file and converts it to key/value pairs and invokes
the user-defined map function for each pair
 The intermediate key/value pairs after the application of the
map logic are collected (buffered) in memory
 Once the buffered key/value pairs exceed a threshold they are
written to local disk and partitioned (using a partitioning
function) into R partitions. The location of each partition is passed
to the master

MapReduce architecture (2)
 Workers: Contain slots to execute map or reduce
functions
 Reducers:
 A reduce worker gets locations of its input partitions from the
master and uses HTTP requests to retrieve them
 Once it has read all its input, it sorts it by key to group
together all occurrences of the same key
 It then invokes the user-defined reduce for each key and
passes it the key and its associated values
 The key/value pairs generated after the application of the
reduce logic are then written to a final output file, which is
subsequently written to the distributed filesystem

Google File System - GFS
 In-house distributed file system at Google
 Stores all input an output files
 Stores files…
 divided into 64 MB blocks
 on at least 3 different machines
 Machines running GFS also run MapReduce

MapReduce job phases
A MapReduce job can be divided into 4 phases:
 Input split: The input dataset is sliced into M splits, one
per map task
 Map logic: The user-supplied map function is invoked
 In tandem a sort phase is also applied that ensures that
map output is locally sorted by key
 In addition, the key space is also partitioned amongst the
reducers
 Shuffle: Map output is relayed to all reduce tasks
 Reduce logic: The user-provided reduce function is
invoked
 Before the application of the reduce function, the input
keys are merged to get globally sorted key/value pairs

Wordcount map in Java
1. public void map(Object key, Text value , Context
context) {
2. StringTokenizer itr = new
StringTokenizer(value.toString());
3. while (itr.hasMoreTokens()) {
4. word.set(itr.nextToken());
5. context.write(word , one);
6. }
7. }

Wordcount reduce in Java
1. public void reduce(Text key, Iterable <IntWritable >
values ,
Context context) {
2. int sum = 0;
3. for (IntWritable val : values) {
4. sum += val.get();
5. }
6. result.set(sum);
7. context.write(key, result);
8. }

Hadoop
 Open-source implementation of MapReduce,
developed by Doug Cutting originally at Yahoo! in
2004
 Now a top-level Apache open-source project
 Implemented in Java (Google‟s in-house
implementation is in C++)
 Jobs can be written in C++, Java, Python, etc.
 Comes with an associated distributed filesystem,
HDFS (clone of GFS)

Hadoop Components
 Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce Software Framework
 There are many other projects based around
core Hadoop
– Often referred to as the „Hadoop
Ecosystem‟
– Pig, Hive, HBase,
Flume, Oozie, Sqoop, etc

Hadoop Users
 Adobe: Several areas from social services to
unstructured data storage and processing
 eBay: 532 nodes cluster storing 5.3PB of data
 Facebook: Used for reporting/analytics; one cluster
with 1100 nodes (12PB) and another with 300 nodes
(3PB)
 LinkedIn: 3 clusters with collectively 4000 nodes
 Twitter: To store and process Tweets and log files
 Yahoo!: Multiple clusters with collectively 40000
nodes; largest cluster has 4500 nodes!

Running a Hadoop Application
 The first order of the day is to format the Hadoop
DFS
 Jump to the Hadoop directory and execute:
bin/hadoop namenode -format
 Running Hadoop
 To run Hadoop and HDFS:
bin/start-all.sh
 To terminate them:
bin/stop-all.sh

 Generating a dataset
 Create a temporary directory to hold the data:
 mkdir /tmp/gutenberg
 Jump to it:
 cd /tmp/gutenberg
 Download text files:
 wget www.gutenberg.org/etext/20417

 Copying the dataset to the HDFS
 Jump to the Hadoop directory and execute:
 bin/hadoop dfs -copyFromLocal /tmp/gutenberg
/ccw/Gutenberg
 Running Wordcount
 bin/hadoop jar hadoop-examples-1.0.4.jar wordcount
/ccw/gutenberg /ccw/gutenberg-output
 Retrieving results from the HDFS
 Copy to the local FS:
 bin/hadoop dfs –getmerge /ccw/gutenberg-output
/tmp/gutenberg-output

 Accessing the web interface
 JobTracker: http://localhost:50030
 TaskTracker: http://localhost:50060
 Reference: Running Hadoop on Ubuntu Linux
(Single-Node Cluster):
 http://www.michael-noll.com/tutorials/running-
hadoop-on-ubuntu-linux-single-node-cluster/

Cloud computing

More Related Content

What's hot

Viewers also liked

Similar to Cloud computing

Recently uploaded

Cloud computing