Cloud Architectures - Jinesh Varia - GrepTheWeb

On Cloud Computing….

“We in academia and the government labs have not
kept up with the times, Universities really need to
get on board.”
- Randal E. Bryant, Dean of the Computer
Science School at Carnegie Mellon University.

source: http://www.nytimes.com/2007/10/08/technology/08cloud.html

Amazon.com and AWS
Bandwidth consumed by
Amazon Web Services

Bandwidth consumed by
Amazon’s global websites

2001
1996 2002
1997 2003
1998 2004
1999 2005
2000 2006
2001 2007
2002 2008

AWS Customer Momentum (490,000)

Q1 2006

Q1 2007

Q1 2008

Q4 2008

0 100 200 300 400 500 600

Amazon S3 Momentum

800,000,000 5,000,000,000 10,000,000,000 40,000,000,000

Q2 Q2 Q3 Q4
2006 2007 2007 2008

Total Objects Stored in Amazon S3
6

Most Companies Worry About This

Your Idea Undifferentiated Successful
“Heavy Lifting” Product
Power/Cooling
Hardware Management
Bandwidth Management
Contract Negotiations
Maintenance
Deployment
Purchasing Decisions
Load Balancing/Scaling
Managing Growth

Focus on Innovation

Your Idea Undifferentiated Successful
“Heavy Lifting” Product

Cloud Computing

Amazon Cloud Computing

Elastic Unlimited Capacity Get Big Fast

Pay As You Go Spend Cash Wisely

Simple, Reliable, Fast Focus On Your Idea

Amazon Amazon
EC2 SQS

Amazon Amazon Amazon

S3 Simple EC2-
DB EBS

Scale: 50 servers to 5000 servers in 3 days

Amazon EC2 easily scaled
to handle additional traffic
Number of EC2 Instances

Peak of 5000 instances

Launch of Facebook modification.

Steady state of ~40 instances

4/12/2008 4/13/2008 4/14/2008 4/15/2008 4/16/2008 4/17/2008 4/18/2008 4/19/2008 4/20/2008

“TimesMachine” from NY Times

1851-1922 Articles
TIFF -> PDF
Input: 11 Million
Articles (4TB of data)
What did he do ?
100 EC2 Instances for
24 hours
All data on S3
Output: 1.5 TB of Data
Hadoop, iText, JetS3t

CS290F : Scalable Internet Services
USCB Fall 2006
Prof created an app to manage team
usage
Ruby on Rails
Complete Stack: From Load balancer,
App Server to DB
Learn how to scale: Simulated load
Generated Graphs
All course contents, students
assignments, lessons learned are on the
Wiki

CS345a : Data Mining @ Stanford

Tools used: Class organization:
Shell/Linux/Java Stanford Winter 2007
Hadoop on EC2 30-35 Students
Data set on S3 Each Team spawns 10-
Datasets :NetFlix, 15 Hadoop slave nodes
Alexa, IR datasets TA created Getting-
from TREC Started AMIs (& scripts)
TA managed the
students usage

Bioinformatics @ Northwestern University

• Using Hadoop to perform sequence
alignments on large genomic datasets
– Northwestern University (Flatow & Lin) presented
a talk at the Next-gen Sequencing Data Analysis
meeting
• “An understanding of the industrial strength map-
reduce paradigm will be invaluable to those looking to
cope with the next-generation datasets. Combined with
the power of elastic computing clouds, many of the
potential barriers to dealing with such large-scale data
can be completely eliminated.”

31

Cloud Architectures

Hardware
Infrastructure/Cost
Job execution time

time

Shrink your processing time
CPUs

time

Main Problems

• How to co-ordinate jobs
between machines Hadoop
(distributed processing) ?
Technical • What if a machine fails ? Web
• How will I Scale-out ? Services

• How do I get management
signoff ?
• Resources to manage the Cloud
Business infrastructure? Computing
• How do I get rid of the
Idle Infrastructure?

What’s so cool about GrepTheWeb ?

RegEx
WWW

Examples of Patterns
Source Code
int x = 40 + i
Any thing with punctuation
“Hey!” he said, “Are you ok?”
Case Sensitive
Function CallOrderController()
Equations
f(x) = x^2
Other Patterns
(dis)integration of life, Email Address

Zoom Level 1

Input dataset (List
of Document Urls)
GetStatus
RegEx Alexa
Subset of
GrepTheWeb document
Service URLs that
matched
the RegEx

Zoom Level 2
Amazon SQS Input Files
Distributed Transient (Alexa Crawl)
Buffer
Amazon S3
Never Lose a messageInfinitely Scalable Storage in the cloud
StartGrep
RegEx Amazon
Highly Amazon SimpleDB
Available, Durable and Reliable
SQS
Ideal for small short-lived
Amazon EC2 Database in the cloud
messages Computing
Resizable
Manage phases
Private and Public Storage
Capacity in the cloud Controller
Lightweight Query-able
Access control Pay by the GB
User info, AttributeMonitor,
Launch, Store
Spawn Server Instances Job status info Shutdown
Message Locking
using a Web Service call Distributed and
Amazon
Partitioned
EC2
Amazon
Root Level Access SimpleDB
GetStatus Cluster Input Amazon
Get Output
DB Output S3
Pay by GB, Pay per
Pay by the hour Query

Zoom Level 3
Amazon SQS
Billing
Queue

StartGrep Launch Monitor Shut Billing
Queue Queue down Service
Queue

Controller
Launch Monitor Shutdown Billing
Controller Controller Controller Controller

launch Get EC2
ping Info
Insert Insert
EC2 Check for
JobID,
info results
Status Shutdown

Master M
GetStatus
Slaves N Get Output
HDFS Output
Status Put
DB File

Input Files
(Alexa Crawl)
Input
Get
Amazon Hadoop Cluster on File
SimpleDB Amazon EC2
Amazon S3

Zoom Level 4
Combine
Map
User1
StartJob1 Map StopJob1
Map Reduce

…..

Service Map Store status
Tasks and results
Hadoop Job
Get
Result
Combine
Map

Map

Map Reduce
User2 StopJob2
StartJob2 …..

Map
Tasks
Hadoop Job

SideTrack: WordCount Example
Input
MAPPER: For each input record, extract
a set of key/value pairs that we care Input key
about the each record value pairs
Map
“Hi Hadoop, Bye Hadoop”

(“Hi”, 1), (“Hadoop”, 1), key 1 key 3
Values..
(“Bye”, 1), (“Hadoop”, 1) Values..

REDUCER: For each extracted Aggregate
Key 1
key/value pair, combine it with other All Values..
values that share the same key
Reduce
(“Hadoop”, [1,1])

(“Hadoop”, 2) Final Key 1
Values..

Source: Doug Cutting’s Slide Deck on Hadoop

Zoom Level 5 (Hadoop MapReduce)
MAPPER: For each input record, extract a set Input

of key/value pairs that we care about the each
Input key
record value pairs

Map
(LineNumber, s3pointer)

key 1 key 3
(s3pointer, [matches]) Values.. Values..

Aggregate
Key 1
All Values..
REDUCER: For each extracted key/value pair,
combine it with other values that share the
same key Reduce

Final Key 1 Values..
Identity Function

Source: Doug Cutting’s Slide Deck on Hadoop

Cloud Architectures - Jinesh Varia - GrepTheWeb

Cloud Architectures - Jinesh Varia - GrepTheWeb

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Cloud Architectures - Jinesh Varia - GrepTheWeb

Similar to Cloud Architectures - Jinesh Varia - GrepTheWeb (20)

Recently uploaded

Recently uploaded (20)

Cloud Architectures - Jinesh Varia - GrepTheWeb