Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Cloud Architectures - Jinesh Varia - GrepTheWeb
1.
2. On Cloud Computing….
“We in academia and the government labs have not
kept up with the times, Universities really need to
get on board.”
- Randal E. Bryant, Dean of the Computer
Science School at Carnegie Mellon University.
source: http://www.nytimes.com/2007/10/08/technology/08cloud.html
15. Scale: 50 servers to 5000 servers in 3 days
Amazon EC2 easily scaled
to handle additional traffic
Number of EC2 Instances
Peak of 5000 instances
Launch of Facebook modification.
Steady state of ~40 instances
4/12/2008 4/13/2008 4/14/2008 4/15/2008 4/16/2008 4/17/2008 4/18/2008 4/19/2008 4/20/2008
16.
17.
18.
19.
20.
21. “TimesMachine” from NY Times
1851-1922 Articles
TIFF -> PDF
Input: 11 Million
Articles (4TB of data)
What did he do ?
100 EC2 Instances for
24 hours
All data on S3
Output: 1.5 TB of Data
Hadoop, iText, JetS3t
29. CS290F : Scalable Internet Services
USCB Fall 2006
Prof created an app to manage team
usage
Ruby on Rails
Complete Stack: From Load balancer,
App Server to DB
Learn how to scale: Simulated load
Generated Graphs
All course contents, students
assignments, lessons learned are on the
Wiki
30. CS345a : Data Mining @ Stanford
Tools used: Class organization:
Shell/Linux/Java Stanford Winter 2007
Hadoop on EC2 30-35 Students
Data set on S3 Each Team spawns 10-
Datasets :NetFlix, 15 Hadoop slave nodes
Alexa, IR datasets TA created Getting-
from TREC Started AMIs (& scripts)
TA managed the
students usage
31. Bioinformatics @ Northwestern University
• Using Hadoop to perform sequence
alignments on large genomic datasets
– Northwestern University (Flatow & Lin) presented
a talk at the Next-gen Sequencing Data Analysis
meeting
• “An understanding of the industrial strength map-
reduce paradigm will be invaluable to those looking to
cope with the next-generation datasets. Combined with
the power of elastic computing clouds, many of the
potential barriers to dealing with such large-scale data
can be completely eliminated.”
31
38. Main Problems
• How to co-ordinate jobs
between machines Hadoop
(distributed processing) ?
Technical • What if a machine fails ? Web
• How will I Scale-out ? Services
• How do I get management
signoff ?
• Resources to manage the Cloud
Business infrastructure? Computing
• How do I get rid of the
Idle Infrastructure?
41. Examples of Patterns
Source Code
int x = 40 + i
Any thing with punctuation
“Hey!” he said, “Are you ok?”
Case Sensitive
Function CallOrderController()
Equations
f(x) = x^2
Other Patterns
(dis)integration of life, Email Address
42. Zoom Level 1
Input dataset (List
of Document Urls)
GetStatus
RegEx Alexa
Subset of
GrepTheWeb document
Service URLs that
matched
the RegEx
43. Zoom Level 2
Amazon SQS Input Files
Distributed Transient (Alexa Crawl)
Buffer
Amazon S3
Never Lose a messageInfinitely Scalable Storage in the cloud
StartGrep
RegEx Amazon
Highly Amazon SimpleDB
Available, Durable and Reliable
SQS
Ideal for small short-lived
Amazon EC2 Database in the cloud
messages Computing
Resizable
Manage phases
Private and Public Storage
Capacity in the cloud Controller
Lightweight Query-able
Access control Pay by the GB
User info, AttributeMonitor,
Launch, Store
Spawn Server Instances Job status info Shutdown
Message Locking
using a Web Service call Distributed and
Amazon
Partitioned
EC2
Amazon
Root Level Access SimpleDB
GetStatus Cluster Input Amazon
Get Output
DB Output S3
Pay by GB, Pay per
Pay by the hour Query
44. Zoom Level 3
Amazon SQS
Billing
Queue
StartGrep Launch Monitor Shut Billing
Queue Queue down Service
Queue
Controller
Launch Monitor Shutdown Billing
Controller Controller Controller Controller
launch Get EC2
ping Info
Insert Insert
EC2 Check for
JobID,
info results
Status Shutdown
Master M
GetStatus
Slaves N Get Output
HDFS Output
Status Put
DB File
Input Files
(Alexa Crawl)
Input
Get
Amazon Hadoop Cluster on File
SimpleDB Amazon EC2
Amazon S3
45. Zoom Level 4
Combine
Map
User1
StartJob1 Map StopJob1
Map Reduce
…..
Service Map Store status
Tasks and results
Hadoop Job
Get
Result
Combine
Map
Map
Map Reduce
User2 StopJob2
StartJob2 …..
Map
Tasks
Hadoop Job
46. SideTrack: WordCount Example
Input
MAPPER: For each input record, extract
a set of key/value pairs that we care Input key
about the each record value pairs
Map
“Hi Hadoop, Bye Hadoop”
(“Hi”, 1), (“Hadoop”, 1), key 1 key 3
Values..
(“Bye”, 1), (“Hadoop”, 1) Values..
REDUCER: For each extracted Aggregate
Key 1
key/value pair, combine it with other All Values..
values that share the same key
Reduce
(“Hadoop”, [1,1])
(“Hadoop”, 2) Final Key 1
Values..
Source: Doug Cutting’s Slide Deck on Hadoop
47. Zoom Level 5 (Hadoop MapReduce)
MAPPER: For each input record, extract a set Input
of key/value pairs that we care about the each
Input key
record value pairs
Map
(LineNumber, s3pointer)
key 1 key 3
(s3pointer, [matches]) Values.. Values..
Aggregate
Key 1
All Values..
REDUCER: For each extracted key/value pair,
combine it with other values that share the
same key Reduce
Final Key 1 Values..
Identity Function
Source: Doug Cutting’s Slide Deck on Hadoop