The Hadoop project is an integral part of Yahoo!'s cloud infrastructure and is at the heart of many of Yahoo!'s important business processes. Sumeet Singh, the Head of Products for Cloud Services and Hadoop at Yahoo!, explains how Yahoo! leverages Hadoop and Cloud Platforms to process and serve Internet- scale data.
Yahoo! operates one of the world's largest private cloud infrastructures. Learn how technologies scale out for building enterprise-wide trusted platforms with tight SLAs.
URL: http://www.saptechnologyservice.com/track1.html
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
1. Big Data and The Cloud at Yahoo!
Sumeet Singh
Head of Products, Cloud Services & Hadoop
April 2, 2013
2. Cloud Mission
2Yahoo! Presentation, Confidential
Build the next generation of personalized experiences with the
fastest serving containers and services that integrate easily and
scale on-demand
4. Infrastructure View
4
PC
TV
Tablet
Phone
Web Crawl
Social Graph
3rd Party
Content
Email
EdgeServing
UserData/ProfileServices
Sensor Data
(Push)
Harvest Data
(Pull)
MySQL MS
SQL
Oracle
Hadoop Data Grid
Ad Serving
Data Serving
Content Serving
Low Latency
NoSQL Stores
Tableau
OLAP
Data Collection Asynchronous Data Processing Synchronous Serving
User
Data
Stores
BI Tools
Datamarts
(Dimensional on Data)
Data Highway
Feeds
(Ad, Search,
Audience)
Web Crawl
Social Feeds
Content Feeds
Harvested Mail
Source of truth for data
MicroStrategy
Yahoo! Presentation, Confidential
5. Functional View
5 Shared Services
Elasticity Services
Deployment
Automation
Registry/Name Service
Monitoring
Replication
Authentication/
Authorization
Quota Management
PubSub
Notifications
Edge
Proxy Cache Streaming
Memory
Cloud
Edge
Containers
Dynamic
Serving
Struct
Storage
Key Value Stores User Data Stores
UnStruct
Storage
Blob Stores
Fabric
Storage Virtualization OS Virtualization
Network
Virtualization
Serving Containers
Ranking and
Indexing
Data
Center
Globally Distributed Data Centers and Edge PODs
In-Memory Stores
Query
Engine Yahoo! Query Language
Data
Process.
Hadoop Real-time/ Stream Processing
Presentation
Containers
IaaS
PaaS
BigDataContainers
5Yahoo! Presentation, Confidential
6. Building and Deploying Services
Yahoo! Presentation, Confidential 6
Run the application software (deploy web and
application servers) in minutes
Simplify application/ business logic with
containers
Automate deployment and management Move messages reliably among distributed
application components
Automate workflows and improve
developer productivity
Coordinate distributed applications,
manage synchronization, groups etc.
Store and serve resources and static content, user
profile/ data and application datasets
Stream on-demand and live content
Speed up page/ content delivery to geo dispersed
user-base with edge technologies
Yahoo! Web, Social, and
Mobile Apps
12
1
2
5
4
3
11
9
8
7
6
Serve dynamic content (personalize,
rank, dynamically select)
10
Collect and process Internet scale data in
acceptable time limits
Manage data transfer, access, and
lifecycle
Auto Scaling
Fault Tolerance
High Availability
Monitoring
Security
7. No Different Than a Public Cloud…
Yahoo! Presentation, Confidential 7
Source: http://aws.amazon.com/
15. So, What is Hadoop
15
Large scale distributed data processing framework that is
Scalable
Economical
Efficient
Reliable
Easy to Use, and
Open Source
Yahoo! Presentation, Confidential
16. Hadoop at Yahoo!
16
0
50
100
150
200
250
300
350
400
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012
RawHDFSStorage(inPB)
NumberofNodes
Year
Nodes HDFS
Yahoo!
Commitsto
Scaling Hadoop
for Production
Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems with
Security, Multi-
tenancy, and
SLAs
Increased
User-base
with partitioned
namespaces
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Current Team
with Y! focus
Behind
every click
Yahoo! Presentation, Confidential
17. Hadoop analyzes 100
billion advertising and
page events everyday
from Yahoo!’s 700+
million users
Yahoo! is at the Frontier of Hadoop Scale
17Yahoo! Presentation, Confidential
165 PB
3 PB
42,000 servers; 4,000 in a single cluster
165 PB of data, 350+ PB of HDFS
10 million slot-hours of compute time
500+ daily users with 360,000 jobs
19. Home Page Personalization
Yahoo! Homepage Real-Time Data Flow
Real-time data flow delivering personalized
homepages to Yahoo! users
SCIENCE
HADOOP
CLUSTER
SERVING SYSTEMS
PRODUCTION
HADOOP
CLUSTER
USER
BEHAVIOR
ENGAGED USERS
CATEGORIZATION
MODELS
SERVING
MAPS
USER
FEEDBACK
Production service: updated
every 5 minutes
Categorization service:
Updated weekly
Yahoo! Presentation, Confidential 19
Hadoop infers user interest, users now
click more often
o Hadoop powers content categorization for
user profiles
o User profiles are derived from user activity
o Profile insights are used to serve relevant
content
Hadoop abstractions enable building
machine learning models faster
(months to days)
o Hadoop, HBase, Storm, and Hive used for
modeling and analytics
Hadoop enables frequent content
refreshes, improving user engagement
(hours to minutes)
o 40 models pushed at 5 minute and 30
minute intervals
20. Mail Anti-Spam (visualize.yahoo.com)
SpamGuard in conjunction with Hadoop has
reduced spam by 60%
“ “
Hadoop powers data mining
algorithms that adapt quickly to
new spam techniques
These are used to identify spam
and spammers
o Hadoop helps Yahoo! blocks 20.5
billion spam emails and deliver 5.6
billion email a day across 300 million
mailboxes
o Hadoop scalability allows us to detect
and respond to new spam algorithms
within hours
o Hadoop also allows our scientists to
detect new spam patterns in this huge
sea of data
Yahoo! Presentation, Confidential 20
21. Many Other Use Cases at Yahoo!...
AD TARGETING SEARCH ASSIST MEMBERSHIP ANTI-ABUSE
3x improvement in the accuracy of ad
placements by targeting billions of
impressions per day
Single, grid-based, highly scalable CMS as
'source of truth‘ for reducing time to
launch new sites from quarters to weeks
Actionable data and insights for better faster
decisions on ad supply forecast and serving
plan against over a million contracts
Over 100B events (35TB a day) aggregated
and processed for user engagement data
enabling downstream analytics0
Over a billion web pages processed to
create the output list of related words for
improved search experience
Filter out over 25% of 2M+ new
registrations every day as abusive with
95+% confidence
CONTENT AGILITY DATA PIPELINES AD OPTIMIZATION
Registration Success Rates Today
6/7/20123Yahoo! Presentation, Confidential
Source: Average global statistics across all Intl, Source, and Locale from RUSS, Nov 1 ’11 – Nov 30 ’11 (LTM available, but not very different from monthly average stats)
1 May have abuse that we do not catch today
875,906
(88% of 1M)
11,611
(1%)
107,845
(11%)
Web Mobile Partners
Average Daily Successful Regs
(1M/ day)
Global Reg Success Rates
Web
Mobile
Partners
Avg total regs 2.22 M/ day
Avg good regs1 0.88 M/ day
Avg total sessions 2.82 M/ day
Avg good sessions 1.15 M/ day
Avg success rate 76%
Avg total regs 13.4 K/ day
Avg successful regs 11.6 K/ day
Avg success rate 85%
Avg total regs 112.3 K/ day
Avg successful regs 107.8 K/ day
Avg success rate 96%
Registration Login
Account
Recovery
1
Lego ModulesContent Agility
Page Views
Link Views
Link Clicks
Ad Views
Ad Clicks
Non Web
Events
Web
Servers
Data
Highway
Filers
FETL
Pipeline
Analysis
Optimization
Targeting
Research
UDA Pipeline for
aggregation
DFEColo
CM
DC
FrontEnd Colo
Servers
FrontEnd Colos:
AC4,NE1,SP2,SK1,
SP1,CH1,SG1
Coord
Sched
Compute
Node
Mnode
CFI
Lotus
Grid
DistributedFileSystem (HDFS)
6
Customer,
Internal &
External
Reporting
Tools
J
C
B
A
E
Gateway
VIP
8
5
2
1
6
D
E
F
G
H
I
BEColo
CM
LOF
Server
RR Filetube
BackEndColo
3 4
Filer
MSFT AdCenter
Web
servers
http
edge
proxy1
2
E
DAQ
HDFS
Proxy
1 7
7
Ad-hoc
Customer
8
Yahoo! Presentation, Confidential 21
23. Data Processing is Evolving
23Yahoo! Presentation, Confidential
1968
Hierarchical Database
1970
Relational Database
IBM System R
1983
Data Warehouse
IBM DB2
1990s
Microsoft SQL Server
Oracle Database
2003
IBM “System S”
2010
Microsoft
Stream Insight
2011
Twitter Storm
2012
Berkeley Spark
2009
IBM Streams2006
Hadoop
2004
Google
Map Reduce
OLTP and Operational Databases
OLAP and Data Warehouses
Big Data
Real-time Analytic Processing