SlideShare a Scribd company logo
1 of 26
Big Data and The Cloud at Yahoo!
Sumeet Singh
Head of Products, Cloud Services & Hadoop
April 2, 2013
Cloud Mission
2Yahoo! Presentation, Confidential
Build the next generation of personalized experiences with the
fastest serving containers and services that integrate easily and
scale on-demand
Yahoo! Cloud
3Yahoo! Presentation, Confidential
Infrastructure View
4
PC
TV
Tablet
Phone
Web Crawl
Social Graph
3rd Party
Content
Email
EdgeServing
UserData/ProfileServices
Sensor Data
(Push)
Harvest Data
(Pull)
MySQL MS
SQL
Oracle
Hadoop Data Grid
Ad Serving
Data Serving
Content Serving
Low Latency
NoSQL Stores
Tableau
OLAP
Data Collection Asynchronous Data Processing Synchronous Serving
User
Data
Stores
BI Tools
Datamarts
(Dimensional on Data)
Data Highway
Feeds
(Ad, Search,
Audience)
Web Crawl
Social Feeds
Content Feeds
Harvested Mail
Source of truth for data
MicroStrategy
Yahoo! Presentation, Confidential
Functional View
5 Shared Services
Elasticity Services
Deployment
Automation
Registry/Name Service
Monitoring
Replication
Authentication/
Authorization
Quota Management
PubSub
Notifications
Edge
Proxy Cache Streaming
Memory
Cloud
Edge
Containers
Dynamic
Serving
Struct
Storage
Key Value Stores User Data Stores
UnStruct
Storage
Blob Stores
Fabric
Storage Virtualization OS Virtualization
Network
Virtualization
Serving Containers
Ranking and
Indexing
Data
Center
Globally Distributed Data Centers and Edge PODs
In-Memory Stores
Query
Engine Yahoo! Query Language
Data
Process.
Hadoop Real-time/ Stream Processing
Presentation
Containers
IaaS
PaaS
BigDataContainers
5Yahoo! Presentation, Confidential
Building and Deploying Services
Yahoo! Presentation, Confidential 6
Run the application software (deploy web and
application servers) in minutes
Simplify application/ business logic with
containers
Automate deployment and management Move messages reliably among distributed
application components
Automate workflows and improve
developer productivity
Coordinate distributed applications,
manage synchronization, groups etc.
Store and serve resources and static content, user
profile/ data and application datasets
Stream on-demand and live content
Speed up page/ content delivery to geo dispersed
user-base with edge technologies
Yahoo! Web, Social, and
Mobile Apps
12
1
2
5
4
3
11
9
8
7
6
Serve dynamic content (personalize,
rank, dynamically select)
10
Collect and process Internet scale data in
acceptable time limits
Manage data transfer, access, and
lifecycle
Auto Scaling
Fault Tolerance
High Availability
Monitoring
Security
No Different Than a Public Cloud…
Yahoo! Presentation, Confidential 7
Source: http://aws.amazon.com/
…Except Latency and SLA needs
Yahoo! Presentation, Confidential 8
“Applications are increasingly expecting real-time or near
real-time responses from cloud platforms”
© http://www.flickr.com/photos/7593077@N03/2650612822
Yahoo! Presentation, Confidential 9
© http://www.flickr.com/photos/51839911@N04/4810141449
 self-expression
 transient content
 unstructured data
What is Happening
Location
Social
Relationship
Science
Understanding
User Interests
access audience
blogscommunication
computer internet
mass media
people networking
technology
© http://www.flickr.com/photos/camkage/3732012461/
Cutting Through The Noise
Yahoo! Presentation, Confidential 10
Big Data
Yahoo! Presentation, Confidential 11
© http://www.flickr.com/photos/gsfc
Turning Data Into Insights
12Yahoo! Presentation, Confidential
Making it Relevant
Yahoo! Presentation, Confidential 13
© http://www.flickr.com/photos/ogimogi/
Hadoop
14Yahoo! Presentation, Confidential
science + big data + insight =
personal relevance = VALUE
© http://www.flickr.com/photos/ddfic/
So, What is Hadoop
15
Large scale distributed data processing framework that is
 Scalable
 Economical
 Efficient
 Reliable
 Easy to Use, and
 Open Source
Yahoo! Presentation, Confidential
Hadoop at Yahoo!
16
0
50
100
150
200
250
300
350
400
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012
RawHDFSStorage(inPB)
NumberofNodes
Year
Nodes HDFS
Yahoo!
Commitsto
Scaling Hadoop
for Production
Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems with
Security, Multi-
tenancy, and
SLAs
Increased
User-base
with partitioned
namespaces
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Current Team
with Y! focus
Behind
every click
Yahoo! Presentation, Confidential
Hadoop analyzes 100
billion advertising and
page events everyday
from Yahoo!’s 700+
million users
Yahoo! is at the Frontier of Hadoop Scale
17Yahoo! Presentation, Confidential
165 PB
3 PB
 42,000 servers; 4,000 in a single cluster
 165 PB of data, 350+ PB of HDFS
 10 million slot-hours of compute time
 500+ daily users with 360,000 jobs
Personalization (visualize.yahoo.com)
Yahoo! Presentation, Confidential 18
Home Page Personalization
Yahoo! Homepage Real-Time Data Flow
Real-time data flow delivering personalized
homepages to Yahoo! users
SCIENCE
HADOOP
CLUSTER
SERVING SYSTEMS
PRODUCTION
HADOOP
CLUSTER
USER
BEHAVIOR
ENGAGED USERS
CATEGORIZATION
MODELS
SERVING
MAPS
USER
FEEDBACK
Production service: updated
every 5 minutes
Categorization service:
Updated weekly
Yahoo! Presentation, Confidential 19
 Hadoop infers user interest, users now
click more often
o Hadoop powers content categorization for
user profiles
o User profiles are derived from user activity
o Profile insights are used to serve relevant
content
 Hadoop abstractions enable building
machine learning models faster
(months to days)
o Hadoop, HBase, Storm, and Hive used for
modeling and analytics
 Hadoop enables frequent content
refreshes, improving user engagement
(hours to minutes)
o 40 models pushed at 5 minute and 30
minute intervals
Mail Anti-Spam (visualize.yahoo.com)
SpamGuard in conjunction with Hadoop has
reduced spam by 60%
“ “
 Hadoop powers data mining
algorithms that adapt quickly to
new spam techniques
 These are used to identify spam
and spammers
o Hadoop helps Yahoo! blocks 20.5
billion spam emails and deliver 5.6
billion email a day across 300 million
mailboxes
o Hadoop scalability allows us to detect
and respond to new spam algorithms
within hours
o Hadoop also allows our scientists to
detect new spam patterns in this huge
sea of data
Yahoo! Presentation, Confidential 20
Many Other Use Cases at Yahoo!...
AD TARGETING SEARCH ASSIST MEMBERSHIP ANTI-ABUSE
3x improvement in the accuracy of ad
placements by targeting billions of
impressions per day
Single, grid-based, highly scalable CMS as
'source of truth‘ for reducing time to
launch new sites from quarters to weeks
Actionable data and insights for better faster
decisions on ad supply forecast and serving
plan against over a million contracts
Over 100B events (35TB a day) aggregated
and processed for user engagement data
enabling downstream analytics0
Over a billion web pages processed to
create the output list of related words for
improved search experience
Filter out over 25% of 2M+ new
registrations every day as abusive with
95+% confidence
CONTENT AGILITY DATA PIPELINES AD OPTIMIZATION
Registration Success Rates Today
6/7/20123Yahoo! Presentation, Confidential
Source: Average global statistics across all Intl, Source, and Locale from RUSS, Nov 1 ’11 – Nov 30 ’11 (LTM available, but not very different from monthly average stats)
1 May have abuse that we do not catch today
875,906
(88% of 1M)
11,611
(1%)
107,845
(11%)
Web Mobile Partners
Average Daily Successful Regs
(1M/ day)
Global Reg Success Rates
Web
Mobile
Partners
Avg total regs 2.22 M/ day
Avg good regs1 0.88 M/ day
Avg total sessions 2.82 M/ day
Avg good sessions 1.15 M/ day
Avg success rate 76%
Avg total regs 13.4 K/ day
Avg successful regs 11.6 K/ day
Avg success rate 85%
Avg total regs 112.3 K/ day
Avg successful regs 107.8 K/ day
Avg success rate 96%
Registration Login
Account
Recovery
1
Lego ModulesContent Agility
 Page Views
 Link Views
 Link Clicks
 Ad Views
 Ad Clicks
 Non Web
Events
Web
Servers
Data
Highway
Filers
FETL
Pipeline
 Analysis
 Optimization
 Targeting
 Research
UDA Pipeline for
aggregation
DFEColo
CM
DC
FrontEnd Colo
Servers
FrontEnd Colos:
AC4,NE1,SP2,SK1,
SP1,CH1,SG1
Coord
Sched
Compute
Node
Mnode
CFI
Lotus
Grid
DistributedFileSystem (HDFS)
6
Customer,
Internal &
External
Reporting
Tools
J
C
B
A
E
Gateway
VIP
8
5
2
1
6
D
E
F
G
H
I
BEColo
CM
LOF
Server
RR Filetube
BackEndColo
3 4
Filer
MSFT AdCenter
Web
servers
http
edge
proxy1
2
E
DAQ
HDFS
Proxy
1 7
7
Ad-hoc
Customer
8
Yahoo! Presentation, Confidential 21
Product Stack
Yahoo! Presentation, Confidential 22
Hadoop
Compute
Hadoop
Services
Hadoop
Storage
Infrastructure
Services
HivePIG OOZIE
HDFS/ HTTP
Proxy
GDM
YARN
MapReduce on
YARN
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Low Latency
Processing
Data Processing is Evolving
23Yahoo! Presentation, Confidential
1968
Hierarchical Database
1970
Relational Database
IBM System R
1983
Data Warehouse
IBM DB2
1990s
Microsoft SQL Server
Oracle Database
2003
IBM “System S”
2010
Microsoft
Stream Insight
2011
Twitter Storm
2012
Berkeley Spark
2009
IBM Streams2006
Hadoop
2004
Google
Map Reduce
OLTP and Operational Databases
OLAP and Data Warehouses
Big Data
Real-time Analytic Processing
Yahoo! is Pioneering New Grounds for Hadoop
24Yahoo! Presentation, Confidential
 32 million jobs run on YARN as of Mar 15
 5.5 billion tasks using 16,500 compute years
 180 years of compute used everyday
 32 million jobs on YARN as of Mar 15
 5.5 billion tasks using 16,500 compute years so far
 180 years of compute used every day
First in the industry to move to nextgen Hadoop
for production services
© http://www.flickr.com/photos/davidbygott/5835680474/
The Next Frontiers
25Yahoo! Presentation, Confidential
 Converged Big Data and Low Latency processing
 Real-time queries, analytics, and reporting
 Resource Management
 Hardware
© http://www.flickr.com/photos/safaripartners/4839031518/
Thank You
26Yahoo! Presentation, Confidential

More Related Content

Similar to SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!

Improve your Tech Quotient
Improve your Tech QuotientImprove your Tech Quotient
Improve your Tech QuotientTarence DSouza
 
Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckSumeet Singh
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studydeep.bi
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseGiorgio Orsi
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIDenodo
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Open Analytics
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunitiesBigdata Meetup Kochi
 
Big Data Paris - A Modern Enterprise Architecture
Big Data Paris - A Modern Enterprise ArchitectureBig Data Paris - A Modern Enterprise Architecture
Big Data Paris - A Modern Enterprise ArchitectureMongoDB
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
Enterprise Mashup Infrastructure Kapow Mashup Server
Enterprise Mashup Infrastructure   Kapow Mashup ServerEnterprise Mashup Infrastructure   Kapow Mashup Server
Enterprise Mashup Infrastructure Kapow Mashup ServerAndreas Krohn
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreTrendwise Analytics
 
Spark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu AdunuthulaSpark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu AdunuthulaSpark Summit
 
WEB 2.0 For Interns(Surya)
WEB 2.0 For Interns(Surya)WEB 2.0 For Interns(Surya)
WEB 2.0 For Interns(Surya)guest71e24d
 
Alitora Innovation Networks
Alitora Innovation NetworksAlitora Innovation Networks
Alitora Innovation Networksalitora
 
8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshare8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshareJulianna DeLua
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Eli White
 
Click stream analysis and hadoop framwork
Click stream analysis and hadoop framworkClick stream analysis and hadoop framwork
Click stream analysis and hadoop framworkMarwadi Univercity
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsFredReynolds2
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companiesRobert Smith
 
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev KumarApache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev KumarYahoo Developer Network
 

Similar to SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! (20)

Improve your Tech Quotient
Improve your Tech QuotientImprove your Tech Quotient
Improve your Tech Quotient
 
Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk Deck
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunities
 
Big Data Paris - A Modern Enterprise Architecture
Big Data Paris - A Modern Enterprise ArchitectureBig Data Paris - A Modern Enterprise Architecture
Big Data Paris - A Modern Enterprise Architecture
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Enterprise Mashup Infrastructure Kapow Mashup Server
Enterprise Mashup Infrastructure   Kapow Mashup ServerEnterprise Mashup Infrastructure   Kapow Mashup Server
Enterprise Mashup Infrastructure Kapow Mashup Server
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and More
 
Spark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu AdunuthulaSpark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu Adunuthula
 
WEB 2.0 For Interns(Surya)
WEB 2.0 For Interns(Surya)WEB 2.0 For Interns(Surya)
WEB 2.0 For Interns(Surya)
 
Alitora Innovation Networks
Alitora Innovation NetworksAlitora Innovation Networks
Alitora Innovation Networks
 
8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshare8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshare
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Click stream analysis and hadoop framwork
Click stream analysis and hadoop framworkClick stream analysis and hadoop framwork
Click stream analysis and hadoop framwork
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential Tools
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companies
 
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev KumarApache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
 

More from Sumeet Singh

Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Sumeet Singh
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
 

More from Sumeet Singh (15)

Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
 

Recently uploaded

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 

Recently uploaded (20)

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 

SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!

  • 1. Big Data and The Cloud at Yahoo! Sumeet Singh Head of Products, Cloud Services & Hadoop April 2, 2013
  • 2. Cloud Mission 2Yahoo! Presentation, Confidential Build the next generation of personalized experiences with the fastest serving containers and services that integrate easily and scale on-demand
  • 4. Infrastructure View 4 PC TV Tablet Phone Web Crawl Social Graph 3rd Party Content Email EdgeServing UserData/ProfileServices Sensor Data (Push) Harvest Data (Pull) MySQL MS SQL Oracle Hadoop Data Grid Ad Serving Data Serving Content Serving Low Latency NoSQL Stores Tableau OLAP Data Collection Asynchronous Data Processing Synchronous Serving User Data Stores BI Tools Datamarts (Dimensional on Data) Data Highway Feeds (Ad, Search, Audience) Web Crawl Social Feeds Content Feeds Harvested Mail Source of truth for data MicroStrategy Yahoo! Presentation, Confidential
  • 5. Functional View 5 Shared Services Elasticity Services Deployment Automation Registry/Name Service Monitoring Replication Authentication/ Authorization Quota Management PubSub Notifications Edge Proxy Cache Streaming Memory Cloud Edge Containers Dynamic Serving Struct Storage Key Value Stores User Data Stores UnStruct Storage Blob Stores Fabric Storage Virtualization OS Virtualization Network Virtualization Serving Containers Ranking and Indexing Data Center Globally Distributed Data Centers and Edge PODs In-Memory Stores Query Engine Yahoo! Query Language Data Process. Hadoop Real-time/ Stream Processing Presentation Containers IaaS PaaS BigDataContainers 5Yahoo! Presentation, Confidential
  • 6. Building and Deploying Services Yahoo! Presentation, Confidential 6 Run the application software (deploy web and application servers) in minutes Simplify application/ business logic with containers Automate deployment and management Move messages reliably among distributed application components Automate workflows and improve developer productivity Coordinate distributed applications, manage synchronization, groups etc. Store and serve resources and static content, user profile/ data and application datasets Stream on-demand and live content Speed up page/ content delivery to geo dispersed user-base with edge technologies Yahoo! Web, Social, and Mobile Apps 12 1 2 5 4 3 11 9 8 7 6 Serve dynamic content (personalize, rank, dynamically select) 10 Collect and process Internet scale data in acceptable time limits Manage data transfer, access, and lifecycle Auto Scaling Fault Tolerance High Availability Monitoring Security
  • 7. No Different Than a Public Cloud… Yahoo! Presentation, Confidential 7 Source: http://aws.amazon.com/
  • 8. …Except Latency and SLA needs Yahoo! Presentation, Confidential 8 “Applications are increasingly expecting real-time or near real-time responses from cloud platforms” © http://www.flickr.com/photos/7593077@N03/2650612822
  • 9. Yahoo! Presentation, Confidential 9 © http://www.flickr.com/photos/51839911@N04/4810141449  self-expression  transient content  unstructured data What is Happening
  • 10. Location Social Relationship Science Understanding User Interests access audience blogscommunication computer internet mass media people networking technology © http://www.flickr.com/photos/camkage/3732012461/ Cutting Through The Noise Yahoo! Presentation, Confidential 10
  • 11. Big Data Yahoo! Presentation, Confidential 11 © http://www.flickr.com/photos/gsfc
  • 12. Turning Data Into Insights 12Yahoo! Presentation, Confidential
  • 13. Making it Relevant Yahoo! Presentation, Confidential 13 © http://www.flickr.com/photos/ogimogi/
  • 14. Hadoop 14Yahoo! Presentation, Confidential science + big data + insight = personal relevance = VALUE © http://www.flickr.com/photos/ddfic/
  • 15. So, What is Hadoop 15 Large scale distributed data processing framework that is  Scalable  Economical  Efficient  Reliable  Easy to Use, and  Open Source Yahoo! Presentation, Confidential
  • 16. Hadoop at Yahoo! 16 0 50 100 150 200 250 300 350 400 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 2006 2007 2008 2009 2010 2011 2012 RawHDFSStorage(inPB) NumberofNodes Year Nodes HDFS Yahoo! Commitsto Scaling Hadoop for Production Use Research Workloads in Search and Advertising Production (Modeling) with machine learning & WebMap Revenue Systems with Security, Multi- tenancy, and SLAs Increased User-base with partitioned namespaces Open Sourced with Apache Hortonworks Spinoff for Enterprise hardening Current Team with Y! focus Behind every click Yahoo! Presentation, Confidential
  • 17. Hadoop analyzes 100 billion advertising and page events everyday from Yahoo!’s 700+ million users Yahoo! is at the Frontier of Hadoop Scale 17Yahoo! Presentation, Confidential 165 PB 3 PB  42,000 servers; 4,000 in a single cluster  165 PB of data, 350+ PB of HDFS  10 million slot-hours of compute time  500+ daily users with 360,000 jobs
  • 19. Home Page Personalization Yahoo! Homepage Real-Time Data Flow Real-time data flow delivering personalized homepages to Yahoo! users SCIENCE HADOOP CLUSTER SERVING SYSTEMS PRODUCTION HADOOP CLUSTER USER BEHAVIOR ENGAGED USERS CATEGORIZATION MODELS SERVING MAPS USER FEEDBACK Production service: updated every 5 minutes Categorization service: Updated weekly Yahoo! Presentation, Confidential 19  Hadoop infers user interest, users now click more often o Hadoop powers content categorization for user profiles o User profiles are derived from user activity o Profile insights are used to serve relevant content  Hadoop abstractions enable building machine learning models faster (months to days) o Hadoop, HBase, Storm, and Hive used for modeling and analytics  Hadoop enables frequent content refreshes, improving user engagement (hours to minutes) o 40 models pushed at 5 minute and 30 minute intervals
  • 20. Mail Anti-Spam (visualize.yahoo.com) SpamGuard in conjunction with Hadoop has reduced spam by 60% “ “  Hadoop powers data mining algorithms that adapt quickly to new spam techniques  These are used to identify spam and spammers o Hadoop helps Yahoo! blocks 20.5 billion spam emails and deliver 5.6 billion email a day across 300 million mailboxes o Hadoop scalability allows us to detect and respond to new spam algorithms within hours o Hadoop also allows our scientists to detect new spam patterns in this huge sea of data Yahoo! Presentation, Confidential 20
  • 21. Many Other Use Cases at Yahoo!... AD TARGETING SEARCH ASSIST MEMBERSHIP ANTI-ABUSE 3x improvement in the accuracy of ad placements by targeting billions of impressions per day Single, grid-based, highly scalable CMS as 'source of truth‘ for reducing time to launch new sites from quarters to weeks Actionable data and insights for better faster decisions on ad supply forecast and serving plan against over a million contracts Over 100B events (35TB a day) aggregated and processed for user engagement data enabling downstream analytics0 Over a billion web pages processed to create the output list of related words for improved search experience Filter out over 25% of 2M+ new registrations every day as abusive with 95+% confidence CONTENT AGILITY DATA PIPELINES AD OPTIMIZATION Registration Success Rates Today 6/7/20123Yahoo! Presentation, Confidential Source: Average global statistics across all Intl, Source, and Locale from RUSS, Nov 1 ’11 – Nov 30 ’11 (LTM available, but not very different from monthly average stats) 1 May have abuse that we do not catch today 875,906 (88% of 1M) 11,611 (1%) 107,845 (11%) Web Mobile Partners Average Daily Successful Regs (1M/ day) Global Reg Success Rates Web Mobile Partners Avg total regs 2.22 M/ day Avg good regs1 0.88 M/ day Avg total sessions 2.82 M/ day Avg good sessions 1.15 M/ day Avg success rate 76% Avg total regs 13.4 K/ day Avg successful regs 11.6 K/ day Avg success rate 85% Avg total regs 112.3 K/ day Avg successful regs 107.8 K/ day Avg success rate 96% Registration Login Account Recovery 1 Lego ModulesContent Agility  Page Views  Link Views  Link Clicks  Ad Views  Ad Clicks  Non Web Events Web Servers Data Highway Filers FETL Pipeline  Analysis  Optimization  Targeting  Research UDA Pipeline for aggregation DFEColo CM DC FrontEnd Colo Servers FrontEnd Colos: AC4,NE1,SP2,SK1, SP1,CH1,SG1 Coord Sched Compute Node Mnode CFI Lotus Grid DistributedFileSystem (HDFS) 6 Customer, Internal & External Reporting Tools J C B A E Gateway VIP 8 5 2 1 6 D E F G H I BEColo CM LOF Server RR Filetube BackEndColo 3 4 Filer MSFT AdCenter Web servers http edge proxy1 2 E DAQ HDFS Proxy 1 7 7 Ad-hoc Customer 8 Yahoo! Presentation, Confidential 21
  • 22. Product Stack Yahoo! Presentation, Confidential 22 Hadoop Compute Hadoop Services Hadoop Storage Infrastructure Services HivePIG OOZIE HDFS/ HTTP Proxy GDM YARN MapReduce on YARN HDFS HBase Zookeeper Support Shop Monitoring Starling Messaging Service HCatalog Low Latency Processing
  • 23. Data Processing is Evolving 23Yahoo! Presentation, Confidential 1968 Hierarchical Database 1970 Relational Database IBM System R 1983 Data Warehouse IBM DB2 1990s Microsoft SQL Server Oracle Database 2003 IBM “System S” 2010 Microsoft Stream Insight 2011 Twitter Storm 2012 Berkeley Spark 2009 IBM Streams2006 Hadoop 2004 Google Map Reduce OLTP and Operational Databases OLAP and Data Warehouses Big Data Real-time Analytic Processing
  • 24. Yahoo! is Pioneering New Grounds for Hadoop 24Yahoo! Presentation, Confidential  32 million jobs run on YARN as of Mar 15  5.5 billion tasks using 16,500 compute years  180 years of compute used everyday  32 million jobs on YARN as of Mar 15  5.5 billion tasks using 16,500 compute years so far  180 years of compute used every day First in the industry to move to nextgen Hadoop for production services © http://www.flickr.com/photos/davidbygott/5835680474/
  • 25. The Next Frontiers 25Yahoo! Presentation, Confidential  Converged Big Data and Low Latency processing  Real-time queries, analytics, and reporting  Resource Management  Hardware © http://www.flickr.com/photos/safaripartners/4839031518/