hadoop @ Ibmbigdata

YAHOO &
HADOOP
USING
AND
IMPROVING

APACHE
HADOOP
AT
YAHOO!

Eric Baldeschwieler
VP, Hadoop Software

AGENDA

• 
Brief
Overview

• 
Hadoop
@
Yahoo!

•  Hadoop
Momentum

•  The
Future
of
Hadoop

2

WHAT’S
happening

-‐
Big
Data
is
here!

-‐ unstructured data
-‐

petabyte scale
-‐

operationally critical

Flickr : sub_lime79

TURNING DATA
INTO INSIGHTS

machine learning
logic regression time series
content clustering
algorithms ad inventory modeling
user interest prediction
factorization models
Flickr : NASA Goddard Photo and Video

MAKING YAHOO
RELEVANT

Flickr : ogimogi

HADOOP:
POWERING
YAHOO!
science
+
big
data + insight =
personal relevance = VALUE

Flickr : DDFic

WHAT IS HADOOP?
Commodity
Pig Hive Programming Languages
•  Computers
•  Network
MapReduce Computation
Focus on
•  Simplicity
HDFS
•  Redundancy
Storage
•  Scale
•  Availability

Transforms commodity equipment into a service that:
•  HDFS – Stores peta bytes of data reliably
•  Map-Reduce – Allows huge distributed computations

Key Attributes
•  Redundant and reliable – Doesn’t stop or loose data even as hardware fails
•  Easy to program – Our rocket scientists use it directly!
•  Very powerful – Allows the development of big data algorithms & tools 7

•  Batch processing centric

WHAT HADOOP ISN’T

•  A
replacement
for
relaFonal
and
data

warehouse
systems

•  A
transacFonal
/
online
/
serving
system

•  A
low
latency
or
streaming
soluFon

8

HADOOP IN THE ENTERPRISE
Business
Intelligence
ApplicaFons

HADOOP
CLUSTER(S) Data

RDMS
EDW

Marts

InteracFons
TransacFons,
Structured
Data

Semi-‐Structured
or
Un-‐Structured
Data

Web
Logs,
Server
Logs,
Business

Social
Media,
etc…
ApplicaFons

9

HADOOP @
YAHOO!
“Where
Science
meets
Data”

PRODUCTS
Data Analytics
Content Optimization
Content Enrichment
Yahoo! Mail Anti-Spam
Advertising Products
HADOOP CLUSTERS Ad Optimization
Tens of thousands of servers Ad Selection
Big Data Processing & ETL

APPLIED SCIENCE
User Interest Prediction
Ad inventory prediction
Machine learning -
search ranking
Machine learning - ad
targeting
Machine learning - spam
10s of Petabytes filtering
11

FROM PROJECT TO
CORE PLATFORM
90 250

80 40K+ Servers
170 PB Storage 200
70
5M+ Monthly Jobs
60 “Behind

every
150
Thousands of Servers

50 Daily
click”

ProducFon

Petabytes
40
Science
100
30
Impact

20
Research
50

10

0 0

2006 2007 2008 2009 2010
12

HADOOP POWERS THE
YAHOO! NETWORK

advertising optimization data analytics
machine learning search ranking
advertising data systems Yahoo! Mail anti-spam
audience, ad and search pipelines ad selection

Yahoo! Homepage Content Optimization
ad inventory prediction
user interest prediction

13

CASE STUDY
YAHOO! HOMEPAGE

Personalized

for
each
visitor

twice
the
engagement

Result:

twice
the
engagement

Recommended
links
News
Interests
Top
Searches

+79% clicks +160% clicks +43% clicks
vs. randomly selected vs. one size fits all vs. editor selected

14

CASE STUDY
YAHOO! HOMEPAGE

•  Serving
Maps
SCIENCE »
Machine learning to build ever
•  Users
-‐
Interests
HADOOP better categorization models

CLUSTER
•  Five
Minute
USER
CATEGORIZATION

ProducLon
BEHAVIOR
MODELS
(weekly)

•  Weekly
PRODUCTION
CategorizaLon
HADOOP
»
Identify user interests using
models
SERVING

CLUSTER
Categorization models
MAPS

(every
5
minutes)

USER

BEHAVIOR

SERVING
SYSTEMS ENGAGED
USERS

Build
customized
home
pages
with
latest
data
(thousands
/
second)

15

CASE STUDY
YAHOO! MAIL

Enabling
quick
response
in
the
spam
arms
race

•  450M
mail
boxes

•  5B+
deliveries/day

SCIENCE

•  AnLspam
models
retrained

every
few
hours
on
Hadoop

PRODUCTION
40%
less
spam
than

Hotmail
and
55%
less

spam
than
Gmail

16

YAHOO! & APACHE HADOOP
Yahoo!
has
contributed
70+%
of

Apache
Hadoop
code
to
date

Hadoop
is
not
our
business,
but
Hadoop
is
key
to
our
business

• 
Yahoo!
benefits
from
open
source
eco-‐system
around
Hadoop

• 
Hadoop
drives
revenue
at
Yahoo!
by
making
our
core
products
beèr

We
need
Hadoop
to
be
rock
solid

• 
We
invest
heavily
in
core
Hadoop
development

• 
We
focus
on
scalability,
reliability,
availability

We
fix
bugs
before
you
see
them

• 
We
run
very
large
clusters

• 
We
have
a
large
QA
effort

• 
We
run
a
huge
variety
of
workloads

We
are
good
Apache
Hadoop
ciLzens

• 
We
contribute
our
work
to
Apache

• 
We
share
the
exact
code
we
run

HADOOP IS GOING
MAINSTREAM

2007 2008 2009 2010

The
Datagraph
Blog

19

THE PLATFORM EFFECT
BIRTH OF AN ECOSYSTEM

and other Early Adopters
Scale and productize Hadoop

Apache
Hadoop

Enhance
Orgs with Internet Scale Problems
Hadoop
Add tools / frameworks, enhance Hadoop
Ecosystem

Service Providers
Grow ecosystem - Training, support, enhancements

Virtuous Circle!
•  Investment -> Adoption
•  Adoption -> Investment

Mainstream / Enterprise adoption
Drive further development, enhancements 20

THE FUTURE OF
HADOOP

21

MAKING HADOOP ENTERPRISE-READY
WHAT’S NEXT
Hadoop
is
far
from
“done”

•  Current
implementaFon
is
showing
its
age

•  Need
to
address
several
deficiencies
in
scalability,
flexibility,

ease
of
use
&
performance

Yahoo!
is
working
on
Next
GeneraLon
of
Hadoop

•  MapReduce:
Rewrite
to
improve
performance;

pluggable
support
for
new
programming
models

•  HDFS:
Adding
volumes
to
improve
scalability;

Flush
&
sync
support
for
applicaFons
that
log
to
HDFS

Apache
should
remain
the
hub
of
Hadoop
ecosystem

•  Yahoo!
contributes
all
Hadoop
changes
back
to
Apache
Hadoop

•  Everyone
benefits
from
shared
neutral
foundaFon

22

hadoop @ Ibmbigdata

More Related Content

What's hot

Similar to hadoop @ Ibmbigdata

Recently uploaded

hadoop @ Ibmbigdata