Spotting Hadoop in the wild

Spotting Hadoop in the wild
Practical use cases from Last.fm and Massive Media

@klbostee

Thursday 12 January 12

• “Data scientist is a job title for an
employee who analyses data, particularly
large amounts of it, to help a business gain a
competitive edge” —WhatIs.com
• “Someone who can obtain, scrub, explore,
model and interpret data, blending
hacking, statistics and machine
learning” —Hilary Mason, bit.ly


• 2007: Started using Hadoop as PhD student
• 2009: Data & Scalability Engineer at Last.fm
• 2011: Data Scientist at Massive Media


• 2007: Started using Hadoop as PhD student
• 2009: Data & Scalability Engineer at Last.fm
• 2011: Data Scientist at Massive Media
• Created Dumbo, a Python API for Hadoop
• Contributed some code to Hadoop itself
• Organized several HUGUK meetups

What are those yellow things?


Core principles

• Distributed
• Fault tolerant
• Sequential reads and writes
• Data locality


Pars pro toto

Pig Hive

HBase
ZooKeeper

MapReduce

HDFS

Hadoop itself is basically the kernel that
provides a ﬁle system and task scheduler


Hadoop ﬁle system

DataNode DataNode DataNode


Hadoop ﬁle system

File A =



Hadoop ﬁle system

File A =

File B =



Hadoop ﬁle system
Linux
File A =
block
File B =
Hadoop
block



Hadoop ﬁle system
Linux
File A =
block
File B =
Hadoop
block
No random writes!



Hadoop task scheduler

TaskTracker TaskTracker TaskTracker



Job A =




Job A = Job B =




Some practical tips

• Install a distribution
• Use compression
• Consider increasing your block size
• Watch out for small ﬁles


HBase

Pig Hive

HBase
ZooKeeper

MapReduce

HDFS

HBase is a database on top of HDFS that
can easily be accessed from MapReduce


Data model
Column family A Column family B

Row keys Column X Column Y Column U Column V

... ... ... ... ...


Data model

sorted

... ... ... ... ...


Data model

sorted

... ... ... ... ...

• Conﬁgurable number of versions per cell
• Each cell version has a timestamp
• TTL can be speciﬁed per column family


Random becomes sequential

... KeyValue

KeyValue
KeyValue

sorted
HDFS
KeyValue
...
KeyValue

Commit log Memstore


KeyValue

... KeyValue

KeyValue
KeyValue

sorted
HDFS
KeyValue
...
KeyValue

Commit log Memstore


KeyValue

... KeyValue

KeyValue
KeyValue

sorted
HDFS
KeyValue
...
KeyValue KeyValue

sequential
Commit log Memstore
write


KeyValue

... KeyValue

KeyValue
KeyValue

sorted
KeyValue
HDFS
KeyValue
...
KeyValue KeyValue

sequential
Commit log Memstore
write


KeyValue

... KeyValue

KeyValue
KeyValue

sorted
KeyValue
HDFS
KeyValue
... sequential
KeyValue KeyValue write

sequential
Commit log Memstore
write


KeyValue High write throughput!

... KeyValue

KeyValue
KeyValue

sorted
KeyValue
HDFS
KeyValue
... sequential

sequential
Commit log Memstore
write


KeyValue High write throughput!
+ efﬁcient scans
+ free empty cells
+ no fragmentation
... KeyValue + ...

KeyValue
KeyValue

sorted
KeyValue
HDFS
KeyValue
... sequential

sequential
Commit log Memstore
write


Horizontal scaling
Row keys sorted


Horizontal scaling
Row keys sorted

Region

RegionServer


Horizontal scaling
Row keys sorted

Region Region Region

Region
... ... ...
RegionServer RegionServer RegionServer


Horizontal scaling
Row keys sorted

Region Region Region

Region
... ... ...
RegionServer RegionServer RegionServer

• Each region has its own commit log and memstores
• Moving regions is easy since the data is all in HDFS
• Strong consistency as each region is served only once


Some practical tips

• Restrict the number of regions per server
• Restrict the number column families
• Use compression
• Increase ﬁle descriptor limits on nodes
• Use a large enough buffer when scanning


Look, a herd of Hadoops!


• “Last.fm lets you effortlessly keep a record
of what you listen to from any player. Based
on your taste, Last.fm recommends you
more music and concerts” —Last.fm
• Over 60 billion tracks scrobbled since 2003
• Started using Hadoop in 2006, before Yahoo


• “Massive Media is the social media
company behind the successful digital
brands Netlog.com and Twoo.com.
We enable members to meet nearby
people instantly” —MassiveMedia.eu
• Over 80 million users on web and mobile
• Using Hadoop for about a year now

Hadoop adoption

1. Business intelligence
2. Testing and experimentation
3. Fraud and abuse detection
4. Product features
5. PR and marketing


Hadoop adoption

m f
st.
La
1. Business intelligence √
2. Testing and experimentation √
3. Fraud and abuse detection √
4. Product features √
5. PR and marketing √


Hadoop adoption

ia
ed
Me
m

siv
f
st.

as
La

M
1. Business intelligence √ √
2. Testing and experimentation √ √
3. Fraud and abuse detection √ √
4. Product features √ √
5. PR and marketing √


Business intelligence


Testing and experimentation


Fraud and abuse detection


Product features


PR and marketing


Let’s dive into the ﬁrst use case!


Goals and requirements

• Timeseries graphs of 1000 or so metrics
• Segmented over about 10 dimensions


Goals and requirements

• Timeseries graphs of 1000 or so metrics
• Segmented over about 10 dimensions
1. Scale with very large number of events
2. History for graphs must be long enough
3. Accessing the graphs must be instantaneous
4. Possibility to analyse in detail when needed


Attempt #1

• Log table in MySQL
• Generate graphs from this table on-the-ﬂy


Attempt #1

• Log table in MySQL
• Generate graphs from this table on-the-ﬂy
1. Large number of events √
2. Long enough history ⁄
3. Instantaneous access ⁄
4. Analyse in detail √


Attempt #2

• Counters in MySQL table
• Update counters on every event


Attempt #2

• Counters in MySQL table
• Update counters on every event
1. Large number of events ⁄
2. Long enough history √
3. Instantaneous access √
4. Analyse in detail ⁄


Attempt #3

• Put log ﬁles in HDFS through syslog-ng
• MapReduce on logs and write to HBase


Attempt #3

• Put log ﬁles in HDFS through syslog-ng
• MapReduce on logs and write to HBase
1. Large number of events √
2. Long enough history √
3. Instantaneous access √
4. Analyse in detail √


Architecture

Syslog-ng

HDFS

MapReduce

HBase


Architecture

Syslog-ng

HDFS
Realtime
processing
MapReduce

HBase


Architecture

Syslog-ng

HDFS
Realtime
Ad-hoc processing
MapReduce
results

HBase


HBase schema

• Separate table for each time granularity
• Global segmentations in row keys
• <language>||<country>||...|||<timestamp>
• * for “not speciﬁed”
• trailing *s are omitted
• Further segmentations in column keys
• e.g. payments_via_paypal, payments_via_sms
• Related metrics in same column family

Questions?


Spotting Hadoop in the wild

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spotting Hadoop in the wild

Similar to Spotting Hadoop in the wild (20)

Recently uploaded

Recently uploaded (20)

Spotting Hadoop in the wild