Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Spotting Hadoop in the wild
1. Spotting Hadoop in the wild
Practical use cases from Last.fm and Massive Media
@klbostee
Thursday 12 January 12
2. • “Data scientist is a job title for an
employee who analyses data, particularly
large amounts of it, to help a business gain a
competitive edge” —WhatIs.com
• “Someone who can obtain, scrub, explore,
model and interpret data, blending
hacking, statistics and machine
learning” —Hilary Mason, bit.ly
Thursday 12 January 12
3. • 2007: Started using Hadoop as PhD student
• 2009: Data & Scalability Engineer at Last.fm
• 2011: Data Scientist at Massive Media
Thursday 12 January 12
4. • 2007: Started using Hadoop as PhD student
• 2009: Data & Scalability Engineer at Last.fm
• 2011: Data Scientist at Massive Media
• Created Dumbo, a Python API for Hadoop
• Contributed some code to Hadoop itself
• Organized several HUGUK meetups
Thursday 12 January 12
6. Core principles
• Distributed
• Fault tolerant
• Sequential reads and writes
• Data locality
Thursday 12 January 12
7. Pars pro toto
Pig Hive
HBase
ZooKeeper
MapReduce
HDFS
Hadoop itself is basically the kernel that
provides a file system and task scheduler
Thursday 12 January 12
14. Hadoop task scheduler
Job A =
TaskTracker TaskTracker TaskTracker
DataNode DataNode DataNode
Thursday 12 January 12
15. Hadoop task scheduler
Job A = Job B =
TaskTracker TaskTracker TaskTracker
DataNode DataNode DataNode
Thursday 12 January 12
16. Some practical tips
• Install a distribution
• Use compression
• Consider increasing your block size
• Watch out for small files
Thursday 12 January 12
17. HBase
Pig Hive
HBase
ZooKeeper
MapReduce
HDFS
HBase is a database on top of HDFS that
can easily be accessed from MapReduce
Thursday 12 January 12
18. Data model
Column family A Column family B
Row keys Column X Column Y Column U Column V
... ... ... ... ...
Thursday 12 January 12
19. Data model
Column family A Column family B
Row keys Column X Column Y Column U Column V
sorted
... ... ... ... ...
Thursday 12 January 12
20. Data model
Column family A Column family B
Row keys Column X Column Y Column U Column V
sorted
... ... ... ... ...
• Configurable number of versions per cell
• Each cell version has a timestamp
• TTL can be specified per column family
Thursday 12 January 12
21. Random becomes sequential
... KeyValue
KeyValue
KeyValue
sorted
HDFS
KeyValue
...
KeyValue
Commit log Memstore
Thursday 12 January 12
30. Horizontal scaling
Row keys sorted
Region
RegionServer
Thursday 12 January 12
31. Horizontal scaling
Row keys sorted
Region Region Region
Region
... ... ...
RegionServer RegionServer RegionServer
Thursday 12 January 12
32. Horizontal scaling
Row keys sorted
Region Region Region
Region
... ... ...
RegionServer RegionServer RegionServer
• Each region has its own commit log and memstores
• Moving regions is easy since the data is all in HDFS
• Strong consistency as each region is served only once
Thursday 12 January 12
33. Some practical tips
• Restrict the number of regions per server
• Restrict the number column families
• Use compression
• Increase file descriptor limits on nodes
• Use a large enough buffer when scanning
Thursday 12 January 12
35. • “Last.fm lets you effortlessly keep a record
of what you listen to from any player. Based
on your taste, Last.fm recommends you
more music and concerts” —Last.fm
• Over 60 billion tracks scrobbled since 2003
• Started using Hadoop in 2006, before Yahoo
Thursday 12 January 12
36. • “Massive Media is the social media
company behind the successful digital
brands Netlog.com and Twoo.com.
We enable members to meet nearby
people instantly” —MassiveMedia.eu
• Over 80 million users on web and mobile
• Using Hadoop for about a year now
Thursday 12 January 12
37. Hadoop adoption
1. Business intelligence
2. Testing and experimentation
3. Fraud and abuse detection
4. Product features
5. PR and marketing
Thursday 12 January 12
38. Hadoop adoption
m f
st.
La
1. Business intelligence √
2. Testing and experimentation √
3. Fraud and abuse detection √
4. Product features √
5. PR and marketing √
Thursday 12 January 12
39. Hadoop adoption
ia
ed
Me
m
siv
f
st.
as
La
M
1. Business intelligence √ √
2. Testing and experimentation √ √
3. Fraud and abuse detection √ √
4. Product features √ √
5. PR and marketing √
Thursday 12 January 12
47. Goals and requirements
• Timeseries graphs of 1000 or so metrics
• Segmented over about 10 dimensions
Thursday 12 January 12
48. Goals and requirements
• Timeseries graphs of 1000 or so metrics
• Segmented over about 10 dimensions
1. Scale with very large number of events
2. History for graphs must be long enough
3. Accessing the graphs must be instantaneous
4. Possibility to analyse in detail when needed
Thursday 12 January 12
49. Attempt #1
• Log table in MySQL
• Generate graphs from this table on-the-fly
Thursday 12 January 12
50. Attempt #1
• Log table in MySQL
• Generate graphs from this table on-the-fly
1. Large number of events √
2. Long enough history ⁄
3. Instantaneous access ⁄
4. Analyse in detail √
Thursday 12 January 12
51. Attempt #2
• Counters in MySQL table
• Update counters on every event
Thursday 12 January 12
52. Attempt #2
• Counters in MySQL table
• Update counters on every event
1. Large number of events ⁄
2. Long enough history √
3. Instantaneous access √
4. Analyse in detail ⁄
Thursday 12 January 12
53. Attempt #3
• Put log files in HDFS through syslog-ng
• MapReduce on logs and write to HBase
Thursday 12 January 12
54. Attempt #3
• Put log files in HDFS through syslog-ng
• MapReduce on logs and write to HBase
1. Large number of events √
2. Long enough history √
3. Instantaneous access √
4. Analyse in detail √
Thursday 12 January 12
55. Architecture
Syslog-ng
HDFS
MapReduce
HBase
Thursday 12 January 12
58. HBase schema
• Separate table for each time granularity
• Global segmentations in row keys
• <language>||<country>||...|||<timestamp>
• * for “not specified”
• trailing *s are omitted
• Further segmentations in column keys
• e.g. payments_via_paypal, payments_via_sms
• Related metrics in same column family
Thursday 12 January 12