• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Spotting Hadoop in the wild
 

Spotting Hadoop in the wild

on

  • 3,823 views

Practical Hadoop use cases from Last.fm and Massive Media

Practical Hadoop use cases from Last.fm and Massive Media

Statistics

Views

Total Views
3,823
Views on SlideShare
3,699
Embed Views
124

Actions

Likes
8
Downloads
51
Comments
0

13 Embeds 124

http://thetechnicalweb.blogspot.in 56
http://www.thetechnicalweb.blogspot.in 15
http://www.linkedin.com 15
http://thetechnicalweb.blogspot.com 14
http://a0.twimg.com 12
http://thetechnicalweb.blogspot.com.br 3
http://www.onlydoo.com 2
https://www.linkedin.com 2
http://comtel.in 1
http://the-refreshing-sip.blogspot.com 1
http://thetechnicalweb.blogsopt.com 1
http://thetechnicalweb.blogspot.de 1
https://twitter.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Spotting Hadoop in the wild Spotting Hadoop in the wild Presentation Transcript

    • Spotting Hadoop in the wild Practical use cases from Last.fm and Massive Media @klbosteeThursday 12 January 12
    • • “Data scientist is a job title for an employee who analyses data, particularly large amounts of it, to help a business gain a competitive edge” —WhatIs.com • “Someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning” —Hilary Mason, bit.lyThursday 12 January 12
    • • 2007: Started using Hadoop as PhD student • 2009: Data & Scalability Engineer at Last.fm • 2011: Data Scientist at Massive MediaThursday 12 January 12
    • • 2007: Started using Hadoop as PhD student • 2009: Data & Scalability Engineer at Last.fm • 2011: Data Scientist at Massive Media • Created Dumbo, a Python API for Hadoop • Contributed some code to Hadoop itself • Organized several HUGUK meetupsThursday 12 January 12
    • What are those yellow things?Thursday 12 January 12
    • Core principles • Distributed • Fault tolerant • Sequential reads and writes • Data localityThursday 12 January 12
    • Pars pro toto Pig Hive HBase ZooKeeper MapReduce HDFS Hadoop itself is basically the kernel that provides a file system and task schedulerThursday 12 January 12
    • Hadoop file system DataNode DataNode DataNodeThursday 12 January 12
    • Hadoop file system File A = DataNode DataNode DataNodeThursday 12 January 12
    • Hadoop file system File A = File B = DataNode DataNode DataNodeThursday 12 January 12
    • Hadoop file system Linux File A = block File B = Hadoop block DataNode DataNode DataNodeThursday 12 January 12
    • Hadoop file system Linux File A = block File B = Hadoop block No random writes! DataNode DataNode DataNodeThursday 12 January 12
    • Hadoop task scheduler TaskTracker TaskTracker TaskTracker DataNode DataNode DataNodeThursday 12 January 12
    • Hadoop task scheduler Job A = TaskTracker TaskTracker TaskTracker DataNode DataNode DataNodeThursday 12 January 12
    • Hadoop task scheduler Job A = Job B = TaskTracker TaskTracker TaskTracker DataNode DataNode DataNodeThursday 12 January 12
    • Some practical tips • Install a distribution • Use compression • Consider increasing your block size • Watch out for small filesThursday 12 January 12
    • HBase Pig Hive HBase ZooKeeper MapReduce HDFS HBase is a database on top of HDFS that can easily be accessed from MapReduceThursday 12 January 12
    • Data model Column family A Column family B Row keys Column X Column Y Column U Column V ... ... ... ... ...Thursday 12 January 12
    • Data model Column family A Column family B Row keys Column X Column Y Column U Column V sorted ... ... ... ... ...Thursday 12 January 12
    • Data model Column family A Column family B Row keys Column X Column Y Column U Column V sorted ... ... ... ... ... • Configurable number of versions per cell • Each cell version has a timestamp • TTL can be specified per column familyThursday 12 January 12
    • Random becomes sequential ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue Commit log MemstoreThursday 12 January 12
    • Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue Commit log MemstoreThursday 12 January 12
    • Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue KeyValue sequential Commit log Memstore writeThursday 12 January 12
    • Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... KeyValue KeyValue sequential Commit log Memstore writeThursday 12 January 12
    • Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore writeThursday 12 January 12
    • Random becomes sequential KeyValue High write throughput! ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore writeThursday 12 January 12
    • Random becomes sequential KeyValue High write throughput! + efficient scans + free empty cells + no fragmentation ... KeyValue + ... KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore writeThursday 12 January 12
    • Horizontal scaling Row keys sortedThursday 12 January 12
    • Horizontal scaling Row keys sortedThursday 12 January 12
    • Horizontal scaling Row keys sorted Region RegionServerThursday 12 January 12
    • Horizontal scaling Row keys sorted Region Region Region Region ... ... ... RegionServer RegionServer RegionServerThursday 12 January 12
    • Horizontal scaling Row keys sorted Region Region Region Region ... ... ... RegionServer RegionServer RegionServer • Each region has its own commit log and memstores • Moving regions is easy since the data is all in HDFS • Strong consistency as each region is served only onceThursday 12 January 12
    • Some practical tips • Restrict the number of regions per server • Restrict the number column families • Use compression • Increase file descriptor limits on nodes • Use a large enough buffer when scanningThursday 12 January 12
    • Look, a herd of Hadoops!Thursday 12 January 12
    • • “Last.fm lets you effortlessly keep a record of what you listen to from any player. Based on your taste, Last.fm recommends you more music and concerts” —Last.fm • Over 60 billion tracks scrobbled since 2003 • Started using Hadoop in 2006, before YahooThursday 12 January 12
    • • “Massive Media is the social media company behind the successful digital brands Netlog.com and Twoo.com. We enable members to meet nearby people instantly” —MassiveMedia.eu • Over 80 million users on web and mobile • Using Hadoop for about a year nowThursday 12 January 12
    • Hadoop adoption 1. Business intelligence 2. Testing and experimentation 3. Fraud and abuse detection 4. Product features 5. PR and marketingThursday 12 January 12
    • Hadoop adoption m f st. La 1. Business intelligence √ 2. Testing and experimentation √ 3. Fraud and abuse detection √ 4. Product features √ 5. PR and marketing √Thursday 12 January 12
    • Hadoop adoption ia ed Me m siv f st. as La M 1. Business intelligence √ √ 2. Testing and experimentation √ √ 3. Fraud and abuse detection √ √ 4. Product features √ √ 5. PR and marketing √Thursday 12 January 12
    • Business intelligenceThursday 12 January 12
    • Testing and experimentationThursday 12 January 12
    • Fraud and abuse detectionThursday 12 January 12
    • Fraud and abuse detectionThursday 12 January 12
    • Product featuresThursday 12 January 12
    • PR and marketingThursday 12 January 12
    • Let’s dive into the first use case!Thursday 12 January 12
    • Goals and requirements • Timeseries graphs of 1000 or so metrics • Segmented over about 10 dimensionsThursday 12 January 12
    • Goals and requirements • Timeseries graphs of 1000 or so metrics • Segmented over about 10 dimensions 1. Scale with very large number of events 2. History for graphs must be long enough 3. Accessing the graphs must be instantaneous 4. Possibility to analyse in detail when neededThursday 12 January 12
    • Attempt #1 • Log table in MySQL • Generate graphs from this table on-the-flyThursday 12 January 12
    • Attempt #1 • Log table in MySQL • Generate graphs from this table on-the-fly 1. Large number of events √ 2. Long enough history ⁄ 3. Instantaneous access ⁄ 4. Analyse in detail √Thursday 12 January 12
    • Attempt #2 • Counters in MySQL table • Update counters on every eventThursday 12 January 12
    • Attempt #2 • Counters in MySQL table • Update counters on every event 1. Large number of events ⁄ 2. Long enough history √ 3. Instantaneous access √ 4. Analyse in detail ⁄Thursday 12 January 12
    • Attempt #3 • Put log files in HDFS through syslog-ng • MapReduce on logs and write to HBaseThursday 12 January 12
    • Attempt #3 • Put log files in HDFS through syslog-ng • MapReduce on logs and write to HBase 1. Large number of events √ 2. Long enough history √ 3. Instantaneous access √ 4. Analyse in detail √Thursday 12 January 12
    • Architecture Syslog-ng HDFS MapReduce HBaseThursday 12 January 12
    • Architecture Syslog-ng HDFS Realtime processing MapReduce HBaseThursday 12 January 12
    • Architecture Syslog-ng HDFS Realtime Ad-hoc processing MapReduce results HBaseThursday 12 January 12
    • HBase schema • Separate table for each time granularity • Global segmentations in row keys • <language>||<country>||...|||<timestamp> • * for “not specified” • trailing *s are omitted • Further segmentations in column keys • e.g. payments_via_paypal, payments_via_sms • Related metrics in same column familyThursday 12 January 12
    • Questions?Thursday 12 January 12