Spotting Hadoop in the wild

4,420 views
4,101 views

Published on

Practical Hadoop use cases from Last.fm and Massive Media

Published in: Technology, Business
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,420
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
54
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Spotting Hadoop in the wild

  1. Spotting Hadoop in the wild Practical use cases from Last.fm and Massive Media @klbosteeThursday 12 January 12
  2. • “Data scientist is a job title for an employee who analyses data, particularly large amounts of it, to help a business gain a competitive edge” —WhatIs.com • “Someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning” —Hilary Mason, bit.lyThursday 12 January 12
  3. • 2007: Started using Hadoop as PhD student • 2009: Data & Scalability Engineer at Last.fm • 2011: Data Scientist at Massive MediaThursday 12 January 12
  4. • 2007: Started using Hadoop as PhD student • 2009: Data & Scalability Engineer at Last.fm • 2011: Data Scientist at Massive Media • Created Dumbo, a Python API for Hadoop • Contributed some code to Hadoop itself • Organized several HUGUK meetupsThursday 12 January 12
  5. What are those yellow things?Thursday 12 January 12
  6. Core principles • Distributed • Fault tolerant • Sequential reads and writes • Data localityThursday 12 January 12
  7. Pars pro toto Pig Hive HBase ZooKeeper MapReduce HDFS Hadoop itself is basically the kernel that provides a file system and task schedulerThursday 12 January 12
  8. Hadoop file system DataNode DataNode DataNodeThursday 12 January 12
  9. Hadoop file system File A = DataNode DataNode DataNodeThursday 12 January 12
  10. Hadoop file system File A = File B = DataNode DataNode DataNodeThursday 12 January 12
  11. Hadoop file system Linux File A = block File B = Hadoop block DataNode DataNode DataNodeThursday 12 January 12
  12. Hadoop file system Linux File A = block File B = Hadoop block No random writes! DataNode DataNode DataNodeThursday 12 January 12
  13. Hadoop task scheduler TaskTracker TaskTracker TaskTracker DataNode DataNode DataNodeThursday 12 January 12
  14. Hadoop task scheduler Job A = TaskTracker TaskTracker TaskTracker DataNode DataNode DataNodeThursday 12 January 12
  15. Hadoop task scheduler Job A = Job B = TaskTracker TaskTracker TaskTracker DataNode DataNode DataNodeThursday 12 January 12
  16. Some practical tips • Install a distribution • Use compression • Consider increasing your block size • Watch out for small filesThursday 12 January 12
  17. HBase Pig Hive HBase ZooKeeper MapReduce HDFS HBase is a database on top of HDFS that can easily be accessed from MapReduceThursday 12 January 12
  18. Data model Column family A Column family B Row keys Column X Column Y Column U Column V ... ... ... ... ...Thursday 12 January 12
  19. Data model Column family A Column family B Row keys Column X Column Y Column U Column V sorted ... ... ... ... ...Thursday 12 January 12
  20. Data model Column family A Column family B Row keys Column X Column Y Column U Column V sorted ... ... ... ... ... • Configurable number of versions per cell • Each cell version has a timestamp • TTL can be specified per column familyThursday 12 January 12
  21. Random becomes sequential ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue Commit log MemstoreThursday 12 January 12
  22. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue Commit log MemstoreThursday 12 January 12
  23. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue KeyValue sequential Commit log Memstore writeThursday 12 January 12
  24. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... KeyValue KeyValue sequential Commit log Memstore writeThursday 12 January 12
  25. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore writeThursday 12 January 12
  26. Random becomes sequential KeyValue High write throughput! ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore writeThursday 12 January 12
  27. Random becomes sequential KeyValue High write throughput! + efficient scans + free empty cells + no fragmentation ... KeyValue + ... KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore writeThursday 12 January 12
  28. Horizontal scaling Row keys sortedThursday 12 January 12
  29. Horizontal scaling Row keys sortedThursday 12 January 12
  30. Horizontal scaling Row keys sorted Region RegionServerThursday 12 January 12
  31. Horizontal scaling Row keys sorted Region Region Region Region ... ... ... RegionServer RegionServer RegionServerThursday 12 January 12
  32. Horizontal scaling Row keys sorted Region Region Region Region ... ... ... RegionServer RegionServer RegionServer • Each region has its own commit log and memstores • Moving regions is easy since the data is all in HDFS • Strong consistency as each region is served only onceThursday 12 January 12
  33. Some practical tips • Restrict the number of regions per server • Restrict the number column families • Use compression • Increase file descriptor limits on nodes • Use a large enough buffer when scanningThursday 12 January 12
  34. Look, a herd of Hadoops!Thursday 12 January 12
  35. • “Last.fm lets you effortlessly keep a record of what you listen to from any player. Based on your taste, Last.fm recommends you more music and concerts” —Last.fm • Over 60 billion tracks scrobbled since 2003 • Started using Hadoop in 2006, before YahooThursday 12 January 12
  36. • “Massive Media is the social media company behind the successful digital brands Netlog.com and Twoo.com. We enable members to meet nearby people instantly” —MassiveMedia.eu • Over 80 million users on web and mobile • Using Hadoop for about a year nowThursday 12 January 12
  37. Hadoop adoption 1. Business intelligence 2. Testing and experimentation 3. Fraud and abuse detection 4. Product features 5. PR and marketingThursday 12 January 12
  38. Hadoop adoption m f st. La 1. Business intelligence √ 2. Testing and experimentation √ 3. Fraud and abuse detection √ 4. Product features √ 5. PR and marketing √Thursday 12 January 12
  39. Hadoop adoption ia ed Me m siv f st. as La M 1. Business intelligence √ √ 2. Testing and experimentation √ √ 3. Fraud and abuse detection √ √ 4. Product features √ √ 5. PR and marketing √Thursday 12 January 12
  40. Business intelligenceThursday 12 January 12
  41. Testing and experimentationThursday 12 January 12
  42. Fraud and abuse detectionThursday 12 January 12
  43. Fraud and abuse detectionThursday 12 January 12
  44. Product featuresThursday 12 January 12
  45. PR and marketingThursday 12 January 12
  46. Let’s dive into the first use case!Thursday 12 January 12
  47. Goals and requirements • Timeseries graphs of 1000 or so metrics • Segmented over about 10 dimensionsThursday 12 January 12
  48. Goals and requirements • Timeseries graphs of 1000 or so metrics • Segmented over about 10 dimensions 1. Scale with very large number of events 2. History for graphs must be long enough 3. Accessing the graphs must be instantaneous 4. Possibility to analyse in detail when neededThursday 12 January 12
  49. Attempt #1 • Log table in MySQL • Generate graphs from this table on-the-flyThursday 12 January 12
  50. Attempt #1 • Log table in MySQL • Generate graphs from this table on-the-fly 1. Large number of events √ 2. Long enough history ⁄ 3. Instantaneous access ⁄ 4. Analyse in detail √Thursday 12 January 12
  51. Attempt #2 • Counters in MySQL table • Update counters on every eventThursday 12 January 12
  52. Attempt #2 • Counters in MySQL table • Update counters on every event 1. Large number of events ⁄ 2. Long enough history √ 3. Instantaneous access √ 4. Analyse in detail ⁄Thursday 12 January 12
  53. Attempt #3 • Put log files in HDFS through syslog-ng • MapReduce on logs and write to HBaseThursday 12 January 12
  54. Attempt #3 • Put log files in HDFS through syslog-ng • MapReduce on logs and write to HBase 1. Large number of events √ 2. Long enough history √ 3. Instantaneous access √ 4. Analyse in detail √Thursday 12 January 12
  55. Architecture Syslog-ng HDFS MapReduce HBaseThursday 12 January 12
  56. Architecture Syslog-ng HDFS Realtime processing MapReduce HBaseThursday 12 January 12
  57. Architecture Syslog-ng HDFS Realtime Ad-hoc processing MapReduce results HBaseThursday 12 January 12
  58. HBase schema • Separate table for each time granularity • Global segmentations in row keys • <language>||<country>||...|||<timestamp> • * for “not specified” • trailing *s are omitted • Further segmentations in column keys • e.g. payments_via_paypal, payments_via_sms • Related metrics in same column familyThursday 12 January 12
  59. Questions?Thursday 12 January 12

×