• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop at Last.fm
 

Hadoop at Last.fm

on

  • 8,550 views

This talk is about the usage of Hadoop at Last.fm, a community-driven music discovery website. We will go through the main types of data Last.fm stores in Hadoop, explain why we need Hadoop to store ...

This talk is about the usage of Hadoop at Last.fm, a community-driven music discovery website. We will go through the main types of data Last.fm stores in Hadoop, explain why we need Hadoop to store and process our data, give examples of what we do with it, and mention some of the additional tools from the Hadoop ecosystem on which we rely for getting these things done.

Statistics

Views

Total Views
8,550
Views on SlideShare
8,533
Embed Views
17

Actions

Likes
15
Downloads
0
Comments
1

2 Embeds 17

http://www.linkedin.com 16
http://strijkwerk.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop at Last.fm Hadoop at Last.fm Presentation Transcript

    • Hadoop at Last.fm
      June 2010
    • About us
      Last.fm you say?
    • Last.fm is a
      music discovery website
      powered by scrobbling
      that provides personalized radio
    • Music discovery website
      Each month we get:
      over 40M unique visitors
      over 500M pageviews
      Each pageview leads to at least one log line
      Clicks and other interactions sometimes lead to log lines too
    • Powered by scrobbling
      scrobble: skrob·bul (ˈskrɒbəll)
      [verb] To automatically add the tracks you play to your Last.fm profile with a piece of software called a Scrobbler
      Stats:
      Up to 800 scrobbles per second
      More than 40 million scrobbles per day
      Over 40 billion scrobbles so far
      Each scrobble leads to a log line
    • Personalized radio
      Via flash player, Xbox, desktop and mobile apps
      Stats:
      Over 10 million streaming hours per month
      Over 400 thousand unique stations per day
      Each stream leads to at least one log line
    • And it’s not just logs…
      So we gather a lot of logs, but also:
      Tags
      Shouts
      Journals
      Wikis
      Friend connections
      Fingerprints

      Hadoop is the infrastructure we use for storing and processing our flood of data
    • OUR SETUP
      How many nodes?
    • Our herd of elephants
      Current specs of our production cluster:
      44 nodes
      8 cores per node
      16 GB memory per node
      4 disks of 1 TB spinning at 7200 RPM per node
      Unpatched CDH2 using:
      Fair scheduler with preemption
      Slightly patched hadoop-lzo
      RecordIO, Avro
      Hive, Dumbo, Pig
    • We often avoid Java with Dumbo
      def mapper(key, value):
      for word in value.split():
      yield word, 1
      def reducer(key, values):
      yield key, sum(values)
      if __name__ == "__main__":
      import dumbo
      dumbo.run(mapper, reducer, combiner=reducer)
    • Or go even more high-level with Hive
      hive> CREATE TABLE name_counts (gender STRING, name STRING, occurrences INT);
      hive> INSERT OVERWRITE TABLE name_counts
      SELECT lower(sex), lower(split(realname, ‘ ‘)[0]), count(1) FROM meta_user_info
      WHERE lower(sex) <> ‘n’ GROUP BY lower(sex), lower(split(realname, ‘ ‘)[0]);
      hive> CREATE TABLE gender_likelihoods (name STRING, gender STRING, likelihood FLOAT);
      hive> INSERT OVERWRITE TABLE gender_likelihoods
      SELECT b.name, b.gender, b.occurrences / a.occurrences FROM
      (SELECT name, sum(occurrences) as occurrences FROM name_counts GROUP BY name) a JOIN name_countsb ON (a.name = b.name);
      hive> SELECT * FROM gender_likelihoods WHERE (name = ‘klaas’) OR (name = ‘sam’);
      klaasm 0.99038464
      klaasf 0.009615385
      samm 0.7578873
      samf 0.24211268
    • Mixed usage of tools is common
      def starter(prog):
      month = prog.delopt(“month”) # is expected to be YYYY/MM
      hql = “INSERT OVERWRITE DIRECTORY ‘cool/stuff_hive’ ...”.format(month)
      if os.system(‘hive –e “{0}”’.format(hql)) != 0:
      raise dumbo.Error("hive query failed")
      prog.addopt(“input”, “cool/stuff_hive”) # will be text delimited by ‘x01’
      prog.addopt(“output”, “cool/stuff/” + month)

      if __name__ == “__main__”:
      dumbo.main(runner, starter)
    • Running out of DFS space is common too
      Possible solutions:
      Bigger and/or more disks
      HDFS RAID
    • Running out of DFS space is common too
      Data deleted 
      New nodes
      More compression and nodes
      Not finalized yet after upgrade
      Possible solutions:
      Bigger and/or more disks
      HDFS RAID
    • Hitting I/O and CPU limits is less common
      Our cluster can be pretty busy at times
      But DFS space is our main worry
    • Hitting I/O and CPU limits is less common
      Upgraded to 0.20
      Our cluster can be pretty busy at times
      But DFS space is our main worry
    • USE CASES
      What do you use it for?
    • Things we do with Hadoop
      Site stats and metrics
      Charts
      Reporting
      Metadata corrections
      Neighbours
      Recommendations
      Indexing for search
      Evaluations
      Data insights
    • And also scaring our ops…
    • Example: Website traffic stats
      We compute a lot of site metrics, mostly from apache logs
    • Example: Website traffic stats
      Google Chrome is gaining ground
      We compute a lot of site metrics, mostly from apache logs
    • Example: Overall charts
      Charts for a single user can be shown in real time and are computed on the fly
      But computing overall charts is a pretty big job and is done on Hadoop
    • Example: World charts
      This “world chart” for the Belgian band “Hooverphonic” also required Hadoop because it’s based on data from many different users
    • Example: Overall wave graphs
      Overall visualizations also typically require Hadoop for getting the data they visualize
      The “wave graphs” in our “Best of 2009” newspaper were good examples
    • Example: Overall wave graphs
    • Example: Death and scrobbles graphs
      −scrobbles
      − listeners
    • Example: Radio stats
      Graphs for several metrics that can be broken down by various attributes
      Used extensively for A/B testing
    • Example: Radio stats
      Graphs for several metrics that can be broken down by various attributes
      Used extensively for A/B testing
      Significant differences
    • Example: Radio stats
      Graphs for several metrics that can be broken down by various attributes
      Used extensively for A/B testing
      Overheated data centre
      Significant differences
      DB maintenance that went bad
    • Thanks!
      klaas@last.fm @klbosteemarc@last.fm @lanttims@last.fm @roserpens