Prototyping Data Intensive Apps:
       TrendingTopics.org
              Pete Skomoroch
              Research Scientist a...
Talk Outline
• TrendingTopics Overview
• Wikipedia Page View Dataset
• Hadoop on Amazon EC2
• Loading Data on EC2: Amazon ...
Data Intensive Web Apps
• Batch data mining or prediction with Hadoop
• Iterate quickly with high level languages & tools
...
TrendingTopics.org




                     4
Daily Pageview Timeline Charts




                                 5
Detects Rising Trends with Hadoop




                                    6
TrendingTopics is Open Source
• Built as a side project at Data Wrangling
• Core code completed over 2 weeks in June
• Cod...
Technology Stack




                   8
Application Data Flow




                        9
Hadoop on Amazon EC2
• EC2: Launch servers on demand
• S3: Simple Storage Service
• EBS: Persistent disks for EC2
• Used C...
Wikipedia Page View Log Data
• 1 TB of hourly log files from Wikipedia Squid proxy
• Filename contains timestamp
• Columns...
Wikipedia Redirect Data
• Many articles in the logs redirect to canonical titles
• Amazon Public Dataset contains a lookup...
Loading Data into Hadoop on EC2
• Hadoop can read directly from Amazon S3

• Pulling in data from EBS snapshots




      ...
Python Streaming for Daily Timelines




                                       14
Streaming: Filter & Aggregate Logs




                                     15
Hive Data Warehouse Layer on EC2
• Easy for analysts familar with SQL
• Familar syntax (SELECT, GROUP BY, SORT...)
• Hide ...
Fixing Wiki Redirects with Hive JOIN




                                       17
Trend Detection With Hadoop




                              18
Hourly Data: “Java” vs. “Hangover”




                                     19
Robust Regression with R & Hadoop
• Fit R model to hourly data for 2.3M Wiki articles
• “Trending” if article views >> mod...
Call Python Trend Mapper from Hive




                                     21
Simple Trend Calculation in Python




                                     22
Use Trend Scores to Rank Articles




                                    23
Exporting Data from our EC2 Cluster
• Output files are delimited text for bulk DB load
• Distcp outputs to S3:
 – hadoop d...
Hooking It All Together




                          25
Ruby on Rails Front End
• Run a development version of the app locally:




                                              ...
Automate & Deploy with EC2onRails
• EC2onRails: Ubuntu server images for EC2, runs
  instances of Ruby on Rails, MySQL, Ap...
Cron Triggers Hadoop Jobs on EC2
 Modify Cloudera hadoop-ec2-init-remote.sh boot script




                              ...
Bash Scripts Automate Daily Run
• run_daily_merge.sh kicks off Hadoop/Hive jobs
• Queries MySQL for last run date
• Genera...
Capistrano Tasks
• Capistrano: Ruby tool for automating tasks on one
  or more remote servers (www.capify.org)
• Deploys R...
Visualization APIs
• Google Viz API: Annotated Timeline for Rails



• Google Charts sparklines: gc4r



• Yahoo BOSS imag...
Ideas & Next Steps
• Try out the code on Github
• Show related articles using link graph data
• Extract trending phrases f...
LinkedIn Analytics Team




                     We’re Hiring




                                    33
Questions?
• Contact
 – Blog: datawrangling.com
 – Twitter: @peteskomoroch

• Code
 – http://github.com/datawrangling/tren...
Upcoming SlideShare
Loading in...5
×

Prototyping Data Intensive Apps: TrendingTopics.org

42,520

Published on

Hadoop World 2009 talk on rapid prototyping of data intensive web applications with Hadoop, Hive, Amazon EC2, Python, and Ruby on Rails. Describes the process of building the open source trend tracking site trendingtopics.org

Published in: Technology
1 Comment
19 Likes
Statistics
Notes
  • really cool~
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
42,520
On Slideshare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
158
Comments
1
Likes
19
Embeds 0
No embeds

No notes for slide

Prototyping Data Intensive Apps: TrendingTopics.org

  1. 1. Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1
  2. 2. Talk Outline • TrendingTopics Overview • Wikipedia Page View Dataset • Hadoop on Amazon EC2 • Loading Data on EC2: Amazon EBS & S3 • Daily Timelines with Hadoop Streaming • Hive Data Warehouse Layer • Trend Computation with Hive • Hooking It All Together • Front End & Visualizations
  3. 3. Data Intensive Web Apps • Batch data mining or prediction with Hadoop • Iterate quickly with high level languages & tools – Pig, Hive, Clojure, Cascading, Python, Ruby • EC2: Get running with limited initial capital • Use external data and APIs in novel ways • Recent real world example: FlightCaster 3
  4. 4. TrendingTopics.org 4
  5. 5. Daily Pageview Timeline Charts 5
  6. 6. Detects Rising Trends with Hadoop 6
  7. 7. TrendingTopics is Open Source • Built as a side project at Data Wrangling • Core code completed over 2 weeks in June • Code on Github • Data released on Amazon Public Datasets 7
  8. 8. Technology Stack 8
  9. 9. Application Data Flow 9
  10. 10. Hadoop on Amazon EC2 • EC2: Launch servers on demand • S3: Simple Storage Service • EBS: Persistent disks for EC2 • Used Cloudera Hadoop Distribution – Includes Pig, Hive, HBase, Sqoop, EBS integration – Allows customization of Hadoop EC2 instances – See EC2 Docs and Tutorials on the Cloudera site 10
  11. 11. Wikipedia Page View Log Data • 1 TB of hourly log files from Wikipedia Squid proxy • Filename contains timestamp • Columns: project, pagename, pageviews, & bytes 11
  12. 12. Wikipedia Redirect Data • Many articles in the logs redirect to canonical titles • Amazon Public Dataset contains a lookup table: 12
  13. 13. Loading Data into Hadoop on EC2 • Hadoop can read directly from Amazon S3 • Pulling in data from EBS snapshots 13
  14. 14. Python Streaming for Daily Timelines 14
  15. 15. Streaming: Filter & Aggregate Logs 15
  16. 16. Hive Data Warehouse Layer on EC2 • Easy for analysts familar with SQL • Familar syntax (SELECT, GROUP BY, SORT...) • Hide the MR muck, for JOIN, UNION, etc. • Supports streaming UDFs • Sampling: generate datasets for R&D • Used on Timeline and Redirect data 16
  17. 17. Fixing Wiki Redirects with Hive JOIN 17
  18. 18. Trend Detection With Hadoop 18
  19. 19. Hourly Data: “Java” vs. “Hangover” 19
  20. 20. Robust Regression with R & Hadoop • Fit R model to hourly data for 2.3M Wiki articles • “Trending” if article views >> model prediction Page View “Spikes” 20
  21. 21. Call Python Trend Mapper from Hive 21
  22. 22. Simple Trend Calculation in Python 22
  23. 23. Use Trend Scores to Rank Articles 23
  24. 24. Exporting Data from our EC2 Cluster • Output files are delimited text for bulk DB load • Distcp outputs to S3: – hadoop distcp /user/output s3n://trendingtopics/exports • Hive can create tables directly over S3 filesystem: – create external table foo ... location 's3n://trendingtopics/ hiveoutput’ 24
  25. 25. Hooking It All Together 25
  26. 26. Ruby on Rails Front End • Run a development version of the app locally: 26
  27. 27. Automate & Deploy with EC2onRails • EC2onRails: Ubuntu server images for EC2, runs instances of Ruby on Rails, MySQL, Apache • Daily cron job on App server launches Hadoop • Hadoop EC2 boot script checks out latest trend detection code from Github • Capistrano tasks deploy new application code, maintain DB, cache, archives 27
  28. 28. Cron Triggers Hadoop Jobs on EC2 Modify Cloudera hadoop-ec2-init-remote.sh boot script 28
  29. 29. Bash Scripts Automate Daily Run • run_daily_merge.sh kicks off Hadoop/Hive jobs • Queries MySQL for last run date • Generates timelines with Hadoop streaming • Hive operates on timeline data, computes Trends with Python • Hive output copied to S3 • Calls remote script to load MySQL staging tables • Hadoop cluster self terminates • Remote script swaps staging and prod after load 29
  30. 30. Capistrano Tasks • Capistrano: Ruby tool for automating tasks on one or more remote servers (www.capify.org) • Deploys Rails code and cron jobs to App server • Cleanup tasks are triggered at end of load: – Flags featured Wikipedia articles – Indexes MySQL tables, Flushes cache 30
  31. 31. Visualization APIs • Google Viz API: Annotated Timeline for Rails • Google Charts sparklines: gc4r • Yahoo BOSS image search in Rails 31
  32. 32. Ideas & Next Steps • Try out the code on Github • Show related articles using link graph data • Extract trending phrases from article text • Boost Twitter trending topics with Wikipedia logs • Update top trends hourly • Use date partitioned tables, incremental updates • Add a real search engine (uses MySQL now) 32
  33. 33. LinkedIn Analytics Team We’re Hiring 33
  34. 34. Questions? • Contact – Blog: datawrangling.com – Twitter: @peteskomoroch • Code – http://github.com/datawrangling/trendingtopics – Hadoop blog posts & tutorials on the Cloudera site – Amazon EMR tutorial on Finding Similar Artists with Hadoop & Last.FM data http://bit.ly/EMRsimilarity 34
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×