Your SlideShare is downloading. ×
0
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org

2,643

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,643
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
88
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Prototyping Data Intensive Apps: TrendingTopics.org 09/29/09 1 Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch
  • 2. Talk Outline <ul><li>TrendingTopics Overview </li></ul><ul><li>Wikipedia Page View Dataset </li></ul><ul><li>Hadoop on Amazon EC2 </li></ul><ul><li>Loading Data on EC2: Amazon EBS & S3 </li></ul><ul><li>Daily Timelines with Hadoop Streaming </li></ul><ul><li>Hive Data Warehouse Layer </li></ul><ul><li>Trend Computation with Hive </li></ul><ul><li>Hooking It All Together </li></ul><ul><li>Front End & Visualizations </li></ul>
  • 3. Data Intensive Web Apps <ul><li>Batch data mining or prediction with Hadoop </li></ul><ul><li>Iterate quickly with high level languages & tools </li></ul><ul><ul><li>Pig, Hive, Clojure, Cascading, Python, Ruby </li></ul></ul><ul><li>EC2: Get running with limited initial capital </li></ul><ul><li>Use external data and APIs in novel ways </li></ul><ul><li>Recent real world example: FlightCaster </li></ul>
  • 4. TrendingTopics.org
  • 5. Daily Pageview Timeline Charts
  • 6. Detects Rising Trends with Hadoop
  • 7. TrendingTopics is Open Source <ul><li>Built as a side project at Data Wrangling </li></ul><ul><li>Core code completed over 2 weeks in June </li></ul><ul><li>Code on Github </li></ul><ul><li>Data released on Amazon Public Datasets </li></ul>
  • 8. Technology Stack
  • 9. Application Data Flow
  • 10. Hadoop on Amazon EC2 <ul><li>EC2: Launch servers on demand </li></ul><ul><li>S3: Simple Storage Service </li></ul><ul><li>EBS: Persistent disks for EC2 </li></ul><ul><li>Used Cloudera Hadoop Distribution </li></ul><ul><ul><li>Includes Pig, Hive, HBase, Sqoop, EBS integration </li></ul></ul><ul><ul><li>Allows customization of Hadoop EC2 instances </li></ul></ul><ul><ul><li>See EC2 Docs and Tutorials on the Cloudera site </li></ul></ul>
  • 11. Wikipedia Page View Log Data <ul><li>1 TB of hourly log files from Wikipedia Squid proxy </li></ul><ul><li>Filename contains timestamp </li></ul><ul><li>Columns: project, pagename, pageviews, & bytes </li></ul>
  • 12. Wikipedia Redirect Data <ul><li>Many articles in the logs redirect to canonical titles </li></ul><ul><li>Amazon Public Dataset contains a lookup table: </li></ul>
  • 13. Loading Data into Hadoop on EC2 <ul><li>Hadoop can read directly from Amazon S3 </li></ul><ul><li>Pulling in data from EBS snapshots </li></ul>
  • 14. Python Streaming for Daily Timelines
  • 15. Streaming: Filter & Aggregate Logs
  • 16. Hive Data Warehouse Layer on EC2 <ul><li>Easy for analysts familar with SQL </li></ul><ul><li>Familar syntax (SELECT, GROUP BY, SORT...) </li></ul><ul><li>Hide the MR muck, for JOIN, UNION, etc. </li></ul><ul><li>Supports streaming UDFs </li></ul><ul><li>Sampling: generate datasets for R&D </li></ul><ul><li>Used on Timeline and Redirect data </li></ul>
  • 17. Fixing Wiki Redirects with Hive JOIN
  • 18. Trend Detection With Hadoop
  • 19. Hourly Data: “Java” vs. “Hangover”
  • 20. Robust Regression with R & Hadoop <ul><li>Fit R model to hourly data for 2.3M Wiki articles </li></ul><ul><li>“ Trending” if article views >> model prediction </li></ul>Page View “Spikes”
  • 21. Call Python Trend Mapper from Hive
  • 22. Simple Trend Calculation in Python
  • 23. Use Trend Scores to Rank Articles
  • 24. Exporting Data from our EC2 Cluster <ul><li>Output files are delimited text for bulk DB load </li></ul><ul><li>Distcp outputs to S3: </li></ul><ul><ul><li>hadoop distcp /user/output s3n://trendingtopics/exports </li></ul></ul><ul><li>Hive can create tables directly over S3 filesystem: </li></ul><ul><ul><li>create external table foo ... location 's3n://trendingtopics/hiveoutput’ </li></ul></ul>
  • 25. Hooking It All Together
  • 26. Ruby on Rails Front End <ul><li>Run a development version of the app locally: </li></ul>
  • 27. Automate & Deploy with EC2onRails <ul><li>EC2onRails: Ubuntu server images for EC2, runs instances of Ruby on Rails, MySQL, Apache </li></ul><ul><li>Daily cron job on App server launches Hadoop </li></ul><ul><li>Hadoop EC2 boot script checks out latest trend detection code from Github </li></ul><ul><li>Capistrano tasks deploy new application code, maintain DB, cache, archives </li></ul>
  • 28. Cron Triggers Hadoop Jobs on EC2 <ul><li>Modify Cloudera hadoop-ec2-init-remote.sh boot script </li></ul>
  • 29. Bash Scripts Automate Daily Run <ul><li>run_daily_merge.sh kicks off Hadoop/Hive jobs </li></ul><ul><li>Queries MySQL for last run date </li></ul><ul><li>Generates timelines with Hadoop streaming </li></ul><ul><li>Hive operates on Timeline data, computes Trends with Python </li></ul><ul><li>Hive output copied to S3 </li></ul><ul><li>Calls remote script to load MySQL staging tables </li></ul><ul><li>Hadoop cluster self terminates </li></ul><ul><li>Remote script swaps staging and prod after load </li></ul>
  • 30. Capistrano Tasks <ul><li>Capistrano: Ruby tool for automating tasks on one or more remote servers ( www.capify.org ) </li></ul><ul><li>Deploys Rails code and cron jobs to App server </li></ul><ul><li>Cleanup tasks are triggered at end of load: </li></ul><ul><ul><li>Flags featured Wikipedia articles </li></ul></ul><ul><ul><li>Indexes MySQL tables, Flushes cache </li></ul></ul>
  • 31. Visualization APIs <ul><li>Google Viz API: Annotated Timeline for Rails </li></ul><ul><li>Google Charts sparklines: gc4r </li></ul><ul><li>Yahoo BOSS image search in Rails </li></ul>
  • 32. Ideas & Next Steps <ul><li>Try out the code on Github </li></ul><ul><li>Show related articles using link graph data </li></ul><ul><li>Extract trending phrases from article text </li></ul><ul><li>Boost Twitter trending topics with Wikipedia logs </li></ul><ul><li>Update top trends hourly </li></ul><ul><li>Use date partitioned tables, incremental updates </li></ul><ul><li>Add a real search engine (uses MySQL now) </li></ul>
  • 33. LinkedIn A nalytics Team We’re Hiring
  • 34. Questions? <ul><li>Contact </li></ul><ul><ul><li>Blog: datawrangling.com </li></ul></ul><ul><ul><li>Twitter: @peteskomoroch </li></ul></ul><ul><li>Code </li></ul><ul><ul><li>http://github.com/datawrangling/trendingtopics </li></ul></ul><ul><ul><li>Hadoop blog posts & tutorials on the Cloudera site </li></ul></ul><ul><ul><li>Amazon EMR tutorial on Finding Similar Artists with Hadoop & Last.FM data http://bit.ly/EMRsimilarity </li></ul></ul>
  • 35. Backup slides
  • 36. Wikipedia Views ~ Google Searches “ Dwight Howard” on TrendingTopics “ Dwight Howard” on Google Trends
  • 37. Wikipedia Log Analysis with Pig <ul><li>Most Read articles Pig Latin example </li></ul><ul><li>Show junk Wikipedia pages that need filtering </li></ul>

×