Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Prototyping Data Intensive Apps: 09/29/09 1 Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch
  • 2. Talk Outline
    • TrendingTopics Overview
    • Wikipedia Page View Dataset
    • Hadoop on Amazon EC2
    • Loading Data on EC2: Amazon EBS & S3
    • Daily Timelines with Hadoop Streaming
    • Hive Data Warehouse Layer
    • Trend Computation with Hive
    • Hooking It All Together
    • Front End & Visualizations
  • 3. Data Intensive Web Apps
    • Batch data mining or prediction with Hadoop
    • Iterate quickly with high level languages & tools
      • Pig, Hive, Clojure, Cascading, Python, Ruby
    • EC2: Get running with limited initial capital
    • Use external data and APIs in novel ways
    • Recent real world example: FlightCaster
  • 4.
  • 5. Daily Pageview Timeline Charts
  • 6. Detects Rising Trends with Hadoop
  • 7. TrendingTopics is Open Source
    • Built as a side project at Data Wrangling
    • Core code completed over 2 weeks in June
    • Code on Github
    • Data released on Amazon Public Datasets
  • 8. Technology Stack
  • 9. Application Data Flow
  • 10. Hadoop on Amazon EC2
    • EC2: Launch servers on demand
    • S3: Simple Storage Service
    • EBS: Persistent disks for EC2
    • Used Cloudera Hadoop Distribution
      • Includes Pig, Hive, HBase, Sqoop, EBS integration
      • Allows customization of Hadoop EC2 instances
      • See EC2 Docs and Tutorials on the Cloudera site
  • 11. Wikipedia Page View Log Data
    • 1 TB of hourly log files from Wikipedia Squid proxy
    • Filename contains timestamp
    • Columns: project, pagename, pageviews, & bytes
  • 12. Wikipedia Redirect Data
    • Many articles in the logs redirect to canonical titles
    • Amazon Public Dataset contains a lookup table:
  • 13. Loading Data into Hadoop on EC2
    • Hadoop can read directly from Amazon S3
    • Pulling in data from EBS snapshots
  • 14. Python Streaming for Daily Timelines
  • 15. Streaming: Filter & Aggregate Logs
  • 16. Hive Data Warehouse Layer on EC2
    • Easy for analysts familar with SQL
    • Familar syntax (SELECT, GROUP BY, SORT...)
    • Hide the MR muck, for JOIN, UNION, etc.
    • Supports streaming UDFs
    • Sampling: generate datasets for R&D
    • Used on Timeline and Redirect data
  • 17. Fixing Wiki Redirects with Hive JOIN
  • 18. Trend Detection With Hadoop
  • 19. Hourly Data: “Java” vs. “Hangover”
  • 20. Robust Regression with R & Hadoop
    • Fit R model to hourly data for 2.3M Wiki articles
    • “ Trending” if article views >> model prediction
    Page View “Spikes”
  • 21. Call Python Trend Mapper from Hive
  • 22. Simple Trend Calculation in Python
  • 23. Use Trend Scores to Rank Articles
  • 24. Exporting Data from our EC2 Cluster
    • Output files are delimited text for bulk DB load
    • Distcp outputs to S3:
      • hadoop distcp /user/output s3n://trendingtopics/exports
    • Hive can create tables directly over S3 filesystem:
      • create external table foo ... location 's3n://trendingtopics/hiveoutput’
  • 25. Hooking It All Together
  • 26. Ruby on Rails Front End
    • Run a development version of the app locally:
  • 27. Automate & Deploy with EC2onRails
    • EC2onRails: Ubuntu server images for EC2, runs instances of Ruby on Rails, MySQL, Apache
    • Daily cron job on App server launches Hadoop
    • Hadoop EC2 boot script checks out latest trend detection code from Github
    • Capistrano tasks deploy new application code, maintain DB, cache, archives
  • 28. Cron Triggers Hadoop Jobs on EC2
    • Modify Cloudera boot script
  • 29. Bash Scripts Automate Daily Run
    • kicks off Hadoop/Hive jobs
    • Queries MySQL for last run date
    • Generates timelines with Hadoop streaming
    • Hive operates on Timeline data, computes Trends with Python
    • Hive output copied to S3
    • Calls remote script to load MySQL staging tables
    • Hadoop cluster self terminates
    • Remote script swaps staging and prod after load
  • 30. Capistrano Tasks
    • Capistrano: Ruby tool for automating tasks on one or more remote servers ( )
    • Deploys Rails code and cron jobs to App server
    • Cleanup tasks are triggered at end of load:
      • Flags featured Wikipedia articles
      • Indexes MySQL tables, Flushes cache
  • 31. Visualization APIs
    • Google Viz API: Annotated Timeline for Rails
    • Google Charts sparklines: gc4r
    • Yahoo BOSS image search in Rails
  • 32. Ideas & Next Steps
    • Try out the code on Github
    • Show related articles using link graph data
    • Extract trending phrases from article text
    • Boost Twitter trending topics with Wikipedia logs
    • Update top trends hourly
    • Use date partitioned tables, incremental updates
    • Add a real search engine (uses MySQL now)
  • 33. LinkedIn A nalytics Team We’re Hiring
  • 34. Questions?
    • Contact
      • Blog:
      • Twitter: @peteskomoroch
    • Code
      • Hadoop blog posts & tutorials on the Cloudera site
      • Amazon EMR tutorial on Finding Similar Artists with Hadoop & Last.FM data
  • 35. Backup slides
  • 36. Wikipedia Views ~ Google Searches “ Dwight Howard” on TrendingTopics “ Dwight Howard” on Google Trends
  • 37. Wikipedia Log Analysis with Pig
    • Most Read articles Pig Latin example
    • Show junk Wikipedia pages that need filtering