Prototyping Data Intensive Apps: TrendingTopics.org 09/29/09 1 Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch
Talk Outline TrendingTopics Overview Wikipedia Page View Dataset Hadoop on Amazon EC2 Loading Data on EC2: Amazon EBS & S3 Daily Timelines with Hadoop Streaming Hive Data Warehouse Layer Trend Computation with Hive Hooking It All Together Front End & Visualizations
Data Intensive Web Apps Batch data mining or prediction with Hadoop Iterate quickly with high level languages & tools Pig, Hive, Clojure, Cascading, Python, Ruby EC2: Get running with limited initial capital Use external data and APIs in novel ways Recent real world example: FlightCaster
TrendingTopics.org
Daily Pageview Timeline Charts
Detects Rising Trends with Hadoop
TrendingTopics is Open Source Built as a side project at Data Wrangling Core code completed over 2 weeks in June Code on Github Data released on Amazon Public Datasets
Technology Stack
Application Data Flow
Hadoop on Amazon EC2 EC2: Launch servers on demand S3: Simple Storage Service EBS: Persistent disks for EC2 Used Cloudera Hadoop Distribution Includes Pig, Hive, HBase, Sqoop, EBS integration Allows customization of Hadoop EC2 instances See EC2 Docs and Tutorials on the Cloudera site
Wikipedia Page View Log Data 1 TB of hourly log files from Wikipedia Squid proxy Filename contains timestamp Columns: project, pagename, pageviews, & bytes
Wikipedia Redirect Data Many articles in the logs redirect to canonical titles Amazon Public Dataset contains a lookup table:
Loading Data into Hadoop on EC2 Hadoop can read directly from Amazon S3 Pulling in data from EBS snapshots
Python Streaming for Daily Timelines
Streaming: Filter & Aggregate Logs
Hive Data Warehouse Layer on EC2 Easy for analysts familar with SQL Familar syntax (SELECT, GROUP BY, SORT...) Hide the MR muck, for JOIN, UNION, etc. Supports streaming UDFs Sampling: generate datasets for R&D Used on Timeline and Redirect data
Fixing Wiki Redirects with Hive JOIN
Trend Detection With Hadoop
Hourly Data: “Java” vs. “Hangover”
Robust Regression with R & Hadoop  Fit R model to hourly data for 2.3M Wiki articles “ Trending” if article views >> model prediction Page View “Spikes”
Call Python Trend Mapper from Hive
Simple Trend Calculation in Python
Use Trend Scores to Rank Articles
Exporting Data from our EC2 Cluster Output files are delimited text for bulk DB load Distcp outputs to S3: hadoop distcp /user/output s3n://trendingtopics/exports Hive can create tables directly over S3 filesystem: create external table foo ... location 's3n://trendingtopics/hiveoutput’
Hooking It All Together
Ruby on Rails Front End Run a development version of the app locally:
Automate & Deploy with EC2onRails EC2onRails: Ubuntu server images for EC2, runs instances of Ruby on Rails, MySQL, Apache Daily cron job on App server launches Hadoop Hadoop EC2 boot script checks out latest trend detection code from Github Capistrano tasks deploy new application code, maintain DB, cache, archives
Cron Triggers Hadoop Jobs on EC2 Modify Cloudera hadoop-ec2-init-remote.sh boot script
Bash Scripts Automate Daily Run run_daily_merge.sh  kicks off Hadoop/Hive jobs Queries MySQL for last run date Generates timelines with Hadoop streaming  Hive operates on Timeline data, computes Trends with Python Hive output copied to S3 Calls remote script to load MySQL staging tables Hadoop cluster self terminates Remote script swaps staging and prod after load
Capistrano Tasks Capistrano: Ruby tool for automating tasks on one or more remote servers ( www.capify.org ) Deploys Rails code and cron jobs to App server Cleanup tasks are triggered at end of load: Flags featured Wikipedia articles Indexes MySQL tables, Flushes cache
Visualization APIs Google Viz API: Annotated Timeline for Rails Google Charts sparklines: gc4r  Yahoo BOSS image search in Rails
Ideas & Next Steps Try out the code on Github Show related articles using link graph data Extract trending phrases from article text Boost Twitter trending topics with Wikipedia logs Update top trends hourly Use date partitioned tables, incremental updates Add a real search engine (uses MySQL now)
LinkedIn  A nalytics  Team We’re Hiring
Questions? Contact Blog:  datawrangling.com Twitter: @peteskomoroch Code http://github.com/datawrangling/trendingtopics Hadoop blog posts & tutorials on the Cloudera site Amazon EMR tutorial on Finding Similar Artists with Hadoop & Last.FM data  http://bit.ly/EMRsimilarity
Backup slides
Wikipedia Views ~ Google Searches “ Dwight Howard” on TrendingTopics “ Dwight Howard” on Google Trends
Wikipedia Log Analysis with Pig Most Read articles Pig Latin example Show junk Wikipedia pages that need filtering

Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org

  • 1.
    Prototyping Data IntensiveApps: TrendingTopics.org 09/29/09 1 Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch
  • 2.
    Talk Outline TrendingTopicsOverview Wikipedia Page View Dataset Hadoop on Amazon EC2 Loading Data on EC2: Amazon EBS & S3 Daily Timelines with Hadoop Streaming Hive Data Warehouse Layer Trend Computation with Hive Hooking It All Together Front End & Visualizations
  • 3.
    Data Intensive WebApps Batch data mining or prediction with Hadoop Iterate quickly with high level languages & tools Pig, Hive, Clojure, Cascading, Python, Ruby EC2: Get running with limited initial capital Use external data and APIs in novel ways Recent real world example: FlightCaster
  • 4.
  • 5.
  • 6.
  • 7.
    TrendingTopics is OpenSource Built as a side project at Data Wrangling Core code completed over 2 weeks in June Code on Github Data released on Amazon Public Datasets
  • 8.
  • 9.
  • 10.
    Hadoop on AmazonEC2 EC2: Launch servers on demand S3: Simple Storage Service EBS: Persistent disks for EC2 Used Cloudera Hadoop Distribution Includes Pig, Hive, HBase, Sqoop, EBS integration Allows customization of Hadoop EC2 instances See EC2 Docs and Tutorials on the Cloudera site
  • 11.
    Wikipedia Page ViewLog Data 1 TB of hourly log files from Wikipedia Squid proxy Filename contains timestamp Columns: project, pagename, pageviews, & bytes
  • 12.
    Wikipedia Redirect DataMany articles in the logs redirect to canonical titles Amazon Public Dataset contains a lookup table:
  • 13.
    Loading Data intoHadoop on EC2 Hadoop can read directly from Amazon S3 Pulling in data from EBS snapshots
  • 14.
    Python Streaming forDaily Timelines
  • 15.
    Streaming: Filter &Aggregate Logs
  • 16.
    Hive Data WarehouseLayer on EC2 Easy for analysts familar with SQL Familar syntax (SELECT, GROUP BY, SORT...) Hide the MR muck, for JOIN, UNION, etc. Supports streaming UDFs Sampling: generate datasets for R&D Used on Timeline and Redirect data
  • 17.
    Fixing Wiki Redirectswith Hive JOIN
  • 18.
  • 19.
    Hourly Data: “Java”vs. “Hangover”
  • 20.
    Robust Regression withR & Hadoop Fit R model to hourly data for 2.3M Wiki articles “ Trending” if article views >> model prediction Page View “Spikes”
  • 21.
    Call Python TrendMapper from Hive
  • 22.
  • 23.
    Use Trend Scoresto Rank Articles
  • 24.
    Exporting Data fromour EC2 Cluster Output files are delimited text for bulk DB load Distcp outputs to S3: hadoop distcp /user/output s3n://trendingtopics/exports Hive can create tables directly over S3 filesystem: create external table foo ... location 's3n://trendingtopics/hiveoutput’
  • 25.
  • 26.
    Ruby on RailsFront End Run a development version of the app locally:
  • 27.
    Automate & Deploywith EC2onRails EC2onRails: Ubuntu server images for EC2, runs instances of Ruby on Rails, MySQL, Apache Daily cron job on App server launches Hadoop Hadoop EC2 boot script checks out latest trend detection code from Github Capistrano tasks deploy new application code, maintain DB, cache, archives
  • 28.
    Cron Triggers HadoopJobs on EC2 Modify Cloudera hadoop-ec2-init-remote.sh boot script
  • 29.
    Bash Scripts AutomateDaily Run run_daily_merge.sh kicks off Hadoop/Hive jobs Queries MySQL for last run date Generates timelines with Hadoop streaming Hive operates on Timeline data, computes Trends with Python Hive output copied to S3 Calls remote script to load MySQL staging tables Hadoop cluster self terminates Remote script swaps staging and prod after load
  • 30.
    Capistrano Tasks Capistrano:Ruby tool for automating tasks on one or more remote servers ( www.capify.org ) Deploys Rails code and cron jobs to App server Cleanup tasks are triggered at end of load: Flags featured Wikipedia articles Indexes MySQL tables, Flushes cache
  • 31.
    Visualization APIs GoogleViz API: Annotated Timeline for Rails Google Charts sparklines: gc4r Yahoo BOSS image search in Rails
  • 32.
    Ideas & NextSteps Try out the code on Github Show related articles using link graph data Extract trending phrases from article text Boost Twitter trending topics with Wikipedia logs Update top trends hourly Use date partitioned tables, incremental updates Add a real search engine (uses MySQL now)
  • 33.
    LinkedIn Analytics Team We’re Hiring
  • 34.
    Questions? Contact Blog: datawrangling.com Twitter: @peteskomoroch Code http://github.com/datawrangling/trendingtopics Hadoop blog posts & tutorials on the Cloudera site Amazon EMR tutorial on Finding Similar Artists with Hadoop & Last.FM data http://bit.ly/EMRsimilarity
  • 35.
  • 36.
    Wikipedia Views ~Google Searches “ Dwight Howard” on TrendingTopics “ Dwight Howard” on Google Trends
  • 37.
    Wikipedia Log Analysiswith Pig Most Read articles Pig Latin example Show junk Wikipedia pages that need filtering