Demand, Media, and SearchAnalyticsSean Timmsean.timm@teamaol.comTwitter: @timmscOctober 4, 2011
Introduction• Who am I?• What do we use Hadoop for?• Our best practices• Lessons learned• The related searches, seasonalit...
History• Originated in Search Backend in 2007• Create data driven products for search.aol.com from search  logs• No Netezz...
DataHourly search.aol.com logs•5 M log lines of data per hour•Logs include searches, clicks, and other data•70% of queries...
We like Pig!•   Hourly, daily, and monthly search and click aggregation•   Related searches•   Auto complete dictionary•  ...
Pig Process in GeneralScript run time < 2 minutes to > 2 hoursAd hoc…wild westComplex shell scripts1. load/copy/backup dat...
Getting data out of HadoopFirst approach: special StoreFunc to write directly to MySQL/Solr•Network: Required master be on...
Getting Data out of HadoopMySQL/Vertica Now•Write data to HDFS•Copy from HDFS to local file system using CLI•Load into dat...
UDFs• Use Piggy Bank and builtins when possible• 89 custom UDFs packaged in a single jar• Most are simple   • Validate a U...
Lessons learned• Many small categorization scripts, better to use a larger single  one• Set priority on large time sensiti...
Related Searches  Group by Query
Challenges• Adult terms• Misspellings• Breadth of suggestions• Coverage• Timeliness of suggestions
Process Flow• Filter and clean data    • Block adult terms, long queries, non-alpha, second+ pages,      operators, URL li...
Related Searches Graph     “The Eagles”                                Hotel California   The band                     NFL...
Classification• Supervised learning• Provide categorized set of queries and/or URLs• Calculate a score based on the edge w...
Applications Outside of Search• Author/citation bipartite graph• Social network graphs• User/Page view graphs
Temporal traffic correlation of Wikipedia Page Views                                                Page 17
Tomato SeasonalityMay: planting tomatoes, tomato cages, types of tomatoesJune: pruning tomato plantsJuly: tomato diseases,...
Upcoming SlideShare
Loading in …5
×

Demand, Media, and Search Analytics at AOL

2,821 views
2,707 views

Published on

Presented at Hadoop-DC on October 4, 2011.

Published in: Technology, Self Improvement
1 Comment
3 Likes
Statistics
Notes
  • There is a mistake on slide #15. It should read: 'Calculate a score based on the edge weights,' not 'Calculate a score based on the graph vectors.'
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
2,821
On SlideShare
0
From Embeds
0
Number of Embeds
73
Actions
Shares
0
Downloads
39
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Demand, Media, and Search Analytics at AOL

  1. 1. Demand, Media, and SearchAnalyticsSean Timmsean.timm@teamaol.comTwitter: @timmscOctober 4, 2011
  2. 2. Introduction• Who am I?• What do we use Hadoop for?• Our best practices• Lessons learned• The related searches, seasonality—example applications Page 2
  3. 3. History• Originated in Search Backend in 2007• Create data driven products for search.aol.com from search logs• No Netezza experience, decided to try Hadoop• Took 3 weeks to write simple aggregation• Apache Pig 0.3—2 days• First product, related searches, launched in 2008• Search breaking trends product led to further demand work• Now Pig 0.8.1 and Hadoop 0.20.2 Page 3
  4. 4. DataHourly search.aol.com logs•5 M log lines of data per hour•Logs include searches, clicks, and other data•70% of queries we only see onceHourly Wikipedia page view data•public data set http://dammit.lt/wikistats•7 M pages viewed per hour•2.7 M English pages per hourBeacoN logs•Page view and click logs for AOL HuffingtonPost Media, Patch, and other AOLproperties Page 4
  5. 5. We like Pig!• Hourly, daily, and monthly search and click aggregation• Related searches• Auto complete dictionary• Mining spelling correction click through• Temporal pattern analysis• Classifying adult queries and URLs• Categorizing queries• Identifying queries in the form of a question or superlative• Identifying breaking trends in AOL Search and Wikipedia page views• Identify queries of local interest• Clustering queries using click graph, temporal distance, Carrot2, k-means• AOL HPMG stats and trends for page views, authors, tags, etc. Page 5
  6. 6. Pig Process in GeneralScript run time < 2 minutes to > 2 hoursAd hoc…wild westComplex shell scripts1. load/copy/backup data2. Launch multiple Pig scripts—some in parallel—some with serial dependencies3. Check for errors—e-mail and halt4. Load data into MySQL, Vertica, or Solr Page 6
  7. 7. Getting data out of HadoopFirst approach: special StoreFunc to write directly to MySQL/Solr•Network: Required master be on the same network as thecluster•Speculative optimization: data would be written more than onceincreasing contention as well as doing unnecessary writes•Replication: writing to the master in parallel, serial replicationwas slow (MySQL)•Timeouts: occasionally a task failed and restarted (Solr) Page 7
  8. 8. Getting Data out of HadoopMySQL/Vertica Now•Write data to HDFS•Copy from HDFS to local file system using CLI•Load into database: LOAD DATA LOCAL INFILE from mysql clientSolr Now•Custom StoreFunc writes Solr XML to HDFS•Starting with Pig 0.7 fields are named using the Pig schema•Copy from HDFS to local file system using CLI•Load into Solr using remote streaming Page 8
  9. 9. UDFs• Use Piggy Bank and builtins when possible• 89 custom UDFs packaged in a single jar• Most are simple • Validate a URL, URL decode a string, calculate a hash value, date math, etc.• Some are complex • Spell check/correct, LOESS regression, Carrot2 clustering, FFT, Euclidean distance, etc. Page 9
  10. 10. Lessons learned• Many small categorization scripts, better to use a larger single one• Set priority on large time sensitive jobs that fight for resources with other jobs• Fair scheduler• Tuning the cluster for maps or reduces• Dont write copious debug• Use appropriate number of reducers (PARALLEL) Page 10
  11. 11. Related Searches Group by Query
  12. 12. Challenges• Adult terms• Misspellings• Breadth of suggestions• Coverage• Timeliness of suggestions
  13. 13. Process Flow• Filter and clean data • Block adult terms, long queries, non-alpha, second+ pages, operators, URL like queries, search spam • Lower case• Join to get query-related query groups• Contextual spell correct within group• Cluster related queries and pick the best from each group• Load into Solr
  14. 14. Related Searches Graph “The Eagles” Hotel California The band NFL Tribute Boston College Page 14
  15. 15. Classification• Supervised learning• Provide categorized set of queries and/or URLs• Calculate a score based on the edge weights• If the score exceeds a specified threshold the query or URL is tagged with the category
  16. 16. Applications Outside of Search• Author/citation bipartite graph• Social network graphs• User/Page view graphs
  17. 17. Temporal traffic correlation of Wikipedia Page Views Page 17
  18. 18. Tomato SeasonalityMay: planting tomatoes, tomato cages, types of tomatoesJune: pruning tomato plantsJuly: tomato diseases, tomato blight, tomato wormAugust: tomato recipes, tomato soup, tomato sauce, tomato salsaSeptember: sun dried tomatoes, canning and freezing tomatoesOctober: green tomato recipes Page 18

×