Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[@IndeedEng] Large scale interactive analytics with Imhotep

5,152 views

Published on

Link to video: https://www.youtube.com/watch?v=IZ-kC6ut1Lg

In a previous talk, we explained how we developed Imhotep, a distributed system for building decision trees for machine learning. We went on to describe how we build large scale interactive analytics tools using the same platform. This has kept our engineering and product organizations focused on key metrics by analyzing test results. It also gives our marketing organization timely and accurate insight into our data - allowing us to identify opportunities, spot trends, and learn about our job seekers. In this talk, Zak Cocos, who leads our Marketing Sciences team, and Product Manager Tom Bergman will discuss and provide examples of the valuable insights that can be gained by using Imhotep with almost any data set.

Published in: Technology, Business
  • Be the first to comment

[@IndeedEng] Large scale interactive analytics with Imhotep

  1. 1. go.indeed.com/IndeedEngTalks
  2. 2. Large Scale Interactive Analytics with Imhotep
  3. 3. Tom Bergman Product Manager
  4. 4. Zak Cocos Manager Marketing Science
  5. 5. We help people get jobs.
  6. 6. What is Imhotep? Imhotep is a highly scalable analytics architecture for querying faceted datasets
  7. 7. Open sourcing Imhotep Imhotep will be an OPEN SOURCE highly scalable analytics architecture for querying faceted datasets
  8. 8. People Tools System Data
  9. 9. People Tools Data System
  10. 10. People Data Tools System
  11. 11. People Data Tools System
  12. 12. A Brief History of Analytics @Indeed
  13. 13. What's best for the job seeker?
  14. 14. Test & Measure EVERYTHING
  15. 15. Query
  16. 16. Query Location
  17. 17. Query Location Impression
  18. 18. Title: Front End Software Engineer Position: 1 Clicked: 0 Country: US Query: indeed software engineer Location: austin Timestamp:2014-04-30T20:00:00 Organic Impression Log Entry
  19. 19. Analytics on Raw Logs
  20. 20. Ramses
  21. 21. ● Search logs ● Extract metrics from matches ● Graph aggregated metrics Ramses
  22. 22. ● Search logs ● Extract metrics from matches ● Graph aggregated metrics Input -> Query and Metric Output -> Aggregated metrics by bucket Ramses
  23. 23. How many organic clicks did we have in Australia?
  24. 24. QUERY country:au METRIC organic_clicks How many organic clicks did we have in Australia?
  25. 25. How many organic clicks did we have in Australia?
  26. 26. Does test group A or B have more revenue?
  27. 27. QUERY testgroup:A, testgroup:B METRIC revenue Does test group A or B have more revenue?
  28. 28. Does test group A or B have more revenue?
  29. 29. How has traffic from Yahoo! changed over time in Great Britain, Germany, and Japan?
  30. 30. QUERY from:yahoo AND country:(gb, de, jp) METRIC visits How has traffic from Yahoo! changed over time in Great Britain, Germany, and Japan?
  31. 31. How has traffic from Yahoo! changed over time in Great Britain, Germany, and Japan?
  32. 32. ● How many unique queries in the US? ● What are the top 50 queries in the US? ● How many clicks did each of those queries receive? Questions Ramses can’t answer
  33. 33. Imhotep
  34. 34. Began as a distributed iteration and group-by engine for building click prediction models. Imhotep Origins
  35. 35. We use an iterative algorithm to build decision trees level-by-level. Decision Tree Builder
  36. 36. Began as a distributed iteration and group-by engine for building click prediction models. Leveraged ability to do massive group-bys and aggregates to make real-time analytics engine. Imhotep Origins
  37. 37. How many Android App users with accounts older than 30 days saved at least 1 job in the past week?
  38. 38. What titles have the highest click-through rate for the query “Architecture” in the US? What about the lowest click-through rate?
  39. 39. For job seekers who click on Google jobs in Ireland, what other company’s jobs do they click on?
  40. 40. Zak Cocos Manager Marketing Science
  41. 41. I also help people get jobs.
  42. 42. Marketing Sciences Research, analysis, and automation team supporting marketing initiatives
  43. 43. Imhotep Imhotep is a highly scalable, [soon to be] open source, analytics architecture for querying faceted datasets
  44. 44. Imhotep@Indeed Ad hoc exploration
  45. 45. Imhotep@Indeed Ad hoc exploration Specific analysis
  46. 46. Imhotep@Indeed Ad hoc exploration Specific analysis Extensible infrastructure
  47. 47. Ad hoc exploration Public Crunchbase Dataset Source: CrunchBase CrunchBase 2013 Snapshot © 2013
  48. 48. Ad hoc exploration Public Crunchbase Dataset Document Source: CrunchBase CrunchBase 2013 Snapshot © 2013
  49. 49. Ad hoc exploration Public Crunchbase Dataset Fields Source: CrunchBase CrunchBase 2013 Snapshot © 2013
  50. 50. Ad hoc exploration Public Crunchbase Dataset Metric Source: CrunchBase CrunchBase 2013 Snapshot © 2013
  51. 51. Interactive tool for exploring Imhotep data Imhotep Data Explorer
  52. 52. Interactive tool for exploring Imhotep data Also: a badass hyperlinked pivot table Imhotep Data Explorer
  53. 53. Imhotep is Large Scale Total size of all indexes: 125TB Jobsearch index (largest): 30TB ● Over 48 billion documents
  54. 54. Query
  55. 55. Query Location
  56. 56. Query Location Organic Impression
  57. 57. Organic Impression A job that was displayed as the result of a search
  58. 58. Title
  59. 59. Company Information
  60. 60. Description
  61. 61. Job Age
  62. 62. abredistime acmetime addltime adsc adsdelay adsi badsc badsi boostojc boostoji bsjc bsjcwia bsji bsjindapplies bsjindappviews bsjrev bsjwia ckcnt cksz counts ctkage ctkagedays dayofweek dcpingtime domTotalTime ds-mpo dsmiss dstime featemp fj freekwac freekwarev freesjc freesjrev frmtime galatdelay iplat iplong jslatdelay jsvdelay kwac kwacdelay kwai kwarev kwcnt lacinsize lacsgsize lmstime mpotime mprtime navTotTime ndxtime ojc ojclong ojcshort ojcwia oji ojindapplies ojindappviews ojwia oocsc page prcvdlatency primfollowcnt prvwoji prvwojlat prvwojopentime prvwojreq radsc radsi recidlookupbudget rectime redirCount redirTime relfollowcnt respTime returnvisit rojc roji rqcnt rqlcnt rqqcnt rrsjc rrsji rrsjrev rsavail rsjc rsji rsused rsviable serpsize sjc sjcdelay sjclong sjcnt sjcshort sjcwia sji sjindapplies sjindappviews sjrev sjwia sllat sllong sqc sqi sugtime svj svjnostar svjstar tadsc tadsi time timeofday totcnt totfollowcnt totrev tottime tsjc tsjcwia tsji tsjindapplies tsjindappviews tsjrev tsjwia unqcnt vp wacinsize wacsgsize
  63. 63. Organic Impression Document Title: Front End Software Engineer Position: 1 Clicked: 0 Country: US Query: indeed software engineer Location: austin Timestamp:2014-04-30T20:00:00
  64. 64. Organic Impression Index Title: Front End Software Engineer Position: 1 Clicked: 0 Country: US Query: indeed software engineer Location: austin Timestamp:2014-04-30T20:00:00
  65. 65. Imhotep Data Explorer can’t... Combine results from multiple datasets
  66. 66. Combine results from multiple datasets Be easily automated Imhotep Data Explorer can’t...
  67. 67. Imhotep Query Language (IQL)
  68. 68. IQL - Imhotep Query Language Can combine results from multiple datasets Allows for automation of data tools
  69. 69. IQL queries - requirements Index Date range Metrics
  70. 70. IQL queries - optional Index Date range Metrics Filters Group by
  71. 71. IQL - Metrics select count() from organic ‘2013-12-05’ ‘2013-12-10’ where country=ie and clicked=1 group by companyid Metrics
  72. 72. select count() from organic ‘2013-12-05’ ‘2013-12-10’ where country=ie and clicked=1 group by companyid IQL - Indexes Index
  73. 73. select count() from organic ‘2013-12-05’ ‘2013-12-10’ where country=ie and clicked=1 group by companyid IQL - Date Range Date Range
  74. 74. select count() from organic ‘2013-12-05’ ‘2013-12-10’ where country=ie and clicked=1 group by companyid IQL - Filters Filters
  75. 75. select count() from organic ‘2013-12-05’ ‘2013-12-10’ where country=ie and clicked=1 group by companyid IQL - Filters Groups
  76. 76. IQL Question Do companies that have raised more than $10 million in the Austin get more clicks on average than those raised less than $10 million?
  77. 77. Methodology 1) organic index: select companies in the US which received organic clicks
  78. 78. Methodology 1) organic index: select companies in the US which received organic clicks 2) crunchbase index: select companies, and the amount of funding for companies receiving investments in Austin
  79. 79. Methodology 1) organic index: select companies in the US which received organic clicks 2) crunchbase index: select companies, and the amount of funding for companies receiving investments in Austin 3) Join, segment, and do the math!
  80. 80. Tom Bergman Product Manager
  81. 81. I still help people get jobs.
  82. 82. Large Scale Interactive Analytics Platform ● 123 Unique Indexes ● Largest Index 30TB ● Total size ~125TB
  83. 83. Large Scale Interactive Analytics Platform IQL -> Largely Programmatic access ● approx 76k queries/day ● Avg time to execute 0.67 seconds Ramses -> Largely Human ● approx 3,400 queries/day ● Avg time to execute 4.4 seconds
  84. 84. Large Scale Interactive Analytics Platform Users ● 198 unique users in past month ● 25,622 unique queries in past month ● Avg 53 queries/user per day
  85. 85. Large Scale Interactive Analytics Platform 40+ internal clients ● 6 Analytics Webapps ● 5 dashboards ● 10 programming/scripting shells ● 6 monitoring apps ● … and more
  86. 86. Large Scale Interactive Analytics Platform One Tool-set for all data ● Website usage ● Operational Monitoring ● Financial Reporting ● Google Analytics ● Internal Webapp Usage ● External Reports
  87. 87. Solving a real problem
  88. 88. Providing the Best Results Show the jobs that users are most interesting to our users
  89. 89. Providing the Best Results Clicks are a very good indicator of interest
  90. 90. Providing the Best Results Clicks are a very good indicator of interest More clicks -> More Relevant Less clicks -> Less Relevant
  91. 91. Architecture Very hard query to serve correctly
  92. 92. Architecture Very hard query to serve correctly Architecture terminology has been co- opted by technology
  93. 93. Terminology Common to both Software and Architecture Blueprint Design Framework Infrastructure Engineer Project manager Development Technical architect Software Modeling Computation Code reviews
  94. 94. Architecture vs Software Titles Architect CAD Designer Project Manager vs Software Architect UI Designer Project Manager
  95. 95. Query Management Indeed uses Imhotep to improve matching
  96. 96. Query Management Indeed uses Imhotep to improve matching Automatically detect results that should be added or removed from queries
  97. 97. Query Management Indeed uses Imhotep to improve matching Automatically detect results that should be added or removed from queries 26,790 rules across all countries
  98. 98. Imhotep Open Source Imhotep Open Source ETA: August 1, 2014
  99. 99. Imhotep Open Source Follow along at our blog engineering.indeed.com Sign up for mailing list to get latest updates go.indeed.com/imhotep-announce
  100. 100. Q & A
  101. 101. Next @IndeedEng Talk Launching Indeed Around the World Davide Novelli, International Director David Tulig, Tech Lead May 28, 2014 http://engineering.indeed.com/talks
  102. 102. More Questions? Jason David James Jeff

×