[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

2,698 views
2,820 views

Published on

Video available at: http://www.youtube.com/watch?v=MFilAoiV5nE

Decision trees are a widely used machine learning technique for supervised classification. Indeed's data sets consist of tens of billions of documents with millions of distinct features. Since decision trees back some of our most important features, we built a custom distributed system to efficiently train them. Every day, we now build dozens of decision trees across this data. This same system now powers our internal analytical tools that enable quick data-driven decision-making at Indeed.

This presentation provides a brief introduction to decision trees followed by a detailed overview of our approach to building them. The talk will be presented by our CTO, Andrew Hudson.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,698
On SlideShare
0
From Embeds
0
Number of Embeds
1,818
Actions
Shares
0
Downloads
27
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

  1. 1. go.indeed.com/IndeedEngTalks
  2. 2. Machine Learning at Indeed Scaling Decision Trees
  3. 3. Andrew Hudson CTO
  4. 4. I help people get jobs.
  5. 5. Indeed is a Search Engine for Jobs
  6. 6. Which jobs to show?
  7. 7. 18,749 jobs
  8. 8. Which jobs to show? Maximize job seeker’s chance to get the job
  9. 9. Which jobs to show? Maximize job seeker’s chance to get the job ● ● ● ● Will job seeker click on the job? Is the job still available? Will job seeker apply to the job? Is job seeker qualified for the job?
  10. 10. Which jobs to show? Maximize job seeker’s chance to get the job ● ● ● ● Will job seeker click on the job? Is the job still available? Will job seeker apply to the job? Is job seeker qualified for the job?
  11. 11. How? Log job seeker behavior Analyze logs, what best explains why they clicked on some jobs and not on others? May help predict future behavior
  12. 12. How? Log job seeker behavior Analyze logs, what best explains why they clicked on some jobs and not on others? May help predict future behavior Supervised learning
  13. 13. Supervised Learning Approaches Neural networks Bayesian methods Decision trees Genetic programming Logistic model tree Nearest neighbor Support Vector Machines Random forests Boosting Bagging Regression Ensemble methods
  14. 14. Supervised Learning Approaches Neural networks Bayesian methods Decision trees Genetic programming Logistic model tree Nearest neighbor Support Vector Machines Random forests Boosting Bagging Regression Ensemble methods
  15. 15. Supervised Learning Approaches Decision trees Genetic programming Logistic model tree Random forests Bagging Boosting Ensemble methods
  16. 16. Decision Trees
  17. 17. What is a Decision Tree? A tree like structure that presents a relevant sequence of questions which determine a path and ultimately some outcome or prediction
  18. 18. I’m Thinking About Buying a Laptop
  19. 19. I’m Thinking About Buying a Laptop Is quality important?
  20. 20. I’m Thinking About Buying a Laptop Is quality important? NO ASUS
  21. 21. I’m Thinking About Buying a Laptop Is quality important? NO ASUS or whatever woot has
  22. 22. I’m Thinking About Buying a Laptop Is quality important? YES Want to run linux? NO ASUS or whatever woot has
  23. 23. I’m Thinking About Buying a Laptop Is quality important? NO ASUS or whatever woot has YES Want to run linux? NO MACBOOK
  24. 24. I’m Thinking About Buying a Laptop Is quality important? NO ASUS or whatever woot has YES Want to run linux? YES LENOVO NO MACBOOK
  25. 25. I’m Thinking About Buying a Laptop Is quality important? YES NO ASUS or whatever woot has IDGAF DELL Want to run linux? YES NO MACBOOK HELLYES SYSTEM76 LENOVO
  26. 26. Benefits of Decision Trees Algorithm relatively simple to understand and implement Model produced also human understandable
  27. 27. Decision Tree Learning Programmatic creation of decision trees
  28. 28. Decision Tree Learning Given a set of documents, split it into two or more subsets that optimize some criteria Repeat this process until a set can no longer be split
  29. 29. Titanic Example 1309 passengers 500 survivors 38.2% survival rate What best explains who survived?
  30. 30. What best explains who survived? class class of ticket; first, second or third fsize family size; number of family members onboard gender male or female
  31. 31. 1309 passengers 500 survivors 38.2% survival
  32. 32. class = 1 1309 passengers 500 survivors 38.2% survival
  33. 33. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival
  34. 34. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival class ≠ 1 986 passengers 300 survivors 30.4% survival
  35. 35. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival class ≠ 1 986 passengers 300 survivors 30.4% survival Score = ?
  36. 36. Score conditional entropy
  37. 37. Conditional Entropy as Score lower conditional entropy ↓ less uncertainty about prediction based on term
  38. 38. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival class ≠ 1 986 passengers 300 survivors 30.4% survival Score = 0.6267
  39. 39. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival class ≠ 1 986 passengers 300 survivors 30.4% survival Score = 0.6267 Best Score: 0.6267, class = 1
  40. 40. class = 1 1309 passengers 500 survivors 38.2% survival Best Score: 0.6267, class = 1
  41. 41. class ≤ 2 1309 passengers 500 survivors 38.2% survival Best Score: 0.6267, class = 1
  42. 42. class ≤ 2 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival Best Score: 0.6267, class = 1
  43. 43. class ≤ 2 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival class > 2 709 passengers 181 survivors 25.5% survival Best Score: 0.6267, class = 1
  44. 44. class ≤ 2 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival Score = 0.6244 class > 2 709 passengers 181 survivors 25.5% survival Best Score: 0.6267, class = 1
  45. 45. class ≤ 2 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival Score = 0.6244 class > 2 709 passengers 181 survivors 25.5% survival Best Score: 0.6244, class ≤ 2
  46. 46. class ≠ 3 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival class = 3 709 passengers 181 survivors 25.5% survival Score = 0.6244 Best Score: 0.6244, class ≤ 2
  47. 47. gender = female 1309 passengers 500 survivors 38.2% survival Best Score: 0.6244, class ≤ 2
  48. 48. gender = female 466 passengers 339 survivors 72.7% survival 1309 passengers 500 survivors 38.2% survival Best Score: 0.6244, class ≤ 2
  49. 49. gender ≠ female 843 passengers 161 survivors 19.1% survival 1309 passengers 500 survivors 38.2% survival gender = female 466 passengers 339 survivors 72.7% survival Best Score: 0.6244, class ≤ 2
  50. 50. gender ≠ female 843 passengers 161 survivors 19.1% survival 1309 passengers 500 survivors 38.2% survival gender = female 466 passengers 339 survivors 72.7% survival Score = 0.5525 Best Score: 0.6244, class ≤ 2
  51. 51. gender ≠ female 843 passengers 161 survivors 19.1% survival 1309 passengers 500 survivors 38.2% survival gender = female 466 passengers 339 survivors 72.7% survival Score = 0.5525 Best Score: 0.5525, gender=f
  52. 52. fsize ≠ 0 519 passengers 261 survivors 50.3% survival 1309 passengers 500 survivors 38.2% survival fsize = 0 790 passengers 239 survivors 30.3% survival Score = 0.6448 Best Score: 0.5525, gender=f
  53. 53. Best Score: 0.5525, gender=f
  54. 54. 19.1% survival 72.7% survival
  55. 55. gender=male 843 passengers 161 survivors 19.1% survival
  56. 56. class = 1 179 passengers 61 survivors 34.1% survival gender=male 843 passengers 161 survivors 19.1% survival
  57. 57. class = 1 179 passengers 61 survivors 34.1% survival gender=male 843 passengers 161 survivors 19.1% survival class ≠ 1 664 passengers 100 survivors 15.1% survival Score = 0.4700
  58. 58. class = 1 class ≠ 1
  59. 59. class = 1 class ≠ 1
  60. 60. 34.1% survival 15.1% survival
  61. 61. 38.2%
  62. 62. 38.2% 19.1% 72.7% MALE FEMALE
  63. 63. 38.2% 19.1% 72.7% MALE FEMALE 15.1% 34.1% CLASS≠1 CLASS=1
  64. 64. 38.2% 19.1% 72.7% MALE FEMALE 15.1% 34.1% CLASS≠1 CLASS=1 13.1% 33.9% FSIZE≠2 FSIZE=2
  65. 65. 38.2% 19.1% 72.7% MALE FEMALE 15.1% 34.1% 49.1% 93.2% CLASS≠1 CLASS=1 CLASS>2 CLASS<=2 13.1% 33.9% FSIZE≠2 FSIZE=2
  66. 66. 38.2% 19.1% 72.7% MALE FEMALE 15.1% 34.1% 49.1% 93.2% CLASS≠1 CLASS=1 CLASS>2 CLASS<=2 13.1% 33.9% 24.4% 54.9% FSIZE≠2 FSIZE=2 FSIZE>2 FSIZE<=2
  67. 67. Predicting Click Probabilities Passenger → Job Impression Survived → Clicked on Job For each candidate job, follow path through tree then take click through rate of terminal node
  68. 68. Simplified Decision Tree for query="sales" NO account sales NO NO 1.9% manager YES YES YES 3.8% NO 2.1% manager representative NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  69. 69. job title = “sales representative” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  70. 70. job title = “account executive” NO account account sales YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  71. 71. job title = “outside sales representative” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  72. 72. job title = “sales associate” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  73. 73. job title = “inside sales representative” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  74. 74. job title = “sales manager” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 1.8% 2.9% NO outside YES 5.1% NO NO service YES 2.9% inside YES 4.4% 4.6%
  75. 75. job title = “sales consultant” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  76. 76. job title = “store manager” NO NO NO account YES YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES sales 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  77. 77. job title = “service sales representative” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  78. 78. job title = “customer service representative” NO NO NO account YES YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES sales 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  79. 79. Final CTR Predictions 5.1% 4.6% 4.4% 3.8% 2.9% 2.9% 2.6% 2.1% 1.9% 1.8% outside sales representative sales representative inside sales representative account executive sales manager service sales representative sales consultant store manager customer service representative sales associate
  80. 80. Single Machine Implementation
  81. 81. Overview
  82. 82. Tree Building Strategies One node at a time - depth first - breadth first
  83. 83. 1 Depth First
  84. 84. 1 2 3 Depth First
  85. 85. 1 2 4 3 5 Depth First
  86. 86. 1 2 5 4 6 3 7 Depth First
  87. 87. 1 2 5 4 6 3 7 Depth First
  88. 88. 1 2 5 4 6 3 7 Depth First
  89. 89. 1 2 5 4 6 3 7 Depth First
  90. 90. 1 2 5 4 6 3 8 7 Depth First 9
  91. 91. 1 Breadth First
  92. 92. 1 2 3 Breadth First
  93. 93. 1 2 4 3 5 Breadth First
  94. 94. 1 2 4 3 5 6 Breadth First 7
  95. 95. 1 2 5 4 8 3 6 9 Breadth First 7
  96. 96. 1 2 5 4 8 3 6 9 Breadth First 7
  97. 97. 1 2 5 4 8 3 9 6 10 7 11 Breadth First
  98. 98. 1 2 5 4 8 3 9 6 10 7 11 Breadth First 12 13
  99. 99. Tree Building Strategies One node at a time - depth first - breadth first One layer at a time, all nodes simultaneous
  100. 100. 1
  101. 101. 1 iter #1
  102. 102. 1 iter #1 2 3
  103. 103. 1 iter #1 2 iter #2 3
  104. 104. 1 iter #1 2 3 iter #2 4 5 6 7
  105. 105. 1 iter #1 2 3 iter #2 4 iter #3 5 6 7
  106. 106. 1 iter #1 2 3 iter #2 5 4 6 7 iter #3 8 9 0 10 11 12 13
  107. 107. 1 iter #1 2 3 iter #2 5 4 6 7 iter #3 8 9 0 10 11 12 13
  108. 108. 1 iter #1 2 3 iter #2 5 4 6 7 iter #3 8 iter #4 9 0 10 11 12 13
  109. 109. 1 iter #1 2 3 iter #2 5 4 6 7 iter #3 8 9 0 10 11 12 13
  110. 110. Data Format id class fsize gender survived id class fsize gender survived 0 1 0 f 1 10 1 1 m 0 1 1 3 m 1 11 1 1 f 1 2 1 3 f 0 12 1 0 f 1 3 1 3 m 0 13 1 0 f 1 4 1 3 f 0 14 1 0 m 1 5 1 0 m 1 15 1 0 m 0 6 1 1 f 1 16 1 1 m 0 7 1 0 m 0 17 1 1 f 1 8 1 2 f 1 18 1 0 f 1 9 1 0 m 0 19 1 0 m 0 ….
  111. 111. Data Format Create an inverted index Key to efficiently building one layer at a time
  112. 112. Inverted Index Maps terms to the list of documents that contain that term Terms and docs stored in sorted order
  113. 113. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606….
  114. 114. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Field
  115. 115. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Term
  116. 116. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Docs
  117. 117. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Docs
  118. 118. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Docs
  119. 119. Inverted Index fsize=0 → 0,5,7,9,12,13,14,15,18,19,22…. fsize=1 → 6,10,11,16,17,26,27,36,49,50…. fsize=2 → 8,20,21,42,76,77,78,79,81,82…. fsize=3 → 1,2,3,4,54,55,56,57,90,339…. fsize=4 → 249,250,251,252,253,449,806…. ….
  120. 120. Inverted Index gender=f → 0,2,4,6,8,11,12,13,17,18,21…. gender=m → 1,3,5,7,9,10,14,15,16,19,20….
  121. 121. Inverted Index survived=0 → 2,3,4,7,9,10,15,16,19,25…. survived=1 → 0,1,5,6,8,11,12,13,14,17….
  122. 122. Inverted Index Implementations Lucene Flamdex
  123. 123. Primary Lookup Tables groups[doc] Where in the tree each doc is Initialized to all ones, all docs start in root values[doc] Value to be classified, for each doc In this case it’s 1 if survived, 0 otherwise
  124. 124. Primary Lookup Tables values[doc] Constructed from an inverted index of the values Invert the field of interest (e.g. survived)
  125. 125. Main Loop Overview foreach field foreach term get group stats evaluate splits apply best splits repeat n times or until no more splits found
  126. 126. Main Loop - First Iteration foreach field (class, fsize, gender)
  127. 127. Main Loop - First Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...)
  128. 128. Main Loop - First Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...) get group stats
  129. 129. Get Group Stats count[grp] Count of how many documents within that group contain current term, initialized to zeros vsum[grp] Summation of the value to be classified from the documents within that group that contain current term, initialized to zeros
  130. 130. Get Group Stats for current field/term
  131. 131. Get Group Stats for current field/term foreach doc
  132. 132. Get Group Stats for current field/term foreach doc grp = grps[doc]
  133. 133. Get Group Stats for current field/term foreach doc grp = grps[doc] if grp == 0 skip
  134. 134. Get Group Stats for current field/term foreach doc grp = grps[doc] if grp == 0 skip count[grp]++ vsum[grp] += vals[doc]
  135. 135. Get Group Stats for current field/term (class=1)
  136. 136. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...)
  137. 137. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
  138. 138. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip
  139. 139. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
  140. 140. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 0, vsum[1] = 0
  141. 141. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 1, vsum[1] = 1
  142. 142. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 2, vsum[1] = 2
  143. 143. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 3, vsum[1] = 2
  144. 144. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 4, vsum[1] = 2
  145. 145. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 5, vsum[1] = 2
  146. 146. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 6, vsum[1] = 3
  147. 147. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 323, vsum[1] = 200
  148. 148. class = 1 1309 passengers 500 survivors 38.2% survival
  149. 149. class = 1 Group 1 1309 passengers 500 survivors 38.2% survival
  150. 150. class = 1 Group 1 1309 passengers 500 survivors 38.2% survival
  151. 151. class = 1 323 passengers count[1] Group 1 1309 passengers 500 survivors 38.2% survival
  152. 152. class = 1 323 passengers 200 survivors count[1] vsum[1] Group 1 1309 passengers 500 survivors 38.2% survival
  153. 153. Get Group Stats for current field/term (class=2) foreach doc (323,324,325,326,327,328,329...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (0,1,0,0,0,0,1,0,1…) … count[1] = 277, vsum[1] = 119
  154. 154. Get Group Stats for current field/term (class=3) foreach doc (600,601,602,603,604,605,606...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (0,0,0,1,1,1,1,1,0…) … count[1] = 709, vsum[1] = 181
  155. 155. Main Loop - First Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...) get group stats evaluate splits
  156. 156. Evaluate Splits Consider current field/term as a potential split for each group 1) check if split is admissible balance check, significance check 2) score the split conditional entropy or some other heuristic 3) keep best scoring split
  157. 157. Evaluate Splits totalcount[group] / totalvalue[group] Total number of documents and total values for each group, i.e. # passengers / # survivors bestsplit[group] / bestscore[group] Current best split and score for each group, initially nulls
  158. 158. foreach field/term (class=1) get group stats (count[1]=323,vsum[1]=200) foreach group if not admissible( … ) skip score = calcscore(cnt[grp], vsum[grp], totcnt[grp], totval[grp]) if score < bestscore[grp] bestscore[grp] = score bestsplit[grp] = field/term
  159. 159. foreach field/term (class=1) get group stats (count[1]=323,vsum[1]=200) foreach group if not admissible( … ) skip score = calcscore(cnt[grp], vsum[grp], totcnt[grp], totval[grp]) if score < bestscore[grp] bestscore[grp] = score bestsplit[grp] = field/term
  160. 160. Main Loop - First Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...) get group stats evaluate splits apply best splits (bestsplit[1]=“gender=f”)
  161. 161. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group
  162. 162. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 1
  163. 163. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 condition: gender=female 1
  164. 164. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 condition: gender=female positive group: 3 1 3
  165. 165. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 condition: gender=female positive group: 3 2 negative group: 2 1 3
  166. 166. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 condition: gender=female positive group: 3 2 negative group: 2 1 3
  167. 167. Apply Best Splits Using inverted index, iterate over docs that match split condition If current document is in targeted group, move it to the positive group At the end, move anything left in target group to negative group
  168. 168. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 1 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  169. 169. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 1 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  170. 170. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 1 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  171. 171. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 1 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  172. 172. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  173. 173. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  174. 174. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  175. 175. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  176. 176. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 3 group[7] = 1 group[8] = 3 group[9] = 1 group[10] = 1 group[11] = 3 group[12] = 3 group[13] = 3 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 3 group[18] = 3 group[19] = 1 group[20] = 1
  177. 177. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 3 group[7] = 1 group[8] = 3 group[9] = 1 group[10] = 1 group[11] = 3 group[12] = 3 group[13] = 3 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 3 group[18] = 3 group[19] = 1 group[20] = 1
  178. 178. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 3 group[7] = 1 group[8] = 3 group[9] = 1 group[10] = 1 group[11] = 3 group[12] = 3 group[13] = 3 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 3 group[18] = 3 group[19] = 1 group[20] = 1
  179. 179. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 2 group[2] = 3 group[3] = 2 group[4] = 3 group[5] = 2 group[6] = 3 group[7] = 2 group[8] = 3 group[9] = 2 group[10] = 2 group[11] = 3 group[12] = 3 group[13] = 3 group[14] = 2 group[15] = 2 group[16] = 2 group[17] = 3 group[18] = 3 group[19] = 2 group[20] = 2
  180. 180. Main Loop foreach field foreach term get group stats evaluate splits apply best splits repeat n times or until no more splits found
  181. 181. 1
  182. 182. 1 iter #1
  183. 183. 1 iter #1 gender = female
  184. 184. 1 iter #1 2 gender ≠ female 3 gender = female
  185. 185. 1 iter #1 2 iter #2 3
  186. 186. Main Loop - Second Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...) get group stats
  187. 187. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (3,2,3,2,3,2,3,2,3…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
  188. 188. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (3,2,3,2,3,2,3,2,3…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[2] = 179, vsum[2] = 61 count[3] = 144, vsum[3] = 139
  189. 189. Get Group Stats for current field/term (class=2) foreach doc (323,324,325,326,327,328,329...) grp = grps[doc] (2,3,2,2,2,2,3,2,2…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (0,1,0,0,0,0,1,0,1…) … count[2] = 171, vsum[2] = 25 count[3] = 106, vsum[3] = 94
  190. 190. Get Group Stats for current field/term (class=3) foreach doc (600,601,602,603,604,605,606...) grp = grps[doc] (2,2,2,3,3,2,2,3,2…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (0,0,0,1,1,1,1,1,0…) … count[2] = 493, vsum[2] = 75 count[3] = 216, vsum[3] = 106
  191. 191. Get Group Stats for current field/term (gender=female) foreach doc (0,2,4,6,8,11,12,13,17,18,21,23….) grp = grps[doc] (3,3,3,3,3,3,3,3,3,3,3,3…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,0,0,1,1,1,1,1,1…) … count[2] = 0, vsum[2] = 0 count[3] = 467, vsum[3] = 339
  192. 192. Get Group Stats for current field/term (gender=male) foreach doc (1,3,5,7,9,10,14,15,16,19,20,22...) grp = grps[doc] (2,2,2,2,2,2,2,2,2,2,2…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,0,1,0,0,0,1,0,0…) … count[2] = 844, vsum[2] = 161 count[3] = 0, vsum[3] = 0
  193. 193. What About Inequality Splits? e.g. class ≤ 2
  194. 194. Main Loop + Inequality Splits foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  195. 195. Main Loop + Inequality Splits foreach field reset inequality stats foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  196. 196. Main Loop + Inequality Splits foreach field reset inequality stats foreach term get group stats update inequality stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  197. 197. Main Loop + Inequality Splits foreach field reset inequality stats foreach term get group stats update inequality stats evaluate splits evaluate inequality splits apply best splits for each group repeat n times or until no more splits found
  198. 198. Scalability Performs quite well on a single machine Worked well for a while, but started to hit limits Ultimately needed to distribute to multiple machines
  199. 199. Multiple Machine Implementation
  200. 200. Hadoop?
  201. 201. Hadoop Experimented with using Hadoop Each level took five sequential map reduce jobs Much slower than single machine; repeatedly writes intermediate data and lots of shuffling
  202. 202. Hadoop Experimented with using Hadoop Each level took five sequential map reduce jobs Much slower than single machine; repeatedly writes intermediate data and lots of shuffling Hadoop not great for iterative algorithms
  203. 203. Partition Data
  204. 204. Inverted Index
  205. 205. Inverted Index
  206. 206. Inverted Index
  207. 207. Inverted Index Shard 1 Shard 2
  208. 208. Machine 1 Machine 2 Shard 1 Shard 2
  209. 209. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  210. 210. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  211. 211. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found FTGS
  212. 212. Main Loop f foreach ield FTGS foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  213. 213. Main Loop f foreach ield t foreach erm FTGS get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  214. 214. Main Loop f foreach ield t get group stats foreach erm FTGS evaluate splits apply best splits for each group repeat n times or until no more splits found
  215. 215. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  216. 216. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  217. 217. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  218. 218. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  219. 219. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 Sorted fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  220. 220. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  221. 221. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  222. 222. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  223. 223. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  224. 224. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  225. 225. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  226. 226. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  227. 227. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  228. 228. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  229. 229. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  230. 230. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  231. 231. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  232. 232. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  233. 233. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  234. 234. FTGS Stream How to distribute?
  235. 235. Machine 1 Machine 2 Shard 1 Shard 2
  236. 236. FTGS 1 Machine 2 Shard 1 Shard 2
  237. 237. FTGS 1 FTGS 2 Shard 1 Shard 2
  238. 238. FTGS 1 FTGS 2 Machine 3 Shard 1 Shard 2
  239. 239. FTGS 1 Merge FTGS 2 Machine 3 Shard 1 Shard 2
  240. 240. FTGS Stream Merge class=1|1|198|111 class=2|1|277|119 class=3|1|511|129 fsize=0|1|790|239 fsize=1|1|94|53 fsize=2|1|75|48 fsize=3|1|21|17 fsize=4|1|3|1 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 Machine 1
  241. 241. FTGS Stream Merge class=1|1|125|89 class=3|1|198|52 fsize=1|1|141|73 fsize=2|1|84|42 fsize=3|1|22|13 fsize=4|1|19|5 fsize=5|1|22|4 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 2
  242. 242. FTGS Stream Merge class=1|1|198|111 class=2|1|277|119 class=3|1|511|129 fsize=0|1|790|239 class=1|1|125|89 class=3|1|198|52 fsize=1|1|141|73 fsize=2|1|84|42 fsize=1|1|94|53 fsize=3|1|22|13 fsize=2|1|75|48 fsize=4|1|19|5 fsize=3|1|21|17 fsize=5|1|22|4 fsize=4|1|3|1 fsize=6|1|16|4 fsize=5|1|3|1 fsize=7|1|8|0 gender=f|1|308|237 gender=m|1|678|122 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 1 Machine 2
  243. 243. FTGS Stream Merge class=1|1|198|111 class=2|1|277|119 class=3|1|511|129 class=1|1|125|89 + fsize=0|1|790|239 fsize=3|1|21|17 fsize=1|1|141|73 fsize=2|1|84|42 fsize=1|1|94|53 fsize=2|1|75|48 class=3|1|198|52 fsize=3|1|22|13 class=1|1|323|200 fsize=4|1|19|5 fsize=5|1|22|4 fsize=4|1|3|1 fsize=6|1|16|4 fsize=5|1|3|1 fsize=7|1|8|0 gender=f|1|308|237 gender=m|1|678|122 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 1 Machine 2
  244. 244. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=1|1|141|73 fsize=0|1|790|239 fsize=2|1|84|42 fsize=1|1|94|53 fsize=3|1|22|13 fsize=2|1|75|48 fsize=3|1|21|17 fsize=4|1|19|5 class=1|1|323|200 fsize=5|1|22|4 fsize=4|1|3|1 fsize=6|1|16|4 fsize=5|1|3|1 fsize=7|1|8|0 gender=f|1|308|237 gender=m|1|678|122 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 1 Machine 2
  245. 245. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=1|1|141|73 fsize=0|1|790|239 fsize=2|1|84|42 fsize=1|1|94|53 fsize=3|1|22|13 fsize=2|1|75|48 fsize=3|1|21|17 fsize=4|1|3|1 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 fsize=4|1|19|5 class=2|1|277|119 class=1|1|323|200 fsize=5|1|22|4 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 1 Machine 2
  246. 246. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|511|129 class=3|1|198|52 fsize=0|1|790|239 fsize=2|1|84|42 fsize=3|1|22|13 fsize=1|1|94|53 fsize=4|1|19|5 fsize=2|1|75|48 fsize=5|1|22|4 fsize=3|1|21|17 fsize=4|1|3|1 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 Machine 1 fsize=1|1|141|73 fsize=6|1|16|4 class=2|1|277|119 class=1|1|323|200 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 2
  247. 247. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=0|1|790|239 fsize=1|1|94|53 fsize=1|1|141|73 + fsize=5|1|22|4 fsize=3|1|21|17 fsize=6|1|16|4 fsize=4|1|3|1 gender=f|1|308|237 gender=m|1|678|122 fsize=3|1|22|13 fsize=4|1|19|5 fsize=2|1|75|48 fsize=5|1|3|1 fsize=2|1|84|42 class=3|1|709|181 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|158|102 class=2|1|277|119 gender=m|1|165|39 class=1|1|323|200 Machine 1 Machine 2
  248. 248. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=0|1|790|239 fsize=1|1|141|73 fsize=1|1|94|53 fsize=3|1|22|13 fsize=4|1|19|5 fsize=2|1|75|48 fsize=5|1|22|4 fsize=3|1|21|17 fsize=6|1|16|4 fsize=4|1|3|1 fsize=7|1|8|0 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 fsize=2|1|84|42 class=3|1|709|181 class=2|1|277|119 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 class=1|1|323|200 Machine 1 Machine 2
  249. 249. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=0|1|790|239 fsize=1|1|141|73 fsize=1|1|94|53 fsize=3|1|22|13 fsize=4|1|19|5 fsize=2|1|75|48 fsize=5|1|22|4 fsize=3|1|21|17 fsize=6|1|16|4 fsize=4|1|3|1 fsize=7|1|8|0 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 fsize=2|1|84|42 fsize=0|1|790|239 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 class=3|1|709|181 Machine 1 class=2|1|277|119 class=1|1|323|200 Machine 2
  250. 250. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=1|1|141|73 fsize=0|1|790|239 fsize=1|1|94|53 fsize=2|1|84|42 fsize=3|1|22|13 fsize=2|1|75|48 fsize=4|1|19|5 fsize=3|1|21|17 fsize=5|1|22|4 fsize=6|1|16|4 fsize=4|1|3|1 fsize=7|1|8|0 fsize=5|1|3|1 fsize=10|1|11|0 gender=f|1|308|237 gender=m|1|678|122 fsize=0|1|790|239 gender=f|1|158|102 gender=m|1|165|39 class=3|1|709|181 Machine 1 class=2|1|277|119 class=1|1|323|200 Machine 2
  251. 251. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=1|1|141|73 fsize=0|1|790|239 fsize=1|1|94|53 fsize=2|1|75|48 fsize=3|1|21|17 fsize=2|1|84|42 fsize=3|1|22|13 + fsize=4|1|3|1 fsize=6|1|16|4 fsize=10|1|11|0 gender=f|1|308|237 Machine 1 fsize=5|1|22|4 fsize=7|1|8|0 fsize=5|1|3|1 gender=m|1|678|122 fsize=4|1|19|5 fsize=1|1|235|126 fsize=0|1|790|239 class=3|1|709|181 class=2|1|277|119 gender=f|1|158|102 gender=m|1|165|39 Machine 2
  252. 252. Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6
  253. 253. FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
  254. 254. k-way merge FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
  255. 255. FTGS 1-6 FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
  256. 256. FTGS 1-6 FTGS 7-12 FTGS 13-18
  257. 257. FTGS 1-18 FTGS 1-6 FTGS 7-12 FTGS 13-18
  258. 258. FTGS 1-36 FTGS 1-18 FTGS 19-36
  259. 259. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found FTGS
  260. 260. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  261. 261. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  262. 262. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found Regroup
  263. 263. FTGS FTGS 1-6 FTGS 7-12 FTGS 13-18
  264. 264. Regroup Regroup 1-6 Regroup 7-12 Regroup 13-18
  265. 265. FTGS FTGS 1-6 FTGS 7-12 FTGS 13-18
  266. 266. Regroup Regroup 1-6 Regroup 7-12 Regroup 13-18
  267. 267. Imhotep
  268. 268. Imhotep Distributed System that does efficient FTGS and Regroup operations on inverted indexes
  269. 269. Imhotep 32 machines 2 cpu x 6 core xeon westmere E5649 128GB RAM 10x1TB 7200 RPM SATA Total: 384 cores, 4TB RAM, 320TB disk
  270. 270. Imhotep Decision tree on 13 billion documents
  271. 271. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc
  272. 272. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc First FTGS: 314 seconds First Regroup: 9.6 seconds
  273. 273. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc First FTGS: 314 seconds (36.3 million terms) First Regroup: 9.6 seconds
  274. 274. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc First FTGS: 314 seconds (36.3 million terms) First Regroup: 9.6 seconds (7 groups)
  275. 275. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc First FTGS: 314 seconds (36.3 million terms) First Regroup: 9.6 seconds (7 groups) Second FTGS: 57 seconds Second Regroup: 23 seconds (217 groups)
  276. 276. Imhotep Distributed System that does efficient FTGS and Regroup operations Powers our internal analytical tools
  277. 277. Imhotep Distributed System that does efficient FTGS and Regroup operations Powers our internal analytical tools … and more
  278. 278. Imhotep - Next @IndeedEng Talk Sharding and shard management Session / FTGS network protocol Memory management Inverted Indexes FTGS Merge Regroup operations Fault Tolerance
  279. 279. Conclusion Now scales to larger and larger data sets by adding more machines Increased freshness and frequency of builds Decision trees have lots of tunable components, regularly get 1% wins via A/B test
  280. 280. Continuous Improvement Sponsored Job Click-through Rate (CTR)
  281. 281. Thanks.
  282. 282. Q&A
  283. 283. More Questions? Jason David James Jeff
  284. 284. Next @IndeedEng Talk Imhotep: Large Scale Analytics and Machine Learning at Indeed Jeff Plaisance, Engineering Manager March 26, 2014 http://engineering.indeed.com/talks

×