Your SlideShare is downloading. ×
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1,406

Published on

Video available at: http://www.youtube.com/watch?v=MFilAoiV5nE …

Video available at: http://www.youtube.com/watch?v=MFilAoiV5nE

Decision trees are a widely used machine learning technique for supervised classification. Indeed's data sets consist of tens of billions of documents with millions of distinct features. Since decision trees back some of our most important features, we built a custom distributed system to efficiently train them. Every day, we now build dozens of decision trees across this data. This same system now powers our internal analytical tools that enable quick data-driven decision-making at Indeed.

This presentation provides a brief introduction to decision trees followed by a detailed overview of our approach to building them. The talk will be presented by our CTO, Andrew Hudson.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,406
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
14
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. go.indeed.com/IndeedEngTalks
  • 2. Machine Learning at Indeed Scaling Decision Trees
  • 3. Andrew Hudson CTO
  • 4. I help people get jobs.
  • 5. Indeed is a Search Engine for Jobs
  • 6. Which jobs to show?
  • 7. 18,749 jobs
  • 8. Which jobs to show? Maximize job seeker’s chance to get the job
  • 9. Which jobs to show? Maximize job seeker’s chance to get the job ● ● ● ● Will job seeker click on the job? Is the job still available? Will job seeker apply to the job? Is job seeker qualified for the job?
  • 10. Which jobs to show? Maximize job seeker’s chance to get the job ● ● ● ● Will job seeker click on the job? Is the job still available? Will job seeker apply to the job? Is job seeker qualified for the job?
  • 11. How? Log job seeker behavior Analyze logs, what best explains why they clicked on some jobs and not on others? May help predict future behavior
  • 12. How? Log job seeker behavior Analyze logs, what best explains why they clicked on some jobs and not on others? May help predict future behavior Supervised learning
  • 13. Supervised Learning Approaches Neural networks Bayesian methods Decision trees Genetic programming Logistic model tree Nearest neighbor Support Vector Machines Random forests Boosting Bagging Regression Ensemble methods
  • 14. Supervised Learning Approaches Neural networks Bayesian methods Decision trees Genetic programming Logistic model tree Nearest neighbor Support Vector Machines Random forests Boosting Bagging Regression Ensemble methods
  • 15. Supervised Learning Approaches Decision trees Genetic programming Logistic model tree Random forests Bagging Boosting Ensemble methods
  • 16. Decision Trees
  • 17. What is a Decision Tree? A tree like structure that presents a relevant sequence of questions which determine a path and ultimately some outcome or prediction
  • 18. I’m Thinking About Buying a Laptop
  • 19. I’m Thinking About Buying a Laptop Is quality important?
  • 20. I’m Thinking About Buying a Laptop Is quality important? NO ASUS
  • 21. I’m Thinking About Buying a Laptop Is quality important? NO ASUS or whatever woot has
  • 22. I’m Thinking About Buying a Laptop Is quality important? YES Want to run linux? NO ASUS or whatever woot has
  • 23. I’m Thinking About Buying a Laptop Is quality important? NO ASUS or whatever woot has YES Want to run linux? NO MACBOOK
  • 24. I’m Thinking About Buying a Laptop Is quality important? NO ASUS or whatever woot has YES Want to run linux? YES LENOVO NO MACBOOK
  • 25. I’m Thinking About Buying a Laptop Is quality important? YES NO ASUS or whatever woot has IDGAF DELL Want to run linux? YES NO MACBOOK HELLYES SYSTEM76 LENOVO
  • 26. Benefits of Decision Trees Algorithm relatively simple to understand and implement Model produced also human understandable
  • 27. Decision Tree Learning Programmatic creation of decision trees
  • 28. Decision Tree Learning Given a set of documents, split it into two or more subsets that optimize some criteria Repeat this process until a set can no longer be split
  • 29. Titanic Example 1309 passengers 500 survivors 38.2% survival rate What best explains who survived?
  • 30. What best explains who survived? class class of ticket; first, second or third fsize family size; number of family members onboard gender male or female
  • 31. 1309 passengers 500 survivors 38.2% survival
  • 32. class = 1 1309 passengers 500 survivors 38.2% survival
  • 33. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival
  • 34. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival class ≠ 1 986 passengers 300 survivors 30.4% survival
  • 35. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival class ≠ 1 986 passengers 300 survivors 30.4% survival Score = ?
  • 36. Score conditional entropy
  • 37. Conditional Entropy as Score lower conditional entropy ↓ less uncertainty about prediction based on term
  • 38. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival class ≠ 1 986 passengers 300 survivors 30.4% survival Score = 0.6267
  • 39. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival class ≠ 1 986 passengers 300 survivors 30.4% survival Score = 0.6267 Best Score: 0.6267, class = 1
  • 40. class = 1 1309 passengers 500 survivors 38.2% survival Best Score: 0.6267, class = 1
  • 41. class ≤ 2 1309 passengers 500 survivors 38.2% survival Best Score: 0.6267, class = 1
  • 42. class ≤ 2 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival Best Score: 0.6267, class = 1
  • 43. class ≤ 2 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival class > 2 709 passengers 181 survivors 25.5% survival Best Score: 0.6267, class = 1
  • 44. class ≤ 2 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival Score = 0.6244 class > 2 709 passengers 181 survivors 25.5% survival Best Score: 0.6267, class = 1
  • 45. class ≤ 2 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival Score = 0.6244 class > 2 709 passengers 181 survivors 25.5% survival Best Score: 0.6244, class ≤ 2
  • 46. class ≠ 3 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival class = 3 709 passengers 181 survivors 25.5% survival Score = 0.6244 Best Score: 0.6244, class ≤ 2
  • 47. gender = female 1309 passengers 500 survivors 38.2% survival Best Score: 0.6244, class ≤ 2
  • 48. gender = female 466 passengers 339 survivors 72.7% survival 1309 passengers 500 survivors 38.2% survival Best Score: 0.6244, class ≤ 2
  • 49. gender ≠ female 843 passengers 161 survivors 19.1% survival 1309 passengers 500 survivors 38.2% survival gender = female 466 passengers 339 survivors 72.7% survival Best Score: 0.6244, class ≤ 2
  • 50. gender ≠ female 843 passengers 161 survivors 19.1% survival 1309 passengers 500 survivors 38.2% survival gender = female 466 passengers 339 survivors 72.7% survival Score = 0.5525 Best Score: 0.6244, class ≤ 2
  • 51. gender ≠ female 843 passengers 161 survivors 19.1% survival 1309 passengers 500 survivors 38.2% survival gender = female 466 passengers 339 survivors 72.7% survival Score = 0.5525 Best Score: 0.5525, gender=f
  • 52. fsize ≠ 0 519 passengers 261 survivors 50.3% survival 1309 passengers 500 survivors 38.2% survival fsize = 0 790 passengers 239 survivors 30.3% survival Score = 0.6448 Best Score: 0.5525, gender=f
  • 53. Best Score: 0.5525, gender=f
  • 54. 19.1% survival 72.7% survival
  • 55. gender=male 843 passengers 161 survivors 19.1% survival
  • 56. class = 1 179 passengers 61 survivors 34.1% survival gender=male 843 passengers 161 survivors 19.1% survival
  • 57. class = 1 179 passengers 61 survivors 34.1% survival gender=male 843 passengers 161 survivors 19.1% survival class ≠ 1 664 passengers 100 survivors 15.1% survival Score = 0.4700
  • 58. class = 1 class ≠ 1
  • 59. class = 1 class ≠ 1
  • 60. 34.1% survival 15.1% survival
  • 61. 38.2%
  • 62. 38.2% 19.1% 72.7% MALE FEMALE
  • 63. 38.2% 19.1% 72.7% MALE FEMALE 15.1% 34.1% CLASS≠1 CLASS=1
  • 64. 38.2% 19.1% 72.7% MALE FEMALE 15.1% 34.1% CLASS≠1 CLASS=1 13.1% 33.9% FSIZE≠2 FSIZE=2
  • 65. 38.2% 19.1% 72.7% MALE FEMALE 15.1% 34.1% 49.1% 93.2% CLASS≠1 CLASS=1 CLASS>2 CLASS<=2 13.1% 33.9% FSIZE≠2 FSIZE=2
  • 66. 38.2% 19.1% 72.7% MALE FEMALE 15.1% 34.1% 49.1% 93.2% CLASS≠1 CLASS=1 CLASS>2 CLASS<=2 13.1% 33.9% 24.4% 54.9% FSIZE≠2 FSIZE=2 FSIZE>2 FSIZE<=2
  • 67. Predicting Click Probabilities Passenger → Job Impression Survived → Clicked on Job For each candidate job, follow path through tree then take click through rate of terminal node
  • 68. Simplified Decision Tree for query="sales" NO account sales NO NO 1.9% manager YES YES YES 3.8% NO 2.1% manager representative NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  • 69. job title = “sales representative” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  • 70. job title = “account executive” NO account account sales YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  • 71. job title = “outside sales representative” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  • 72. job title = “sales associate” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  • 73. job title = “inside sales representative” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  • 74. job title = “sales manager” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 1.8% 2.9% NO outside YES 5.1% NO NO service YES 2.9% inside YES 4.4% 4.6%
  • 75. job title = “sales consultant” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  • 76. job title = “store manager” NO NO NO account YES YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES sales 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  • 77. job title = “service sales representative” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  • 78. job title = “customer service representative” NO NO NO account YES YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES sales 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  • 79. Final CTR Predictions 5.1% 4.6% 4.4% 3.8% 2.9% 2.9% 2.6% 2.1% 1.9% 1.8% outside sales representative sales representative inside sales representative account executive sales manager service sales representative sales consultant store manager customer service representative sales associate
  • 80. Single Machine Implementation
  • 81. Overview
  • 82. Tree Building Strategies One node at a time - depth first - breadth first
  • 83. 1 Depth First
  • 84. 1 2 3 Depth First
  • 85. 1 2 4 3 5 Depth First
  • 86. 1 2 5 4 6 3 7 Depth First
  • 87. 1 2 5 4 6 3 7 Depth First
  • 88. 1 2 5 4 6 3 7 Depth First
  • 89. 1 2 5 4 6 3 7 Depth First
  • 90. 1 2 5 4 6 3 8 7 Depth First 9
  • 91. 1 Breadth First
  • 92. 1 2 3 Breadth First
  • 93. 1 2 4 3 5 Breadth First
  • 94. 1 2 4 3 5 6 Breadth First 7
  • 95. 1 2 5 4 8 3 6 9 Breadth First 7
  • 96. 1 2 5 4 8 3 6 9 Breadth First 7
  • 97. 1 2 5 4 8 3 9 6 10 7 11 Breadth First
  • 98. 1 2 5 4 8 3 9 6 10 7 11 Breadth First 12 13
  • 99. Tree Building Strategies One node at a time - depth first - breadth first One layer at a time, all nodes simultaneous
  • 100. 1
  • 101. 1 iter #1
  • 102. 1 iter #1 2 3
  • 103. 1 iter #1 2 iter #2 3
  • 104. 1 iter #1 2 3 iter #2 4 5 6 7
  • 105. 1 iter #1 2 3 iter #2 4 iter #3 5 6 7
  • 106. 1 iter #1 2 3 iter #2 5 4 6 7 iter #3 8 9 0 10 11 12 13
  • 107. 1 iter #1 2 3 iter #2 5 4 6 7 iter #3 8 9 0 10 11 12 13
  • 108. 1 iter #1 2 3 iter #2 5 4 6 7 iter #3 8 iter #4 9 0 10 11 12 13
  • 109. 1 iter #1 2 3 iter #2 5 4 6 7 iter #3 8 9 0 10 11 12 13
  • 110. Data Format id class fsize gender survived id class fsize gender survived 0 1 0 f 1 10 1 1 m 0 1 1 3 m 1 11 1 1 f 1 2 1 3 f 0 12 1 0 f 1 3 1 3 m 0 13 1 0 f 1 4 1 3 f 0 14 1 0 m 1 5 1 0 m 1 15 1 0 m 0 6 1 1 f 1 16 1 1 m 0 7 1 0 m 0 17 1 1 f 1 8 1 2 f 1 18 1 0 f 1 9 1 0 m 0 19 1 0 m 0 ….
  • 111. Data Format Create an inverted index Key to efficiently building one layer at a time
  • 112. Inverted Index Maps terms to the list of documents that contain that term Terms and docs stored in sorted order
  • 113. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606….
  • 114. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Field
  • 115. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Term
  • 116. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Docs
  • 117. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Docs
  • 118. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Docs
  • 119. Inverted Index fsize=0 → 0,5,7,9,12,13,14,15,18,19,22…. fsize=1 → 6,10,11,16,17,26,27,36,49,50…. fsize=2 → 8,20,21,42,76,77,78,79,81,82…. fsize=3 → 1,2,3,4,54,55,56,57,90,339…. fsize=4 → 249,250,251,252,253,449,806…. ….
  • 120. Inverted Index gender=f → 0,2,4,6,8,11,12,13,17,18,21…. gender=m → 1,3,5,7,9,10,14,15,16,19,20….
  • 121. Inverted Index survived=0 → 2,3,4,7,9,10,15,16,19,25…. survived=1 → 0,1,5,6,8,11,12,13,14,17….
  • 122. Inverted Index Implementations Lucene Flamdex
  • 123. Primary Lookup Tables groups[doc] Where in the tree each doc is Initialized to all ones, all docs start in root values[doc] Value to be classified, for each doc In this case it’s 1 if survived, 0 otherwise
  • 124. Primary Lookup Tables values[doc] Constructed from an inverted index of the values Invert the field of interest (e.g. survived)
  • 125. Main Loop Overview foreach field foreach term get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 126. Main Loop - First Iteration foreach field (class, fsize, gender)
  • 127. Main Loop - First Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...)
  • 128. Main Loop - First Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...) get group stats
  • 129. Get Group Stats count[grp] Count of how many documents within that group contain current term, initialized to zeros vsum[grp] Summation of the value to be classified from the documents within that group that contain current term, initialized to zeros
  • 130. Get Group Stats for current field/term
  • 131. Get Group Stats for current field/term foreach doc
  • 132. Get Group Stats for current field/term foreach doc grp = grps[doc]
  • 133. Get Group Stats for current field/term foreach doc grp = grps[doc] if grp == 0 skip
  • 134. Get Group Stats for current field/term foreach doc grp = grps[doc] if grp == 0 skip count[grp]++ vsum[grp] += vals[doc]
  • 135. Get Group Stats for current field/term (class=1)
  • 136. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...)
  • 137. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
  • 138. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip
  • 139. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
  • 140. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 0, vsum[1] = 0
  • 141. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 1, vsum[1] = 1
  • 142. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 2, vsum[1] = 2
  • 143. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 3, vsum[1] = 2
  • 144. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 4, vsum[1] = 2
  • 145. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 5, vsum[1] = 2
  • 146. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 6, vsum[1] = 3
  • 147. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 323, vsum[1] = 200
  • 148. class = 1 1309 passengers 500 survivors 38.2% survival
  • 149. class = 1 Group 1 1309 passengers 500 survivors 38.2% survival
  • 150. class = 1 Group 1 1309 passengers 500 survivors 38.2% survival
  • 151. class = 1 323 passengers count[1] Group 1 1309 passengers 500 survivors 38.2% survival
  • 152. class = 1 323 passengers 200 survivors count[1] vsum[1] Group 1 1309 passengers 500 survivors 38.2% survival
  • 153. Get Group Stats for current field/term (class=2) foreach doc (323,324,325,326,327,328,329...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (0,1,0,0,0,0,1,0,1…) … count[1] = 277, vsum[1] = 119
  • 154. Get Group Stats for current field/term (class=3) foreach doc (600,601,602,603,604,605,606...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (0,0,0,1,1,1,1,1,0…) … count[1] = 709, vsum[1] = 181
  • 155. Main Loop - First Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...) get group stats evaluate splits
  • 156. Evaluate Splits Consider current field/term as a potential split for each group 1) check if split is admissible balance check, significance check 2) score the split conditional entropy or some other heuristic 3) keep best scoring split
  • 157. Evaluate Splits totalcount[group] / totalvalue[group] Total number of documents and total values for each group, i.e. # passengers / # survivors bestsplit[group] / bestscore[group] Current best split and score for each group, initially nulls
  • 158. foreach field/term (class=1) get group stats (count[1]=323,vsum[1]=200) foreach group if not admissible( … ) skip score = calcscore(cnt[grp], vsum[grp], totcnt[grp], totval[grp]) if score < bestscore[grp] bestscore[grp] = score bestsplit[grp] = field/term
  • 159. foreach field/term (class=1) get group stats (count[1]=323,vsum[1]=200) foreach group if not admissible( … ) skip score = calcscore(cnt[grp], vsum[grp], totcnt[grp], totval[grp]) if score < bestscore[grp] bestscore[grp] = score bestsplit[grp] = field/term
  • 160. Main Loop - First Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...) get group stats evaluate splits apply best splits (bestsplit[1]=“gender=f”)
  • 161. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group
  • 162. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 1
  • 163. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 condition: gender=female 1
  • 164. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 condition: gender=female positive group: 3 1 3
  • 165. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 condition: gender=female positive group: 3 2 negative group: 2 1 3
  • 166. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 condition: gender=female positive group: 3 2 negative group: 2 1 3
  • 167. Apply Best Splits Using inverted index, iterate over docs that match split condition If current document is in targeted group, move it to the positive group At the end, move anything left in target group to negative group
  • 168. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 1 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  • 169. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 1 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  • 170. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 1 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  • 171. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 1 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  • 172. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  • 173. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  • 174. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  • 175. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  • 176. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 3 group[7] = 1 group[8] = 3 group[9] = 1 group[10] = 1 group[11] = 3 group[12] = 3 group[13] = 3 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 3 group[18] = 3 group[19] = 1 group[20] = 1
  • 177. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 3 group[7] = 1 group[8] = 3 group[9] = 1 group[10] = 1 group[11] = 3 group[12] = 3 group[13] = 3 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 3 group[18] = 3 group[19] = 1 group[20] = 1
  • 178. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 3 group[7] = 1 group[8] = 3 group[9] = 1 group[10] = 1 group[11] = 3 group[12] = 3 group[13] = 3 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 3 group[18] = 3 group[19] = 1 group[20] = 1
  • 179. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 2 group[2] = 3 group[3] = 2 group[4] = 3 group[5] = 2 group[6] = 3 group[7] = 2 group[8] = 3 group[9] = 2 group[10] = 2 group[11] = 3 group[12] = 3 group[13] = 3 group[14] = 2 group[15] = 2 group[16] = 2 group[17] = 3 group[18] = 3 group[19] = 2 group[20] = 2
  • 180. Main Loop foreach field foreach term get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 181. 1
  • 182. 1 iter #1
  • 183. 1 iter #1 gender = female
  • 184. 1 iter #1 2 gender ≠ female 3 gender = female
  • 185. 1 iter #1 2 iter #2 3
  • 186. Main Loop - Second Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...) get group stats
  • 187. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (3,2,3,2,3,2,3,2,3…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
  • 188. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (3,2,3,2,3,2,3,2,3…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[2] = 179, vsum[2] = 61 count[3] = 144, vsum[3] = 139
  • 189. Get Group Stats for current field/term (class=2) foreach doc (323,324,325,326,327,328,329...) grp = grps[doc] (2,3,2,2,2,2,3,2,2…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (0,1,0,0,0,0,1,0,1…) … count[2] = 171, vsum[2] = 25 count[3] = 106, vsum[3] = 94
  • 190. Get Group Stats for current field/term (class=3) foreach doc (600,601,602,603,604,605,606...) grp = grps[doc] (2,2,2,3,3,2,2,3,2…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (0,0,0,1,1,1,1,1,0…) … count[2] = 493, vsum[2] = 75 count[3] = 216, vsum[3] = 106
  • 191. Get Group Stats for current field/term (gender=female) foreach doc (0,2,4,6,8,11,12,13,17,18,21,23….) grp = grps[doc] (3,3,3,3,3,3,3,3,3,3,3,3…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,0,0,1,1,1,1,1,1…) … count[2] = 0, vsum[2] = 0 count[3] = 467, vsum[3] = 339
  • 192. Get Group Stats for current field/term (gender=male) foreach doc (1,3,5,7,9,10,14,15,16,19,20,22...) grp = grps[doc] (2,2,2,2,2,2,2,2,2,2,2…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,0,1,0,0,0,1,0,0…) … count[2] = 844, vsum[2] = 161 count[3] = 0, vsum[3] = 0
  • 193. What About Inequality Splits? e.g. class ≤ 2
  • 194. Main Loop + Inequality Splits foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  • 195. Main Loop + Inequality Splits foreach field reset inequality stats foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  • 196. Main Loop + Inequality Splits foreach field reset inequality stats foreach term get group stats update inequality stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  • 197. Main Loop + Inequality Splits foreach field reset inequality stats foreach term get group stats update inequality stats evaluate splits evaluate inequality splits apply best splits for each group repeat n times or until no more splits found
  • 198. Scalability Performs quite well on a single machine Worked well for a while, but started to hit limits Ultimately needed to distribute to multiple machines
  • 199. Multiple Machine Implementation
  • 200. Hadoop?
  • 201. Hadoop Experimented with using Hadoop Each level took five sequential map reduce jobs Much slower than single machine; repeatedly writes intermediate data and lots of shuffling
  • 202. Hadoop Experimented with using Hadoop Each level took five sequential map reduce jobs Much slower than single machine; repeatedly writes intermediate data and lots of shuffling Hadoop not great for iterative algorithms
  • 203. Partition Data
  • 204. Inverted Index
  • 205. Inverted Index
  • 206. Inverted Index
  • 207. Inverted Index Shard 1 Shard 2
  • 208. Machine 1 Machine 2 Shard 1 Shard 2
  • 209. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  • 210. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  • 211. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found FTGS
  • 212. Main Loop f foreach ield FTGS foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  • 213. Main Loop f foreach ield t foreach erm FTGS get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  • 214. Main Loop f foreach ield t get group stats foreach erm FTGS evaluate splits apply best splits for each group repeat n times or until no more splits found
  • 215. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 216. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 217. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 218. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 219. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 Sorted fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 220. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 221. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 222. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 223. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 224. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 225. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 226. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 227. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 228. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 229. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 230. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 231. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 232. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 233. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  • 234. FTGS Stream How to distribute?
  • 235. Machine 1 Machine 2 Shard 1 Shard 2
  • 236. FTGS 1 Machine 2 Shard 1 Shard 2
  • 237. FTGS 1 FTGS 2 Shard 1 Shard 2
  • 238. FTGS 1 FTGS 2 Machine 3 Shard 1 Shard 2
  • 239. FTGS 1 Merge FTGS 2 Machine 3 Shard 1 Shard 2
  • 240. FTGS Stream Merge class=1|1|198|111 class=2|1|277|119 class=3|1|511|129 fsize=0|1|790|239 fsize=1|1|94|53 fsize=2|1|75|48 fsize=3|1|21|17 fsize=4|1|3|1 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 Machine 1
  • 241. FTGS Stream Merge class=1|1|125|89 class=3|1|198|52 fsize=1|1|141|73 fsize=2|1|84|42 fsize=3|1|22|13 fsize=4|1|19|5 fsize=5|1|22|4 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 2
  • 242. FTGS Stream Merge class=1|1|198|111 class=2|1|277|119 class=3|1|511|129 fsize=0|1|790|239 class=1|1|125|89 class=3|1|198|52 fsize=1|1|141|73 fsize=2|1|84|42 fsize=1|1|94|53 fsize=3|1|22|13 fsize=2|1|75|48 fsize=4|1|19|5 fsize=3|1|21|17 fsize=5|1|22|4 fsize=4|1|3|1 fsize=6|1|16|4 fsize=5|1|3|1 fsize=7|1|8|0 gender=f|1|308|237 gender=m|1|678|122 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 1 Machine 2
  • 243. FTGS Stream Merge class=1|1|198|111 class=2|1|277|119 class=3|1|511|129 class=1|1|125|89 + fsize=0|1|790|239 fsize=3|1|21|17 fsize=1|1|141|73 fsize=2|1|84|42 fsize=1|1|94|53 fsize=2|1|75|48 class=3|1|198|52 fsize=3|1|22|13 class=1|1|323|200 fsize=4|1|19|5 fsize=5|1|22|4 fsize=4|1|3|1 fsize=6|1|16|4 fsize=5|1|3|1 fsize=7|1|8|0 gender=f|1|308|237 gender=m|1|678|122 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 1 Machine 2
  • 244. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=1|1|141|73 fsize=0|1|790|239 fsize=2|1|84|42 fsize=1|1|94|53 fsize=3|1|22|13 fsize=2|1|75|48 fsize=3|1|21|17 fsize=4|1|19|5 class=1|1|323|200 fsize=5|1|22|4 fsize=4|1|3|1 fsize=6|1|16|4 fsize=5|1|3|1 fsize=7|1|8|0 gender=f|1|308|237 gender=m|1|678|122 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 1 Machine 2
  • 245. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=1|1|141|73 fsize=0|1|790|239 fsize=2|1|84|42 fsize=1|1|94|53 fsize=3|1|22|13 fsize=2|1|75|48 fsize=3|1|21|17 fsize=4|1|3|1 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 fsize=4|1|19|5 class=2|1|277|119 class=1|1|323|200 fsize=5|1|22|4 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 1 Machine 2
  • 246. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|511|129 class=3|1|198|52 fsize=0|1|790|239 fsize=2|1|84|42 fsize=3|1|22|13 fsize=1|1|94|53 fsize=4|1|19|5 fsize=2|1|75|48 fsize=5|1|22|4 fsize=3|1|21|17 fsize=4|1|3|1 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 Machine 1 fsize=1|1|141|73 fsize=6|1|16|4 class=2|1|277|119 class=1|1|323|200 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 2
  • 247. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=0|1|790|239 fsize=1|1|94|53 fsize=1|1|141|73 + fsize=5|1|22|4 fsize=3|1|21|17 fsize=6|1|16|4 fsize=4|1|3|1 gender=f|1|308|237 gender=m|1|678|122 fsize=3|1|22|13 fsize=4|1|19|5 fsize=2|1|75|48 fsize=5|1|3|1 fsize=2|1|84|42 class=3|1|709|181 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|158|102 class=2|1|277|119 gender=m|1|165|39 class=1|1|323|200 Machine 1 Machine 2
  • 248. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=0|1|790|239 fsize=1|1|141|73 fsize=1|1|94|53 fsize=3|1|22|13 fsize=4|1|19|5 fsize=2|1|75|48 fsize=5|1|22|4 fsize=3|1|21|17 fsize=6|1|16|4 fsize=4|1|3|1 fsize=7|1|8|0 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 fsize=2|1|84|42 class=3|1|709|181 class=2|1|277|119 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 class=1|1|323|200 Machine 1 Machine 2
  • 249. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=0|1|790|239 fsize=1|1|141|73 fsize=1|1|94|53 fsize=3|1|22|13 fsize=4|1|19|5 fsize=2|1|75|48 fsize=5|1|22|4 fsize=3|1|21|17 fsize=6|1|16|4 fsize=4|1|3|1 fsize=7|1|8|0 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 fsize=2|1|84|42 fsize=0|1|790|239 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 class=3|1|709|181 Machine 1 class=2|1|277|119 class=1|1|323|200 Machine 2
  • 250. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=1|1|141|73 fsize=0|1|790|239 fsize=1|1|94|53 fsize=2|1|84|42 fsize=3|1|22|13 fsize=2|1|75|48 fsize=4|1|19|5 fsize=3|1|21|17 fsize=5|1|22|4 fsize=6|1|16|4 fsize=4|1|3|1 fsize=7|1|8|0 fsize=5|1|3|1 fsize=10|1|11|0 gender=f|1|308|237 gender=m|1|678|122 fsize=0|1|790|239 gender=f|1|158|102 gender=m|1|165|39 class=3|1|709|181 Machine 1 class=2|1|277|119 class=1|1|323|200 Machine 2
  • 251. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=1|1|141|73 fsize=0|1|790|239 fsize=1|1|94|53 fsize=2|1|75|48 fsize=3|1|21|17 fsize=2|1|84|42 fsize=3|1|22|13 + fsize=4|1|3|1 fsize=6|1|16|4 fsize=10|1|11|0 gender=f|1|308|237 Machine 1 fsize=5|1|22|4 fsize=7|1|8|0 fsize=5|1|3|1 gender=m|1|678|122 fsize=4|1|19|5 fsize=1|1|235|126 fsize=0|1|790|239 class=3|1|709|181 class=2|1|277|119 gender=f|1|158|102 gender=m|1|165|39 Machine 2
  • 252. Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6
  • 253. FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
  • 254. k-way merge FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
  • 255. FTGS 1-6 FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
  • 256. FTGS 1-6 FTGS 7-12 FTGS 13-18
  • 257. FTGS 1-18 FTGS 1-6 FTGS 7-12 FTGS 13-18
  • 258. FTGS 1-36 FTGS 1-18 FTGS 19-36
  • 259. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found FTGS
  • 260. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  • 261. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  • 262. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found Regroup
  • 263. FTGS FTGS 1-6 FTGS 7-12 FTGS 13-18
  • 264. Regroup Regroup 1-6 Regroup 7-12 Regroup 13-18
  • 265. FTGS FTGS 1-6 FTGS 7-12 FTGS 13-18
  • 266. Regroup Regroup 1-6 Regroup 7-12 Regroup 13-18
  • 267. Imhotep
  • 268. Imhotep Distributed System that does efficient FTGS and Regroup operations on inverted indexes
  • 269. Imhotep 32 machines 2 cpu x 6 core xeon westmere E5649 128GB RAM 10x1TB 7200 RPM SATA Total: 384 cores, 4TB RAM, 320TB disk
  • 270. Imhotep Decision tree on 13 billion documents
  • 271. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc
  • 272. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc First FTGS: 314 seconds First Regroup: 9.6 seconds
  • 273. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc First FTGS: 314 seconds (36.3 million terms) First Regroup: 9.6 seconds
  • 274. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc First FTGS: 314 seconds (36.3 million terms) First Regroup: 9.6 seconds (7 groups)
  • 275. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc First FTGS: 314 seconds (36.3 million terms) First Regroup: 9.6 seconds (7 groups) Second FTGS: 57 seconds Second Regroup: 23 seconds (217 groups)
  • 276. Imhotep Distributed System that does efficient FTGS and Regroup operations Powers our internal analytical tools
  • 277. Imhotep Distributed System that does efficient FTGS and Regroup operations Powers our internal analytical tools … and more
  • 278. Imhotep - Next @IndeedEng Talk Sharding and shard management Session / FTGS network protocol Memory management Inverted Indexes FTGS Merge Regroup operations Fault Tolerance
  • 279. Conclusion Now scales to larger and larger data sets by adding more machines Increased freshness and frequency of builds Decision trees have lots of tunable components, regularly get 1% wins via A/B test
  • 280. Continuous Improvement Sponsored Job Click-through Rate (CTR)
  • 281. Thanks.
  • 282. Q&A
  • 283. More Questions? Jason David James Jeff
  • 284. Next @IndeedEng Talk Imhotep: Large Scale Analytics and Machine Learning at Indeed Jeff Plaisance, Engineering Manager March 26, 2014 http://engineering.indeed.com/talks

×