Machine Learning at
Indeed
Scaling Decision Trees
George Murray
Delivery Lead, Resume Data Team
Decision Tree Learning
Given a set of documents, split it into two or more
subsets that optimize some criteria.
Repeat this process until a set can no longer be split.
50
0
80
9
all passengers
survived perished
Decision Tree Learning
In this analogy:
• passengers = impressions
• survivors = clicks
50
0
80
9
all passengers
survived perished
319
181
281 528
class ∈ [1, 2] class ∉ [1, 2]
339
161
127
682
gender = f gender ≠ f
200 300
123
686
class = 1 class ≠ 1
H=0.6267
H=0.6244
H=0.5525
50
0
80
9
all passengers
survived perished
319
181
281 528
class ∈ [1, 2] class ∉ [1, 2]
339
161
127
682
gender = f gender ≠ f
200 300
123
686
class = 1 class ≠ 1
H=0.6267
H=0.6244
H=0.5525
50
0
80
9
all passengers
survived perished
319
181
281 528
class ∈ [1, 2] class ∉ [1, 2]
339
161
127
682
gender = f gender ≠ f
200 300
123
686
class = 1 class ≠ 1
339
161
127
682
gender = f gender ≠ f
survived perished
339
161
127
682
gender = f gender ≠ f
survived perished
class = 1 class < 3
72.7%
female
19.1%
male
38.2%
all passengers
49.1%
class = 2
93.2%
class <= 2
15.1%
class ≠ 1
34.1%
class = 1
13.1%
fsize ≠ 2
33.9%
fsize = 2
24.4%
fsize > 2
54.9%
fsize <= 2
Depth-first
Breadth-first
One layer at a time, all nodes simultaneously
Data format
Inverted Index
• Map terms to the list of documents that contain that
term
• Terms and documents are stored in sorted order
• Key structure in search engines
• Also key to building one layer at a time efficiently
• Apache Lucene, Indeed Flamdex
Inverted Index
class=1 : 0,1,2,3,4,5,6,7,8,9…
Inverted Index
class=1 : 0,1,2,3,4,5,6,7,8,9…
Field
Term
Document IDs
Inverted Index
class=1 : 0,1,2,3,4,5,6,7,8,9…
class=2 : 323,324,325,326…
class=3 : 600,601,602,603…
fsize=0 : 0,5,7,9,12,13,14,15…
fsize=1 : 6,10,11,16,17,26,27…
fsize=2 : 8,20,21,42,76,77,78…
gender=f : 0,2,4,6,8,11,12,13…
gender=m : 1,3,5,7,9,10,14,15…
survived=0 : 2,3,4,7,9,10,15,16…
survived=1 : 0,1,5,6,8,11,12,13…
Primary Lookup Tables
• groups[doc]: Where in the tree each doc is. All
docs start at root, so initially all 1s.
• values[doc]: Value to be classified for each doc.
For the titanic this is 1 if survived, 0 if not. In
general, invert the field of interest.
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Main Loop
foreach field (class,fsize,gender,…)
foreach term
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Main Loop
foreach field (class,fsize,gender,…)
foreach term (class=1,class=2,…)
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Main Loop
foreach field (class,fsize,gender,…)
foreach term (class=1,class=2,…)
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Get Group Stats
• count[grp]: Count of how many documents in the
group contain the current term. All 0s initially.
• vsum[grp]: Summation of the value to be classified
from the documents within that group that contain
the current term. Also all 0s initially.
Get Group Stats
// for current field+term
foreach doc
grp = grps[doc]
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc]
Get Group Stats
// for current field+term (class=1)
foreach doc
grp = grps[doc]
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc]
Get Group Stats
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc]
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc]
Get Group Stats
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc] (1,1,1,1,1,1,1,1,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc]
Get Group Stats
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc] (1,1,1,1,1,1,1,1,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc]
Get Group Stats
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc] (1,1,1,1,1,1,1,1,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc]
Get Group Stats
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc] (1,1,1,1,1,1,1,1,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
Get Group Stats
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc] (1,1,1,1,1,1,1,1,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
count[1] = 0, vsum[1] = 0
Get Group Stats
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc] (1,1,1,1,1,1,1,1,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
count[1] = 1, vsum[1] = 1
Get Group Stats
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc] (1,1,1,1,1,1,1,1,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
count[1] = 2, vsum[1] = 2
Get Group Stats
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc] (1,1,1,1,1,1,1,1,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
count[1] = 3, vsum[1] = 2
Get Group Stats
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc] (1,1,1,1,1,1,1,1,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
count[1] = 4, vsum[1] = 2
Get Group Stats
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc] (1,1,1,1,1,1,1,1,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
count[1] = 323, vsum[1] = 200
50
0
80
9
all passengers
survived perished
319
181
281 528
class ∈ [1, 2] class ∉ [1, 2]
339
161
127
682
gender = f gender ≠ f
200 300
123
686
class = 1 class ≠ 1
Get Group Stats
// for current field+term (gender=m)
foreach doc (1,3,5,7,…)
grp = grps[doc] (1,1,1,1,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
count[1] = 1, vsum[1] = 1
Main Loop
foreach field (class,fsize,gender,…)
foreach term (class=1,class=2,…)
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Main Loop
foreach field (class,fsize,gender,…)
foreach term (class=1,class=2,…)
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Evaluate Splits
Consider current field/term as a potential split for each
group
1. Check if split is admissible: balance check,
significance, etc.
2. Score the split: conditional entropy, or some other
heuristic
3. Keep the best scoring split
Evaluate Splits
More tables:
• totalcount[group], totalvalue[group]: Total
number of documents and total values for each
group — in this example, number of passengers and
survivors respectively
• bestsplit[group], bestscore[group]: Current best
split and score for each group, initially nulls
Evaluate Splits
foreach group
if not admissible (…) skip
score = calcscore(cnt[grp], vsum[grp],
totcnt[grp], totval[grp])
if score < bestscore[grp]
bestscore[grp] = score
bestsplit[grp] = (field,term)
Main Loop
foreach field (class,fsize,gender,…)
foreach term (class=1,class=2,…)
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Main Loop
foreach field (class,fsize,gender,…)
foreach term (class=1,class=2,…)
get group stats
evaluate splits
apply best splits (bestsplit[1]=(gender,f))
repeat n times or until no more splits found
Apply Best Splits
All passengers
Apply Best Splits
gender ≠ f gender = f
Apply Best Splits
DocID group[ID] DocID group[ID] DocID group[ID]
0 1 7 1 14 1
1 1 8 1 15 1
2 1 9 1 16 1
3 1 10 1 17 1
4 1 11 1 18 1
5 1 12 1 19 1
6 1 13 1 20 1
gender=f : 0,2,4,6,8,11,12,13,17,18,21,23,…
Apply Best Splits
DocID group[ID] DocID group[ID] DocID group[ID]
0 3 7 1 14 1
1 1 8 1 15 1
2 1 9 1 16 1
3 1 10 1 17 1
4 1 11 1 18 1
5 1 12 1 19 1
6 1 13 1 20 1
gender=f : 0,2,4,6,8,11,12,13,17,18,21,23,…
Apply Best Splits
DocID group[ID] DocID group[ID] DocID group[ID]
0 3 7 1 14 1
1 1 8 1 15 1
2 3 9 1 16 1
3 1 10 1 17 1
4 1 11 1 18 1
5 1 12 1 19 1
6 1 13 1 20 1
gender=f : 0,2,4,6,8,11,12,13,17,18,21,23,…
Apply Best Splits
DocID group[ID] DocID group[ID] DocID group[ID]
0 3 7 1 14 1
1 1 8 1 15 1
2 3 9 1 16 1
3 1 10 1 17 1
4 3 11 1 18 1
5 1 12 1 19 1
6 1 13 1 20 1
gender=f : 0,2,4,6,8,11,12,13,17,18,21,23,…
Apply Best Splits
DocID group[ID] DocID group[ID] DocID group[ID]
0 3 7 1 14 1
1 1 8 3 15 1
2 3 9 1 16 1
3 1 10 1 17 3
4 3 11 3 18 3
5 1 12 3 19 1
6 3 13 3 20 1
gender=f : 0,2,4,6,8,11,12,13,17,18,21,23,…
Apply Best Splits
DocID group[ID] DocID group[ID] DocID group[ID]
0 3 7 1 14 1
1 1 8 3 15 1
2 3 9 1 16 1
3 1 10 1 17 3
4 3 11 3 18 3
5 1 12 3 19 1
6 3 13 3 20 1
gender≠f : 1,3,5,7,9,10,14,15,16,19,20,…
Apply Best Splits
DocID group[ID] DocID group[ID] DocID group[ID]
0 3 7 2 14 2
1 2 8 3 15 2
2 3 9 2 16 2
3 2 10 2 17 3
4 3 11 3 18 3
5 2 12 3 19 2
6 3 13 3 20 2
gender≠f : 1,3,5,7,9,10,14,15,16,19,20,…
Main Loop
foreach field (class,fsize,gender,…)
foreach term (class=1,class=2,…)
get group stats
evaluate splits
apply best splits (bestsplit[1]=(gender,f))
repeat n times or until no more splits found
Main Loop
foreach field (class,fsize,gender,…)
foreach term (class=1,class=2,…)
get group stats
evaluate splits
apply best splits (bestsplit[1]=(gender,f))
repeat n times or until no more splits
found
Main Loop - 2nd Iteration
foreach field (class,fsize,gender,…)
foreach term
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Main Loop - 2nd Iteration
foreach field (class,fsize,gender,…)
foreach term (class=1,class=2,…)
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Main Loop - 2nd Iteration
foreach field (class,fsize,gender,…)
foreach term (class=1,class=2,…)
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Get Group Stats (1st loop)
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc] (1,1,1,1,1,1,1,1,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
Get Group Stats (Now)
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc] (3,2,3,2,3,2,3,2,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
Get Group Stats (Now)
// for current field+term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8,…)
grp = grps[doc] (3,2,3,2,3,2,3,2,…)
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
count[2] = 179, vsum[2] = 61
count[3] = 144, vsum[3] = 139
Evaluate Splits
foreach group
if not admissible (…) skip
score = calcscore(cnt[grp], vsum[grp],
totcnt[grp], totval[grp])
if score < bestscore[grp]
bestscore[grp] = score
bestsplit[grp] = (field,term)
Multiple Machine
Implementation
Hadoop
• Each level took five sequential MR jobs
• Ended up much slower than a single machine
Inverted Index
Shard 2Shard 1
Machine 1 Machine 2
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
FTGS Stream - One Machine
class 1 1 323;200
class 2 1 277;119
class 3 1 709;181
fsize 0 1 790;239
fsize 1 1 235;126
fsize 2 1 159;90
fsize 3 1 43;30
fsize 4 1 22;6
fsize 5 1 25;5
fsize 6 1 16;4
fsize 7 1 8;0
fsize 10 1 11;0
gender f 1 466;339
gender
m 1 843;161
Shard 2Shard 1
MergeFTGS 1 FTGS 2
FTGS Stream Merge
class 1 1 198;111
class 2 1 277;119
class 3 1 511;129
fsize 0 1 790;239
fsize 1 1 94;53
fsize 2 1 75;48
fsize 3 1 21;17
fsize 4 1 3;1
fsize 5 1 25;5
gender f 1 308;237
gender m 1 678;122
class 1 1 125;89
class 3 1 198;52
fsize 1 1 141;73
fsize 2 1 84;42
fsize 3 1 122;13
fsize 4 1 19;5
fsize 6 1 16;4
fsize 10 1 11;0
fsize 7 1 8;0
gender f 1 158;102
gender m 1 165;39
FTGS Stream Merge
class 1 1 198;111
class 2 1 277;119
class 3 1 511;129
fsize 0 1 790;239
fsize 1 1 94;53
fsize 2 1 75;48
fsize 3 1 21;17
fsize 4 1 3;1
fsize 5 1 25;5
gender f 1 308;237
gender m 1 678;122
class 1 1 125;89
class 3 1 198;52
fsize 1 1 141;73
fsize 2 1 84;42
fsize 3 1 122;13
fsize 4 1 19;5
fsize 6 1 16;4
fsize 10 1 11;0
fsize 7 1 8;0
gender f 1 158;102
gender m 1 165;39
class 1 1 323;200
+
FTGS Stream Merge
class 1 1 198;111
class 2 1 277;119
class 3 1 511;129
fsize 0 1 790;239
fsize 1 1 94;53
fsize 2 1 75;48
fsize 3 1 21;17
fsize 4 1 3;1
fsize 5 1 25;5
gender f 1 308;237
gender m 1 678;122
class 1 1 125;89
class 3 1 198;52
fsize 1 1 141;73
fsize 2 1 84;42
fsize 3 1 122;13
fsize 4 1 19;5
fsize 6 1 16;4
fsize 10 1 11;0
fsize 7 1 8;0
gender f 1 158;102
gender m 1 165;39
class 2 1 277;119
class 1 1 323;200
FTGS Stream Merge
class 1 1 198;111
class 2 1 277;119
class 3 1 511;129
fsize 0 1 790;239
fsize 1 1 94;53
fsize 2 1 75;48
fsize 3 1 21;17
fsize 4 1 3;1
fsize 5 1 25;5
gender f 1 308;237
gender m 1 678;122
class 1 1 125;89
class 3 1 198;52
fsize 1 1 141;73
fsize 2 1 84;42
fsize 3 1 122;13
fsize 4 1 19;5
fsize 6 1 16;4
fsize 10 1 11;0
fsize 7 1 8;0
gender f 1 158;102
gender m 1 165;39
class 3 1 709;181
class 2 1 277;119
class 1 1 323;200
+
FTGS Stream Merge
class 1 1 198;111
class 2 1 277;119
class 3 1 511;129
fsize 0 1 790;239
fsize 1 1 94;53
fsize 2 1 75;48
fsize 3 1 21;17
fsize 4 1 3;1
fsize 5 1 25;5
gender f 1 308;237
gender m 1 678;122
class 1 1 125;89
class 3 1 198;52
fsize 1 1 141;73
fsize 2 1 84;42
fsize 3 1 122;13
fsize 4 1 19;5
fsize 6 1 16;4
fsize 10 1 11;0
fsize 7 1 8;0
gender f 1 158;102
gender m 1 165;39
fsize 0 1 790;239
class 3 1 709;181
class 2 1 277;119
class 1 1 323;200
FTGS Stream Merge
class 1 1 198;111
class 2 1 277;119
class 3 1 511;129
fsize 0 1 790;239
fsize 1 1 94;53
fsize 2 1 75;48
fsize 3 1 21;17
fsize 4 1 3;1
fsize 5 1 25;5
gender f 1 308;237
gender m 1 678;122
class 1 1 125;89
class 3 1 198;52
fsize 1 1 141;73
fsize 2 1 84;42
fsize 3 1 122;13
fsize 4 1 19;5
fsize 6 1 16;4
fsize 10 1 11;0
fsize 7 1 8;0
gender f 1 158;102
gender m 1 165;39
fsize 1 1 235;126
fsize 0 1 790;239
class 3 1 709;181
class 2 1 277;119
class 1 1 323;200
+
FTGS Stream Merge
Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6
FTGS Stream Merge
FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
K-way Merge — O(n k log k)
FTGS Stream Merge
FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6 FTGS 7 FTGS 8 FTGS 9 FTGS 10 FTGS 11 FTGS 12
Merge 1-6 Merge 7-12
Merge 1-12
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Evaluate Splits
FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6 FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
Merge / Evaluate 1-12
Merge 1-6 Merge 7-12
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits found
Apply Best Splits
FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6 FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
Merge / Evaluate 1-12
Merge 1-6 Merge 7-12
Apply Best Splits
Regroup 1 Regroup 2 Regroup 3 Regroup 4 Regroup 5 Regroup 6 Regroup 1 Regroup 2 Regroup 3 Regroup 4 Regroup 5 Regroup 6
Regroup 1-12
Regroup 1-6 Regroup 7-12
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits
repeat n times or until no more splits
found
q = “sales”
q = “sales”
Imhotep
A distributed system that does efficient FTGS and regroup
operations on inverted indices
Imhotep
• 32 Machines
• 2x 6-core Xeon Westmere E5649
• 128GB RAM
• 10x1TB 7200 RPM SATA
Total: 384 cores, 4TB RAM, 320TB disk
Imhotep
Decision tree on 13 billion documents
Inverted index size: 330GB
Imhotep
Decision tree on 13 billion documents
Inverted index size: 330GB
First FTGS: 314 seconds (36.3M terms)
First Regroup: 9.6 seconds (7 groups)
Second FTGS: 57 seconds
Second Regroup: 23 seconds (217 groups)
Imhotep
Also powers our internal analytics tools
Scaling decision trees - George Murray, July 2015

Scaling decision trees - George Murray, July 2015

  • 2.
  • 3.
  • 7.
    Decision Tree Learning Givena set of documents, split it into two or more subsets that optimize some criteria. Repeat this process until a set can no longer be split.
  • 8.
  • 9.
    Decision Tree Learning Inthis analogy: • passengers = impressions • survivors = clicks
  • 10.
    50 0 80 9 all passengers survived perished 319 181 281528 class ∈ [1, 2] class ∉ [1, 2] 339 161 127 682 gender = f gender ≠ f 200 300 123 686 class = 1 class ≠ 1
  • 11.
    H=0.6267 H=0.6244 H=0.5525 50 0 80 9 all passengers survived perished 319 181 281528 class ∈ [1, 2] class ∉ [1, 2] 339 161 127 682 gender = f gender ≠ f 200 300 123 686 class = 1 class ≠ 1
  • 12.
    H=0.6267 H=0.6244 H=0.5525 50 0 80 9 all passengers survived perished 319 181 281528 class ∈ [1, 2] class ∉ [1, 2] 339 161 127 682 gender = f gender ≠ f 200 300 123 686 class = 1 class ≠ 1
  • 13.
    339 161 127 682 gender = fgender ≠ f survived perished
  • 14.
    339 161 127 682 gender = fgender ≠ f survived perished class = 1 class < 3
  • 15.
    72.7% female 19.1% male 38.2% all passengers 49.1% class =2 93.2% class <= 2 15.1% class ≠ 1 34.1% class = 1 13.1% fsize ≠ 2 33.9% fsize = 2 24.4% fsize > 2 54.9% fsize <= 2
  • 16.
  • 17.
    One layer ata time, all nodes simultaneously
  • 18.
  • 19.
    Inverted Index • Mapterms to the list of documents that contain that term • Terms and documents are stored in sorted order • Key structure in search engines • Also key to building one layer at a time efficiently • Apache Lucene, Indeed Flamdex
  • 20.
    Inverted Index class=1 :0,1,2,3,4,5,6,7,8,9…
  • 21.
    Inverted Index class=1 :0,1,2,3,4,5,6,7,8,9… Field Term Document IDs
  • 22.
    Inverted Index class=1 :0,1,2,3,4,5,6,7,8,9… class=2 : 323,324,325,326… class=3 : 600,601,602,603… fsize=0 : 0,5,7,9,12,13,14,15… fsize=1 : 6,10,11,16,17,26,27… fsize=2 : 8,20,21,42,76,77,78… gender=f : 0,2,4,6,8,11,12,13… gender=m : 1,3,5,7,9,10,14,15… survived=0 : 2,3,4,7,9,10,15,16… survived=1 : 0,1,5,6,8,11,12,13…
  • 23.
    Primary Lookup Tables •groups[doc]: Where in the tree each doc is. All docs start at root, so initially all 1s. • values[doc]: Value to be classified for each doc. For the titanic this is 1 if survived, 0 if not. In general, invert the field of interest.
  • 24.
    Main Loop foreach field foreachterm get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 25.
    Main Loop foreach field(class,fsize,gender,…) foreach term get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 26.
    Main Loop foreach field(class,fsize,gender,…) foreach term (class=1,class=2,…) get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 27.
    Main Loop foreach field(class,fsize,gender,…) foreach term (class=1,class=2,…) get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 28.
    Get Group Stats •count[grp]: Count of how many documents in the group contain the current term. All 0s initially. • vsum[grp]: Summation of the value to be classified from the documents within that group that contain the current term. Also all 0s initially.
  • 29.
    Get Group Stats //for current field+term foreach doc grp = grps[doc] if grp == 0 skip count[grp]++ vsum[grp] += vals[doc]
  • 30.
    Get Group Stats //for current field+term (class=1) foreach doc grp = grps[doc] if grp == 0 skip count[grp]++ vsum[grp] += vals[doc]
  • 31.
    Get Group Stats //for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] if grp == 0 skip count[grp]++ vsum[grp] += vals[doc]
  • 32.
    Get Group Stats //for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] (1,1,1,1,1,1,1,1,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc]
  • 33.
    Get Group Stats //for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] (1,1,1,1,1,1,1,1,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc]
  • 34.
    Get Group Stats //for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] (1,1,1,1,1,1,1,1,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc]
  • 35.
    Get Group Stats //for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] (1,1,1,1,1,1,1,1,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
  • 36.
    Get Group Stats //for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] (1,1,1,1,1,1,1,1,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…) count[1] = 0, vsum[1] = 0
  • 37.
    Get Group Stats //for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] (1,1,1,1,1,1,1,1,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…) count[1] = 1, vsum[1] = 1
  • 38.
    Get Group Stats //for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] (1,1,1,1,1,1,1,1,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…) count[1] = 2, vsum[1] = 2
  • 39.
    Get Group Stats //for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] (1,1,1,1,1,1,1,1,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…) count[1] = 3, vsum[1] = 2
  • 40.
    Get Group Stats //for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] (1,1,1,1,1,1,1,1,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…) count[1] = 4, vsum[1] = 2
  • 41.
    Get Group Stats //for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] (1,1,1,1,1,1,1,1,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…) count[1] = 323, vsum[1] = 200
  • 42.
    50 0 80 9 all passengers survived perished 319 181 281528 class ∈ [1, 2] class ∉ [1, 2] 339 161 127 682 gender = f gender ≠ f 200 300 123 686 class = 1 class ≠ 1
  • 43.
    Get Group Stats //for current field+term (gender=m) foreach doc (1,3,5,7,…) grp = grps[doc] (1,1,1,1,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…) count[1] = 1, vsum[1] = 1
  • 44.
    Main Loop foreach field(class,fsize,gender,…) foreach term (class=1,class=2,…) get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 45.
    Main Loop foreach field(class,fsize,gender,…) foreach term (class=1,class=2,…) get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 46.
    Evaluate Splits Consider currentfield/term as a potential split for each group 1. Check if split is admissible: balance check, significance, etc. 2. Score the split: conditional entropy, or some other heuristic 3. Keep the best scoring split
  • 47.
    Evaluate Splits More tables: •totalcount[group], totalvalue[group]: Total number of documents and total values for each group — in this example, number of passengers and survivors respectively • bestsplit[group], bestscore[group]: Current best split and score for each group, initially nulls
  • 48.
    Evaluate Splits foreach group ifnot admissible (…) skip score = calcscore(cnt[grp], vsum[grp], totcnt[grp], totval[grp]) if score < bestscore[grp] bestscore[grp] = score bestsplit[grp] = (field,term)
  • 49.
    Main Loop foreach field(class,fsize,gender,…) foreach term (class=1,class=2,…) get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 50.
    Main Loop foreach field(class,fsize,gender,…) foreach term (class=1,class=2,…) get group stats evaluate splits apply best splits (bestsplit[1]=(gender,f)) repeat n times or until no more splits found
  • 51.
  • 52.
    Apply Best Splits gender≠ f gender = f
  • 53.
    Apply Best Splits DocIDgroup[ID] DocID group[ID] DocID group[ID] 0 1 7 1 14 1 1 1 8 1 15 1 2 1 9 1 16 1 3 1 10 1 17 1 4 1 11 1 18 1 5 1 12 1 19 1 6 1 13 1 20 1 gender=f : 0,2,4,6,8,11,12,13,17,18,21,23,…
  • 54.
    Apply Best Splits DocIDgroup[ID] DocID group[ID] DocID group[ID] 0 3 7 1 14 1 1 1 8 1 15 1 2 1 9 1 16 1 3 1 10 1 17 1 4 1 11 1 18 1 5 1 12 1 19 1 6 1 13 1 20 1 gender=f : 0,2,4,6,8,11,12,13,17,18,21,23,…
  • 55.
    Apply Best Splits DocIDgroup[ID] DocID group[ID] DocID group[ID] 0 3 7 1 14 1 1 1 8 1 15 1 2 3 9 1 16 1 3 1 10 1 17 1 4 1 11 1 18 1 5 1 12 1 19 1 6 1 13 1 20 1 gender=f : 0,2,4,6,8,11,12,13,17,18,21,23,…
  • 56.
    Apply Best Splits DocIDgroup[ID] DocID group[ID] DocID group[ID] 0 3 7 1 14 1 1 1 8 1 15 1 2 3 9 1 16 1 3 1 10 1 17 1 4 3 11 1 18 1 5 1 12 1 19 1 6 1 13 1 20 1 gender=f : 0,2,4,6,8,11,12,13,17,18,21,23,…
  • 57.
    Apply Best Splits DocIDgroup[ID] DocID group[ID] DocID group[ID] 0 3 7 1 14 1 1 1 8 3 15 1 2 3 9 1 16 1 3 1 10 1 17 3 4 3 11 3 18 3 5 1 12 3 19 1 6 3 13 3 20 1 gender=f : 0,2,4,6,8,11,12,13,17,18,21,23,…
  • 58.
    Apply Best Splits DocIDgroup[ID] DocID group[ID] DocID group[ID] 0 3 7 1 14 1 1 1 8 3 15 1 2 3 9 1 16 1 3 1 10 1 17 3 4 3 11 3 18 3 5 1 12 3 19 1 6 3 13 3 20 1 gender≠f : 1,3,5,7,9,10,14,15,16,19,20,…
  • 59.
    Apply Best Splits DocIDgroup[ID] DocID group[ID] DocID group[ID] 0 3 7 2 14 2 1 2 8 3 15 2 2 3 9 2 16 2 3 2 10 2 17 3 4 3 11 3 18 3 5 2 12 3 19 2 6 3 13 3 20 2 gender≠f : 1,3,5,7,9,10,14,15,16,19,20,…
  • 60.
    Main Loop foreach field(class,fsize,gender,…) foreach term (class=1,class=2,…) get group stats evaluate splits apply best splits (bestsplit[1]=(gender,f)) repeat n times or until no more splits found
  • 61.
    Main Loop foreach field(class,fsize,gender,…) foreach term (class=1,class=2,…) get group stats evaluate splits apply best splits (bestsplit[1]=(gender,f)) repeat n times or until no more splits found
  • 62.
    Main Loop -2nd Iteration foreach field (class,fsize,gender,…) foreach term get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 63.
    Main Loop -2nd Iteration foreach field (class,fsize,gender,…) foreach term (class=1,class=2,…) get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 64.
    Main Loop -2nd Iteration foreach field (class,fsize,gender,…) foreach term (class=1,class=2,…) get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 65.
    Get Group Stats(1st loop) // for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] (1,1,1,1,1,1,1,1,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
  • 66.
    Get Group Stats(Now) // for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] (3,2,3,2,3,2,3,2,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…)
  • 67.
    Get Group Stats(Now) // for current field+term (class=1) foreach doc (0,1,2,3,4,5,6,7,8,…) grp = grps[doc] (3,2,3,2,3,2,3,2,…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,…) count[2] = 179, vsum[2] = 61 count[3] = 144, vsum[3] = 139
  • 68.
    Evaluate Splits foreach group ifnot admissible (…) skip score = calcscore(cnt[grp], vsum[grp], totcnt[grp], totval[grp]) if score < bestscore[grp] bestscore[grp] = score bestsplit[grp] = (field,term)
  • 69.
  • 70.
    Hadoop • Each leveltook five sequential MR jobs • Ended up much slower than a single machine
  • 71.
  • 72.
  • 73.
    Main Loop foreach field foreachterm get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 74.
    Main Loop foreach field foreachterm get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 75.
    FTGS Stream -One Machine class 1 1 323;200 class 2 1 277;119 class 3 1 709;181 fsize 0 1 790;239 fsize 1 1 235;126 fsize 2 1 159;90 fsize 3 1 43;30 fsize 4 1 22;6 fsize 5 1 25;5 fsize 6 1 16;4 fsize 7 1 8;0 fsize 10 1 11;0 gender f 1 466;339 gender m 1 843;161
  • 76.
  • 77.
    FTGS Stream Merge class1 1 198;111 class 2 1 277;119 class 3 1 511;129 fsize 0 1 790;239 fsize 1 1 94;53 fsize 2 1 75;48 fsize 3 1 21;17 fsize 4 1 3;1 fsize 5 1 25;5 gender f 1 308;237 gender m 1 678;122 class 1 1 125;89 class 3 1 198;52 fsize 1 1 141;73 fsize 2 1 84;42 fsize 3 1 122;13 fsize 4 1 19;5 fsize 6 1 16;4 fsize 10 1 11;0 fsize 7 1 8;0 gender f 1 158;102 gender m 1 165;39
  • 78.
    FTGS Stream Merge class1 1 198;111 class 2 1 277;119 class 3 1 511;129 fsize 0 1 790;239 fsize 1 1 94;53 fsize 2 1 75;48 fsize 3 1 21;17 fsize 4 1 3;1 fsize 5 1 25;5 gender f 1 308;237 gender m 1 678;122 class 1 1 125;89 class 3 1 198;52 fsize 1 1 141;73 fsize 2 1 84;42 fsize 3 1 122;13 fsize 4 1 19;5 fsize 6 1 16;4 fsize 10 1 11;0 fsize 7 1 8;0 gender f 1 158;102 gender m 1 165;39 class 1 1 323;200 +
  • 79.
    FTGS Stream Merge class1 1 198;111 class 2 1 277;119 class 3 1 511;129 fsize 0 1 790;239 fsize 1 1 94;53 fsize 2 1 75;48 fsize 3 1 21;17 fsize 4 1 3;1 fsize 5 1 25;5 gender f 1 308;237 gender m 1 678;122 class 1 1 125;89 class 3 1 198;52 fsize 1 1 141;73 fsize 2 1 84;42 fsize 3 1 122;13 fsize 4 1 19;5 fsize 6 1 16;4 fsize 10 1 11;0 fsize 7 1 8;0 gender f 1 158;102 gender m 1 165;39 class 2 1 277;119 class 1 1 323;200
  • 80.
    FTGS Stream Merge class1 1 198;111 class 2 1 277;119 class 3 1 511;129 fsize 0 1 790;239 fsize 1 1 94;53 fsize 2 1 75;48 fsize 3 1 21;17 fsize 4 1 3;1 fsize 5 1 25;5 gender f 1 308;237 gender m 1 678;122 class 1 1 125;89 class 3 1 198;52 fsize 1 1 141;73 fsize 2 1 84;42 fsize 3 1 122;13 fsize 4 1 19;5 fsize 6 1 16;4 fsize 10 1 11;0 fsize 7 1 8;0 gender f 1 158;102 gender m 1 165;39 class 3 1 709;181 class 2 1 277;119 class 1 1 323;200 +
  • 81.
    FTGS Stream Merge class1 1 198;111 class 2 1 277;119 class 3 1 511;129 fsize 0 1 790;239 fsize 1 1 94;53 fsize 2 1 75;48 fsize 3 1 21;17 fsize 4 1 3;1 fsize 5 1 25;5 gender f 1 308;237 gender m 1 678;122 class 1 1 125;89 class 3 1 198;52 fsize 1 1 141;73 fsize 2 1 84;42 fsize 3 1 122;13 fsize 4 1 19;5 fsize 6 1 16;4 fsize 10 1 11;0 fsize 7 1 8;0 gender f 1 158;102 gender m 1 165;39 fsize 0 1 790;239 class 3 1 709;181 class 2 1 277;119 class 1 1 323;200
  • 82.
    FTGS Stream Merge class1 1 198;111 class 2 1 277;119 class 3 1 511;129 fsize 0 1 790;239 fsize 1 1 94;53 fsize 2 1 75;48 fsize 3 1 21;17 fsize 4 1 3;1 fsize 5 1 25;5 gender f 1 308;237 gender m 1 678;122 class 1 1 125;89 class 3 1 198;52 fsize 1 1 141;73 fsize 2 1 84;42 fsize 3 1 122;13 fsize 4 1 19;5 fsize 6 1 16;4 fsize 10 1 11;0 fsize 7 1 8;0 gender f 1 158;102 gender m 1 165;39 fsize 1 1 235;126 fsize 0 1 790;239 class 3 1 709;181 class 2 1 277;119 class 1 1 323;200 +
  • 83.
    FTGS Stream Merge Shard1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6
  • 84.
    FTGS Stream Merge FTGS1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6 K-way Merge — O(n k log k)
  • 85.
    FTGS Stream Merge FTGS1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6 FTGS 7 FTGS 8 FTGS 9 FTGS 10 FTGS 11 FTGS 12 Merge 1-6 Merge 7-12 Merge 1-12
  • 86.
    Main Loop foreach field foreachterm get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 87.
    Main Loop foreach field foreachterm get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 88.
    Evaluate Splits FTGS 1FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6 FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6 Merge / Evaluate 1-12 Merge 1-6 Merge 7-12
  • 89.
    Main Loop foreach field foreachterm get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 90.
    Apply Best Splits FTGS1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6 FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6 Merge / Evaluate 1-12 Merge 1-6 Merge 7-12
  • 91.
    Apply Best Splits Regroup1 Regroup 2 Regroup 3 Regroup 4 Regroup 5 Regroup 6 Regroup 1 Regroup 2 Regroup 3 Regroup 4 Regroup 5 Regroup 6 Regroup 1-12 Regroup 1-6 Regroup 7-12
  • 92.
    Main Loop foreach field foreachterm get group stats evaluate splits apply best splits repeat n times or until no more splits found
  • 93.
  • 94.
  • 95.
    Imhotep A distributed systemthat does efficient FTGS and regroup operations on inverted indices
  • 96.
    Imhotep • 32 Machines •2x 6-core Xeon Westmere E5649 • 128GB RAM • 10x1TB 7200 RPM SATA Total: 384 cores, 4TB RAM, 320TB disk
  • 97.
    Imhotep Decision tree on13 billion documents Inverted index size: 330GB
  • 98.
    Imhotep Decision tree on13 billion documents Inverted index size: 330GB First FTGS: 314 seconds (36.3M terms) First Regroup: 9.6 seconds (7 groups) Second FTGS: 57 seconds Second Regroup: 23 seconds (217 groups)
  • 99.
    Imhotep Also powers ourinternal analytics tools

Editor's Notes

  • #2 Welcome to Indeed, this is an abridged version of a talk given by our CTO condensed for time. As such I’ll be skipping over some of the extended explanations, so if you’d like to know more about what I’m talking about today the link here will get you there. I’ll put it up again at the end.
  • #3 Fast scalable building system, using well known example, showing how they’re applied (search / analytics)
  • #6 this is what indeed.com looks like if you search for a job we do know what we are measuring because we did design the system
  • #7 We want to maximize jobseekers getting jobs. Talk about which jobs to show here, job age, will they apply, is the job seeker qualified, but we’re only going to be looking at will the JS click (CTR). This is a supervised learning problem: users vs our machines. We do this by logging user behavior, what they clicked, what they were presented with, use that to help predict future behavior
  • #8 (type we’re talking about) for simplicity a common way to discuss this is in the context of :
  • #9 the Titanic. 1309 passengers, 500 survivors. What’s the best predictor of whether someone survived?
  • #11 Here’s only a few ways we can divvy up passengers of the ~1300. Not the only ways. Of the ways we’re looking at to split them, we need a way to score them.
  • #12 In this example we’re using conditional entropy to score how well we’ve split the set. In information theory, entropy is the expected value of the information contained in an event. If we calculate the entropy of an event Y given a condition X, and average that result over every value X can take, that gives us conditional entropy. A lower conditional entropy means there’s less uncertainty about our prediction. In this case, Y is survival (the event we want to know reliability for predicting), and for each condition, X is the dividing factor (class on lower left) Details of the scoring method aren’t important, could be another talk on them
  • #13 In this case splitting by gender works out best
  • #14 Now that we know that we can split our document set and determine what the best split for each of those categories are.
  • #15 In this case the best predictor of survival for women is different than the best predictor for men.
  • #18 This is the algorithm we’re going to use to scale it: Each tier is a group of nodes Mention group 5 splitting down to 0 This will play into being able to scale across multiple machines
  • #20 Mention we have another talk about Imhotep and why we use flamdex for it
  • #22 This set happens to be sorted by class so ids are sequential
  • #36 vals in this case remember is survival
  • #44 Looks a little different for something like gender=m because not sequential
  • #45 This is where we were when we left the main loop
  • #49 calcscore in this case is conditional entropy 1 is the only group so far
  • #51 each split is a combination of a target group, a condition, a positive destination group, and an inverse negative destination group (next slides)
  • #52 in this example, all passengers end up in either the positive group (female) or negative inverse (not)
  • #53 To accomplish this, we’re going to use the inverted index to iterate over docs that match the split condition, then if the document is in the targeted group, move it to the positive group, then move anything left in the target group to the negative group (1 target 2 negative 3 positive)
  • #54 ii for female
  • #59 emphasize target group, only group
  • #66 last time we looked at group stats, this is what it looked like
  • #67 This time around we’ve rebucketed all the documents into groups two or three
  • #69 Now when we evaluate splits again later in the main loop, we have two groups to validate splits for. Since best score and best split are by group we could end up with different splits per group
  • #70 Let’s talk about scalability. This algorithm actually worked fairly well on a single machine but we did eventually hit limits
  • #71 Our algorithm is iterative. Hadoop writes intermediate data and does lots of shuffling bits across machines. Overhead job setup
  • #85 Selection tree or heap
  • #93 Go back a few times
  • #94 We use this algorithm to generate decision trees like this one for our SERP (EXAMPLE) value = click through rate field = job title terms are in rectangles probability of CTR is yellow bubble weight in lucene
  • #96 OPEN SOURCE Egyptian history buff Possibly first physician, architect, engineer
  • #97 (original specs)
  • #99 Cache fresh
  • #100 talk about Imhotep: scales larger by adding more machines, increased freshness of decision trees, since we can iterate fast we can tweak fast, get regular 1% wins via A/B test. Online talks about sharding concerns with hotspots, micro optimizations top 10 sales cities —> FTGS Bucket by hour: regroup operation top 10 queries in seattle? AB test bucket
  • #101 More on imhotep at this link (sharding, hot spots, micro optimizations, and more)