Upcoming SlideShare
×

1,636 views

Published on

Published in: Technology, Education
3 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,636
On SlideShare
0
From Embeds
0
Number of Embeds
480
Actions
Shares
0
30
0
Likes
3
Embeds 0
No embeds

No notes for slide

1. 1. Batch Querying with Cascading Nathan Marz Rapleaf
2. 2. Batch Querying <ul><li>A core tool at Rapleaf - we use this everywhere </li></ul><ul><li>Will be releasing source code tomorrow </li></ul>
3. 3. Motivation <ul><li>We have a list of 1,000,000 e-mails </li></ul><ul><li>Database size: 10,000,000,000 records </li></ul><ul><li>We want all the records we have in our database related to those e-mails </li></ul>
5. 5. Interface
6. 6. Features <ul><li>Variable number of keys per record </li></ul><ul><li>Many keys (“batch”) </li></ul><ul><li>Output is a subset of the input - if a record matches multiple keys, emit it only once </li></ul><ul><li>Does not eliminate duplication from input set </li></ul>
7. 7. Algorithm #1 <ul><li>Step 1: For each record R, key K in R: emit(K,R) </li></ul><ul><li>[ A , ( A , AGE, 25)] </li></ul><ul><li>[ A , ( A , F RIENDS, B )] </li></ul><ul><li>[ B , ( A , FRI E NDS, B )] </li></ul><ul><li>[ C , ( C , O T HER_ E MAILS, [ A , D ])] </li></ul><ul><li>[ A , ( C , OTHER _ EMAI L S, [ A , D ])] </li></ul><ul><li>[ D , ( C , OT H ER_EMA I LS, [ A , D ])] </li></ul><ul><li>[ D , ( D , GE N DER , MALE) ] </li></ul><ul><li>[ D , ( D , GENDER, MALE)] </li></ul><ul><li>[ E , ( E , FRIENDS, R )] </li></ul><ul><li>[ R , ( E , F R IENDS, R )] </li></ul><ul><li>(A, AGE, 25) </li></ul><ul><li>( A , FRIENDS, B ) </li></ul><ul><li>( C , OTHER_EMAILS, [ A , D ]) </li></ul><ul><li>( D, GENDER, MALE) </li></ul><ul><li>( D , GE N DER, MALE) </li></ul><ul><li>(E FRIENDS, R ) </li></ul>
8. 8. Algorithm #1 (cont.) <ul><li>Step 2: Inner join against list of keys: </li></ul><ul><li>E </li></ul><ul><li>R </li></ul><ul><li>D </li></ul><ul><li>[ D , ( C , OTHER_EMAILS, [ A , D ]) ] </li></ul><ul><li>[ D , ( D , GE N DER, MALE)] </li></ul><ul><li>[ D , ( D , GEND E R, MALE)] </li></ul><ul><li>[ E , ( E , F R IEND S , R )] </li></ul><ul><li>[ R , ( E , FRIE N DS, R )] </li></ul><ul><li>[ A , ( A , AGE, 25)] </li></ul><ul><li>[ A , ( A , F RIENDS, B )] </li></ul><ul><li>[ B , ( A , FRI E NDS, B )] </li></ul><ul><li>[ C , ( C , O T HER_ E MAILS, [ A , D ])] </li></ul><ul><li>[ A , ( C , OTHER _ EMAI L S, [ A , D ])] </li></ul><ul><li>[ D , ( C , OT H ER_EMA I LS, [ A , D ])] </li></ul><ul><li>[ D , ( D , GE N DER , MALE) ] </li></ul><ul><li>[ D , ( D , GENDER, MALE)] </li></ul><ul><li>[ E , ( E , FRIENDS, R )] </li></ul><ul><li>[ R , ( E , F R IENDS, R )] </li></ul>
9. 9. Algorithm #1 (cont.) <ul><li>Step 3: Strip out key from tuple </li></ul><ul><li>[ D , ( C , OTHER_EMAILS, [ A , D ]) ] </li></ul><ul><li>[ D , ( D , GE N DER, MALE)] </li></ul><ul><li>[ D , ( D , GEND E R, MALE)] </li></ul><ul><li>[ E , ( E , F R IEND S , R )] </li></ul><ul><li>[ R , ( E , FRIE N DS, R )] </li></ul><ul><li>( C , OTHER_EMAILS, [ A , D ] ) </li></ul><ul><li>( D , G ENDER, MALE) </li></ul><ul><li>( D , G E NDER, MALE) </li></ul><ul><li>( E , FR I ENDS, R ) </li></ul><ul><li>( E , FRIE N DS, R ) </li></ul>
10. 10. Algorithm #1 (cont.) <ul><li>(A, AGE, 25) </li></ul><ul><li>( A , FRIENDS, B ) </li></ul><ul><li>( C , OTHER_EMAILS, [ A , D ]) </li></ul><ul><li>(D, GENDER, MALE) </li></ul><ul><li>( D , GENDER, MALE) </li></ul><ul><li>(E, FRIENDS, R ) </li></ul><ul><li>( C , OTHER_EMAILS, [ A , D ] ) </li></ul><ul><li>(D, GENDER, MALE) </li></ul><ul><li>( D , G ENDER, MALE) </li></ul><ul><li>(E FRIENDS, R ) </li></ul><ul><li>( E FRIENDS, R ) </li></ul><ul><li>Didn’t work - duplicated a record </li></ul><ul><li>Cannot just do a distinct, since there might be actual duplicates </li></ul><ul><li>Need a way to mark (K,R) pairs as originating from the same record </li></ul>Input Output
11. 11. Algorithm #2 <ul><li>Step 1: For each record R, choose unique string S: For each key K in R: emit(K, S, R) </li></ul><ul><li>[ A , 1, ( A , AGE, 25)] </li></ul><ul><li>[ A , 2, ( A , F RIENDS, B )] </li></ul><ul><li>[ B , 2, ( A , FRI E NDS, B )] </li></ul><ul><li>[ C , 3, ( C , OTHER_ E MAILS, [ A , D ])] </li></ul><ul><li>[ A , 3, ( C , OT H ER_EMAI L S, [ A , D ])] </li></ul><ul><li>[ D , 3, ( C , OTHER_ E MAILS, [ A , D ])] </li></ul><ul><li>[ D , 4, ( D , GEN D ER, MA L E)] </li></ul><ul><li>[ D , 5, ( D , GENDER, MALE ) ] </li></ul><ul><li>[ E , 6 , ( E , FRIENDS, R )] </li></ul><ul><li>[ R , 6, ( E , FRIENDS, R ) ] </li></ul><ul><li>(A, AGE, 25) </li></ul><ul><li>( A , FRIENDS, B ) </li></ul><ul><li>( C , OTHER_EMAILS, [ A , D ]) </li></ul><ul><li>( D, GENDER, MALE) </li></ul><ul><li>( D , GE N DER, MALE) </li></ul><ul><li>(E FRIENDS, R ) </li></ul>
12. 12. Algorithm #2 (cont.) <ul><li>Step 2: Inner join against list of keys: </li></ul><ul><li>E </li></ul><ul><li>R </li></ul><ul><li>D </li></ul><ul><li>[ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ] </li></ul><ul><li>[ D , 4 , ( D , GE N DER, MALE)] </li></ul><ul><li>[ D , 5, ( D , GEND E R, MALE)] </li></ul><ul><li>[ E , 6, ( E , FRIEND S , R )] </li></ul><ul><li>[ R , 6, ( E , F R IENDS, R )] </li></ul><ul><li>[ A , 1, ( A , AGE, 25)] </li></ul><ul><li>[ A , 2, ( A , F RIENDS, B )] </li></ul><ul><li>[ B , 2, ( A , FRI E NDS, B )] </li></ul><ul><li>[ C , 3, ( C , OTHER_ E MAILS, [ A , D ])] </li></ul><ul><li>[ A , 3, ( C , OT H ER_EMAI L S, [ A , D ])] </li></ul><ul><li>[ D , 3, ( C , OTHER_ E MAILS, [ A , D ])] </li></ul><ul><li>[ D , 4, ( D , GEN D ER, MA L E)] </li></ul><ul><li>[ D , 5, ( D , GENDER, MALE ) ] </li></ul><ul><li>[ E , 6 , ( E , FRIENDS, R )] </li></ul><ul><li>[ R , 6, ( E , FRIENDS, R ) ] </li></ul>
13. 13. Algorithm #2 (cont.) <ul><li>Step 3: Unique tuples by S </li></ul><ul><li>[ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ] </li></ul><ul><li>[ D , 4 , ( D , GE N DER, MALE)] </li></ul><ul><li>[ D , 5, ( D , GEND E R, MALE)] </li></ul><ul><li>[ E , 6, ( E , FRIEND S , R )] </li></ul><ul><li>[ R , 6, ( E , F R IENDS, R )] </li></ul><ul><li>[ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ] </li></ul><ul><li>[ D , 4 , ( D , GE N DER, MALE)] </li></ul><ul><li>[ D , 5, ( D , GEND E R, MALE)] </li></ul><ul><li>[ E , 6, ( E , FRIEND S , R )] </li></ul>
14. 14. Algorithm #2 (cont.) <ul><li>Step 4: Strip out all fields besides output </li></ul><ul><li>[ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ] </li></ul><ul><li>[ D , 4 , ( D , GE N DER, MALE)] </li></ul><ul><li>[ D , 5, ( D , GEND E R, MALE)] </li></ul><ul><li>[ E , 6, ( E , FRIEND S , R )] </li></ul><ul><li>[ R , 6, ( E , F R IENDS, R )] </li></ul><ul><li>( C , OTHER_EMAILS, [ A , D ] ) </li></ul><ul><li>( D , G ENDER, MALE) </li></ul><ul><li>( D , G E NDER, MALE) </li></ul><ul><li>( E , FR I ENDS, R ) </li></ul>It works!
15. 15. Algorithm #2 Analysis <ul><li>Basically a glorified join </li></ul><ul><li>Slow when number of keys much smaller than number of records </li></ul><ul><li>Need to funnel all data through a reduce! </li></ul>
16. 16. Speeding it up <ul><li>Bloom filter: “Set” structure where testing for membership may result in false positives (but no false negatives) </li></ul><ul><li>Add a large bloom filter to filter out as much data as possible in first map phase </li></ul>
17. 17. Algorithm #3 <ul><li>Step 1: Create a bloom filter out of all keys </li></ul><ul><li>E </li></ul><ul><li>R </li></ul><ul><li>D </li></ul>
18. 18. Algorithm #3 <ul><li>Step 2: For each record R, choose unique string S: For each key K in R: emit(K, S, R) </li></ul><ul><li>[ A , 1, ( A , AGE, 25)] </li></ul><ul><li>[ A , 2, ( A , F RIENDS, B )] </li></ul><ul><li>[ B , 2, ( A , FRI E NDS, B )] </li></ul><ul><li>[ C , 3, ( C , OTHER_ E MAILS, [ A , D ])] </li></ul><ul><li>[ A , 3, ( C , OT H ER_EMAI L S, [ A , D ])] </li></ul><ul><li>[ D , 3, ( C , OTHER_ E MAILS, [ A , D ])] </li></ul><ul><li>[ D , 4, ( D , GEN D ER, MA L E)] </li></ul><ul><li>[ D , 5, ( D , GENDER, MALE ) ] </li></ul><ul><li>[ E , 6 , ( E , FRIENDS, R )] </li></ul><ul><li>[ R , 6, ( E , FRIENDS, R ) ] </li></ul><ul><li>(A, AGE, 25) </li></ul><ul><li>( A , FRIENDS, B ) </li></ul><ul><li>( C , OTHER_EMAILS, [ A , D ]) </li></ul><ul><li>( D, GENDER, MALE) </li></ul><ul><li>( D , GE N DER, MALE) </li></ul><ul><li>(E FRIENDS, R ) </li></ul>
19. 19. Algorithm #3 <ul><li>Step 3: For each (K, S, R), keep tuple only if K passes membership test of bloom filter </li></ul><ul><li>[ A , 1, ( A , AGE, 25)] </li></ul><ul><li>[ A , 2, ( A , F RIENDS, B )] </li></ul><ul><li>[ B , 2, ( A , FRI E NDS, B )] </li></ul><ul><li>[ C , 3, ( C , OTHER_ E MAILS, [ A , D ])] </li></ul><ul><li>[ A , 3, ( C , OT H ER_EMAI L S, [ A , D ])] </li></ul><ul><li>[ D , 3, ( C , OTHER_ E MAILS, [ A , D ])] </li></ul><ul><li>[ D , 4, ( D , GEN D ER, MA L E)] </li></ul><ul><li>[ D , 5, ( D , GENDER, MALE ) ] </li></ul><ul><li>[ E , 6 , ( E , FRIENDS, R )] </li></ul><ul><li>[ R , 6, ( E , FRIENDS, R ) ] </li></ul><ul><li>[ B , 2, ( A , FRIENDS, B ) ] </li></ul><ul><li>[ D , 3 , ( C , O T HER_EMAILS, [ A , D ] ) ] </li></ul><ul><li>[ D , 4, ( D , GENDE R , MALE)] </li></ul><ul><li>[ D , 5, ( D , GENDER, MALE)] </li></ul><ul><li>[ E , 6, ( E , F R IENDS, R )] </li></ul><ul><li>[ R , 6, ( E , FRIE N DS, R )] </li></ul>
20. 20. Algorithm #3 (cont.) <ul><li>Step 4: Inner join against list of keys: </li></ul><ul><li>E </li></ul><ul><li>R </li></ul><ul><li>D </li></ul><ul><li>[ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ] </li></ul><ul><li>[ D , 4 , ( D , GE N DER, MALE)] </li></ul><ul><li>[ D , 5, ( D , GEND E R, MALE)] </li></ul><ul><li>[ E , 6, ( E , FRIEND S , R )] </li></ul><ul><li>[ R , 6, ( E , F R IENDS, R )] </li></ul><ul><li>[ B , 2, ( A , FRIENDS, B ) ] </li></ul><ul><li>[ D , 3 , ( C , O T HER_EMAILS, [ A , D ] ) ] </li></ul><ul><li>[ D , 4, ( D , GENDE R , MALE)] </li></ul><ul><li>[ D , 5, ( D , GENDER, MALE)] </li></ul><ul><li>[ E , 6, ( E , F R IENDS, R )] </li></ul><ul><li>[ R , 6, ( E , FRIE N DS, R )] </li></ul>
21. 21. Algorithm #3 (cont.) <ul><li>Step 5: Unique tuples by S </li></ul><ul><li>[ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ] </li></ul><ul><li>[ D , 4 , ( D , GE N DER, MALE)] </li></ul><ul><li>[ D , 5, ( D , GEND E R, MALE)] </li></ul><ul><li>[ E , 6, ( E , FRIEND S , R )] </li></ul><ul><li>[ R , 6, ( E , F R IENDS, R )] </li></ul><ul><li>[ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ] </li></ul><ul><li>[ D , 4 , ( D , GE N DER, MALE)] </li></ul><ul><li>[ D , 5, ( D , GEND E R, MALE)] </li></ul><ul><li>[ E , 6, ( E , FRIEND S , R )] </li></ul>
22. 22. Algorithm #3 (cont.) <ul><li>Step 6: Strip out all fields besides output </li></ul><ul><li>[ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ] </li></ul><ul><li>[ D , 4 , ( D , GE N DER, MALE)] </li></ul><ul><li>[ D , 5, ( D , GEND E R, MALE)] </li></ul><ul><li>[ E , 6, ( E , FRIEND S , R )] </li></ul><ul><li>[ R , 6, ( E , F R IENDS, R )] </li></ul><ul><li>( C , OTHER_EMAILS, [ A , D ] ) </li></ul><ul><li>( D , G ENDER, MALE) </li></ul><ul><li>( D , G E NDER, MALE) </li></ul><ul><li>( E , FR I ENDS, R ) </li></ul>
23. 23. Algorithm #3 (cont.) <ul><li>Job #1: </li></ul><ul><ul><li>Mapper: Bloom filter </li></ul></ul><ul><ul><li>Reducer: Inner join </li></ul></ul><ul><li>Job #2: </li></ul><ul><ul><li>Mapper: Emit key=S, value=tuple </li></ul></ul><ul><ul><li>Reducer: Pick first tuple and strip fields </li></ul></ul>
24. 24. Interface
25. 25. Interface Two M/R jobs with filter One map-only job (false positives) Two M/R jobs (no filter)
26. 26. Optimization <ul><li>For record sets with at most one key per record, don’t need to unique on “S”: </li></ul>
27. 27. Interface One M/R job with filter One map-only job (false positives) One M/R job (no filter)
28. 28. Finishing Touches <ul><li>Distribute bloom filters with distributed cache </li></ul><ul><li>Load in bloom filters once per task by using JVM reuse </li></ul>
29. 29. Limitations <ul><li>Bloom filter can only be so big </li></ul><ul><ul><ul><li>We use 250 MB bloom filter </li></ul></ul></ul><ul><ul><ul><li>Works up to order of 100M keys </li></ul></ul></ul>
30. 30. Source Code <ul><li>Will be releasing source code tomorrow via Rapleaf dev blog </li></ul>
31. 31. Questions?