• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Batch Querying with Cascading
 

Batch Querying with Cascading

on

  • 2,512 views

 

Statistics

Views

Total Views
2,512
Views on SlideShare
2,034
Embed Views
478

Actions

Likes
3
Downloads
29
Comments
0

3 Embeds 478

http://blog.rapleaf.com 472
http://www.slideshare.net 5
http://web.archive.org 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Batch Querying with Cascading Batch Querying with Cascading Presentation Transcript

    • Batch Querying with Cascading Nathan Marz Rapleaf
    • Batch Querying
      • A core tool at Rapleaf - we use this everywhere
      • Will be releasing source code tomorrow
    • Motivation
      • We have a list of 1,000,000 e-mails
      • Database size: 10,000,000,000 records
      • We want all the records we have in our database related to those e-mails
    • Motivation
      • ( [email_address] , AGE, 25)
      • ( [email_address] , FRIENDS, [email_address] )
      • ( [email_address] , OTHER_EMAILS, [ [email_address] , [email_address] ])
      • ( [email_address] , GENDER, MALE)
      • ( [email_address] , GENDER, MALE)
      • ( [email_address] , FRIENDS, [email_address] )
      • [email_address]
      • [email_address] om
      • [email_address] m
      Records Keys
      • ( [email_address] , OTHER_EMAILS, [ [email_address] , [email_address] ])
      • ( [email_address] om , GENDER, MALE)
      • ( [email_address] om , GENDER, MALE)
      • ( [email_address] , FRIENDS, richard@ rapleaf.com )
      Results
    • Interface
    • Features
      • Variable number of keys per record
      • Many keys (“batch”)
      • Output is a subset of the input - if a record matches multiple keys, emit it only once
      • Does not eliminate duplication from input set
    • Algorithm #1
      • Step 1: For each record R, key K in R: emit(K,R)
      • [ A , ( A , AGE, 25)]
      • [ A , ( A , F RIENDS, B )]
      • [ B , ( A , FRI E NDS, B )]
      • [ C , ( C , O T HER_ E MAILS, [ A , D ])]
      • [ A , ( C , OTHER _ EMAI L S, [ A , D ])]
      • [ D , ( C , OT H ER_EMA I LS, [ A , D ])]
      • [ D , ( D , GE N DER , MALE) ]
      • [ D , ( D , GENDER, MALE)]
      • [ E , ( E , FRIENDS, R )]
      • [ R , ( E , F R IENDS, R )]
      • (A, AGE, 25)
      • ( A , FRIENDS, B )
      • ( C , OTHER_EMAILS, [ A , D ])
      • ( D, GENDER, MALE)
      • ( D , GE N DER, MALE)
      • (E FRIENDS, R )
    • Algorithm #1 (cont.)
      • Step 2: Inner join against list of keys:
      • E
      • R
      • D
      • [ D , ( C , OTHER_EMAILS, [ A , D ]) ]
      • [ D , ( D , GE N DER, MALE)]
      • [ D , ( D , GEND E R, MALE)]
      • [ E , ( E , F R IEND S , R )]
      • [ R , ( E , FRIE N DS, R )]
      • [ A , ( A , AGE, 25)]
      • [ A , ( A , F RIENDS, B )]
      • [ B , ( A , FRI E NDS, B )]
      • [ C , ( C , O T HER_ E MAILS, [ A , D ])]
      • [ A , ( C , OTHER _ EMAI L S, [ A , D ])]
      • [ D , ( C , OT H ER_EMA I LS, [ A , D ])]
      • [ D , ( D , GE N DER , MALE) ]
      • [ D , ( D , GENDER, MALE)]
      • [ E , ( E , FRIENDS, R )]
      • [ R , ( E , F R IENDS, R )]
    • Algorithm #1 (cont.)
      • Step 3: Strip out key from tuple
      • [ D , ( C , OTHER_EMAILS, [ A , D ]) ]
      • [ D , ( D , GE N DER, MALE)]
      • [ D , ( D , GEND E R, MALE)]
      • [ E , ( E , F R IEND S , R )]
      • [ R , ( E , FRIE N DS, R )]
      • ( C , OTHER_EMAILS, [ A , D ] )
      • ( D , G ENDER, MALE)
      • ( D , G E NDER, MALE)
      • ( E , FR I ENDS, R )
      • ( E , FRIE N DS, R )
    • Algorithm #1 (cont.)
      • (A, AGE, 25)
      • ( A , FRIENDS, B )
      • ( C , OTHER_EMAILS, [ A , D ])
      • (D, GENDER, MALE)
      • ( D , GENDER, MALE)
      • (E, FRIENDS, R )
      • ( C , OTHER_EMAILS, [ A , D ] )
      • (D, GENDER, MALE)
      • ( D , G ENDER, MALE)
      • (E FRIENDS, R )
      • ( E FRIENDS, R )
      • Didn’t work - duplicated a record
      • Cannot just do a distinct, since there might be actual duplicates
      • Need a way to mark (K,R) pairs as originating from the same record
      Input Output
    • Algorithm #2
      • Step 1: For each record R, choose unique string S: For each key K in R: emit(K, S, R)
      • [ A , 1, ( A , AGE, 25)]
      • [ A , 2, ( A , F RIENDS, B )]
      • [ B , 2, ( A , FRI E NDS, B )]
      • [ C , 3, ( C , OTHER_ E MAILS, [ A , D ])]
      • [ A , 3, ( C , OT H ER_EMAI L S, [ A , D ])]
      • [ D , 3, ( C , OTHER_ E MAILS, [ A , D ])]
      • [ D , 4, ( D , GEN D ER, MA L E)]
      • [ D , 5, ( D , GENDER, MALE ) ]
      • [ E , 6 , ( E , FRIENDS, R )]
      • [ R , 6, ( E , FRIENDS, R ) ]
      • (A, AGE, 25)
      • ( A , FRIENDS, B )
      • ( C , OTHER_EMAILS, [ A , D ])
      • ( D, GENDER, MALE)
      • ( D , GE N DER, MALE)
      • (E FRIENDS, R )
    • Algorithm #2 (cont.)
      • Step 2: Inner join against list of keys:
      • E
      • R
      • D
      • [ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ]
      • [ D , 4 , ( D , GE N DER, MALE)]
      • [ D , 5, ( D , GEND E R, MALE)]
      • [ E , 6, ( E , FRIEND S , R )]
      • [ R , 6, ( E , F R IENDS, R )]
      • [ A , 1, ( A , AGE, 25)]
      • [ A , 2, ( A , F RIENDS, B )]
      • [ B , 2, ( A , FRI E NDS, B )]
      • [ C , 3, ( C , OTHER_ E MAILS, [ A , D ])]
      • [ A , 3, ( C , OT H ER_EMAI L S, [ A , D ])]
      • [ D , 3, ( C , OTHER_ E MAILS, [ A , D ])]
      • [ D , 4, ( D , GEN D ER, MA L E)]
      • [ D , 5, ( D , GENDER, MALE ) ]
      • [ E , 6 , ( E , FRIENDS, R )]
      • [ R , 6, ( E , FRIENDS, R ) ]
    • Algorithm #2 (cont.)
      • Step 3: Unique tuples by S
      • [ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ]
      • [ D , 4 , ( D , GE N DER, MALE)]
      • [ D , 5, ( D , GEND E R, MALE)]
      • [ E , 6, ( E , FRIEND S , R )]
      • [ R , 6, ( E , F R IENDS, R )]
      • [ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ]
      • [ D , 4 , ( D , GE N DER, MALE)]
      • [ D , 5, ( D , GEND E R, MALE)]
      • [ E , 6, ( E , FRIEND S , R )]
    • Algorithm #2 (cont.)
      • Step 4: Strip out all fields besides output
      • [ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ]
      • [ D , 4 , ( D , GE N DER, MALE)]
      • [ D , 5, ( D , GEND E R, MALE)]
      • [ E , 6, ( E , FRIEND S , R )]
      • [ R , 6, ( E , F R IENDS, R )]
      • ( C , OTHER_EMAILS, [ A , D ] )
      • ( D , G ENDER, MALE)
      • ( D , G E NDER, MALE)
      • ( E , FR I ENDS, R )
      It works!
    • Algorithm #2 Analysis
      • Basically a glorified join
      • Slow when number of keys much smaller than number of records
      • Need to funnel all data through a reduce!
    • Speeding it up
      • Bloom filter: “Set” structure where testing for membership may result in false positives (but no false negatives)
      • Add a large bloom filter to filter out as much data as possible in first map phase
    • Algorithm #3
      • Step 1: Create a bloom filter out of all keys
      • E
      • R
      • D
    • Algorithm #3
      • Step 2: For each record R, choose unique string S: For each key K in R: emit(K, S, R)
      • [ A , 1, ( A , AGE, 25)]
      • [ A , 2, ( A , F RIENDS, B )]
      • [ B , 2, ( A , FRI E NDS, B )]
      • [ C , 3, ( C , OTHER_ E MAILS, [ A , D ])]
      • [ A , 3, ( C , OT H ER_EMAI L S, [ A , D ])]
      • [ D , 3, ( C , OTHER_ E MAILS, [ A , D ])]
      • [ D , 4, ( D , GEN D ER, MA L E)]
      • [ D , 5, ( D , GENDER, MALE ) ]
      • [ E , 6 , ( E , FRIENDS, R )]
      • [ R , 6, ( E , FRIENDS, R ) ]
      • (A, AGE, 25)
      • ( A , FRIENDS, B )
      • ( C , OTHER_EMAILS, [ A , D ])
      • ( D, GENDER, MALE)
      • ( D , GE N DER, MALE)
      • (E FRIENDS, R )
    • Algorithm #3
      • Step 3: For each (K, S, R), keep tuple only if K passes membership test of bloom filter
      • [ A , 1, ( A , AGE, 25)]
      • [ A , 2, ( A , F RIENDS, B )]
      • [ B , 2, ( A , FRI E NDS, B )]
      • [ C , 3, ( C , OTHER_ E MAILS, [ A , D ])]
      • [ A , 3, ( C , OT H ER_EMAI L S, [ A , D ])]
      • [ D , 3, ( C , OTHER_ E MAILS, [ A , D ])]
      • [ D , 4, ( D , GEN D ER, MA L E)]
      • [ D , 5, ( D , GENDER, MALE ) ]
      • [ E , 6 , ( E , FRIENDS, R )]
      • [ R , 6, ( E , FRIENDS, R ) ]
      • [ B , 2, ( A , FRIENDS, B ) ]
      • [ D , 3 , ( C , O T HER_EMAILS, [ A , D ] ) ]
      • [ D , 4, ( D , GENDE R , MALE)]
      • [ D , 5, ( D , GENDER, MALE)]
      • [ E , 6, ( E , F R IENDS, R )]
      • [ R , 6, ( E , FRIE N DS, R )]
    • Algorithm #3 (cont.)
      • Step 4: Inner join against list of keys:
      • E
      • R
      • D
      • [ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ]
      • [ D , 4 , ( D , GE N DER, MALE)]
      • [ D , 5, ( D , GEND E R, MALE)]
      • [ E , 6, ( E , FRIEND S , R )]
      • [ R , 6, ( E , F R IENDS, R )]
      • [ B , 2, ( A , FRIENDS, B ) ]
      • [ D , 3 , ( C , O T HER_EMAILS, [ A , D ] ) ]
      • [ D , 4, ( D , GENDE R , MALE)]
      • [ D , 5, ( D , GENDER, MALE)]
      • [ E , 6, ( E , F R IENDS, R )]
      • [ R , 6, ( E , FRIE N DS, R )]
    • Algorithm #3 (cont.)
      • Step 5: Unique tuples by S
      • [ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ]
      • [ D , 4 , ( D , GE N DER, MALE)]
      • [ D , 5, ( D , GEND E R, MALE)]
      • [ E , 6, ( E , FRIEND S , R )]
      • [ R , 6, ( E , F R IENDS, R )]
      • [ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ]
      • [ D , 4 , ( D , GE N DER, MALE)]
      • [ D , 5, ( D , GEND E R, MALE)]
      • [ E , 6, ( E , FRIEND S , R )]
    • Algorithm #3 (cont.)
      • Step 6: Strip out all fields besides output
      • [ D , 3, ( C , OTHER_EMAILS, [ A , D ]) ]
      • [ D , 4 , ( D , GE N DER, MALE)]
      • [ D , 5, ( D , GEND E R, MALE)]
      • [ E , 6, ( E , FRIEND S , R )]
      • [ R , 6, ( E , F R IENDS, R )]
      • ( C , OTHER_EMAILS, [ A , D ] )
      • ( D , G ENDER, MALE)
      • ( D , G E NDER, MALE)
      • ( E , FR I ENDS, R )
    • Algorithm #3 (cont.)
      • Job #1:
        • Mapper: Bloom filter
        • Reducer: Inner join
      • Job #2:
        • Mapper: Emit key=S, value=tuple
        • Reducer: Pick first tuple and strip fields
    • Interface
    • Interface Two M/R jobs with filter One map-only job (false positives) Two M/R jobs (no filter)
    • Optimization
      • For record sets with at most one key per record, don’t need to unique on “S”:
    • Interface One M/R job with filter One map-only job (false positives) One M/R job (no filter)
    • Finishing Touches
      • Distribute bloom filters with distributed cache
      • Load in bloom filters once per task by using JVM reuse
    • Limitations
      • Bloom filter can only be so big
          • We use 250 MB bloom filter
          • Works up to order of 100M keys
    • Source Code
      • Will be releasing source code tomorrow via Rapleaf dev blog
    • Questions?