Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Aka examples
1. AKA
identifies unique coupons given different names in
the SnipSnap coupon database using a combination
of k-means clustering and "smoking gun" feature
based rule inference.
Github: https://github.com/snipsnap/aka-service/
Email: luke.otterblad@gmail.com
2. Step 1: Matches – same value,
description text and activity dates
3. Matches – pairs are shown ,but many more
than 2 items are matched into groups
4. More Examples…Different Barcodes –
Same Coupon
The above two were matched into a group. The
coupon below was also in the same set of American
Eagle but NOT put into the same group even though it
has some similarity….
5. How does it work?
• https://github.com/snipsnap/aka-service
• run via the command line
• $ python aka.py -db_pswd your_password store McDonald’s
id
face_value
offer_details
start_date
expiriation_date
988767 Free
With the purchase of an Egg McMuffin
2013-09-03
2013-10-31
989829 FREE Egg McMuffin
with the purchase of an Egg McMuffin
2013-09-03
2013-10-31
997447 Free Egg McMuffin
with the purchase of an Egg McMuffin
2013-09-03
2013-10-31
6. Active Coupons for a Store as a Graph
•
When the aka-service is started, for a particular store each active coupon is converted to dictionary
format and face value and details based features are converted to the python version of a graph and
normalized with some language processing.
•
Item - > Features
{"CouponA": {[‘free’, ‘with’, ‘the’, ‘purchase’, ‘of’, ‘an’, ‘egg’, ‘mcmuffin’]
{"CouponB": {[‘free’, ‘with’, ‘the’, ‘purchase’, ‘of’, ‘an’, ‘egg’, ‘mcmuffin’]
•
Features -> Item
{“egg":["CouponA","CouponB"],
“mcmuffin": ["CouponA","CouponB"],
“free": ["CouponA","CouponB"],
“with":["CouponA","CouponB"],
“the": ["CouponA","CouponB"]}
"purchase": ["CouponA","CouponB"]
“of": ["CouponA","CouponB"]}
“an": ["CouponA","CouponB"]}
7. Despite different text AKA identifies all
of these as the same item
id
face_value
988767 Free
989829 FREE Egg McMuffin
997447 Free Egg McMuffin
offer_details
With the purchase of an Egg McMuffin
with the purchase of an Egg McMuffin
with the purchase of an Egg McMuffin
start_date
2013-09-03
2013-09-03
2013-09-03
expiriation_date
aka_guid
2013-10-31
de5086f035bc-11e38da3005056c0000
8
2013-10-31
de5086f035bc-11e38da3005056c0000
8
2013-10-31
de5086f035bc-11e38da3005056c0000
8
8. Free is treated as a value keyword
(along with % and $ descriptions)
9. But, words and value
alone don’t create the match. Expiry
date also matters
10. Coupon with No Barcode connected to
the same offer with a barcode
Same offer value (free mini candle) and same data range (September 9October 6, 2013)
13. Smoking Gun Features
• A Smoking gun feature for a coupon is a piece of
information that identifies it as being the same real
world item as another coupon (with near certainty).
• There are two sources of such identification in the
database. The first is a barcode_id. Multiple coupons
that have the same barcode_id are indeed the same
physical coupon. The second is a promo_code.
• Two coupons that have the same promo_code are the
same coupon 95% + of the time. (Some stores like
Dunkin Donuts don’t use unique codes…but more on
that later)
14. More Matches
Above two coupons are matched, and are also
NOT matched with the below coupon despite
having an extremely similar description and
validity:
The code in the upper right hand corner (9152 versus 9992 –the smoking gun)
helps significantly in separating them into a different identification.
15. Two coupons Not matched, even
though they have the same description
and similar text
(they are valid at different
times)
16. Finding smoother images
I experimented with using the number of recorded features as an indicator of
picture quality – but that didn’t have much correlation. What did work was
using the picture with the highest number of redemptions within an aka group
18. The Dollar Store $1 Off coupon
problem – likely to be many of those
These four were originally matched. But I had to introduce the notion of a confidence
percentage.
This is largely because AKA weights the value of an item more heavily than the
details words describing the offer (for most stores they have few items that are the
same price)
20. Trouble Spots: AKA identifies same offer due to
assumed smoking gun, but while there is the
same barcode there is a different expiry.
Ignoring PLU for Dunkin Donuts (and other publishers that duplicate promocodes)
and going with 99% confidence does the trick.
21. There’s Exceptions to every rule
• Coupons are no different
• In the settings.yaml (pictured above) you can define
exceptions to global rules.
• What pop_smoking_gun tells aka is that for Dunkin’
Donuts the global rules of promo_code and barcode_id
does not apply– for Dunkin Donuts’ they don’t create PLU
codes as unique to an offer.
22. Another example
Ignoring PLU for Dunkin Donuts (and other publishers that duplicate promocodes)
and going with 99% confidence does the trick.
23. But knowing the store “rules” also helps
correct errors (if they stick to unique codes)
Mechanical Turk expiry: 10/17/2012
Mechanical Turk expiry: 10/7/2012
http://c346897.r97.cf1.rackcdn.c
http://c346897.r97.cf1.rackcdn.com/d32b578eom/cd0faf92-f85e-11e2-9f66fd2a-11e2-9be6-40406c9e1e47.jpg
40406c9e1e47.jpg
Since Bed Bath & Beyond id’s and promocodes
indicate the same item aka can reconcile the mistake
24. AKA- never misinterpret a store's
coupon rules again
ids
sharable
descrption_text
Aka_guid
987120
1 save 20% on your entire purchase bath body works
75926f4f-328f-11e3-a3cd005056c00008
987271
1 save 20% on your entire purchase bath body works
75926f4f-328f-11e3-a3cd005056c00008
988484
1
save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cdf139439
005056c00008
989519
1
save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd9522
005056c00008
989774
1 save 20% on your entire purchase bath body works
990040
0
990943
1 save 20% on your entire purchase bath body works
75926f4f-328f-11e3-a3cd005056c00008
992970
1 save 20% on your entire purchase bath body works
75926f4f-328f-11e3-a3cd005056c00008
992998
0 save 20% on your entire purchase bath body works
75926f4f-328f-11e3-a3cd005056c00008
994314
1 save 20% on your entire purchase bath body works
75926f4f-328f-11e3-a3cd005056c00008
75926f4f-328f-11e3-a3cd005056c00008
save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cdf139492
005056c00008
10 coupons all identified as the same item with some marked sharable and some not.
Suppose a publisher had submitted coupon 990040 to not be shareable……
25. AKA- never misinterpret a store's coupon
rules again
sharabl
descrption_text
e
ids
Aka_guid
aka_sharable
987120
1 save 20% on your entire purchase bath body works
75926f4f-328f-11e3-a3cd005056c00008
0
987271
1 save 20% on your entire purchase bath body works
75926f4f-328f-11e3-a3cd005056c00008
0
988484
1
save 20% on your entire purchase bath body works
f139439
75926f4f-328f-11e3-a3cd005056c00008
0
989519
1
save 20% on your entire purchase bath body works
9522
75926f4f-328f-11e3-a3cd005056c00008
0
989774
1 save 20% on your entire purchase bath body works
75926f4f-328f-11e3-a3cd005056c00008
0
990040
0
save 20% on your entire purchase bath body works
f139492
75926f4f-328f-11e3-a3cd005056c00008
0
990943
1 save 20% on your entire purchase bath body works
75926f4f-328f-11e3-a3cd005056c00008
0
992970
1 save 20% on your entire purchase bath body works
75926f4f-328f-11e3-a3cd005056c00008
0
992998
0 save 20% on your entire purchase bath body works
75926f4f-328f-11e3-a3cd005056c00008
0
994314
1 save 20% on your entire purchase bath body works
75926f4f-328f-11e3-a3cd005056c00008
0
An easy feature could be to treat a single not sharable within an aka group as a
“presidential” vote and switch all to not sharable. This can also work for items
tagged as manufacturer coupons. You’d basically only need 1 tag from Mechanical
Turk (or a from classifier).
27. Kroger’s matches
Kroger’s requires the highest confidence of any store, as many of their coupons
are different only by a single word. These will match (incorrectly) without a
high confidence set. Listed below is a sample false match made by AKA:
28. Same item in the database twice for
Macy’s
http://c346897.r97.cf1.rackcdn.com/59667340-1588-11e3-a8e340406c9e1e47-thumb.jpg
http://c346897.r97.cf1.rackcdn.com/ac4dc266-1588-11e3-a7d040406c9e1e47-thumb.jpg
36. Occasional data entry errors can lead
to bad reconciliation
aka_guid
id
barcode_id
alt_barcode face_
_id
value
offer_details
$5.00
Off
Save $5.00 On Your Purchase
0
$25.0 Of $25.00 Or More
0
2719bf74-40b611e3-86dd22000a91806d
421909
138859
2719bf74-40b611e3-86dd22000a91806d
539197
46299
0
Save
On Any Aveeno Product
$1.00
2719bf74-40b611e3-86dd22000a91806d
560927
138859
0
Save
On any
$1.00
2719bf74-40b611e3-86dd22000a91806d
595323
138859
0
20%
Off
1 Regular Priced Item
Here the 99% reliable barcode_id is idenified with 3 different items (for Toys R Us)
37. These three items were matched via barcode which I can only assume is some
type of data entry error. The difference is that for every other toys”r”us coupon
the smoking gun rules are valid. These items barcodes are recorded incorrectly
39. Background for entity resolution (aka
collective reconciliation, de-duping)
• Chapter 20 of Beautiful Data “Connecting Data” by Toby
Segaran (who I think likely wrote the chapter while
working on the YouTube reconciliation).
• Indrajit Bhattacharya’s PhD dissertation, which you can find
at: http://www.lib.umd.edu/drum/handle/1903/4241
• About me: Father of 2 lovely daughters with my wife
Emma. Programmer, Statistician, Pot Limit Omaha and
Mixed Game poker semi-professional (though I don’t get
much time for poker nowadays). I'm located in historic
Northfield, MN where I share an office with my Jack Russell
Terrier, Kirby.
• Email: luke.otterblad@gmail.com.