Consumer Analytics in Real Time:
How InfoScout Tracks Purchase Behavior with Mechanical Turk
Jon Brelig, CTO, InfoScout
Sh...
Overview

– Receipt workflow
– Quality control
– Analytics
Wish I knew who that shopper was!
Helping brands answer…
•
•
•
•
•
•
•

Who’s buying my product?
Who’s the end consumer?
Why did they buy?
When and where?
H...
How do we build
a better panel?
Capture receipts through mobile
Our mobile apps
Receipt Hog

Put $ in your pocket!

Shoparoo

Fundraise for a cause!
Architecture

target.com
target.com

Masterdata
MySQL

GAT G2 LMN LIME = UPC 052000209648

1. Capture Receipt

2. Convert ...
Digitizing Receipts
Task is to convert image(s) of receipts => structured data
Amazon Mechanical Turk
Transcribing Receipts
• Isn’t OCR good enough?

Auto Extract
OpenCV, OCR, Regex

– Leverage OCR & computer vision, fill ga...
Summary Transcription

Summary Extraction
Mechanical Turk

Itemized Extraction
Mechanical Turk

Score & Audit
Staff / Mech...
Summary Transcription
Receipts by Month
1,200,000
1,000,000
800,000
600,000
400,000
200,000
-

How do we scale quality con...
Known Answers
• Publish HIT with at least one
known answer to audit Worker
accuracy
• Additional support provided by
Amazo...
Known Answers
Net Cost per Receipt
Developed more efficient review process
$0.0300

Transitioned to Known Answers

$0.0250...
Itemized Extraction

Summary Extraction
Mechanical Turk

Itemized Extraction
Mechanical Turk

Score & Audit
Staff / Mechan...
Itemized Extraction
• Transcribe every item on receipt
• HITs audited by review team, priority scored by:
–
–
–
–
–

Compa...
Plurality

Publish HIT

• HIT completed by >1 Worker
– InfoScout only sends HITs with low
confidence to multiple Workers
W...
HIT Acceptance Latency
700

Minutes to Accept

600

Changed Template

500
400
300
200
100
0
12/22/12

•
•

1/22/13

2/22/1...
700,000

100%
90%

Total HITs Completed

600,000

80%
500,000

70%
60%

400,000

50%
300,000

40%
30%

200,000

20%
100,00...
Pareto of Worker Volume
90%
% of all HITs completed

80%
70%
60%
50%
40%
30%
20%
10%
0%
Top 5%

6-10%

10-20%

21-50%

51-...
Analytics Demo
Please give us your feedback on this
presentation

BDT206
As a thank you, we will select prize
winners daily for completed...
Appendix
Quality Control Strategies
• Filter incoming Workers
– Qualifications
– Template validation
– Template instructions

Enhan...
HIT templates
• Clear & concise instructions
– 1st time each Worker sees detailed
instructions, has ability to hide once
t...
Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013
Upcoming SlideShare
Loading in …5
×

Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

1,675 views
1,325 views

Published on

Understanding the factors that drive consumer purchase behavior make brands better marketers. In this session, join the Vice President of Mechanical Turk to explore how retail businesses are marrying human judgment with large scale data analytics without sacrificing efficiency or scalability. We’ll highlight real world examples and introduce Jon Brelig, CTO of InfoScout, to explore how his company is leveraging a combination of automated methods and Mechanical Turk to build out a real-world analytics solution relied upon by brands, such as P&G, Unilever, and General Mills. By extracting item-level purchase data from more than 40,000 consumer receipt images each day and associating it with specific products, brands, user surveys and other digital marketing signals, Infoscout is able to rapidly gauge changes in consumer behavior and market share with remarkable granularity

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,675
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
48
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

  1. 1. Consumer Analytics in Real Time: How InfoScout Tracks Purchase Behavior with Mechanical Turk Jon Brelig, CTO, InfoScout Sharon Chiarella, Vice President, Amazon Mechanical Turk November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Overview – Receipt workflow – Quality control – Analytics
  3. 3. Wish I knew who that shopper was!
  4. 4. Helping brands answer… • • • • • • • Who’s buying my product? Who’s the end consumer? Why did they buy? When and where? How many? At what price? With what else? Who’s the shopper? What’s their motive?
  5. 5. How do we build a better panel? Capture receipts through mobile
  6. 6. Our mobile apps Receipt Hog Put $ in your pocket! Shoparoo Fundraise for a cause!
  7. 7. Architecture target.com target.com Masterdata MySQL GAT G2 LMN LIME = UPC 052000209648 1. Capture Receipt 2. Convert to structured data Computer vision + OCR + MTurk 3) Link to masterdata Scraping + classification models + human training Tlog Redshift 5. Build cool stuff on top of it! Analytics, data firehouse, hacks, etc. 4) Data warehouse & prematerialize MySQL, Amazon Redshift, Hadoop (Amazon EMR)
  8. 8. Digitizing Receipts Task is to convert image(s) of receipts => structured data
  9. 9. Amazon Mechanical Turk
  10. 10. Transcribing Receipts • Isn’t OCR good enough? Auto Extract OpenCV, OCR, Regex – Leverage OCR & computer vision, fill gaps with humans • Human = MTurk + small audit staff – We leverage a 6-person team to act as the top audit layer of the system User marks or staff rejects HIT • Hybrid of computer + human Summary Extraction Mechanical Turk Itemized Extraction Mechanical Turk Score & Audit Staff / Mechanical Turk Complete Can we skip? – It is a solved problem… for books – Low recognition on wrinkled receipts from mobile
  11. 11. Summary Transcription Summary Extraction Mechanical Turk Itemized Extraction Mechanical Turk Score & Audit Staff / Mechanical Turk Complete Can we skip? User marks or staff rejects HIT Auto Extract OpenCV, OCR, Regex
  12. 12. Summary Transcription Receipts by Month 1,200,000 1,000,000 800,000 600,000 400,000 200,000 - How do we scale quality control with growing volume?
  13. 13. Known Answers • Publish HIT with at least one known answer to audit Worker accuracy • Additional support provided by Amazon API • Most effective when there is a concrete, expected answer – i.e. Multiple choice answers Known Answer
  14. 14. Known Answers Net Cost per Receipt Developed more efficient review process $0.0300 Transitioned to Known Answers $0.0250 $0.0200 $0.0150 $0.0100 $0.0050 $- InfoScout Review Cost Mturk Cost Known Answers lowered our net cost per receipt from 2 cents to 1 cent per receipt
  15. 15. Itemized Extraction Summary Extraction Mechanical Turk Itemized Extraction Mechanical Turk Score & Audit Staff / Mechanical Turk Complete Can we skip? User marks or staff rejects HIT Auto Extract OpenCV, OCR, Regex
  16. 16. Itemized Extraction • Transcribe every item on receipt • HITs audited by review team, priority scored by: – – – – – Comparing output to known OCR extraction Comparison to master data? (i.e. did they “fat finger” a price or UPC?) Worker approval history Worker tenure (for InfoScout HITs) Additional features • Not a great candidate for Known Answers…. How do we scale quality control for itemized extraction?
  17. 17. Plurality Publish HIT • HIT completed by >1 Worker – InfoScout only sends HITs with low confidence to multiple Workers Worker 2 Submits Worker 1 Submits • Higher quality, higher cost – Limit costs by scientifically selecting HITs to send to a second Worker • Multiple strategies when an answer discrepancy is found – Ask a third Worker – Leverage internal auditors Match ? YES Accept
  18. 18. HIT Acceptance Latency 700 Minutes to Accept 600 Changed Template 500 400 300 200 100 0 12/22/12 • • 1/22/13 2/22/13 3/22/13 4/22/13 5/22/13 6/22/13 Measures HIT demand Template change decreased demand temporarily, but Workers acclimated
  19. 19. 700,000 100% 90% Total HITs Completed 600,000 80% 500,000 70% 60% 400,000 50% 300,000 40% 30% 200,000 20% 100,000 10% 0% 0 HITs Complete (New Workers) % Completed by retained Workers Worker Retention HITs Complete (Retained Workers) Within two months, 80% of HITs were completed by returning Workers
  20. 20. Pareto of Worker Volume 90% % of all HITs completed 80% 70% 60% 50% 40% 30% 20% 10% 0% Top 5% 6-10% 10-20% 21-50% 51-100% Worker Percentile Our top 5% (~500) active Workers account for >80% of all HITs completed
  21. 21. Analytics Demo
  22. 22. Please give us your feedback on this presentation BDT206 As a thank you, we will select prize winners daily for completed surveys!
  23. 23. Appendix
  24. 24. Quality Control Strategies • Filter incoming Workers – Qualifications – Template validation – Template instructions Enhance • Increase quality during completion HIT • Post submission – Plurality (multiple HITs per task) – Known Answers – Workers audit Workers Approve/Reject? Multiple strategies can yield high accuracy
  25. 25. HIT templates • Clear & concise instructions – 1st time each Worker sees detailed instructions, has ability to hide once they’re comfortable • Keyboard shortcuts • Maximize Validation – Client-side and/or AJAX validation • Bonus Rewards – Nice option for rewarding Workers, especially when HIT’s are variable in length & time

×