2. PROBLEM STATEMENT:
eBay maintains hundreds of millions of accounts* across our properties
and partners, that are sometimes unstructured, in different formats,
different character sets, and are changing over time. Identifying which
accounts belong to the same person enables us to personalize each
customer's experience, deliver great customer service, and fight fraud.
3. PROBLEM SOLVED!:
Identifying which accounts belong to the same person is hard under
normal circumstances. Doing so daily at this scale is a feat that defies
superlatives.
MapReduce gives us a robust design pattern to simplify entity recognition
as a series of parallelized unit operations.
7. EDGES: Linking accounts pairwise
•Multiple strategies for blocking and matching accounts.
•Each strategy writes to its own ‘bucket’
•Each strategy is a configuration-driven MR with Mappers that can:
– Read simultaneously from multiple file types (text, sequence, Avro, ORC) and
layouts (fixed, delimited, json, CSV)
– Extract, transform, normalize, and combine fields
– Burst records to create multiple key-value pairs
and Reducers that can:
– Embed a rules-based matching engine that is configured on load
– Embed, build, and search Lucene indexes
High-Scale Entity Resolution in Hadoop 7
8. GRAPH: Identify and validate connected components
•Iterative MR for identifying connected components
•Connected components are validated for integrity and over-grouping, and
partitioned (connected components are relatively small)
High-Scale Entity Resolution in Hadoop 8
9. Operational Experience
• Infrastructure Issues
– Hadoop Upgrades ..
– Shared Environment
• Source Data Issues
– Owner Changes
– ID Issues
– Knowledge?
• Too many Mappers
• Too many Versions
• Space Issues
• Large Clusters
High-Scale Entity Resolution in Hadoop 9
11. Summary:
•Entity resolution at scale
•Daily processing of full data set
•Accurate results
•Reliable, stable process
High-Scale Entity Resolution in Hadoop 11