Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High-Scale Entity Resolution in Hadoop

538 views

Published on

High-Scale Entity Resolution in Hadoop

Published in: Technology
  • Be the first to comment

High-Scale Entity Resolution in Hadoop

  1. 1. High-Scale Entity Resolution in Hadoop June 29, 2016 Gurpreet Singh & Tom Schweiger
  2. 2. PROBLEM STATEMENT: eBay maintains hundreds of millions of accounts* across our properties and partners, that are sometimes unstructured, in different formats, different character sets, and are changing over time. Identifying which accounts belong to the same person enables us to personalize each customer's experience, deliver great customer service, and fight fraud.
  3. 3. PROBLEM SOLVED!: Identifying which accounts belong to the same person is hard under normal circumstances. Doing so daily at this scale is a feat that defies superlatives. MapReduce gives us a robust design pattern to simplify entity recognition as a series of parallelized unit operations.
  4. 4. Technology Stack High-Scale Entity Resolution in Hadoop 4
  5. 5. Modular Solution: High-Scale Entity Resolution in Hadoop 5 Sources Edges Graph Table Account Entity 102832 10921 236896 10921 786273 10921 324324 23987 349709 73652 152631 73652 543273 37726
  6. 6. SOURCES – Overall Data Flow High-Scale Entity Resolution in Hadoop 6
  7. 7. EDGES: Linking accounts pairwise •Multiple strategies for blocking and matching accounts. •Each strategy writes to its own ‘bucket’ •Each strategy is a configuration-driven MR with Mappers that can: – Read simultaneously from multiple file types (text, sequence, Avro, ORC) and layouts (fixed, delimited, json, CSV) – Extract, transform, normalize, and combine fields – Burst records to create multiple key-value pairs and Reducers that can: – Embed a rules-based matching engine that is configured on load – Embed, build, and search Lucene indexes High-Scale Entity Resolution in Hadoop 7
  8. 8. GRAPH: Identify and validate connected components •Iterative MR for identifying connected components •Connected components are validated for integrity and over-grouping, and partitioned (connected components are relatively small) High-Scale Entity Resolution in Hadoop 8
  9. 9. Operational Experience • Infrastructure Issues – Hadoop Upgrades .. – Shared Environment • Source Data Issues – Owner Changes – ID Issues – Knowledge? • Too many Mappers • Too many Versions • Space Issues • Large Clusters High-Scale Entity Resolution in Hadoop 9
  10. 10. Performance Numbers High-Scale Entity Resolution in Hadoop 10
  11. 11. Summary: •Entity resolution at scale •Daily processing of full data set •Accurate results •Reliable, stable process High-Scale Entity Resolution in Hadoop 11

×