Like this presentation? Why not share!

- Real-time Data De-duplication using... by DECK36 2461 views
- Approximate methods for scalable da... by Andrew Clegg 2238 views
- Tiny Batches, in the wine: Shiny Ne... by Paco Nathan 5761 views
- ScopeSketch: probabilistic data str... by Namuk Park 201 views
- Scalable real-time processing techn... by Lars Albertsson 602 views
- Sean Kandel - Data profiling: Asses... by huguk 389 views

597

-1

-1

Published on

Andrew Clegg overviews methods and provides use cases for performing data sets operations like membership testing, distinct counts, and nearest-neighbour finding more efficiently. Filmed at qconlondon.com.

Andrew Clegg has a PhD in computational linguistics and text mining, and has worked in life sciences, healthcare, social and online media, and publishing. Nowadays he heads up the Data Analytics & Visualization team at Pearson in London, helping companies from the Pearson, FT and Penguin groups make the most of their data. Twitter: @andrew_clegg

Published in:
Technology

No Downloads

Total Views

597

On Slideshare

0

From Embeds

0

Number of Embeds

0

Shares

0

Downloads

0

Comments

0

Likes

2

No embeds

No notes for slide

- 1. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /scalability-data-mining
- 2. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
- 3. Approximate methods for scalable data mining Andrew Clegg Data Analytics & Visualization Team Pearson Technology Twitter: @andrew_clegg
- 4. Approximate methods for scalable data mining l 08/03/132 Outline 1. Intro 2. What are approximate methods and why are they cool? 3. Set membership (ﬁnding non-unique items) 4. Cardinality estimation (counting unique items) 5. Frequency estimation (counting occurrences of items) 6. Locality-sensitive hashing (ﬁnding similar items) 7. Further reading and sample code
- 5. Approximate methods for scalable data mining l 08/03/133 Intro Me and the team London Dario Villanueva Ablanedo Data Analytics Engineer Hubert Rogers Data Scientist Andrew Clegg Technical Manager Kostas Perifanos Data Analytics Engineer Andreas Galatoulas Product Manager
- 6. Approximate methods for scalable data mining l 08/03/134 Intro Motivation for getting into approximate methods Counting unique terms across ElasticSearch shards Cluster nodes Master node Distinct terms per shard Globally distinct terms Client Number of globally distinct terms Icons from Dropline Neu! http://ﬁndicons.com/pack/1714/dropline_neu
- 7. Approximate methods for scalable data mining l 08/03/135 Intro Motivation for getting into approximate methods But what if each term-set is BIG? Memory cost CPU cost to serialize Network transfer cost CPU cost to deserialize CPU & memory cost to merge & count sets … and what if they’re too big to ﬁt in memory?
- 8. Approximate methods for scalable data mining l 08/03/136 But we’ll come back to that later.
- 9. Approximate methods for scalable data mining l 08/03/137 What are approximate methods? Trading accuracy for scalability • Often use probabilistic data structures – a.k.a. “Sketches” • Mostly stream-friendly – Allow you to query data you haven’t even kept! • Generally simple to parallelize • Predictable error rate (can be tuned)
- 10. Approximate methods for scalable data mining l 08/03/138 What are approximate methods? Trading accuracy for scalability • Represent characteristics or summary of data • Use much less space than full dataset (often via hashing) – Can alleviate disk, memory, network bottlenecks • Generally incur more CPU load than exact methods – This may not be true in a distributed system, overall: [de]serialization for example – Many data-centric systems have CPU to spare anyway
- 11. Approximate methods for scalable data mining l 08/03/139 Set membership Have I seen this item before?
- 12. Approximate methods for scalable data mining l 08/03/1310 Set membership Naïve approach • Put all items in a hash table in memory – e.g. HashSet in Java, set in Python • Checking whether item exists is very cheap • Not so good when items don’t ﬁt in memory any more • Merging big sets (to increase query speed) can be expensive – Especially if they are on different cluster nodes
- 13. Approximate methods for scalable data mining l 08/03/1311 Set membership Bloom ﬁlter A probabilistic data structure for testing set membership Real-life example: BigTable and HBase use these to avoid wasted lookups for non- existent row and column IDs.
- 14. Approximate methods for scalable data mining l 08/03/1312 Set membership Bloom ﬁlter: creating and populating • Bitﬁeld of size n (can be quite large but << total data size) • k independent hash functions with integer output in [0, n-1] • For each input item: – For each hash: ○ Hash item to get an index into the bitﬁeld ○ Set that bit to 1 i.e. Each item yields a unique pattern of k bits. These are ORed onto the bitﬁeld when the item is added.
- 15. Approximate methods for scalable data mining l 08/03/1313 Set membership Bloom ﬁlter: querying • Hash the query item with all k hash functions • Are all of the corresponding bits set? – No = we have never seen this item before – Yes = we have probably seen this item before • Probability of false positive depends on: – n (bitﬁeld size) – number of items added • k has an optimum value also based on these – Must be picked in advance based on what you expect, roughly
- 16. Approximate methods for scalable data mining l 08/03/1314 Set membership Bloom ﬁlter Example (3 elements, 3 hash functions, 18 bits) Image from Wikipedia http://en.wikipedia.org/wiki/File:Bloom_ﬁlter.svg
- 17. Approximate methods for scalable data mining l 08/03/1315 Set membership Bloom ﬁlter Cool properties • Union/intersection = bitwise OR/AND • Add/query operations stay at O(k) time (and they’re fast) • Filter takes up constant space – Can be rebuilt bigger once saturated, if you still have the data Extensions • BFs supporting “remove”, scalable (growing) BFs, stable BFs, …
- 18. Approximate methods for scalable data mining l 08/03/1316 Cardinality estimation How many distinct items have I seen?
- 19. Approximate methods for scalable data mining l 08/03/1317 Cardinality calculation Naïve approach • Put all items in a hash table in memory – e.g. HashSet in Java, set in Python – Duplicates are ignored • Count the number remaining at the end – Implementations typically track this -- fast to check • Not so good when items don’t ﬁt in memory any more • Merging big sets can be expensive – Especially if they are on different cluster nodes
- 20. Approximate methods for scalable data mining l 08/03/1318 Cardinality estimation Probabilistic counting An approximate method for counting unique items Real-life example: Implementation of parallelizable distinct counts in ElasticSearch. https://github.com/ptdavteam/elasticsearch-approx-plugin
- 21. Approximate methods for scalable data mining l 08/03/1319 Cardinality estimation Probabilistic counting Intuitive explanation Long runs of trailing 0s in random bit strings are rare. But the more bit strings you look at, the more likely you are to see a long one. So “longest run of trailing 0s seen” can be used as an estimator of “number of unique bit strings seen”. 01110001 11101010 00100101 11001100 11110100 11101100 00010100 00000001 00000010 10001110 01110100 01101010 01111111 00100010 00110000 00001010 01000100 01111010 01011101 00000100
- 22. Approximate methods for scalable data mining l 08/03/1320 Cardinality estimation Probabilistic counting: basic algorithm • Let n = 0 • For each input item: – Hash item into bit string – Count trailing zeroes in bit string – If this count > n: ○ Let n = count
- 23. Approximate methods for scalable data mining l 08/03/1321 Cardinality estimation Probabilistic counting: calculating the estimate • n = longest run of trailing 0s seen • Estimated cardinality (“count distinct”) = 2^n … that’s it! This is an estimate, but not actually a great one. Improvements • Various “fudge factors”, corrections for extreme values, etc. • Multiple hashes in parallel, average over results (LogLog algorithm) • Harmonic mean instead of geometric (HyperLogLog algorithm)
- 24. Approximate methods for scalable data mining l 08/03/1322 Cardinality estimation Probabilistic counting and friends Cool properties • Error rates are predictable – And tunable, for multi-hash methods • Can be merged easily – max(longest run counters from all shards) • Add/query operations are constant time (and fast too) • Data structure is just counter[s]
- 25. Approximate methods for scalable data mining l 08/03/1323 Frequency estimation How many occurences of each item have I seen?
- 26. Approximate methods for scalable data mining l 08/03/1324 Frequency calculation Naïve approach • Maintain a key-value hash table from item -> counter – e.g. HashMap in Java, dict in Python • Not so good when items don’t ﬁt in memory any more • Merging big maps can be expensive – Especially if they are on different cluster nodes
- 27. Approximate methods for scalable data mining l 08/03/1325 Frequency estimation Count-min sketch A probabilistic data structure for counting occurences of items Real-life example: Keeping track of traffic volume by IP address in a ﬁrewall, to detect anomalies.
- 28. Approximate methods for scalable data mining l 08/03/1326 Frequency estimation Count-min sketch: creating and populating • k integer arrays, each of length n • k hash functions yielding values in [0, n-1] – These values act as indexes into the arrays • For each input item: – For each hash: ○ Hash item to get index into corresponding array ○ Increment the value at that position by 1
- 29. Approximate methods for scalable data mining l 08/03/1327 Frequency estimation Count-min sketch: creating and populating +2A3 +1+1A2 +1+1A1 “foo” h1 h2 h3 h1 h2 h3 “bar”
- 30. Approximate methods for scalable data mining l 08/03/1328 Frequency estimation Count-min sketch: querying • For each hash function: – Hash query item to get index into corresponding array – Get the count at that position • Return the lowest of these counts This minimizes the effect of hash collisions. (Collisions can only cause over-counting, not under-counting)
- 31. Approximate methods for scalable data mining l 08/03/1329 Frequency estimation Count-min sketch: querying 0200000000A3 0001001000A2 0000100010A1 “foo” h1 h2 h3 min(1, 1, 2) = 1 Caveat: You can’t iterate through the items, they’re not stored at all.
- 32. Approximate methods for scalable data mining l 08/03/1330 Frequency estimation Count-min sketch Cool properties • Fast adding and querying in O(k) time • As with Bloom ﬁlter: more hashes = lower error • Mergeable by cellwise addition • Better accuracy for higher-frequency items (“heavy hitters”) • Can also be used to ﬁnd quantiles
- 33. Approximate methods for scalable data mining l 08/03/1331 Similarity search Which items are most similar to this one?
- 34. Approximate methods for scalable data mining l 08/03/1332 Similarity search Naïve approach Nearest-neighbour search • For each stored item: – Compare to query item via appropriate distance metric* – Keep if closer than previous closest match • Distance metric calculation can be expensive – Especially if items are many-dimensional records • Can be slow even if data small enough to ﬁt in memory *Cosine distance, Hamming distance, Jaccard distance etc.
- 35. Approximate methods for scalable data mining l 08/03/1333 Similarity search Locality-sensitive hashing A probabilistic method for nearest-neighbour search Real-life example: Finding users with similar music tastes in an online radio service.
- 36. Approximate methods for scalable data mining l 08/03/1334 Similarity search Locality-sensitive hashing Intuitive explanation Typical hash functions: • Similar inputs yield very different outputs Locality-sensitive hash functions: • Similar inputs yield similar or identical outputs So: Hash each item, then just compare hashes. – Can be used to pre-ﬁlter items before exact comparisons – You can also index the hashes for quick lookup
- 37. Approximate methods for scalable data mining l 08/03/1335 Similarity search Locality-sensitive hashing: random hyperplanes Treat n-valued items as vectors in n-dimensional space. Draw k random hyperplanes in that space. For each hyperplane: Is each vector above it (1) or below it (0)? Item1 h1 h2 h3 Item2 Hash(Item1) = 011 Hash(Item2) = 001 In this example: n=2, k=3 As the cosine distance decreases, the probability of a hash match increases.
- 38. Approximate methods for scalable data mining l 08/03/1336 Similarity search Locality-sensitive hashing: random hyperplanes Cool properties • Hamming distance between hashes approximates cosine distance • More hyperplanes (higher k) -> bigger hashes -> better estimates • Can use to narrow search space before exact nearest-neighbours • Various ways to combine sketches for better separation
- 39. Approximate methods for scalable data mining l 08/03/1337 Similarity search Locality-sensitive hashing Other approaches • Bit sampling: approximates Hamming distance • MinHashing: approximates Jaccard distance • Random projection: approximates Euclidean distance
- 40. Approximate methods for scalable data mining l 08/03/1338 Wrap-up
- 41. Approximate methods for scalable data mining l 08/03/1339 http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
- 42. Approximate methods for scalable data mining l 08/03/1340 Resources Ebook available free from: http://infolab.stanford.edu/~ullman/mmds.html stream-lib: https://github.com/clearspring/stream-lib Java libraries for cardinality, set membership, frequency and top-N items (not covered here) No canonical source of multiple LSH algorithms, but plenty of separate implementations Wikipedia is pretty good on these topics too
- 43. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/scalability -data-mining

No public clipboards found for this slide

Be the first to comment