Approximate methods for scalabledata miningAndrew CleggData Analytics & Visualization TeamPearson TechnologyTwitter: @andr...
Approximate methods for scalable data mining l 08/03/132Outline1. Intro2. What are approximate methods and why are they co...
Approximate methods for scalable data mining l 08/03/133IntroMe and the teamLondonDario VillanuevaAblanedoData Analytics E...
Approximate methods for scalable data mining l 08/03/134IntroMotivation for getting into approximate methodsCounting uniqu...
Approximate methods for scalable data mining l 08/03/135IntroMotivation for getting into approximate methodsBut what if ea...
Approximate methods for scalable data mining l 08/03/136But we’ll come back to that later.
Approximate methods for scalable data mining l 08/03/137What are approximate methods?Trading accuracy for scalability• Oft...
Approximate methods for scalable data mining l 08/03/138What are approximate methods?Trading accuracy for scalability• Rep...
Approximate methods for scalable data mining l 08/03/139Set membershipHave I seen this item before?
Approximate methods for scalable data mining l 08/03/1310Set membershipNaïve approach• Put all items in a hash table in me...
Approximate methods for scalable data mining l 08/03/1311Set membershipBloom filterA probabilistic data structure for testi...
Approximate methods for scalable data mining l 08/03/1312Set membershipBloom filter: creating and populating• Bitfield of si...
Approximate methods for scalable data mining l 08/03/1313Set membershipBloom filter: querying• Hash the query item with all...
Approximate methods for scalable data mining l 08/03/1314Set membershipBloom filterExample (3 elements, 3 hash functions, 1...
Approximate methods for scalable data mining l 08/03/1315Set membershipBloom filterCool properties• Union/intersection = bi...
Approximate methods for scalable data mining l 08/03/1316Cardinality estimationHow many distinct items have I seen?
Approximate methods for scalable data mining l 08/03/1317Cardinality calculationNaïve approach• Put all items in a hash ta...
Approximate methods for scalable data mining l 08/03/1318Cardinality estimationProbabilistic countingAn approximate method...
Approximate methods for scalable data mining l 08/03/1319Cardinality estimationProbabilistic countingIntuitive explanation...
Approximate methods for scalable data mining l 08/03/1320Cardinality estimationProbabilistic counting: basic algorithm• Le...
Approximate methods for scalable data mining l 08/03/1321Cardinality estimationProbabilistic counting: calculating the est...
Approximate methods for scalable data mining l 08/03/1322Cardinality estimationProbabilistic counting and friendsCool prop...
Approximate methods for scalable data mining l 08/03/1323Frequency estimationHow many occurences of each item have I seen?
Approximate methods for scalable data mining l 08/03/1324Frequency calculationNaïve approach• Maintain a key-value hash ta...
Approximate methods for scalable data mining l 08/03/1325Frequency estimationCount-min sketchA probabilistic data structur...
Approximate methods for scalable data mining l 08/03/1326Frequency estimationCount-min sketch: creating and populating• k ...
Approximate methods for scalable data mining l 08/03/1327Frequency estimationCount-min sketch: creating and populating+2A3...
Approximate methods for scalable data mining l 08/03/1328Frequency estimationCount-min sketch: querying• For each hash fun...
Approximate methods for scalable data mining l 08/03/1329Frequency estimationCount-min sketch: querying0200000000A30001001...
Approximate methods for scalable data mining l 08/03/1330Frequency estimationCount-min sketchCool properties• Fast adding ...
Approximate methods for scalable data mining l 08/03/1331Similarity searchWhich items are most similar to this one?
Approximate methods for scalable data mining l 08/03/1332Similarity searchNaïve approachNearest-neighbour search• For each...
Approximate methods for scalable data mining l 08/03/1333Similarity searchLocality-sensitive hashingA probabilistic method...
Approximate methods for scalable data mining l 08/03/1334Similarity searchLocality-sensitive hashingIntuitive explanationT...
Approximate methods for scalable data mining l 08/03/1335Similarity searchLocality-sensitive hashing: random hyperplanesTr...
Approximate methods for scalable data mining l 08/03/1336Similarity searchLocality-sensitive hashing: random hyperplanesCo...
Approximate methods for scalable data mining l 08/03/1337Similarity searchLocality-sensitive hashingOther approaches• Bit ...
Approximate methods for scalable data mining l 08/03/1338Wrap-up
Approximate methods for scalable data mining l 08/03/1339http://highlyscalable.wordpress.com/2012/05/01/probabilistic-stru...
Approximate methods for scalable data mining l 08/03/1340ResourcesEbook available free from:http://infolab.stanford.edu/~u...
Approximate methods for scalable data mining (long version)
Upcoming SlideShare
Loading in …5
×

Approximate methods for scalable data mining (long version)

846 views

Published on

My approximate methods talk, full length version, as delivered at QCon London 2013.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
846
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Approximate methods for scalable data mining (long version)

  1. 1. Approximate methods for scalabledata miningAndrew CleggData Analytics & Visualization TeamPearson TechnologyTwitter: @andrew_clegg
  2. 2. Approximate methods for scalable data mining l 08/03/132Outline1. Intro2. What are approximate methods and why are they cool?3. Set membership (finding non-unique items)4. Cardinality estimation (counting unique items)5. Frequency estimation (counting occurrences of items)6. Locality-sensitive hashing (finding similar items)7. Further reading and sample code
  3. 3. Approximate methods for scalable data mining l 08/03/133IntroMe and the teamLondonDario VillanuevaAblanedoData Analytics EngineerHubert RogersData ScientistAndrew CleggTechnical ManagerKostas PerifanosData AnalyticsEngineerAndreasGalatoulasProduct Manager
  4. 4. Approximate methods for scalable data mining l 08/03/134IntroMotivation for getting into approximate methodsCounting unique terms across ElasticSearch shardsClusternodes MasternodeDistinctterms pershardGloballydistincttermsClientNumber ofgloballydistincttermsIcons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu
  5. 5. Approximate methods for scalable data mining l 08/03/135IntroMotivation for getting into approximate methodsBut what if each term-set is BIG?Memory costCPU cost toserializeNetworktransfer costCPU cost todeserializeCPU & memory cost tomerge & count sets… and what if they’re too big to fit in memory?
  6. 6. Approximate methods for scalable data mining l 08/03/136But we’ll come back to that later.
  7. 7. Approximate methods for scalable data mining l 08/03/137What are approximate methods?Trading accuracy for scalability• Often use probabilistic data structures– a.k.a. “Sketches”• Mostly stream-friendly– Allow you to query data you haven’t even kept!• Generally simple to parallelize• Predictable error rate (can be tuned)
  8. 8. Approximate methods for scalable data mining l 08/03/138What are approximate methods?Trading accuracy for scalability• Represent characteristics or summary of data• Use much less space than full dataset (often via hashing)– Can alleviate disk, memory, network bottlenecks• Generally incur more CPU load than exact methods– This may not be true in a distributed system, overall:[de]serialization for example– Many data-centric systems have CPU to spare anyway
  9. 9. Approximate methods for scalable data mining l 08/03/139Set membershipHave I seen this item before?
  10. 10. Approximate methods for scalable data mining l 08/03/1310Set membershipNaïve approach• Put all items in a hash table in memory– e.g. HashSet in Java, set in Python• Checking whether item exists is very cheap• Not so good when items don’t fit in memory any more• Merging big sets (to increase query speed) can be expensive– Especially if they are on different cluster nodes
  11. 11. Approximate methods for scalable data mining l 08/03/1311Set membershipBloom filterA probabilistic data structure for testing set membershipReal-life example:BigTable and HBase use these to avoid wasted lookups for non-existent row and column IDs.
  12. 12. Approximate methods for scalable data mining l 08/03/1312Set membershipBloom filter: creating and populating• Bitfield of size n (can be quite large but << total data size)• k independent hash functions with integer output in [0, n-1]• For each input item:– For each hash:○ Hash item to get an index into the bitfield○ Set that bit to 1i.e. Each item yields a unique pattern of k bits.These are ORed onto the bitfield when the item is added.
  13. 13. Approximate methods for scalable data mining l 08/03/1313Set membershipBloom filter: querying• Hash the query item with all k hash functions• Are all of the corresponding bits set?– No = we have never seen this item before– Yes = we have probably seen this item before• Probability of false positive depends on:– n (bitfield size)– number of items added• k has an optimum value also based on these– Must be picked in advance based on what you expect, roughly
  14. 14. Approximate methods for scalable data mining l 08/03/1314Set membershipBloom filterExample (3 elements, 3 hash functions, 18 bits)Image from Wikipedia http://en.wikipedia.org/wiki/File:Bloom_filter.svg
  15. 15. Approximate methods for scalable data mining l 08/03/1315Set membershipBloom filterCool properties• Union/intersection = bitwise OR/AND• Add/query operations stay at O(k) time (and they’re fast)• Filter takes up constant space– Can be rebuilt bigger once saturated, if you still have the dataExtensions• BFs supporting “remove”, scalable (growing) BFs, stable BFs, …
  16. 16. Approximate methods for scalable data mining l 08/03/1316Cardinality estimationHow many distinct items have I seen?
  17. 17. Approximate methods for scalable data mining l 08/03/1317Cardinality calculationNaïve approach• Put all items in a hash table in memory– e.g. HashSet in Java, set in Python– Duplicates are ignored• Count the number remaining at the end– Implementations typically track this -- fast to check• Not so good when items don’t fit in memory any more• Merging big sets can be expensive– Especially if they are on different cluster nodes
  18. 18. Approximate methods for scalable data mining l 08/03/1318Cardinality estimationProbabilistic countingAn approximate method for counting unique itemsReal-life example:Implementation of parallelizable distinct counts in ElasticSearch.https://github.com/ptdavteam/elasticsearch-approx-plugin
  19. 19. Approximate methods for scalable data mining l 08/03/1319Cardinality estimationProbabilistic countingIntuitive explanationLong runs of trailing 0s in random bit strings are rare.But the more bit strings you look at, the more likely youare to see a long one.So “longest run of trailing 0s seen” can be used as anestimator of “number of unique bit strings seen”.0111000111101010001001011100110011110100111011000001010000000001000000101000111001110100011010100111111100100010001100000000101001000100011110100101110100000100
  20. 20. Approximate methods for scalable data mining l 08/03/1320Cardinality estimationProbabilistic counting: basic algorithm• Let n = 0• For each input item:– Hash item into bit string– Count trailing zeroes in bit string– If this count > n:○ Let n = count
  21. 21. Approximate methods for scalable data mining l 08/03/1321Cardinality estimationProbabilistic counting: calculating the estimate• n = longest run of trailing 0s seen• Estimated cardinality (“count distinct”) = 2^n … that’s it!This is an estimate, but not actually a great one.Improvements• Various “fudge factors”, corrections for extreme values, etc.• Multiple hashes in parallel, average over results (LogLog algorithm)• Harmonic mean instead of geometric (HyperLogLog algorithm)
  22. 22. Approximate methods for scalable data mining l 08/03/1322Cardinality estimationProbabilistic counting and friendsCool properties• Error rates are predictable– And tunable, for multi-hash methods• Can be merged easily– max(longest run counters from all shards)• Add/query operations are constant time (and fast too)• Data structure is just counter[s]
  23. 23. Approximate methods for scalable data mining l 08/03/1323Frequency estimationHow many occurences of each item have I seen?
  24. 24. Approximate methods for scalable data mining l 08/03/1324Frequency calculationNaïve approach• Maintain a key-value hash table from item -> counter– e.g. HashMap in Java, dict in Python• Not so good when items don’t fit in memory any more• Merging big maps can be expensive– Especially if they are on different cluster nodes
  25. 25. Approximate methods for scalable data mining l 08/03/1325Frequency estimationCount-min sketchA probabilistic data structure for counting occurences ofitemsReal-life example:Keeping track of traffic volume by IP address in a firewall, todetect anomalies.
  26. 26. Approximate methods for scalable data mining l 08/03/1326Frequency estimationCount-min sketch: creating and populating• k integer arrays, each of length n• k hash functions yielding values in [0, n-1]– These values act as indexes into the arrays• For each input item:– For each hash:○ Hash item to get index into corresponding array○ Increment the value at that position by 1
  27. 27. Approximate methods for scalable data mining l 08/03/1327Frequency estimationCount-min sketch: creating and populating+2A3+1+1A2+1+1A1“foo”h1 h2 h3h1 h2 h3“bar”
  28. 28. Approximate methods for scalable data mining l 08/03/1328Frequency estimationCount-min sketch: querying• For each hash function:– Hash query item to get index into corresponding array– Get the count at that position• Return the lowest of these countsThis minimizes the effect of hash collisions.(Collisions can only cause over-counting, not under-counting)
  29. 29. Approximate methods for scalable data mining l 08/03/1329Frequency estimationCount-min sketch: querying0200000000A30001001000A20000100010A1“foo”h1 h2 h3min(1, 1, 2) = 1Caveat: You can’t iterate through the items, they’re not stored at all.
  30. 30. Approximate methods for scalable data mining l 08/03/1330Frequency estimationCount-min sketchCool properties• Fast adding and querying in O(k) time• As with Bloom filter: more hashes = lower error• Mergeable by cellwise addition• Better accuracy for higher-frequency items (“heavy hitters”)• Can also be used to find quantiles
  31. 31. Approximate methods for scalable data mining l 08/03/1331Similarity searchWhich items are most similar to this one?
  32. 32. Approximate methods for scalable data mining l 08/03/1332Similarity searchNaïve approachNearest-neighbour search• For each stored item:– Compare to query item via appropriate distance metric*– Keep if closer than previous closest match• Distance metric calculation can be expensive– Especially if items are many-dimensional records• Can be slow even if data small enough to fit in memory*Cosine distance, Hamming distance, Jaccard distance etc.
  33. 33. Approximate methods for scalable data mining l 08/03/1333Similarity searchLocality-sensitive hashingA probabilistic method for nearest-neighbour searchReal-life example:Finding users with similar music tastes in an online radio service.
  34. 34. Approximate methods for scalable data mining l 08/03/1334Similarity searchLocality-sensitive hashingIntuitive explanationTypical hash functions:• Similar inputs yield very different outputsLocality-sensitive hash functions:• Similar inputs yield similar or identical outputsSo: Hash each item, then just compare hashes.– Can be used to pre-filter items before exact comparisons– You can also index the hashes for quick lookup
  35. 35. Approximate methods for scalable data mining l 08/03/1335Similarity searchLocality-sensitive hashing: random hyperplanesTreat n-valued items as vectors in n-dimensional space.Draw k random hyperplanes in that space.For each hyperplane:Is each vector above it (1) or below it (0)?Item1h1 h2h3Item2Hash(Item1) = 011Hash(Item2) = 001In this example: n=2, k=3As the cosine distance decreases, theprobability of a hash match increases.
  36. 36. Approximate methods for scalable data mining l 08/03/1336Similarity searchLocality-sensitive hashing: random hyperplanesCool properties• Hamming distance between hashes approximates cosine distance• More hyperplanes (higher k) -> bigger hashes -> better estimates• Can use to narrow search space before exact nearest-neighbours• Various ways to combine sketches for better separation
  37. 37. Approximate methods for scalable data mining l 08/03/1337Similarity searchLocality-sensitive hashingOther approaches• Bit sampling: approximates Hamming distance• MinHashing: approximates Jaccard distance• Random projection: approximates Euclidean distance
  38. 38. Approximate methods for scalable data mining l 08/03/1338Wrap-up
  39. 39. Approximate methods for scalable data mining l 08/03/1339http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
  40. 40. Approximate methods for scalable data mining l 08/03/1340ResourcesEbook available free from:http://infolab.stanford.edu/~ullman/mmds.htmlstream-lib:https://github.com/clearspring/stream-libJava libraries for cardinality, set membership,frequency and top-N items (not covered here)No canonical source of multiple LSH algorithms,but plenty of separate implementationsWikipedia is pretty good on these topics too

×