Approximate methods for scalable data mining
Upcoming SlideShare
Loading in...5
×
 

Approximate methods for scalable data mining

on

  • 1,155 views

Talk on probabilistic algorithms and data structures from Data Science London in April 2013 (Big Data Week).

Talk on probabilistic algorithms and data structures from Data Science London in April 2013 (Big Data Week).

Statistics

Views

Total Views
1,155
Views on SlideShare
1,096
Embed Views
59

Actions

Likes
3
Downloads
13
Comments
0

1 Embed 59

https://twitter.com 59

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Approximate methods for scalable data mining Approximate methods for scalable data mining Presentation Transcript

  • Approximate methods forscalable data miningAndrew CleggData Analytics & Visualization TeamPearson TechnologyTwitter: @andrew_clegg
  • Approximate methods for scalable data mining l 24/04/133What are approximate methods?Trading accuracy for scalability• Often use probabilistic data structures– a.k.a. sketches or signatures• Mostly stream-friendly– Allow you to query data you haven’t even kept!• Generally simple to parallelize• Predictable error rate (can be tuned)
  • Approximate methods for scalable data mining l 24/04/134What are approximate methods?Trading accuracy for scalability• Represent characteristics or summary of data• Use much less space than full dataset (generally via hashing tricks)– Can alleviate disk, memory, network bottlenecks• Generally incur more CPU load than exact methods– This may not be true in a distributed system, overall○ [de]serialization for example– Many data-centric systems have CPU to spare anyway
  • Approximate methods for scalable data mining l 24/04/135Why approximate methods?A real-life exampleIcons from Dropline Neu! http://findicons.com/pack/1714/dropline_neuCounting unique terms in time buckets across ElasticSearch shardsClusternodes MasternodeUnique termsper bucketper shardGloballyunique termsper bucketClientNumber of globallyunique terms perbucket
  • Approximate methods for scalable data mining l 24/04/136Why approximate methods?A real-life exampleIcons from Dropline Neu! http://findicons.com/pack/1714/dropline_neuBut what if each bucket contains a LOT of terms?… and what if there aretoo many to fit inmemory?MemorycostCPU cost toserializeNetworktransfer costCPU cost todeserializeCPU & memory cost tomerge & count sets
  • Approximate methods for scalable data mining l 24/04/137Cardinality estimationApproximate distinct countsIntuitive explanationLong runs of trailing 0s in random bit strings are rare.But the more bit strings you look at, the more likely youare to see a long one.So “longest run of trailing 0s seen” can be used asan estimator of “number of unique bit stringsseen”.0111000111101010001001011100110011110100111011000001010000000001000000101000111001110100011010100111111100100010001100000000101001000100011110100101110100000100
  • Approximate methods for scalable data mining l 24/04/138Cardinality estimationProbabilistic counting: basic algorithmCounting the items• Let n = 0• For each input item:– Hash item into bit string– Count trailing zeroes in bit string– If this count > n:○ Let n = countCalculating the count• n = longest run of trailing 0s seen• Estimated cardinality (“count distinct”) =2^n … that’s it!This is an estimate, but not a great one. But…
  • Approximate methods for scalable data mining l 24/04/139HyperLogLog algorithmBillions of distinct values in 1.5KB of RAM with 2% relative errorImage: http://www.aggregateknowledge.com/science/blog/hll.htmlCool properties• Stream-friendly: no need to keep data• Error rates are predictable and tunable• Size and speed stay constant• Trivial to parallelize– Combine two HLL counters by takingthe max of each register
  • Approximate methods for scalable data mining l 24/04/1310Resources on cardinality estimationHyperLogLog paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdfJava implementations: https://github.com/clearspring/stream-lib/Algebird implements HyperLogLog (and much more!) in Scalding: https://github.com/twitter/algebirdSimmer wraps Algebird in Hadoop Streaming command line: https://github.com/avibryant/simmerOur ElasticSearch plugin: https://github.com/ptdavteam/elasticsearch-approx-pluginMetaMarkets blog: http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/Aggregate Knowledge blog, including JavaScript implementation and D3 visualization:http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
  • Approximate methods for scalable data mining l 24/04/1311Bloom filtersSet membership test with chance of false positivesImage: http://en.wikipedia.org/wiki/Bloom_filterAt least one 0 means wdefinitely isn’t in set.All 1s would mean wprobably is in set.Hash each item n times ⇒indices into bit field.
  • Approximate methods for scalable data mining l 24/04/1312Count-min sketchFrequency histogram estimation with chance of over-countingA1 +1 +1A2 +1 +1A3 +2“foo”h1 h2 h3“bar”h1 h2 h3More hashes / arrays ⇒reduced chance ofovercountingcount(“foo”) =min(1, 1, 2) =1
  • Approximate methods for scalable data mining l 24/04/1313Random hyperplanesLocality-sensitive hashing for approximate nearest neighboursHash(Item1) = 011Hash(Item2) = 001As the cosine distancedecreases, the probabilityof a hash match increasesItem1h1 h2h3Item2θBitwise hamming distancecorrelates with cosinedistance
  • Approximate methods for scalable data mining l 24/04/1314Feature hashingHigh-dimensional machine learning without feature dictionary“reduce”“the”“size”“of”“your”“feature”“vector”“with”“this”“one”“weird”“old”“trick”h(“reduce”) = 9h(“the”) = 3h(“size”) = 1. . .+1+1+1Effect of collisions on overallclassification accuracy issurprisingly small!Multiple hashes, or 1-bit“sign hash”, can reducecollisions effects if necessary
  • Approximate methods for scalable data mining l 24/04/1315Thanks for listeningAnd some further reading…Great ebook available free from:http://infolab.stanford.edu/~ullman/mmds.html