Approximate methods for scalable data mining

Approximate methods for
scalable data mining
Andrew Clegg
Data Analytics & Visualization Team
Pearson Technology
Twitter: @andrew_clegg

Approximate methods for scalable data mining l 24/04/133
What are approximate methods?
Trading accuracy for scalability
• Often use probabilistic data structures
– a.k.a. sketches or signatures
• Mostly stream-friendly
– Allow you to query data you haven’t even kept!
• Generally simple to parallelize
• Predictable error rate (can be tuned)

What are approximate methods?
Trading accuracy for scalability
• Represent characteristics or summary of data
• Use much less space than full dataset (generally via hashing tricks)
– Can alleviate disk, memory, network bottlenecks
• Generally incur more CPU load than exact methods
– This may not be true in a distributed system, overall
○ [de]serialization for example
– Many data-centric systems have CPU to spare anyway

Why approximate methods?
A real-life example
Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu
Counting unique terms in time buckets across ElasticSearch shards
Cluster
nodes Master
node
Unique terms
per bucket
per shard
Globally
unique terms
per bucket
Client
Number of globally
unique terms per
bucket

Why approximate methods?
A real-life example
Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu
But what if each bucket contains a LOT of terms?
… and what if there are
too many to fit in
memory?
Memory
cost
CPU cost to
serialize
Network
transfer cost
CPU cost to
deserialize
CPU & memory cost to
merge & count sets

Cardinality estimation
Approximate distinct counts
Intuitive explanation
Long runs of trailing 0s in random bit strings are rare.
But the more bit strings you look at, the more likely you
are to see a long one.
So “longest run of trailing 0s seen” can be used as
an estimator of “number of unique bit strings
seen”.
01110001
11101010
00100101
11001100
11110100
11101100
00010100
00000001
00000010
10001110
01110100
01101010
01111111
00100010
00110000
00001010
01000100
01111010
01011101
00000100

Cardinality estimation
Probabilistic counting: basic algorithm
Counting the items
• Let n = 0
• For each input item:
– Hash item into bit string
– Count trailing zeroes in bit string
– If this count > n:
○ Let n = count
Calculating the count
• n = longest run of trailing 0s seen
• Estimated cardinality (“count distinct”) =
2^n … that’s it!
This is an estimate, but not a great one. But…

HyperLogLog algorithm
Billions of distinct values in 1.5KB of RAM with 2% relative error
Image: http://www.aggregateknowledge.com/science/blog/hll.html
Cool properties
• Stream-friendly: no need to keep data
• Error rates are predictable and tunable
• Size and speed stay constant
• Trivial to parallelize
– Combine two HLL counters by taking
the max of each register

Resources on cardinality estimation
HyperLogLog paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
Java implementations: https://github.com/clearspring/stream-lib/
Algebird implements HyperLogLog (and much more!) in Scalding: https://github.com/twitter/algebird
Simmer wraps Algebird in Hadoop Streaming command line: https://github.com/avibryant/simmer
Our ElasticSearch plugin: https://github.com/ptdavteam/elasticsearch-approx-plugin
MetaMarkets blog: http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/
Aggregate Knowledge blog, including JavaScript implementation and D3 visualization:
http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

Bloom filters
Set membership test with chance of false positives
Image: http://en.wikipedia.org/wiki/Bloom_filter
At least one 0 means w
definitely isn’t in set.
All 1s would mean w
probably is in set.
Hash each item n times ⇒
indices into bit field.

Count-min sketch
Frequency histogram estimation with chance of over-counting
A1 +1 +1
A2 +1 +1
A3 +2
“foo”
h1 h2 h3
“bar”
h1 h2 h3
More hashes / arrays ⇒
reduced chance of
overcounting
count(“foo”) =
min(1, 1, 2) =
1

Random hyperplanes
Locality-sensitive hashing for approximate nearest neighbours
Hash(Item1) = 011
Hash(Item2) = 001
As the cosine distance
decreases, the probability
of a hash match increases
Item1
h1 h2
h3
Item2
θ
Bitwise hamming distance
correlates with cosine
distance

Feature hashing
High-dimensional machine learning without feature dictionary
“reduce”
“the”
“size”
“of”
“your”
“feature”
“vector”
“with”
“this”
“one”
“weird”
“old”
“trick”
h(“reduce”) = 9
h(“the”) = 3
h(“size”) = 1
. . .
+1
+1
+1
Effect of collisions on overall
classification accuracy is
surprisingly small!
Multiple hashes, or 1-bit
“sign hash”, can reduce
collisions effects if necessary

Thanks for listening
And some further reading…
Great ebook available free from:
http://infolab.stanford.edu/~ullman/mmds.html

Approximate methods for scalable data mining

Recommended

Recommended

More Related Content

Similar to Approximate methods for scalable data mining

Similar to Approximate methods for scalable data mining (20)

Recently uploaded

Recently uploaded (20)

Approximate methods for scalable data mining