SlideShare a Scribd company logo
1 of 15
Approximate methods for
scalable data mining
Andrew Clegg
Data Analytics & Visualization Team
Pearson Technology
Twitter: @andrew_clegg
Approximate methods for scalable data mining l 24/04/133
What are approximate methods?
Trading accuracy for scalability
• Often use probabilistic data structures
– a.k.a. sketches or signatures
• Mostly stream-friendly
– Allow you to query data you haven’t even kept!
• Generally simple to parallelize
• Predictable error rate (can be tuned)
Approximate methods for scalable data mining l 24/04/134
What are approximate methods?
Trading accuracy for scalability
• Represent characteristics or summary of data
• Use much less space than full dataset (generally via hashing tricks)
– Can alleviate disk, memory, network bottlenecks
• Generally incur more CPU load than exact methods
– This may not be true in a distributed system, overall
○ [de]serialization for example
– Many data-centric systems have CPU to spare anyway
Approximate methods for scalable data mining l 24/04/135
Why approximate methods?
A real-life example
Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu
Counting unique terms in time buckets across ElasticSearch shards
Cluster
nodes Master
node
Unique terms
per bucket
per shard
Globally
unique terms
per bucket
Client
Number of globally
unique terms per
bucket
Approximate methods for scalable data mining l 24/04/136
Why approximate methods?
A real-life example
Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu
But what if each bucket contains a LOT of terms?
… and what if there are
too many to fit in
memory?
Memory
cost
CPU cost to
serialize
Network
transfer cost
CPU cost to
deserialize
CPU & memory cost to
merge & count sets
Approximate methods for scalable data mining l 24/04/137
Cardinality estimation
Approximate distinct counts
Intuitive explanation
Long runs of trailing 0s in random bit strings are rare.
But the more bit strings you look at, the more likely you
are to see a long one.
So “longest run of trailing 0s seen” can be used as
an estimator of “number of unique bit strings
seen”.
01110001
11101010
00100101
11001100
11110100
11101100
00010100
00000001
00000010
10001110
01110100
01101010
01111111
00100010
00110000
00001010
01000100
01111010
01011101
00000100
Approximate methods for scalable data mining l 24/04/138
Cardinality estimation
Probabilistic counting: basic algorithm
Counting the items
• Let n = 0
• For each input item:
– Hash item into bit string
– Count trailing zeroes in bit string
– If this count > n:
○ Let n = count
Calculating the count
• n = longest run of trailing 0s seen
• Estimated cardinality (“count distinct”) =
2^n … that’s it!
This is an estimate, but not a great one. But…
Approximate methods for scalable data mining l 24/04/139
HyperLogLog algorithm
Billions of distinct values in 1.5KB of RAM with 2% relative error
Image: http://www.aggregateknowledge.com/science/blog/hll.html
Cool properties
• Stream-friendly: no need to keep data
• Error rates are predictable and tunable
• Size and speed stay constant
• Trivial to parallelize
– Combine two HLL counters by taking
the max of each register
Approximate methods for scalable data mining l 24/04/1310
Resources on cardinality estimation
HyperLogLog paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
Java implementations: https://github.com/clearspring/stream-lib/
Algebird implements HyperLogLog (and much more!) in Scalding: https://github.com/twitter/algebird
Simmer wraps Algebird in Hadoop Streaming command line: https://github.com/avibryant/simmer
Our ElasticSearch plugin: https://github.com/ptdavteam/elasticsearch-approx-plugin
MetaMarkets blog: http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/
Aggregate Knowledge blog, including JavaScript implementation and D3 visualization:
http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
Approximate methods for scalable data mining l 24/04/1311
Bloom filters
Set membership test with chance of false positives
Image: http://en.wikipedia.org/wiki/Bloom_filter
At least one 0 means w
definitely isn’t in set.
All 1s would mean w
probably is in set.
Hash each item n times ⇒
indices into bit field.
Approximate methods for scalable data mining l 24/04/1312
Count-min sketch
Frequency histogram estimation with chance of over-counting
A1 +1 +1
A2 +1 +1
A3 +2
“foo”
h1 h2 h3
“bar”
h1 h2 h3
More hashes / arrays ⇒
reduced chance of
overcounting
count(“foo”) =
min(1, 1, 2) =
1
Approximate methods for scalable data mining l 24/04/1313
Random hyperplanes
Locality-sensitive hashing for approximate nearest neighbours
Hash(Item1) = 011
Hash(Item2) = 001
As the cosine distance
decreases, the probability
of a hash match increases
Item1
h1 h2
h3
Item2
θ
Bitwise hamming distance
correlates with cosine
distance
Approximate methods for scalable data mining l 24/04/1314
Feature hashing
High-dimensional machine learning without feature dictionary
“reduce”
“the”
“size”
“of”
“your”
“feature”
“vector”
“with”
“this”
“one”
“weird”
“old”
“trick”
h(“reduce”) = 9
h(“the”) = 3
h(“size”) = 1
. . .
+1
+1
+1
Effect of collisions on overall
classification accuracy is
surprisingly small!
Multiple hashes, or 1-bit
“sign hash”, can reduce
collisions effects if necessary
Approximate methods for scalable data mining l 24/04/1315
Thanks for listening
And some further reading…
Great ebook available free from:
http://infolab.stanford.edu/~ullman/mmds.html

More Related Content

Similar to Approximate methods for scalable data mining

Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityRaffael Marty
 
Approaches for application request throttling - dotNetCologne
Approaches for application request throttling - dotNetCologneApproaches for application request throttling - dotNetCologne
Approaches for application request throttling - dotNetCologneMaarten Balliauw
 
ConFoo Montreal - Approaches for application request throttling
ConFoo Montreal - Approaches for application request throttlingConFoo Montreal - Approaches for application request throttling
ConFoo Montreal - Approaches for application request throttlingMaarten Balliauw
 
Lego-like building blocks of Storm and Spark Streaming Pipelines
Lego-like building blocks of Storm and Spark Streaming PipelinesLego-like building blocks of Storm and Spark Streaming Pipelines
Lego-like building blocks of Storm and Spark Streaming PipelinesDataWorks Summit/Hadoop Summit
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
Approaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days PolandApproaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days PolandMaarten Balliauw
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData WebinarSnappyData
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streamingAdam Doyle
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic
 
201411203 goto night on graphs for fraud detection
201411203 goto night on graphs for fraud detection201411203 goto night on graphs for fraud detection
201411203 goto night on graphs for fraud detectionRik Van Bruggen
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data WarehousesConnor McDonald
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopKevin Crawley
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 

Similar to Approximate methods for scalable data mining (20)

Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for Security
 
Approaches for application request throttling - dotNetCologne
Approaches for application request throttling - dotNetCologneApproaches for application request throttling - dotNetCologne
Approaches for application request throttling - dotNetCologne
 
Ssbse10.ppt
Ssbse10.pptSsbse10.ppt
Ssbse10.ppt
 
ConFoo Montreal - Approaches for application request throttling
ConFoo Montreal - Approaches for application request throttlingConFoo Montreal - Approaches for application request throttling
ConFoo Montreal - Approaches for application request throttling
 
Lego-like building blocks of Storm and Spark Streaming Pipelines
Lego-like building blocks of Storm and Spark Streaming PipelinesLego-like building blocks of Storm and Spark Streaming Pipelines
Lego-like building blocks of Storm and Spark Streaming Pipelines
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Approaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days PolandApproaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days Poland
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
 
201411203 goto night on graphs for fraud detection
201411203 goto night on graphs for fraud detection201411203 goto night on graphs for fraud detection
201411203 goto night on graphs for fraud detection
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data Warehouses
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 

Recently uploaded

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 

Approximate methods for scalable data mining

  • 1.
  • 2. Approximate methods for scalable data mining Andrew Clegg Data Analytics & Visualization Team Pearson Technology Twitter: @andrew_clegg
  • 3. Approximate methods for scalable data mining l 24/04/133 What are approximate methods? Trading accuracy for scalability • Often use probabilistic data structures – a.k.a. sketches or signatures • Mostly stream-friendly – Allow you to query data you haven’t even kept! • Generally simple to parallelize • Predictable error rate (can be tuned)
  • 4. Approximate methods for scalable data mining l 24/04/134 What are approximate methods? Trading accuracy for scalability • Represent characteristics or summary of data • Use much less space than full dataset (generally via hashing tricks) – Can alleviate disk, memory, network bottlenecks • Generally incur more CPU load than exact methods – This may not be true in a distributed system, overall ○ [de]serialization for example – Many data-centric systems have CPU to spare anyway
  • 5. Approximate methods for scalable data mining l 24/04/135 Why approximate methods? A real-life example Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu Counting unique terms in time buckets across ElasticSearch shards Cluster nodes Master node Unique terms per bucket per shard Globally unique terms per bucket Client Number of globally unique terms per bucket
  • 6. Approximate methods for scalable data mining l 24/04/136 Why approximate methods? A real-life example Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu But what if each bucket contains a LOT of terms? … and what if there are too many to fit in memory? Memory cost CPU cost to serialize Network transfer cost CPU cost to deserialize CPU & memory cost to merge & count sets
  • 7. Approximate methods for scalable data mining l 24/04/137 Cardinality estimation Approximate distinct counts Intuitive explanation Long runs of trailing 0s in random bit strings are rare. But the more bit strings you look at, the more likely you are to see a long one. So “longest run of trailing 0s seen” can be used as an estimator of “number of unique bit strings seen”. 01110001 11101010 00100101 11001100 11110100 11101100 00010100 00000001 00000010 10001110 01110100 01101010 01111111 00100010 00110000 00001010 01000100 01111010 01011101 00000100
  • 8. Approximate methods for scalable data mining l 24/04/138 Cardinality estimation Probabilistic counting: basic algorithm Counting the items • Let n = 0 • For each input item: – Hash item into bit string – Count trailing zeroes in bit string – If this count > n: ○ Let n = count Calculating the count • n = longest run of trailing 0s seen • Estimated cardinality (“count distinct”) = 2^n … that’s it! This is an estimate, but not a great one. But…
  • 9. Approximate methods for scalable data mining l 24/04/139 HyperLogLog algorithm Billions of distinct values in 1.5KB of RAM with 2% relative error Image: http://www.aggregateknowledge.com/science/blog/hll.html Cool properties • Stream-friendly: no need to keep data • Error rates are predictable and tunable • Size and speed stay constant • Trivial to parallelize – Combine two HLL counters by taking the max of each register
  • 10. Approximate methods for scalable data mining l 24/04/1310 Resources on cardinality estimation HyperLogLog paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf Java implementations: https://github.com/clearspring/stream-lib/ Algebird implements HyperLogLog (and much more!) in Scalding: https://github.com/twitter/algebird Simmer wraps Algebird in Hadoop Streaming command line: https://github.com/avibryant/simmer Our ElasticSearch plugin: https://github.com/ptdavteam/elasticsearch-approx-plugin MetaMarkets blog: http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/ Aggregate Knowledge blog, including JavaScript implementation and D3 visualization: http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
  • 11. Approximate methods for scalable data mining l 24/04/1311 Bloom filters Set membership test with chance of false positives Image: http://en.wikipedia.org/wiki/Bloom_filter At least one 0 means w definitely isn’t in set. All 1s would mean w probably is in set. Hash each item n times ⇒ indices into bit field.
  • 12. Approximate methods for scalable data mining l 24/04/1312 Count-min sketch Frequency histogram estimation with chance of over-counting A1 +1 +1 A2 +1 +1 A3 +2 “foo” h1 h2 h3 “bar” h1 h2 h3 More hashes / arrays ⇒ reduced chance of overcounting count(“foo”) = min(1, 1, 2) = 1
  • 13. Approximate methods for scalable data mining l 24/04/1313 Random hyperplanes Locality-sensitive hashing for approximate nearest neighbours Hash(Item1) = 011 Hash(Item2) = 001 As the cosine distance decreases, the probability of a hash match increases Item1 h1 h2 h3 Item2 θ Bitwise hamming distance correlates with cosine distance
  • 14. Approximate methods for scalable data mining l 24/04/1314 Feature hashing High-dimensional machine learning without feature dictionary “reduce” “the” “size” “of” “your” “feature” “vector” “with” “this” “one” “weird” “old” “trick” h(“reduce”) = 9 h(“the”) = 3 h(“size”) = 1 . . . +1 +1 +1 Effect of collisions on overall classification accuracy is surprisingly small! Multiple hashes, or 1-bit “sign hash”, can reduce collisions effects if necessary
  • 15. Approximate methods for scalable data mining l 24/04/1315 Thanks for listening And some further reading… Great ebook available free from: http://infolab.stanford.edu/~ullman/mmds.html