SlideShare a Scribd company logo
1 of 31
PROBABILISTIC DATA
                  STRUCTURES
                                Jim Duey
                             Lonocloud.com
                                @jimduey
                            http://clojure.net




Wednesday, August 1, 2012
WHAT IS A DATA STRUCTURE?



              It is a ‘structure’ that holds ‘data’, allowing you to extract
              information.

              Data gets added to the structure.

              Queries of various sorts are used to extract information.




Wednesday, August 1, 2012
INSPIRATION



              Ilya Katsov

              https://highlyscalable.wordpress.com/2012/05/01/
              probabilistic-structures-web-analytics-data-mining/




Wednesday, August 1, 2012
WORD OF CAUTION


              Many probabilistic data structures use hashing

              Java’s hashCode is not safe across multiple processes

              “Java's hashCode is not safe for distributed systems”

              http://martin.kleppmann.com/2012/06/18/java-hashcode-
              unsafe-for-distributed-systems.html




Wednesday, August 1, 2012
PROBABILISTIC


              Query may return a wrong answer

              The answer is ‘good enough’

              Uses a fraction of the resources i.e. memory or cpu cycles




Wednesday, August 1, 2012
HOW MANY ITEMS?


              If you have a large collection of ‘things’ ...

              And there are some duplicates ...

              And you want to know how many unique things there are.




Wednesday, August 1, 2012
LINEAR COUNTING
     class LinearCounter {
        BitSet mask = new BitSet(m); // m is a design parameter


        void add(value) {
            // get an index for value between 0 .. m
            int position = value.hashCode() % m;


            mask.set(position);
        }


Wednesday, August 1, 2012
LINEAR COUNTING
                                                  1
                                      add()       0
                                                  0
                            Thing 1
                                                  0
                                         add()    1
                            Thing 2               0
                                                  0
                            Thing 3               0
                                          add()   0
                                                  0
                            Thing 4
                                                  1
                                       add()      0

Wednesday, August 1, 2012
LINEAR COUNTING
    class LinearCounter {
        BitSet mask = new BitSet(m); // m is a design parameter
      ...
    }
      Question: How big is m ?




Wednesday, August 1, 2012
LINEAR COUNTING
                                 Load Factor
                    n       Number of unique items expected

                   m              Size of bit mask


          If the load factor is < 1; few collisions, number of

                             bits set is the cardinality.




Wednesday, August 1, 2012
LINEAR COUNTING
                                  Load Factor
                    n       Number of unique items expected

                   m               Size of bit mask



          If the load factor is very high 100; all bits set, no

                            information about cardinality.




Wednesday, August 1, 2012
LINEAR COUNTING
                                Load Factor
                    n       Number of unique items expected

                   m             Size of bit mask



           If the load factor is higher than 1, but not too high;

            many collisions, but some relationship might exist
                    between number of bits set and cardinality.



Wednesday, August 1, 2012
LINEAR COUNTING

              Finding the number of members of the collection


         n = - m * ln ((m - w) / m)
         m is the size of the bit map
         w is the number of 1 s in the bitmap (cardinality)




Wednesday, August 1, 2012
LINEAR COUNTING
    class LinearCounter {
        BitSet mask = new BitSet(m); // m is a design parameter
      ...
    }
      Question: How big is m ?
      m > max(5, 1 / (std-err * n / m) ** 2) * (e ** (n / m) - n / m -1)


      On the order of 1M unique values, m = 154 Kbit,         n/m = 6.5
      On the order of 10M unique values, m = 1.1 Mbit, n/m = 9
      for a standard error of 0.01

Wednesday, August 1, 2012
LINEAR COUNTING


              “Linear-Time Probabilistic Counting Algorithm for
              Database Applications”

              Use table to find bit map size.

              Checkout Ilya’s blog post for some nice graphs.




Wednesday, August 1, 2012
LINEAR COUNTING
                                      1
                                      0
                                      0
                            Thing 1
                                      0
                                      1
                            Thing 2   0
                                      0
                            Thing 3   0
                                      0
                                      0
                            Thing 4
                                      1
                                      0

Wednesday, August 1, 2012
1
                                      0
                                      1
                            Thing 1
                                      0
                                      1
                            Thing 2   1
                                      0
                            Thing 3   0
                                      1
                                      0
                            Thing 4
                                      1
                                      1

Wednesday, August 1, 2012
BLOOM FILTER



              If you have a large collection of ‘things’ ...

              And you want to know if some thing is in the collection.




Wednesday, August 1, 2012
BLOOM FILTER
                                      1
                                      0
                                      1
                            Thing 1
                                      0
                                      1
                            Thing 2   1
                                      0
                            Thing 3   0
                                      1
                                      0
                            Thing 4
                                      1
                                      1

Wednesday, August 1, 2012
BLOOM FILTER
                                          1
                                          0
                                          1
                                          0
                                          1
                                          1
                            Other thing
                                          0
                                          0
                                          1
                                          0
                                          1
                                          1

Wednesday, August 1, 2012
BLOOM FILTER
                                            1
                                            0
                                            1
                                            0
                                            1
                                            1
                            Missing thing
                                            0
                                            0
                                            1
                                            0
                                            1
                                            1

Wednesday, August 1, 2012
BLOOM FILTER
              How big to make ‘m’ and ‘k’?

                    ‘m’ is the number of bits in the filter

                    ‘k’ is the number of separate hash functions

     m = - (n * ln p) / (ln 2) ** 2


     n is the the number of distinct items to be stored
     p is the probability of a false positive
     m = - (1M * ln .01) / (ln 2) ** 2 = 9.6 Mbits = 1.2 MB

Wednesday, August 1, 2012
BLOOM FILTER
              How big to make ‘m’ and ‘k’?

                    ‘m’ is the number of bits in the filter

                    ‘k’ is the number of separate hash functions

     m = - (n * ln p) / (ln 2) ^ 2
     k = m / n * ln 2


     k = 9.6M / 1M * 0.69 = 6.64 = 7 hash functions



Wednesday, August 1, 2012
BLOOM FILTER

              You can’t query a Bloom filter for cardinality

              You can’t remove an item once it’s been added

              Many variants of the Bloom filter, some that address these
              issues




Wednesday, August 1, 2012
HASH FUNCTIONS
              How to find many hash functions?

                    “Out of one, many”

                    Make the size of your bit mask a power of 2

                    By masking off bit fields, you can get multiple hash values
                    from a single hash function.

                    a 16 bit hash will cover a 65Kbit index

                    512 bit hash will give 32 16-bit hashes

Wednesday, August 1, 2012
COUNT-MIN SKETCH
              When you want to know how many of each item there is in a
              collection.




Wednesday, August 1, 2012
COUNT-MIN SKETCH
                                                 w
                                     +1


                                                                   +1
               Thing 1
                                                                        d
                                                 +1

                                                              +1

      Each box is a counter.
      Each row is indexed by a corresponding hash function.

Wednesday, August 1, 2012
COUNT-MIN SKETCH
                                                   w
                                        a


                                                                     b
         Some thing
                                                                         d
                                                    c

                                                                 d

      Estimated frequency for ‘Some thing’ is min(a, b, c, d).


Wednesday, August 1, 2012
COUNT-MIN SKETCH
              How big to make ‘w’ and ‘d’?

                    ‘w’ is the number of counters per hash function

                    limits the magnitude of the error

                    ‘d’ is the number of separate hash functions

                    controls the probability that the estimation is greater than
                    the error




Wednesday, August 1, 2012
COUNT-MIN SKETCH

                            error-limit <= 2 * n / w
    probability limit exceeded = 1 - (1 / 2) ** d


    n = total number of items counted
    w = number of counters per hash function
    d = number of separate hash functions


         Works best on skewed data.



Wednesday, August 1, 2012
RESOURCES
              https://highlyscalable.wordpress.com/2012/05/01/
              probabilistic-structures-web-analytics-data-mining/

              http://blog.aggregateknowledge.com/

               http://lkozma.net/blog/sketching-data-structures/

              https://sites.google.com/site/countminsketch/home

              “PyCon 2011: Handling ridiculous amounts of data with
              probabilistic data structures”



Wednesday, August 1, 2012

More Related Content

More from StampedeCon

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
 

More from StampedeCon (20)

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 

Recently uploaded

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

A Survey of Probabilistic Data Structures - StampedeCon 2012

  • 1. PROBABILISTIC DATA STRUCTURES Jim Duey Lonocloud.com @jimduey http://clojure.net Wednesday, August 1, 2012
  • 2. WHAT IS A DATA STRUCTURE? It is a ‘structure’ that holds ‘data’, allowing you to extract information. Data gets added to the structure. Queries of various sorts are used to extract information. Wednesday, August 1, 2012
  • 3. INSPIRATION Ilya Katsov https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/ Wednesday, August 1, 2012
  • 4. WORD OF CAUTION Many probabilistic data structures use hashing Java’s hashCode is not safe across multiple processes “Java's hashCode is not safe for distributed systems” http://martin.kleppmann.com/2012/06/18/java-hashcode- unsafe-for-distributed-systems.html Wednesday, August 1, 2012
  • 5. PROBABILISTIC Query may return a wrong answer The answer is ‘good enough’ Uses a fraction of the resources i.e. memory or cpu cycles Wednesday, August 1, 2012
  • 6. HOW MANY ITEMS? If you have a large collection of ‘things’ ... And there are some duplicates ... And you want to know how many unique things there are. Wednesday, August 1, 2012
  • 7. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter void add(value) { // get an index for value between 0 .. m int position = value.hashCode() % m; mask.set(position); } Wednesday, August 1, 2012
  • 8. LINEAR COUNTING 1 add() 0 0 Thing 1 0 add() 1 Thing 2 0 0 Thing 3 0 add() 0 0 Thing 4 1 add() 0 Wednesday, August 1, 2012
  • 9. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter ... } Question: How big is m ? Wednesday, August 1, 2012
  • 10. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is < 1; few collisions, number of bits set is the cardinality. Wednesday, August 1, 2012
  • 11. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is very high 100; all bits set, no information about cardinality. Wednesday, August 1, 2012
  • 12. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is higher than 1, but not too high; many collisions, but some relationship might exist between number of bits set and cardinality. Wednesday, August 1, 2012
  • 13. LINEAR COUNTING Finding the number of members of the collection n = - m * ln ((m - w) / m) m is the size of the bit map w is the number of 1 s in the bitmap (cardinality) Wednesday, August 1, 2012
  • 14. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter ... } Question: How big is m ? m > max(5, 1 / (std-err * n / m) ** 2) * (e ** (n / m) - n / m -1) On the order of 1M unique values, m = 154 Kbit, n/m = 6.5 On the order of 10M unique values, m = 1.1 Mbit, n/m = 9 for a standard error of 0.01 Wednesday, August 1, 2012
  • 15. LINEAR COUNTING “Linear-Time Probabilistic Counting Algorithm for Database Applications” Use table to find bit map size. Checkout Ilya’s blog post for some nice graphs. Wednesday, August 1, 2012
  • 16. LINEAR COUNTING 1 0 0 Thing 1 0 1 Thing 2 0 0 Thing 3 0 0 0 Thing 4 1 0 Wednesday, August 1, 2012
  • 17. 1 0 1 Thing 1 0 1 Thing 2 1 0 Thing 3 0 1 0 Thing 4 1 1 Wednesday, August 1, 2012
  • 18. BLOOM FILTER If you have a large collection of ‘things’ ... And you want to know if some thing is in the collection. Wednesday, August 1, 2012
  • 19. BLOOM FILTER 1 0 1 Thing 1 0 1 Thing 2 1 0 Thing 3 0 1 0 Thing 4 1 1 Wednesday, August 1, 2012
  • 20. BLOOM FILTER 1 0 1 0 1 1 Other thing 0 0 1 0 1 1 Wednesday, August 1, 2012
  • 21. BLOOM FILTER 1 0 1 0 1 1 Missing thing 0 0 1 0 1 1 Wednesday, August 1, 2012
  • 22. BLOOM FILTER How big to make ‘m’ and ‘k’? ‘m’ is the number of bits in the filter ‘k’ is the number of separate hash functions m = - (n * ln p) / (ln 2) ** 2 n is the the number of distinct items to be stored p is the probability of a false positive m = - (1M * ln .01) / (ln 2) ** 2 = 9.6 Mbits = 1.2 MB Wednesday, August 1, 2012
  • 23. BLOOM FILTER How big to make ‘m’ and ‘k’? ‘m’ is the number of bits in the filter ‘k’ is the number of separate hash functions m = - (n * ln p) / (ln 2) ^ 2 k = m / n * ln 2 k = 9.6M / 1M * 0.69 = 6.64 = 7 hash functions Wednesday, August 1, 2012
  • 24. BLOOM FILTER You can’t query a Bloom filter for cardinality You can’t remove an item once it’s been added Many variants of the Bloom filter, some that address these issues Wednesday, August 1, 2012
  • 25. HASH FUNCTIONS How to find many hash functions? “Out of one, many” Make the size of your bit mask a power of 2 By masking off bit fields, you can get multiple hash values from a single hash function. a 16 bit hash will cover a 65Kbit index 512 bit hash will give 32 16-bit hashes Wednesday, August 1, 2012
  • 26. COUNT-MIN SKETCH When you want to know how many of each item there is in a collection. Wednesday, August 1, 2012
  • 27. COUNT-MIN SKETCH w +1 +1 Thing 1 d +1 +1 Each box is a counter. Each row is indexed by a corresponding hash function. Wednesday, August 1, 2012
  • 28. COUNT-MIN SKETCH w a b Some thing d c d Estimated frequency for ‘Some thing’ is min(a, b, c, d). Wednesday, August 1, 2012
  • 29. COUNT-MIN SKETCH How big to make ‘w’ and ‘d’? ‘w’ is the number of counters per hash function limits the magnitude of the error ‘d’ is the number of separate hash functions controls the probability that the estimation is greater than the error Wednesday, August 1, 2012
  • 30. COUNT-MIN SKETCH error-limit <= 2 * n / w probability limit exceeded = 1 - (1 / 2) ** d n = total number of items counted w = number of counters per hash function d = number of separate hash functions Works best on skewed data. Wednesday, August 1, 2012
  • 31. RESOURCES https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/ http://blog.aggregateknowledge.com/ http://lkozma.net/blog/sketching-data-structures/ https://sites.google.com/site/countminsketch/home “PyCon 2011: Handling ridiculous amounts of data with probabilistic data structures” Wednesday, August 1, 2012