This document provides an overview of randomization techniques used in data structures, including hash tables, bloom filters, and skip lists. It discusses how each of these structures implements a dictionary abstract data type (ADT) with operations like insert, delete, and lookup. For hash tables, it describes direct addressing, chaining to resolve collisions, and analysis showing expected constant time performance. Bloom filters are explained as a space-efficient probabilistic data structure for set membership with possible false positives. Skip lists are randomized balanced search trees that provide logarithmic time performance for dictionary operations.
The document discusses hashing techniques for implementing dictionaries. It begins by introducing the direct addressing method, which stores key-value pairs directly in an array indexed by keys. However, this wastes space when there are fewer unique keys than array slots. Hashing addresses this by using a hash function to map keys to array slots, reducing storage needs. However, collisions can occur when different keys hash to the same slot. The document then covers various techniques for handling collisions, including chaining, linear probing, quadratic probing, and double hashing. It also discusses properties of good hash functions such as minimizing collisions between related keys and producing uniformly random mappings.
Hash tables provide an effective way to implement dictionaries by mapping keys to table slots using hash functions. Collisions occur when multiple keys hash to the same slot, but can be resolved through chaining or open addressing. With chaining, elements that hash to the same slot are stored in a linked list at that slot. All dictionary operations take expected O(1) time with hash tables using chaining due to uniform hashing and load factors remaining constant as the table grows. Good hash functions satisfy the uniform hashing assumption and distribute keys evenly among slots.
The document discusses hash tables and how they can be used to implement dictionaries. Hash tables map keys to table slots using a hash function in order to store and retrieve items efficiently. Collisions may occur when multiple keys hash to the same slot. Chaining is described as a method to handle collisions by storing colliding items in linked lists attached to table slots. Analysis shows that with simple uniform hashing, dictionary operations like search, insert and delete take expected O(1) time on average.
This document discusses hashing and its applications. It begins by describing dictionary operations like search, insert, delete, minimum, maximum, and their implementations using different data structures. It then focuses on hash tables, explaining how they work using hash functions to map keys to array indices. The document discusses collisions, good and bad hash functions, and performance of hash table operations. It also describes how hashing can be used for substring pattern matching and other applications like document fingerprinting.
This document discusses space and time tradeoffs in algorithms, specifically using hashing. Hashing is an effective data structure for implementing dictionaries by distributing keys among an array using a hash function. Collisions can occur when two keys hash to the same slot, which are resolved using separate chaining (linked lists) or open addressing techniques like linear probing, quadratic probing, and double hashing. Hashing provides fast search time of O(1) on average with proper parameter tuning but can suffer from clustering issues depending on the probing method used.
This document discusses hash tables and their implementation. It begins by explaining that dictionaries map keys to values like arrays but allow any type for the key. It then discusses how dictionaries can be implemented using balanced binary search trees or hash tables. Hash tables offer constant-time lookup by using a hash function to map keys to array indices, but collisions must be handled when two keys hash to the same index. Various collision resolution techniques are presented like separate chaining, linear probing, and quadratic probing. The document provides examples and guidelines for choosing a hash function and table size to minimize collisions.
Dictionaries and hash tables are data structures for storing and searching key-value pairs. A dictionary abstract data type (ADT) allows searching, inserting, and deleting items by key. A log file dictionary implements the ADT using an unsorted sequence, with O(1) insertion but O(n) search and removal. A hash table maps keys to array indices using a hash function for expected O(1) operations, handling collisions by chaining or probing. Common hash functions include modular arithmetic and polynomial evaluation.
Hashing using a different methods of techniclokaprasaadvs
Here are the steps to solve this problem:
1) Construct the open hash table with separate chaining for the keys 30, 20, 56, 75, 31, 19 using h(K) = K mod 11:
Index: 0 1 2 3 4 5 6 7 8 9 10
Keys: 30 20 56 75 31 19
2) The largest number of key comparisons for a successful search is 2 (to search the linked list for key 56).
3) The average number of key comparisons is 1 (each key is directly stored without collisions).
4) Construct the closed hash table using linear probing for the same keys:
Index: 0 1 2 3 4 5 6 7 8 9 10
The document discusses hashing techniques for implementing dictionaries. It begins by introducing the direct addressing method, which stores key-value pairs directly in an array indexed by keys. However, this wastes space when there are fewer unique keys than array slots. Hashing addresses this by using a hash function to map keys to array slots, reducing storage needs. However, collisions can occur when different keys hash to the same slot. The document then covers various techniques for handling collisions, including chaining, linear probing, quadratic probing, and double hashing. It also discusses properties of good hash functions such as minimizing collisions between related keys and producing uniformly random mappings.
Hash tables provide an effective way to implement dictionaries by mapping keys to table slots using hash functions. Collisions occur when multiple keys hash to the same slot, but can be resolved through chaining or open addressing. With chaining, elements that hash to the same slot are stored in a linked list at that slot. All dictionary operations take expected O(1) time with hash tables using chaining due to uniform hashing and load factors remaining constant as the table grows. Good hash functions satisfy the uniform hashing assumption and distribute keys evenly among slots.
The document discusses hash tables and how they can be used to implement dictionaries. Hash tables map keys to table slots using a hash function in order to store and retrieve items efficiently. Collisions may occur when multiple keys hash to the same slot. Chaining is described as a method to handle collisions by storing colliding items in linked lists attached to table slots. Analysis shows that with simple uniform hashing, dictionary operations like search, insert and delete take expected O(1) time on average.
This document discusses hashing and its applications. It begins by describing dictionary operations like search, insert, delete, minimum, maximum, and their implementations using different data structures. It then focuses on hash tables, explaining how they work using hash functions to map keys to array indices. The document discusses collisions, good and bad hash functions, and performance of hash table operations. It also describes how hashing can be used for substring pattern matching and other applications like document fingerprinting.
This document discusses space and time tradeoffs in algorithms, specifically using hashing. Hashing is an effective data structure for implementing dictionaries by distributing keys among an array using a hash function. Collisions can occur when two keys hash to the same slot, which are resolved using separate chaining (linked lists) or open addressing techniques like linear probing, quadratic probing, and double hashing. Hashing provides fast search time of O(1) on average with proper parameter tuning but can suffer from clustering issues depending on the probing method used.
This document discusses hash tables and their implementation. It begins by explaining that dictionaries map keys to values like arrays but allow any type for the key. It then discusses how dictionaries can be implemented using balanced binary search trees or hash tables. Hash tables offer constant-time lookup by using a hash function to map keys to array indices, but collisions must be handled when two keys hash to the same index. Various collision resolution techniques are presented like separate chaining, linear probing, and quadratic probing. The document provides examples and guidelines for choosing a hash function and table size to minimize collisions.
Dictionaries and hash tables are data structures for storing and searching key-value pairs. A dictionary abstract data type (ADT) allows searching, inserting, and deleting items by key. A log file dictionary implements the ADT using an unsorted sequence, with O(1) insertion but O(n) search and removal. A hash table maps keys to array indices using a hash function for expected O(1) operations, handling collisions by chaining or probing. Common hash functions include modular arithmetic and polynomial evaluation.
Hashing using a different methods of techniclokaprasaadvs
Here are the steps to solve this problem:
1) Construct the open hash table with separate chaining for the keys 30, 20, 56, 75, 31, 19 using h(K) = K mod 11:
Index: 0 1 2 3 4 5 6 7 8 9 10
Keys: 30 20 56 75 31 19
2) The largest number of key comparisons for a successful search is 2 (to search the linked list for key 56).
3) The average number of key comparisons is 1 (each key is directly stored without collisions).
4) Construct the closed hash table using linear probing for the same keys:
Index: 0 1 2 3 4 5 6 7 8 9 10
Hashing is a technique used to store and retrieve data efficiently. It involves using a hash function to map keys to integers that are used as indexes in an array. This improves searching time from O(n) to O(1) on average. However, collisions can occur when different keys map to the same index. Collision resolution techniques like chaining and open addressing are used to handle collisions. Chaining resolves collisions by linking keys together in buckets, while open addressing resolves them by probing to find the next empty index. Both approaches allow basic dictionary operations like insertion and search to be performed in O(1) average time when load factors are low.
Here are the key points comparing hash-based search and binary search on a best case basis:
- Binary search has a best case time complexity of O(1) as it directly locates the target element by comparing the middle element on each iteration.
- Hash-based search has a best case time complexity of O(1) if there is no collision during probing. The target element can be found directly by indexing into the hash table.
- Collisions degrade the performance of hash-based search. The fewer collisions, the closer it gets to the best case.
- The load factor α (number of elements/number of slots) impacts the number of collisions - a lower load factor results in fewer collisions on
The document discusses hashing techniques for mapping keys to indices in a hash table. It covers:
1) Hash functions which map keys to hash codes using techniques like polynomial accumulation and then compress the codes to indices using modulo.
2) Open addressing techniques like linear probing and double hashing which store elements directly in the hash table by probing for empty slots when collisions occur.
3) Analysis showing open addressing has average probe costs of O(1/α) for unsuccessful searches and O(1/α) ln(1/(1-α)) for successful searches, where α is the load factor. Chaining has constant O(1) costs.
This document summarizes techniques for hashing, including hash functions, open addressing, linear probing, and double hashing. It discusses choosing good hash functions that distribute keys uniformly. For open addressing, it describes linear probing, which probes successive table locations when collisions occur, and double hashing, which uses two hash functions to determine probe positions. Analysis shows the expected number of probes is related to the load factor for these probing techniques.
Hashing is a common technique for implementing dictionaries that provides constant-time operations by mapping keys to table positions using a hash function, though collisions require resolution strategies like separate chaining or open addressing. Popular hash functions include division and cyclic shift hashing to better distribute keys across buckets. Both open hashing using linked lists and closed hashing using linear probing can provide average constant-time performance for dictionary operations depending on load factor.
Hashing In Data Structure Download PPT icajiwol341
1. The document discusses hashing techniques for implementing dictionaries and search data structures, including separate chaining and closed hashing (linear probing, quadratic probing, and double hashing).
2. Separate chaining uses a linked list at each index to handle collisions, while closed hashing searches for empty slots using a collision resolution function.
3. Double hashing is described as the best closed hashing technique, as it uses a second hash function to spread keys out and avoid clustering completely.
Unit – VIII discusses searching and hashing techniques. It describes linear and binary searching algorithms. Linear search has O(n) time complexity while binary search has O(log n) time complexity for sorted arrays. Hashing is also introduced as a technique to allow O(1) access time by mapping keys to array indices via a hash function. Separate chaining and open addressing like linear probing and quadratic probing are described as methods to handle collisions during hashing.
The document discusses different hashing techniques used to store and retrieve data in hash tables. It begins by motivating the need for hashing through the limitations of linear and binary search. It then defines hashing as a process to map keys of arbitrary size to fixed size values. Popular hash functions discussed include division, folding, and mid-square methods. The document also covers collision resolution techniques for hash tables, including open addressing methods like linear probing, quadratic probing and double hashing as well as separate chaining using linked lists.
The document summarizes several linear sorting algorithms, including bucket sort, counting sort, general bucket sort, and radix sort. Counting sort runs in O(n+k) time and O(k) space, where k is the range of integer keys, and is stable. Radix sort uses a stable sorting algorithm like counting sort to sort based on each digit of d-digit numbers, resulting in O(d(n+k)) time for sorting n numbers with d digits in the range [1,k].
Describes Map data structure, its methods and implementation using Hash tables & linked list along with their running time. Hash table components, bucket Array and hash function. Collision handing strategies: Separate chaining, Linear probing, quadratic probing, double hashing.
Ordered Maps and corresponding binary search
presentation on important DAG,TRIE,Hashing.pptxjainaaru59
Directed acyclic graph (DAG) is used to represent the flow of values between basic blocks of code. A DAG is a directed graph with no cycles. It is generated during intermediate code generation. DAGs determine common subexpressions and the flow of names and computed values between blocks of code. An algorithm is described to construct a DAG by creating nodes for operands and adding edges between nodes and operator nodes. Examples show how expressions are represented by a DAG. The complexity of a DAG depends on its width and depth. Applications of DAGs include determining common subexpressions, names used in blocks, and which statements' values may be used outside blocks.
Hashing is a technique for storing data in an array-like structure that allows for fast lookup of data based on keys. It improves upon linear and binary search by avoiding the need to keep data sorted. Hashing works by using a hash function to map keys to array indices, with collisions resolved through techniques like separate chaining or open addressing. Separate chaining uses linked lists at each index while open addressing resolves collisions by probing to alternate indices like linear, quadratic, or double hashing.
The document discusses different data structures for implementing dictionaries, including arrays, linked lists, hash tables, binary trees, and B-trees. It focuses on hashing as a technique for implementing dictionaries. Hashing maps keys to table slots using a hash function to achieve fast average-case search, insertion, and deletion times of O(1). Collisions are resolved using chaining, where each slot contains a linked list. The load factor affects performance, with lower load factors resulting in faster operations.
Hashing is a technique used to map data of arbitrary size to values of fixed size. It allows for fast lookup of data in near constant time. Common applications include dictionaries, databases, and search engines. Hashing works by applying a hash function to a key that returns an index value. Collisions occur when different keys hash to the same index, and must be resolved through techniques like separate chaining or open addressing.
This document discusses different searching methods like sequential, binary, and hashing. It defines searching as finding an element within a list. Sequential search searches lists sequentially until the element is found or the end is reached, with efficiency of O(n) in worst case. Binary search works on sorted arrays by eliminating half of remaining elements at each step, with efficiency of O(log n). Hashing maps keys to table positions using a hash function, allowing searches, inserts and deletes in O(1) time on average. Good hash functions uniformly distribute keys and generate different hashes for similar keys.
The document discusses arrays and recursion. It introduces binary search and merge sort algorithms. Binary search reduces the search space in half at each step by comparing the target value to the middle element of a sorted array. Merge sort divides the array into halves, sorts each half recursively, and then merges the sorted halves. Both binary search and merge sort have time complexities of O(log n), providing more efficient alternatives to linear search and selection sort respectively.
The document discusses arrays and recursion. It introduces binary search and merge sort algorithms. Binary search reduces the search space in half at each step by comparing the target value to the middle element of a sorted array. Merge sort divides the array into halves, sorts each half recursively, and then merges the sorted halves. Both binary search and merge sort have time complexities of O(log n), providing more efficient alternatives to linear search and selection sort respectively.
The document discusses hashing techniques and collision resolution methods for hash tables. It covers:
- Hashing maps keys of variable length to smaller fixed-length values using a hash function. Hash tables use hashing to efficiently store and retrieve key-value pairs.
- Collisions occur when two keys hash to the same value. Common collision resolution methods are separate chaining, where each slot points to a linked list, and open addressing techniques like linear probing and double hashing.
- Bucket hashing groups hash table slots into buckets to improve performance. Records are hashed to buckets and stored sequentially within buckets or in an overflow bucket if a bucket is full. This reduces disk accesses when the hash table is stored
The document discusses sorting algorithms. It defines sorting as arranging a list of records in a certain order based on their keys. Some key points made:
- Sorting is important as it enables efficient searching and other tasks. Common sorting algorithms include selection sort, insertion sort, mergesort, quicksort, and heapsort.
- The complexity of sorting in general is Θ(n log n) but some special cases allow linear time sorting. Internal sorting happens in memory while external sorting handles data too large for memory.
- Applications of sorting include searching, finding closest pairs of numbers, checking for duplicates, and calculating frequency distributions. Sorting also enables efficient algorithms for computing medians, convex hulls, and
At the beginning, the number of elements in a set of numbers to be stored in a computer system used to be not so large or having a wide range. Then, using a
simple table T [0, 1, ..., m − 1]called, direct-address table, could be used to store those numbers. As the situation became more and more complex, and a new idea came to be:
Definition
An associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of tuples {(key, value)}
This can bee seen in the example of dictionaries in any spoken language. The problem became more complex when the range of the possible values for the
keys at the tuples became unbounded. Therefore a new type of data structure is needed to avoid the sparsity problem in the data, the hash table.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Hashing is a technique used to store and retrieve data efficiently. It involves using a hash function to map keys to integers that are used as indexes in an array. This improves searching time from O(n) to O(1) on average. However, collisions can occur when different keys map to the same index. Collision resolution techniques like chaining and open addressing are used to handle collisions. Chaining resolves collisions by linking keys together in buckets, while open addressing resolves them by probing to find the next empty index. Both approaches allow basic dictionary operations like insertion and search to be performed in O(1) average time when load factors are low.
Here are the key points comparing hash-based search and binary search on a best case basis:
- Binary search has a best case time complexity of O(1) as it directly locates the target element by comparing the middle element on each iteration.
- Hash-based search has a best case time complexity of O(1) if there is no collision during probing. The target element can be found directly by indexing into the hash table.
- Collisions degrade the performance of hash-based search. The fewer collisions, the closer it gets to the best case.
- The load factor α (number of elements/number of slots) impacts the number of collisions - a lower load factor results in fewer collisions on
The document discusses hashing techniques for mapping keys to indices in a hash table. It covers:
1) Hash functions which map keys to hash codes using techniques like polynomial accumulation and then compress the codes to indices using modulo.
2) Open addressing techniques like linear probing and double hashing which store elements directly in the hash table by probing for empty slots when collisions occur.
3) Analysis showing open addressing has average probe costs of O(1/α) for unsuccessful searches and O(1/α) ln(1/(1-α)) for successful searches, where α is the load factor. Chaining has constant O(1) costs.
This document summarizes techniques for hashing, including hash functions, open addressing, linear probing, and double hashing. It discusses choosing good hash functions that distribute keys uniformly. For open addressing, it describes linear probing, which probes successive table locations when collisions occur, and double hashing, which uses two hash functions to determine probe positions. Analysis shows the expected number of probes is related to the load factor for these probing techniques.
Hashing is a common technique for implementing dictionaries that provides constant-time operations by mapping keys to table positions using a hash function, though collisions require resolution strategies like separate chaining or open addressing. Popular hash functions include division and cyclic shift hashing to better distribute keys across buckets. Both open hashing using linked lists and closed hashing using linear probing can provide average constant-time performance for dictionary operations depending on load factor.
Hashing In Data Structure Download PPT icajiwol341
1. The document discusses hashing techniques for implementing dictionaries and search data structures, including separate chaining and closed hashing (linear probing, quadratic probing, and double hashing).
2. Separate chaining uses a linked list at each index to handle collisions, while closed hashing searches for empty slots using a collision resolution function.
3. Double hashing is described as the best closed hashing technique, as it uses a second hash function to spread keys out and avoid clustering completely.
Unit – VIII discusses searching and hashing techniques. It describes linear and binary searching algorithms. Linear search has O(n) time complexity while binary search has O(log n) time complexity for sorted arrays. Hashing is also introduced as a technique to allow O(1) access time by mapping keys to array indices via a hash function. Separate chaining and open addressing like linear probing and quadratic probing are described as methods to handle collisions during hashing.
The document discusses different hashing techniques used to store and retrieve data in hash tables. It begins by motivating the need for hashing through the limitations of linear and binary search. It then defines hashing as a process to map keys of arbitrary size to fixed size values. Popular hash functions discussed include division, folding, and mid-square methods. The document also covers collision resolution techniques for hash tables, including open addressing methods like linear probing, quadratic probing and double hashing as well as separate chaining using linked lists.
The document summarizes several linear sorting algorithms, including bucket sort, counting sort, general bucket sort, and radix sort. Counting sort runs in O(n+k) time and O(k) space, where k is the range of integer keys, and is stable. Radix sort uses a stable sorting algorithm like counting sort to sort based on each digit of d-digit numbers, resulting in O(d(n+k)) time for sorting n numbers with d digits in the range [1,k].
Describes Map data structure, its methods and implementation using Hash tables & linked list along with their running time. Hash table components, bucket Array and hash function. Collision handing strategies: Separate chaining, Linear probing, quadratic probing, double hashing.
Ordered Maps and corresponding binary search
presentation on important DAG,TRIE,Hashing.pptxjainaaru59
Directed acyclic graph (DAG) is used to represent the flow of values between basic blocks of code. A DAG is a directed graph with no cycles. It is generated during intermediate code generation. DAGs determine common subexpressions and the flow of names and computed values between blocks of code. An algorithm is described to construct a DAG by creating nodes for operands and adding edges between nodes and operator nodes. Examples show how expressions are represented by a DAG. The complexity of a DAG depends on its width and depth. Applications of DAGs include determining common subexpressions, names used in blocks, and which statements' values may be used outside blocks.
Hashing is a technique for storing data in an array-like structure that allows for fast lookup of data based on keys. It improves upon linear and binary search by avoiding the need to keep data sorted. Hashing works by using a hash function to map keys to array indices, with collisions resolved through techniques like separate chaining or open addressing. Separate chaining uses linked lists at each index while open addressing resolves collisions by probing to alternate indices like linear, quadratic, or double hashing.
The document discusses different data structures for implementing dictionaries, including arrays, linked lists, hash tables, binary trees, and B-trees. It focuses on hashing as a technique for implementing dictionaries. Hashing maps keys to table slots using a hash function to achieve fast average-case search, insertion, and deletion times of O(1). Collisions are resolved using chaining, where each slot contains a linked list. The load factor affects performance, with lower load factors resulting in faster operations.
Hashing is a technique used to map data of arbitrary size to values of fixed size. It allows for fast lookup of data in near constant time. Common applications include dictionaries, databases, and search engines. Hashing works by applying a hash function to a key that returns an index value. Collisions occur when different keys hash to the same index, and must be resolved through techniques like separate chaining or open addressing.
This document discusses different searching methods like sequential, binary, and hashing. It defines searching as finding an element within a list. Sequential search searches lists sequentially until the element is found or the end is reached, with efficiency of O(n) in worst case. Binary search works on sorted arrays by eliminating half of remaining elements at each step, with efficiency of O(log n). Hashing maps keys to table positions using a hash function, allowing searches, inserts and deletes in O(1) time on average. Good hash functions uniformly distribute keys and generate different hashes for similar keys.
The document discusses arrays and recursion. It introduces binary search and merge sort algorithms. Binary search reduces the search space in half at each step by comparing the target value to the middle element of a sorted array. Merge sort divides the array into halves, sorts each half recursively, and then merges the sorted halves. Both binary search and merge sort have time complexities of O(log n), providing more efficient alternatives to linear search and selection sort respectively.
The document discusses arrays and recursion. It introduces binary search and merge sort algorithms. Binary search reduces the search space in half at each step by comparing the target value to the middle element of a sorted array. Merge sort divides the array into halves, sorts each half recursively, and then merges the sorted halves. Both binary search and merge sort have time complexities of O(log n), providing more efficient alternatives to linear search and selection sort respectively.
The document discusses hashing techniques and collision resolution methods for hash tables. It covers:
- Hashing maps keys of variable length to smaller fixed-length values using a hash function. Hash tables use hashing to efficiently store and retrieve key-value pairs.
- Collisions occur when two keys hash to the same value. Common collision resolution methods are separate chaining, where each slot points to a linked list, and open addressing techniques like linear probing and double hashing.
- Bucket hashing groups hash table slots into buckets to improve performance. Records are hashed to buckets and stored sequentially within buckets or in an overflow bucket if a bucket is full. This reduces disk accesses when the hash table is stored
The document discusses sorting algorithms. It defines sorting as arranging a list of records in a certain order based on their keys. Some key points made:
- Sorting is important as it enables efficient searching and other tasks. Common sorting algorithms include selection sort, insertion sort, mergesort, quicksort, and heapsort.
- The complexity of sorting in general is Θ(n log n) but some special cases allow linear time sorting. Internal sorting happens in memory while external sorting handles data too large for memory.
- Applications of sorting include searching, finding closest pairs of numbers, checking for duplicates, and calculating frequency distributions. Sorting also enables efficient algorithms for computing medians, convex hulls, and
At the beginning, the number of elements in a set of numbers to be stored in a computer system used to be not so large or having a wide range. Then, using a
simple table T [0, 1, ..., m − 1]called, direct-address table, could be used to store those numbers. As the situation became more and more complex, and a new idea came to be:
Definition
An associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of tuples {(key, value)}
This can bee seen in the example of dictionaries in any spoken language. The problem became more complex when the range of the possible values for the
keys at the tuples became unbounded. Therefore a new type of data structure is needed to avoid the sparsity problem in the data, the hash table.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
3. Dictionary ADT
A dictionary ADT implements the following operations
• Insert(x): puts the item x into the dictionary
• Delete(x): deletes the item x from the dictionary
• IsIn(x): returns true iff the item x is in the dictionary
2
4. Dictionary ADT
• Frequently, we think of the items being stored in the dictio-
nary as keys
• The keys typically have records associated with them which
are carried around with the key but not used by the ADT
implementation
• Thus we can implement functions like:
– Insert(k,r): puts the item (k,r) into the dictionary if the
key k is not already there, otherwise returns an error
– Delete(k): deletes the item with key k from the dictionary
– Lookup(k): returns the item (k,r) if k is in the dictionary,
otherwise returns null
3
5. Implementing Dictionaries
• The simplest way to implement a dictionary ADT is with a
linked list
• Let l be a linked list data structure, assume we have the
following operations defined for l
– head(l): returns a pointer to the head of the list
– next(p): given a pointer p into the list, returns a pointer
to the next element in the list if such exists, null otherwise
– previous(p): given a pointer p into the list, returns a
pointer to the previous element in the list if such exists,
null otherwise
– key(p): given a pointer into the list, returns the key value
of that item
– record(p): given a pointer into the list, returns the record
value of that item
4
6. At-Home Exercise
Implement a dictionary with a linked list
• Q1: Write the operation Lookup(k) which returns a pointer
to the item with key k if it is in the dictionary or null otherwise
• Q2: Write the operation Insert(k,r)
• Q3: Write the operation Delete(k)
• Q4: For a dictionary with n elements, what is the runtime
of all of these operations for the linked list data structure?
• Q5: Describe how you would use this dictionary ADT to
count the number of occurences of each word in an online
book.
5
7. Dictionaries
• This linked list implementation of dictionaries is very slow
• Q: Can we do better?
• A: Yes, with hash tables, AVL trees, etc
6
8. Hash Tables
Hash Tables implement the Dictionary ADT, namely:
• Insert(x) - O(1) expected time, Θ(n) worst case
• Lookup(x) - O(1) expected time, Θ(n) worst case
• Delete(x) - O(1) expected time, Θ(n) worst case
7
9. Direct Addressing
• Suppose universe of keys is U = {0, 1, . . . , m − 1}, where m is
not too large
• Assume no two elements have the same key
• We use an array T[0..m − 1] to store the keys
• Slot k contains the elem with key k
8
10. Direct Address Functions
DA-Search(T,k){ return T[k];}
DA-Insert(T,x){ T[key(x)] = x;}
DA-Delete(T,x){ T[key(x)] = NIL;}
Each of these operations takes O(1) time
9
11. Direct Addressing Problem
• If universe U is large, storing the array T may be impractical
• Also much space can be wasted in T if number of objects
stored is small
• Q: Can we do better?
• A: Yes we can trade time for space
10
12. Hash Tables
• “Key” Idea: An element with key k is stored in slot h(k),
where h is a hash function mapping U into the set {0, . . . , m−
1}
• Main problem: Two keys can now hash to the same slot
• Q: How do we resolve this problem?
• A1: Try to prevent it by hashing keys to “random” slots and
making the table large enough
• A2: Chaining
• A3: Open Addressing
11
13. Chained Hash
In chaining, all elements that hash to the same slot are put in a
linked list.
CH-Insert(T,x){Insert x at the head of list T[h(key(x))];}
CH-Search(T,k){search for elem with key k in list T[h(k)];}
CH-Delete(T,x){delete x from the list T[h(key(x))];}
12
14. Analysis
• CH-Insert and CH-Delete take O(1) time if the list is doubly
linked and there are no duplicate keys
• Q: How long does CH-Search take?
• A: It depends. In particular, depends on the load factor,
α = n/m (i.e. average number of elems in a list)
13
15. CH-Search Analysis
• Worst case analysis: everyone hashes to one slot so Θ(n)
• For average case, make the simple uniform hashing assump-
tion: any given elem is equally likely to hash into any of the
m slots, indep. of the other elems
• Let ni be a random variable giving the length of the list at
the i-th slot
• Then time to do a search for key k is 1 + nh(k)
14
16. CH-Search Analysis
• Q: What is E(nh(k))?
• A: We know that h(k) is uniformly distributed among {0, .., m−
1}
• Thus, E(nh(k)) =
Pm−1
i=0 (1/m)ni = n/m = α
15
17. Hash Functions
• Want each key to be equally likely to hash to any of the m
slots, independently of the other keys
• Key idea is to use the hash function to “break up” any pat-
terns that might exist in the data
• We will always assume a key is a natural number (can e.g.
easily convert strings to naturaly numbers)
16
18. Division Method
• h(k) = k mod m
• Want m to be a prime number, which is not too close to a
power of 2
• Why? Reduces collisions in the case where there is periodicity
in the keys inserted
17
19. Hash Tables Wrapup
Hash Tables implement the Dictionary ADT, namely:
• Insert(x) - O(1) expected time, Θ(n) worst case
• Lookup(x) - O(1) expected time, Θ(n) worst case
• Delete(x) - O(1) expected time, Θ(n) worst case
18
20. Bloom Filters
• Randomized data structure for representing a set. Imple-
ments:
• Insert(x) :
• IsMember(x) :
• Allow false positives but require very little space
• Used frequently in: Databases, networking problems, p2p
networks, packet routing
19
21. Bloom Filters
• Have m slots, k hash functions, n elements; assume hash
functions are all independent
• Each slot stores 1 bit, initially all bits are 0
• Insert(x) : Set the bit in slots h1(x), h2(x), ..., hk(x) to 1
• IsMember(x) : Return yes iff the bits in h1(x), h2(x), ..., hk(x)
are all 1
20
22. Analysis Sketch
• m slots, k hash functions, n elements; assume hash functions
are all independent
• Then P(fixed slot is still 0) = (1 − 1/m)kn
• Useful fact from Taylor expansion of e−x:
e−x − x2/2 ≤ 1 − x ≤ e−x for x < 1
• Then if x ≤ 1
e−x(1 − x2) ≤ 1 − x ≤ e−x
21
23. Analysis
• Thus we have the following to good approximation.
P(fixed slot is still 0) = (1 − 1/m)kn
≈ e−km/n
• Let p = e−kn/m and let ρ be the fraction of 0 bits after n
elements inserted then
P(false positive) = (1 − ρ)k ≈ (1 − p)k
• Where the first approximation holds because ρ is very close
to p (by a Martingale argument beyond the scope of this
class)
22
24. Analysis
• Want to minimize (1−p)k, which is equivalent to minimizing
g = k ln(1 − p)
• Trick: Note that g = −(n/m) ln(p) ln(1 − p)
• By symmetry, this is minimized when p = 1/2 or equivalently
k = (m/n) ln 2
• False positive rate is then (1/2)k ≈ (.6185)m/n
23
25. Tricks
• Can get the union of two sets by just taking the bitwise-or
of the bit-vectors for the corresponding Bloom filters
• Can easily half the size of a bloom filter - assume size is
power of 2 then just bitwise-or the first and second halves
together
• Can approximate the size of the intersection of two sets -
inner product of the bit vectors associated with the Bloom
filters is a good approximation to this.
26
26. Extensions
• Counting Bloom filters handle deletions: instead of storing
bits, store integers in the slots. Insertion increments, deletion
decrements.
• Bloomier Filters: Also allow for data to be inserted in the
filter - similar functionality to hash tables but less space, and
the possibility of false positives.
27
27. Skip List
• Enables insertions and searches for ordered keys in O(log n)
expected time
• Very elegant randomized data structure, simple to code but
analysis is subtle
• They guarantee that, with high probability, all the major op-
erations take O(log n) time (e.g. Find-Max, Find i-th ele-
ment, etc.)
28
28. Skip List
• A skip list is basically a collection of doubly-linked lists,
L1, L2, . . . , Lx, for some integer x
• Each list has a special head and tail node, the keys of these
nodes are assumed to be −MAXNUM and +MAXNUM re-
spectively
• The keys in each list are in sorted order (non-decreasing)
29
29. Skip List
• Every node is stored in the bottom list
• For each node in the bottom list, we flip a coin over and
over until we get tails. For each heads, we make a duplicate
of the node.
• The duplicates are stacked up in levels and the nodes on
each level are strung together in sorted linked lists
• Each node v stores a search key (key(v)), a pointer to its
next lower copy (down(v)), and a pointer to the next node
in its level (right(v)).
30
31. Search
• To do a search for a key, x, we start at the leftmost node L
in the highest level
• We then scan through each level as far as we can without
passing the target value x and then proceed down to the next
level
• The search ends either when we find the key x or fail to find
x on the lowest level
32
32. Search
SkipListFind(x, L){
v = L;
while (v != NULL) and (Key(v) != x){
if (Key(Right(v)) > x)
v = Down(v);
else
v = Right(v);
}
return v;
}
33
34. Insert
p is a constant between 0 and 1, typically p = 1/2, let rand()
return a random value between 0 and 1
Insert(k){
First call Search(k), let pLeft be the leftmost elem <= k in L_1
Insert k in L_1, to the right of pLeft
i = 2;
while (rand()<= p){
insert k in the appropriate place in L_i;
}
35
35. Deletion
• Deletion is very simple
• First do a search for the key to be deleted
• Then delete that key from all the lists it appears in from
the bottom up, making sure to “zip up” the lists after the
deletion
36
36. Analysis
• Intuitively, each level of the skip list has about half the num-
ber of nodes of the previous level, so we expect the total
number of levels to be about O(log n)
• Similarly, each time we add another level, we cut the search
time in half except for a constant overhead
• So after O(log n) levels, we would expect a search time of
O(log n)
• We will now formalize these two intuitive observations
37
37. Height of Skip List
• For some key, i, let Xi be the maximum height of i in the
skip list.
• Q: What is the probability that Xi ≥ 2 log n?
• A: If p = 1/2, we have:
P(Xi ≥ 2 log n) =
1
2
2 log n
=
1
(2log n)2
=
1
n2
• Thus the probability that a particular key i achieves height
2 log n is 1
n2
38
38. Height of Skip List
• Q: What is the probability that any key achieves height
2 log n?
• A: We want
P(X1 ≥ 2 log n or X2 ≥ 2 log n or . . . or Xn ≥ 2 log n)
• By a Union Bound, this probability is no more than
P(X1 ≥ k log n) + P(X2 ≥ k log n) + · · · + P(Xn ≥ k log n)
• Which equals:
n
X
i=1
1
n2
=
n
n2
= 1/n
39
39. Height of Skip List
• This probability gets small as n gets large
• In particular, the probability of having a skip list of size ex-
ceeding 2 log n is o(1)
• If an event occurs with probability 1 − o(1), we say that it
occurs with high probability
• Key Point: The height of a skip list is O(log n) with high
probability.
40
40. In-Class Exercise Trick
A trick for computing expectations of discrete positive random
variables:
• Let X be a discrete r.v., that takes on values from 1 to n
E(X) =
n
X
i=1
P(X ≥ i)
41
42. In-Class Exercise
Q: How much memory do we expect a skip list to use up?
• Let Xi be the number of lists that element i is inserted in.
• Q: What is P(Xi ≥ 1), P(Xi ≥ 2), P(Xi ≥ 3)?
• Q: What is P(Xi ≥ k) for general k?
• Q: What is E(Xi)?
• Q: Let X =
Pn
i=1 Xi. What is E(X)?
43
43. Search Time
• Its easier to analyze the search time if we imagine running
the search backwards
• Imagine that we start at the found node v in the bottommost
list and we trace the path backwards to the top leftmost
senitel, L
• This will give us the length of the search path from L to v
which is the time required to do the search
44
45. Backward Search
• For every node v in the skip list Up(v) exists with probability
1/2. So for purposes of analysis, SLFBack is the same as
the following algorithm:
FlipWalk(v){
while (v != L){
if (COINFLIP == HEADS)
v = Up(v);
else
v = Left(v);
}}
46
46. Analysis
• For this algorithm, the expected number of heads is exactly
the same as the expected number of tails
• Thus the expected run time of the algorithm is twice the
expected number of upward jumps
• Since we already know that the number of upward jumps
is O(log n) with high probability, we can conclude that the
expected search time is O(log n)
47