Hashing provides a way to access records in constant time by mapping keys to addresses using a hash function. Collisions occur when different keys map to the same address. Common solutions include spreading out records, using extra memory, or storing multiple records at an address. The distribution of records can be analyzed using mathematical tools like the Poisson distribution to predict collisions and optimize performance. Various hashing methods like double hashing and chaining help resolve collisions.
Indexing is used to speed up access to desired data.
E.g. author catalog in library
A search key is an attribute or set of attributes used to look up records in a file. Unrelated to keys in the db schema.
An index file consists of records called index entries.
An index entry for key k may consist of
An actual data record (with search key value k)
A pair (k, rid) where rid is a pointer to the actual data record
A pair (k, bid) where bid is a pointer to a bucket of record pointers
Index files are typically much smaller than the original file if the actual data records are in a separate file.
If the index contains the data records, there is a single file with a special organization.
• Process for heuristics optimization
1. The parser of a high-level query generates an initial internal representation;
2. Apply heuristics rules to optimize the internal representation.
3. A query execution plan is generated to execute groups of operations based on the access paths available on the files involved in the query.
Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsaDsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexin
Indexing is used to speed up access to desired data.
E.g. author catalog in library
A search key is an attribute or set of attributes used to look up records in a file. Unrelated to keys in the db schema.
An index file consists of records called index entries.
An index entry for key k may consist of
An actual data record (with search key value k)
A pair (k, rid) where rid is a pointer to the actual data record
A pair (k, bid) where bid is a pointer to a bucket of record pointers
Index files are typically much smaller than the original file if the actual data records are in a separate file.
If the index contains the data records, there is a single file with a special organization.
• Process for heuristics optimization
1. The parser of a high-level query generates an initial internal representation;
2. Apply heuristics rules to optimize the internal representation.
3. A query execution plan is generated to execute groups of operations based on the access paths available on the files involved in the query.
Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsaDsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexing content for dsa dsa Dsa data indexing content for dsa dsa dsa Dsa data indexing content for dsa dsa Dsa data indexing content Dsa data indexin
My notes from the book: Designing Data Intensive Applications (https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable-ebook/dp/B06XPJML5D)
Elasticsearch Arcihtecture & What's New in Version 5Burak TUNGUT
General architectural concepts of Elasticsearch and what's new in version 5? Examples are prepared with our company business therefore these are excluded from presentation.
This covers some key concepts and techniques when one needs to distribute data across many nodes cutting across products ranging from caches to databases.
CAVEAT: If you haven't seen me present this in person slide 7 and 12 wont make much sense. Will be uploading a video version before long
Probabilistic data structures. Part 4. SimilarityAndrii Gakhov
The book "Probabilistic Data Structures and Algorithms in Big Data Applications" is now available at Amazon and from local bookstores. More details at https://pdsa.gakhov.com
In this presentation, I described popular algorithms that employed Locality Sensitive Hashing (LSH) to solve similarity-related problems. I started with LSH in general and then switched to such algorithms as MinHash (LSH for Jaccard similarity) and SimHash (LSH for cosine similarity). Each approach came with some math that was behind it and simple examples to clarify the theory statements.
Describe about the heap memory management such as memory allocation & deallocation. Explained the Memory manager functionality and fragmentation issues.
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017Alex Robinson
Until recently, developers have had to deal with some serious tradeoffs when picking a database technology. One could pick a SQL database and deal with their eventual scaling problems or pick a NoSQL database and have to work around their lack of transactions, strong consistency, and/or secondary indexes. However, a new class of distributed database engines is emerging that combines the transactional consistency guarantees of traditional relational databases with the horizontal scalability and high availability of popular NoSQL databases.
In this talk, we'll examine the history of databases to see how we got here, covering the motivations for this new class of systems and why developers should care about them. We'll then take a deep dive into the key design choices behind one open source distributed SQL database, CockroachDB, that enable it to offer such properties and compare them to past SQL and NoSQL designs. We will look specifically at how to achieve the easy deployment and management of a scalable, self-healing, strongly-consistent database with techniques such as dynamic sharding and rebalancing, consensus protocols, lock-free transactions, and more.
Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
2. 2
Motivation
• Sequential Searching can be done in O(N) access time,
meaning that the number of seeks grows in proportion to
the size of the file.
• B-Trees improve on this greatly, providing O(Logk N)
access where k is a measure of the leaf size (i.e., the
number of records that can be stored in a leaf).
• What we would like to achieve, however, is an O(1)
access, which means that no matter how big a file grows,
access to a record always takes the same small number of
seeks.
• Static Hashing techniques can achieve such performance
provided that the file does not increase in time.
3. 3
What is Hashing?
• A Hash function is a function h(K) which transforms a
key K into an address.
• Hashing is like indexing in that it involves associating
a key with a relative record address.
• Hashing, however, is different from indexing in two
important ways:
– With hashing, there is no obvious connection
between the key and the location.
– With hashing two different keys may be transformed
to the same address.
4. 4
Collisions
• When two different keys produce the same
address, there is a collision. The keys involved are
called synonyms.
• Coming up with a hashing function that avoids
collision is extremely difficult. It is best to simply
find ways to deal with them.
• Possible Solutions:
– Spread out the records
– Use extra memory
– Put more than one record at a single address.
5. 5
A Simple Hashing Algorithm
• Step 1: Represent the key in
numerical form
• Step 2: Fold and Add
• Step 3: Divide by a prime
number and use the remainder as
the address.
6. 6
Hashing Functions and Record
Distributions
• Records can be distributed among addresses in
different ways: there may be (a) no synonyms (uniform
distribution); (b) only synonyms (worst case); (c) a few
synonyms (happens with random distributions).
• Purely uniform distributions are difficult to obtain and
may not be worth searching for.
• Random distributions can be easily derived, but they
are not perfect since they may generate a fair number
of synonyms.
• We want better hashing methods.
7. 7
Some Other Hashing Methods
• Though there is no hash function that guarantees
better-than-random distributions in all cases, by
taking into considerations the keys that are being
hashed, certain improvements are possible.
• Here are some methods that are potentially better
than random:
– Examine keys for a pattern
– Fold parts of the key
– Divide the key by a number
– Square the key and take the middle
– Radix transformation
8. 8
Predicting the Distribution of
Records
• When using a random distribution, we can use a
number of mathematical tools to obtain
conservative estimates of how our hashing
function is likely to behave:
• Using the Poisson Function p(x)=(r/N)x
e-(r/N)
/x!
applied to Hashing, we can conclude that:
• In general, if there are N addresses, then the
expected number of addresses with x records
assigned to them is Np(x)
9. 9
Predicting Collisions for a Full
File
• Suppose you have a hashing function that you
believe will distribute records randomly and you
want to store 10,000 records in 10,000 addresses.
• How many addresses do you expect to have no
records assigned to them?
• How many addresses should have one, two, and
three records assigned respectively?
• How can we reduce the number of overflow
records?
10. 10
Increasing Memory Space I
• Reducing collisions can be done by choosing a good
hashing function or using extra memory.
• The question asked here is how much extra memory
should be use to obtain a given rate of collision
reduction?
• Definition: Packing density refers to the ratio of the
number of records to be stored (r) to the number of
available spaces (N).
• The packing density gives a measure of the amount of
space in a file that is used.
11. 11
Increasing Memory Space II
• The Poisson Distribution allows us to predict the number
of collisions that are likely to occur given a certain packing
density. We use the Poisson Distribution to answer the
following questions:
• How many addresses should have no records assigned to
them?
• How many addresses should have exactly one record
assigned (no synonym)?
• How many addresses should have one record plus one or
more synonyms?
• Assuming that only one record can be assigned to each home
address, how many overflow records can be expected?
• What percentage of records should be overflow records?
12. 12
Collision Resolution by
Progressive Overflow
• How do we deal with records that cannot fit into their home
address? A simple approach: Progressive Overflow or
Linear Probing.
• If a key, k1, hashes into the same address, a1, as another
key, k2, then look for the first available address, a2,
following a1 and place k1 in a2. If the end of the address
space is reached, then wrap around it.
• When searching for a key that is not in, if the address space
is not full, then an empty address will be reached or the
search will come back to where it began.
13. 13
Search Length when using
Progressive Overflow
• Progressive Overflow causes extra searches and
thus extra disk accesses.
• If there are many collisions, then many records
will be far from “home”.
• Definitions: Search length refers to the number of
accesses required to retrieve a record from
secondary memory. The average search length is
the average number of times you can expect to
have to access the disk to retrieve a record.
• Average search length = (Total search length)/
(Total number of records)
14. 14
Storing More than One Record
per Address: Buckets
• Definition: A bucket describes a block of records
sharing the same address that is retrieved in one
disk access.
• When a record is to be stored or retrieved, its
home bucket address is determined by hashing.
When a bucket is filled, we still have to worry
about the record overflow problem, but this occurs
much less often than when each address can hold
only one record.
15. 15
Effect of Buckets on Performance
• To compute how densely packed a file is, we need
to consider 1) the number of addresses, N,
(buckets) 2) the number of records we can put at
each address, b, (bucket size) and 3) the number
of records, r. Then, Packing Density = r/bN.
• Though the packing density does not change when
halving the number of addresses and doubling the
size of the buckets, the expected number of
overflows decreases dramatically.
16. 16
Making Deletions
• Deleting a record from a hashed file is more
complicated than adding a record for two reasons:
– The slot freed by the deletion must not be allowed to
hinder later searches
– It should be possible to reuse the freed slot for later
additions.
• In order to deal with deletions we use tombstones, i.e.,
a marker indicating that a record once lived there but
no longer does. Tombstones solve both the problems
caused by deletion.
• Insertion of records is slightly different when using
tombstones.
17. 17
Effects of Deletions and
Additions on Performance
• After a large number of deletions and additions
have taken places, one can expect to find many
tombstones occupying places that could be
occupied by records whose home address precedes
them but that are stored after them.
• This deteriorates average search lengths.
• There are 3 types of solutions for dealing with this
problem: a) local reorganization during deletions;
b) global reorganization when the average search
length is too large; c) use of a different collision
resolution algorithm.
18. 18
Other Collision Resolution
Techniques
• There are a few variations on random hashing that
may improve performance:
– Double Hashing: When an overflow occurs, use a
second hashing function to map the record to its
overflow location.
– Chained Progressive Overflow: Like Progressive
overflow except that synonyms are linked together with
pointers.
– Chaining with a Separate Overflow Area: Like
chained progressive overflow except that overflow
addresses do not occupy home addresses.
– Scatter Tables: The Hash file contains no records, but
only pointers to records. I.e., it is an index.
19. 19
Pattern of Record Access
• If we have some information about what records
get accessed most often, we can optimize their
location so that these records will have short
search lengths.
• By doing this, we try to decrease the effective
average search length even if the nominal average
search length remains the same.
• This principle is related to the one used in
Huffman encoding.