SlideShare a Scribd company logo
1 of 16
CSE 326: Data Structures
Lecture #14
Bart Niswonger
Summer Quarter 2001
Today’s Outline
• Project
– Rules of competition
– Comments?
•
– Probing continued
– Rehashing
– Extendible Hashing
– Case Study
Cost of a Database Query
I/O to CPU ratio is 300!
Extendible Hashing
• Hashing technique for huge data sets
– optimizes to reduce disk accesses
– each hash bucket fits on one disk block
– better than B-Trees if order is not important
• Table contains
– buckets, each fitting in one disk block, with the
data
– a directory that fits in one disk block used to hash
to the correct bucket
001 010 011 110 111101
Extendible Hash Table
• Directory - entries labeled by k bits & pointer
to bucket with all keys starting with its bits
• Each block contains keys & data matching on
the first j ≤ k bits
000 100
(2)
00001
00011
00100
00110
(2)
01001
01011
01100
(3)
10001
10011
(3)
10101
10110
10111
(2)
11001
11011
11100
11110
directory for k = 3
Inserting (easy case)
001 010 011 110 111101000 100
(2)
00001
00011
00100
00110
(2)
01001
01011
01100
(3)
10001
10011
(3)
10101
10110
10111
(2)
11001
11011
11100
11110
insert(11011)
001 010 011 110 111101000 100
(2)
00001
00011
00100
00110
(2)
01001
01011
01100
(3)
10001
10011
(3)
10101
10110
10111
(2)
11001
11100
11110
Splitting
001 010 011 110 111101000 100
(2)
00001
00011
00100
00110
(2)
01001
01011
01100
(3)
10001
10011
(3)
10101
10110
10111
(2)
11001
11011
11100
11110
insert(11000)
001 010 011 110 111101000 100
(2)
00001
00011
00100
00110
(2)
01001
01011
01100
(3)
10001
10011
(3)
10101
10110
10111
(3)
11000
11001
11011
(3)
11100
11110
Rehashing
01 10 1100
(2)
01101
(2)
10000
10001
10011
10111
(2)
11001
11110
insert(10010)
001 010 011 110 111101000 100
No room to
insert and no
adoption!
Expand
directory
Now, it’s just a normal split.
When Directory Gets Too Large
• Store only pointers to the items
+ (potentially) much smaller M
+ fewer items in the directory
– one extra disk access!
• Rehash
+ potentially better distribution over the buckets
+ fewer unnecessary items in the directory
– can’t solve the problem if there’s simply too much
data
• What if these don’t work?
– use a B-Tree to store the directory!
Rehash of Hashing
• Hashing is a great data structure for storing
unordered data that supports insert, delete & find
• Both separate chaining (open) and open addressing
(closed) hashing are useful
– separate chaining flexible
– closed hashing uses less storage, but performs badly with
load factors near 1
– extendible hashing for very large disk-based data
• Hashing pros and cons
+ very fast
+ simple to implement, supports insert, delete, find
- lazy deletion necessary in open addressing, can waste
storage
- does not support operations dependent on order: min, max,
range
Case Study
• Spelling dictionary
– 30,000 words
– static
– arbitrary(ish)
preprocessing time
• Goals
– fast spell checking
– minimal storage
• Practical notes
– almost all searches are
successful
– words average about 8
characters in length
– 30,000 words at 8
bytes/word is 1/4 MB
– pointers are 4 bytes
– there are many
regularities in the
structure of English
words
Why?
Solutions
• Solutions
– sorted array + binary search
– open hashing
– closed hashing + linear probing
What kind of hash
function should we
use?
Storage
• Assume words are strings & entries are
pointers to strings
Array +
binary search Open hashing
…
Closed hashing
(open addressing)
How many
pointers does
each use?
Analysis
Binary search
– storage: n pointers + words = 360KB
– time: log2n ≤ 15 probes/access, worst case
Open hashing
– storage: n + n/λ pointers + words (λ = 1 ⇒ 600KB)
– time: 1 + λ/2 probes/access, average (λ = 1 ⇒ 1.5)
Closed hashing
– storage: n/λ pointers + words (λ = 0.5 ⇒ 480KB)
– time: probes/access, average (λ = 0.5 ⇒ 1.5)( ) 





−
+
λ1
1
1
2
1
Which one should we use?
To Do
• Start Project III (only 1 week!)
• Start reading chapter 8 in the book
Coming Up
• Disjoint-set union-find ADT
• Quiz (Thursday)

More Related Content

What's hot

Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data StructuresSHAKOOR AB
 
Dynamic memory allocation and linked lists
Dynamic memory allocation and linked listsDynamic memory allocation and linked lists
Dynamic memory allocation and linked listsDeepam Aggarwal
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce worldYu Liu
 
Introduction to data structure
Introduction to data structureIntroduction to data structure
Introduction to data structureZaid Shabbir
 
Array in c programming
Array in c programmingArray in c programming
Array in c programmingManojkumar C
 
Data structures (introduction)
 Data structures (introduction) Data structures (introduction)
Data structures (introduction)Arvind Devaraj
 
Data structure lecture 1
Data structure lecture 1Data structure lecture 1
Data structure lecture 1Kumar
 
Hash table
Hash tableHash table
Hash tableVu Tran
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jamsBol.com Techlab
 
Data structure Assignment Help
Data structure Assignment HelpData structure Assignment Help
Data structure Assignment HelpJosephErin
 
Gems in the python standard library
Gems in the python standard libraryGems in the python standard library
Gems in the python standard libraryjasonscheirer
 

What's hot (17)

Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data Structures
 
Hash tables
Hash tablesHash tables
Hash tables
 
Dynamic memory allocation and linked lists
Dynamic memory allocation and linked listsDynamic memory allocation and linked lists
Dynamic memory allocation and linked lists
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce world
 
Hash table
Hash tableHash table
Hash table
 
Data Structures 6
Data Structures 6Data Structures 6
Data Structures 6
 
Hashing PPT
Hashing PPTHashing PPT
Hashing PPT
 
Introduction to data structure
Introduction to data structureIntroduction to data structure
Introduction to data structure
 
Array in c programming
Array in c programmingArray in c programming
Array in c programming
 
Hashing
HashingHashing
Hashing
 
Data structures (introduction)
 Data structures (introduction) Data structures (introduction)
Data structures (introduction)
 
Data structure lecture 1
Data structure lecture 1Data structure lecture 1
Data structure lecture 1
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
 
Hash table
Hash tableHash table
Hash table
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jams
 
Data structure Assignment Help
Data structure Assignment HelpData structure Assignment Help
Data structure Assignment Help
 
Gems in the python standard library
Gems in the python standard libraryGems in the python standard library
Gems in the python standard library
 

Similar to CSE 326 Lecture 14: Extendible Hashing and Case Study for Dictionary

Allocation and free space management
Allocation and free space managementAllocation and free space management
Allocation and free space managementrajshreemuthiah
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedTony Rogerson
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data BaseSiva Rushi
 
Advanced data structures vol. 1
Advanced data structures   vol. 1Advanced data structures   vol. 1
Advanced data structures vol. 1Christalin Nelson
 
introduction into IR
introduction into IRintroduction into IR
introduction into IRssusere3b1a2
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
 
chapter5-file system implementation.ppt
chapter5-file system implementation.pptchapter5-file system implementation.ppt
chapter5-file system implementation.pptBUSHRASHAIKH804312
 
Ch11 - Operating Systems
Ch11 - Operating SystemsCh11 - Operating Systems
Ch11 - Operating SystemsBala Krish
 
Exploring Direct Concept Search
Exploring Direct Concept SearchExploring Direct Concept Search
Exploring Direct Concept SearchSteve Rowe
 
Dictionaries_66f8_csm-c.pptx
Dictionaries_66f8_csm-c.pptxDictionaries_66f8_csm-c.pptx
Dictionaries_66f8_csm-c.pptxvaishnavijanapala
 
Unit-4 swapping.pptx
Unit-4 swapping.pptxUnit-4 swapping.pptx
Unit-4 swapping.pptxItechAnand1
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.pptIshaXogaha
 
files,indexing,hashing,linear and non linear hashing
files,indexing,hashing,linear and non linear hashingfiles,indexing,hashing,linear and non linear hashing
files,indexing,hashing,linear and non linear hashingRohit Kumar
 
Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Burak TUNGUT
 
hasing introduction.pptx
hasing introduction.pptxhasing introduction.pptx
hasing introduction.pptxvvwaykule
 

Similar to CSE 326 Lecture 14: Extendible Hashing and Case Study for Dictionary (20)

Allocation and free space management
Allocation and free space managementAllocation and free space management
Allocation and free space management
 
02-hashing.pdf
02-hashing.pdf02-hashing.pdf
02-hashing.pdf
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
 
Hashing
HashingHashing
Hashing
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data Base
 
Advanced data structures vol. 1
Advanced data structures   vol. 1Advanced data structures   vol. 1
Advanced data structures vol. 1
 
introduction into IR
introduction into IRintroduction into IR
introduction into IR
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
chapter5-file system implementation.ppt
chapter5-file system implementation.pptchapter5-file system implementation.ppt
chapter5-file system implementation.ppt
 
Ch11 - Operating Systems
Ch11 - Operating SystemsCh11 - Operating Systems
Ch11 - Operating Systems
 
Exploring Direct Concept Search
Exploring Direct Concept SearchExploring Direct Concept Search
Exploring Direct Concept Search
 
Dictionaries_66f8_csm-c.pptx
Dictionaries_66f8_csm-c.pptxDictionaries_66f8_csm-c.pptx
Dictionaries_66f8_csm-c.pptx
 
Unit-4 swapping.pptx
Unit-4 swapping.pptxUnit-4 swapping.pptx
Unit-4 swapping.pptx
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
 
files,indexing,hashing,linear and non linear hashing
files,indexing,hashing,linear and non linear hashingfiles,indexing,hashing,linear and non linear hashing
files,indexing,hashing,linear and non linear hashing
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5
 
hasing introduction.pptx
hasing introduction.pptxhasing introduction.pptx
hasing introduction.pptx
 

More from Krish_ver2

5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back trackingKrish_ver2
 
5.5 back track
5.5 back track5.5 back track
5.5 back trackKrish_ver2
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02Krish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructuresKrish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructuresKrish_ver2
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithmKrish_ver2
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03Krish_ver2
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programmingKrish_ver2
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-iKrish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03Krish_ver2
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquerKrish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03Krish_ver2
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02Krish_ver2
 

More from Krish_ver2 (20)

5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back tracking
 
5.5 back track
5.5 back track5.5 back track
5.5 back track
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithm
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programming
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02
 
5.1 greedy
5.1 greedy5.1 greedy
5.1 greedy
 
5.1 greedy 03
5.1 greedy 035.1 greedy 03
5.1 greedy 03
 
4.4 hashing02
4.4 hashing024.4 hashing02
4.4 hashing02
 
4.4 hashing
4.4 hashing4.4 hashing
4.4 hashing
 
4.2 bst
4.2 bst4.2 bst
4.2 bst
 
4.2 bst 03
4.2 bst 034.2 bst 03
4.2 bst 03
 
4.2 bst 02
4.2 bst 024.2 bst 02
4.2 bst 02
 

Recently uploaded

Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 

Recently uploaded (20)

Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 

CSE 326 Lecture 14: Extendible Hashing and Case Study for Dictionary

  • 1. CSE 326: Data Structures Lecture #14 Bart Niswonger Summer Quarter 2001
  • 2. Today’s Outline • Project – Rules of competition – Comments? • – Probing continued – Rehashing – Extendible Hashing – Case Study
  • 3. Cost of a Database Query I/O to CPU ratio is 300!
  • 4. Extendible Hashing • Hashing technique for huge data sets – optimizes to reduce disk accesses – each hash bucket fits on one disk block – better than B-Trees if order is not important • Table contains – buckets, each fitting in one disk block, with the data – a directory that fits in one disk block used to hash to the correct bucket
  • 5. 001 010 011 110 111101 Extendible Hash Table • Directory - entries labeled by k bits & pointer to bucket with all keys starting with its bits • Each block contains keys & data matching on the first j ≤ k bits 000 100 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11011 11100 11110 directory for k = 3
  • 6. Inserting (easy case) 001 010 011 110 111101000 100 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11011 11100 11110 insert(11011) 001 010 011 110 111101000 100 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11100 11110
  • 7. Splitting 001 010 011 110 111101000 100 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11011 11100 11110 insert(11000) 001 010 011 110 111101000 100 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (3) 11000 11001 11011 (3) 11100 11110
  • 8. Rehashing 01 10 1100 (2) 01101 (2) 10000 10001 10011 10111 (2) 11001 11110 insert(10010) 001 010 011 110 111101000 100 No room to insert and no adoption! Expand directory Now, it’s just a normal split.
  • 9. When Directory Gets Too Large • Store only pointers to the items + (potentially) much smaller M + fewer items in the directory – one extra disk access! • Rehash + potentially better distribution over the buckets + fewer unnecessary items in the directory – can’t solve the problem if there’s simply too much data • What if these don’t work? – use a B-Tree to store the directory!
  • 10. Rehash of Hashing • Hashing is a great data structure for storing unordered data that supports insert, delete & find • Both separate chaining (open) and open addressing (closed) hashing are useful – separate chaining flexible – closed hashing uses less storage, but performs badly with load factors near 1 – extendible hashing for very large disk-based data • Hashing pros and cons + very fast + simple to implement, supports insert, delete, find - lazy deletion necessary in open addressing, can waste storage - does not support operations dependent on order: min, max, range
  • 11. Case Study • Spelling dictionary – 30,000 words – static – arbitrary(ish) preprocessing time • Goals – fast spell checking – minimal storage • Practical notes – almost all searches are successful – words average about 8 characters in length – 30,000 words at 8 bytes/word is 1/4 MB – pointers are 4 bytes – there are many regularities in the structure of English words Why?
  • 12. Solutions • Solutions – sorted array + binary search – open hashing – closed hashing + linear probing What kind of hash function should we use?
  • 13. Storage • Assume words are strings & entries are pointers to strings Array + binary search Open hashing … Closed hashing (open addressing) How many pointers does each use?
  • 14. Analysis Binary search – storage: n pointers + words = 360KB – time: log2n ≤ 15 probes/access, worst case Open hashing – storage: n + n/λ pointers + words (λ = 1 ⇒ 600KB) – time: 1 + λ/2 probes/access, average (λ = 1 ⇒ 1.5) Closed hashing – storage: n/λ pointers + words (λ = 0.5 ⇒ 480KB) – time: probes/access, average (λ = 0.5 ⇒ 1.5)( )       − + λ1 1 1 2 1 Which one should we use?
  • 15. To Do • Start Project III (only 1 week!) • Start reading chapter 8 in the book
  • 16. Coming Up • Disjoint-set union-find ADT • Quiz (Thursday)

Editor's Notes

  1. A whole lotta hashing going on! We’ll get to the hashing – but first the project stuff – new assignment – communication Rules for maze runners? Comments?
  2. This is a trace of how much time a simple database operation takes: this one lists all employees along with their job information, getting the employee and job information from separate databases. The thing to notice is that disk access takes something like 100 times as much time as processing. I told you disk access was expensive! BTW: the “index” in the picture is a B-Tree. - SQLServer
  3. OK, extendible hashing takes hashing and makes it work for huge data sets. The idea is very similar to B-Trees.
  4. Notice that those are the hashed keys I’m representing in the table. Actually, the buckets would store the original key/value pair.
  5. Inserting if there’s room in the bucket is easy. Just like in B-Trees. This takes 2 disk accesses. One to grab the directory, and one for the bucket. But what if there isn’t room?
  6. Then, as in B-Trees, we need to split a node. In this case, that turns out to be easy because we have room in the directory for another node. How many disk accesses does this take? 3: one for the directory, one for the old bucket, and one for the new bucket.
  7. Now, even though there’s tons of space in the table, we have to expand the directory. Once we’ve done that, we’re back to the case on the previous page. How many disk accesses does this take? 3! Just as for a normal split. The only problem is if the directory ends up too large to fit in a block. So, what does this say about the hash function we want for extendible hashing? We want something parameterized so we can rehash if necessary to get the buckets evened out.
  8. If the directory gets too large, we’re in trouble. In that case, we have two ways to fix it. One puts more keys in each bucket. The other evens out the buckets. Neither is guaranteed to work.
  9. Alright, let’s move on to the case study. Here’s the situation. Why are most searches successful? Because most words will be spelled correctly.
  10. Remember that we get arbitrary preprocessing time. What can we do with that? We should use a universal hash function (or at least a parameterized hash function) because that will let use rehash in preprocess until we’re happy with the distribution.
  11. Notice that all of these use the same amount of memory for the strings. How many pointers does each one use? N for array. Just store a pointer to each string. n (pointer to each string) + n/lambda (null pointer ending each list). Note, I’m playing a little game here; without an optimization, this would actually be 2n + n/lambda. N/lambda.
  12. The answer, of course, is it depends! However, given that we want fast checking, either hash table is better than the BST. And, given that we want low memory usage, we should use the closed hash table. Can anyone argue for the binary search?
  13. Even if you don’t yet have a team you should absolutely start thinking about project III. The code should not be hard to write, I think, but the algorithm may take you a bit
  14. Next time, We’ll look at extendible hashing for huge data sets.