SlideShare a Scribd company logo
1 of 20
Data Structure
Programming Project #5
郭建志
Image
•You want to count elements
•You don’t need exact results
Problem
•Given:
•keys in many documents
•Goal:
•Count the key frequency
•Bounded error is allowed
•Constraint:
•Limited storage and limited computation
Simple Solution
•Construct a map from elements to counts
•Balanced binary tree:
map in C++
•Hash table:
unordered_map in C++
•You may want to use libraries Guava or FastUtil in
Java for convenience and better performance
Problem
•The number of of distinct elements might be
very large
•You have a limited memory space
•For real-time applications you need runtime
guarantees
Solution: Count-Min Sketch
•The trick: don’t store the distinct elements,
but just the counters
•Create an integer array of length x initially
filled with 0s
•Each incoming element gets mapped to a
number between 0 and x
•The corresponding counter in the array gets
incremented
•To query an element’s count, simply return
the integer value at it’s position
Solution: Count-Min Sketch
•The trick: don’t store the distinct elements,
but just the counters
•Create an integer array of length x initially
filled with 0s
•Each incoming element gets mapped to a
number between 0 and x
•The corresponding counter in the array gets
incremented
•To query an element’s count, simply return
the integer value at it’s position
You are completely right:
There will be collisions!
Solution: Count-Min Sketch
•Use multiple arrays with different hash
functions to compute the index
•When queried, return the minimum of the
numbers the array
Solution: Count-Min Sketch
•Use multiple arrays with different hash
functions to compute the index
•When queried, return the minimum of the
numbers the array
You are completely right:
There will still be collisions!
Solution: Count-Min Sketch
•Use multiple arrays with different hash
functions to compute the index
•When queried, return the minimum of the
numbers the array
You are completely right:
There will still be collisions!
… but less
Some properties
•Only over-estimates, never under-estimates
the true count
•Has a constant memory and time
consumption independent of the number of
elements
•The relative error may be high for low-
frequent elements
To be practical: Conservative Update
•Instead of incrementing every counter, only
increment the ones which need an update
•Experiments show a strong error reduction
using this heuristic
Comparison
Comparison
Usage
•General:
•Every counting problem on large data sets, where
errors are acceptable
•Counting elements in data streams (e.g., trend
detection on Twitter, Facebook, and news)
Usage
•Mine:
•Feature selection for large sets using pointwise
mutual information (PMI)
•PMI has been used for finding associations
between words
•All kinds of feature statistics like word
co-occurrences
You need to implement:
void init(int **map, int r, int c, int *a, int *b, int p)
1. Create a 2D array with r rows and c columns for pointer map
2. Create an array with r elements uniformly chosen from [1, p-1] for
pointer a
3. Create an array with r elements uniformly chosen from [1, p-1] for
pointer b (note: a[i] and b[i] should be independent)
int myhash(char *str, int count, int r, int c, int p, int *a, int *b)
Use hash in <string> to covert str to an integer key
// note that 0 <= count <= r-1
return (a[count] * key + b[count]) % p % c;
void insert(int **map, int r, int c, int p, char *str, int *a, int *b)
1. Find the smallest ones in the following positions,
map[count][myhash(str, count, r, c, a, b)] for 0 ≤ count ≤ r-1
2. Increment the smallest ones by 1
void query(int **map, int r, int c, int p, char *str, int *a, int *b)
1. Find the smallest one in the following positions,
map[count][myhash(str, count, r, c, a, b)] for 0 ≤ count ≤ r-1
2. Return the smallest one
Input Sample: input.txt
10 10 1019
Data structures
serve as the
basis
for abstract
data type. The
abstract data
type defines
the logical
form of the
data type. The
data structure
implements the
physical form
of the data
type.
#rows #cols prime
Text…
Note: your code
should understand
that ”Data” is
equivalent to “data”
and ignore
punctuation marks
(e.g., commas)
You don’t need to output anything, but..
Command:
# data
6
# serve
1
# type
4
Input a key
Return a value
Note: The value is
allowed to have a
small error to some
extent, since you
are using cm sketch
instead of binary
search tree
Note
• Deadline:
1/3 Thu (有問題可以再調整)
• This project accounts for 15% of the total score
• You are not allowed to use ”class” in STL to count the words
• You must implement a 2D array with the given size (i.e., #rows
and #cols in input.txt) to count the words
• E-course
• C or C++ Source code
• You can read this paper if you are interested in the research field:
• http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf

More Related Content

What's hot

Applications of data structures
Applications of data structuresApplications of data structures
Applications of data structuresWipro
 
Big oh Representation Used in Time complexities
Big oh Representation Used in Time complexitiesBig oh Representation Used in Time complexities
Big oh Representation Used in Time complexitiesLAKSHMITHARUN PONNAM
 
Statistics in Data Science with Python
Statistics in Data Science with PythonStatistics in Data Science with Python
Statistics in Data Science with PythonMahe Karim
 
Intro to tsql unit 5
Intro to tsql   unit 5Intro to tsql   unit 5
Intro to tsql unit 5Syed Asrarali
 
Programming Problem 3
Programming Problem 3Programming Problem 3
Programming Problem 3Dwight Sabio
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
 
Intellectual technologies
Intellectual technologiesIntellectual technologies
Intellectual technologiesPolad Saruxanov
 
Algorithms Intro Lecture
Algorithms Intro LectureAlgorithms Intro Lecture
Algorithms Intro LectureIra D
 
MappingBetweenRealWorldandComputerScience
MappingBetweenRealWorldandComputerScienceMappingBetweenRealWorldandComputerScience
MappingBetweenRealWorldandComputerScienceKaushik Patidar
 
K-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonK-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonAfzal Ahmad
 
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?bzamecnik
 
Graph Tea: Simulating Tool for Graph Theory & Algorithms
Graph Tea: Simulating Tool for Graph Theory & AlgorithmsGraph Tea: Simulating Tool for Graph Theory & Algorithms
Graph Tea: Simulating Tool for Graph Theory & AlgorithmsIJMTST Journal
 

What's hot (20)

Xi practical file
Xi practical fileXi practical file
Xi practical file
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithms
 
Applications of data structures
Applications of data structuresApplications of data structures
Applications of data structures
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
 
Big oh Representation Used in Time complexities
Big oh Representation Used in Time complexitiesBig oh Representation Used in Time complexities
Big oh Representation Used in Time complexities
 
Big O Notation
Big O NotationBig O Notation
Big O Notation
 
Statistics in Data Science with Python
Statistics in Data Science with PythonStatistics in Data Science with Python
Statistics in Data Science with Python
 
Pointers operation day2
Pointers operation day2Pointers operation day2
Pointers operation day2
 
Intro to tsql unit 5
Intro to tsql   unit 5Intro to tsql   unit 5
Intro to tsql unit 5
 
Programming Problem 3
Programming Problem 3Programming Problem 3
Programming Problem 3
 
CS151 Deep copy
CS151 Deep copyCS151 Deep copy
CS151 Deep copy
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Intellectual technologies
Intellectual technologiesIntellectual technologies
Intellectual technologies
 
Algorithms Intro Lecture
Algorithms Intro LectureAlgorithms Intro Lecture
Algorithms Intro Lecture
 
MappingBetweenRealWorldandComputerScience
MappingBetweenRealWorldandComputerScienceMappingBetweenRealWorldandComputerScience
MappingBetweenRealWorldandComputerScience
 
K-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonK-Means Algorithm Implementation In python
K-Means Algorithm Implementation In python
 
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?
 
Graph Tea: Simulating Tool for Graph Theory & Algorithms
Graph Tea: Simulating Tool for Graph Theory & AlgorithmsGraph Tea: Simulating Tool for Graph Theory & Algorithms
Graph Tea: Simulating Tool for Graph Theory & Algorithms
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 

Similar to hash

Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structureThinh Dang
 
Sketch algorithms
Sketch algorithmsSketch algorithms
Sketch algorithmsSimon Belak
 
Lesson 13. Pattern 5. Address arithmetic
Lesson 13. Pattern 5. Address arithmeticLesson 13. Pattern 5. Address arithmetic
Lesson 13. Pattern 5. Address arithmeticPVS-Studio
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyAndrii Gakhov
 
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptxQ-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptxkalai75
 
R programming slides
R  programming slidesR  programming slides
R programming slidesPankaj Saini
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm. Abdul salam
 
Things to Remember When Developing 64-bit Software
Things to Remember When Developing 64-bit SoftwareThings to Remember When Developing 64-bit Software
Things to Remember When Developing 64-bit SoftwareAndrey Karpov
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Andrii Gakhov
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structuresYoav chernobroda
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptxRohithK65
 
Engineering Numerical Analysis-Introduction.pdf
Engineering Numerical Analysis-Introduction.pdfEngineering Numerical Analysis-Introduction.pdf
Engineering Numerical Analysis-Introduction.pdfssuseraae901
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
House price prediction
House price predictionHouse price prediction
House price predictionKaranseth30
 
Development of a static code analyzer for detecting errors of porting program...
Development of a static code analyzer for detecting errors of porting program...Development of a static code analyzer for detecting errors of porting program...
Development of a static code analyzer for detecting errors of porting program...PVS-Studio
 
Introduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searchingIntroduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searchingMvenkatarao
 

Similar to hash (20)

Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
 
Sketch algorithms
Sketch algorithmsSketch algorithms
Sketch algorithms
 
QBIC
QBICQBIC
QBIC
 
Lesson 13. Pattern 5. Address arithmetic
Lesson 13. Pattern 5. Address arithmeticLesson 13. Pattern 5. Address arithmetic
Lesson 13. Pattern 5. Address arithmetic
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
 
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptxQ-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
 
R programming slides
R  programming slidesR  programming slides
R programming slides
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
 
Things to Remember When Developing 64-bit Software
Things to Remember When Developing 64-bit SoftwareThings to Remember When Developing 64-bit Software
Things to Remember When Developing 64-bit Software
 
4.ArraysInC.pdf
4.ArraysInC.pdf4.ArraysInC.pdf
4.ArraysInC.pdf
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
 
DLD-unit-1(2022).pdf
DLD-unit-1(2022).pdfDLD-unit-1(2022).pdf
DLD-unit-1(2022).pdf
 
Engineering Numerical Analysis-Introduction.pdf
Engineering Numerical Analysis-Introduction.pdfEngineering Numerical Analysis-Introduction.pdf
Engineering Numerical Analysis-Introduction.pdf
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Session 4
Session 4Session 4
Session 4
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Development of a static code analyzer for detecting errors of porting program...
Development of a static code analyzer for detecting errors of porting program...Development of a static code analyzer for detecting errors of porting program...
Development of a static code analyzer for detecting errors of porting program...
 
Introduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searchingIntroduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searching
 

Recently uploaded

How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 

Recently uploaded (20)

How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 

hash

  • 2. Image •You want to count elements •You don’t need exact results
  • 3. Problem •Given: •keys in many documents •Goal: •Count the key frequency •Bounded error is allowed •Constraint: •Limited storage and limited computation
  • 4. Simple Solution •Construct a map from elements to counts •Balanced binary tree: map in C++ •Hash table: unordered_map in C++ •You may want to use libraries Guava or FastUtil in Java for convenience and better performance
  • 5. Problem •The number of of distinct elements might be very large •You have a limited memory space •For real-time applications you need runtime guarantees
  • 6. Solution: Count-Min Sketch •The trick: don’t store the distinct elements, but just the counters •Create an integer array of length x initially filled with 0s •Each incoming element gets mapped to a number between 0 and x •The corresponding counter in the array gets incremented •To query an element’s count, simply return the integer value at it’s position
  • 7. Solution: Count-Min Sketch •The trick: don’t store the distinct elements, but just the counters •Create an integer array of length x initially filled with 0s •Each incoming element gets mapped to a number between 0 and x •The corresponding counter in the array gets incremented •To query an element’s count, simply return the integer value at it’s position You are completely right: There will be collisions!
  • 8. Solution: Count-Min Sketch •Use multiple arrays with different hash functions to compute the index •When queried, return the minimum of the numbers the array
  • 9. Solution: Count-Min Sketch •Use multiple arrays with different hash functions to compute the index •When queried, return the minimum of the numbers the array You are completely right: There will still be collisions!
  • 10. Solution: Count-Min Sketch •Use multiple arrays with different hash functions to compute the index •When queried, return the minimum of the numbers the array You are completely right: There will still be collisions! … but less
  • 11. Some properties •Only over-estimates, never under-estimates the true count •Has a constant memory and time consumption independent of the number of elements •The relative error may be high for low- frequent elements
  • 12. To be practical: Conservative Update •Instead of incrementing every counter, only increment the ones which need an update •Experiments show a strong error reduction using this heuristic
  • 15. Usage •General: •Every counting problem on large data sets, where errors are acceptable •Counting elements in data streams (e.g., trend detection on Twitter, Facebook, and news)
  • 16. Usage •Mine: •Feature selection for large sets using pointwise mutual information (PMI) •PMI has been used for finding associations between words •All kinds of feature statistics like word co-occurrences
  • 17. You need to implement: void init(int **map, int r, int c, int *a, int *b, int p) 1. Create a 2D array with r rows and c columns for pointer map 2. Create an array with r elements uniformly chosen from [1, p-1] for pointer a 3. Create an array with r elements uniformly chosen from [1, p-1] for pointer b (note: a[i] and b[i] should be independent) int myhash(char *str, int count, int r, int c, int p, int *a, int *b) Use hash in <string> to covert str to an integer key // note that 0 <= count <= r-1 return (a[count] * key + b[count]) % p % c; void insert(int **map, int r, int c, int p, char *str, int *a, int *b) 1. Find the smallest ones in the following positions, map[count][myhash(str, count, r, c, a, b)] for 0 ≤ count ≤ r-1 2. Increment the smallest ones by 1 void query(int **map, int r, int c, int p, char *str, int *a, int *b) 1. Find the smallest one in the following positions, map[count][myhash(str, count, r, c, a, b)] for 0 ≤ count ≤ r-1 2. Return the smallest one
  • 18. Input Sample: input.txt 10 10 1019 Data structures serve as the basis for abstract data type. The abstract data type defines the logical form of the data type. The data structure implements the physical form of the data type. #rows #cols prime Text… Note: your code should understand that ”Data” is equivalent to “data” and ignore punctuation marks (e.g., commas)
  • 19. You don’t need to output anything, but.. Command: # data 6 # serve 1 # type 4 Input a key Return a value Note: The value is allowed to have a small error to some extent, since you are using cm sketch instead of binary search tree
  • 20. Note • Deadline: 1/3 Thu (有問題可以再調整) • This project accounts for 15% of the total score • You are not allowed to use ”class” in STL to count the words • You must implement a 2D array with the given size (i.e., #rows and #cols in input.txt) to count the words • E-course • C or C++ Source code • You can read this paper if you are interested in the research field: • http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf