SlideShare a Scribd company logo
Duplicate detection
Kira Radinsky
Based on the Standford slides by Christopher Manning
and Prabhakar Raghavan
Duplication
• ~30% of the content on the Web is near-duplicate pages
– Pages with content that is nearly identical to that of other
pages
• Issues:
– Index duplicate content only once
– Return only one version in the search results
– How can near-duplicate pages be identified in a scalable
and reliable manner?
Duplicate/Near-Duplicate Detection
• Duplication: Exact match can be detected
with fingerprints
• Near-Duplication: Approximate match
– Compute syntactic similarity with an edit-distance
measure
– Use similarity threshold to detect near-duplicates
• E.g., Similarity > 80% => Documents are near
duplicates
• Not transitive though sometimes used transitively
Computing Similarity - Shingles
Features for Similarity (Shingles (Word N-Grams))
• Segments of a document (natural or artificial breakpoints)
• K-shingling of a document transforms the document into a
set containing all windows of k contiguous terms
• E.g., 4-shingling of
“My name is Inigo Montoya. You killed my father. Prepare to die”:
{
my name is inigo
name is inigo montoya
is inigo montoya you
inigo montoya you killed
montoya you killed my,
you killed my father
killed my father prepare
my father prepare to
father prepare to die
}
Computing Similarity – distance metric
Similarity Measurement
• Denote by Sk(d) the k-shingling of document d
• Definition: the resemblance of d1 and d2,
R(d1,d2) = |Sk(d1)  Sk(d2)| / |Sk(d1)  Sk(d2)|
• The distance measure Δ(d1,d2) = 1-R(d1,d2) is a
metric
Shingles + Set Intersection
1. Computing exact set intersection of shingles
between all pairs of documents is
expensive/intractable
– Approximate using a cleverly chosen subset of
shingles from each (a sketch)
2. Estimate (size_of_intersection /
size_of_union) based on a short sketch
Doc
A Shingle set A Sketch A
Doc B
Shingle set B Sketch B
Jaccard
Sketch of a document
Create a sketch vector (of size ~200) for each document
– Documents that share ≥ t (usually 80%) corresponding
vector elements will be considered near duplicates
– Definitions:
• Let f map all k-shingles in the universe to 0..2m (e.g., f =
fingerprinting)
• Let p be a random permutation on 0..2m
– For doc D, sketchD is as follows:
• Sketch Option 1:
Fm(d)=minm(Π(Sk(d)) be the m smallest numbers after
applying Π to the k-shingling of d
• Sketch Option 2:
Vn(d) = { t ε Π(Sk(d)) | t = 0 mod n }
From Shingles to Sketches (cont.)
• A function X is an unbiased estimator of a value Y if E(X)=Y
• Theorem: when choosing Π u.a.r., the following functions are
unbiased estimators of R(d1,d2):
• | minm(Fm(d1)  Fm(d2))  Fm(d1)  Fm(d2) | /
| minm(Fm(d1)  Fm(d2)) |
• | Vn(d1)  Vn(d2) | / | Vn(d1)  Vn(d2) |
• Either Fm(d) or Vn(d) can be chosen as d’s sketch
– Fm(d) has the advantage that it is of fixed size
– Vn(d) is easier to compute
Multiple Sketches
• Let pi be a random permutation on 0..2m
• For doc D, sketchD[ i ] is as follows:
– Let f map all shingles in the universe to 0..2m (e.g.,
f = fingerprinting)
– Let pi be a random permutation on 0..2m
– Pick MIN {pi(f(s))} over all shingles s in D
Computing Sketch[i] for Doc1
Document 1
264
264
264
264
Start with 64-bit f(shingles)
Permute on the number line
with pi
Pick the min value
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 1 Document 2
264
264
264
264
264
264
264
264
Are these equal?
Test for 200 random permutations: p1, p2,… p200
A B
However…
Document 1 Document 2
264
264
264
264
264
264
264
264
A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is
common to both (i.e., lies in the intersection)
Claim: This happens with probability
Size_of_intersection / Size_of_union
Why?
BA
Set Similarity of sets Ci , Cj
• View sets as columns of a matrix A
– one row for each element in the universe.
– aij = 1 indicates presence of item i in set j
• Example
ji
ji
ji
CC
CC
)C,Jaccard(C



C1 C2
0 1
1 0
1 1 Jaccard(C1,C2) = 2/5 = 0.4
0 0
1 1
0 1
Key Observation
• For columns Ci, Cj, four types of rows
Ci Cj
A 1 1
B 1 0
C 0 1
D 0 0
• Let A = # of rows of type A
• Claim
CBA
A
)C,Jaccard(C ji


Min Hashing
• Randomly permute rows
• Hash h(Ci) = index of first row with 1 in
column Ci
• Surprising Property
• Why?
– Both are A/(A+B+C)
– Look down columns Ci, Cj until first non-Type-D
row
– h(Ci) = h(Cj)  type A row
   jiji C,CJaccard)h(C)h(CP 
Min-Hash sketches
• Pick P random row permutations
• MinHash sketch
SketchD = list of P indexes of first rows with 1 in
column C
• Similarity of signatures
– Let sim[sketch(Ci),sketch(Cj)] = fraction of
permutations where MinHash values agree
– Observe E[sim(sig(Ci),sig(Cj))] = Jaccard(Ci,Cj)
Example
C1 C2 C3
R1 1 0 1
R2 0 1 1
R3 1 0 0
R4 1 0 1
R5 0 1 0
Signatures
S1 S2 S3
Perm 1 = (12345) 1 2 1
Perm 2 = (54321) 4 5 4
Perm 3 = (34512) 3 5 4
Similarities
1-2 1-3 2-3
Col-Col 0.00 0.50 0.25
Sig-Sig 0.00 0.67 0.00
Algorithm for Clustering Near-Duplicate
Documents
1. Compute the sketch of each document
2. From each sketch, produce a list of <shingle, docID> pairs
3. Group all pairs by shingle value
4. For any shingle that is shared by more than one document, output a
triplet <smaller-docID, larger-docID, 1> for each pair of docIDs sharing
that shingle
5. Sort and aggregate the list of triplets, producing final triplets of the
form <smaller-docID, larger-docID, # common shingles>
6. Join any pair of documents whose number of common shingles
exceeds a chosen threshold using a “Union-Find” algorithm
7. Each resulting connected component of the UF algorithm is a cluster
of near-duplicate documents
Implementation nicely fits the “map-reduce” programming paradigm
Implementation Trick
• Permuting universe even once is prohibitive
• Row Hashing
– Pick P hash functions hk: {1,…,n}{1,…,O(n)}
– Ordering under hk gives random permutation of
rows
• One-pass Implementation
– For each Ci and hk, keep slot for min-hash value
– Initialize all slot(Ci,hk) to infinity
– Scan rows in arbitrary order looking for 1’s
• Suppose row Rj has 1 in column Ci
• For each hk,
– if hk(j) < slot(Ci,hk), then slot(Ci,hk)  hk(j)
Example
C1 C2
R1 1 0
R2 0 1
R3 1 1
R4 1 0
R5 0 1
h(x) = x mod 5
g(x) = 2x+1 mod 5
h(1) = 1 1 -
g(1) = 3 3 -
h(2) = 2 1 2
g(2) = 0 3 0
h(3) = 3 1 2
g(3) = 2 2 0
h(4) = 4 1 2
g(4) = 4 2 0
h(5) = 0 1 0
g(5) = 1 2 0
C1 slots C2 slots

More Related Content

What's hot

Arrays
ArraysArrays
An Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellAn Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using Haskell
Michel Rijnders
 
Functional programming with haskell
Functional programming with haskellFunctional programming with haskell
Functional programming with haskell
faradjpour
 
130717666736980000
130717666736980000130717666736980000
130717666736980000
Tanzeel Ahmad
 
Chapter 7.3
Chapter 7.3Chapter 7.3
Chapter 7.3sotlsoc
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text files
Edwin de Jonge
 
GLSL
GLSLGLSL
Scala collections wizardry - Scalapeño
Scala collections wizardry - ScalapeñoScala collections wizardry - Scalapeño
Scala collections wizardry - ScalapeñoSagie Davidovich
 
Introduction To Algebird: Abstract algebra for analytics
Introduction To Algebird: Abstract algebra for analyticsIntroduction To Algebird: Abstract algebra for analytics
Introduction To Algebird: Abstract algebra for analytics
Knoldus Inc.
 
mat lab introduction and basics to learn
mat lab introduction and basics to learnmat lab introduction and basics to learn
mat lab introduction and basics to learn
pavan373
 
Data Streaming Algorithms
Data Streaming AlgorithmsData Streaming Algorithms
Data Streaming Algorithms
宇 傅
 
Java Arrays
Java ArraysJava Arrays
Java Arrays
OXUS 20
 
1.Array and linklst definition
1.Array and linklst definition1.Array and linklst definition
1.Array and linklst definition
balavigneshwari
 
IR-ranking
IR-rankingIR-ranking
IR-rankingFELIX75
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
Edwin de Jonge
 
Arrays in VbScript
Arrays in VbScriptArrays in VbScript
Arrays in VbScript
Nilanjan Saha
 
Vector3
Vector3Vector3
Vector3
Rajendran
 

What's hot (20)

Arrays
ArraysArrays
Arrays
 
Cis166 final review c#
Cis166 final review c#Cis166 final review c#
Cis166 final review c#
 
An Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellAn Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using Haskell
 
Functional programming with haskell
Functional programming with haskellFunctional programming with haskell
Functional programming with haskell
 
2-D array
2-D array2-D array
2-D array
 
130717666736980000
130717666736980000130717666736980000
130717666736980000
 
Chapter 7.3
Chapter 7.3Chapter 7.3
Chapter 7.3
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text files
 
GLSL
GLSLGLSL
GLSL
 
Scala collections wizardry - Scalapeño
Scala collections wizardry - ScalapeñoScala collections wizardry - Scalapeño
Scala collections wizardry - Scalapeño
 
Introduction To Algebird: Abstract algebra for analytics
Introduction To Algebird: Abstract algebra for analyticsIntroduction To Algebird: Abstract algebra for analytics
Introduction To Algebird: Abstract algebra for analytics
 
mat lab introduction and basics to learn
mat lab introduction and basics to learnmat lab introduction and basics to learn
mat lab introduction and basics to learn
 
Data Streaming Algorithms
Data Streaming AlgorithmsData Streaming Algorithms
Data Streaming Algorithms
 
Java Arrays
Java ArraysJava Arrays
Java Arrays
 
1.Array and linklst definition
1.Array and linklst definition1.Array and linklst definition
1.Array and linklst definition
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
Arrays in VbScript
Arrays in VbScriptArrays in VbScript
Arrays in VbScript
 
Vector3
Vector3Vector3
Vector3
 

Viewers also liked

Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detection
ieeepondy
 
Duplicate detection
Duplicate detectionDuplicate detection
Duplicate detectionjonecx
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
eSAT Journals
 
An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsLikan Patra
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingVipin Kp
 
The Duplicitous Duplicate
The Duplicitous DuplicateThe Duplicitous Duplicate
The Duplicitous Duplicate
Anish Raivadera
 
Production&creative
Production&creativeProduction&creative
Production&creative
Red Keds
 
TIPOGRAMAS LEGOLAND
TIPOGRAMAS LEGOLANDTIPOGRAMAS LEGOLAND
TIPOGRAMAS LEGOLAND
Brandon Uriel
 
Post-buckling analysis of a simply supported compound beams made of two symme...
Post-buckling analysis of a simply supported compound beams made of two symme...Post-buckling analysis of a simply supported compound beams made of two symme...
Post-buckling analysis of a simply supported compound beams made of two symme...
IOSR Journals
 
Web components
Web componentsWeb components
Web components
Mario Mendonça
 
Pura Tirtha Empul (Bali)
Pura Tirtha Empul (Bali)Pura Tirtha Empul (Bali)
Pura Tirtha Empul (Bali)
F. Ovies
 
От кирпича к мультиканальности, от fuckupов - к лидерству.
От кирпича к мультиканальности, от fuckupов  - к лидерству.От кирпича к мультиканальности, от fuckupов  - к лидерству.
От кирпича к мультиканальности, от fuckupов - к лидерству.
Promodo
 
Engage 2013 - Webtrends Streams
Engage 2013 - Webtrends StreamsEngage 2013 - Webtrends Streams
Engage 2013 - Webtrends StreamsWebtrends
 
Engage 2013 - SEM Optimization
Engage 2013 - SEM OptimizationEngage 2013 - SEM Optimization
Engage 2013 - SEM OptimizationWebtrends
 
2009 Heinz Marketing Holiday Cocktail Collection
2009 Heinz Marketing Holiday Cocktail Collection2009 Heinz Marketing Holiday Cocktail Collection
2009 Heinz Marketing Holiday Cocktail CollectionHeinz Marketing Inc
 
From Mao to More: Catching up with the next generation of talent in China
From Mao to More: Catching up with the next generation of talent in China From Mao to More: Catching up with the next generation of talent in China
From Mao to More: Catching up with the next generation of talent in China
MSL
 
Lean UX for Design Teams (Crushing the Boulder)
Lean UX for Design Teams (Crushing the Boulder)Lean UX for Design Teams (Crushing the Boulder)
Lean UX for Design Teams (Crushing the Boulder)
Janice Fraser
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
Pradeeban Kathiravelu, Ph.D.
 
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
GangSeok Lee
 

Viewers also liked (20)

Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detection
 
Duplicate detection
Duplicate detectionDuplicate detection
Duplicate detection
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
 
An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate records
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawling
 
The Duplicitous Duplicate
The Duplicitous DuplicateThe Duplicitous Duplicate
The Duplicitous Duplicate
 
Production&creative
Production&creativeProduction&creative
Production&creative
 
TIPOGRAMAS LEGOLAND
TIPOGRAMAS LEGOLANDTIPOGRAMAS LEGOLAND
TIPOGRAMAS LEGOLAND
 
Post-buckling analysis of a simply supported compound beams made of two symme...
Post-buckling analysis of a simply supported compound beams made of two symme...Post-buckling analysis of a simply supported compound beams made of two symme...
Post-buckling analysis of a simply supported compound beams made of two symme...
 
Web components
Web componentsWeb components
Web components
 
Pura Tirtha Empul (Bali)
Pura Tirtha Empul (Bali)Pura Tirtha Empul (Bali)
Pura Tirtha Empul (Bali)
 
От кирпича к мультиканальности, от fuckupов - к лидерству.
От кирпича к мультиканальности, от fuckupов  - к лидерству.От кирпича к мультиканальности, от fuckupов  - к лидерству.
От кирпича к мультиканальности, от fuckupов - к лидерству.
 
Engage 2013 - Webtrends Streams
Engage 2013 - Webtrends StreamsEngage 2013 - Webtrends Streams
Engage 2013 - Webtrends Streams
 
Engage 2013 - SEM Optimization
Engage 2013 - SEM OptimizationEngage 2013 - SEM Optimization
Engage 2013 - SEM Optimization
 
2009 Heinz Marketing Holiday Cocktail Collection
2009 Heinz Marketing Holiday Cocktail Collection2009 Heinz Marketing Holiday Cocktail Collection
2009 Heinz Marketing Holiday Cocktail Collection
 
From Mao to More: Catching up with the next generation of talent in China
From Mao to More: Catching up with the next generation of talent in China From Mao to More: Catching up with the next generation of talent in China
From Mao to More: Catching up with the next generation of talent in China
 
EAGLENEST
EAGLENESTEAGLENEST
EAGLENEST
 
Lean UX for Design Teams (Crushing the Boulder)
Lean UX for Design Teams (Crushing the Boulder)Lean UX for Design Teams (Crushing the Boulder)
Lean UX for Design Teams (Crushing the Boulder)
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
 
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
[2014 CodeEngn Conference 11] 김기홍 - 빅데이터 기반 악성코드 자동 분석 플랫폼
 

Similar to Tutorial 4 (duplicate detection)

3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
Viet-Trung TRAN
 
image compresson
image compressonimage compresson
image compresson
Ajay Kumar
 
Image compression
Image compressionImage compression
Image compression
Ale Johnsan
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGGeorge Simov
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Mail.ru Group
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Bekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesBekas for cognitive_speaker_series
Bekas for cognitive_speaker_series
diannepatricia
 
Bekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesBekas for cognitive_speaker_series
Bekas for cognitive_speaker_series
diannepatricia
 
Report on Efficient Estimation for High Similarities using Odd Sketches
Report on Efficient Estimation for High Similarities using Odd Sketches  Report on Efficient Estimation for High Similarities using Odd Sketches
Report on Efficient Estimation for High Similarities using Odd Sketches
AXEL FOTSO
 
Introduction to complex networks
Introduction to complex networksIntroduction to complex networks
Introduction to complex networks
Vincent Traag
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
eXascale Infolab
 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queries
Luiz Henrique Zambom Santana
 
ImageCompression.ppt
ImageCompression.pptImageCompression.ppt
ImageCompression.ppt
dudoo1
 
ImageCompression.ppt
ImageCompression.pptImageCompression.ppt
ImageCompression.ppt
ssuser6d1fca
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
fridolin.wild
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
The Statistical and Applied Mathematical Sciences Institute
 
similarity1 (6).ppt
similarity1 (6).pptsimilarity1 (6).ppt
similarity1 (6).ppt
ssuserf11a32
 

Similar to Tutorial 4 (duplicate detection) (20)

3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
 
image compresson
image compressonimage compresson
image compresson
 
Image compression
Image compressionImage compression
Image compression
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
 
Nn
NnNn
Nn
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Bekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesBekas for cognitive_speaker_series
Bekas for cognitive_speaker_series
 
Bekas for cognitive_speaker_series
Bekas for cognitive_speaker_seriesBekas for cognitive_speaker_series
Bekas for cognitive_speaker_series
 
Report on Efficient Estimation for High Similarities using Odd Sketches
Report on Efficient Estimation for High Similarities using Odd Sketches  Report on Efficient Estimation for High Similarities using Odd Sketches
Report on Efficient Estimation for High Similarities using Odd Sketches
 
Introduction to complex networks
Introduction to complex networksIntroduction to complex networks
Introduction to complex networks
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Lec10 matching
Lec10 matchingLec10 matching
Lec10 matching
 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queries
 
ImageCompression.ppt
ImageCompression.pptImageCompression.ppt
ImageCompression.ppt
 
ImageCompression.ppt
ImageCompression.pptImageCompression.ppt
ImageCompression.ppt
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
 
similarity1 (6).ppt
similarity1 (6).pptsimilarity1 (6).ppt
similarity1 (6).ppt
 

More from Kira

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
Kira
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
Kira
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
Kira
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
Kira
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
Kira
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
Kira
 

More from Kira (13)

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
 

Recently uploaded

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 

Recently uploaded (20)

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 

Tutorial 4 (duplicate detection)

  • 1. Duplicate detection Kira Radinsky Based on the Standford slides by Christopher Manning and Prabhakar Raghavan
  • 2. Duplication • ~30% of the content on the Web is near-duplicate pages – Pages with content that is nearly identical to that of other pages • Issues: – Index duplicate content only once – Return only one version in the search results – How can near-duplicate pages be identified in a scalable and reliable manner?
  • 3. Duplicate/Near-Duplicate Detection • Duplication: Exact match can be detected with fingerprints • Near-Duplication: Approximate match – Compute syntactic similarity with an edit-distance measure – Use similarity threshold to detect near-duplicates • E.g., Similarity > 80% => Documents are near duplicates • Not transitive though sometimes used transitively
  • 4. Computing Similarity - Shingles Features for Similarity (Shingles (Word N-Grams)) • Segments of a document (natural or artificial breakpoints) • K-shingling of a document transforms the document into a set containing all windows of k contiguous terms • E.g., 4-shingling of “My name is Inigo Montoya. You killed my father. Prepare to die”: { my name is inigo name is inigo montoya is inigo montoya you inigo montoya you killed montoya you killed my, you killed my father killed my father prepare my father prepare to father prepare to die }
  • 5. Computing Similarity – distance metric Similarity Measurement • Denote by Sk(d) the k-shingling of document d • Definition: the resemblance of d1 and d2, R(d1,d2) = |Sk(d1)  Sk(d2)| / |Sk(d1)  Sk(d2)| • The distance measure Δ(d1,d2) = 1-R(d1,d2) is a metric
  • 6. Shingles + Set Intersection 1. Computing exact set intersection of shingles between all pairs of documents is expensive/intractable – Approximate using a cleverly chosen subset of shingles from each (a sketch) 2. Estimate (size_of_intersection / size_of_union) based on a short sketch Doc A Shingle set A Sketch A Doc B Shingle set B Sketch B Jaccard
  • 7. Sketch of a document Create a sketch vector (of size ~200) for each document – Documents that share ≥ t (usually 80%) corresponding vector elements will be considered near duplicates – Definitions: • Let f map all k-shingles in the universe to 0..2m (e.g., f = fingerprinting) • Let p be a random permutation on 0..2m – For doc D, sketchD is as follows: • Sketch Option 1: Fm(d)=minm(Π(Sk(d)) be the m smallest numbers after applying Π to the k-shingling of d • Sketch Option 2: Vn(d) = { t ε Π(Sk(d)) | t = 0 mod n }
  • 8. From Shingles to Sketches (cont.) • A function X is an unbiased estimator of a value Y if E(X)=Y • Theorem: when choosing Π u.a.r., the following functions are unbiased estimators of R(d1,d2): • | minm(Fm(d1)  Fm(d2))  Fm(d1)  Fm(d2) | / | minm(Fm(d1)  Fm(d2)) | • | Vn(d1)  Vn(d2) | / | Vn(d1)  Vn(d2) | • Either Fm(d) or Vn(d) can be chosen as d’s sketch – Fm(d) has the advantage that it is of fixed size – Vn(d) is easier to compute
  • 9. Multiple Sketches • Let pi be a random permutation on 0..2m • For doc D, sketchD[ i ] is as follows: – Let f map all shingles in the universe to 0..2m (e.g., f = fingerprinting) – Let pi be a random permutation on 0..2m – Pick MIN {pi(f(s))} over all shingles s in D
  • 10. Computing Sketch[i] for Doc1 Document 1 264 264 264 264 Start with 64-bit f(shingles) Permute on the number line with pi Pick the min value
  • 11. Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document 2 264 264 264 264 264 264 264 264 Are these equal? Test for 200 random permutations: p1, p2,… p200 A B
  • 12. However… Document 1 Document 2 264 264 264 264 264 264 264 264 A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Claim: This happens with probability Size_of_intersection / Size_of_union Why? BA
  • 13. Set Similarity of sets Ci , Cj • View sets as columns of a matrix A – one row for each element in the universe. – aij = 1 indicates presence of item i in set j • Example ji ji ji CC CC )C,Jaccard(C    C1 C2 0 1 1 0 1 1 Jaccard(C1,C2) = 2/5 = 0.4 0 0 1 1 0 1
  • 14. Key Observation • For columns Ci, Cj, four types of rows Ci Cj A 1 1 B 1 0 C 0 1 D 0 0 • Let A = # of rows of type A • Claim CBA A )C,Jaccard(C ji  
  • 15. Min Hashing • Randomly permute rows • Hash h(Ci) = index of first row with 1 in column Ci • Surprising Property • Why? – Both are A/(A+B+C) – Look down columns Ci, Cj until first non-Type-D row – h(Ci) = h(Cj)  type A row    jiji C,CJaccard)h(C)h(CP 
  • 16. Min-Hash sketches • Pick P random row permutations • MinHash sketch SketchD = list of P indexes of first rows with 1 in column C • Similarity of signatures – Let sim[sketch(Ci),sketch(Cj)] = fraction of permutations where MinHash values agree – Observe E[sim(sig(Ci),sig(Cj))] = Jaccard(Ci,Cj)
  • 17. Example C1 C2 C3 R1 1 0 1 R2 0 1 1 R3 1 0 0 R4 1 0 1 R5 0 1 0 Signatures S1 S2 S3 Perm 1 = (12345) 1 2 1 Perm 2 = (54321) 4 5 4 Perm 3 = (34512) 3 5 4 Similarities 1-2 1-3 2-3 Col-Col 0.00 0.50 0.25 Sig-Sig 0.00 0.67 0.00
  • 18. Algorithm for Clustering Near-Duplicate Documents 1. Compute the sketch of each document 2. From each sketch, produce a list of <shingle, docID> pairs 3. Group all pairs by shingle value 4. For any shingle that is shared by more than one document, output a triplet <smaller-docID, larger-docID, 1> for each pair of docIDs sharing that shingle 5. Sort and aggregate the list of triplets, producing final triplets of the form <smaller-docID, larger-docID, # common shingles> 6. Join any pair of documents whose number of common shingles exceeds a chosen threshold using a “Union-Find” algorithm 7. Each resulting connected component of the UF algorithm is a cluster of near-duplicate documents Implementation nicely fits the “map-reduce” programming paradigm
  • 19.
  • 20. Implementation Trick • Permuting universe even once is prohibitive • Row Hashing – Pick P hash functions hk: {1,…,n}{1,…,O(n)} – Ordering under hk gives random permutation of rows • One-pass Implementation – For each Ci and hk, keep slot for min-hash value – Initialize all slot(Ci,hk) to infinity – Scan rows in arbitrary order looking for 1’s • Suppose row Rj has 1 in column Ci • For each hk, – if hk(j) < slot(Ci,hk), then slot(Ci,hk)  hk(j)
  • 21. Example C1 C2 R1 1 0 R2 0 1 R3 1 1 R4 1 0 R5 0 1 h(x) = x mod 5 g(x) = 2x+1 mod 5 h(1) = 1 1 - g(1) = 3 3 - h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(5) = 0 1 0 g(5) = 1 2 0 C1 slots C2 slots