SlideShare a Scribd company logo
1 of 25
•OCT 4TH, 2017
Set Similarity Search using a
Distributed Prefix Tree Index
Fabian Fier
Prof. Johann-Christoph Freytag, Ph.D.
Problem Statement: Set Similarity Search
• Input
• A set of records R
• each consisting of a token set
• A search record s
• A similarity function sim
• A similarity threshold t
• Output
• All pairs of records where sim(r,s) ≥ t (r ∈ R)
Set Similarity Search using a Distributed Prefix Tree Index 2
Example: Jaccard Similarity Function
𝑠𝑖𝑚 𝑟, 𝑠 =
|𝑟 ∩ 𝑠|
|𝑟 ∪ 𝑠|
=
3
8
Set Similarity Search using a Distributed Prefix Tree Index 3
sr
Approaches for Set Similarity Search
• Naive: compute similarity for each element in R
• Use Indexes (distributed):
• Inverted Index
• Optimization: filters
• New Approach: Prefix Tree (Trie)
Set Similarity Search using a Distributed Prefix Tree Index 4
Inverted Index (1)
Build an inverted index {[token, {recordId}]}
Set Similarity Search using a Distributed Prefix Tree Index 5
r1 a b e
r2 a d e
r3 b c d e f g
r4 b c d f g
r5 b d f g
a r1, r2
b r1, r3, r4, r5
c r3, r4
d r2, r3, r4, r5
e r1, r2, r3
f r3, r4, r5
g r3, r4, r5
Inverted Index (2)
Probe the index
• Get the inverted lists for each token of s
• Count record ID frequencies (=overlap) and calculate the similarities
Set Similarity Search using a Distributed Prefix Tree Index 6
s c d f g
c r3, r4
d r2, r3, r4, r5
f r3, r4, r5
g r3, r4, r5
r2 1 → 1/6
r3 4 → 4/6
r4 4 → 4/5
r5 3 → 3/4
(r4, s)
t = 0.8
inverted index candidates
resultquery
Inverted Index (3)
• Optimization:
• Only documents with a similar length can be similar
• Add length to the index and use it to shrink the candidate set
Set Similarity Search using a Distributed Prefix Tree Index 7
r1 a b e
r2 a d e
r3 b c d e f g
r4 b c d f g
r5 b d f g
a 3 r1, r2
b 3 r1
b 4 r5
b 5 r4
… … …
s c d f g
r4 4 → 4/5
r5 3 → 3/4
(r4, s)
query: t = 0.8, length 4 or 5
result
c 5 r4
d 4 r5
d 5 r4
f 4 r5
… … …
inverted index
{[token, length, {recordId}]}
only two
candidates
candidates
n
e
w
Prefix Tree (1)
• Inspired by Charles Kaminskis approach (prefix trees for ED similarity search)
→ Our goal: find similar records with the Jaccard similarity function
1. Build the prefix tree
Set Similarity Search using a Distributed Prefix Tree Index 8
r1 a b e
r2 a d e
r3 b c d e f g
r4 b c d f g
r5 b d f g
a (3,3) b (4,6)
b e (3,3)
r1
d e (3,3)
r2
d f g (4,4)
r5
c d (5,6)
e f g (6,6)
r3
f g (5,5)
r4
Prefix Tree (2)
2. Probe the tree
• Start at the root of the tree and
follow all paths
• For each path:
• Discard subtrees which fail the
length filter
• Compare the query tokens
with the node tokens and
count all mismatches
• If there are too many
mismatches, discard this path
or subtree
Set Similarity Search using a Distributed Prefix Tree Index 9
s c d f g
query
a (3,3) b (4,6)
b e (3,3)
r1
d e (3,3)
r2
d f g (4,4)
r5
c d (5,6)
e f g (6,6)
r3
f g (5,5)
r4
t = 0.8
→ length 4 or 5
→ allowed mismatches:
0 (length 4), 1 (length 5)
1. 2.
too short
too long
too many
mismatches
similar
m: 1 (b)
3.
m: 1
4. 5.
m: 1
6.
m: 2
(b, c)
Implementation of the Prefix Tree (1)
1. Build the prefix tree:
• Result: INDEX which contains all prefix tree nodes
• Key: parent node id
• Payload: own node id, min. and max. path length, record id (or 0) and
is_record (boolean)
2. Probe the tree: Breadth-first search with LOOP and JOIN
Set Similarity Search using a Distributed Prefix Tree Index 10
Implementation of the Prefix Tree (2)
• Remarks
1. Token orders
• All records must have the same token order → Which one?
• The token order influences the shape of the prefix tree
→ We experimented with diffent token orders
2. Level number in the prefix tree
• Each JOIN in the LOOP joins a new level from the tree with queries
→ We add a integer „level“ to all tree nodes and change the index key to
parent_id and level
→ We add „RIGHT.level = COUNTER“ to the JOIN condition
Set Similarity Search using a Distributed Prefix Tree Index 11
Experiments and Results
• Datasets
• Flickr (253 MB), DBLP (685 MB), Enron (1.0 GB), Netflix (1.1 GB), CSX (3.5
GB)
• US Patent Data from 2005 (9.5 GB) and 2010 (16.5 GB)
• Queries
• 100 records from the original dataset
• Token orders
• Least frequent to most frequent
• Most frequent to least frequent
• Random
• Cluster configuration
• 6 Thor nodes with 3 Thor slaves per node
Set Similarity Search using a Distributed Prefix Tree Index 12
Result 1: Token Order has Significant Influence on
Query Runtime
• Least frequent tokens at the beginning (inc)
• Tree is wide
• Most frequent tokens at the beginning (dec)
• Tree is deep
Set Similarity Search using a Distributed Prefix Tree Index 13
r1 r2 r3 r4
r5 r6
r7 r8
r4
r7 r8
r3r1 r2
r5 r6
0
100
200
300
400
500
60708090100
Runtimeins
Threshold in %
DBLP
inc
dec
ran
0
20
40
60
80
100
120
140
160
180
200
60708090100
Runtimeins
Threshold in %
Enron
inc
dec
ran
Result 2: Tree Level as Additional Index Key
Set Similarity Search using a Distributed Prefix Tree Index 14
0
100
200
300
400
500
6080100
Runtimeins
Threshold in %
DBLP inc
0
100
200
300
400
500
6080100
Runtimeins
Threshold in %
DBLP dec
0
100
200
300
400
500
6080100
Runtimeins
Threshold in %
DBLP ran
normal
level
Result 3: Comparing (Prefix) Inverted Indexes to
Prefix Trees
• Prefix inverted indexes are better for high thresholds
• Normal inverted indexes are better for low thresholds
Set Similarity Search using a Distributed Prefix Tree Index 15
0
20
40
60
80
100
120
140
60708090100
Runtimeins
Threshold in %
DBLP
0
20
40
60
80
100
120
140
60708090100
Runtimeins
Threshold in %
enron
prefixtree_best
inverted_index
prefix_inverted_index
• The patent datasets contain stopwords which appear in almost every record
• We removed the most frequent 0.075% of the tokens
• Average record length has been reduced to 44% (2005) and 40% (2010)
Result 4: Stop Word Removal Important for Big
Datasets
Set Similarity Search using a Distributed Prefix Tree Index 16
0
50000
100000
150000
200000
0 100000 200000 300000
frequency
rank
Token distribution of the most frequent token (1 %)
2005
2010
0.075 99% of the tokens appear
only 400 times or less
Result 4: Stop Word Removal Important for Big
Datasets
Set Similarity Search using a Distributed Prefix Tree Index 17
0
500
1000
1500
2000
2500
3000
3500
Runtimeins
2005, t = 0.95
0
500
1000
1500
2000
2500
3000
3500
Runtimeins
2010, t = 0.95
with stopwords
with stopwords
and level as key
without stopwords
(0.075%)
without stopwords
(0.075%) and
level as key
timeout
Thank you!
Questions?
Set Similarity Search using a Distributed Prefix Tree Index 18
Backup
Set Similarity Search using a Distributed Prefix Tree Index 19
Implementation of the Inverted Index (1)
Build
• Extract tokens and length for each record with NORMALIZE
• Combine the record ids to record id sets for each token and length with ROLLUP
• BUILD an INDEX with token and length as a key and the record id set as payload
Set Similarity Search using a Distributed Prefix Tree Index 20
r1 a b e
a 3 r1
b 3 r1
e 3 r1
NORMALIZE(inputDS,
COUNT(LEFT.token_set),
getTidCntRid(LEFT,
COUNTER))
a 3 r1
a 3 r2
ROLLUP(tupelDS,
LEFT.token = RIGHT.token and
LEFT.cnt = RIGHT.cnt,
combineRids(LEFT,RIGHT),
local)
a 3 r1, r2(distributed
and sorted)
Implementation of the Inverted Index (2)
Probe
• Read the index and the query records
• Use PROJECT to find the similarity pairs for each query
• PROJECT(queryDS, findSimPairs(LEFT))
• TRANSFORM function findSimPairs:
• JOIN the query token and the inverted index
with the conditions LEFT.token = RIGHT.token
and the length filter to find all candidates
• Extract all candidate record ids with NORMALIZE
and count them with TABLE
• Calculate all similarities with PROJECT and SKIP
all candidates which are not similar
Set Similarity Search using a Distributed Prefix Tree Index 21
s c d f g
r4 5 4
r5 4 3
(r4, s)
JOIN
result
c 5 r4
d 4 r5
d 5 r4
… … …
index
NORMALIZE,
TABLE
PROJECT with
SKIP
Inverted Index with Prefix Filtering
• Idea:
• Two documents can only be similar, if their prefixes share at least one
token!
• Approach:
• Create the inverted index with only the prefixes of R
→ Reduce the index size
• Search the candidates with only the prefix token from s
→ Should decrease the candidate set size
• Calculate the similarity for each candidate with the original documents
→ Need an additional access to the documents
Set Similarity Search using a Distributed Prefix Tree Index 22
Example
Set Similarity Search using a Distributed Prefix Tree Index 23
r1 a b e
r2 a d e
r3 b c d e f
g
r4 b c d f g
r5 b d f g
a 3 r1, r2
b 3 r1
b 4 r5
b 5 r4
… … …
s c d f g
r4 4 →
4/5
r5 3 →
3/4
(r4,
s)
t = 0.8, length 4 or 5
result
c 5 r4
d 4 r5
d 5 r4
f 4 r5
… … …
1. Build the inverted index
{[token, length, {recordId}]}
only one
candidate
2. Use the index to seach
Implementation of the Prefix Inverted Index
• Similar to the first version
• Changes:
• Build an additional INDEX for the input records with the record id as a key
and the tokens as payload
• Build the inverted index only for the prefixes
• Change the NORMALIZE expression from COUNT(LEFT.token_set)
to indexLength(COUNT(LEFT.token_set))
• Use again PROJECT to find the similarity pairs for each query and change
the findSimPairs function
• JOIN only the query prefix with the index to get the candidate record ids
• Get the candidate records from the new index
• Verify the candidates with a new C++ function
Set Similarity Search using a Distributed Prefix Tree Index 24
Implementation of the Prefix Tree (1)
1. Build the prefix tree:
• Result: INDEX which contains all prefix tree nodes
• Key: parent node id
• Payload: own node id, min. and max. path length, record id (or 0) and
is_record (boolean)
2. Probe the tree: Breadth-first search with LOOP and JOIN
Set Similarity Search using a Distributed Prefix Tree Index 25
LOOP(QueryDS,
LEFT.is_record = false, EXISTS(ROWS(LEFT)) = true,
JOIN(ROWS(LEFT), pt_index,
LEFT.node_id = RIGHT.parent_id AND
/* length filter */ AND
LEFT.too_much = false,
QueryPTTransform(LEFT,RIGHT),LIMIT(0),INNER))(too_much=false);

More Related Content

What's hot

What's hot (20)

Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
 
C programming
C programmingC programming
C programming
 
SPARQL 1.1 Status
SPARQL 1.1 StatusSPARQL 1.1 Status
SPARQL 1.1 Status
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 
Data Structure
Data StructureData Structure
Data Structure
 
Data Structures & Algorithm design using C
Data Structures & Algorithm design using C Data Structures & Algorithm design using C
Data Structures & Algorithm design using C
 
Processing data with Python, using standard library modules you (probably) ne...
Processing data with Python, using standard library modules you (probably) ne...Processing data with Python, using standard library modules you (probably) ne...
Processing data with Python, using standard library modules you (probably) ne...
 
Certified bit coded regular expression parsing
Certified bit coded regular expression parsingCertified bit coded regular expression parsing
Certified bit coded regular expression parsing
 
Processing Regex Python
Processing Regex PythonProcessing Regex Python
Processing Regex Python
 
Pa1 session 2
Pa1 session 2 Pa1 session 2
Pa1 session 2
 
Unit 7 sorting
Unit   7 sortingUnit   7 sorting
Unit 7 sorting
 
Distributed System by Pratik Tambekar
Distributed System by Pratik TambekarDistributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
 
SPARQL-DL - Theory & Practice
SPARQL-DL - Theory & PracticeSPARQL-DL - Theory & Practice
SPARQL-DL - Theory & Practice
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Cs341
Cs341Cs341
Cs341
 
Parsing (Automata)
Parsing (Automata)Parsing (Automata)
Parsing (Automata)
 
R Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB AcademyR Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB Academy
 
Data Structures 8
Data Structures 8Data Structures 8
Data Structures 8
 
Data Structures 7
Data Structures 7Data Structures 7
Data Structures 7
 

Similar to Set Similarity Search using a Distributed Prefix Tree Index

BCSE101E_Python_Module5 (4).pdf
BCSE101E_Python_Module5 (4).pdfBCSE101E_Python_Module5 (4).pdf
BCSE101E_Python_Module5 (4).pdf
mukeshb0905
 
RFS Search Lang Spec
RFS Search Lang SpecRFS Search Lang Spec
RFS Search Lang Spec
Jing Kang
 
Python iteration
Python iterationPython iteration
Python iteration
dietbuddha
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
source{d}
 

Similar to Set Similarity Search using a Distributed Prefix Tree Index (20)

Web search engines
Web search enginesWeb search engines
Web search engines
 
LECTURE-1 (1).pptx
LECTURE-1 (1).pptxLECTURE-1 (1).pptx
LECTURE-1 (1).pptx
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
MIPS Architecture
MIPS ArchitectureMIPS Architecture
MIPS Architecture
 
Python with data Sciences
Python with data SciencesPython with data Sciences
Python with data Sciences
 
BCSE101E_Python_Module5 (4).pdf
BCSE101E_Python_Module5 (4).pdfBCSE101E_Python_Module5 (4).pdf
BCSE101E_Python_Module5 (4).pdf
 
Basic data analysis using R.
Basic data analysis using R.Basic data analysis using R.
Basic data analysis using R.
 
Python Tutorial Part 1
Python Tutorial Part 1Python Tutorial Part 1
Python Tutorial Part 1
 
Query evaluation and optimization
Query evaluation and optimizationQuery evaluation and optimization
Query evaluation and optimization
 
Intermediate code generation in Compiler Design
Intermediate code generation in Compiler DesignIntermediate code generation in Compiler Design
Intermediate code generation in Compiler Design
 
search engine
search enginesearch engine
search engine
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 
RFS Search Lang Spec
RFS Search Lang SpecRFS Search Lang Spec
RFS Search Lang Spec
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
SQL
SQLSQL
SQL
 
Rdbms
RdbmsRdbms
Rdbms
 
DynamodbDB Deep Dive
DynamodbDB Deep DiveDynamodbDB Deep Dive
DynamodbDB Deep Dive
 
Python iteration
Python iterationPython iteration
Python iteration
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Analytics: The Final Data Frontier (or, Why Users Need Your Data and How Pino...
Analytics: The Final Data Frontier (or, Why Users Need Your Data and How Pino...Analytics: The Final Data Frontier (or, Why Users Need Your Data and How Pino...
Analytics: The Final Data Frontier (or, Why Users Need Your Data and How Pino...
 

More from HPCC Systems

Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
HPCC Systems
 

More from HPCC Systems (20)

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex Systems
 
Welcome
WelcomeWelcome
Welcome
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon Cutting
 
Path to 8.0
Path to 8.0 Path to 8.0
Path to 8.0
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle Changes
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine Learning
 
Docker Support
Docker Support Docker Support
Docker Support
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network Capabilities
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis Tool
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL Neater
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
 

Recently uploaded

一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
great91
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
ju0dztxtn
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
0uyfyq0q4
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
ppy8zfkfm
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 

Recently uploaded (20)

一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 

Set Similarity Search using a Distributed Prefix Tree Index

  • 1. •OCT 4TH, 2017 Set Similarity Search using a Distributed Prefix Tree Index Fabian Fier Prof. Johann-Christoph Freytag, Ph.D.
  • 2. Problem Statement: Set Similarity Search • Input • A set of records R • each consisting of a token set • A search record s • A similarity function sim • A similarity threshold t • Output • All pairs of records where sim(r,s) ≥ t (r ∈ R) Set Similarity Search using a Distributed Prefix Tree Index 2
  • 3. Example: Jaccard Similarity Function 𝑠𝑖𝑚 𝑟, 𝑠 = |𝑟 ∩ 𝑠| |𝑟 ∪ 𝑠| = 3 8 Set Similarity Search using a Distributed Prefix Tree Index 3 sr
  • 4. Approaches for Set Similarity Search • Naive: compute similarity for each element in R • Use Indexes (distributed): • Inverted Index • Optimization: filters • New Approach: Prefix Tree (Trie) Set Similarity Search using a Distributed Prefix Tree Index 4
  • 5. Inverted Index (1) Build an inverted index {[token, {recordId}]} Set Similarity Search using a Distributed Prefix Tree Index 5 r1 a b e r2 a d e r3 b c d e f g r4 b c d f g r5 b d f g a r1, r2 b r1, r3, r4, r5 c r3, r4 d r2, r3, r4, r5 e r1, r2, r3 f r3, r4, r5 g r3, r4, r5
  • 6. Inverted Index (2) Probe the index • Get the inverted lists for each token of s • Count record ID frequencies (=overlap) and calculate the similarities Set Similarity Search using a Distributed Prefix Tree Index 6 s c d f g c r3, r4 d r2, r3, r4, r5 f r3, r4, r5 g r3, r4, r5 r2 1 → 1/6 r3 4 → 4/6 r4 4 → 4/5 r5 3 → 3/4 (r4, s) t = 0.8 inverted index candidates resultquery
  • 7. Inverted Index (3) • Optimization: • Only documents with a similar length can be similar • Add length to the index and use it to shrink the candidate set Set Similarity Search using a Distributed Prefix Tree Index 7 r1 a b e r2 a d e r3 b c d e f g r4 b c d f g r5 b d f g a 3 r1, r2 b 3 r1 b 4 r5 b 5 r4 … … … s c d f g r4 4 → 4/5 r5 3 → 3/4 (r4, s) query: t = 0.8, length 4 or 5 result c 5 r4 d 4 r5 d 5 r4 f 4 r5 … … … inverted index {[token, length, {recordId}]} only two candidates candidates n e w
  • 8. Prefix Tree (1) • Inspired by Charles Kaminskis approach (prefix trees for ED similarity search) → Our goal: find similar records with the Jaccard similarity function 1. Build the prefix tree Set Similarity Search using a Distributed Prefix Tree Index 8 r1 a b e r2 a d e r3 b c d e f g r4 b c d f g r5 b d f g a (3,3) b (4,6) b e (3,3) r1 d e (3,3) r2 d f g (4,4) r5 c d (5,6) e f g (6,6) r3 f g (5,5) r4
  • 9. Prefix Tree (2) 2. Probe the tree • Start at the root of the tree and follow all paths • For each path: • Discard subtrees which fail the length filter • Compare the query tokens with the node tokens and count all mismatches • If there are too many mismatches, discard this path or subtree Set Similarity Search using a Distributed Prefix Tree Index 9 s c d f g query a (3,3) b (4,6) b e (3,3) r1 d e (3,3) r2 d f g (4,4) r5 c d (5,6) e f g (6,6) r3 f g (5,5) r4 t = 0.8 → length 4 or 5 → allowed mismatches: 0 (length 4), 1 (length 5) 1. 2. too short too long too many mismatches similar m: 1 (b) 3. m: 1 4. 5. m: 1 6. m: 2 (b, c)
  • 10. Implementation of the Prefix Tree (1) 1. Build the prefix tree: • Result: INDEX which contains all prefix tree nodes • Key: parent node id • Payload: own node id, min. and max. path length, record id (or 0) and is_record (boolean) 2. Probe the tree: Breadth-first search with LOOP and JOIN Set Similarity Search using a Distributed Prefix Tree Index 10
  • 11. Implementation of the Prefix Tree (2) • Remarks 1. Token orders • All records must have the same token order → Which one? • The token order influences the shape of the prefix tree → We experimented with diffent token orders 2. Level number in the prefix tree • Each JOIN in the LOOP joins a new level from the tree with queries → We add a integer „level“ to all tree nodes and change the index key to parent_id and level → We add „RIGHT.level = COUNTER“ to the JOIN condition Set Similarity Search using a Distributed Prefix Tree Index 11
  • 12. Experiments and Results • Datasets • Flickr (253 MB), DBLP (685 MB), Enron (1.0 GB), Netflix (1.1 GB), CSX (3.5 GB) • US Patent Data from 2005 (9.5 GB) and 2010 (16.5 GB) • Queries • 100 records from the original dataset • Token orders • Least frequent to most frequent • Most frequent to least frequent • Random • Cluster configuration • 6 Thor nodes with 3 Thor slaves per node Set Similarity Search using a Distributed Prefix Tree Index 12
  • 13. Result 1: Token Order has Significant Influence on Query Runtime • Least frequent tokens at the beginning (inc) • Tree is wide • Most frequent tokens at the beginning (dec) • Tree is deep Set Similarity Search using a Distributed Prefix Tree Index 13 r1 r2 r3 r4 r5 r6 r7 r8 r4 r7 r8 r3r1 r2 r5 r6 0 100 200 300 400 500 60708090100 Runtimeins Threshold in % DBLP inc dec ran 0 20 40 60 80 100 120 140 160 180 200 60708090100 Runtimeins Threshold in % Enron inc dec ran
  • 14. Result 2: Tree Level as Additional Index Key Set Similarity Search using a Distributed Prefix Tree Index 14 0 100 200 300 400 500 6080100 Runtimeins Threshold in % DBLP inc 0 100 200 300 400 500 6080100 Runtimeins Threshold in % DBLP dec 0 100 200 300 400 500 6080100 Runtimeins Threshold in % DBLP ran normal level
  • 15. Result 3: Comparing (Prefix) Inverted Indexes to Prefix Trees • Prefix inverted indexes are better for high thresholds • Normal inverted indexes are better for low thresholds Set Similarity Search using a Distributed Prefix Tree Index 15 0 20 40 60 80 100 120 140 60708090100 Runtimeins Threshold in % DBLP 0 20 40 60 80 100 120 140 60708090100 Runtimeins Threshold in % enron prefixtree_best inverted_index prefix_inverted_index
  • 16. • The patent datasets contain stopwords which appear in almost every record • We removed the most frequent 0.075% of the tokens • Average record length has been reduced to 44% (2005) and 40% (2010) Result 4: Stop Word Removal Important for Big Datasets Set Similarity Search using a Distributed Prefix Tree Index 16 0 50000 100000 150000 200000 0 100000 200000 300000 frequency rank Token distribution of the most frequent token (1 %) 2005 2010 0.075 99% of the tokens appear only 400 times or less
  • 17. Result 4: Stop Word Removal Important for Big Datasets Set Similarity Search using a Distributed Prefix Tree Index 17 0 500 1000 1500 2000 2500 3000 3500 Runtimeins 2005, t = 0.95 0 500 1000 1500 2000 2500 3000 3500 Runtimeins 2010, t = 0.95 with stopwords with stopwords and level as key without stopwords (0.075%) without stopwords (0.075%) and level as key timeout
  • 18. Thank you! Questions? Set Similarity Search using a Distributed Prefix Tree Index 18
  • 19. Backup Set Similarity Search using a Distributed Prefix Tree Index 19
  • 20. Implementation of the Inverted Index (1) Build • Extract tokens and length for each record with NORMALIZE • Combine the record ids to record id sets for each token and length with ROLLUP • BUILD an INDEX with token and length as a key and the record id set as payload Set Similarity Search using a Distributed Prefix Tree Index 20 r1 a b e a 3 r1 b 3 r1 e 3 r1 NORMALIZE(inputDS, COUNT(LEFT.token_set), getTidCntRid(LEFT, COUNTER)) a 3 r1 a 3 r2 ROLLUP(tupelDS, LEFT.token = RIGHT.token and LEFT.cnt = RIGHT.cnt, combineRids(LEFT,RIGHT), local) a 3 r1, r2(distributed and sorted)
  • 21. Implementation of the Inverted Index (2) Probe • Read the index and the query records • Use PROJECT to find the similarity pairs for each query • PROJECT(queryDS, findSimPairs(LEFT)) • TRANSFORM function findSimPairs: • JOIN the query token and the inverted index with the conditions LEFT.token = RIGHT.token and the length filter to find all candidates • Extract all candidate record ids with NORMALIZE and count them with TABLE • Calculate all similarities with PROJECT and SKIP all candidates which are not similar Set Similarity Search using a Distributed Prefix Tree Index 21 s c d f g r4 5 4 r5 4 3 (r4, s) JOIN result c 5 r4 d 4 r5 d 5 r4 … … … index NORMALIZE, TABLE PROJECT with SKIP
  • 22. Inverted Index with Prefix Filtering • Idea: • Two documents can only be similar, if their prefixes share at least one token! • Approach: • Create the inverted index with only the prefixes of R → Reduce the index size • Search the candidates with only the prefix token from s → Should decrease the candidate set size • Calculate the similarity for each candidate with the original documents → Need an additional access to the documents Set Similarity Search using a Distributed Prefix Tree Index 22
  • 23. Example Set Similarity Search using a Distributed Prefix Tree Index 23 r1 a b e r2 a d e r3 b c d e f g r4 b c d f g r5 b d f g a 3 r1, r2 b 3 r1 b 4 r5 b 5 r4 … … … s c d f g r4 4 → 4/5 r5 3 → 3/4 (r4, s) t = 0.8, length 4 or 5 result c 5 r4 d 4 r5 d 5 r4 f 4 r5 … … … 1. Build the inverted index {[token, length, {recordId}]} only one candidate 2. Use the index to seach
  • 24. Implementation of the Prefix Inverted Index • Similar to the first version • Changes: • Build an additional INDEX for the input records with the record id as a key and the tokens as payload • Build the inverted index only for the prefixes • Change the NORMALIZE expression from COUNT(LEFT.token_set) to indexLength(COUNT(LEFT.token_set)) • Use again PROJECT to find the similarity pairs for each query and change the findSimPairs function • JOIN only the query prefix with the index to get the candidate record ids • Get the candidate records from the new index • Verify the candidates with a new C++ function Set Similarity Search using a Distributed Prefix Tree Index 24
  • 25. Implementation of the Prefix Tree (1) 1. Build the prefix tree: • Result: INDEX which contains all prefix tree nodes • Key: parent node id • Payload: own node id, min. and max. path length, record id (or 0) and is_record (boolean) 2. Probe the tree: Breadth-first search with LOOP and JOIN Set Similarity Search using a Distributed Prefix Tree Index 25 LOOP(QueryDS, LEFT.is_record = false, EXISTS(ROWS(LEFT)) = true, JOIN(ROWS(LEFT), pt_index, LEFT.node_id = RIGHT.parent_id AND /* length filter */ AND LEFT.too_much = false, QueryPTTransform(LEFT,RIGHT),LIMIT(0),INNER))(too_much=false);

Editor's Notes

  1. Motivation des Problems auf der Tonspur
  2. @Info: search record = query (im Folgenden meiste query genannt)
  3. @Info: - In den Beispiele rechts sind die Schnittmengen orange markiert
  4. @Info: - r1 wurde als Beispiel zum Erklären orange markiert
  5. @Info: - Für den 1. Schritt: Token von s blau markiert Für den 2. Schritt: r4 als Beispiel orange markiert
  6. @Info: - Um die Kandidatenmenge zu verkleinern, nutzen wir zusätzlich den length Filter Gleichungen der Einfachheit halber weggelassen gleiche Beispiel wie davor, nur mit length als zusätzl. Parameter Index bauen: wie davor, nur mit length als zustätzl. Wert im Index (r1 als Beispiel markiert) Bei der Suche: nur länge 4 und 5 erlaubt -> nur die Listen für die Token von s holen, die diese Länge haben Rest analog, aber da weniger Listen geholt wurden, gibt es nur 2 statt 4 Kandidaten
  7. @info: Im Blog: ED und Wörter -> wir: Jaccard und Mengen Prefixtree am Beispiel erklären: Wörter werden im Baum abgebildet, gemeinsame Präfixe werden zusammengefasst Auch hier: merken uns die Längen (min und max Pfadlängen) -> für Length Filter
  8. @info: -wichtig: Baum und Query haben dieselbe Token order Beispiel: (klick) 1. Pfad links -> Teilbaum mit Pfadlänge 3 -> zu kurz -> (klick) wird verworfen (klick) 2. Pfad recht -> Knoten „b(4,6)“ -> Länge okay, ein Mismatch (klick) folgen dem linken Pfad: cd (5,6) -> alles okay, keine neuen Mismatches -(klick) folgen dem linken Pfad: efg (6,6) -> zu lang-> (klick) wird verworfen -(klick) folgen dem rechten Pfad: fg (5,5) -> alles okay ->(klick) erreichen Record und haben nicht zu viele Mismatches -> similar -(klick) folgen dem rechten Pfad: dfg (4,4) -> C (query) und b(node) sind Mismatches -> (klick) Länge 4 darf nur 0 Mismatches haben -> verwerfen
  9. @Info: Bei 1: Mussten alles (Funktionen etc.) dahingehend abändern Breitensuche: man beginnt bei alle Kindern der Wurzel (parentid =0) und joint diese mit den Querys (node_id=0) Die zweite Join Cond. entfernt alle Teilbaume, die zu lang oder kurz sind Zudem werden Knoten, die bereits zu viele Mismatches haben (too_much=true), nicht berücksichtigt Die TRANSFORM function der Join Funktion vergleicht dann den Knoteninhalt mit der Query (aktualisiert die node_id -> entspricht dem letzten besuchten Knoten, berechnet die Mismatches, …) Das Ergebnis sind veränderte Query Records mit u.a. neuen node_id, die dann durch die LOOP erneut mit dem Baum gejoint werden (nächste Ebene) Usw. Ende ist, wenn je is_record true ist (d.h. Pfadende/ Record wurde gefunden) bzw. keine Query Records mehr da sind (d.h. alle haben sim Paare gefunden) Zum Schluss wird die Menge noch gefiltert, um Records die im letzten Schritt erst zu viele Mismatches hatten, wieder zu entfernen
  10. Wir habens dann nochmal als 3. Variante einen Hash-Join anstatt eines Keyd-Join (Half) versucht: schneller
  11. flickrlondon (253 MB) - 1680490 Records - Recordlänge: min 1, max 102, avg 9.78 - gefundene Ergebnisse zwischen 5626 (t=0.95) und 16952 (t=0.6) dbpl (685 MB) - 1268017 Records - Recordlänge: min 13, max 714, avg 36.21 - gefundene Ergebnisse zwischen 0 (t=0.95) und 6 (t=0.6) enron (1.0 GB) - 517431 Records - Recordlänge: min 1, max 3162, avg 133.57 - gefundene Ergebnisse zwischen 308 (t=0.95) und 5761 (t=0.6) netflix (1.1 GB) - 429585 Records - Recordlänge: min 1, max 523, avg 128.80 - gefundene Ergebnisse zwischen 0 (t=0.95) und 272 (t=0.6) csx (3.5 GB) - 1385532 Records - Recordlänge: min 35, max 3875, avg 148.89 - gefundene Ergebnisse zwischen 2 (t=0.95) und 124 (t=0.6) patentdata 2005 (9.5 GB) - 157829 Records - Recordlänge: min 25, max 278421, avg 7248.72 - Recordlänge ohne Stopwords: min 4, max 264134, avg 3181.62 (weniger als die Hälfte von davor) - mit Stopwords: gefundene Ergebnisse 32 (t=0.95) - ohne Stopwords:gefundene Ergebnisse zwischen 21 (t=0.95) und 62 (t=0.6) 2010 (16.5 GB) - 244597 Records - Recordlänge: min 32, max 581937, avg 8175.30 - Recordlänge ohne Stopwords: min 2, max 565981, avg 3213.53 (weniger als die Hälfte von davor) - mit Stopwords: gefundene Ergebnisse 10 (t=0.95) - ohne Stopwords: gefundene Ergebnisse 8 (t=0.95)
  12. @Info: beeinflusst Form des Baumes u. Laufzeit Z.B.: (prefixtree, 100 queries, ohne level als key) dbpl: hohe t -> dec besser; niedrige t -> inc besser Enron: inc besser Abhängig vom t und Datensatz Random ist meist zwischen inc und dec Anzahl der Ergebnisse: Dblp (eher wenige Ergebnisse) bei t=0.95 0, t=0.9 0, t=0.8 0, t=0.7 3, t=0.6 6 Ergebnisse Enron (eher viele Ergebnisse) bei t=0.95 308, t=0.9 462, t=0.8 1793, t=0.7 3387, t=0.6 5761 Ergebnisse
  13. @Info: Laufzeit am Beispiel dblp mit verschied. Sortierungen -> bei allen Datensätzen und Sortierungen war mit level besser - vor allem dec wird sig. schneller (bei niedrigen t meisten mehr als doppelt so schnell) -> Ausnahme: Patentdata 2005 MIT Stopwords bei inc Sortierung und t = 0.95 (dort ohne level als key besser) Anzahl der Ergebnisse: Dblp bei t=0.95 0, t=0.9 0, t=0.8 0, t=0.7 3, t=0.6 6 Ergebnisse
  14. @Info: (Steht alles auf der Folie) Ist bei den anderen Datensätzen auch so Anzahl der Ergebnisse: Dblp (eher wenige Ergebnisse) bei t=0.95 0, t=0.9 0, t=0.8 0, t=0.7 3, t=0.6 6 Ergebnisse Enron (eher viele Ergebnisse) bei t=0.95 308, t=0.9 462, t=0.8 1793, t=0.7 3387, t=0.6 5761 Ergebnisse
  15. @Info: - 2005 und 2010 enthalten Stopwords 2005 Tokenanzahl: 21610107 > häufigste Token kommt 157826 mal vor, d.h. in fast jedem Record (in 157826 von 157829 R.) > es gibt viele seltene Token > ca. 62 % der Token kommen genau einmal vor > ca. 99 % der Token kommen 400 mal oder seltener vor > die restl. 1 % sind abgebildet 2010 Tokenanzahl: 33460202 > häufigste Token kommt 244593 mal vor, d.h. in fast jedem Record (in 244593 von 244597 R.) > die Verteilung ist ähnlich zu davor: es gibt viele seltene Token ca. 62 % der Token kommen wieder genau einmal vor ca. 99 % der Token kommen 400 mal oder seltener vor Es werden je die häufigsten 0,075% entfernt 2005: 16208 Token werden entfernt 2010: 25102 Token werden entfernt (scheinbar klarer Rundungsfehler in ECL, es müssten 25095 T. sein) Recordlängen werden dadurch deutlich verkürzt - 2005: - Recordlänge: min 25, max 278421, avg 7248.72 - Recordlänge ohne Stopwords: min 4, max 264134, avg 3181.62 2010 - Recordlänge: min 32, max 581937, avg 8175.30 - Recordlänge ohne Stopwords: min 2, max 565981, avg 3213.53
  16. @Info: Selben Verhalten wie davor Laufzeit abh. von der Sortierung Level als zusätzl. key meist nützlich (bis auf 2 Ausnahmen) Prefix Inverted Indexe sind am besten Mit Stopword Removal im allen Fällen (Prefixbaum inc/dec/ran mit/ohne level, inv. Index, Prefixindex) schneller
  17. @Info: Dinge wie Daten sortieren, Input vorverarbeiten sind weggelassen worden Zunächst wird jedes Input-Record {rid, token_set} zu {token, length, rid} für jedes token umgewandelt (mittels normalize) Dann werden die rids mit denselben token und length Werten zu einem Set vermengt (mittels distribute(hash(tid)), sort(tid,cnt,local) und rollup) Zum Schluss wird für das Ergebnis ein Index erzeugt list_index := INDEX(list_ds, {tid, cnt}, {rid_set, RecPtr}, '~trie::list_index::'+INPUT+'_'+ORDER); BUILDINDEX(list_index, OVERWRITE); - Hinweis: recordIdSet muss als {blob} angegeben werden
  18. @Info: Wir haben mehrere Querys (nicht nur eine wie in der Def am Anfang) Zunächst wird project aufgerufen -> damit wird die Funktion findSimPairs einmal für jede Query aufgerufen
  19. @Info: - Gleiche Beispiel wie davor folgt auf der nächsten Folie
  20. @Info: -Beispiel von davor -orange: Präfixe -alles rot durchgestrichene fällt weg
  21. @info: ähnlich zu davor - Beim neuen Index: tokens muss (wieder) ein {blob} sein - Zum schluss werden die Kandidaten in findSimPairs verifiziert, dafür wird eine c++ Funktion genutzt (diese arbeitet „schlau“: vergleiche ab Overlap bzw. Präfixlänge und stoppe sobald geforderter Overlap nicht mehr erreicht werden kann)
  22. @Info: Bei 1: Mussten alles (Funktionen etc.) dahingehend abändern Breitensuche: man beginnt bei alle Kindern der Wurzel (parentid =0) und joint diese mit den Querys (node_id=0) Die zweite Join Cond. entfernt alle Teilbaume, die zu lang oder kurz sind Zudem werden Knoten, die bereits zu viele Mismatches haben (too_much=true), nicht berücksichtigt Die TRANSFORM function der Join Funktion vergleicht dann den Knoteninhalt mit der Query (aktualisiert die node_id -> entspricht dem letzten besuchten Knoten, berechnet die Mismatches, …) Das Ergebnis sind veränderte Query Records mit u.a. neuen node_id, die dann durch die LOOP erneut mit dem Baum gejoint werden (nächste Ebene) Usw. Ende ist, wenn je is_record true ist (d.h. Pfadende/ Record wurde gefunden) bzw. keine Query Records mehr da sind (d.h. alle haben sim Paare gefunden) Zum Schluss wird die Menge noch gefiltert, um Records die im letzten Schritt erst zu viele Mismatches hatten, wieder zu entfernen