SlideShare a Scribd company logo
1 of 3
Download to read offline
GFilter: A General Gram Filter for String Similarity Search
Abstract:
Numerous applications such as data integration, protein detection, and
article copy detection share a similar core problem: given a string as the
query, how to efficiently find all the similar answers from a large scale
string collection. Many existing methods adopt a prefix-filter-based
framework to solve this problem, and a number of recent works aim to use
advanced filters to improve the overall search performance. In this paper
we propose a gram-based framework to achieve near maximum filter
performance. The main idea is to judiciously choose the high-quality grams
as the prefix of query according to their estimated ability to filter
candidates. As this selection process is proved to be NP-hard problem, we
give a cost model to measure the filter ability of grams and develop
efficient heuristic algorithms to find high-quality grams. Extensive
experiments on real datasets demonstrate the superiority of the proposed
framework in comparison with the state-of-art approaches.
Existing System:
Existing literatures usually adopt a gram-based filter-and verification
framework for string similarity search. After the query string is
decomposed into grams, an efficient and critical type of filter can utilize the
grams and inverted index (from grams to strings) of the string collection to
generate candidates. Other advanced filters can be also used after this
elementary type of filter. Then in verification step, the similarity function is
evaluated for the surviving candidates to produce the final results.
Proposed System:
We develop a general gram filter. This generalization provides a chance to
select optimized combination of grams from q-gram set of a query.
Theoretical analysis is conducted on the complexity of the problem, which
is NP-hard.
We present a choose-and-extend framework to efficiently find the high
quality grams in the query process. Under this framework, different
strategies can be extended.
We propose a new measure of analyzing the overlap between inverted
lists. This is the key of synthetic criterion to help generate better ordering
for the grams of query. With the help of synthetic criterion, grams can be
selected efficiently and effectively.
Hardware Requirements:
• System : Pentium IV 2.4 GHz.
• Hard Disk : 40 GB.
• Floppy Drive : 1.44 Mb.
• Monitor : 15 VGA Colour.
• Mouse : Logitech.
• RAM : 256 Mb.
Software Requirements:
• Operating system : - Windows XP.
• Front End : - JSP
• Back End : - SQL Server
Software Requirements:
• Operating system : - Windows XP.
• Front End : - .Net
• Back End : - SQL Server

More Related Content

What's hot

What's hot (9)

IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classificat...
IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classificat...IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classificat...
IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classificat...
 
Metabolomic data analysis and visualization tools
Metabolomic data analysis and visualization toolsMetabolomic data analysis and visualization tools
Metabolomic data analysis and visualization tools
 
Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models
Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models
Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models
 
Query expansion
Query expansionQuery expansion
Query expansion
 
Automation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationAutomation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report Generation
 
FastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHMFastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHM
 
Slicing and dicing expert-curated protein targets in the Guide to PHARMACOLGY
Slicing and dicing expert-curated protein targets in the Guide to PHARMACOLGYSlicing and dicing expert-curated protein targets in the Guide to PHARMACOLGY
Slicing and dicing expert-curated protein targets in the Guide to PHARMACOLGY
 
Task Oriented-Query Reformulation
Task Oriented-Query ReformulationTask Oriented-Query Reformulation
Task Oriented-Query Reformulation
 
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
 

Similar to G filter a general gram filter for string similarity search

Distributed approach for Peptide Identification
Distributed approach for Peptide IdentificationDistributed approach for Peptide Identification
Distributed approach for Peptide Identification
abhinav vedanbhatla
 

Similar to G filter a general gram filter for string similarity search (20)

Meta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter OptimizationMeta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter Optimization
 
Ijricit 01-002 enhanced replica detection in short time for large data sets
Ijricit 01-002 enhanced replica detection in  short time for large data setsIjricit 01-002 enhanced replica detection in  short time for large data sets
Ijricit 01-002 enhanced replica detection in short time for large data sets
 
Distributed approach for Peptide Identification
Distributed approach for Peptide IdentificationDistributed approach for Peptide Identification
Distributed approach for Peptide Identification
 
Clustering using GA and Hill-climbing
Clustering using GA and Hill-climbingClustering using GA and Hill-climbing
Clustering using GA and Hill-climbing
 
Query aware determinization of uncertain objects
Query aware determinization of uncertain objectsQuery aware determinization of uncertain objects
Query aware determinization of uncertain objects
 
FINAL REVIEW
FINAL REVIEWFINAL REVIEW
FINAL REVIEW
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
weka data mining
weka data mining weka data mining
weka data mining
 
Policy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detectionPolicy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detection
 
research paper
research paperresearch paper
research paper
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
 
IEEE 2014 DOTNET DATA MINING PROJECTS A probabilistic approach to string tran...
IEEE 2014 DOTNET DATA MINING PROJECTS A probabilistic approach to string tran...IEEE 2014 DOTNET DATA MINING PROJECTS A probabilistic approach to string tran...
IEEE 2014 DOTNET DATA MINING PROJECTS A probabilistic approach to string tran...
 
2014 IEEE DOTNET DATA MINING PROJECT A probabilistic approach to string trans...
2014 IEEE DOTNET DATA MINING PROJECT A probabilistic approach to string trans...2014 IEEE DOTNET DATA MINING PROJECT A probabilistic approach to string trans...
2014 IEEE DOTNET DATA MINING PROJECT A probabilistic approach to string trans...
 
B017410916
B017410916B017410916
B017410916
 
B017410916
B017410916B017410916
B017410916
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
Evolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comEvolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.com
 
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
 
IEEE 2014 JAVA DATA MINING PROJECTS A probabilistic approach to string transf...
IEEE 2014 JAVA DATA MINING PROJECTS A probabilistic approach to string transf...IEEE 2014 JAVA DATA MINING PROJECTS A probabilistic approach to string transf...
IEEE 2014 JAVA DATA MINING PROJECTS A probabilistic approach to string transf...
 
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
 

More from ieeepondy

More from ieeepondy (20)

Demand aware network function placement
Demand aware network function placementDemand aware network function placement
Demand aware network function placement
 
Service description in the nfv revolution trends, challenges and a way forward
Service description in the nfv revolution trends, challenges and a way forwardService description in the nfv revolution trends, challenges and a way forward
Service description in the nfv revolution trends, challenges and a way forward
 
Secure optimization computation outsourcing in cloud computing a case study o...
Secure optimization computation outsourcing in cloud computing a case study o...Secure optimization computation outsourcing in cloud computing a case study o...
Secure optimization computation outsourcing in cloud computing a case study o...
 
Spatial related traffic sign inspection for inventory purposes using mobile l...
Spatial related traffic sign inspection for inventory purposes using mobile l...Spatial related traffic sign inspection for inventory purposes using mobile l...
Spatial related traffic sign inspection for inventory purposes using mobile l...
 
Standards for hybrid clouds
Standards for hybrid cloudsStandards for hybrid clouds
Standards for hybrid clouds
 
Rfhoc a random forest approach to auto-tuning hadoop's configuration
Rfhoc a random forest approach to auto-tuning hadoop's configurationRfhoc a random forest approach to auto-tuning hadoop's configuration
Rfhoc a random forest approach to auto-tuning hadoop's configuration
 
Resource and instance hour minimization for deadline constrained dag applicat...
Resource and instance hour minimization for deadline constrained dag applicat...Resource and instance hour minimization for deadline constrained dag applicat...
Resource and instance hour minimization for deadline constrained dag applicat...
 
Reliable and confidential cloud storage with efficient data forwarding functi...
Reliable and confidential cloud storage with efficient data forwarding functi...Reliable and confidential cloud storage with efficient data forwarding functi...
Reliable and confidential cloud storage with efficient data forwarding functi...
 
Rebuttal to “comments on ‘control cloud data access privilege and anonymity w...
Rebuttal to “comments on ‘control cloud data access privilege and anonymity w...Rebuttal to “comments on ‘control cloud data access privilege and anonymity w...
Rebuttal to “comments on ‘control cloud data access privilege and anonymity w...
 
Scalable cloud–sensor architecture for the internet of things
Scalable cloud–sensor architecture for the internet of thingsScalable cloud–sensor architecture for the internet of things
Scalable cloud–sensor architecture for the internet of things
 
Scalable algorithms for nearest neighbor joins on big trajectory data
Scalable algorithms for nearest neighbor joins on big trajectory dataScalable algorithms for nearest neighbor joins on big trajectory data
Scalable algorithms for nearest neighbor joins on big trajectory data
 
Robust workload and energy management for sustainable data centers
Robust workload and energy management for sustainable data centersRobust workload and energy management for sustainable data centers
Robust workload and energy management for sustainable data centers
 
Privacy preserving deep computation model on cloud for big data feature learning
Privacy preserving deep computation model on cloud for big data feature learningPrivacy preserving deep computation model on cloud for big data feature learning
Privacy preserving deep computation model on cloud for big data feature learning
 
Pricing the cloud ieee projects, ieee projects chennai, ieee projects 2016,ie...
Pricing the cloud ieee projects, ieee projects chennai, ieee projects 2016,ie...Pricing the cloud ieee projects, ieee projects chennai, ieee projects 2016,ie...
Pricing the cloud ieee projects, ieee projects chennai, ieee projects 2016,ie...
 
Protection of big data privacy
Protection of big data privacyProtection of big data privacy
Protection of big data privacy
 
Power optimization with bler constraint for wireless fronthauls in c ran
Power optimization with bler constraint for wireless fronthauls in c ranPower optimization with bler constraint for wireless fronthauls in c ran
Power optimization with bler constraint for wireless fronthauls in c ran
 
Performance aware cloud resource allocation via fitness-enabled auction
Performance aware cloud resource allocation via fitness-enabled auctionPerformance aware cloud resource allocation via fitness-enabled auction
Performance aware cloud resource allocation via fitness-enabled auction
 
Performance limitations of a text search application running in cloud instances
Performance limitations of a text search application running in cloud instancesPerformance limitations of a text search application running in cloud instances
Performance limitations of a text search application running in cloud instances
 
Performance analysis and optimal cooperative cluster size for randomly distri...
Performance analysis and optimal cooperative cluster size for randomly distri...Performance analysis and optimal cooperative cluster size for randomly distri...
Performance analysis and optimal cooperative cluster size for randomly distri...
 
Predictive control for energy aware consolidation in cloud datacenters
Predictive control for energy aware consolidation in cloud datacentersPredictive control for energy aware consolidation in cloud datacenters
Predictive control for energy aware consolidation in cloud datacenters
 

G filter a general gram filter for string similarity search

  • 1. GFilter: A General Gram Filter for String Similarity Search Abstract: Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework to solve this problem, and a number of recent works aim to use advanced filters to improve the overall search performance. In this paper we propose a gram-based framework to achieve near maximum filter performance. The main idea is to judiciously choose the high-quality grams as the prefix of query according to their estimated ability to filter candidates. As this selection process is proved to be NP-hard problem, we give a cost model to measure the filter ability of grams and develop efficient heuristic algorithms to find high-quality grams. Extensive experiments on real datasets demonstrate the superiority of the proposed framework in comparison with the state-of-art approaches. Existing System:
  • 2. Existing literatures usually adopt a gram-based filter-and verification framework for string similarity search. After the query string is decomposed into grams, an efficient and critical type of filter can utilize the grams and inverted index (from grams to strings) of the string collection to generate candidates. Other advanced filters can be also used after this elementary type of filter. Then in verification step, the similarity function is evaluated for the surviving candidates to produce the final results. Proposed System: We develop a general gram filter. This generalization provides a chance to select optimized combination of grams from q-gram set of a query. Theoretical analysis is conducted on the complexity of the problem, which is NP-hard. We present a choose-and-extend framework to efficiently find the high quality grams in the query process. Under this framework, different strategies can be extended. We propose a new measure of analyzing the overlap between inverted lists. This is the key of synthetic criterion to help generate better ordering for the grams of query. With the help of synthetic criterion, grams can be selected efficiently and effectively. Hardware Requirements: • System : Pentium IV 2.4 GHz. • Hard Disk : 40 GB. • Floppy Drive : 1.44 Mb. • Monitor : 15 VGA Colour.
  • 3. • Mouse : Logitech. • RAM : 256 Mb. Software Requirements: • Operating system : - Windows XP. • Front End : - JSP • Back End : - SQL Server Software Requirements: • Operating system : - Windows XP. • Front End : - .Net • Back End : - SQL Server