SlideShare a Scribd company logo
1 of 71
Introduction
Aim –
• The aim of this project is to build a prototype of a search engine which will work on
millions of wikipedia pages (which are in xml format) and retrieves the top 10 relevant
Wikipedia documents that matches the input query.
• This takes Wikipedia corpus in XML format which is available at Wikipedia.org as input.
Then it indices millions of Wikipedia pages involving a comparable number of distinct
terms.
• Then given a query, it retrieves relevant ranked documents and their titles using index.
Exact values vs. full text search
Search can be categorized into 2 ways:-
• Exact value
• Full text
Exact value search
Example- Exact value F00 is not the same
as foo. The value 2014 is not the same as
2014-12-03.
• Exact values are easy to query. The
decision is binary, a value either
matches a query or it does not.
Ex:- where name = “soham” and
user_id=1;
Full text search
Querying full text data is much more
subtle . We are not asking “Does this
document match the query?” but “How
well does this document match the
query?”. Means how relevant is this
document to the query.
• Not ‘yes’ or ‘no’ or binary answers.
Analysis and Analyzers
Inverted Index
To facilitate full text search, in our project we first analyze the text and use the result to build inverted
index.
• An inverted index contains all the unique words that appear in any document, and for each word,
a list of documents in which it appears.
• To create inverted index, we first split the content field of each document into separate words(which
we call tokens),create a sorted list of all the unique terms, and then for each term we maintain a list
of documents in which it is present.
Given input will be
in XML format.
Need to understand
in this XML file
where content is
present.
Get into the content
and do various
preprocessing on it.
Once preprocessed
the entire text(1
page) is now
converted into many
tokens.
Now we are going to
save all these words
in a tabular form and
for every word we
maintain a list of
pages in which it
occurred.
Example
Consider following documents:-
Document 1-This is text search engine.
Document 2- The prerequisite for doing good work is motivation.
Term Document 1 Document 2
This  
is  
text  
search  
engine  
The  
prerequisite  
for  
doing  
good  
work  
motivation  
Delete
Stop
Words
Tokenize Stemming
Maintain
Indexing
For millions of documents if we maintain inverted indexing,
we will go for multi-level indexing, so as to direct the search
accordingly.
Text Preprocessing
1) Tokenization
 White Space Tokenizer
 Penn Treebank Tokenizer
2) Sentence Segmentation – mark the sentence ending
3) Case Folding – ex: if user types ‘ravi’ in small or capital letters ,he mean the same thing i.e., we need to understand
and converge words having same root meaning
4) Removing Stop Words – ex: is, a, an, the etc., are really not important coming to search
5) Stemming – ex: obtaining root word from given word
 Porter Stemming Algorithm(Implemented)
 Lancaster Stemming Algorithm
 WordNet Lemmatizer
Tokenization
1) White space tokenizer
Ex:- Dr. Srikant Varma still has possession of his ill-gotten goods; his brother’s assets and liabilities.
#tokens = 15
If we tokenize based on all characters other than alphabets(A-Z, a-z)
Ex:- Dr Srikant Varma still has possession of his ill gotten goods his brother s assets and liabilities
#tokens = 17
Issues in Tokenization
India’s capital  India Indias India’s?
We’re, I’m, aren’t  We are, I am, are not?
Ill-gotten  Ill-gotten, Ill gotten?
Srikant Varma  single token or two?
U.S.A  USA or U.S.A
Special characters and number formats $35.40 , $50.75
Penn Treebank Tokenizer(a
standard defined by
linguistic Data
consortium(LDC)) will
answer all these questions
Penn Treebank Tokenization
• A standard released by Linguistic Data Consortium.
• It seperates out clitics.
• Keeps hyphenated words together.
• Seperates out all punctuations.
Ex:- India’s capital – India s
Ex:- “ Has anything escaped me ? ” I asked with some self-importance . “ I trust that there
is nothing of consequence which I have overlooked ? ”
#Tokens = 28
doesn’t  does n’t
can’t  ca n’t
haven’t  have n’t
Case Folding
For tasks like speech recognition and information retrieval, everything is mapped to lower case.
So we might sometimes loose meaning of acronyms.
JADAVPUR  jadavpur
US  us
Fox
fox
While searching for fox, user has typed
Fox.
Inside this document there is this word.
There is a chance that our engine will say that this document is not containing this word.
To maintain consistency convert all these words into single case(small case).
Java Code to Implement Whitespace Tokenization
Tokenizing using OpenNLP
The opennlp.tools.tokenize package contains the classes and interfaces that are used to perform tokenization.
To tokenize the given sentences into simpler fragments, the OpenNLP library provides this class
• WhitespaceTokenizer − This class uses whitespaces to tokenize the given text.
To tokenize a sentence using the WhitespaceTokenizer class, we need to −
• Create an object of the respective class.
• Tokenize the sentence using the tokenize() method.
• Print the tokens.
Following are the steps to be followed to write a program which tokenizes the given raw text.
Step 1 − Instantiating the respective class
In both the classes, there are no constructors available to instantiate them. Therefore, we need to create objects of
these classes using the static variable INSTANCE.
Step 2 − Tokenize the sentences
Both these classes contain a method called tokenize(). This method accepts a raw text in String format. On invoking,
it tokenizes the given String and returns an array of Strings (tokens).
Tokenize the sentence using the tokenizer() method as shown below.
Step 3 − Print the tokens
After tokenizing the sentence, you can print the tokens using for loop, as shown below.
Example
Following is the program which tokenizes the given sentence using the Whitespace Tokenizer class.
Compile and execute the saved Java file from the Command prompt using the following commands −
On executing, the above program reads the given String (raw text), tokenizes it, and displays the following output -
Java Code to Implement Case Folding on a given sentence
Lower Case : The conversion of a String to lower case is similar to upper case except that we change each of the letter to its lower case form.
Toggle Case : Small letters a-z are converted to capital letters A-Z and vice versa.
• We would be using length() and charAt(int) methods of the String class. The length() method returns
the numbers of characters that the String contains. The charAt() method takes an integer argument
and returns the character contained at that integer index.
• The Character class contains a number of static methods which can be used to test whether a
character is upper case or lower case and also convert between the two forms.
XML(Extensible Markup Language)
XML tags identify the data and are used to store and organize data rather than how to display like the HTML tags.
Characteristics of XML
• Extensible:- XML allows us to create our own self descriptive tags, or language that
suits our application.
• XML carries data, does not present it.
• XML is public standard.
Sample XML Template
Root element
Play
Author
Text
Shakespeare
title
Text Macbeth
Chapter
title
Text
Macbeth’s
castle
DOM Parser to read XML file in Java
Document Object Model (DOM) API for XML approach is memory intensive compared to the SAX
parser. If XML content size is large it is recommended to use the SAX parser approach. In the DOM
parsing approach we load the entire contents of an XML file into a tree structure and then iterate
through the tree to read the content.
Typically when we need to modify the XML documents DOM parser would be advantageous.
A sample implementation of DOM parser is listed below. Here we read the XML file and create a
Document object in memory. Then we iterate through the tree and extract the required elements/
attributes.
SAX Parser
• SAX (the simple API for XML) is an event based parser for XML documents.
• SAX is a streaming interface for XML, means the application using SAX receives event notifications about the
XML document being processed, an element and attribute at a time in sequential order starting at the top of
the document.
 Reads an XML document from top to bottom , recognizing the tokens that make up a well
formed XML document. Tokens are processed in the same order that they appear in the
document.
 Application program provides an “event” handler that must be registered with the parser.
• SAX will parse the document and depending on token, it will call the methods in the handler.
Content Handler Interface
• void startDocument():- Called at the beginning of a document.
• void endDocument():- Called at the end of a document.
• void startElement(string uri, string localName, string qName, string Attributes atts):- Called at the beginning of
an element.
• void endElement(string uri, string localName, string qName):- Called at the end of an element.
• void characters(char[] ch, int start, int length):- Called when character data is encountered.
Why SAX Parser
Other parsers try to bring entire XML document into the memory and then parse it and create a constructive parse
tree. SAX parser reads XML document streamwise(byte by byte)
SAX Parser Example
Attributes
Creating a SAX parser
We know when we extend a class
we get all the attributes of the
class and methods with an option
to overwrite some methods.
When an element starts
SAX Parser calls
‘startElement’
We know employee id, so
convert this string into integer
number.
If this is the first employee
created, we want a list of
employees so that we can keep
on adding to the list.
Later we add this
employee to newList
Earlier if this is true, it will
read age, parse into
Int.(string),employee is going
to set age. After setting age
Boolean will be turned to false
to make it usable for the next
time.
High Level Design
Module 1:-
Wikipedia
Dump(contains set of
pages)
Create Wikipage
object
The text attributes we are interested in:-
 Title
 Infobox(summary)
 External Links(links to other pages)
 Category
 Text Content
Aim- Create wikipage object
out of Wikipedia xml
document
High Level Design
Module 2:-
Consider a wikipage
• Step 1- Split the text into tokens
• Step 2-Remove stop words
• Step 3- Stem the word
• Step 4- Maintain count where each word occurred(Maintain Hashmap)
Key Value
Particular
word
Integer[]
{0,0,0,0,0}
Title
Infobox
External
Links
Category
Text
High Level Design
Module 3:-
Build Inverted Index
Key Doc List
Particular word Docid1-setBit:tf(weight)
Docid2-setBit:tf(weight)
If Inverted index size is too large, cannot fit in RAM , we dump it in
the Hard Disk.(with sorted words)
High Level Design
Module 3:-
Key Doc List
Particular word Docid1-setBit:tf(weight)
Docid2-setBit:tf(weight)
Key Doc List
Particular word Docid3-setBit:tf(weight)
Docid4-setBit:tf(weight)
Key Doc List
Particular word Docid5-setBit:tf(weight)
Docid6-setBit:tf(weight)
RAM HD HD
High Level Design
Term Frequency
• Is a numerical statistic that is intended to reflect how important a word is to the document in a
collection(or) corpus.
Ex:- Consider a query “borrow” and we wish to determine which documents are relevant to the query.
Approach 1:-Eliminate all documents that do not contain “borrow”.
Above still leaves many documents .
Solution- Count the number of times each term occur in each document.
How to calculate
Term Frequency
Title Infobox Links Category Body
1, 3, 0, 1, 5
A particular word
weight 1000 20 10 50 1
Total weight = 1*1000+3*20+0*10+1*50+5*1=1115
Term Frequency = 1 + log(total weight) = 1 +log(1115)= 4.047
Key Doc List
Particular word Docid1-setBit:tf(weight)
Docid2-setBit:tf(weight)
External Sorting
• It is a class of sorting algorithm that can handle massive amounts of data.
• External sorting is required, when the data being sorted do not fit in to the main memory and instead they must
reside in the slower external memory.
One example is External merge sort
 Which sorts chunks that fit in RAM.
 Then merge the sorted chunks together.
 First divide the file in to runs such that size of run small enough to fit in to main memory.
 Sort each run by using merge sort algorithm
 Finally merge runs together into successively bigger runs, until file is sorted.
External Sorting(An example)
10,5,7,8,25,30,4,
1,100,26
10,5,7,8,25
30,4,1,100,
26
5,7,8,10,25
1,4,26,30,
100
1,5,7,8,10,25,26,30,100
Big File
Smaller
Runs
Smaller
Runs
sorted sorted
merged
High Level Design
Module 4:- Merge the sub-index files(using external sort)
Key Doc List
Particular word Docid1-setBit:tf(weight);
Docid2-setBit:tf(weight);
File 1
Key Doc List
Particular word Docid1-setBit:tf(weight);
Docid2-setBit:tf(weight);
File 2
High Level Design
Disadvantages with term frequency
 Consider a query ‘the brown’.
 Because the term “the” is so common, term frequency tend to incorrectly emphasize
document which happen to use “the” more frequently.
Solution-
(Inverted document frequency):-
Which diminishes the weight of terms that occur very frequently in the document set and increases the weight
of terms that occur rarely.
idf = log(total number of docs/Number of docs in which the word has
occurred)
• Final index contain:-
Key Doc List
Particular word Idf #docid1-setBit:tf(weight);
docid2-setBit:tf(weight);
docid3-setBit:tf(weight);
How frequent the word has occurred
in all the documents.
So given one time.
How frequent the word has occurred
in that particular document.
So required for every document.
How a wiki-page xml file
looks like…
ID is almost like page number.
Infobox gives almost every detail
Who is owner
When did it start
Page rank
Alexa rank
……..
Starts with two curly braces
inside text tag
Text
A page may fall under various
categories.
Every Wikipedia page
will have some links to
other pages.
Implementation of Module 1
For every wikipage we need to build a
“WikiPage” object and inside the object
these are things I need to store and do
processing upon.
We are using SAX Parser
SAX-Simple API for XML
Parsing
Create object
saxParser
From Factory
service we have
to get SAX Parser
Calling the method
‘parse’ in object
‘saxParser’
Input file or
corpus file
Along with it we
create a
WikiSAXHandler
following the
standard
prototype.
A hashmap into which we store
what are the elements that we need
to parse like Title, Text & id.
This snippet of code is for whenever an element has
started we check whether it is one of the required
element or not.
Storing element in ‘qName’ and checking whether it is
the required element.
• If element is ‘TITLE’ it means a new page has
started ,so create a new WikiPage.
• In that wikipage we put whatever the title is.
• Once that element has occurred we don’t want to
see one more title element ,so making it false.
Now current element is
nothing but title.
Assuming that after title
only ID and text will come
and title is always the
starting element in all the
Wikipedia pages.
Depending on the current Element if it is a TITLE Tag (say) , the data
structure stringbuilder ‘TITLE’ present inside the ‘WikiPage’ will now get
whatever is present inside the ‘TITLE’ tag.
Similarly Wikipedia page string ID will get particular string.
Implementation of Module 2
Counting the total number of
documents.
Sometimes if document id is not
present in document count it
ourselves and giving that number as
the document ID.
Every string is mapped to integer array.
Wikipage object
Get the info present about title which is a
stringbuilder Converted into string
Method to parse string
parseText method
Loading Stopwords
Module 3-Implementation
Overview
Module 3-
Wikipage1
XML
Wikipage
Object
WordCount
Wikipage2
XML
Wikipage
Object
WordCount
WikipageN
XML
Wikipage
Object
WordCount
…..
TreeMap(allWords)
Hashmap
TreeMap(allWords)
Every WordCount Hashmap is written in to Treemap
If RAM is full, we will flush this treemap to Hard Disk.
Building a Inverted Index
Contains the Information about
the collection of all documents in
which this particular word has
occurred.
Going to append DocId
Going to append SetBit
Going to append Term Frequency
Put it in allWords.
Pages are divided into
chunks , for each chunk we
are going to have a
treemap.
Delete the old tree map if it
is present in old memory.
Pages are divided into
chunks , for each chunk we
are going to have a
treemap.
Delete the old tree map if it
is present in old memory.
Calculating Term Frequency
Take each number from
integer array and attach
weights to it which are
configurable.
Just checks if a word has
occurred in particular
category or not using a
simple ‘OR’.
Creating a stringbuilder and
appending wt. and tf.
Whatever is returned is
being used in the other
methods to form key-value
pair for the final treemap
Merging external offset files
Word = docid_Setbit:weight(tf);
Good#idf=12
_5:3000;7_5:
2000;15_9:1
300;3_4:20;
Good=3_4:20;7_5:2000;
Good=12_5:3000;
Good=15_9:1300;
File1
File2
File3
A word as a key
External sorting is
used.(merge
procedure)
7_index.txt
Math.log((double)totalNumberdocs/numberofdocs.word_present)
Calculating idf:-
Internals of Merging
Priority
Queue
File_1
File_2
File_n
subindexfiles
Readers used
0_index.txt 1_index.txt 25_index.txt
……
Writers used
getNextline
Implementation of Merging
Attaching a writer for all 26
index files.
Associating readers with every
sub-index files
Going to take 1 word from
each sub-index files and add it
to priority queue
Taking all words which are
same, merge it and creating a
same entry.
Whatever I have deleted from heap
I will store it in currentWord.
Apart from creation of final index
files we are also going to maintain a
file in which we are going to record
every distinct words once and the
number of times it has occurred.
Creating secondary indexes
to inverted indexes.
Creating indexes to inverted indexes:-
Word1#lineStart
Word20#lineStart
Word#lineStart
Word#idf=[doc_id
List]
*_secondary.txt *_offset.txt *_index.txt
Real Example:-
earlier#0
Eval#52
0.earlier#0
10.edit#25
.
.
.
.
.
.
52eval#100
.
.
.
.
Evaluation#201
0earlier#03=2_1:1;3_1:1;
25edit#06=4_1:1;
.
.
.
.
.
100eval#05=10_12:13;
.
.
.
.
201evaluation#07=20_13:1;
*4_secondary.txt *4_offset.txt *4_index.txt
Thank You

More Related Content

Similar to Text based search engine on a fixed corpus and utilizing indexation and ranking algorithms

chapter 4 web authoring unit 4 xml.pptx
chapter 4 web authoring  unit 4 xml.pptxchapter 4 web authoring  unit 4 xml.pptx
chapter 4 web authoring unit 4 xml.pptx
amare63
 
Java Programming Introduction Lexer 1 In this project we.pdf
Java Programming  Introduction Lexer 1 In this project we.pdfJava Programming  Introduction Lexer 1 In this project we.pdf
Java Programming Introduction Lexer 1 In this project we.pdf
adinathassociates
 

Similar to Text based search engine on a fixed corpus and utilizing indexation and ranking algorithms (20)

chapter 4 web authoring unit 4 xml.pptx
chapter 4 web authoring  unit 4 xml.pptxchapter 4 web authoring  unit 4 xml.pptx
chapter 4 web authoring unit 4 xml.pptx
 
Parser
ParserParser
Parser
 
Ch2 neworder
Ch2 neworderCh2 neworder
Ch2 neworder
 
Lexical Analysis.pdf
Lexical Analysis.pdfLexical Analysis.pdf
Lexical Analysis.pdf
 
Xml
XmlXml
Xml
 
Unit 5 xml (1)
Unit 5   xml (1)Unit 5   xml (1)
Unit 5 xml (1)
 
Java Programming Introduction Lexer 1 In this project we.pdf
Java Programming  Introduction Lexer 1 In this project we.pdfJava Programming  Introduction Lexer 1 In this project we.pdf
Java Programming Introduction Lexer 1 In this project we.pdf
 
Xml basics
Xml basicsXml basics
Xml basics
 
Xml programming language myassignmenthelp.net
Xml programming  language myassignmenthelp.netXml programming  language myassignmenthelp.net
Xml programming language myassignmenthelp.net
 
XMl
XMlXMl
XMl
 
Xml session
Xml sessionXml session
Xml session
 
Cd ch2 - lexical analysis
Cd   ch2 - lexical analysisCd   ch2 - lexical analysis
Cd ch2 - lexical analysis
 
Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and Clustering
 
Xml
XmlXml
Xml
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
 
Sanskrit parser Project Report
Sanskrit parser Project ReportSanskrit parser Project Report
Sanskrit parser Project Report
 
Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdf
 
Plc part 2
Plc  part 2Plc  part 2
Plc part 2
 
Apex code (Salesforce)
Apex code (Salesforce)Apex code (Salesforce)
Apex code (Salesforce)
 
1 xml fundamentals
1 xml fundamentals1 xml fundamentals
1 xml fundamentals
 

More from Soham Mondal

Employee of the Month - Samsung Semiconductor India Research
Employee of the Month - Samsung Semiconductor India ResearchEmployee of the Month - Samsung Semiconductor India Research
Employee of the Month - Samsung Semiconductor India Research
Soham Mondal
 

More from Soham Mondal (20)

Résumé (2 pager - 12 ft standard syntax)
Résumé (2 pager -  12 ft standard syntax)Résumé (2 pager -  12 ft standard syntax)
Résumé (2 pager - 12 ft standard syntax)
 
Novel encryption algorithm and software development ecc and rsa
Novel encryption algorithm and software development ecc and rsaNovel encryption algorithm and software development ecc and rsa
Novel encryption algorithm and software development ecc and rsa
 
Leveraging UPF-Extracted Checkers using UPF Query Functions for Verifying Pow...
Leveraging UPF-Extracted Checkers using UPF Query Functions for Verifying Pow...Leveraging UPF-Extracted Checkers using UPF Query Functions for Verifying Pow...
Leveraging UPF-Extracted Checkers using UPF Query Functions for Verifying Pow...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
(OSVC시상) WALL OF FAME MEMORY SOLUTIONS PRODUCT AND DEVELOPMENT
(OSVC시상) WALL OF FAME MEMORY SOLUTIONS PRODUCT AND DEVELOPMENT(OSVC시상) WALL OF FAME MEMORY SOLUTIONS PRODUCT AND DEVELOPMENT
(OSVC시상) WALL OF FAME MEMORY SOLUTIONS PRODUCT AND DEVELOPMENT
 
Experience Certificate - Marketing Analyst-Soham Mondal.pdf
Experience Certificate - Marketing Analyst-Soham Mondal.pdfExperience Certificate - Marketing Analyst-Soham Mondal.pdf
Experience Certificate - Marketing Analyst-Soham Mondal.pdf
 
Marketing Expansion Strategist at Mentored Research
Marketing Expansion Strategist at Mentored ResearchMarketing Expansion Strategist at Mentored Research
Marketing Expansion Strategist at Mentored Research
 
Internshala Student Partner 6.0 Jadavpur University Certificate
Internshala Student Partner 6.0 Jadavpur University CertificateInternshala Student Partner 6.0 Jadavpur University Certificate
Internshala Student Partner 6.0 Jadavpur University Certificate
 
Development of Chatbots using Python and basic NLP
Development of Chatbots using Python and basic NLPDevelopment of Chatbots using Python and basic NLP
Development of Chatbots using Python and basic NLP
 
Network Management Internship at NetTech Pvt. Ltd., and ONGC Pvt. Ltd.,
Network Management Internship at NetTech Pvt. Ltd., and ONGC Pvt. Ltd.,Network Management Internship at NetTech Pvt. Ltd., and ONGC Pvt. Ltd.,
Network Management Internship at NetTech Pvt. Ltd., and ONGC Pvt. Ltd.,
 
Business Development and Product Strategy for a SME named SARL based in Leban...
Business Development and Product Strategy for a SME named SARL based in Leban...Business Development and Product Strategy for a SME named SARL based in Leban...
Business Development and Product Strategy for a SME named SARL based in Leban...
 
Employee of the Month - Samsung Semiconductor India Research
Employee of the Month - Samsung Semiconductor India ResearchEmployee of the Month - Samsung Semiconductor India Research
Employee of the Month - Samsung Semiconductor India Research
 
Final Completion Certificate of Marketing Management Internship
Final Completion Certificate of Marketing Management InternshipFinal Completion Certificate of Marketing Management Internship
Final Completion Certificate of Marketing Management Internship
 
Business Development Plan for Sawaya for Agriculture & Trading (SARL)
Business Development Plan for Sawaya for Agriculture & Trading (SARL)Business Development Plan for Sawaya for Agriculture & Trading (SARL)
Business Development Plan for Sawaya for Agriculture & Trading (SARL)
 
Product and Marketing Strategy of an Android Gaming App
Product and Marketing Strategy of an Android Gaming AppProduct and Marketing Strategy of an Android Gaming App
Product and Marketing Strategy of an Android Gaming App
 
INTRODUCTION_GOALS_SOHAM_MONDAL_PRESENTATION
INTRODUCTION_GOALS_SOHAM_MONDAL_PRESENTATIONINTRODUCTION_GOALS_SOHAM_MONDAL_PRESENTATION
INTRODUCTION_GOALS_SOHAM_MONDAL_PRESENTATION
 
Convolution presentation
Convolution presentationConvolution presentation
Convolution presentation
 
Presentation app
Presentation appPresentation app
Presentation app
 
P&G HBS presentation
P&G HBS presentationP&G HBS presentation
P&G HBS presentation
 
Red bull –a marketing excellence
Red bull –a marketing excellenceRed bull –a marketing excellence
Red bull –a marketing excellence
 

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 

Recently uploaded (20)

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 

Text based search engine on a fixed corpus and utilizing indexation and ranking algorithms

  • 2. Aim – • The aim of this project is to build a prototype of a search engine which will work on millions of wikipedia pages (which are in xml format) and retrieves the top 10 relevant Wikipedia documents that matches the input query. • This takes Wikipedia corpus in XML format which is available at Wikipedia.org as input. Then it indices millions of Wikipedia pages involving a comparable number of distinct terms. • Then given a query, it retrieves relevant ranked documents and their titles using index.
  • 3. Exact values vs. full text search Search can be categorized into 2 ways:- • Exact value • Full text Exact value search Example- Exact value F00 is not the same as foo. The value 2014 is not the same as 2014-12-03. • Exact values are easy to query. The decision is binary, a value either matches a query or it does not. Ex:- where name = “soham” and user_id=1; Full text search Querying full text data is much more subtle . We are not asking “Does this document match the query?” but “How well does this document match the query?”. Means how relevant is this document to the query. • Not ‘yes’ or ‘no’ or binary answers.
  • 5. Inverted Index To facilitate full text search, in our project we first analyze the text and use the result to build inverted index. • An inverted index contains all the unique words that appear in any document, and for each word, a list of documents in which it appears. • To create inverted index, we first split the content field of each document into separate words(which we call tokens),create a sorted list of all the unique terms, and then for each term we maintain a list of documents in which it is present. Given input will be in XML format. Need to understand in this XML file where content is present. Get into the content and do various preprocessing on it. Once preprocessed the entire text(1 page) is now converted into many tokens. Now we are going to save all these words in a tabular form and for every word we maintain a list of pages in which it occurred.
  • 6. Example Consider following documents:- Document 1-This is text search engine. Document 2- The prerequisite for doing good work is motivation. Term Document 1 Document 2 This   is   text   search   engine   The   prerequisite   for   doing   good   work   motivation   Delete Stop Words Tokenize Stemming Maintain Indexing For millions of documents if we maintain inverted indexing, we will go for multi-level indexing, so as to direct the search accordingly.
  • 7. Text Preprocessing 1) Tokenization  White Space Tokenizer  Penn Treebank Tokenizer 2) Sentence Segmentation – mark the sentence ending 3) Case Folding – ex: if user types ‘ravi’ in small or capital letters ,he mean the same thing i.e., we need to understand and converge words having same root meaning 4) Removing Stop Words – ex: is, a, an, the etc., are really not important coming to search 5) Stemming – ex: obtaining root word from given word  Porter Stemming Algorithm(Implemented)  Lancaster Stemming Algorithm  WordNet Lemmatizer
  • 8. Tokenization 1) White space tokenizer Ex:- Dr. Srikant Varma still has possession of his ill-gotten goods; his brother’s assets and liabilities. #tokens = 15 If we tokenize based on all characters other than alphabets(A-Z, a-z) Ex:- Dr Srikant Varma still has possession of his ill gotten goods his brother s assets and liabilities #tokens = 17
  • 9. Issues in Tokenization India’s capital  India Indias India’s? We’re, I’m, aren’t  We are, I am, are not? Ill-gotten  Ill-gotten, Ill gotten? Srikant Varma  single token or two? U.S.A  USA or U.S.A Special characters and number formats $35.40 , $50.75 Penn Treebank Tokenizer(a standard defined by linguistic Data consortium(LDC)) will answer all these questions
  • 10. Penn Treebank Tokenization • A standard released by Linguistic Data Consortium. • It seperates out clitics. • Keeps hyphenated words together. • Seperates out all punctuations. Ex:- India’s capital – India s Ex:- “ Has anything escaped me ? ” I asked with some self-importance . “ I trust that there is nothing of consequence which I have overlooked ? ” #Tokens = 28 doesn’t  does n’t can’t  ca n’t haven’t  have n’t
  • 11. Case Folding For tasks like speech recognition and information retrieval, everything is mapped to lower case. So we might sometimes loose meaning of acronyms. JADAVPUR  jadavpur US  us Fox fox While searching for fox, user has typed Fox. Inside this document there is this word. There is a chance that our engine will say that this document is not containing this word. To maintain consistency convert all these words into single case(small case).
  • 12. Java Code to Implement Whitespace Tokenization Tokenizing using OpenNLP The opennlp.tools.tokenize package contains the classes and interfaces that are used to perform tokenization. To tokenize the given sentences into simpler fragments, the OpenNLP library provides this class • WhitespaceTokenizer − This class uses whitespaces to tokenize the given text. To tokenize a sentence using the WhitespaceTokenizer class, we need to − • Create an object of the respective class. • Tokenize the sentence using the tokenize() method. • Print the tokens.
  • 13. Following are the steps to be followed to write a program which tokenizes the given raw text. Step 1 − Instantiating the respective class In both the classes, there are no constructors available to instantiate them. Therefore, we need to create objects of these classes using the static variable INSTANCE. Step 2 − Tokenize the sentences Both these classes contain a method called tokenize(). This method accepts a raw text in String format. On invoking, it tokenizes the given String and returns an array of Strings (tokens). Tokenize the sentence using the tokenizer() method as shown below. Step 3 − Print the tokens After tokenizing the sentence, you can print the tokens using for loop, as shown below.
  • 14. Example Following is the program which tokenizes the given sentence using the Whitespace Tokenizer class. Compile and execute the saved Java file from the Command prompt using the following commands −
  • 15. On executing, the above program reads the given String (raw text), tokenizes it, and displays the following output -
  • 16. Java Code to Implement Case Folding on a given sentence Lower Case : The conversion of a String to lower case is similar to upper case except that we change each of the letter to its lower case form. Toggle Case : Small letters a-z are converted to capital letters A-Z and vice versa. • We would be using length() and charAt(int) methods of the String class. The length() method returns the numbers of characters that the String contains. The charAt() method takes an integer argument and returns the character contained at that integer index. • The Character class contains a number of static methods which can be used to test whether a character is upper case or lower case and also convert between the two forms.
  • 17. XML(Extensible Markup Language) XML tags identify the data and are used to store and organize data rather than how to display like the HTML tags. Characteristics of XML • Extensible:- XML allows us to create our own self descriptive tags, or language that suits our application. • XML carries data, does not present it. • XML is public standard.
  • 18. Sample XML Template Root element Play Author Text Shakespeare title Text Macbeth Chapter title Text Macbeth’s castle
  • 19. DOM Parser to read XML file in Java Document Object Model (DOM) API for XML approach is memory intensive compared to the SAX parser. If XML content size is large it is recommended to use the SAX parser approach. In the DOM parsing approach we load the entire contents of an XML file into a tree structure and then iterate through the tree to read the content. Typically when we need to modify the XML documents DOM parser would be advantageous. A sample implementation of DOM parser is listed below. Here we read the XML file and create a Document object in memory. Then we iterate through the tree and extract the required elements/ attributes.
  • 20. SAX Parser • SAX (the simple API for XML) is an event based parser for XML documents. • SAX is a streaming interface for XML, means the application using SAX receives event notifications about the XML document being processed, an element and attribute at a time in sequential order starting at the top of the document.  Reads an XML document from top to bottom , recognizing the tokens that make up a well formed XML document. Tokens are processed in the same order that they appear in the document.  Application program provides an “event” handler that must be registered with the parser. • SAX will parse the document and depending on token, it will call the methods in the handler.
  • 21. Content Handler Interface • void startDocument():- Called at the beginning of a document. • void endDocument():- Called at the end of a document. • void startElement(string uri, string localName, string qName, string Attributes atts):- Called at the beginning of an element. • void endElement(string uri, string localName, string qName):- Called at the end of an element. • void characters(char[] ch, int start, int length):- Called when character data is encountered.
  • 22. Why SAX Parser Other parsers try to bring entire XML document into the memory and then parse it and create a constructive parse tree. SAX parser reads XML document streamwise(byte by byte)
  • 25. We know when we extend a class we get all the attributes of the class and methods with an option to overwrite some methods. When an element starts SAX Parser calls ‘startElement’ We know employee id, so convert this string into integer number. If this is the first employee created, we want a list of employees so that we can keep on adding to the list. Later we add this employee to newList Earlier if this is true, it will read age, parse into Int.(string),employee is going to set age. After setting age Boolean will be turned to false to make it usable for the next time.
  • 26. High Level Design Module 1:- Wikipedia Dump(contains set of pages) Create Wikipage object The text attributes we are interested in:-  Title  Infobox(summary)  External Links(links to other pages)  Category  Text Content Aim- Create wikipage object out of Wikipedia xml document
  • 27. High Level Design Module 2:- Consider a wikipage • Step 1- Split the text into tokens • Step 2-Remove stop words • Step 3- Stem the word • Step 4- Maintain count where each word occurred(Maintain Hashmap) Key Value Particular word Integer[] {0,0,0,0,0} Title Infobox External Links Category Text
  • 28. High Level Design Module 3:- Build Inverted Index Key Doc List Particular word Docid1-setBit:tf(weight) Docid2-setBit:tf(weight) If Inverted index size is too large, cannot fit in RAM , we dump it in the Hard Disk.(with sorted words)
  • 29. High Level Design Module 3:- Key Doc List Particular word Docid1-setBit:tf(weight) Docid2-setBit:tf(weight) Key Doc List Particular word Docid3-setBit:tf(weight) Docid4-setBit:tf(weight) Key Doc List Particular word Docid5-setBit:tf(weight) Docid6-setBit:tf(weight) RAM HD HD
  • 30. High Level Design Term Frequency • Is a numerical statistic that is intended to reflect how important a word is to the document in a collection(or) corpus. Ex:- Consider a query “borrow” and we wish to determine which documents are relevant to the query. Approach 1:-Eliminate all documents that do not contain “borrow”. Above still leaves many documents . Solution- Count the number of times each term occur in each document.
  • 31. How to calculate Term Frequency Title Infobox Links Category Body 1, 3, 0, 1, 5 A particular word weight 1000 20 10 50 1 Total weight = 1*1000+3*20+0*10+1*50+5*1=1115 Term Frequency = 1 + log(total weight) = 1 +log(1115)= 4.047 Key Doc List Particular word Docid1-setBit:tf(weight) Docid2-setBit:tf(weight)
  • 32. External Sorting • It is a class of sorting algorithm that can handle massive amounts of data. • External sorting is required, when the data being sorted do not fit in to the main memory and instead they must reside in the slower external memory. One example is External merge sort  Which sorts chunks that fit in RAM.  Then merge the sorted chunks together.  First divide the file in to runs such that size of run small enough to fit in to main memory.  Sort each run by using merge sort algorithm  Finally merge runs together into successively bigger runs, until file is sorted.
  • 34. High Level Design Module 4:- Merge the sub-index files(using external sort) Key Doc List Particular word Docid1-setBit:tf(weight); Docid2-setBit:tf(weight); File 1 Key Doc List Particular word Docid1-setBit:tf(weight); Docid2-setBit:tf(weight); File 2
  • 35. High Level Design Disadvantages with term frequency  Consider a query ‘the brown’.  Because the term “the” is so common, term frequency tend to incorrectly emphasize document which happen to use “the” more frequently. Solution- (Inverted document frequency):- Which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely. idf = log(total number of docs/Number of docs in which the word has occurred)
  • 36. • Final index contain:- Key Doc List Particular word Idf #docid1-setBit:tf(weight); docid2-setBit:tf(weight); docid3-setBit:tf(weight); How frequent the word has occurred in all the documents. So given one time. How frequent the word has occurred in that particular document. So required for every document.
  • 37. How a wiki-page xml file looks like…
  • 38. ID is almost like page number. Infobox gives almost every detail Who is owner When did it start Page rank Alexa rank …….. Starts with two curly braces inside text tag
  • 39. Text
  • 40. A page may fall under various categories. Every Wikipedia page will have some links to other pages.
  • 42. For every wikipage we need to build a “WikiPage” object and inside the object these are things I need to store and do processing upon.
  • 43. We are using SAX Parser SAX-Simple API for XML Parsing
  • 44. Create object saxParser From Factory service we have to get SAX Parser Calling the method ‘parse’ in object ‘saxParser’ Input file or corpus file Along with it we create a WikiSAXHandler following the standard prototype.
  • 45. A hashmap into which we store what are the elements that we need to parse like Title, Text & id.
  • 46. This snippet of code is for whenever an element has started we check whether it is one of the required element or not. Storing element in ‘qName’ and checking whether it is the required element. • If element is ‘TITLE’ it means a new page has started ,so create a new WikiPage. • In that wikipage we put whatever the title is. • Once that element has occurred we don’t want to see one more title element ,so making it false. Now current element is nothing but title. Assuming that after title only ID and text will come and title is always the starting element in all the Wikipedia pages.
  • 47. Depending on the current Element if it is a TITLE Tag (say) , the data structure stringbuilder ‘TITLE’ present inside the ‘WikiPage’ will now get whatever is present inside the ‘TITLE’ tag. Similarly Wikipedia page string ID will get particular string.
  • 49. Counting the total number of documents. Sometimes if document id is not present in document count it ourselves and giving that number as the document ID. Every string is mapped to integer array. Wikipage object Get the info present about title which is a stringbuilder Converted into string Method to parse string
  • 55. Contains the Information about the collection of all documents in which this particular word has occurred. Going to append DocId Going to append SetBit Going to append Term Frequency Put it in allWords.
  • 56. Pages are divided into chunks , for each chunk we are going to have a treemap. Delete the old tree map if it is present in old memory.
  • 57. Pages are divided into chunks , for each chunk we are going to have a treemap. Delete the old tree map if it is present in old memory.
  • 59. Take each number from integer array and attach weights to it which are configurable. Just checks if a word has occurred in particular category or not using a simple ‘OR’. Creating a stringbuilder and appending wt. and tf. Whatever is returned is being used in the other methods to form key-value pair for the final treemap
  • 66. Attaching a writer for all 26 index files. Associating readers with every sub-index files Going to take 1 word from each sub-index files and add it to priority queue Taking all words which are same, merge it and creating a same entry.
  • 67. Whatever I have deleted from heap I will store it in currentWord. Apart from creation of final index files we are also going to maintain a file in which we are going to record every distinct words once and the number of times it has occurred.
  • 68. Creating secondary indexes to inverted indexes.
  • 69. Creating indexes to inverted indexes:- Word1#lineStart Word20#lineStart Word#lineStart Word#idf=[doc_id List] *_secondary.txt *_offset.txt *_index.txt