SlideShare a Scribd company logo
1 of 50
KEYWORD EXTRACTION
Industry Oriented Mini Project report submitted
in partial fulfillment of requirements
for the award of degree of
Bachelor of Technology
In
Information Technology
By
N.ADITYA SAI (Reg No: 12131A1276)
P.PHANI KRISHNA SAI (Reg No: 12131A1280)
P.V.N.K.RAJU (Reg No: 12131A1279)
Under the esteemed guidance of
Mr. HARIKRISHNASAIRAJ
Assistant Professor
Department of Information Technology
Department of Information Technology
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING
(AUTONOMOUS)
(Affiliated to JNTU-K, Kakinada)
VISAKHAPATNAM
2015 – 2016
KEYWORD EXTRACTION
Industry Oriented Mini Project report submitted
in partial fulfillment of requirements
for the award of degree of
Bachelor of Technology
In
Information Technology
By
N.ADITYA SAI ( Reg No: 12131A1276)
P.PHANI KRISHNASAI (Reg No: 12131A1280)
P.V.N.K.RAJU ( Reg No: 11131A1279)
Signature of the Guide Signature of the Coordinator
Department of Information Technology
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING
(AUTONOMOUS)
(Affiliated to JNTU-K, Kakinada)
VISAKHAPATNAM
2015 – 2016
Gayatri Vidya Parishad College of Engineering (Autonomous)
Visakhapatnam
CERTIFICATE
This report on “Keyword Extraction” is a bonafied record
of the Industry oriented mini project report submitted
By
N.ADITYA SAI ( Reg No: 12131A1276)
P.PHANI KRISHNA SAI ( Reg No: 12131A1280)
P.V.N.K.RAJU ( Reg No: 12131A1279)
in their VII semester in partial fulfillment of the requirements for the Award of Degree of
Bachelor of Technology
In
Information Technology
During the academic year 2015-2016
Mr. K. Harikrishnasairaj Dr. K. B. Madhuri
Assistant Professor Head of the Department
Project Guide Department of Information Technology
External Examiner
ABSTRACT
An unsupervised method for extracting keywords from a factoid question or a paragraph is
proposed in this project. Keyword plays a crucial role in extracting the correct information as
per user requirements. Everyday thousands of books and papers are published which makes it
very difficult to go through all the text material; instead there is a need of good information
extraction method which can find the required text document. As such effective keywords are
a necessity. Since keyword is the smallest unit which express meaning of the entire document,
many applications can take advantage of it such as automatic indexing, text summarization,
information retrieval, classification, clustering, filtering, cataloging, topic detection and
tracking, information visualization , report generation , web searches etc.
ACKNOWLEDGEMENT
We would like to express our deep sense of gratitude to our esteemed institute “Gayatri
Vidya Parishad College of Engineering (Autonomous)”, which has provided us an opportunity
to fulfill our cherished desire.
We express our sincere thanks to our Principal Dr. A. B. KOTESWARA RAO for his
encouragement to us during the course of this project.
We express our heartfelt thanks and acknowledge our indebtedness to Prof. Dr. K. B.
MADHURI, Head of the Department, Department of Information Technology.
We express our profound gratitude and our deep indebtedness to our guide Mr. K.
HARIKRISHNASAIRAJ, Assistant Professor whose valuable suggestions, guidance and
comprehensive assistance helped us a lot in realizing our present project “KEYWORD
EXTRACTION”.
We also thank Mr. D. NAGA TEJ, Assistant Prof. and project coordinator, for
guiding us throughout the project and lead us in completing our project efficiently.
We would also like to thank all the members of teaching and non-teaching staff of the
Information Technology Department for all their support in completion of our project.
Project Members:
N ADITYA SAI (12131A1276)
P PHANI KRISHNA SAI (12131A1280)
P V N K RAJU (12131A1279)
1. INTRODUCTION.................................................................................................................................2
1 INTRODUCTION ...............................................................................................................3
1.1 FUNCTIONALITY...........................................................................................................3
...............................................................................................................................................................5
2. ANALYSIS AND SRS DOCUMENT........................................................................................................5
2.1 ANALYSIS AND SRS DOCUMENT..............................................................................6
3. REQUIREMENTS...............................................................................................................................8
HARDWARE REQUIREMENTS...........................................................................................9
SOFTWARE REQUIREMENTS............................................................................................9
4. DESIGN..............................................................................................................................................10
4.1 PROCESS........................................................................................................................11
5. DEVELOPMENT.......................................................................................................16
6. IMPLEMENTATION.............................................................................................................................19
OUTPUT: .........................................................................................................30
1.KeyWord Extraction......................................................................................30
................................................................................................................................................31
7.1 INTRODUCTION...........................................................................................................34
7.2 TYPES OF TESTING......................................................................................................35
8. CONCLUSION....................................................................................................................39
..................................................................................................................................................41
....................................................................................................................................................41
.............................................................................................................................................................42
9. BIBLIOGRAPHY................................................................................................................................42
1
CONTENTS
1. INTRODUCTION
2
1 INTRODUCTION
Keyword Extraction is a part of Question Answering (QA) system. As it is difficult to
check each word in the question or paragraph which the user queries, we extract the
highlighting words or keywords of the query and match them with the library and retrieve the
best answers based on the precedence. Keyword Extraction is a very useful starting point for
machine translation.
1.1 FUNCTIONALITY
“Keyword Extraction” is a major part of Machine Translation. The user gives the query or a
paragraph to the system. Firstly this application divides the sentence in to words and compares
them with the stop word list which was created. If the words are matched they are removed
and the remaining are stored in the new file. The words that are extracted are known as
keywords according to the algorithm and they are passed to the software in which parts of
speech tagging is applied. This will be helpful for Question Answering (QA) classification.
3
1.2 EXISTING AND PROPOSED SYSTEM
Existing System:
The existing systems are using the stop words which are not suitable for the implementation
of algorithm. Many words such as but, and, also, or, comma, semi colon etc. are not
considered as stop words in the existing system. Hence the complexity of the system is
increasing.
Proposed System:
The project attempts to implement the algorithm by modifying the stop word list and
identifying the named entities. The system implements a solution which would recognize
named entities and group them in to clusters.
• The proposed system will take a question or paragraph as input and produces summary
of the question or paragraph as output.
• The system provides a solution which would recognize named entities.
• The system will provide configurable results. This is because, the keyword list can be
modified by the user during the runtime.
4
2. ANALYSIS AND SRS DOCUMENT
5
2.1 ANALYSIS AND SRS DOCUMENT
Python: Python is a widely used general purpose, high level language. It supports multilevel
programming paradigms, including object oriented, imperative and functional programming
and programming styles.
Java: Java is a dynamic computer programming language that is concurrent, class based and
object oriented.
Study and types of keyword extraction: There are four type of summarizations. They are:
1. Statistical methods.
2. Linguistic methods.
3. Mixed methods.
• Statistical methods tend to focus on non-linguistic features of the text such as term
frequency, inverse document frequency, and position of a keyword. The benefits of
purely statistical methods are their ease of use, limited computation requirements, and
the fact that they do generally produce good results.
• Linguistic methods which pay attention to linguistic features such as part-of-speech,
syntactic structure and semantic qualities tend to add value, functioning sometimes as
filters for bad keywords.
• Mixed methods are both incorporating linguistic methods and statistical methods such
as term frequency and inverse document frequency.
6
Purpose:
• Keyword Extraction is a major part of Question Answering system, in order to extract
keywords our application is useful.
• Parts of speech tagging is used to apply parts of speech to keywords of the questions or
paragraph which helps to distinguish questions or to summarize the paragraph.
Queries Module:
• The sample query or a paragraph is given by the user.
Keyword Extraction Module:
• Based on the query given by the user the keywords are extracted.
Parts of Speech Tagging Module:
• The keywords which are extracted are tagged with their respective parts of speech.
7
3. REQUIREMENTS
8
CLIENT SIDE REQUIREMENTS
HARDWARE REQUIREMENTS
• PROCESSOR : INTEL PENTIUM II OR ABOVE
• RAM : 512GB(MIN)
• HARDDISK : 20GB (MIN)
SOFTWARE REQUIREMENTS
• OPERATING SYSTEM : UBUNTU or FEDORA version7 to version21
• PROGRAMMING LANGUAGE : JAVA, PYTHON.
9
4. DESIGN
10
4.1 PROCESS
Software design is an iterative process through which requirements are translated into a
“blueprint” for constructing software. Initially, the blueprint depicts a holistic view of
software. That is, the design is represented as a high level of abstraction. As design iteration
occur, subsequent refinement leads to design representations at much lower levels of
abstractions. These can still be traced to requirements, but connection is more subtle.
Throughout the design process, the quality of the evolving design is assessed with a
series of formal technical reviews or design walkthroughs. Three characteristics that serve as a
guide for evaluation of good design:
• The design must implement all of the explicit requirements contained in the
analysis model.
• Design must be readable, understandable guide for those who generate code and
for those who test and subsequently support the software.
• Design should provide a complete picture of the software, addressing the data,
functional and behavioral domains from an implementation perspective.
4.2 IMPORTANCE OF UML IN SOFTWARE DEVELOPMENT
11
The Unified Modeling Language (UML) provides a standard format via construction
of a model and using object oriented paradigm for describing software systems as well as non-
software systems, business processes for the enterprise's problem areas and corporate
infrastructure.
The model abstracts the essential details of the underlying problem and provides a
simplified view of the problem so as to make easy for the solution architect to work towards
building the solution.
In context of the software development, the importance of UML can be comprehended
using analogy of a construction process. Normally, Builders use the designs and maps to
construct buildings. The services of a civil architect are needed to create designs and maps
which act as reference point for the builder. The communication between architect and builder
becomes critical according to the degree of complexity in the design of the building.
Blueprints or Architectural designs are the standard graphical language that both architects and
builders must understand for an effective communication.
Software development is a similar process in many ways. UML has emerged as the
software blueprint methodology for the business and systems analysts, designers,
programmers and everyone involved in creating and deploying the software systems in an
enterprise. The UML provides for everyone involved in software development process a
common vocabulary to communicate about software design.
4.3 UML DIAGRAMS
12
Use Case Diagram: A use case diagram is a type of behavioral diagram defined in UML and
created from use-case analysis. The main purpose of a use case diagram is to show what the
system functions are performed for which actors. Roles of actors in the system can be
depicted.
Figure 1: USE CASE DIAGRAM
Sequence diagram: An interaction diagram, a subset of behavior diagrams, emphasizes
the flow of control and data among the things in the system being modeled.
13
Figure 2: SEQUENCE DIAGRAM
Class diagram: A class diagram in the Unified Modeling Language (UML) is a type of
static structure diagram that describes the structure of a system by showing the
14
system's classes, their attributes, operations (or methods), and the relationships among
the classes.
There are 3 classes
• User
• Backend
• Terminal
Figure 3: CLASS DIAGRAM
15
5. DEVELOPMENT
16
About Software Development:
Software development is the set of activities that results in software products. Software
development may include research, new development, modification, reuse, re-engineering,
maintenance, or any other activities that result in software products. Especially the first phase
in the software development process may involve many departments, including marketing,
engineering, research and development and general management.
Software development process include following steps-
• Requirement Analysis: The most important task in creating the software product is
extracting the requirements or requirement analysis. Frequently demonstrating live code
may help reduce the risk that the requirements are incorrect. Once the general
requirements are gleaned from the client, an analysis of scope of the development should
be determined and clearly stated.
• Specification: It is the task of precisely describing the software to be written. In
practice, most successful specifications are written to understand and fine-tune
applications that are already developed. These are most important for external interfaces
that must remain stable.
17
• Architecture: The architecture of the system refers to an abstract representation of
the system. It is concerned with making sure the software system will meet the
requirements of the product.
• Design, implementation and testing: Implementation is the part of process where
software engineers actually program the code for project. Software testing is integral and
important part of the software development process. This part of the process ensures that
bugs are recognized as early as possible.
• Deployment and maintenance: Deployment starts after the code is appropriately
tested, is approved for release and sold. Maintenance and enhancing software to cope
with newly discovered problems or new requirements can take far more time than the
initial development of software.
Mainly this application is developed to reduce the complexity, monitoring each word
from the question or paragraph and emulating it to the words in database may take much
time and thus increases the complexity. In order to overcome it we extract the keywords
and retrieve the answers to the queries of the user.
18
6. IMPLEMENTATION
19
6.1: SAMPLE CODE:
Key1.py
import string
import re
import os
flag=0
with open('/home/deepak/Desktop/phy/CRFTagger/samples/input.txt','w') as f2:
with open('f1.txt') as fp:
for line in fp:
tempa=line.translate(None,'?,"')
for word in tempa.split():
20
with open('stopwords.txt') as f1:
for line1 in f1:
for word1 in line1.split():
if(word1==word):
flag=1
break
if(flag==0):
print(word)
f2.write(word+'n')
flag=0
os.system('make test')
with open('/home/deepak/Desktop/phy/CRFTagger/samples/input.txt.pos') as f3:
with open('fout.txt','w') as f4:
for line2 in f3:
for word2 in line2.split():
if
"NNS"or"NN"or"NNP"or"NNPS"or"JJ"or"JJS"or"JJR"or"RB"or"RBR"or"RBS"or"VB"or"V
BD"or"VBG"or"VBN"or"VBP"or"VBZ" in word2:
f4.write(word2+'n')
key2.py:
import operator
debug = False
21
test = False
def is_number(s):
try:
float(s) if '.' in s else int(s)
return True
except ValueError:
return False
def load_stop_words(stop_word_file):
"""
Utility function to load stop words from a file and return as a list of words
@param stop_word_file Path and file name of a file containing stop words.
@return list A list of stop words.
"""
stop_words = []
for line in open(stop_word_file):
if line.strip()[0:1] != "#":
for word in line.split(): # in case more than one per line
stop_words.append(word)
return stop_words
def separate_words(text, min_word_return_size):
"""
Utility function to return a list of all words that are have a length greater than a specified
number of characters.
@param text The text that must be split in to words.
@param min_word_return_size The minimum no of characters a word must have to be
included.
"""
splitter = re.compile('[^a-zA-Z0-9_+-/]')
words = []
for single_word in splitter.split(text):
current_word = single_word.strip().lower()
#leave numbers in phrase, but don't count as words, since they tend to invalidate scores of
their phrases
if len(current_word) > min_word_return_size and current_word != '' and not
is_number(current_word):
words.append(current_word)
22
return words
def split_sentences(text):
"""
Utility function to return a list of sentences.
@param text The text that must be split in to sentences.
"""
sentence_delimiters = re.compile(u'[[]n.!?,;:t-"()'u2019u2013]')
sentences = sentence_delimiters.split(text)
return sentences
def build_stop_word_regex(stop_word_file_path):
stop_word_list = load_stop_words(stop_word_file_path)
stop_word_regex_list = []
for word in stop_word_list:
word_regex = 'b' + word + 'b'
stop_word_regex_list.append(word_regex)
stop_word_pattern = re.compile('|'.join(stop_word_regex_list), re.IGNORECASE)
return stop_word_pattern
def generate_candidate_keywords(sentence_list, stopword_pattern, min_char_length=1,
max_words_length=5):
phrase_list = []
for s in sentence_list:
tmp = re.sub(stopword_pattern, '|', s.strip())
phrases = tmp.split("|")
for phrase in phrases:
phrase = phrase.strip().lower()
if phrase != "" and is_acceptable(phrase, min_char_length, max_words_length):
phrase_list.append(phrase)
return phrase_list
def is_acceptable(phrase, min_char_length, max_words_length):
# a phrase must have a min length in characters
if len(phrase) < min_char_length:
return 0
# a phrase must have a max number of words
words = phrase.split()
23
if len(words) > max_words_length:
return 0
digits = 0
alpha = 0
for i in range(0, len(phrase)):
if phrase[i].isdigit():
digits += 1
elif phrase[i].isalpha():
alpha += 1
# a phrase must have at least one alpha character
if alpha == 0:
return 0
# a phrase must have more alpha than digits characters
if digits > alpha:
return 0
return 1
def calculate_word_scores(phraseList):
word_frequency = {}
word_degree = {}
for phrase in phraseList:
word_list = separate_words(phrase, 0)
word_list_length = len(word_list)
word_list_degree = word_list_length – 1
#if word_list_degree > 3: word_list_degree = 3 #exp.
for word in word_list:
word_frequency.setdefault(word, 0)
word_frequency[word] += 1
word_degree.setdefault(word, 0)
word_degree[word] += word_list_degree #orig.
#word_degree[word] += 1/(word_list_length*1.0) #exp.
for item in word_frequency:
word_degree[item] = word_degree[item] + word_frequency[item]
# Calculate Word scores = deg(w)/frew(w)
word_score = {}
for item in word_frequency:
word_score.setdefault(item, 0)
word_score[item] = word_degree[item] / (word_frequency[item] * 1.0) #orig.
#word_score[item] = word_frequency[item]/(word_degree[item] * 1.0) #exp.
return word_score
24
def generate_candidate_keyword_scores(phrase_list, word_score,
min_keyword_frequency=1):
keyword_candidates = {}
for phrase in phrase_list:
if min_keyword_frequency > 1:
if phrase_list.count(phrase) < min_keyword_frequency:
Continue
keyword_candidates.setdefault(phrase, 0)
word_list = separate_words(phrase, 0)
candidate_score = 0
for word in word_list:
candidate_score += word_score[word]
keyword_candidates[phrase] = candidate_score
return keyword_candidates
class Rake(object):
def __init__(self, stop_words_path, min_char_length=1, max_words_length=5,
min_keyword_frequency=1):
self.__stop_words_path = stop_words_path
self.__stop_words_pattern = build_stop_word_regex(stop_words_path)
self.__min_char_length = min_char_length
self.__max_words_length = max_words_length
self.__min_keyword_frequency = min_keyword_frequency
def run(self, text):
sentence_list = split_sentences(text)
phrase_list = generate_candidate_keywords(sentence_list, self.__stop_words_pattern,
self.__min_char_length, self.__max_words_length)
word_scores = calculate_word_scores(phrase_list)
keyword_candidates = generate_candidate_keyword_scores(phrase_list, word_scores,
self.__min_keyword_frequency)
sorted_keywords = sorted(keyword_candidates.iteritems(), key=operator.itemgetter(1),
25
reverse=True)
return sorted_keywords
if test:
text = "Compatibility of systems of linear constraints over the set of natural numbers.
Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and
nonstrict inequations are considered. Upper bounds for components of a minimal set of
solutions and algorithms of construction of minimal generating sets of solutions for all types
of systems are given. These criteria and the corresponding algorithms for constructing a
minimal supporting set of solutions can be used in solving all the considered types of systems
and systems of mixed types."
# Split text into sentences
sentenceList = split_sentences(text)
#stoppath = "FoxStoplist.txt" #Fox stoplist contains "numbers", so it will not find "natural
numbers" like in Table 1.1
stoppath = "RAKE/SmartStoplist.txt" #SMART stoplist misses some of the lower-scoring
keywords in which means that the top 1/3 cuts off one of the 4.0 score words
stopwordpattern = build_stop_word_regex(stoppath)
# generate candidate keywords
phraseList = generate_candidate_keywords(sentenceList, stopwordpattern)
# calculate individual word scores
wordscores = calculate_word_scores(phraseList)
# generate candidate keyword scores
keywordcandidates = generate_candidate_keyword_scores(phraseList, wordscores)
if debug: print keywordcandidates
sortedKeywords = sorted(keywordcandidates.iteritems(), key=operator.itemgetter(1),
reverse=True)
if debug: print sortedKeywords
totalKeywords = len(sortedKeywords)
26
if debug: print totalKeywords
print sortedKeywords[0:(totalKeywords / 3)]
rake = Rake("SmartStoplist.txt")
keywords = rake.run(text)
print keywords
Stop word list:
a
a's
able
about
above
according
accordingly
across
actually
after
afterwards
again
against
ain't
all
allow
allows
almost
27
alone
along
already
also
although
always
am
among
amongst
an
and
another
any
anybody
anyhow
anyone
anything
anyway
anyways
anywhere
apart
appear
appreciate
28
appropriate
are
aren't
around
as
aside
ask
asking
associated
at
available
away
awfully
6.2 SCREEN SHOTS
Input File:
29
This is the input file for the queries or the paragraphs which is used to extract the
keywords that are useful for the question answering system or for summarization for the
paragraph.
OUTPUT:
1. KeyWord Extraction
30
Here it is the output of the keywords of the above sample data which is obtained by
the removing the stop words which are given by the standards of google corporation.
2. Parts of speech tagging
31
Here the program is compiled and executed and hence the keywords extracted from the
above program are taken to which parts of speech tagging is applied and therefore output is
stored in text document.
3. Output file
32
This is the output of the program after applying the parts of speech tagging for
the keywords that are extracted from the questions and paragraph that are given in the
input file.
33
7. TESTING
7.1 INTRODUCTION
The development of software involves series of productive activities and testing is an
important activity of them. This phase is a critical element of software quality assurance and
represents the ultimate review of specification, coding and testing.
34
The main objectives of testing are as follows:
• Testing is a process of executing a program with the intent of finding an error.
• A good test case is one that has a high probability of finding an undiscovered error.
• A successful test is one uncovers an undiscovered error.
Testing can be done in different ways. Some of the types of testing are mentioned
below. The main purpose of any type of test is to systematically uncover different
classes of errors and do so with a minimum amount of time and effort.
7.2 TYPES OF TESTING
• Unit testing
• Integration testing
• Regression testing
• System testing
• Alpha testing
• Beta testing
Testing can be done manually or by using testing tools. There are several testing tools
for different software.
Unit Testing: It is a method by which individual units of source code, sets of one or
more computer program modules together with associated control data, usage
procedures, and operating procedures, are tested to determine if they are fit for use.
35
Integration Testing: It is the phase in software testing in which individual software
modules are combined and tested as a group Integration testing takes as its
input modules that have been unit tested, groups them in larger aggregates, applies
tests defined in an integration test plan to those aggregates, and delivers as its output
the integrated system ready for system testing.
Regression Testing: Regression testing is any type of software testing that seeks to
uncover new software bugs, or regressions, in existing functional and non-
functional areas of a system after changes, such as
enhancements, patches or configuration changes, have been made to them.
System Testing: System testing of software or hardware is testing conducted on a
complete, integrated system to evaluate the system's compliance with its
specified requirements.
Alpha Testing: Alpha testing is simulated or actual operational testing by potential
users/customers or an independent test team at the developers' site. Alpha testing is
often employed for off-the-shelf software as a form of internal acceptance testing,
before the software goes to beta testing.
Beta Testing: Beta testing comes after alpha testing and can be considered a form of
external user acceptance testing. Versions of the software, known as beta versions, are
released to a limited audience outside of the programming team. The software is
released to groups of people so that further testing can ensure the product has few
faults or bugs. Sometimes, beta versions are made available to the open public to
increase the feedback field to a maximal number of future users.
Each module can be tested using the following two strategies:
36
Black Box Testing: In this strategy some test cases are generated as input conditions
that fully execute all functional requirements for the program. This testing is used to
find errors in the following categories:
• Incorrect or missing functions
• Interface errors
• Errors in data structure or external database access
• Performance errors
• Initialization and termination errors
In this testing, only the output is checked for correctness. The logical flow of
the data is not checked.
White Box Testing: In this test cases are generated on the logic of each module by
drawing flow graphs of that module and logical decisions are tested on all the cases.
7.3 TEST CASES:
Case 1: Submit Query
37
The user enters the input manually as a question. The input may be either a factoid question .
The keywords which from the query (factoid question) must be extracted.
Expected output: keywords of the query must be extracted.
Observed output: successful. Keywords of the query(question) are extracted.
Case 2: Submit the paragraph
The user enters the input manually as a paragraph. The input may be either a factoid question.
The keywords which from the paragraph must be extracted so that the paragraph can be
summarized by the user easily based on the keywords.
Expected output: keywords of the paragraph must be extracted.
Observed output: successful. Keywords of the paragraph are extracted.
Case 3: Parts of speech tagging
The keywords which are extracted are tagged with their respective parts of speech. It may be
either noun, pronoun, adjective etc. The tagging is done to both the questions and paragraph. It
is basically up on the type of the query given by the user.
Expected output: keywords must be tagged with parts of speech
Observed output: Successful. Keywords of the paragraph or the question are tagged with
their respective parts of speech
38
8. CONCLUSION
Conclusion:
Keyword Extraction is an application of Natural Language Processing, whose importance
has been recognized for a long time. In the project, we have implemented a summarization
39
algorithm to extract keywords from a single document and obtain its summary. Main
advantage of our method is that it produces more accurate results as it can recognize and group
named entities. Our project also implements extraction of non-trivial keywords from the
paragraphs which proves to be an advantage when compared with the existing algorithm.as
more electronic documents become available, we believe our method will be useful in many
applications, especially for domain-independent keyword extraction.
40
41
9. BIBLIOGRAPHY
42
BIBILOGRAPHY:
• www.enchantedlearning.com
• http://dx.doi.org/10.1007/978-3-540-85760-0_46www.Wikipedia.com
• Hinrich Schu¨tze and Yoram Singer. Part-of-speech tagging using a variable memory
markov model. In Proceedings of the 32nd annual meeting on Association for
Computational Linguistics, ACL ’94, pages 181–187, Stroudsburg, PA, USA, 1994.
Association for Computational Linguistics.
43

More Related Content

Similar to raju

01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPadTraitet Thepbandansuk
 
"Unveiling Insights: A Data Science Journey".pptx
"Unveiling Insights: A Data Science Journey".pptx"Unveiling Insights: A Data Science Journey".pptx
"Unveiling Insights: A Data Science Journey".pptxakshatmponline008
 
Internship-Report-sample-6.pdf
Internship-Report-sample-6.pdfInternship-Report-sample-6.pdf
Internship-Report-sample-6.pdfAbhiAry
 
Internship-Report-sample-6 (1).pdf
Internship-Report-sample-6 (1).pdfInternship-Report-sample-6 (1).pdf
Internship-Report-sample-6 (1).pdfShankarYadav75
 
Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04nihshowandtell
 
Nectar cloud workshop ndj 20110331.2
Nectar cloud workshop ndj 20110331.2Nectar cloud workshop ndj 20110331.2
Nectar cloud workshop ndj 20110331.2Nick Jones
 
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Qazi Maaz Arshad
 
Ll from over 200 projects presentation file
Ll from over 200 projects presentation fileLl from over 200 projects presentation file
Ll from over 200 projects presentation fileKMIRC PolyU
 
Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04David Phillips
 
Student portal system application -Project Book
Student portal system application -Project BookStudent portal system application -Project Book
Student portal system application -Project BookS.M. Fazla Rabbi
 
dissertation- rukiye kırgıl - copy
dissertation- rukiye kırgıl - copydissertation- rukiye kırgıl - copy
dissertation- rukiye kırgıl - copyRukiye KIRGIL
 
Stat Tech Reportv1
Stat Tech Reportv1Stat Tech Reportv1
Stat Tech Reportv1stat
 

Similar to raju (20)

01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad
 
"Unveiling Insights: A Data Science Journey".pptx
"Unveiling Insights: A Data Science Journey".pptx"Unveiling Insights: A Data Science Journey".pptx
"Unveiling Insights: A Data Science Journey".pptx
 
rip 1.pdf
rip 1.pdfrip 1.pdf
rip 1.pdf
 
Internship-Report-sample-6.pdf
Internship-Report-sample-6.pdfInternship-Report-sample-6.pdf
Internship-Report-sample-6.pdf
 
Internship-Report-sample-6 (1).pdf
Internship-Report-sample-6 (1).pdfInternship-Report-sample-6 (1).pdf
Internship-Report-sample-6 (1).pdf
 
Cs internship report file 1.pdf
Cs internship report file 1.pdfCs internship report file 1.pdf
Cs internship report file 1.pdf
 
Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04
 
Nectar cloud workshop ndj 20110331.2
Nectar cloud workshop ndj 20110331.2Nectar cloud workshop ndj 20110331.2
Nectar cloud workshop ndj 20110331.2
 
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
 
Project Report
 Project Report Project Report
Project Report
 
Internship Wso2
Internship Wso2Internship Wso2
Internship Wso2
 
Airline Analysis of Data Using Hadoop
Airline Analysis of Data Using HadoopAirline Analysis of Data Using Hadoop
Airline Analysis of Data Using Hadoop
 
Online Job Portal
Online Job PortalOnline Job Portal
Online Job Portal
 
Ll from over 200 projects presentation file
Ll from over 200 projects presentation fileLl from over 200 projects presentation file
Ll from over 200 projects presentation file
 
Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04
 
Student portal system application -Project Book
Student portal system application -Project BookStudent portal system application -Project Book
Student portal system application -Project Book
 
dissertation- rukiye kırgıl - copy
dissertation- rukiye kırgıl - copydissertation- rukiye kırgıl - copy
dissertation- rukiye kırgıl - copy
 
firozreport.pdf
firozreport.pdffirozreport.pdf
firozreport.pdf
 
Final_Thesis
Final_ThesisFinal_Thesis
Final_Thesis
 
Stat Tech Reportv1
Stat Tech Reportv1Stat Tech Reportv1
Stat Tech Reportv1
 

raju

  • 1. KEYWORD EXTRACTION Industry Oriented Mini Project report submitted in partial fulfillment of requirements for the award of degree of Bachelor of Technology In Information Technology By N.ADITYA SAI (Reg No: 12131A1276) P.PHANI KRISHNA SAI (Reg No: 12131A1280) P.V.N.K.RAJU (Reg No: 12131A1279) Under the esteemed guidance of Mr. HARIKRISHNASAIRAJ Assistant Professor Department of Information Technology Department of Information Technology GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING (AUTONOMOUS) (Affiliated to JNTU-K, Kakinada) VISAKHAPATNAM 2015 – 2016
  • 2. KEYWORD EXTRACTION Industry Oriented Mini Project report submitted in partial fulfillment of requirements for the award of degree of Bachelor of Technology In Information Technology By N.ADITYA SAI ( Reg No: 12131A1276) P.PHANI KRISHNASAI (Reg No: 12131A1280) P.V.N.K.RAJU ( Reg No: 11131A1279) Signature of the Guide Signature of the Coordinator Department of Information Technology GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING (AUTONOMOUS) (Affiliated to JNTU-K, Kakinada) VISAKHAPATNAM 2015 – 2016
  • 3. Gayatri Vidya Parishad College of Engineering (Autonomous) Visakhapatnam CERTIFICATE This report on “Keyword Extraction” is a bonafied record of the Industry oriented mini project report submitted By N.ADITYA SAI ( Reg No: 12131A1276) P.PHANI KRISHNA SAI ( Reg No: 12131A1280) P.V.N.K.RAJU ( Reg No: 12131A1279) in their VII semester in partial fulfillment of the requirements for the Award of Degree of Bachelor of Technology In Information Technology During the academic year 2015-2016 Mr. K. Harikrishnasairaj Dr. K. B. Madhuri Assistant Professor Head of the Department Project Guide Department of Information Technology External Examiner
  • 4.
  • 5. ABSTRACT An unsupervised method for extracting keywords from a factoid question or a paragraph is proposed in this project. Keyword plays a crucial role in extracting the correct information as per user requirements. Everyday thousands of books and papers are published which makes it very difficult to go through all the text material; instead there is a need of good information extraction method which can find the required text document. As such effective keywords are a necessity. Since keyword is the smallest unit which express meaning of the entire document, many applications can take advantage of it such as automatic indexing, text summarization, information retrieval, classification, clustering, filtering, cataloging, topic detection and tracking, information visualization , report generation , web searches etc.
  • 6. ACKNOWLEDGEMENT We would like to express our deep sense of gratitude to our esteemed institute “Gayatri Vidya Parishad College of Engineering (Autonomous)”, which has provided us an opportunity to fulfill our cherished desire. We express our sincere thanks to our Principal Dr. A. B. KOTESWARA RAO for his encouragement to us during the course of this project. We express our heartfelt thanks and acknowledge our indebtedness to Prof. Dr. K. B. MADHURI, Head of the Department, Department of Information Technology. We express our profound gratitude and our deep indebtedness to our guide Mr. K. HARIKRISHNASAIRAJ, Assistant Professor whose valuable suggestions, guidance and comprehensive assistance helped us a lot in realizing our present project “KEYWORD EXTRACTION”. We also thank Mr. D. NAGA TEJ, Assistant Prof. and project coordinator, for guiding us throughout the project and lead us in completing our project efficiently. We would also like to thank all the members of teaching and non-teaching staff of the Information Technology Department for all their support in completion of our project. Project Members: N ADITYA SAI (12131A1276) P PHANI KRISHNA SAI (12131A1280) P V N K RAJU (12131A1279)
  • 7.
  • 8. 1. INTRODUCTION.................................................................................................................................2 1 INTRODUCTION ...............................................................................................................3 1.1 FUNCTIONALITY...........................................................................................................3 ...............................................................................................................................................................5 2. ANALYSIS AND SRS DOCUMENT........................................................................................................5 2.1 ANALYSIS AND SRS DOCUMENT..............................................................................6 3. REQUIREMENTS...............................................................................................................................8 HARDWARE REQUIREMENTS...........................................................................................9 SOFTWARE REQUIREMENTS............................................................................................9 4. DESIGN..............................................................................................................................................10 4.1 PROCESS........................................................................................................................11 5. DEVELOPMENT.......................................................................................................16 6. IMPLEMENTATION.............................................................................................................................19 OUTPUT: .........................................................................................................30 1.KeyWord Extraction......................................................................................30 ................................................................................................................................................31 7.1 INTRODUCTION...........................................................................................................34 7.2 TYPES OF TESTING......................................................................................................35 8. CONCLUSION....................................................................................................................39 ..................................................................................................................................................41 ....................................................................................................................................................41 .............................................................................................................................................................42 9. BIBLIOGRAPHY................................................................................................................................42 1 CONTENTS
  • 10. 1 INTRODUCTION Keyword Extraction is a part of Question Answering (QA) system. As it is difficult to check each word in the question or paragraph which the user queries, we extract the highlighting words or keywords of the query and match them with the library and retrieve the best answers based on the precedence. Keyword Extraction is a very useful starting point for machine translation. 1.1 FUNCTIONALITY “Keyword Extraction” is a major part of Machine Translation. The user gives the query or a paragraph to the system. Firstly this application divides the sentence in to words and compares them with the stop word list which was created. If the words are matched they are removed and the remaining are stored in the new file. The words that are extracted are known as keywords according to the algorithm and they are passed to the software in which parts of speech tagging is applied. This will be helpful for Question Answering (QA) classification. 3
  • 11. 1.2 EXISTING AND PROPOSED SYSTEM Existing System: The existing systems are using the stop words which are not suitable for the implementation of algorithm. Many words such as but, and, also, or, comma, semi colon etc. are not considered as stop words in the existing system. Hence the complexity of the system is increasing. Proposed System: The project attempts to implement the algorithm by modifying the stop word list and identifying the named entities. The system implements a solution which would recognize named entities and group them in to clusters. • The proposed system will take a question or paragraph as input and produces summary of the question or paragraph as output. • The system provides a solution which would recognize named entities. • The system will provide configurable results. This is because, the keyword list can be modified by the user during the runtime. 4
  • 12. 2. ANALYSIS AND SRS DOCUMENT 5
  • 13. 2.1 ANALYSIS AND SRS DOCUMENT Python: Python is a widely used general purpose, high level language. It supports multilevel programming paradigms, including object oriented, imperative and functional programming and programming styles. Java: Java is a dynamic computer programming language that is concurrent, class based and object oriented. Study and types of keyword extraction: There are four type of summarizations. They are: 1. Statistical methods. 2. Linguistic methods. 3. Mixed methods. • Statistical methods tend to focus on non-linguistic features of the text such as term frequency, inverse document frequency, and position of a keyword. The benefits of purely statistical methods are their ease of use, limited computation requirements, and the fact that they do generally produce good results. • Linguistic methods which pay attention to linguistic features such as part-of-speech, syntactic structure and semantic qualities tend to add value, functioning sometimes as filters for bad keywords. • Mixed methods are both incorporating linguistic methods and statistical methods such as term frequency and inverse document frequency. 6
  • 14. Purpose: • Keyword Extraction is a major part of Question Answering system, in order to extract keywords our application is useful. • Parts of speech tagging is used to apply parts of speech to keywords of the questions or paragraph which helps to distinguish questions or to summarize the paragraph. Queries Module: • The sample query or a paragraph is given by the user. Keyword Extraction Module: • Based on the query given by the user the keywords are extracted. Parts of Speech Tagging Module: • The keywords which are extracted are tagged with their respective parts of speech. 7
  • 16. CLIENT SIDE REQUIREMENTS HARDWARE REQUIREMENTS • PROCESSOR : INTEL PENTIUM II OR ABOVE • RAM : 512GB(MIN) • HARDDISK : 20GB (MIN) SOFTWARE REQUIREMENTS • OPERATING SYSTEM : UBUNTU or FEDORA version7 to version21 • PROGRAMMING LANGUAGE : JAVA, PYTHON. 9
  • 18. 4.1 PROCESS Software design is an iterative process through which requirements are translated into a “blueprint” for constructing software. Initially, the blueprint depicts a holistic view of software. That is, the design is represented as a high level of abstraction. As design iteration occur, subsequent refinement leads to design representations at much lower levels of abstractions. These can still be traced to requirements, but connection is more subtle. Throughout the design process, the quality of the evolving design is assessed with a series of formal technical reviews or design walkthroughs. Three characteristics that serve as a guide for evaluation of good design: • The design must implement all of the explicit requirements contained in the analysis model. • Design must be readable, understandable guide for those who generate code and for those who test and subsequently support the software. • Design should provide a complete picture of the software, addressing the data, functional and behavioral domains from an implementation perspective. 4.2 IMPORTANCE OF UML IN SOFTWARE DEVELOPMENT 11
  • 19. The Unified Modeling Language (UML) provides a standard format via construction of a model and using object oriented paradigm for describing software systems as well as non- software systems, business processes for the enterprise's problem areas and corporate infrastructure. The model abstracts the essential details of the underlying problem and provides a simplified view of the problem so as to make easy for the solution architect to work towards building the solution. In context of the software development, the importance of UML can be comprehended using analogy of a construction process. Normally, Builders use the designs and maps to construct buildings. The services of a civil architect are needed to create designs and maps which act as reference point for the builder. The communication between architect and builder becomes critical according to the degree of complexity in the design of the building. Blueprints or Architectural designs are the standard graphical language that both architects and builders must understand for an effective communication. Software development is a similar process in many ways. UML has emerged as the software blueprint methodology for the business and systems analysts, designers, programmers and everyone involved in creating and deploying the software systems in an enterprise. The UML provides for everyone involved in software development process a common vocabulary to communicate about software design. 4.3 UML DIAGRAMS 12
  • 20. Use Case Diagram: A use case diagram is a type of behavioral diagram defined in UML and created from use-case analysis. The main purpose of a use case diagram is to show what the system functions are performed for which actors. Roles of actors in the system can be depicted. Figure 1: USE CASE DIAGRAM Sequence diagram: An interaction diagram, a subset of behavior diagrams, emphasizes the flow of control and data among the things in the system being modeled. 13
  • 21. Figure 2: SEQUENCE DIAGRAM Class diagram: A class diagram in the Unified Modeling Language (UML) is a type of static structure diagram that describes the structure of a system by showing the 14
  • 22. system's classes, their attributes, operations (or methods), and the relationships among the classes. There are 3 classes • User • Backend • Terminal Figure 3: CLASS DIAGRAM 15
  • 24. About Software Development: Software development is the set of activities that results in software products. Software development may include research, new development, modification, reuse, re-engineering, maintenance, or any other activities that result in software products. Especially the first phase in the software development process may involve many departments, including marketing, engineering, research and development and general management. Software development process include following steps- • Requirement Analysis: The most important task in creating the software product is extracting the requirements or requirement analysis. Frequently demonstrating live code may help reduce the risk that the requirements are incorrect. Once the general requirements are gleaned from the client, an analysis of scope of the development should be determined and clearly stated. • Specification: It is the task of precisely describing the software to be written. In practice, most successful specifications are written to understand and fine-tune applications that are already developed. These are most important for external interfaces that must remain stable. 17
  • 25. • Architecture: The architecture of the system refers to an abstract representation of the system. It is concerned with making sure the software system will meet the requirements of the product. • Design, implementation and testing: Implementation is the part of process where software engineers actually program the code for project. Software testing is integral and important part of the software development process. This part of the process ensures that bugs are recognized as early as possible. • Deployment and maintenance: Deployment starts after the code is appropriately tested, is approved for release and sold. Maintenance and enhancing software to cope with newly discovered problems or new requirements can take far more time than the initial development of software. Mainly this application is developed to reduce the complexity, monitoring each word from the question or paragraph and emulating it to the words in database may take much time and thus increases the complexity. In order to overcome it we extract the keywords and retrieve the answers to the queries of the user. 18
  • 27. 6.1: SAMPLE CODE: Key1.py import string import re import os flag=0 with open('/home/deepak/Desktop/phy/CRFTagger/samples/input.txt','w') as f2: with open('f1.txt') as fp: for line in fp: tempa=line.translate(None,'?,"') for word in tempa.split(): 20
  • 28. with open('stopwords.txt') as f1: for line1 in f1: for word1 in line1.split(): if(word1==word): flag=1 break if(flag==0): print(word) f2.write(word+'n') flag=0 os.system('make test') with open('/home/deepak/Desktop/phy/CRFTagger/samples/input.txt.pos') as f3: with open('fout.txt','w') as f4: for line2 in f3: for word2 in line2.split(): if "NNS"or"NN"or"NNP"or"NNPS"or"JJ"or"JJS"or"JJR"or"RB"or"RBR"or"RBS"or"VB"or"V BD"or"VBG"or"VBN"or"VBP"or"VBZ" in word2: f4.write(word2+'n') key2.py: import operator debug = False 21
  • 29. test = False def is_number(s): try: float(s) if '.' in s else int(s) return True except ValueError: return False def load_stop_words(stop_word_file): """ Utility function to load stop words from a file and return as a list of words @param stop_word_file Path and file name of a file containing stop words. @return list A list of stop words. """ stop_words = [] for line in open(stop_word_file): if line.strip()[0:1] != "#": for word in line.split(): # in case more than one per line stop_words.append(word) return stop_words def separate_words(text, min_word_return_size): """ Utility function to return a list of all words that are have a length greater than a specified number of characters. @param text The text that must be split in to words. @param min_word_return_size The minimum no of characters a word must have to be included. """ splitter = re.compile('[^a-zA-Z0-9_+-/]') words = [] for single_word in splitter.split(text): current_word = single_word.strip().lower() #leave numbers in phrase, but don't count as words, since they tend to invalidate scores of their phrases if len(current_word) > min_word_return_size and current_word != '' and not is_number(current_word): words.append(current_word) 22
  • 30. return words def split_sentences(text): """ Utility function to return a list of sentences. @param text The text that must be split in to sentences. """ sentence_delimiters = re.compile(u'[[]n.!?,;:t-"()'u2019u2013]') sentences = sentence_delimiters.split(text) return sentences def build_stop_word_regex(stop_word_file_path): stop_word_list = load_stop_words(stop_word_file_path) stop_word_regex_list = [] for word in stop_word_list: word_regex = 'b' + word + 'b' stop_word_regex_list.append(word_regex) stop_word_pattern = re.compile('|'.join(stop_word_regex_list), re.IGNORECASE) return stop_word_pattern def generate_candidate_keywords(sentence_list, stopword_pattern, min_char_length=1, max_words_length=5): phrase_list = [] for s in sentence_list: tmp = re.sub(stopword_pattern, '|', s.strip()) phrases = tmp.split("|") for phrase in phrases: phrase = phrase.strip().lower() if phrase != "" and is_acceptable(phrase, min_char_length, max_words_length): phrase_list.append(phrase) return phrase_list def is_acceptable(phrase, min_char_length, max_words_length): # a phrase must have a min length in characters if len(phrase) < min_char_length: return 0 # a phrase must have a max number of words words = phrase.split() 23
  • 31. if len(words) > max_words_length: return 0 digits = 0 alpha = 0 for i in range(0, len(phrase)): if phrase[i].isdigit(): digits += 1 elif phrase[i].isalpha(): alpha += 1 # a phrase must have at least one alpha character if alpha == 0: return 0 # a phrase must have more alpha than digits characters if digits > alpha: return 0 return 1 def calculate_word_scores(phraseList): word_frequency = {} word_degree = {} for phrase in phraseList: word_list = separate_words(phrase, 0) word_list_length = len(word_list) word_list_degree = word_list_length – 1 #if word_list_degree > 3: word_list_degree = 3 #exp. for word in word_list: word_frequency.setdefault(word, 0) word_frequency[word] += 1 word_degree.setdefault(word, 0) word_degree[word] += word_list_degree #orig. #word_degree[word] += 1/(word_list_length*1.0) #exp. for item in word_frequency: word_degree[item] = word_degree[item] + word_frequency[item] # Calculate Word scores = deg(w)/frew(w) word_score = {} for item in word_frequency: word_score.setdefault(item, 0) word_score[item] = word_degree[item] / (word_frequency[item] * 1.0) #orig. #word_score[item] = word_frequency[item]/(word_degree[item] * 1.0) #exp. return word_score 24
  • 32. def generate_candidate_keyword_scores(phrase_list, word_score, min_keyword_frequency=1): keyword_candidates = {} for phrase in phrase_list: if min_keyword_frequency > 1: if phrase_list.count(phrase) < min_keyword_frequency: Continue keyword_candidates.setdefault(phrase, 0) word_list = separate_words(phrase, 0) candidate_score = 0 for word in word_list: candidate_score += word_score[word] keyword_candidates[phrase] = candidate_score return keyword_candidates class Rake(object): def __init__(self, stop_words_path, min_char_length=1, max_words_length=5, min_keyword_frequency=1): self.__stop_words_path = stop_words_path self.__stop_words_pattern = build_stop_word_regex(stop_words_path) self.__min_char_length = min_char_length self.__max_words_length = max_words_length self.__min_keyword_frequency = min_keyword_frequency def run(self, text): sentence_list = split_sentences(text) phrase_list = generate_candidate_keywords(sentence_list, self.__stop_words_pattern, self.__min_char_length, self.__max_words_length) word_scores = calculate_word_scores(phrase_list) keyword_candidates = generate_candidate_keyword_scores(phrase_list, word_scores, self.__min_keyword_frequency) sorted_keywords = sorted(keyword_candidates.iteritems(), key=operator.itemgetter(1), 25
  • 33. reverse=True) return sorted_keywords if test: text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types." # Split text into sentences sentenceList = split_sentences(text) #stoppath = "FoxStoplist.txt" #Fox stoplist contains "numbers", so it will not find "natural numbers" like in Table 1.1 stoppath = "RAKE/SmartStoplist.txt" #SMART stoplist misses some of the lower-scoring keywords in which means that the top 1/3 cuts off one of the 4.0 score words stopwordpattern = build_stop_word_regex(stoppath) # generate candidate keywords phraseList = generate_candidate_keywords(sentenceList, stopwordpattern) # calculate individual word scores wordscores = calculate_word_scores(phraseList) # generate candidate keyword scores keywordcandidates = generate_candidate_keyword_scores(phraseList, wordscores) if debug: print keywordcandidates sortedKeywords = sorted(keywordcandidates.iteritems(), key=operator.itemgetter(1), reverse=True) if debug: print sortedKeywords totalKeywords = len(sortedKeywords) 26
  • 34. if debug: print totalKeywords print sortedKeywords[0:(totalKeywords / 3)] rake = Rake("SmartStoplist.txt") keywords = rake.run(text) print keywords Stop word list: a a's able about above according accordingly across actually after afterwards again against ain't all allow allows almost 27
  • 37. This is the input file for the queries or the paragraphs which is used to extract the keywords that are useful for the question answering system or for summarization for the paragraph. OUTPUT: 1. KeyWord Extraction 30
  • 38. Here it is the output of the keywords of the above sample data which is obtained by the removing the stop words which are given by the standards of google corporation. 2. Parts of speech tagging 31
  • 39. Here the program is compiled and executed and hence the keywords extracted from the above program are taken to which parts of speech tagging is applied and therefore output is stored in text document. 3. Output file 32
  • 40. This is the output of the program after applying the parts of speech tagging for the keywords that are extracted from the questions and paragraph that are given in the input file. 33
  • 41. 7. TESTING 7.1 INTRODUCTION The development of software involves series of productive activities and testing is an important activity of them. This phase is a critical element of software quality assurance and represents the ultimate review of specification, coding and testing. 34
  • 42. The main objectives of testing are as follows: • Testing is a process of executing a program with the intent of finding an error. • A good test case is one that has a high probability of finding an undiscovered error. • A successful test is one uncovers an undiscovered error. Testing can be done in different ways. Some of the types of testing are mentioned below. The main purpose of any type of test is to systematically uncover different classes of errors and do so with a minimum amount of time and effort. 7.2 TYPES OF TESTING • Unit testing • Integration testing • Regression testing • System testing • Alpha testing • Beta testing Testing can be done manually or by using testing tools. There are several testing tools for different software. Unit Testing: It is a method by which individual units of source code, sets of one or more computer program modules together with associated control data, usage procedures, and operating procedures, are tested to determine if they are fit for use. 35
  • 43. Integration Testing: It is the phase in software testing in which individual software modules are combined and tested as a group Integration testing takes as its input modules that have been unit tested, groups them in larger aggregates, applies tests defined in an integration test plan to those aggregates, and delivers as its output the integrated system ready for system testing. Regression Testing: Regression testing is any type of software testing that seeks to uncover new software bugs, or regressions, in existing functional and non- functional areas of a system after changes, such as enhancements, patches or configuration changes, have been made to them. System Testing: System testing of software or hardware is testing conducted on a complete, integrated system to evaluate the system's compliance with its specified requirements. Alpha Testing: Alpha testing is simulated or actual operational testing by potential users/customers or an independent test team at the developers' site. Alpha testing is often employed for off-the-shelf software as a form of internal acceptance testing, before the software goes to beta testing. Beta Testing: Beta testing comes after alpha testing and can be considered a form of external user acceptance testing. Versions of the software, known as beta versions, are released to a limited audience outside of the programming team. The software is released to groups of people so that further testing can ensure the product has few faults or bugs. Sometimes, beta versions are made available to the open public to increase the feedback field to a maximal number of future users. Each module can be tested using the following two strategies: 36
  • 44. Black Box Testing: In this strategy some test cases are generated as input conditions that fully execute all functional requirements for the program. This testing is used to find errors in the following categories: • Incorrect or missing functions • Interface errors • Errors in data structure or external database access • Performance errors • Initialization and termination errors In this testing, only the output is checked for correctness. The logical flow of the data is not checked. White Box Testing: In this test cases are generated on the logic of each module by drawing flow graphs of that module and logical decisions are tested on all the cases. 7.3 TEST CASES: Case 1: Submit Query 37
  • 45. The user enters the input manually as a question. The input may be either a factoid question . The keywords which from the query (factoid question) must be extracted. Expected output: keywords of the query must be extracted. Observed output: successful. Keywords of the query(question) are extracted. Case 2: Submit the paragraph The user enters the input manually as a paragraph. The input may be either a factoid question. The keywords which from the paragraph must be extracted so that the paragraph can be summarized by the user easily based on the keywords. Expected output: keywords of the paragraph must be extracted. Observed output: successful. Keywords of the paragraph are extracted. Case 3: Parts of speech tagging The keywords which are extracted are tagged with their respective parts of speech. It may be either noun, pronoun, adjective etc. The tagging is done to both the questions and paragraph. It is basically up on the type of the query given by the user. Expected output: keywords must be tagged with parts of speech Observed output: Successful. Keywords of the paragraph or the question are tagged with their respective parts of speech 38
  • 46. 8. CONCLUSION Conclusion: Keyword Extraction is an application of Natural Language Processing, whose importance has been recognized for a long time. In the project, we have implemented a summarization 39
  • 47. algorithm to extract keywords from a single document and obtain its summary. Main advantage of our method is that it produces more accurate results as it can recognize and group named entities. Our project also implements extraction of non-trivial keywords from the paragraphs which proves to be an advantage when compared with the existing algorithm.as more electronic documents become available, we believe our method will be useful in many applications, especially for domain-independent keyword extraction. 40
  • 48. 41
  • 50. BIBILOGRAPHY: • www.enchantedlearning.com • http://dx.doi.org/10.1007/978-3-540-85760-0_46www.Wikipedia.com • Hinrich Schu¨tze and Yoram Singer. Part-of-speech tagging using a variable memory markov model. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, ACL ’94, pages 181–187, Stroudsburg, PA, USA, 1994. Association for Computational Linguistics. 43