1. KEYWORD EXTRACTION
Industry Oriented Mini Project report submitted
in partial fulfillment of requirements
for the award of degree of
Bachelor of Technology
In
Information Technology
By
N.ADITYA SAI (Reg No: 12131A1276)
P.PHANI KRISHNA SAI (Reg No: 12131A1280)
P.V.N.K.RAJU (Reg No: 12131A1279)
Under the esteemed guidance of
Mr. HARIKRISHNASAIRAJ
Assistant Professor
Department of Information Technology
Department of Information Technology
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING
(AUTONOMOUS)
(Affiliated to JNTU-K, Kakinada)
VISAKHAPATNAM
2015 – 2016
2. KEYWORD EXTRACTION
Industry Oriented Mini Project report submitted
in partial fulfillment of requirements
for the award of degree of
Bachelor of Technology
In
Information Technology
By
N.ADITYA SAI ( Reg No: 12131A1276)
P.PHANI KRISHNASAI (Reg No: 12131A1280)
P.V.N.K.RAJU ( Reg No: 11131A1279)
Signature of the Guide Signature of the Coordinator
Department of Information Technology
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING
(AUTONOMOUS)
(Affiliated to JNTU-K, Kakinada)
VISAKHAPATNAM
2015 – 2016
3. Gayatri Vidya Parishad College of Engineering (Autonomous)
Visakhapatnam
CERTIFICATE
This report on “Keyword Extraction” is a bonafied record
of the Industry oriented mini project report submitted
By
N.ADITYA SAI ( Reg No: 12131A1276)
P.PHANI KRISHNA SAI ( Reg No: 12131A1280)
P.V.N.K.RAJU ( Reg No: 12131A1279)
in their VII semester in partial fulfillment of the requirements for the Award of Degree of
Bachelor of Technology
In
Information Technology
During the academic year 2015-2016
Mr. K. Harikrishnasairaj Dr. K. B. Madhuri
Assistant Professor Head of the Department
Project Guide Department of Information Technology
External Examiner
4.
5. ABSTRACT
An unsupervised method for extracting keywords from a factoid question or a paragraph is
proposed in this project. Keyword plays a crucial role in extracting the correct information as
per user requirements. Everyday thousands of books and papers are published which makes it
very difficult to go through all the text material; instead there is a need of good information
extraction method which can find the required text document. As such effective keywords are
a necessity. Since keyword is the smallest unit which express meaning of the entire document,
many applications can take advantage of it such as automatic indexing, text summarization,
information retrieval, classification, clustering, filtering, cataloging, topic detection and
tracking, information visualization , report generation , web searches etc.
6. ACKNOWLEDGEMENT
We would like to express our deep sense of gratitude to our esteemed institute “Gayatri
Vidya Parishad College of Engineering (Autonomous)”, which has provided us an opportunity
to fulfill our cherished desire.
We express our sincere thanks to our Principal Dr. A. B. KOTESWARA RAO for his
encouragement to us during the course of this project.
We express our heartfelt thanks and acknowledge our indebtedness to Prof. Dr. K. B.
MADHURI, Head of the Department, Department of Information Technology.
We express our profound gratitude and our deep indebtedness to our guide Mr. K.
HARIKRISHNASAIRAJ, Assistant Professor whose valuable suggestions, guidance and
comprehensive assistance helped us a lot in realizing our present project “KEYWORD
EXTRACTION”.
We also thank Mr. D. NAGA TEJ, Assistant Prof. and project coordinator, for
guiding us throughout the project and lead us in completing our project efficiently.
We would also like to thank all the members of teaching and non-teaching staff of the
Information Technology Department for all their support in completion of our project.
Project Members:
N ADITYA SAI (12131A1276)
P PHANI KRISHNA SAI (12131A1280)
P V N K RAJU (12131A1279)
10. 1 INTRODUCTION
Keyword Extraction is a part of Question Answering (QA) system. As it is difficult to
check each word in the question or paragraph which the user queries, we extract the
highlighting words or keywords of the query and match them with the library and retrieve the
best answers based on the precedence. Keyword Extraction is a very useful starting point for
machine translation.
1.1 FUNCTIONALITY
“Keyword Extraction” is a major part of Machine Translation. The user gives the query or a
paragraph to the system. Firstly this application divides the sentence in to words and compares
them with the stop word list which was created. If the words are matched they are removed
and the remaining are stored in the new file. The words that are extracted are known as
keywords according to the algorithm and they are passed to the software in which parts of
speech tagging is applied. This will be helpful for Question Answering (QA) classification.
3
11. 1.2 EXISTING AND PROPOSED SYSTEM
Existing System:
The existing systems are using the stop words which are not suitable for the implementation
of algorithm. Many words such as but, and, also, or, comma, semi colon etc. are not
considered as stop words in the existing system. Hence the complexity of the system is
increasing.
Proposed System:
The project attempts to implement the algorithm by modifying the stop word list and
identifying the named entities. The system implements a solution which would recognize
named entities and group them in to clusters.
• The proposed system will take a question or paragraph as input and produces summary
of the question or paragraph as output.
• The system provides a solution which would recognize named entities.
• The system will provide configurable results. This is because, the keyword list can be
modified by the user during the runtime.
4
13. 2.1 ANALYSIS AND SRS DOCUMENT
Python: Python is a widely used general purpose, high level language. It supports multilevel
programming paradigms, including object oriented, imperative and functional programming
and programming styles.
Java: Java is a dynamic computer programming language that is concurrent, class based and
object oriented.
Study and types of keyword extraction: There are four type of summarizations. They are:
1. Statistical methods.
2. Linguistic methods.
3. Mixed methods.
• Statistical methods tend to focus on non-linguistic features of the text such as term
frequency, inverse document frequency, and position of a keyword. The benefits of
purely statistical methods are their ease of use, limited computation requirements, and
the fact that they do generally produce good results.
• Linguistic methods which pay attention to linguistic features such as part-of-speech,
syntactic structure and semantic qualities tend to add value, functioning sometimes as
filters for bad keywords.
• Mixed methods are both incorporating linguistic methods and statistical methods such
as term frequency and inverse document frequency.
6
14. Purpose:
• Keyword Extraction is a major part of Question Answering system, in order to extract
keywords our application is useful.
• Parts of speech tagging is used to apply parts of speech to keywords of the questions or
paragraph which helps to distinguish questions or to summarize the paragraph.
Queries Module:
• The sample query or a paragraph is given by the user.
Keyword Extraction Module:
• Based on the query given by the user the keywords are extracted.
Parts of Speech Tagging Module:
• The keywords which are extracted are tagged with their respective parts of speech.
7
18. 4.1 PROCESS
Software design is an iterative process through which requirements are translated into a
“blueprint” for constructing software. Initially, the blueprint depicts a holistic view of
software. That is, the design is represented as a high level of abstraction. As design iteration
occur, subsequent refinement leads to design representations at much lower levels of
abstractions. These can still be traced to requirements, but connection is more subtle.
Throughout the design process, the quality of the evolving design is assessed with a
series of formal technical reviews or design walkthroughs. Three characteristics that serve as a
guide for evaluation of good design:
• The design must implement all of the explicit requirements contained in the
analysis model.
• Design must be readable, understandable guide for those who generate code and
for those who test and subsequently support the software.
• Design should provide a complete picture of the software, addressing the data,
functional and behavioral domains from an implementation perspective.
4.2 IMPORTANCE OF UML IN SOFTWARE DEVELOPMENT
11
19. The Unified Modeling Language (UML) provides a standard format via construction
of a model and using object oriented paradigm for describing software systems as well as non-
software systems, business processes for the enterprise's problem areas and corporate
infrastructure.
The model abstracts the essential details of the underlying problem and provides a
simplified view of the problem so as to make easy for the solution architect to work towards
building the solution.
In context of the software development, the importance of UML can be comprehended
using analogy of a construction process. Normally, Builders use the designs and maps to
construct buildings. The services of a civil architect are needed to create designs and maps
which act as reference point for the builder. The communication between architect and builder
becomes critical according to the degree of complexity in the design of the building.
Blueprints or Architectural designs are the standard graphical language that both architects and
builders must understand for an effective communication.
Software development is a similar process in many ways. UML has emerged as the
software blueprint methodology for the business and systems analysts, designers,
programmers and everyone involved in creating and deploying the software systems in an
enterprise. The UML provides for everyone involved in software development process a
common vocabulary to communicate about software design.
4.3 UML DIAGRAMS
12
20. Use Case Diagram: A use case diagram is a type of behavioral diagram defined in UML and
created from use-case analysis. The main purpose of a use case diagram is to show what the
system functions are performed for which actors. Roles of actors in the system can be
depicted.
Figure 1: USE CASE DIAGRAM
Sequence diagram: An interaction diagram, a subset of behavior diagrams, emphasizes
the flow of control and data among the things in the system being modeled.
13
21. Figure 2: SEQUENCE DIAGRAM
Class diagram: A class diagram in the Unified Modeling Language (UML) is a type of
static structure diagram that describes the structure of a system by showing the
14
22. system's classes, their attributes, operations (or methods), and the relationships among
the classes.
There are 3 classes
• User
• Backend
• Terminal
Figure 3: CLASS DIAGRAM
15
24. About Software Development:
Software development is the set of activities that results in software products. Software
development may include research, new development, modification, reuse, re-engineering,
maintenance, or any other activities that result in software products. Especially the first phase
in the software development process may involve many departments, including marketing,
engineering, research and development and general management.
Software development process include following steps-
• Requirement Analysis: The most important task in creating the software product is
extracting the requirements or requirement analysis. Frequently demonstrating live code
may help reduce the risk that the requirements are incorrect. Once the general
requirements are gleaned from the client, an analysis of scope of the development should
be determined and clearly stated.
• Specification: It is the task of precisely describing the software to be written. In
practice, most successful specifications are written to understand and fine-tune
applications that are already developed. These are most important for external interfaces
that must remain stable.
17
25. • Architecture: The architecture of the system refers to an abstract representation of
the system. It is concerned with making sure the software system will meet the
requirements of the product.
• Design, implementation and testing: Implementation is the part of process where
software engineers actually program the code for project. Software testing is integral and
important part of the software development process. This part of the process ensures that
bugs are recognized as early as possible.
• Deployment and maintenance: Deployment starts after the code is appropriately
tested, is approved for release and sold. Maintenance and enhancing software to cope
with newly discovered problems or new requirements can take far more time than the
initial development of software.
Mainly this application is developed to reduce the complexity, monitoring each word
from the question or paragraph and emulating it to the words in database may take much
time and thus increases the complexity. In order to overcome it we extract the keywords
and retrieve the answers to the queries of the user.
18
27. 6.1: SAMPLE CODE:
Key1.py
import string
import re
import os
flag=0
with open('/home/deepak/Desktop/phy/CRFTagger/samples/input.txt','w') as f2:
with open('f1.txt') as fp:
for line in fp:
tempa=line.translate(None,'?,"')
for word in tempa.split():
20
28. with open('stopwords.txt') as f1:
for line1 in f1:
for word1 in line1.split():
if(word1==word):
flag=1
break
if(flag==0):
print(word)
f2.write(word+'n')
flag=0
os.system('make test')
with open('/home/deepak/Desktop/phy/CRFTagger/samples/input.txt.pos') as f3:
with open('fout.txt','w') as f4:
for line2 in f3:
for word2 in line2.split():
if
"NNS"or"NN"or"NNP"or"NNPS"or"JJ"or"JJS"or"JJR"or"RB"or"RBR"or"RBS"or"VB"or"V
BD"or"VBG"or"VBN"or"VBP"or"VBZ" in word2:
f4.write(word2+'n')
key2.py:
import operator
debug = False
21
29. test = False
def is_number(s):
try:
float(s) if '.' in s else int(s)
return True
except ValueError:
return False
def load_stop_words(stop_word_file):
"""
Utility function to load stop words from a file and return as a list of words
@param stop_word_file Path and file name of a file containing stop words.
@return list A list of stop words.
"""
stop_words = []
for line in open(stop_word_file):
if line.strip()[0:1] != "#":
for word in line.split(): # in case more than one per line
stop_words.append(word)
return stop_words
def separate_words(text, min_word_return_size):
"""
Utility function to return a list of all words that are have a length greater than a specified
number of characters.
@param text The text that must be split in to words.
@param min_word_return_size The minimum no of characters a word must have to be
included.
"""
splitter = re.compile('[^a-zA-Z0-9_+-/]')
words = []
for single_word in splitter.split(text):
current_word = single_word.strip().lower()
#leave numbers in phrase, but don't count as words, since they tend to invalidate scores of
their phrases
if len(current_word) > min_word_return_size and current_word != '' and not
is_number(current_word):
words.append(current_word)
22
30. return words
def split_sentences(text):
"""
Utility function to return a list of sentences.
@param text The text that must be split in to sentences.
"""
sentence_delimiters = re.compile(u'[[]n.!?,;:t-"()'u2019u2013]')
sentences = sentence_delimiters.split(text)
return sentences
def build_stop_word_regex(stop_word_file_path):
stop_word_list = load_stop_words(stop_word_file_path)
stop_word_regex_list = []
for word in stop_word_list:
word_regex = 'b' + word + 'b'
stop_word_regex_list.append(word_regex)
stop_word_pattern = re.compile('|'.join(stop_word_regex_list), re.IGNORECASE)
return stop_word_pattern
def generate_candidate_keywords(sentence_list, stopword_pattern, min_char_length=1,
max_words_length=5):
phrase_list = []
for s in sentence_list:
tmp = re.sub(stopword_pattern, '|', s.strip())
phrases = tmp.split("|")
for phrase in phrases:
phrase = phrase.strip().lower()
if phrase != "" and is_acceptable(phrase, min_char_length, max_words_length):
phrase_list.append(phrase)
return phrase_list
def is_acceptable(phrase, min_char_length, max_words_length):
# a phrase must have a min length in characters
if len(phrase) < min_char_length:
return 0
# a phrase must have a max number of words
words = phrase.split()
23
31. if len(words) > max_words_length:
return 0
digits = 0
alpha = 0
for i in range(0, len(phrase)):
if phrase[i].isdigit():
digits += 1
elif phrase[i].isalpha():
alpha += 1
# a phrase must have at least one alpha character
if alpha == 0:
return 0
# a phrase must have more alpha than digits characters
if digits > alpha:
return 0
return 1
def calculate_word_scores(phraseList):
word_frequency = {}
word_degree = {}
for phrase in phraseList:
word_list = separate_words(phrase, 0)
word_list_length = len(word_list)
word_list_degree = word_list_length – 1
#if word_list_degree > 3: word_list_degree = 3 #exp.
for word in word_list:
word_frequency.setdefault(word, 0)
word_frequency[word] += 1
word_degree.setdefault(word, 0)
word_degree[word] += word_list_degree #orig.
#word_degree[word] += 1/(word_list_length*1.0) #exp.
for item in word_frequency:
word_degree[item] = word_degree[item] + word_frequency[item]
# Calculate Word scores = deg(w)/frew(w)
word_score = {}
for item in word_frequency:
word_score.setdefault(item, 0)
word_score[item] = word_degree[item] / (word_frequency[item] * 1.0) #orig.
#word_score[item] = word_frequency[item]/(word_degree[item] * 1.0) #exp.
return word_score
24
33. reverse=True)
return sorted_keywords
if test:
text = "Compatibility of systems of linear constraints over the set of natural numbers.
Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and
nonstrict inequations are considered. Upper bounds for components of a minimal set of
solutions and algorithms of construction of minimal generating sets of solutions for all types
of systems are given. These criteria and the corresponding algorithms for constructing a
minimal supporting set of solutions can be used in solving all the considered types of systems
and systems of mixed types."
# Split text into sentences
sentenceList = split_sentences(text)
#stoppath = "FoxStoplist.txt" #Fox stoplist contains "numbers", so it will not find "natural
numbers" like in Table 1.1
stoppath = "RAKE/SmartStoplist.txt" #SMART stoplist misses some of the lower-scoring
keywords in which means that the top 1/3 cuts off one of the 4.0 score words
stopwordpattern = build_stop_word_regex(stoppath)
# generate candidate keywords
phraseList = generate_candidate_keywords(sentenceList, stopwordpattern)
# calculate individual word scores
wordscores = calculate_word_scores(phraseList)
# generate candidate keyword scores
keywordcandidates = generate_candidate_keyword_scores(phraseList, wordscores)
if debug: print keywordcandidates
sortedKeywords = sorted(keywordcandidates.iteritems(), key=operator.itemgetter(1),
reverse=True)
if debug: print sortedKeywords
totalKeywords = len(sortedKeywords)
26
34. if debug: print totalKeywords
print sortedKeywords[0:(totalKeywords / 3)]
rake = Rake("SmartStoplist.txt")
keywords = rake.run(text)
print keywords
Stop word list:
a
a's
able
about
above
according
accordingly
across
actually
after
afterwards
again
against
ain't
all
allow
allows
almost
27
37. This is the input file for the queries or the paragraphs which is used to extract the
keywords that are useful for the question answering system or for summarization for the
paragraph.
OUTPUT:
1. KeyWord Extraction
30
38. Here it is the output of the keywords of the above sample data which is obtained by
the removing the stop words which are given by the standards of google corporation.
2. Parts of speech tagging
31
39. Here the program is compiled and executed and hence the keywords extracted from the
above program are taken to which parts of speech tagging is applied and therefore output is
stored in text document.
3. Output file
32
40. This is the output of the program after applying the parts of speech tagging for
the keywords that are extracted from the questions and paragraph that are given in the
input file.
33
41. 7. TESTING
7.1 INTRODUCTION
The development of software involves series of productive activities and testing is an
important activity of them. This phase is a critical element of software quality assurance and
represents the ultimate review of specification, coding and testing.
34
42. The main objectives of testing are as follows:
• Testing is a process of executing a program with the intent of finding an error.
• A good test case is one that has a high probability of finding an undiscovered error.
• A successful test is one uncovers an undiscovered error.
Testing can be done in different ways. Some of the types of testing are mentioned
below. The main purpose of any type of test is to systematically uncover different
classes of errors and do so with a minimum amount of time and effort.
7.2 TYPES OF TESTING
• Unit testing
• Integration testing
• Regression testing
• System testing
• Alpha testing
• Beta testing
Testing can be done manually or by using testing tools. There are several testing tools
for different software.
Unit Testing: It is a method by which individual units of source code, sets of one or
more computer program modules together with associated control data, usage
procedures, and operating procedures, are tested to determine if they are fit for use.
35
43. Integration Testing: It is the phase in software testing in which individual software
modules are combined and tested as a group Integration testing takes as its
input modules that have been unit tested, groups them in larger aggregates, applies
tests defined in an integration test plan to those aggregates, and delivers as its output
the integrated system ready for system testing.
Regression Testing: Regression testing is any type of software testing that seeks to
uncover new software bugs, or regressions, in existing functional and non-
functional areas of a system after changes, such as
enhancements, patches or configuration changes, have been made to them.
System Testing: System testing of software or hardware is testing conducted on a
complete, integrated system to evaluate the system's compliance with its
specified requirements.
Alpha Testing: Alpha testing is simulated or actual operational testing by potential
users/customers or an independent test team at the developers' site. Alpha testing is
often employed for off-the-shelf software as a form of internal acceptance testing,
before the software goes to beta testing.
Beta Testing: Beta testing comes after alpha testing and can be considered a form of
external user acceptance testing. Versions of the software, known as beta versions, are
released to a limited audience outside of the programming team. The software is
released to groups of people so that further testing can ensure the product has few
faults or bugs. Sometimes, beta versions are made available to the open public to
increase the feedback field to a maximal number of future users.
Each module can be tested using the following two strategies:
36
44. Black Box Testing: In this strategy some test cases are generated as input conditions
that fully execute all functional requirements for the program. This testing is used to
find errors in the following categories:
• Incorrect or missing functions
• Interface errors
• Errors in data structure or external database access
• Performance errors
• Initialization and termination errors
In this testing, only the output is checked for correctness. The logical flow of
the data is not checked.
White Box Testing: In this test cases are generated on the logic of each module by
drawing flow graphs of that module and logical decisions are tested on all the cases.
7.3 TEST CASES:
Case 1: Submit Query
37
45. The user enters the input manually as a question. The input may be either a factoid question .
The keywords which from the query (factoid question) must be extracted.
Expected output: keywords of the query must be extracted.
Observed output: successful. Keywords of the query(question) are extracted.
Case 2: Submit the paragraph
The user enters the input manually as a paragraph. The input may be either a factoid question.
The keywords which from the paragraph must be extracted so that the paragraph can be
summarized by the user easily based on the keywords.
Expected output: keywords of the paragraph must be extracted.
Observed output: successful. Keywords of the paragraph are extracted.
Case 3: Parts of speech tagging
The keywords which are extracted are tagged with their respective parts of speech. It may be
either noun, pronoun, adjective etc. The tagging is done to both the questions and paragraph. It
is basically up on the type of the query given by the user.
Expected output: keywords must be tagged with parts of speech
Observed output: Successful. Keywords of the paragraph or the question are tagged with
their respective parts of speech
38
46. 8. CONCLUSION
Conclusion:
Keyword Extraction is an application of Natural Language Processing, whose importance
has been recognized for a long time. In the project, we have implemented a summarization
39
47. algorithm to extract keywords from a single document and obtain its summary. Main
advantage of our method is that it produces more accurate results as it can recognize and group
named entities. Our project also implements extraction of non-trivial keywords from the
paragraphs which proves to be an advantage when compared with the existing algorithm.as
more electronic documents become available, we believe our method will be useful in many
applications, especially for domain-independent keyword extraction.
40