HINDI TEXT SUMMERIZATION
o Abhishek Kumar (CSE - 10114008
o Nishant Kumar (CSE – 10114017)
Presented by:Guided by:
• Dr. Manjira Sinha
 Introduction
Text Summarization is an algorithm that extracts sentences from
a text document, determines which are most important, and
returns with short (usually half the size), vital information in a
readable and structured way.
 It provides the reader with filtered description of source text and a
non redundant presentation of facts found in the text.
 Why we need Text-Summarization
 Business leaders, analysts, students and academic researchers
need to go through huge numbers of documents every day to
keep ahead, and a large portion of their time is spent just
figuring out what document is relevant and what isn’t.+
 By extracting important sentences and creating comprehensive
summaries, it’s possible to quickly assess whether or not a
document is worth reading.
• Headlines of news
• Abstract summary of technical paper
• Review of book or Preview of a Movie
 Why choose Hindi as language of study ?
 Native language of most of the people in Bihar,
Jharkhand, UP, MP Delhi, Chhattisgarh, Himachal Pradesh,
Haryana, Rajsthan etc.
 Lots of work done on English language relatively very few
have shown interest in Hindi language.
 Official Language of India.
It is written in the Devanagari script which has largest
alphabet set.
 Approaches Of Summarization
Extraction - Based
 Statistical
 Linguistic
 Hybrid
Abstraction - Based
 Proposed System
1. •Hindi Text Document
2. •Preprocessing
3. •Extracting sentence features
4. •Sentence Ranking
5. •Summary
 Preprocessing
 Sentence Segmentation
 पूर्ण विराम (|)
Tokenization
 पूर्ण विराम (|), उपविराम (:), अर्ण विराम (;)
Stopwords Removal
 कारक (ने, को, से, के लिए, में, पर)
 सिणनाम (आप, तू, यह, िह, कु छ )
 समुच्चयबोर्क अव्यय(और, िेककन, पर, एिं, इसलिए, मगर)
 समास (अनुसार, पयणन्त, िािा)
 Stemming
 Suffixes, plural are ignored (भारत,भारतीय)
Feature-Extraction
 Text Rank Feature
 Word frequency
(Sentence having most frequent word in paragraph will have high ranking.)
 Sentence Length
 Eliminate sentence too long or too short.
 Sentence Position
 Position of sentence in text, decides its importance.
 Beginning – Theme
 End – Conclude or summary
 Title Word Feature
 Sentences having words which matches with paragraph title words will be
included in summary.
Sentence Ranking & Summary
Calculation of Ranking value of each sentence based on our selected features.
Normalize the each feature ranking value in scale (0 to 1).
Add all feature ranking values and calculate final ranking for each sentence.
 Sort the final ranking of each sentence in descending order.
Based on percentage of summary requirement, select sentences in descending
order of ranking values.
Print the summary of paragraph in order of original paragraph.
Future Work
 We Will add more features like :
 Proper Noun, Numerical data, Sentence similarity.
 To optimize our algorithm
 Genetic Algorithm (GA)
 Artificial Neural Network (ANN)
 We will make a GUI based Software for better user
experience.
References
 Yihong Gong, Xin Liu : Generic Text Summarization Using Relevance Measure and
Latent Semantic Analysis
 Gunes Erkan, Dragomir R. Radev : LexRank - Graph-based Lexical Centrality as
Salience in Text Summarization
 David Kirk Evans, Judith L. Klavans, Kathleen R. McKeown : Columbia Newsblaster:
Multilingual News Summarization on the Web
 Algorithmia : Introduction to Automatic Text Summarization
 Wikipedia : Automatic_summarization
Thank You for your Attention
र्न्यिाद !!

Text summerization

  • 1.
    HINDI TEXT SUMMERIZATION oAbhishek Kumar (CSE - 10114008 o Nishant Kumar (CSE – 10114017) Presented by:Guided by: • Dr. Manjira Sinha
  • 2.
     Introduction Text Summarizationis an algorithm that extracts sentences from a text document, determines which are most important, and returns with short (usually half the size), vital information in a readable and structured way.  It provides the reader with filtered description of source text and a non redundant presentation of facts found in the text.
  • 3.
     Why weneed Text-Summarization  Business leaders, analysts, students and academic researchers need to go through huge numbers of documents every day to keep ahead, and a large portion of their time is spent just figuring out what document is relevant and what isn’t.+  By extracting important sentences and creating comprehensive summaries, it’s possible to quickly assess whether or not a document is worth reading. • Headlines of news • Abstract summary of technical paper • Review of book or Preview of a Movie
  • 4.
     Why chooseHindi as language of study ?  Native language of most of the people in Bihar, Jharkhand, UP, MP Delhi, Chhattisgarh, Himachal Pradesh, Haryana, Rajsthan etc.  Lots of work done on English language relatively very few have shown interest in Hindi language.  Official Language of India. It is written in the Devanagari script which has largest alphabet set.
  • 5.
     Approaches OfSummarization Extraction - Based  Statistical  Linguistic  Hybrid Abstraction - Based
  • 6.
     Proposed System 1.•Hindi Text Document 2. •Preprocessing 3. •Extracting sentence features 4. •Sentence Ranking 5. •Summary
  • 7.
     Preprocessing  SentenceSegmentation  पूर्ण विराम (|) Tokenization  पूर्ण विराम (|), उपविराम (:), अर्ण विराम (;) Stopwords Removal  कारक (ने, को, से, के लिए, में, पर)  सिणनाम (आप, तू, यह, िह, कु छ )  समुच्चयबोर्क अव्यय(और, िेककन, पर, एिं, इसलिए, मगर)  समास (अनुसार, पयणन्त, िािा)  Stemming  Suffixes, plural are ignored (भारत,भारतीय)
  • 8.
    Feature-Extraction  Text RankFeature  Word frequency (Sentence having most frequent word in paragraph will have high ranking.)  Sentence Length  Eliminate sentence too long or too short.  Sentence Position  Position of sentence in text, decides its importance.  Beginning – Theme  End – Conclude or summary  Title Word Feature  Sentences having words which matches with paragraph title words will be included in summary.
  • 9.
    Sentence Ranking &Summary Calculation of Ranking value of each sentence based on our selected features. Normalize the each feature ranking value in scale (0 to 1). Add all feature ranking values and calculate final ranking for each sentence.  Sort the final ranking of each sentence in descending order. Based on percentage of summary requirement, select sentences in descending order of ranking values. Print the summary of paragraph in order of original paragraph.
  • 10.
    Future Work  WeWill add more features like :  Proper Noun, Numerical data, Sentence similarity.  To optimize our algorithm  Genetic Algorithm (GA)  Artificial Neural Network (ANN)  We will make a GUI based Software for better user experience.
  • 11.
    References  Yihong Gong,Xin Liu : Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis  Gunes Erkan, Dragomir R. Radev : LexRank - Graph-based Lexical Centrality as Salience in Text Summarization  David Kirk Evans, Judith L. Klavans, Kathleen R. McKeown : Columbia Newsblaster: Multilingual News Summarization on the Web  Algorithmia : Introduction to Automatic Text Summarization  Wikipedia : Automatic_summarization
  • 12.
    Thank You foryour Attention र्न्यिाद !!