SlideShare a Scribd company logo
Automated Abstracts
                     By - Sameer Wadkar
                     Big Data Architect / Data Scientist




December 8th, 2012                                         © 2012 Axiomine LLC
Agenda

  What are Automated Abstracts?
  Process of Automated Abstracts
  Extracting significant words
  Scoring Sentences using Luhn’s algorithm
  Domain specific abstracts
  Automated Abstracts on a Massive Data Corpus
  The Axiomine Platform
What are Automated Abstracts?
 • Abstracts comprise of the key sentences in the document
 • Key challenges
     • Generate Automated Abstracts on massive Terabyte Scale
       or Streaming Data
     • Exploit valuable domain knowledge.
     • Allow abstracts to be based on user-defined query
        • If user declares her interest in “Risk”, the abstracts will be
           focussed around the term “Risk” and its related words.




In practice Automated Abstracts is Automated Extracts
Process of Automated Abstracts
      Define Corpus                          Extract                          Score                          Generate
       & Summary                           significant                    Sentences per                      Abstracts
       Size Criteria                         words                          document                         (Extracts)

•   Define Document Corpus        •   Find imp. words in the         •   Calculate a importance       •   Pick the top sentences
•   A Corpus is collection of         Corpus                             score for a sentence based       based on score and
    “Text” documents in digital   •   Word frequency is the              on frequency and co-             chosen criteria
    format                            simplest measure.                  location of significant
•   Define criteria for key              •      Words like “and”,        words.
    sentence selection.                        “the” occur               •    Luhn’s Algorithms
    Examples include                           frequently but not    •   Score of sentences
    •   Top 20 sentences                       informative               depends on relative
    •   Top 5% of the                    •     Likewise very low         importance of significant
        sentences                              frequency words           words.
                                               like “preposterous”
                                               not informative
                                  •   Statistical and Natural
                                      Language Processing
                                      (NLP) offer stronger
                                      methods
                                  •   TF-IDF (Term Freq. -
                                      Inverse Doc Freq.) is
                                      statistical technique to
                                      evaluate word importance
                                  •   NLP techniques like Parts
                                      of Speech Tagging and
                                      Named Entity Extraction
                                      can be used.




Pick sentences based on location & occurrence of important words
Extracting significant words
• Number of times the word occurs is an inadequate
  measure.
    • Stop words like “and”, “the” occur frequently but are not important
    • Very rarely occurring words like, “preposterous” are also not very
      significant
    • Pick words often but not too often and also not too rarely
• Two popular methods
    • Statistical measures like TF-IDF can be used
    • Linguistic methods like Natural Language Processing can be
      used
    • Hybrid of Statistical and Linguistic methods




Discovery of key words algorithmically is a non-trivial problem
Extracting significant words- Statistical Technique
• TF-IF stands for Term Frequency-Inverse Document
  Frequency
       •   TF-IDF = Term Frequency * Log(Inverse Document Frequency)
       •   TF = Number of times a word occurs in a corpus
       •   DF = Proportion of documents containing the word
                     𝟏
       •   IDF= log( )
                      𝑫𝑭
• Pick words with TF-IDF above a predefined threshold
• Ex. Consider a News corpus with 10000 news articles

   Word in          TF               DF             𝟏         IDF = 𝐥𝐨𝐠(
                                                                           𝟏
                                                                             )       TF-IDF
                                                                           𝑫𝑭
   corpus                                           𝑫𝑭
 and           10 million   10000 (all docs)   10K/10K=1      0                  0
 football      1000         100                10K/100=1000   3                  3000

 “and” occurs more but “football” is significant


TF-IDF combines two conflicting measures into a “significance” score
Extracting significant words- NLP Techniques
• Rules
    • Sentences containing a proper noun are important.
    • Sentences containing a place, person, medical / technology term,
      a custom domain dictionary, are important
• Two main techniques
    • Parts of Speech Tagging
    • Named Entity Extraction
• Parts of Speech Tagging
    • Identifies grammatical form of the words in the sentence. Is the
      word a proper noun, noun, adjective, adverb etc.
• Named Entity Extraction
    • Discover from a text of document named entities like “person”,
      “place”, “medical term”. Try out Calais Viewer
    • Examples of COTS and Open Source Software – Open Calais,
      GATE, UIMA, Autonomy


Exploit your domain knowledge - No glory in full automation
Sentence Scoring (Luhn’s Algorithm)
• Find a cluster of important words in a sentence. For a cluster to
  be formed important words have to be within a pre-specified
  number of words of each other. Ex. 3
• Score each cluster and use cluster scores to score the sentence


                                                                      All bolded words are
                        All significant words within 1                “discovered” to be
                              word of each other                      significant words in a
                                                                      medical corpus


  A 15-year-old liver transplant patient is the first person in the world to take on the
  immune system and blood type of her donor.
                                                                      “patient” and “immune”
                                                                      within 12 words of each
                                                                      other. Hence different
    All significant words within a maximum of 3 words of each other   clusters in 1 sentence.




Important sentences have important words close together
Scoring Sentences
• Sample Scoring Criteria
     • Cluster Score = (No of Significant Words)2/(No of words in the
       cluster)
     • Sentence Score = Max of all cluster scores for the given
       sentence
     • Pick to N or N% of sentences for the abstract

  Phrase             No of Significant   No of words in   Cluster Score
                     Words in cluster    cluster
  Liver transplant   3                   3                (3)2/3=3
  patient
  immune system      6                   9                (6)2/9=4
  and blood type
  of her donor
  Sentence Score = Max(3,4)                               4




All words have same weight. Limitation(?) or Opportunity(!)
Domain Specific Abstracts
 • Give each significant word a different weight during cluster
 • We can get Domain/Query specific abstracts!
 • Ex. In the previous example, if we wanted abstracts related
   to “Liver Transplants”, we would weigh the words “Liver” and
   “Transplant” higher (Ex. 5 vs.1 for the rest)
 Phrase             Weight of           Weight of all words   Cluster Score
                    Significant Words   in cluster
 Liver transplant   5+5+1=11            5+5+1                 (11)2/11=11
 patient
 immune system      6*1=6               9*1=9                 (6)2/9=4
 and blood type
 of her donor
 Sentence Score = Max(11,4)                                   11

 Sentences with words “liver” or “transplant” will get weighed
 higher now.


Abstracting process is not a black box - The user & domain can drive it
Examples of Domain Specific Abstracts
• Imagine a large Project Review Document
    • Find the Project Risk Summary (Give more weight to words
      related to “Risk”)
    • Find the Project Execution Summary (Give more weight to
      words related to Project Management)
• Imagine a Medical Corpus
    • Find sentences to “Transplant” and “Grafting” procedures
    • Find sentences related to “Heart Surgery” (Provides more
      weight to words like “Cardiac”, “Heart”, “Cardiovascular”, etc.




Domain dictionaries and expert knowledge improve abstracts
Automated Abstracts on Big Data Scale (Process)

    Large              TF-IDF
  Document            MapReduce
   Corpus              process
                                          Weighing             Significant
                                           Rules                 Words
                     Named Entity
                      Extraction
                     MapReduce
                       process

                                         Automated
                                          Abstracts           Document
                                         MapReduce            Abstracts
                                          process

    Domain
   Knowledge


Abstracts generation techniques work well with MapReduce technique
What Axiomine can do?
• At Axiomine we have developed methods to
    • Generate abstracts on a massive scale.
    • Generate abstracts on new documents in real-time
    • Allow incorporation of domain knowledge in real-time
• We utilize various Big Data Technologies
    • Natural Language Processing on Hadoop
    • Real time NLP using General Purpose Graphics Programming
      (GPGPU) using NVIDIA graphics chips




At Axiomine we handle large scale Text Analytics
Intuitive Insights Information Access Platform
    Integration platform for diverse data sources
  comprising of Structured and Unstructured Data



 Intuitively navigate Big Data Corpus at the Speed
                      of Thought



   Methodology and Implementation to perform
    Topic Modeling on Massive Text Corpora



  A high fidelity algorithm to estimate Document
  Similarity based on results of Topic Modeling



 Develop Automated Domain Specific Abstracts in
                  Real Time



    Business Intelligence Layer that can query
      Terabyte scale corpuses in Real-Time



Axiomine’s I3AP supports access to unlimited data at the speed of thought
Q.E.D

        15

More Related Content

What's hot

Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
Rupak Roy
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
rudolf eremyan
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
Rupak Roy
 
1909 paclic
1909 paclic1909 paclic
1909 paclic
WarNik Chow
 
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
Rommel Carvalho
 
Spatial LDA
Spatial LDASpatial LDA
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clustering
Guy De Pauw
 
The Duet model
The Duet modelThe Duet model
The Duet model
Bhaskar Mitra
 
Semi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguationSemi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguation
kokanechandrakant
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
Yuriy Guts
 
Zouaq wole2013
Zouaq wole2013Zouaq wole2013
Zouaq wole2013
Amal Zouaq
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
WarNik Chow
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
Bhaskar Mitra
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
University of Minnesota, Duluth
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
Chandan Deb
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
Daniel Adenew
 
Cross-Language Information Retrieval
Cross-Language Information RetrievalCross-Language Information Retrieval
Cross-Language Information Retrieval
Sumin Byeon
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Sean Golliher
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
Marina Santini
 
HackYale - Natural Language Processing (All Slides)
HackYale - Natural Language Processing (All Slides)HackYale - Natural Language Processing (All Slides)
HackYale - Natural Language Processing (All Slides)
Nick Hathaway
 

What's hot (20)

Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
1909 paclic
1909 paclic1909 paclic
1909 paclic
 
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
 
Spatial LDA
Spatial LDASpatial LDA
Spatial LDA
 
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clustering
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Semi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguationSemi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguation
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
 
Zouaq wole2013
Zouaq wole2013Zouaq wole2013
Zouaq wole2013
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
 
Cross-Language Information Retrieval
Cross-Language Information RetrievalCross-Language Information Retrieval
Cross-Language Information Retrieval
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
HackYale - Natural Language Processing (All Slides)
HackYale - Natural Language Processing (All Slides)HackYale - Natural Language Processing (All Slides)
HackYale - Natural Language Processing (All Slides)
 

Similar to Automated Abstracts and Big Data

Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
Habtamu100
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
captainmactavish1996
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Saurabh Kaushik
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
JemalNesre1
 
NLP todo
NLP todoNLP todo
NLP todo
Rohit Verma
 
Ir 03
Ir   03Ir   03
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
Habtamu100
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
Josh Patterson
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
Kuppusamy P
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
SHIBDASDUTTA
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents
Iconic Translation Machines
 
Search pitb
Search pitbSearch pitb
Search pitb
Nawab Iqbal
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
Sangameswar Venkatraman
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Toine Bogers
 
Nlp
NlpNlp
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
BibleTech2013.pptx
BibleTech2013.pptxBibleTech2013.pptx
BibleTech2013.pptx
Andi Wu
 

Similar to Automated Abstracts and Big Data (20)

Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
NLP todo
NLP todoNLP todo
NLP todo
 
Ir 03
Ir   03Ir   03
Ir 03
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents
 
Search pitb
Search pitbSearch pitb
Search pitb
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Nlp
NlpNlp
Nlp
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
NLP PPT.pptx
 
BibleTech2013.pptx
BibleTech2013.pptxBibleTech2013.pptx
BibleTech2013.pptx
 

Automated Abstracts and Big Data

  • 1. Automated Abstracts By - Sameer Wadkar Big Data Architect / Data Scientist December 8th, 2012 © 2012 Axiomine LLC
  • 2. Agenda What are Automated Abstracts? Process of Automated Abstracts Extracting significant words Scoring Sentences using Luhn’s algorithm Domain specific abstracts Automated Abstracts on a Massive Data Corpus The Axiomine Platform
  • 3. What are Automated Abstracts? • Abstracts comprise of the key sentences in the document • Key challenges • Generate Automated Abstracts on massive Terabyte Scale or Streaming Data • Exploit valuable domain knowledge. • Allow abstracts to be based on user-defined query • If user declares her interest in “Risk”, the abstracts will be focussed around the term “Risk” and its related words. In practice Automated Abstracts is Automated Extracts
  • 4. Process of Automated Abstracts Define Corpus Extract Score Generate & Summary significant Sentences per Abstracts Size Criteria words document (Extracts) • Define Document Corpus • Find imp. words in the • Calculate a importance • Pick the top sentences • A Corpus is collection of Corpus score for a sentence based based on score and “Text” documents in digital • Word frequency is the on frequency and co- chosen criteria format simplest measure. location of significant • Define criteria for key • Words like “and”, words. sentence selection. “the” occur • Luhn’s Algorithms Examples include frequently but not • Score of sentences • Top 20 sentences informative depends on relative • Top 5% of the • Likewise very low importance of significant sentences frequency words words. like “preposterous” not informative • Statistical and Natural Language Processing (NLP) offer stronger methods • TF-IDF (Term Freq. - Inverse Doc Freq.) is statistical technique to evaluate word importance • NLP techniques like Parts of Speech Tagging and Named Entity Extraction can be used. Pick sentences based on location & occurrence of important words
  • 5. Extracting significant words • Number of times the word occurs is an inadequate measure. • Stop words like “and”, “the” occur frequently but are not important • Very rarely occurring words like, “preposterous” are also not very significant • Pick words often but not too often and also not too rarely • Two popular methods • Statistical measures like TF-IDF can be used • Linguistic methods like Natural Language Processing can be used • Hybrid of Statistical and Linguistic methods Discovery of key words algorithmically is a non-trivial problem
  • 6. Extracting significant words- Statistical Technique • TF-IF stands for Term Frequency-Inverse Document Frequency • TF-IDF = Term Frequency * Log(Inverse Document Frequency) • TF = Number of times a word occurs in a corpus • DF = Proportion of documents containing the word 𝟏 • IDF= log( ) 𝑫𝑭 • Pick words with TF-IDF above a predefined threshold • Ex. Consider a News corpus with 10000 news articles Word in TF DF 𝟏 IDF = 𝐥𝐨𝐠( 𝟏 ) TF-IDF 𝑫𝑭 corpus 𝑫𝑭 and 10 million 10000 (all docs) 10K/10K=1 0 0 football 1000 100 10K/100=1000 3 3000 “and” occurs more but “football” is significant TF-IDF combines two conflicting measures into a “significance” score
  • 7. Extracting significant words- NLP Techniques • Rules • Sentences containing a proper noun are important. • Sentences containing a place, person, medical / technology term, a custom domain dictionary, are important • Two main techniques • Parts of Speech Tagging • Named Entity Extraction • Parts of Speech Tagging • Identifies grammatical form of the words in the sentence. Is the word a proper noun, noun, adjective, adverb etc. • Named Entity Extraction • Discover from a text of document named entities like “person”, “place”, “medical term”. Try out Calais Viewer • Examples of COTS and Open Source Software – Open Calais, GATE, UIMA, Autonomy Exploit your domain knowledge - No glory in full automation
  • 8. Sentence Scoring (Luhn’s Algorithm) • Find a cluster of important words in a sentence. For a cluster to be formed important words have to be within a pre-specified number of words of each other. Ex. 3 • Score each cluster and use cluster scores to score the sentence All bolded words are All significant words within 1 “discovered” to be word of each other significant words in a medical corpus A 15-year-old liver transplant patient is the first person in the world to take on the immune system and blood type of her donor. “patient” and “immune” within 12 words of each other. Hence different All significant words within a maximum of 3 words of each other clusters in 1 sentence. Important sentences have important words close together
  • 9. Scoring Sentences • Sample Scoring Criteria • Cluster Score = (No of Significant Words)2/(No of words in the cluster) • Sentence Score = Max of all cluster scores for the given sentence • Pick to N or N% of sentences for the abstract Phrase No of Significant No of words in Cluster Score Words in cluster cluster Liver transplant 3 3 (3)2/3=3 patient immune system 6 9 (6)2/9=4 and blood type of her donor Sentence Score = Max(3,4) 4 All words have same weight. Limitation(?) or Opportunity(!)
  • 10. Domain Specific Abstracts • Give each significant word a different weight during cluster • We can get Domain/Query specific abstracts! • Ex. In the previous example, if we wanted abstracts related to “Liver Transplants”, we would weigh the words “Liver” and “Transplant” higher (Ex. 5 vs.1 for the rest) Phrase Weight of Weight of all words Cluster Score Significant Words in cluster Liver transplant 5+5+1=11 5+5+1 (11)2/11=11 patient immune system 6*1=6 9*1=9 (6)2/9=4 and blood type of her donor Sentence Score = Max(11,4) 11 Sentences with words “liver” or “transplant” will get weighed higher now. Abstracting process is not a black box - The user & domain can drive it
  • 11. Examples of Domain Specific Abstracts • Imagine a large Project Review Document • Find the Project Risk Summary (Give more weight to words related to “Risk”) • Find the Project Execution Summary (Give more weight to words related to Project Management) • Imagine a Medical Corpus • Find sentences to “Transplant” and “Grafting” procedures • Find sentences related to “Heart Surgery” (Provides more weight to words like “Cardiac”, “Heart”, “Cardiovascular”, etc. Domain dictionaries and expert knowledge improve abstracts
  • 12. Automated Abstracts on Big Data Scale (Process) Large TF-IDF Document MapReduce Corpus process Weighing Significant Rules Words Named Entity Extraction MapReduce process Automated Abstracts Document MapReduce Abstracts process Domain Knowledge Abstracts generation techniques work well with MapReduce technique
  • 13. What Axiomine can do? • At Axiomine we have developed methods to • Generate abstracts on a massive scale. • Generate abstracts on new documents in real-time • Allow incorporation of domain knowledge in real-time • We utilize various Big Data Technologies • Natural Language Processing on Hadoop • Real time NLP using General Purpose Graphics Programming (GPGPU) using NVIDIA graphics chips At Axiomine we handle large scale Text Analytics
  • 14. Intuitive Insights Information Access Platform Integration platform for diverse data sources comprising of Structured and Unstructured Data Intuitively navigate Big Data Corpus at the Speed of Thought Methodology and Implementation to perform Topic Modeling on Massive Text Corpora A high fidelity algorithm to estimate Document Similarity based on results of Topic Modeling Develop Automated Domain Specific Abstracts in Real Time Business Intelligence Layer that can query Terabyte scale corpuses in Real-Time Axiomine’s I3AP supports access to unlimited data at the speed of thought
  • 15. Q.E.D 15