Automated Abstracts and Big Data

Automated Abstracts
By - Sameer Wadkar
Big Data Architect / Data Scientist

December 8th, 2012 © 2012 Axiomine LLC

Agenda

What are Automated Abstracts?
Process of Automated Abstracts
Extracting significant words
Scoring Sentences using Luhn’s algorithm
Domain specific abstracts
Automated Abstracts on a Massive Data Corpus
The Axiomine Platform

What are Automated Abstracts?
• Abstracts comprise of the key sentences in the document
• Key challenges
• Generate Automated Abstracts on massive Terabyte Scale
or Streaming Data
• Exploit valuable domain knowledge.
• Allow abstracts to be based on user-defined query
• If user declares her interest in “Risk”, the abstracts will be
focussed around the term “Risk” and its related words.

In practice Automated Abstracts is Automated Extracts

Process of Automated Abstracts
Define Corpus Extract Score Generate
& Summary significant Sentences per Abstracts
Size Criteria words document (Extracts)

• Define Document Corpus • Find imp. words in the • Calculate a importance • Pick the top sentences
• A Corpus is collection of Corpus score for a sentence based based on score and
“Text” documents in digital • Word frequency is the on frequency and co- chosen criteria
format simplest measure. location of significant
• Define criteria for key • Words like “and”, words.
sentence selection. “the” occur • Luhn’s Algorithms
Examples include frequently but not • Score of sentences
• Top 20 sentences informative depends on relative
• Top 5% of the • Likewise very low importance of significant
sentences frequency words words.
like “preposterous”
not informative
• Statistical and Natural
Language Processing
(NLP) offer stronger
methods
• TF-IDF (Term Freq. -
Inverse Doc Freq.) is
statistical technique to
evaluate word importance
• NLP techniques like Parts
of Speech Tagging and
Named Entity Extraction
can be used.

Pick sentences based on location & occurrence of important words

Extracting significant words
• Number of times the word occurs is an inadequate
measure.
• Stop words like “and”, “the” occur frequently but are not important
• Very rarely occurring words like, “preposterous” are also not very
significant
• Pick words often but not too often and also not too rarely
• Two popular methods
• Statistical measures like TF-IDF can be used
• Linguistic methods like Natural Language Processing can be
used
• Hybrid of Statistical and Linguistic methods

Discovery of key words algorithmically is a non-trivial problem

Extracting significant words- Statistical Technique
• TF-IF stands for Term Frequency-Inverse Document
Frequency
• TF-IDF = Term Frequency * Log(Inverse Document Frequency)
• TF = Number of times a word occurs in a corpus
• DF = Proportion of documents containing the word
𝟏
• IDF= log( )
𝑫𝑭
• Pick words with TF-IDF above a predefined threshold
• Ex. Consider a News corpus with 10000 news articles

Word in TF DF 𝟏 IDF = 𝐥𝐨𝐠(
𝟏
) TF-IDF
𝑫𝑭
corpus 𝑫𝑭
and 10 million 10000 (all docs) 10K/10K=1 0 0
football 1000 100 10K/100=1000 3 3000

“and” occurs more but “football” is significant

TF-IDF combines two conflicting measures into a “significance” score

Extracting significant words- NLP Techniques
• Rules
• Sentences containing a proper noun are important.
• Sentences containing a place, person, medical / technology term,
a custom domain dictionary, are important
• Two main techniques
• Parts of Speech Tagging
• Named Entity Extraction
• Parts of Speech Tagging
• Identifies grammatical form of the words in the sentence. Is the
word a proper noun, noun, adjective, adverb etc.
• Named Entity Extraction
• Discover from a text of document named entities like “person”,
“place”, “medical term”. Try out Calais Viewer
• Examples of COTS and Open Source Software – Open Calais,
GATE, UIMA, Autonomy

Exploit your domain knowledge - No glory in full automation

Sentence Scoring (Luhn’s Algorithm)
• Find a cluster of important words in a sentence. For a cluster to
be formed important words have to be within a pre-specified
number of words of each other. Ex. 3
• Score each cluster and use cluster scores to score the sentence

All bolded words are
All significant words within 1 “discovered” to be
word of each other significant words in a
medical corpus

A 15-year-old liver transplant patient is the first person in the world to take on the
immune system and blood type of her donor.
“patient” and “immune”
within 12 words of each
other. Hence different
All significant words within a maximum of 3 words of each other clusters in 1 sentence.

Important sentences have important words close together

Scoring Sentences
• Sample Scoring Criteria
• Cluster Score = (No of Significant Words)2/(No of words in the
cluster)
• Sentence Score = Max of all cluster scores for the given
sentence
• Pick to N or N% of sentences for the abstract

Phrase No of Significant No of words in Cluster Score
Words in cluster cluster
Liver transplant 3 3 (3)2/3=3
patient
immune system 6 9 (6)2/9=4
and blood type
of her donor
Sentence Score = Max(3,4) 4

All words have same weight. Limitation(?) or Opportunity(!)

Domain Specific Abstracts
• Give each significant word a different weight during cluster
• We can get Domain/Query specific abstracts!
• Ex. In the previous example, if we wanted abstracts related
to “Liver Transplants”, we would weigh the words “Liver” and
“Transplant” higher (Ex. 5 vs.1 for the rest)
Phrase Weight of Weight of all words Cluster Score
Significant Words in cluster
Liver transplant 5+5+1=11 5+5+1 (11)2/11=11
patient
immune system 6*1=6 9*1=9 (6)2/9=4
and blood type
of her donor
Sentence Score = Max(11,4) 11

Sentences with words “liver” or “transplant” will get weighed
higher now.

Abstracting process is not a black box - The user & domain can drive it

Examples of Domain Specific Abstracts
• Imagine a large Project Review Document
• Find the Project Risk Summary (Give more weight to words
related to “Risk”)
• Find the Project Execution Summary (Give more weight to
words related to Project Management)
• Imagine a Medical Corpus
• Find sentences to “Transplant” and “Grafting” procedures
• Find sentences related to “Heart Surgery” (Provides more
weight to words like “Cardiac”, “Heart”, “Cardiovascular”, etc.

Domain dictionaries and expert knowledge improve abstracts

Automated Abstracts on Big Data Scale (Process)

Large TF-IDF
Document MapReduce
Corpus process
Weighing Significant
Rules Words
Named Entity
Extraction
MapReduce
process

Automated
Abstracts Document
MapReduce Abstracts
process

Domain
Knowledge

Abstracts generation techniques work well with MapReduce technique

What Axiomine can do?
• At Axiomine we have developed methods to
• Generate abstracts on a massive scale.
• Generate abstracts on new documents in real-time
• Allow incorporation of domain knowledge in real-time
• We utilize various Big Data Technologies
• Natural Language Processing on Hadoop
• Real time NLP using General Purpose Graphics Programming
(GPGPU) using NVIDIA graphics chips

At Axiomine we handle large scale Text Analytics

Intuitive Insights Information Access Platform
Integration platform for diverse data sources
comprising of Structured and Unstructured Data

Intuitively navigate Big Data Corpus at the Speed
of Thought

Methodology and Implementation to perform
Topic Modeling on Massive Text Corpora

A high fidelity algorithm to estimate Document
Similarity based on results of Topic Modeling

Develop Automated Domain Specific Abstracts in
Real Time

Business Intelligence Layer that can query
Terabyte scale corpuses in Real-Time

Axiomine’s I3AP supports access to unlimited data at the speed of thought

Automated Abstracts and Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Automated Abstracts and Big Data

Similar to Automated Abstracts and Big Data (20)

Automated Abstracts and Big Data