• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Automated Abstracts and Big Data
 

Automated Abstracts and Big Data

on

  • 318 views

Describes at the high level how Automated Abstracts work and how these algorithms can be scaled to a massive corpus.

Describes at the high level how Automated Abstracts work and how these algorithms can be scaled to a massive corpus.

Statistics

Views

Total Views
318
Views on SlideShare
318
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Automated Abstracts and Big Data Automated Abstracts and Big Data Presentation Transcript

    • Automated Abstracts By - Sameer Wadkar Big Data Architect / Data ScientistDecember 8th, 2012 © 2012 Axiomine LLC
    • Agenda What are Automated Abstracts? Process of Automated Abstracts Extracting significant words Scoring Sentences using Luhn’s algorithm Domain specific abstracts Automated Abstracts on a Massive Data Corpus The Axiomine Platform
    • What are Automated Abstracts? • Abstracts comprise of the key sentences in the document • Key challenges • Generate Automated Abstracts on massive Terabyte Scale or Streaming Data • Exploit valuable domain knowledge. • Allow abstracts to be based on user-defined query • If user declares her interest in “Risk”, the abstracts will be focussed around the term “Risk” and its related words.In practice Automated Abstracts is Automated Extracts
    • Process of Automated Abstracts Define Corpus Extract Score Generate & Summary significant Sentences per Abstracts Size Criteria words document (Extracts)• Define Document Corpus • Find imp. words in the • Calculate a importance • Pick the top sentences• A Corpus is collection of Corpus score for a sentence based based on score and “Text” documents in digital • Word frequency is the on frequency and co- chosen criteria format simplest measure. location of significant• Define criteria for key • Words like “and”, words. sentence selection. “the” occur • Luhn’s Algorithms Examples include frequently but not • Score of sentences • Top 20 sentences informative depends on relative • Top 5% of the • Likewise very low importance of significant sentences frequency words words. like “preposterous” not informative • Statistical and Natural Language Processing (NLP) offer stronger methods • TF-IDF (Term Freq. - Inverse Doc Freq.) is statistical technique to evaluate word importance • NLP techniques like Parts of Speech Tagging and Named Entity Extraction can be used.Pick sentences based on location & occurrence of important words
    • Extracting significant words• Number of times the word occurs is an inadequate measure. • Stop words like “and”, “the” occur frequently but are not important • Very rarely occurring words like, “preposterous” are also not very significant • Pick words often but not too often and also not too rarely• Two popular methods • Statistical measures like TF-IDF can be used • Linguistic methods like Natural Language Processing can be used • Hybrid of Statistical and Linguistic methodsDiscovery of key words algorithmically is a non-trivial problem
    • Extracting significant words- Statistical Technique• TF-IF stands for Term Frequency-Inverse Document Frequency • TF-IDF = Term Frequency * Log(Inverse Document Frequency) • TF = Number of times a word occurs in a corpus • DF = Proportion of documents containing the word 𝟏 • IDF= log( ) 𝑫𝑭• Pick words with TF-IDF above a predefined threshold• Ex. Consider a News corpus with 10000 news articles Word in TF DF 𝟏 IDF = 𝐥𝐨𝐠( 𝟏 ) TF-IDF 𝑫𝑭 corpus 𝑫𝑭 and 10 million 10000 (all docs) 10K/10K=1 0 0 football 1000 100 10K/100=1000 3 3000 “and” occurs more but “football” is significantTF-IDF combines two conflicting measures into a “significance” score
    • Extracting significant words- NLP Techniques• Rules • Sentences containing a proper noun are important. • Sentences containing a place, person, medical / technology term, a custom domain dictionary, are important• Two main techniques • Parts of Speech Tagging • Named Entity Extraction• Parts of Speech Tagging • Identifies grammatical form of the words in the sentence. Is the word a proper noun, noun, adjective, adverb etc.• Named Entity Extraction • Discover from a text of document named entities like “person”, “place”, “medical term”. Try out Calais Viewer • Examples of COTS and Open Source Software – Open Calais, GATE, UIMA, AutonomyExploit your domain knowledge - No glory in full automation
    • Sentence Scoring (Luhn’s Algorithm)• Find a cluster of important words in a sentence. For a cluster to be formed important words have to be within a pre-specified number of words of each other. Ex. 3• Score each cluster and use cluster scores to score the sentence All bolded words are All significant words within 1 “discovered” to be word of each other significant words in a medical corpus A 15-year-old liver transplant patient is the first person in the world to take on the immune system and blood type of her donor. “patient” and “immune” within 12 words of each other. Hence different All significant words within a maximum of 3 words of each other clusters in 1 sentence.Important sentences have important words close together
    • Scoring Sentences• Sample Scoring Criteria • Cluster Score = (No of Significant Words)2/(No of words in the cluster) • Sentence Score = Max of all cluster scores for the given sentence • Pick to N or N% of sentences for the abstract Phrase No of Significant No of words in Cluster Score Words in cluster cluster Liver transplant 3 3 (3)2/3=3 patient immune system 6 9 (6)2/9=4 and blood type of her donor Sentence Score = Max(3,4) 4All words have same weight. Limitation(?) or Opportunity(!)
    • Domain Specific Abstracts • Give each significant word a different weight during cluster • We can get Domain/Query specific abstracts! • Ex. In the previous example, if we wanted abstracts related to “Liver Transplants”, we would weigh the words “Liver” and “Transplant” higher (Ex. 5 vs.1 for the rest) Phrase Weight of Weight of all words Cluster Score Significant Words in cluster Liver transplant 5+5+1=11 5+5+1 (11)2/11=11 patient immune system 6*1=6 9*1=9 (6)2/9=4 and blood type of her donor Sentence Score = Max(11,4) 11 Sentences with words “liver” or “transplant” will get weighed higher now.Abstracting process is not a black box - The user & domain can drive it
    • Examples of Domain Specific Abstracts• Imagine a large Project Review Document • Find the Project Risk Summary (Give more weight to words related to “Risk”) • Find the Project Execution Summary (Give more weight to words related to Project Management)• Imagine a Medical Corpus • Find sentences to “Transplant” and “Grafting” procedures • Find sentences related to “Heart Surgery” (Provides more weight to words like “Cardiac”, “Heart”, “Cardiovascular”, etc.Domain dictionaries and expert knowledge improve abstracts
    • Automated Abstracts on Big Data Scale (Process) Large TF-IDF Document MapReduce Corpus process Weighing Significant Rules Words Named Entity Extraction MapReduce process Automated Abstracts Document MapReduce Abstracts process Domain KnowledgeAbstracts generation techniques work well with MapReduce technique
    • What Axiomine can do?• At Axiomine we have developed methods to • Generate abstracts on a massive scale. • Generate abstracts on new documents in real-time • Allow incorporation of domain knowledge in real-time• We utilize various Big Data Technologies • Natural Language Processing on Hadoop • Real time NLP using General Purpose Graphics Programming (GPGPU) using NVIDIA graphics chipsAt Axiomine we handle large scale Text Analytics
    • Intuitive Insights Information Access Platform Integration platform for diverse data sources comprising of Structured and Unstructured Data Intuitively navigate Big Data Corpus at the Speed of Thought Methodology and Implementation to perform Topic Modeling on Massive Text Corpora A high fidelity algorithm to estimate Document Similarity based on results of Topic Modeling Develop Automated Domain Specific Abstracts in Real Time Business Intelligence Layer that can query Terabyte scale corpuses in Real-TimeAxiomine’s I3AP supports access to unlimited data at the speed of thought
    • Q.E.D 15