SlideShare a Scribd company logo
LAZY MAN’S LEARNING
How to BuildYour OwnText Summarizer
Sho Fola Soboyejo, Digital Architect, Kroger Co.
April 19th, 2018
@shoreason
I’VE GOT A FEVER ANDTHE ONLY
PRESCRIPTION IS … MORE BOOKS
NATURAL LANGUAGE
PROCESSING (NLP) DOMAINS
• Mostly Solved: SPAM detection, parts of speech
tagging , named entity recognition
• Making Progress: Sentiment analysis, coreference
resolution, word sense disambiguation, parsing,
machine translation, information extraction
• Still Really Hard: Question answering, Paraphrase,
Summarization and dialogue
PROBLEMS IN NLP
• Ambiguity: RedTape Holds Up New Bridges
• Idioms: Get Cold Feet, Dark Horse
• Neologisms: Bromance, Unfriend, Retweet
• Tricky name entities:Where is Black Panther Playing?
• Non-Standard English: #challengeday, @mlmeetup
Stanford NLP: Dan Jurafsky
“HOW CANYOU
SAYTHE MOST
IMPORTANTTHINGS
INTHE SHORTEST
AMOUNT OFTIME ?”
- Siraj Raval
PRACTICAL APPLICATIONS
FOR SUMMARIZATION
• Headlines (from around the world)
• Outlines (notes for students)
• Minutes (of a meeting)
• Previews (of movies)
• Synopses (soap opera listings)
• Reviews (of a book, CD, movie, etc.)
• Bulletins (weather forecasts/stock market
reports)
• Sound bites (politicians on a current issue)
— Page 1, Advances in AutomaticText
Summarization, 1999.
FORMS OF SUMMARIZATION
Single Document vs Multi Document
APPROACHES
Extractive vs Abstractive
EXTRACTIVE
• Pick figure out most
important sentences in
document.Then simply
extract and order those.
• Same words and sentences
in document. No abstract.
• Ranking phrase relevance
ABSTRACTIVE
• Boil down the gist of a
document into an abstract
likely using new words in
summary.
• Very much what you and I
would do.
• Much harder
“IT’S FAR EASIERTO
RECOGNIZE
WORDSTHAN IT IS
TO UNDERSTAND
THE MEANING”
- Laura Klein (Design forVoice
Interfaces)
SPEED READINGTIPS
• 1st and last sentence
(Order in text)
• Title and other paragraphs
(Connection to other
sentences)
• Index (Word Frequency)
• Focus on Keywords
BASIC CLEAN UP EXPECTED
• Remove Stop Words
• Stemming
• Lower case
• Remove Punctuation
• Remove Numbers
STAGES
CONTENT
SELECTION
INFORMATION
ORDERING
▸ Sentence Segmentation
▸ Document order
▸ Sentence Extraction
▸ Keep original sentences
▸ Sentence weight
▸ Sentence simplification
SENTENCE
REALIZATION
SUMMARY OPTIONS
Algorithmia
Gensim (summarization)
OFFTOTHE RACES
Algorithmia &
Gensim in Action
NAIVE ALGORITHM
• Determine most frequent content words in original document
(Word frequency table)
• N most common words are stored and sorted (100)
• Score each sentence based on how many high frequency words it
contains
• Build summary by compiling sentences above certain score threshold
• Select N top sentences and sort based on order in original text
https://koko-summarizer.herokuapp.com/content
NAIVE 1.0
ALGORITHM
IN
ACTION
NAIVE EXTRACTIVE
ALGORITHM 2.0
• Compare each sentence in document against other sentences and determine
intersection
• [0][2] = intersection score of comparing sentence 1 to sentence 3
• Treating each sentence as a node the connection between the nodes is the intersection
score.Weight of the edges
• Calculate the score of each sentence/node as key value pair {sentence: nodeScore}
• NodeScore = sum of all intersections with other sentences excluding itself. Sum of all
edges connected to the node
• Split text into paragraphs pick best sentence in each paragraph. Essentially, treating
paragraphs as subset of graph and pick best node in each subset
• s1 = "my friend's car is nicer than
mine but my wife is way more
beautiful"
• s2 = "my wife is more beautiful and
has brown eyes”
• s1.intersection(s2) = {'is', 'wife',
'beautiful', 'my',‘more'}
• Intersection score =
len(s1.intersection(s2)) / ((len(s1) +
len(s2)) / 2) = .4762
• lower score less similarity, higher
score more similarity
SENTENCE INTERSECTIONS
1
3
8
1
3
1
2
6
6
1
11
12
2
1
3
8
1
3
1
2
GraphTheory Implications
WHYTHIS MIGHT WORK
• Again, a paragraph can be treated as a subatomic
piece of a text
• Sentences with strong intersection likely hold the
same or very similar information
• Sentences with intersection with many other
sentences is likely very key to the text
NAIVE 2.0
ALGORITHM
IN
ACTION
built on code by Shlomi Babluki
https://koko-summarizer.herokuapp.com/content
GOING MUCH FURTHER
• Bi-Grams
• TF-IDF (frequent in a
document but not across
documents)
• IncludingTitle
• Apply stemming
• RNN (Recurrent Neural
Network)
GOAL
Train an encoder-decoder recurrent neural network
with LSTM units and attention for generating
summaries using the texts of news articles from the
Gigaword dataset
WHAT IS A NEURAL
NETWORK?
• Modeled after the human brain
(neurons) and nervous system
• Like a neuron, it has input,
hidden and output layers
• Network initializes with a
guessers and the learns adjusts
as more data passes through it
• Deep learning is using a neural
network with more hidden
layers
NEURAL NETWORKS (WHITE
PAPERS)
SEQTO SEQ LEARNING
Courtesy: QuocV. Le & Mike Schuster, Research Scientists,
Google BrainTeam
SALESFORCE PAPER
https://www.salesforce.com/
products/einstein/ai-
research/tl-dr-reinforced-
model-abstractive-
summarization/
Abstractive
Neural Networks
Extractive
Algorithmia, Gensim, Naive 1.0 and 2.0
BRINGING ITTOGETHER
GETTING STARTED
• Try out Algorithmia and
Gensim
• Fork my github code and try
your hand on Naive 3.0
• Explore some NLP and
Machine Learning intro
courses
• Check out the White Papers
I referenced in this talk
ACCESSTO RICH DATASETS
• CNN/Daily Mail Stories (Kyunghyun Cho)
• https://drive.google.com/uc?
export=download&id=0BwmD_VLjR
OrfTHk4NFg2SndKcjQ
• BCC Stories
• http://mlg.ucd.ie/
• Annotated English Gigaword
• https://catalog.ldc.upenn.edu/
LDC2012T21
Look out for deck on Slideshare
@shoreason
www.shoreason.com
github.com/shoreason

More Related Content

What's hot

Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
ankit_ppt
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
Osebe Sammi
 
Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLR
Basis Technology
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
ananth
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
Machine Learning Prague
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1
Pier Luca Lanzi
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Saurabh Kaushik
 

What's hot (8)

Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
 
Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLR
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 

Similar to Lazy man's learning: How To Build Your Own Text Summarizer

Natural Language Processing Crash Course
Natural Language Processing Crash CourseNatural Language Processing Crash Course
Natural Language Processing Crash Course
Charlie Greenbacker
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
AdityaMishra178868
 
Taming Text
Taming TextTaming Text
Taming Text
Grant Ingersoll
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Saurabh Kaushik
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy Way
Michael Yarichuk
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search Solutions
Findwise
 
Functional programming
Functional programmingFunctional programming
Functional programming
Prateek Jain
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
Noemi Derzsy
 
Lexing and parsing
Lexing and parsingLexing and parsing
Lexing and parsing
Elizabeth Smith
 
All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...
wltrimbl
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Roelof Pieters
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
Bill Liu
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Yervand Aghababyan
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
Lucidworks
 
2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solrNick Zadrozny
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your app
Nick Zadrozny
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing Workshop
Lakshya Sivaramakrishnan
 

Similar to Lazy man's learning: How To Build Your Own Text Summarizer (20)

Natural Language Processing Crash Course
Natural Language Processing Crash CourseNatural Language Processing Crash Course
Natural Language Processing Crash Course
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
 
Taming Text
Taming TextTaming Text
Taming Text
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
NLTK
NLTKNLTK
NLTK
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy Way
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search Solutions
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
Lexing and parsing
Lexing and parsingLexing and parsing
Lexing and parsing
 
All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your app
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing Workshop
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 

Lazy man's learning: How To Build Your Own Text Summarizer

  • 1. LAZY MAN’S LEARNING How to BuildYour OwnText Summarizer Sho Fola Soboyejo, Digital Architect, Kroger Co. April 19th, 2018 @shoreason
  • 2. I’VE GOT A FEVER ANDTHE ONLY PRESCRIPTION IS … MORE BOOKS
  • 3. NATURAL LANGUAGE PROCESSING (NLP) DOMAINS • Mostly Solved: SPAM detection, parts of speech tagging , named entity recognition • Making Progress: Sentiment analysis, coreference resolution, word sense disambiguation, parsing, machine translation, information extraction • Still Really Hard: Question answering, Paraphrase, Summarization and dialogue
  • 4. PROBLEMS IN NLP • Ambiguity: RedTape Holds Up New Bridges • Idioms: Get Cold Feet, Dark Horse • Neologisms: Bromance, Unfriend, Retweet • Tricky name entities:Where is Black Panther Playing? • Non-Standard English: #challengeday, @mlmeetup Stanford NLP: Dan Jurafsky
  • 5. “HOW CANYOU SAYTHE MOST IMPORTANTTHINGS INTHE SHORTEST AMOUNT OFTIME ?” - Siraj Raval
  • 6. PRACTICAL APPLICATIONS FOR SUMMARIZATION • Headlines (from around the world) • Outlines (notes for students) • Minutes (of a meeting) • Previews (of movies) • Synopses (soap opera listings) • Reviews (of a book, CD, movie, etc.) • Bulletins (weather forecasts/stock market reports) • Sound bites (politicians on a current issue) — Page 1, Advances in AutomaticText Summarization, 1999.
  • 7. FORMS OF SUMMARIZATION Single Document vs Multi Document
  • 9. EXTRACTIVE • Pick figure out most important sentences in document.Then simply extract and order those. • Same words and sentences in document. No abstract. • Ranking phrase relevance
  • 10. ABSTRACTIVE • Boil down the gist of a document into an abstract likely using new words in summary. • Very much what you and I would do. • Much harder
  • 11. “IT’S FAR EASIERTO RECOGNIZE WORDSTHAN IT IS TO UNDERSTAND THE MEANING” - Laura Klein (Design forVoice Interfaces)
  • 12. SPEED READINGTIPS • 1st and last sentence (Order in text) • Title and other paragraphs (Connection to other sentences) • Index (Word Frequency) • Focus on Keywords
  • 13. BASIC CLEAN UP EXPECTED • Remove Stop Words • Stemming • Lower case • Remove Punctuation • Remove Numbers
  • 14. STAGES CONTENT SELECTION INFORMATION ORDERING ▸ Sentence Segmentation ▸ Document order ▸ Sentence Extraction ▸ Keep original sentences ▸ Sentence weight ▸ Sentence simplification SENTENCE REALIZATION
  • 17. NAIVE ALGORITHM • Determine most frequent content words in original document (Word frequency table) • N most common words are stored and sorted (100) • Score each sentence based on how many high frequency words it contains • Build summary by compiling sentences above certain score threshold • Select N top sentences and sort based on order in original text
  • 19. NAIVE EXTRACTIVE ALGORITHM 2.0 • Compare each sentence in document against other sentences and determine intersection • [0][2] = intersection score of comparing sentence 1 to sentence 3 • Treating each sentence as a node the connection between the nodes is the intersection score.Weight of the edges • Calculate the score of each sentence/node as key value pair {sentence: nodeScore} • NodeScore = sum of all intersections with other sentences excluding itself. Sum of all edges connected to the node • Split text into paragraphs pick best sentence in each paragraph. Essentially, treating paragraphs as subset of graph and pick best node in each subset
  • 20. • s1 = "my friend's car is nicer than mine but my wife is way more beautiful" • s2 = "my wife is more beautiful and has brown eyes” • s1.intersection(s2) = {'is', 'wife', 'beautiful', 'my',‘more'} • Intersection score = len(s1.intersection(s2)) / ((len(s1) + len(s2)) / 2) = .4762 • lower score less similarity, higher score more similarity SENTENCE INTERSECTIONS
  • 23. WHYTHIS MIGHT WORK • Again, a paragraph can be treated as a subatomic piece of a text • Sentences with strong intersection likely hold the same or very similar information • Sentences with intersection with many other sentences is likely very key to the text
  • 24. NAIVE 2.0 ALGORITHM IN ACTION built on code by Shlomi Babluki https://koko-summarizer.herokuapp.com/content
  • 25. GOING MUCH FURTHER • Bi-Grams • TF-IDF (frequent in a document but not across documents) • IncludingTitle • Apply stemming • RNN (Recurrent Neural Network)
  • 26. GOAL Train an encoder-decoder recurrent neural network with LSTM units and attention for generating summaries using the texts of news articles from the Gigaword dataset
  • 27. WHAT IS A NEURAL NETWORK? • Modeled after the human brain (neurons) and nervous system • Like a neuron, it has input, hidden and output layers • Network initializes with a guessers and the learns adjusts as more data passes through it • Deep learning is using a neural network with more hidden layers
  • 29. SEQTO SEQ LEARNING Courtesy: QuocV. Le & Mike Schuster, Research Scientists, Google BrainTeam
  • 31. Abstractive Neural Networks Extractive Algorithmia, Gensim, Naive 1.0 and 2.0 BRINGING ITTOGETHER
  • 32. GETTING STARTED • Try out Algorithmia and Gensim • Fork my github code and try your hand on Naive 3.0 • Explore some NLP and Machine Learning intro courses • Check out the White Papers I referenced in this talk
  • 33. ACCESSTO RICH DATASETS • CNN/Daily Mail Stories (Kyunghyun Cho) • https://drive.google.com/uc? export=download&id=0BwmD_VLjR OrfTHk4NFg2SndKcjQ • BCC Stories • http://mlg.ucd.ie/ • Annotated English Gigaword • https://catalog.ldc.upenn.edu/ LDC2012T21
  • 34. Look out for deck on Slideshare @shoreason www.shoreason.com github.com/shoreason