SlideShare a Scribd company logo
Probabilistic Content Model,
with Applications to Generation and Summarization
BRYAN ZHANG HANG|
Outline:
Goal: Modeling Topic Structures of Text
We will use:
 Hidden Markov Model
 Bigrams
 Clustering
Application:
 Sentence Ordering
 Extractive Summerization
Review: Hidden Markov Model:
S1 S2 S3
O1 O3O2
STATES
OBSERVATIONS
TRANSITIONS
EMISSIONS
Imagine :
 You call your friend who lives in a foreign country from time to tim
e. Every time you ask him or her “ What are you up to?”
 The possible answers are:
“ walk” “ice cream” “shopping”
“reading” “programming” “kayaking”
Review: Hidden Markov Model:
Possible answers over a month:
“kayaking” “walk” “ shopping” “kayaking” “programming”…
sunny sunny probably sunny sunny ? Probably rainy
Review: Hidden Markov Model:
Latent class ( Hidden Part)
Review: Hidden Markov Model:
S1 S2 S3
O1 O3O2
TRANSITIONS
probability
EMISSIONS
probability
S
Review: Hidden Markov Model:
R S S
Programming ReadWalk
P(Programming |R)
*
P(R|*) P(S|R) P(S|S)
P(walk|S) P(Read|S)
Review: Hidden Markov Model:
R S S
Programming ReadingWalking
P(Programming |R)
*
P(R|*) P(S|R) P(S|S)
P(walk|S) P(Read|S)
The probability of the sequence Programming Walking Reading given the weather is :
P(R|*) * P(S|R) * P(S|S) * P(Programming |R) * P(walk|S) * P(Read|S )
Exercise:
Rainy Sunny
Walk CleanGo shopping
S
0.1
0.4
0.5
0.6 0.3
0.1
0.3
0.4
0.6?
0.40.6
What is the state sequence (Start-S1-S2) that can maximize
probability of the observation sequence “ clean shopping”
Rainy Sunny
Walk CleanGo shopping
S
0.1
0.4
0.5
0.6 0.3
0.1
0.3
0.4
0.6?
0.40.6
Transition P.
P(R|START)=0.6
P(S |START)=0.4
P(S|R)=0.3
P(S|S)=0.6
P(R|R)=0.7
P(R|S)=0.4
Emission P.
P(CLEAN|R)=0.5
P(CLEAN|S)=0.1
P(SHOPPING|R)=0.4
P(SHOPPING|S)=0.3
Transition P.
P(R|START)=0.6
P(S |START)=0.4
P(S|R)=0.3
P(S|S)=0.6
P(R|R)=0.7
P(R|S)=0.4
STATES { R, S}
Emission P.
P(CLEAN|R)=0.5
P(CLEAN|S)=0.1
P(SHOPPING|R)=0.4
P(SHOPPING|S)=0.3
EMISSIONS{CLEAN,SHOPPING}
START S1 S2
CLEAN SHOPPING
START
P(S |START) P(S|S)P(CLEAN|S) P(SHOPPING|S)
P(S |START) P(R|S) P(CLEAN|S) P(SHOPPING|R)
P(R |START) P(S|R) P(CLEAN|R) P(SHOPPING|S)
P(R |START) P(R|R) P(CLEAN|R) P(SHOPPING|R)
ANSWER IS START-RAIN-RAIN
Probabilistic Content Model
S1 S2 S3
O1 O3O2
TOPICS
TRANSITIONS
EMISSIONS
SENTENCES
Sentences are Bigram Sequences
Probability of a n-word sentence generated from a state s is :
Probabilistic Content Model
S1 S2 S3
O1 O3O2
TOPICS
TRANSITIONS
EMISSIONS
SENTENCES
TOPICS:
Derived from the content
 Partition sentences from the documents within a domai
n-specific collection into k clusters (Initial Clusters) .
 Use Bigram Vectors as features
 Sentence similarity is the cosine of bigram vectors.
STEP 1
An example of the output:
LOCATION INFORMATION
TOPICS:
Derived from the content
 D(C,C’): Number of documents in which a sentence from C immediately
precedes one from C’
 D(C): Number of documents containing sentences from C.
 For two States C,C’, smoothed estimate of state transition probability is:
EM-like Viterbi Re-estimation
 we can compute the transition probability from the initial sentence clusters (Topic Clusters)
 Hidden Markov Model can estimate the topics of sentences
 Assign sentence s in the topic clusters as the estimated topic.
 Cluster/estimate cycle is repeated until the clusters stabilize
TOPICS:
Derived from the content
STEP 2
Evaluation Task 1
Information Ordering
 Information ordering task is essential to many text-
synthesis applications
e.g. concept-to-text generation, multi-document
summarization.
Evaluation Task 1
Information Ordering
Evaluation Task 1
Information Ordering
Num. of Sentences
Evaluation Task 1
Information Ordering
Number of Order of Sentences:
 3 sentences= 3*2*1=6 kinds of different sentence order
 4 sentences =4*3*2*1=24
 Number of sentences over 10 means :
There are over 3 million kinds of different orders .
Evaluation Task 1
Information Ordering
 Generate all the sentence orders
 Compute Probability of each order
 Rank the orders by probability
Metric :
 OSO: Original Sentence Order:
Position of Original Sentence in the ranked list
Baseline:
 Word bigram model
Evaluation Task 1
Information Ordering
Rank is the Rank of the original sentence order (OSO)
by the model
OSO prediction rate is the percentage of the test
cases in which the model gives highest probability to
the OSO among all possible permutations.
Evaluation Task 1
Information Ordering
Indicator of the swaps
•Lapata’ technique is feature-rich method (in this
experiment using linguistic features such as noun-
verb dependency.
•It aggravates the data sparseness problems
for a smaller corpus
Kendall T: measure how much an ordering
differs from the OSO
Evaluation Task 2
Summarization
 Baseline: the “Lead” baseline, pick the first L sentences
 Sentence classifer:
1.each sentence is labelled “ in” or “ out” of the summary
2.features for each sentence are unigrams and its location,
which means we look at the words and their location in the
sentences.
Evaluation Task 2
Summarization
Probabilistic Content Model:
 All the sentences in the documents are assigned with the topics
 All the sentences in the summaries are assigned with the topics
Probability( Topic A in summary)=
(Number of documents in summary where topic A appears)
(Number of documents in documents where topic A appears )
 Sentences in which its topic has high appearance probability in
summaries are extracted.
Evaluation Task 1
Information Ordering
Content Model outperforms sentence-level,
Locally-focused method and L baseline
Content model
Word+ Location
baseline
Relation Between Two Tasks
Single Domain: Earthquakes
Ordering : OSO prediction rate
Summarization: Extractive accuracy
Optimization of parameters on one task promises to yield good performance on the other
This content model serves as effective representation of text structure in general
Conclusions:
In this paper , this unsupervised, knowledge-lean method validates the
hypothesis:
Word distribution patterns strongly correlate with discourse patterns within a
text ( at least specific domains)
Future direction :
This model is a domain-dependent model
Incorporation of domain-independent relations in the transition structure of
the content model.
Probabilistic content models,

More Related Content

What's hot

Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
 
Word2Vec
Word2VecWord2Vec
Word2Vec
hyunyoung Lee
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
Bhaskar Mitra
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
Bhaskar Mitra
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
hyunyoung Lee
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Innovation Quotient Pvt Ltd
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
Bhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
Bhaskar Mitra
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET Journal
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
KozoChikai
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
FReeze FRancis
 
text summarization using amr
text summarization using amrtext summarization using amr
text summarization using amr
amit nagarkoti
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
Claudia Wagner
 
Word2Vec
Word2VecWord2Vec
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
Christian Perone
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
Lidia Pivovarova
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
Innovation Quotient Pvt Ltd
 

What's hot (20)

Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
 
text summarization using amr
text summarization using amrtext summarization using amr
text summarization using amr
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
 
Questions
QuestionsQuestions
Questions
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 

Viewers also liked

Dialogue System Iso 24617 2
Dialogue System Iso 24617 2Dialogue System Iso 24617 2
Dialogue System Iso 24617 2
Bryan Gummibearehausen
 
话题模型2
话题模型2话题模型2
话题模型2
Bryan Gummibearehausen
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
Bryan Gummibearehausen
 
Auto summarization tool
Auto summarization toolAuto summarization tool
Auto summarization toolkalpesh1908
 
TEXT SUMMARIZATION
TEXT SUMMARIZATIONTEXT SUMMARIZATION
TEXT SUMMARIZATION
Aman Sadhwani
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
Carlos Castillo (ChaTo)
 
Text summarization
Text summarizationText summarization
Text summarizationkareemhashem
 
Workplace Accountability: How Effective Managers Create a Culture of Ownership
Workplace Accountability: How Effective Managers Create a Culture of OwnershipWorkplace Accountability: How Effective Managers Create a Culture of Ownership
Workplace Accountability: How Effective Managers Create a Culture of Ownership
The Business LockerRoom
 
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
Christopher Bradley
 

Viewers also liked (10)

Dialogue System Iso 24617 2
Dialogue System Iso 24617 2Dialogue System Iso 24617 2
Dialogue System Iso 24617 2
 
话题模型2
话题模型2话题模型2
话题模型2
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Auto summarization tool
Auto summarization toolAuto summarization tool
Auto summarization tool
 
TEXT SUMMARIZATION
TEXT SUMMARIZATIONTEXT SUMMARIZATION
TEXT SUMMARIZATION
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
 
Text summarization
Text summarizationText summarization
Text summarization
 
Workplace Accountability: How Effective Managers Create a Culture of Ownership
Workplace Accountability: How Effective Managers Create a Culture of OwnershipWorkplace Accountability: How Effective Managers Create a Culture of Ownership
Workplace Accountability: How Effective Managers Create a Culture of Ownership
 
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
 

Similar to Probabilistic content models,

Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
Keerti Bhogaraju
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
indico data
 
Topic Segmentation in Dialogue
Topic Segmentation in DialogueTopic Segmentation in Dialogue
Topic Segmentation in Dialogue
Jinho Choi
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
Marina Santini
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
changedaeoh
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
MLconf
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
Chris Fregly
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
Eugene Nho
 
Mathematical Semantic Markup in a Wiki: the Roles of Symbols and Notations
Mathematical Semantic Markup in a Wiki: the Roles of Symbols and NotationsMathematical Semantic Markup in a Wiki: the Roles of Symbols and Notations
Mathematical Semantic Markup in a Wiki: the Roles of Symbols and Notations
Christoph Lange
 
Scala for Machine Learning
Scala for Machine LearningScala for Machine Learning
Scala for Machine Learning
Patrick Nicolas
 
Semantic Annotation of Documents
Semantic Annotation of DocumentsSemantic Annotation of Documents
Semantic Annotation of Documents
subash chandra
 
MDE=Model Driven Everything (Spanish Eclipse Day 2009)
MDE=Model Driven Everything (Spanish Eclipse Day 2009)MDE=Model Driven Everything (Spanish Eclipse Day 2009)
MDE=Model Driven Everything (Spanish Eclipse Day 2009)
Jordi Cabot
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
cscpconf
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
csandit
 
Bt8903, c# programming
Bt8903, c# programmingBt8903, c# programming
Bt8903, c# programmingsmumbahelp
 
Ontology-based Cooperation of Information Systems
Ontology-based Cooperation of Information SystemsOntology-based Cooperation of Information Systems
Ontology-based Cooperation of Information Systems
Raji Ghawi
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen
 

Similar to Probabilistic content models, (20)

Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
Topic Segmentation in Dialogue
Topic Segmentation in DialogueTopic Segmentation in Dialogue
Topic Segmentation in Dialogue
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
Mathematical Semantic Markup in a Wiki: the Roles of Symbols and Notations
Mathematical Semantic Markup in a Wiki: the Roles of Symbols and NotationsMathematical Semantic Markup in a Wiki: the Roles of Symbols and Notations
Mathematical Semantic Markup in a Wiki: the Roles of Symbols and Notations
 
Scala for Machine Learning
Scala for Machine LearningScala for Machine Learning
Scala for Machine Learning
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Semantic Annotation of Documents
Semantic Annotation of DocumentsSemantic Annotation of Documents
Semantic Annotation of Documents
 
MDE=Model Driven Everything (Spanish Eclipse Day 2009)
MDE=Model Driven Everything (Spanish Eclipse Day 2009)MDE=Model Driven Everything (Spanish Eclipse Day 2009)
MDE=Model Driven Everything (Spanish Eclipse Day 2009)
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
Bt8903, c# programming
Bt8903, c# programmingBt8903, c# programming
Bt8903, c# programming
 
Ontology-based Cooperation of Information Systems
Ontology-based Cooperation of Information SystemsOntology-based Cooperation of Information Systems
Ontology-based Cooperation of Information Systems
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 

Recently uploaded

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 

Recently uploaded (20)

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 

Probabilistic content models,

  • 1. Probabilistic Content Model, with Applications to Generation and Summarization BRYAN ZHANG HANG|
  • 2. Outline: Goal: Modeling Topic Structures of Text We will use:  Hidden Markov Model  Bigrams  Clustering Application:  Sentence Ordering  Extractive Summerization
  • 3. Review: Hidden Markov Model: S1 S2 S3 O1 O3O2 STATES OBSERVATIONS TRANSITIONS EMISSIONS
  • 4. Imagine :  You call your friend who lives in a foreign country from time to tim e. Every time you ask him or her “ What are you up to?”  The possible answers are: “ walk” “ice cream” “shopping” “reading” “programming” “kayaking” Review: Hidden Markov Model:
  • 5. Possible answers over a month: “kayaking” “walk” “ shopping” “kayaking” “programming”… sunny sunny probably sunny sunny ? Probably rainy Review: Hidden Markov Model: Latent class ( Hidden Part)
  • 6. Review: Hidden Markov Model: S1 S2 S3 O1 O3O2 TRANSITIONS probability EMISSIONS probability S
  • 7. Review: Hidden Markov Model: R S S Programming ReadWalk P(Programming |R) * P(R|*) P(S|R) P(S|S) P(walk|S) P(Read|S)
  • 8. Review: Hidden Markov Model: R S S Programming ReadingWalking P(Programming |R) * P(R|*) P(S|R) P(S|S) P(walk|S) P(Read|S) The probability of the sequence Programming Walking Reading given the weather is : P(R|*) * P(S|R) * P(S|S) * P(Programming |R) * P(walk|S) * P(Read|S )
  • 9. Exercise: Rainy Sunny Walk CleanGo shopping S 0.1 0.4 0.5 0.6 0.3 0.1 0.3 0.4 0.6? 0.40.6 What is the state sequence (Start-S1-S2) that can maximize probability of the observation sequence “ clean shopping”
  • 10. Rainy Sunny Walk CleanGo shopping S 0.1 0.4 0.5 0.6 0.3 0.1 0.3 0.4 0.6? 0.40.6 Transition P. P(R|START)=0.6 P(S |START)=0.4 P(S|R)=0.3 P(S|S)=0.6 P(R|R)=0.7 P(R|S)=0.4 Emission P. P(CLEAN|R)=0.5 P(CLEAN|S)=0.1 P(SHOPPING|R)=0.4 P(SHOPPING|S)=0.3
  • 11. Transition P. P(R|START)=0.6 P(S |START)=0.4 P(S|R)=0.3 P(S|S)=0.6 P(R|R)=0.7 P(R|S)=0.4 STATES { R, S} Emission P. P(CLEAN|R)=0.5 P(CLEAN|S)=0.1 P(SHOPPING|R)=0.4 P(SHOPPING|S)=0.3 EMISSIONS{CLEAN,SHOPPING} START S1 S2 CLEAN SHOPPING START P(S |START) P(S|S)P(CLEAN|S) P(SHOPPING|S) P(S |START) P(R|S) P(CLEAN|S) P(SHOPPING|R) P(R |START) P(S|R) P(CLEAN|R) P(SHOPPING|S) P(R |START) P(R|R) P(CLEAN|R) P(SHOPPING|R) ANSWER IS START-RAIN-RAIN
  • 12. Probabilistic Content Model S1 S2 S3 O1 O3O2 TOPICS TRANSITIONS EMISSIONS SENTENCES
  • 13. Sentences are Bigram Sequences Probability of a n-word sentence generated from a state s is :
  • 14. Probabilistic Content Model S1 S2 S3 O1 O3O2 TOPICS TRANSITIONS EMISSIONS SENTENCES
  • 15. TOPICS: Derived from the content  Partition sentences from the documents within a domai n-specific collection into k clusters (Initial Clusters) .  Use Bigram Vectors as features  Sentence similarity is the cosine of bigram vectors. STEP 1
  • 16.
  • 17. An example of the output: LOCATION INFORMATION
  • 18. TOPICS: Derived from the content  D(C,C’): Number of documents in which a sentence from C immediately precedes one from C’  D(C): Number of documents containing sentences from C.  For two States C,C’, smoothed estimate of state transition probability is:
  • 19. EM-like Viterbi Re-estimation  we can compute the transition probability from the initial sentence clusters (Topic Clusters)  Hidden Markov Model can estimate the topics of sentences  Assign sentence s in the topic clusters as the estimated topic.  Cluster/estimate cycle is repeated until the clusters stabilize TOPICS: Derived from the content STEP 2
  • 20. Evaluation Task 1 Information Ordering  Information ordering task is essential to many text- synthesis applications e.g. concept-to-text generation, multi-document summarization.
  • 22. Evaluation Task 1 Information Ordering Num. of Sentences
  • 23. Evaluation Task 1 Information Ordering Number of Order of Sentences:  3 sentences= 3*2*1=6 kinds of different sentence order  4 sentences =4*3*2*1=24  Number of sentences over 10 means : There are over 3 million kinds of different orders .
  • 24. Evaluation Task 1 Information Ordering  Generate all the sentence orders  Compute Probability of each order  Rank the orders by probability Metric :  OSO: Original Sentence Order: Position of Original Sentence in the ranked list Baseline:  Word bigram model
  • 25. Evaluation Task 1 Information Ordering Rank is the Rank of the original sentence order (OSO) by the model OSO prediction rate is the percentage of the test cases in which the model gives highest probability to the OSO among all possible permutations.
  • 26. Evaluation Task 1 Information Ordering Indicator of the swaps •Lapata’ technique is feature-rich method (in this experiment using linguistic features such as noun- verb dependency. •It aggravates the data sparseness problems for a smaller corpus Kendall T: measure how much an ordering differs from the OSO
  • 27. Evaluation Task 2 Summarization  Baseline: the “Lead” baseline, pick the first L sentences  Sentence classifer: 1.each sentence is labelled “ in” or “ out” of the summary 2.features for each sentence are unigrams and its location, which means we look at the words and their location in the sentences.
  • 28. Evaluation Task 2 Summarization Probabilistic Content Model:  All the sentences in the documents are assigned with the topics  All the sentences in the summaries are assigned with the topics Probability( Topic A in summary)= (Number of documents in summary where topic A appears) (Number of documents in documents where topic A appears )  Sentences in which its topic has high appearance probability in summaries are extracted.
  • 29. Evaluation Task 1 Information Ordering Content Model outperforms sentence-level, Locally-focused method and L baseline
  • 31. Relation Between Two Tasks Single Domain: Earthquakes Ordering : OSO prediction rate Summarization: Extractive accuracy Optimization of parameters on one task promises to yield good performance on the other This content model serves as effective representation of text structure in general
  • 32. Conclusions: In this paper , this unsupervised, knowledge-lean method validates the hypothesis: Word distribution patterns strongly correlate with discourse patterns within a text ( at least specific domains) Future direction : This model is a domain-dependent model Incorporation of domain-independent relations in the transition structure of the content model.