SlideShare a Scribd company logo
1 of 36
Download to read offline
19.6.2012




                          Text Stream Processing

                                                             Dunja Mladenić
    Artificial Intelligence Laboratory
                                                             Marko Grobelnik
    Jožef Stefan Institute
                                                             Blaž Fortuna
    Ljubljana, Slovenia
                                                             Delia Rusu


                                                                          ailab.ijs.si




                                       Text stream
• What are text streams                processing               • Key literature overview
• Properties of text streams                                    • Further publicly available
• Motivation                   • Topic detection                  tools
• Pre-processing of text       • Entity, event and fact         • Conclusions
  streams                        extraction and resolution      • Questions and discussion
• Text quality                 • Word sense disambiguation
                               • Summarization
                               • Sentiment analysis
       Introduction to         • Social network analysis                  Concluding
        text streams                                                       remarks




                                                                               ailab.ijs.si




                                                                                                      1
19.6.2012




        Introduction to Text Streams
What are text streams
Properties of text streams
Motivation
Pre-processing of text streams
Text quality




                                                            ailab.ijs.si




              What are data streams
Continuously arriving data, usually in real-time

Dealing with streams can be often easy, but…
   …gets hard when we have an intensive data stream and
   complex operations on data are required!

In such situations usually…
   …the volume of data is too big to be stored
   …the data can be scanned thoroughly only once
   …the data is highly non-stationary (changes properties through
   time), therefore approximation and adaptation are key to success

Therefore, a typical solution is…
   …not to store observed data explicitly, but rather in the aggregate
   form which allows execution of required operations


                                                            ailab.ijs.si




                                                                                  2
19.6.2012




              Stream processing
Who works with real time data processing?

  “Stream Mining” (subfield of “Data Mining”) dealing
  with mining data streams in different scenarios in
  relation with machine learning and data bases
    http://en.wikipedia.org/wiki/Data_stream_mining


  “Complex Event Processing” is a research area
  discovering complex events from simple ones by
  inference, statistics etc.
    http://en.wikipedia.org/wiki/Complex_Event_Processing


                                                       ailab.ijs.si




  Motivation for stream processing
Why one would need (near) real-time information
processing?
  …because Time and Reaction Speed correlate with
  many target quantities – e.g.:
    …on stock exchange with Earnings
    …in controlling with Quality of Service
    …in fraud detection with Safety, etc.


  Generally, we can say: Reaction Speed == Value
    …if our systems react fast, we create new value!




                                                       ailab.ijs.si




                                                                             3
19.6.2012




             What are text streams
Continuous, often rapid, ordered sequence of texts
Text information arriving continuously over time in
  the form of a data stream
  News and similar regular report
    News articles, online comments on news, online traffic
    reports, internal company reports, web searches,
    scientific papers, patents
  Social media
    discussion forums (eg., Twitter, Facebook), short
    messages on phones or computer, chat, transcripts of
    phone conversations, blogs, e-mails
            Demo http://newsfeed.ijs.si           ailab.ijs.si




                      NewsFeed




                                                  ailab.ijs.si




                                                                        4
19.6.2012




          Properties of text streams
Produced with a high rate over time
Can be read only once or a small number of times
(due to the rate and/or overall volume)
Challenging for computing and storage
capabilities – efficiency and scalability of the
approaches
Strong temporal dimension
Modularity over time and sources (topic,
sentiment,…)


                                                          ailab.ijs.si




      Example task: evolution of research
       topics and communities over time

Based on time stamped research publication titles and
authors
Observe which topics/communities shrunk, which
emerged, which split, over time, when in time were the
turning points,…
TimeFall – monitoring dynamic, temporally evolving
graphs and streams based on Minimum Description
Length
  find good cut-points in time, and stitch together the communities:
  good cut-point leads to shorter description length.
  fast and efficient incremental algorithm, scales to large datasets,
  easily parallelizable


                                                          ailab.ijs.si




                                                                                5
19.6.2012




                         Example task: evolution of research
                          topics and communities over time
       Given: n time-stamped events (eg., papers), each related to
       several of m items (eg., title-words, and/or author-names)
       Find cluster patterns and summarize their evolution in time
Time    Papers    Words        Time                Words           Time      Words

1990                           1990                                1990
1992                           1992
1991       V                   1991

                                      Papers
1990                       1   1990                            2   1991
1992                           1992
1991                           1991
1990                           1990                                1992
1991                           1991




                                                                             3
Time     Word Clusters           Time          Word Clusters       Time   Word Clusters

1990                            1990                               1990


                           5                                   4
                                1991                               1991



1992                            1992                               1992
                                                                                          ailab.ijs.si




       TimeFall on 12 million
       medical publications
       from PubMed
       MEDLINE over 40
       years
       scales linearly with the
       product of the initial
       time point blocks and
       the number of non-
       zeros in the matrix



 J. Ferlez, C. Faloutsos, J. Leskovec, D. Mladenic, M. Grobelnik. Monitoring Network Evolution
 using MDL. International Conference on Data Engineering (ICDE 2008).             ailab.ijs.si




                                                                                                                6
19.6.2012




       Pre-processing text stream
Basic text pre-processing
  including removing stop-words, applying stemming
Representing text for internal processing
  Splitting into units (eg., sentences or words)
  Mapping to internal representation (eg., feature
  vectors of words, vectors of ontology concepts)
Pre-processing for aligning/merging text streams
  Time wise alignment of multiple text streams -
  coordinated text streams (appearing over the same
  time window, eg. news)
  Content alignment possibly over different languages
                                                 ailab.ijs.si




                      Example

      The city hosts a great number of religious

      buildings, many of them dating back to

      medieval times.

                        Stop Words




                                                 ailab.ijs.si




                                                                       7
19.6.2012




                  Example

city hosts great number religious buildings,
     host                             religi       build
many them dating back medieval times.
                  date                mediev       time




                             Stemming




                                                     ailab.ijs.si




                  Example

city host great number religi build, many

them date back mediev time.
            Splitting into units of words



      (city, host, great, number, religi, build,
      many, them, date, back, mediev, time)



              Feature vector of words


                                                     ailab.ijs.si




                                                                           8
19.6.2012




                                   Text Quality
         Factors:
              Vocabulary use
              Grammatical and fluent sentences
              Structure and coherence
              Non-redundant information
              Referential clarity – e.g. proper usage of pronouns
         Models of text quality
              Global coherence - overall document organization
              Local coherence - Adjacent sentences
         Language model based approaches

                                                                            ailab.ijs.si




                                       Text stream
• What are text streams                processing            • Key literature overview
• Properties of text streams                                 • Further publicly available
• Motivation                   • Topic detection               tools
• Pre-processing of text       • Entity, event and fact      • Conclusions
  streams                        extraction and resolution   • Questions and discussion
• Text quality                 • Word sense disambiguation
                               • Summarization
                               • Sentiment analysis
       Introduction to         • Social network analysis               Concluding
        text streams                                                    remarks




                                                                            ailab.ijs.si




                                                                                                   9
19.6.2012




                     Text Stream Processing

WEB
                                     Topic           Summarization
                                     Detection
                                                      Sentiment
           Web         Text Pre-
                                                      Analysis
           Crawler     Processing    Information
                                     Extraction
                                                      Social
                                    Word              Network
                                    Sense             Analysis
                                    Disambiguation




                                              Text Stream
                                           Processing Results

                                                         ailab.ijs.si




                          Topic Detection




Religion

Art

                                                         ailab.ijs.si




                                                                              10
19.6.2012




                Topic Detection
Supervised techniques
  The data is labeled with predefined topics
  Machine learning algorithms are used to predict
  unseen data labels
Unsupervised techniques
  Identify patterns and structure within the dataset
  Clustering: grouping data sharing similar topics
  Statistical methods: probabilistic topic modeling




                                                ailab.ijs.si




      Probabilistic Topic Modeling
Topic: a probability distribution over words in a
fixed vocabulary
Given an input corpus containing a number of
documents, each having a sequence of words,
the goal is to find useful sets of topics




                                                ailab.ijs.si




                                                                     11
19.6.2012




                  Latent Dirichlet Allocation
    Documents
    can have
    multiple topics

       Religion

       Art




D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of
   Machine Learning Research, 3:993–1022, January 2003             ailab.ijs.si




                   LDA Generative Process
     A topic is a distribution over words
     A document is a mixture of topics (at the level of
     the corpus)
     Each word is drawn from one of the corpus-level
     topics

     For each document generate the words:
      1. Randomly choose a distribution over the topics
      2. For each word in the document
          a)   Randomly choose a topic from the distribution over topics in
               (step 1)
          b)   Randomly choose a word from the corresponding distribution
               over the vocabulary

                                                                    ailab.ijs.si




                                                                                         12
19.6.2012




                  LDA Generative Process

Assume a number of    Choose a distribution   For each word:
topics for the        over the topics         • Choose a topic
document collection                             assignment
(Craiova guide)                               • Choose the word
                                                from the topic

 religious 0.03
 monastery 0.01
 church 0.01

 art 0.02
 painter 0.02
 sculpture 0.01

 park 0.01
 garden 0.01
                                                            ailab.ijs.si




                  Topic Models - Extensions
    Hierarchical Topic Models
        D. Blei, T. Griffiths, and M. Jordan. The nested
        Chinese restaurant process and Bayesian
        nonparametric inference of topic hierarchies. Journal
        of the ACM, 57:2 1–30, 2010.
        Q. Ho, J. Eisenstein, E. P. Xing. Document
        Hierarchies from Text and Links. WWW 2012
    Dynamic Topic Models
        D. Blei and J. Lafferty. Dynamic topic models. In
        Proceedings of the 23rd International Conference on
        Machine Learning, 2006.

                                                            ailab.ijs.si




                                                                                 13
19.6.2012




       Topic Detection in Streams
Unsupervised methods
Simpler approaches – e.g. Clustering
Probabilistic topic models
  Challenging because of the amount and dynamics of
  the data
  E.g. Online inference for LDA – fits a topic model to
  random Wikipedia articles




                                                       ailab.ijs.si




            Topic Detection Tools
Available implementations
  LDA, HLDA, …
    http://www.cs.princeton.edu/~blei/topicmodeling.html
  Mallet
    Toolkit for statistical NLP
    http://mallet.cs.umass.edu/




                                                       ailab.ijs.si




                                                                            14
19.6.2012




                  Clustering on text streams

                                          Grouping similar documents
                                            – adjusting to changes in
                                            the topics over time
                                              Clusters generated as the data
                                              arrives and stored in a tree
                                              Adding examples by adjusting
                                              the whole path from the root to
                                              the leaf node with the new
                                              example – adding, removing,
                                              splitting and merging clusters



                                                                   ailab.ijs.si




                Clustering on Reuters V1 news
              (colors showing predefined topics)




B. Novak, Algorithm for identifying topics in text streams, 2008   ailab.ijs.si




                                                                                        15
19.6.2012




          Topic Detection - DEMOS
A 100-topic browser of the dynamic topic model
fit to Science (1882-2001)
  http://topics.cs.princeton.edu/Science/
Browsing search results
  http://searchpoint.ijs.si/




                                               ailab.ijs.si




                 100-topic browser
                Science (1882-2001)




   1890                   1940              2000




                                               ailab.ijs.si




                                                                    16
19.6.2012




                  Search Point




                                               ailab.ijs.si




               Entity Extraction
Subtask of information extraction
Identifying elements in text which belong to a
predefined group of things:
  Names of people, locations, organizations (most
  common)
  Time expressions, quantities, money amounts,
  percentages
  Gene and protein names
  Etc.



                                               ailab.ijs.si




                                                                    17
19.6.2012




                  Entity Extraction




                                                        ailab.ijs.si




     Entity Extraction Approaches
Lists of entities (gazetteers) and grammar rules
  e.g. GATE – General Architecture for Text
  Engineering
     H. Cunningham, et al. Text Processing with GATE (Version
     6). University of Sheffield Department of Computer Science.
     15 April 2011
Statistical models
  e.g. Stanford NER - linear chain Conditional Random
  Field (CRF) sequence models
     J. R. Finkel, T. Grenager, and C. Manning. Incorporating Non-
     local Information into Information Extraction Systems by
     Gibbs Sampling. In ACL 2005, pp. 363-370.


                                                        ailab.ijs.si




                                                                             18
19.6.2012




        Collective Entity Resolution
Entity resolution: discover and map entities to
corresponding references (e.g from a database,
knowledge base, etc.).
Approaches:
  Pairwise similarity with attributes of references
  Relational clustering using both attribute and relational
  information
     I. Bhattacharya, L. Getoor. Collective entity resolution in
     relational data. ACM Transactions on Knowledge Discovery from
     Data (TKDD), 2007.
  Topic models for the context of every word in a knowledge
  base
     P. Sen. Collective Context-Aware Topic Models for Entity
     Disambiguation. WWW 2012

                                                            ailab.ijs.si




    Entity Resolution to Linked Data

Enhance named entity classification using Linked Data
features
  Y. Ni, L. Zhang, Z. Qiu, C. Wang. Enhancing the Open-Domain
  Classification of Named Entity using Linked Open Data. ISWC,
  2010.
Type knowledge base from LOD
  (name string, type)
  E.g. from the triplet (dbpedia:Craiova, rdf:type, Place) ->
  (Craiova, Place)
Uses WordNet as an intermediate taxonomy to compute
the similarity between the LOD type and the target type


                                                            ailab.ijs.si




                                                                                 19
19.6.2012




    Entity Resolution to Linked Data
Finding all possible forms under which an entity
can occur in text
  Resource descriptions - most useful rdfs:label and
  foaf:name
  Redirect relationship

   (entity1, type1)
   (entity2, ?)

   entity1 has URI1
   entity2 has URI2
   URI1 owl:sameAs URI2

   Conclude: (entity2, type1)


                                                     ailab.ijs.si




                      Relation Extraction
Identifying relationships between entities (and more
generally phrases)
Traditional relation extraction
  The target relation is given, together with corresponding
  extraction patterns for the relation
  A specific corpus
Open relation extraction (and more general Open
information extraction)
  Diverse relations, not previously fixed
  Corpus: the Web
M. Banko, O. Etzioni. The Tradeoffs Between Open
and Traditional Relation Extraction. ACL, 2008.

                                                     ailab.ijs.si




                                                                          20
19.6.2012




          Identifying Relations for Open IE
    3-step method:
        Label – automatically label sentences with extractions
        (arg1, relation phrase, arg2)
        Learn – learn a relation phrase extractor (e.g using CRF)
        Extract – given a sentence, identify (arg1, arg2) and the
        relation phrase (based on the learned relation extractor)
    Examples
        TextRunner – M. Banko, M. Cafarella, S. Soderland, M.
        Broadhead, O. Etzioni. Open Information Extraction from
        the Web. IJCAI 2007.
        WOE – F. Wu, D.S. Weld. Open Information Extraction
        using Wikipedia. ACL 2010.

                                                                    ailab.ijs.si




          Identifying Relations for Open IE

 REVERB
  Input: POS-tagged and NP-chunked sentence
  Identify relation phrases
        syntactic and lexical constraints
    Find a pair of NP arguments for each relation
    phrase – assign confidence score (logistic
    regression classifier)
    Output: (x,y,z) extraction triplets


A. Fader, S. Soderland O. Etzioni. Identifying Relations for Open
Information Extraction. EMNLP 2011.                                 ailab.ijs.si




                                                                                         21
19.6.2012




   Identifying Relations for Open IE
Key points:
  Relation phrases are identified holistically as
  opposed to word-by-word
  Potential phrases are filtered based on statistics
  (lexical constraints)
  relation first opposed to arguments first
     relation phrase not confused for arguments (e.g. “made a deal
     with”)


DEMO
  http://openie.cs.washington.edu/#

                                                        ailab.ijs.si




                         REVERB




                                                        ailab.ijs.si




                                                                             22
19.6.2012




                     Never Ending Language
                            Learning

     NELL – Never Ending Language Learning
          Addressed tasks
              Reading task: read the Web and extract a knowledge base of
              structured facts and knowledge.
              Learning task: improved (and updated) reading – extract
              past information more accurately




A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. Hruschka Jr. and T.M. Mitchell.
Toward an Architecture for Never-Ending Language Learning. AAAI, 2010.
                                                                           ailab.ijs.si




                     Never Ending Language
                            Learning




  A. Carlson et al. 2010
                                                                           ailab.ijs.si




                                                                                                23
19.6.2012




           Never Ending Language
                  Learning
Coupled Pattern Learner (CPL)
  Extracts instances of categories and relations (using
  contextual patterns)
Coupled SEAL (CSEAL)
  Queries the Web with beliefs from each category or
  relation, mines lists and tables to extract new instances
Coupled Morphological Classifier (CMC)
  One regression model per category – classifies noun
  phrases
Rule Learner (RL)
  Infer new relation instances
DEMO
  http://rtw.ml.cmu.edu/rtw/


                                                      ailab.ijs.si




                           NELL




                                                      ailab.ijs.si




                                                                           24
19.6.2012




   Domain and Summary Templates
Domain templates
  Event-centric: the focus is on events described with
  verbs
  E. Filatova, V. Hatzivassiloglou, K. McKeown.
  Automatic creation of domain templates. In
  Proceedings of COLING/ACL 2006
Summary templates
  Entity-centric: the focus is on summarizing entity
  categories
  P. Li, J. Jiang, Y. Wang. Generating templates of
  entity summaries with an entity-aspect model and
  pattern mining. ACL 2010.

                                                 ailab.ijs.si




              Domain Templates
Domain is a set of events of a particular type
  E.g. presidential elections, football championships
Domains can be instantiated – instances of
events of a particular type
  E.g. Euro Championship 2012
Different levels of granularity
Hierarchical structure for domains
Template – a set of attribute-value pairs
  The attributes specify functional roles characteristic
  for the domain events

                                                 ailab.ijs.si




                                                                      25
19.6.2012




                       Domain Templates
   Use a corpus describing instances of events within a
   domain and learn the domain templates (general
   characteristics of the domain)
   The verbs are used as a starting point – estimate of
   the verb importance given the domain
   The sentences containing the top X verbs are parsed
   The most frequent subtrees (FREQuent Tree miner)
   are kept
   The named entities are substituted with more generic
   constructs – e.g. POS tags
   The frequent sub-trees are merged together


                                                    ailab.ijs.si




                       Domain Templates
E.g. terrorist attack domain

• Important verbs

Killed, told, found, injured, reported,
Happened, blamed, arrested, died, linked

• Frequent subtrees

(VP(ADVP(NP))(VBD killed)(NP(CD 34)(NNS people)))
(VP(ADVP)(VBD killed)(NP(CD 34)(NNS people)))

• Merging subtrees

(VBD killed)(NP(NUMBER)(NNS people))




                                                    ailab.ijs.si




                                                                         26
19.6.2012




                 Summary Templates
Starting point: a collection of entity summaries for
a given entity category
Goal: to obtain a summary template for the entity
category
E.g. The physicist category

ENTITY received his phd from ? university
ENTITY studied ? under ?
ENTITY earned his ? in physics from university of ?

ENTITY was awarded the medal in ?
ENTITY won the ? award
ENTITY received the Nobel prize in physics in ?


                                                      ailab.ijs.si




                 Summary Templates
Identify subtopics (aspects) of the summary
collection
  Using LDA (see Topic Detection)
  Each word:
      a stop word,
      a background word,
      a document word,
      an aspect word




                                                      ailab.ijs.si




                                                                           27
19.6.2012




            Summary Templates
Sentence patterns are generated for each aspect
  frequent subtree pattern mining
Fixed structure of a sentence pattern
  Aspect words, background words, stop words
Template slots – vary between documents
  Document words




                                               ailab.ijs.si




            Summary Templates
Sentence pattern generation
  Locate subject entities (using heuristics) – e.g.
  pronouns in a biography
  Generate parse trees (using Stanford Parser) – label
  stop, background, aspect, document, entity words
  given by the topic model
  Mine frequent subtree patterns (using FREQT)
  Prune patterns without entities or aspect words
  Convert subtree patterns to sentence patterns (find
  the sentences that generated the pattern)



                                               ailab.ijs.si




                                                                    28
19.6.2012




              Word sense disambiguation

    Identifying the meaning of words in context
    Supervised WSD
        Words labeled with their senses are required
        Classification task
    Unsupervised WSD
        Known as word sense induction
        Clustering task
    Knowledge-based WSD
        Relies on knowledge resources: WordNet, Wikipedia,
        OpenCyc, etc.


R. Navigli. Word sense disambiguation: A survey.
ACM Computational Surveys, 41(2), 2009.                ailab.ijs.si




              Word Sense Disambiguation
    Ponzetto, S.P. and Navigli, R. Knowledge-rich
    Word Sense Disambiguation Rivaling Supervised
    Systems. ACL 2010.
        Extend WordNet with Wikipedia relations
        Apply simple knowledge-based approaches

     Performance was similar with state-of-the-art
     supervised approaches




                                                       ailab.ijs.si




                                                                            29
19.6.2012




                  WSD Evaluation
Evaluation workshops SenseEval, SemEval, …
WSD evaluation topics (SemEval 2010)
  Cross-lingual WSD
  WSD on a specific domain
  Word sense induction
  Disambiguating Sentiment Ambiguous Adjectives
Evaluation topics related to WSD (SemEval
2012)
  Semantic textual similarity – similarity between two
  sentences
  Relational similarity – between pairs of words

                                                        ailab.ijs.si




                   Summarization
Extractive
  Identifying relevant sentences that belong to the
  summary
Abstractive
  Identifying/paraphrasing sections of the document to
  be summarized
  E.g. Summarization as phrase extraction - K.
  Woodsend, M. Lapata. Automatic Generation of Story
  Highlights. ACL, 2010.
     joint content selection and compression model
     ILP model to determine phrases that form the highlights


                                                        ailab.ijs.si




                                                                             30
19.6.2012




        Summarization Evaluation
Several evaluation workshops
  Document Understanding Conferences (DUC), Text
  Analysis Conferences (TAC)
  Metrics: ROUGE (n-gram based)
Linguistic quality
  Grammaticality, non-redundancy, referential clarity,
  focus, structure and coherence
  E. Pitler, A. Louis, A. Nenkova. Automatic Evaluation
  of Linguistic Quality in Multi-Document
  Summarization. ACL 2010.


                                                    ailab.ijs.si




               Sentiment analysis
Broad sense: sentiment analysis ~ opinion mining
“computational treatment of opinion, sentiment, and
subjectivity in text” (B. Pang, L. Lee, 2008)
Surveys, book chapters:
  B. Pang, L. Lee. Opinion mining and sentiment analysis.
  Foundations and Trends in Information Retrieval 2(1-2), pp.
  1–135, 2008
  B. Liu. Sentiment Analysis and Subjectivity. Handbook of
  Natural Language Processing, Second Edition, (editors: N.
  Indurkhya and F. J. Damerau), 2010.
  B. Liu. Web Data Mining - Exploring Hyperlinks, Contents
  and Usage Data, Ch. 11: Opinion Mining, Second Edition,
  Springer, 2011.

                                                    ailab.ijs.si




                                                                         31
19.6.2012




                 Interactive Approach to
                   Sentiment Analysis
http://aidemo.ijs.si/render/index.html               Model and task
                                                       selection


                                                          Query



                                                   Model-based filters


                                                   Examples retrieved by
                                                   uncertainty or class-margin
                                                   sampling




                                                   Query result, grouped by
                                                   predicted label



      T. Stajner and I. Novalija. Managing Diversity through Social Media, ESWC
      2012 Workshop on Common value management.                      ailab.ijs.si




                             Architecture

                                            Indexed documents
                                            Active learning for
                                            modeling topic,
                                            sentiment (diversity
                                            analysis)
                                            Interactive user
                                            interface

                                        Example: stream of social
                                          media posts relevant to
                                          brand management ailab.ijs.si




                                                                                          32
19.6.2012




                     Social Network Analysis
       Modeling social
       relationships
       Network theory
       concepts
           Nodes – individuals
           within the network
           Edges – relationships
           between individuals



Mario Karlovčec, Dunja Mladenić, Marko Grobelnik, Mitja Jermol. Visualizations of
Business and Research Collaboration in Slovenia, Proc. Of the Information
Technology Interfaces 2012.                                             ailab.ijs.si




            Influence and Passivity in Social
                         Media
       Majority of Twitter users are passive information
       consumers – do not forward the content to the network
       Influence and passivity based on information
       forwarding activity
       Passivity
           User retweeting rate and audience retweeting rate
           how difficult it is for other users to influence him
       Algorithm ~ HITS
           Passivity score ~ authority score
               Most passive: robot users – follow many users, but retweet a small
               percentage
           Influence score ~ hub score
               Most influential: news services – post many links forwarded by other
               users
D.M. Romero, W. Galuba, S. Asur, and B.A. Huberman. Influence and
Passivity in Social Media. ECML PKDD, 2011.                                 ailab.ijs.si




                                                                                                 33
19.6.2012




                                       Text stream
• What are text streams                processing            • Key literature overview
• Properties of text streams                                 • Further publicly available
• Motivation                   • Topic detection               tools
• Pre-processing of text       • Entity, event and fact      • Conclusions
  streams                        extraction and resolution   • Questions and discussion
• Text quality                 • Word sense disambiguation
                               • Summarization
                               • Sentiment analysis
       Introduction to         • Social network analysis               Concluding
        text streams                                                    remarks




                                                                            ailab.ijs.si




                                 Key literature




                                                                            ailab.ijs.si




                                                                                                  34
19.6.2012




      Further publicly available tools
Topic detection
   David Blei’s homepage:
   http://www.cs.princeton.edu/~blei/topicmodeling.html
   Mallet: http://mallet.cs.umass.edu/
Natural language toolkits
   GATE: http://gate.ac.uk/
   OpenNLP: http://opennlp.apache.org/
   Nltk: http://nltk.org/
Entity Extraction
   Stanford NER: http://nlp.stanford.edu/ner/index.shtml
Relation Extraction
   NELL: http://rtw.ml.cmu.edu/rtw/
   REVERB: http://openie.cs.washington.edu/
WSD
   WordNet::SenseRelate: http://senserelate.sourceforge.net/


                                                           ailab.ijs.si




                        Conclusions
Dealing with streams can be often easy, but…
   …gets hard when we have an intensive data stream and
   complex operations on data are required!
Topic detection
   Currently online inference (e.g. for LDA) is a new direction
Entity, relationship and template extraction,
sentiment analysis and social network analysis
   Are already applied on streams
Word Sense Disambiguation
   Complex knowledge bases (e.g. WordNet + Wikipedia)
   coupled with simple disambiguation algorithms work well
Summarization
   Abstraction summaries are more suited for text streams


                                                           ailab.ijs.si




                                                                                35
19.6.2012




Questions and discussion




                      ailab.ijs.si




                                           36

More Related Content

Similar to Text Stream Processing Tutorial @WIMS 2012

Stream Reasoning: State of the Art and Beyond
Stream Reasoning: State of the Art and BeyondStream Reasoning: State of the Art and Beyond
Stream Reasoning: State of the Art and BeyondEmanuele Della Valle
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoAshok Venkatesan
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jDebanjan Mahata
 
Scientific data management from the lab to the web
Scientific data management   from the lab to the webScientific data management   from the lab to the web
Scientific data management from the lab to the webJose Manuel Gómez-Pérez
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search SolutionsFindwise
 
"What is left to do?", Dublin Core 2012 Keynote
"What is left to do?", Dublin Core 2012 Keynote"What is left to do?", Dublin Core 2012 Keynote
"What is left to do?", Dublin Core 2012 KeynoteDan Brickley
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 
Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Making your data work for you: Scratchpads, publishing & the biodiversity dat...Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Making your data work for you: Scratchpads, publishing & the biodiversity dat...Vince Smith
 
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence Marina Santini
 
The Dublin Core 1:1 Principle in the Age of Linked Data
The Dublin Core 1:1 Principle in the Age of Linked DataThe Dublin Core 1:1 Principle in the Age of Linked Data
The Dublin Core 1:1 Principle in the Age of Linked DataRichard Urban
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebSimon Price
 
Size does not matter (if your data is in a silo)
Size does not matter (if your data is in a silo)Size does not matter (if your data is in a silo)
Size does not matter (if your data is in a silo)Ora Lassila
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesDr.-Ing. Thomas Hartmann
 
Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)ALATechSource
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentationekansa
 
Resource and Metadata Management with a Linked Data perspective
Resource and Metadata Management with a Linked Data perspectiveResource and Metadata Management with a Linked Data perspective
Resource and Metadata Management with a Linked Data perspectiveHannes Ebner
 

Similar to Text Stream Processing Tutorial @WIMS 2012 (20)

Stream Reasoning: State of the Art and Beyond
Stream Reasoning: State of the Art and BeyondStream Reasoning: State of the Art and Beyond
Stream Reasoning: State of the Art and Beyond
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and Dato
 
DatoConference2015
DatoConference2015DatoConference2015
DatoConference2015
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 
Scientific data management from the lab to the web
Scientific data management   from the lab to the webScientific data management   from the lab to the web
Scientific data management from the lab to the web
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search Solutions
 
"What is left to do?", Dublin Core 2012 Keynote
"What is left to do?", Dublin Core 2012 Keynote"What is left to do?", Dublin Core 2012 Keynote
"What is left to do?", Dublin Core 2012 Keynote
 
Dublin Core: What is left to do?
Dublin Core: What is left to do?Dublin Core: What is left to do?
Dublin Core: What is left to do?
 
Sem web tutorial general
Sem web tutorial generalSem web tutorial general
Sem web tutorial general
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Making your data work for you: Scratchpads, publishing & the biodiversity dat...Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Making your data work for you: Scratchpads, publishing & the biodiversity dat...
 
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
 
The Dublin Core 1:1 Principle in the Age of Linked Data
The Dublin Core 1:1 Principle in the Age of Linked DataThe Dublin Core 1:1 Principle in the Age of Linked Data
The Dublin Core 1:1 Principle in the Age of Linked Data
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 
Size does not matter (if your data is in a silo)
Size does not matter (if your data is in a silo)Size does not matter (if your data is in a silo)
Size does not matter (if your data is in a silo)
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
 
Our World is Socio-technical
Our World is Socio-technicalOur World is Socio-technical
Our World is Socio-technical
 
Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentation
 
Resource and Metadata Management with a Linked Data perspective
Resource and Metadata Management with a Linked Data perspectiveResource and Metadata Management with a Linked Data perspective
Resource and Metadata Management with a Linked Data perspective
 

More from RENDER project

Internals Of An Aggregated Web News Feed
Internals Of An Aggregated Web News FeedInternals Of An Aggregated Web News Feed
Internals Of An Aggregated Web News FeedRENDER project
 
Unterstützungswerkzeuge für Wikipedia
Unterstützungswerkzeuge für WikipediaUnterstützungswerkzeuge für Wikipedia
Unterstützungswerkzeuge für WikipediaRENDER project
 
Render Review: Wikipedia Case Study, Year 1
Render Review: Wikipedia Case Study, Year 1Render Review: Wikipedia Case Study, Year 1
Render Review: Wikipedia Case Study, Year 1RENDER project
 
Wiki case study - Review year 1
Wiki case study  - Review year 1Wiki case study  - Review year 1
Wiki case study - Review year 1RENDER project
 
Towards a diversity-minded Wikipedia
Towards a diversity-minded WikipediaTowards a diversity-minded Wikipedia
Towards a diversity-minded WikipediaRENDER project
 
Diversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. MadalliDiversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. MadalliRENDER project
 
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...RENDER project
 
Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus
Diversiweb2011 07 Approximate subgraph matching - Mitja TrampusDiversiweb2011 07 Approximate subgraph matching - Mitja Trampus
Diversiweb2011 07 Approximate subgraph matching - Mitja TrampusRENDER project
 
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...RENDER project
 
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...RENDER project
 
Diversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
Diversiweb2011 04 Expressing Opinion Diversity - Delia RusuDiversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
Diversiweb2011 04 Expressing Opinion Diversity - Delia RusuRENDER project
 
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny VrandecicDiversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny VrandecicRENDER project
 
Diversiweb2011 01 Opening - Elena Simperl
Diversiweb2011 01 Opening - Elena SimperlDiversiweb2011 01 Opening - Elena Simperl
Diversiweb2011 01 Opening - Elena SimperlRENDER project
 
Data Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data ManagementData Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data ManagementRENDER project
 
Render Project introduction and overview
Render Project introduction and overviewRender Project introduction and overview
Render Project introduction and overviewRENDER project
 

More from RENDER project (18)

Internals Of An Aggregated Web News Feed
Internals Of An Aggregated Web News FeedInternals Of An Aggregated Web News Feed
Internals Of An Aggregated Web News Feed
 
Unterstützungswerkzeuge für Wikipedia
Unterstützungswerkzeuge für WikipediaUnterstützungswerkzeuge für Wikipedia
Unterstützungswerkzeuge für Wikipedia
 
Render Review: Wikipedia Case Study, Year 1
Render Review: Wikipedia Case Study, Year 1Render Review: Wikipedia Case Study, Year 1
Render Review: Wikipedia Case Study, Year 1
 
Wiki case study - Review year 1
Wiki case study  - Review year 1Wiki case study  - Review year 1
Wiki case study - Review year 1
 
Towards a diversity-minded Wikipedia
Towards a diversity-minded WikipediaTowards a diversity-minded Wikipedia
Towards a diversity-minded Wikipedia
 
Diversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. MadalliDiversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. Madalli
 
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
 
Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus
Diversiweb2011 07 Approximate subgraph matching - Mitja TrampusDiversiweb2011 07 Approximate subgraph matching - Mitja Trampus
Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus
 
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
 
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
 
Diversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
Diversiweb2011 04 Expressing Opinion Diversity - Delia RusuDiversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
Diversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
 
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny VrandecicDiversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
 
Diversiweb2011 01 Opening - Elena Simperl
Diversiweb2011 01 Opening - Elena SimperlDiversiweb2011 01 Opening - Elena Simperl
Diversiweb2011 01 Opening - Elena Simperl
 
Data Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data ManagementData Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data Management
 
Diversity toolkit
Diversity toolkitDiversity toolkit
Diversity toolkit
 
RENDER Telefonica
RENDER TelefonicaRENDER Telefonica
RENDER Telefonica
 
Defining Diversity
Defining DiversityDefining Diversity
Defining Diversity
 
Render Project introduction and overview
Render Project introduction and overviewRender Project introduction and overview
Render Project introduction and overview
 

Text Stream Processing Tutorial @WIMS 2012

  • 1. 19.6.2012 Text Stream Processing Dunja Mladenić Artificial Intelligence Laboratory Marko Grobelnik Jožef Stefan Institute Blaž Fortuna Ljubljana, Slovenia Delia Rusu ailab.ijs.si Text stream • What are text streams processing • Key literature overview • Properties of text streams • Further publicly available • Motivation • Topic detection tools • Pre-processing of text • Entity, event and fact • Conclusions streams extraction and resolution • Questions and discussion • Text quality • Word sense disambiguation • Summarization • Sentiment analysis Introduction to • Social network analysis Concluding text streams remarks ailab.ijs.si 1
  • 2. 19.6.2012 Introduction to Text Streams What are text streams Properties of text streams Motivation Pre-processing of text streams Text quality ailab.ijs.si What are data streams Continuously arriving data, usually in real-time Dealing with streams can be often easy, but… …gets hard when we have an intensive data stream and complex operations on data are required! In such situations usually… …the volume of data is too big to be stored …the data can be scanned thoroughly only once …the data is highly non-stationary (changes properties through time), therefore approximation and adaptation are key to success Therefore, a typical solution is… …not to store observed data explicitly, but rather in the aggregate form which allows execution of required operations ailab.ijs.si 2
  • 3. 19.6.2012 Stream processing Who works with real time data processing? “Stream Mining” (subfield of “Data Mining”) dealing with mining data streams in different scenarios in relation with machine learning and data bases http://en.wikipedia.org/wiki/Data_stream_mining “Complex Event Processing” is a research area discovering complex events from simple ones by inference, statistics etc. http://en.wikipedia.org/wiki/Complex_Event_Processing ailab.ijs.si Motivation for stream processing Why one would need (near) real-time information processing? …because Time and Reaction Speed correlate with many target quantities – e.g.: …on stock exchange with Earnings …in controlling with Quality of Service …in fraud detection with Safety, etc. Generally, we can say: Reaction Speed == Value …if our systems react fast, we create new value! ailab.ijs.si 3
  • 4. 19.6.2012 What are text streams Continuous, often rapid, ordered sequence of texts Text information arriving continuously over time in the form of a data stream News and similar regular report News articles, online comments on news, online traffic reports, internal company reports, web searches, scientific papers, patents Social media discussion forums (eg., Twitter, Facebook), short messages on phones or computer, chat, transcripts of phone conversations, blogs, e-mails Demo http://newsfeed.ijs.si ailab.ijs.si NewsFeed ailab.ijs.si 4
  • 5. 19.6.2012 Properties of text streams Produced with a high rate over time Can be read only once or a small number of times (due to the rate and/or overall volume) Challenging for computing and storage capabilities – efficiency and scalability of the approaches Strong temporal dimension Modularity over time and sources (topic, sentiment,…) ailab.ijs.si Example task: evolution of research topics and communities over time Based on time stamped research publication titles and authors Observe which topics/communities shrunk, which emerged, which split, over time, when in time were the turning points,… TimeFall – monitoring dynamic, temporally evolving graphs and streams based on Minimum Description Length find good cut-points in time, and stitch together the communities: good cut-point leads to shorter description length. fast and efficient incremental algorithm, scales to large datasets, easily parallelizable ailab.ijs.si 5
  • 6. 19.6.2012 Example task: evolution of research topics and communities over time Given: n time-stamped events (eg., papers), each related to several of m items (eg., title-words, and/or author-names) Find cluster patterns and summarize their evolution in time Time Papers Words Time Words Time Words 1990 1990 1990 1992 1992 1991 V 1991 Papers 1990 1 1990 2 1991 1992 1992 1991 1991 1990 1990 1992 1991 1991 3 Time Word Clusters Time Word Clusters Time Word Clusters 1990 1990 1990 5 4 1991 1991 1992 1992 1992 ailab.ijs.si TimeFall on 12 million medical publications from PubMed MEDLINE over 40 years scales linearly with the product of the initial time point blocks and the number of non- zeros in the matrix J. Ferlez, C. Faloutsos, J. Leskovec, D. Mladenic, M. Grobelnik. Monitoring Network Evolution using MDL. International Conference on Data Engineering (ICDE 2008). ailab.ijs.si 6
  • 7. 19.6.2012 Pre-processing text stream Basic text pre-processing including removing stop-words, applying stemming Representing text for internal processing Splitting into units (eg., sentences or words) Mapping to internal representation (eg., feature vectors of words, vectors of ontology concepts) Pre-processing for aligning/merging text streams Time wise alignment of multiple text streams - coordinated text streams (appearing over the same time window, eg. news) Content alignment possibly over different languages ailab.ijs.si Example The city hosts a great number of religious buildings, many of them dating back to medieval times. Stop Words ailab.ijs.si 7
  • 8. 19.6.2012 Example city hosts great number religious buildings, host religi build many them dating back medieval times. date mediev time Stemming ailab.ijs.si Example city host great number religi build, many them date back mediev time. Splitting into units of words (city, host, great, number, religi, build, many, them, date, back, mediev, time) Feature vector of words ailab.ijs.si 8
  • 9. 19.6.2012 Text Quality Factors: Vocabulary use Grammatical and fluent sentences Structure and coherence Non-redundant information Referential clarity – e.g. proper usage of pronouns Models of text quality Global coherence - overall document organization Local coherence - Adjacent sentences Language model based approaches ailab.ijs.si Text stream • What are text streams processing • Key literature overview • Properties of text streams • Further publicly available • Motivation • Topic detection tools • Pre-processing of text • Entity, event and fact • Conclusions streams extraction and resolution • Questions and discussion • Text quality • Word sense disambiguation • Summarization • Sentiment analysis Introduction to • Social network analysis Concluding text streams remarks ailab.ijs.si 9
  • 10. 19.6.2012 Text Stream Processing WEB Topic Summarization Detection Sentiment Web Text Pre- Analysis Crawler Processing Information Extraction Social Word Network Sense Analysis Disambiguation Text Stream Processing Results ailab.ijs.si Topic Detection Religion Art ailab.ijs.si 10
  • 11. 19.6.2012 Topic Detection Supervised techniques The data is labeled with predefined topics Machine learning algorithms are used to predict unseen data labels Unsupervised techniques Identify patterns and structure within the dataset Clustering: grouping data sharing similar topics Statistical methods: probabilistic topic modeling ailab.ijs.si Probabilistic Topic Modeling Topic: a probability distribution over words in a fixed vocabulary Given an input corpus containing a number of documents, each having a sequence of words, the goal is to find useful sets of topics ailab.ijs.si 11
  • 12. 19.6.2012 Latent Dirichlet Allocation Documents can have multiple topics Religion Art D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003 ailab.ijs.si LDA Generative Process A topic is a distribution over words A document is a mixture of topics (at the level of the corpus) Each word is drawn from one of the corpus-level topics For each document generate the words: 1. Randomly choose a distribution over the topics 2. For each word in the document a) Randomly choose a topic from the distribution over topics in (step 1) b) Randomly choose a word from the corresponding distribution over the vocabulary ailab.ijs.si 12
  • 13. 19.6.2012 LDA Generative Process Assume a number of Choose a distribution For each word: topics for the over the topics • Choose a topic document collection assignment (Craiova guide) • Choose the word from the topic religious 0.03 monastery 0.01 church 0.01 art 0.02 painter 0.02 sculpture 0.01 park 0.01 garden 0.01 ailab.ijs.si Topic Models - Extensions Hierarchical Topic Models D. Blei, T. Griffiths, and M. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57:2 1–30, 2010. Q. Ho, J. Eisenstein, E. P. Xing. Document Hierarchies from Text and Links. WWW 2012 Dynamic Topic Models D. Blei and J. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, 2006. ailab.ijs.si 13
  • 14. 19.6.2012 Topic Detection in Streams Unsupervised methods Simpler approaches – e.g. Clustering Probabilistic topic models Challenging because of the amount and dynamics of the data E.g. Online inference for LDA – fits a topic model to random Wikipedia articles ailab.ijs.si Topic Detection Tools Available implementations LDA, HLDA, … http://www.cs.princeton.edu/~blei/topicmodeling.html Mallet Toolkit for statistical NLP http://mallet.cs.umass.edu/ ailab.ijs.si 14
  • 15. 19.6.2012 Clustering on text streams Grouping similar documents – adjusting to changes in the topics over time Clusters generated as the data arrives and stored in a tree Adding examples by adjusting the whole path from the root to the leaf node with the new example – adding, removing, splitting and merging clusters ailab.ijs.si Clustering on Reuters V1 news (colors showing predefined topics) B. Novak, Algorithm for identifying topics in text streams, 2008 ailab.ijs.si 15
  • 16. 19.6.2012 Topic Detection - DEMOS A 100-topic browser of the dynamic topic model fit to Science (1882-2001) http://topics.cs.princeton.edu/Science/ Browsing search results http://searchpoint.ijs.si/ ailab.ijs.si 100-topic browser Science (1882-2001) 1890 1940 2000 ailab.ijs.si 16
  • 17. 19.6.2012 Search Point ailab.ijs.si Entity Extraction Subtask of information extraction Identifying elements in text which belong to a predefined group of things: Names of people, locations, organizations (most common) Time expressions, quantities, money amounts, percentages Gene and protein names Etc. ailab.ijs.si 17
  • 18. 19.6.2012 Entity Extraction ailab.ijs.si Entity Extraction Approaches Lists of entities (gazetteers) and grammar rules e.g. GATE – General Architecture for Text Engineering H. Cunningham, et al. Text Processing with GATE (Version 6). University of Sheffield Department of Computer Science. 15 April 2011 Statistical models e.g. Stanford NER - linear chain Conditional Random Field (CRF) sequence models J. R. Finkel, T. Grenager, and C. Manning. Incorporating Non- local Information into Information Extraction Systems by Gibbs Sampling. In ACL 2005, pp. 363-370. ailab.ijs.si 18
  • 19. 19.6.2012 Collective Entity Resolution Entity resolution: discover and map entities to corresponding references (e.g from a database, knowledge base, etc.). Approaches: Pairwise similarity with attributes of references Relational clustering using both attribute and relational information I. Bhattacharya, L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007. Topic models for the context of every word in a knowledge base P. Sen. Collective Context-Aware Topic Models for Entity Disambiguation. WWW 2012 ailab.ijs.si Entity Resolution to Linked Data Enhance named entity classification using Linked Data features Y. Ni, L. Zhang, Z. Qiu, C. Wang. Enhancing the Open-Domain Classification of Named Entity using Linked Open Data. ISWC, 2010. Type knowledge base from LOD (name string, type) E.g. from the triplet (dbpedia:Craiova, rdf:type, Place) -> (Craiova, Place) Uses WordNet as an intermediate taxonomy to compute the similarity between the LOD type and the target type ailab.ijs.si 19
  • 20. 19.6.2012 Entity Resolution to Linked Data Finding all possible forms under which an entity can occur in text Resource descriptions - most useful rdfs:label and foaf:name Redirect relationship (entity1, type1) (entity2, ?) entity1 has URI1 entity2 has URI2 URI1 owl:sameAs URI2 Conclude: (entity2, type1) ailab.ijs.si Relation Extraction Identifying relationships between entities (and more generally phrases) Traditional relation extraction The target relation is given, together with corresponding extraction patterns for the relation A specific corpus Open relation extraction (and more general Open information extraction) Diverse relations, not previously fixed Corpus: the Web M. Banko, O. Etzioni. The Tradeoffs Between Open and Traditional Relation Extraction. ACL, 2008. ailab.ijs.si 20
  • 21. 19.6.2012 Identifying Relations for Open IE 3-step method: Label – automatically label sentences with extractions (arg1, relation phrase, arg2) Learn – learn a relation phrase extractor (e.g using CRF) Extract – given a sentence, identify (arg1, arg2) and the relation phrase (based on the learned relation extractor) Examples TextRunner – M. Banko, M. Cafarella, S. Soderland, M. Broadhead, O. Etzioni. Open Information Extraction from the Web. IJCAI 2007. WOE – F. Wu, D.S. Weld. Open Information Extraction using Wikipedia. ACL 2010. ailab.ijs.si Identifying Relations for Open IE REVERB Input: POS-tagged and NP-chunked sentence Identify relation phrases syntactic and lexical constraints Find a pair of NP arguments for each relation phrase – assign confidence score (logistic regression classifier) Output: (x,y,z) extraction triplets A. Fader, S. Soderland O. Etzioni. Identifying Relations for Open Information Extraction. EMNLP 2011. ailab.ijs.si 21
  • 22. 19.6.2012 Identifying Relations for Open IE Key points: Relation phrases are identified holistically as opposed to word-by-word Potential phrases are filtered based on statistics (lexical constraints) relation first opposed to arguments first relation phrase not confused for arguments (e.g. “made a deal with”) DEMO http://openie.cs.washington.edu/# ailab.ijs.si REVERB ailab.ijs.si 22
  • 23. 19.6.2012 Never Ending Language Learning NELL – Never Ending Language Learning Addressed tasks Reading task: read the Web and extract a knowledge base of structured facts and knowledge. Learning task: improved (and updated) reading – extract past information more accurately A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. Hruschka Jr. and T.M. Mitchell. Toward an Architecture for Never-Ending Language Learning. AAAI, 2010. ailab.ijs.si Never Ending Language Learning A. Carlson et al. 2010 ailab.ijs.si 23
  • 24. 19.6.2012 Never Ending Language Learning Coupled Pattern Learner (CPL) Extracts instances of categories and relations (using contextual patterns) Coupled SEAL (CSEAL) Queries the Web with beliefs from each category or relation, mines lists and tables to extract new instances Coupled Morphological Classifier (CMC) One regression model per category – classifies noun phrases Rule Learner (RL) Infer new relation instances DEMO http://rtw.ml.cmu.edu/rtw/ ailab.ijs.si NELL ailab.ijs.si 24
  • 25. 19.6.2012 Domain and Summary Templates Domain templates Event-centric: the focus is on events described with verbs E. Filatova, V. Hatzivassiloglou, K. McKeown. Automatic creation of domain templates. In Proceedings of COLING/ACL 2006 Summary templates Entity-centric: the focus is on summarizing entity categories P. Li, J. Jiang, Y. Wang. Generating templates of entity summaries with an entity-aspect model and pattern mining. ACL 2010. ailab.ijs.si Domain Templates Domain is a set of events of a particular type E.g. presidential elections, football championships Domains can be instantiated – instances of events of a particular type E.g. Euro Championship 2012 Different levels of granularity Hierarchical structure for domains Template – a set of attribute-value pairs The attributes specify functional roles characteristic for the domain events ailab.ijs.si 25
  • 26. 19.6.2012 Domain Templates Use a corpus describing instances of events within a domain and learn the domain templates (general characteristics of the domain) The verbs are used as a starting point – estimate of the verb importance given the domain The sentences containing the top X verbs are parsed The most frequent subtrees (FREQuent Tree miner) are kept The named entities are substituted with more generic constructs – e.g. POS tags The frequent sub-trees are merged together ailab.ijs.si Domain Templates E.g. terrorist attack domain • Important verbs Killed, told, found, injured, reported, Happened, blamed, arrested, died, linked • Frequent subtrees (VP(ADVP(NP))(VBD killed)(NP(CD 34)(NNS people))) (VP(ADVP)(VBD killed)(NP(CD 34)(NNS people))) • Merging subtrees (VBD killed)(NP(NUMBER)(NNS people)) ailab.ijs.si 26
  • 27. 19.6.2012 Summary Templates Starting point: a collection of entity summaries for a given entity category Goal: to obtain a summary template for the entity category E.g. The physicist category ENTITY received his phd from ? university ENTITY studied ? under ? ENTITY earned his ? in physics from university of ? ENTITY was awarded the medal in ? ENTITY won the ? award ENTITY received the Nobel prize in physics in ? ailab.ijs.si Summary Templates Identify subtopics (aspects) of the summary collection Using LDA (see Topic Detection) Each word: a stop word, a background word, a document word, an aspect word ailab.ijs.si 27
  • 28. 19.6.2012 Summary Templates Sentence patterns are generated for each aspect frequent subtree pattern mining Fixed structure of a sentence pattern Aspect words, background words, stop words Template slots – vary between documents Document words ailab.ijs.si Summary Templates Sentence pattern generation Locate subject entities (using heuristics) – e.g. pronouns in a biography Generate parse trees (using Stanford Parser) – label stop, background, aspect, document, entity words given by the topic model Mine frequent subtree patterns (using FREQT) Prune patterns without entities or aspect words Convert subtree patterns to sentence patterns (find the sentences that generated the pattern) ailab.ijs.si 28
  • 29. 19.6.2012 Word sense disambiguation Identifying the meaning of words in context Supervised WSD Words labeled with their senses are required Classification task Unsupervised WSD Known as word sense induction Clustering task Knowledge-based WSD Relies on knowledge resources: WordNet, Wikipedia, OpenCyc, etc. R. Navigli. Word sense disambiguation: A survey. ACM Computational Surveys, 41(2), 2009. ailab.ijs.si Word Sense Disambiguation Ponzetto, S.P. and Navigli, R. Knowledge-rich Word Sense Disambiguation Rivaling Supervised Systems. ACL 2010. Extend WordNet with Wikipedia relations Apply simple knowledge-based approaches Performance was similar with state-of-the-art supervised approaches ailab.ijs.si 29
  • 30. 19.6.2012 WSD Evaluation Evaluation workshops SenseEval, SemEval, … WSD evaluation topics (SemEval 2010) Cross-lingual WSD WSD on a specific domain Word sense induction Disambiguating Sentiment Ambiguous Adjectives Evaluation topics related to WSD (SemEval 2012) Semantic textual similarity – similarity between two sentences Relational similarity – between pairs of words ailab.ijs.si Summarization Extractive Identifying relevant sentences that belong to the summary Abstractive Identifying/paraphrasing sections of the document to be summarized E.g. Summarization as phrase extraction - K. Woodsend, M. Lapata. Automatic Generation of Story Highlights. ACL, 2010. joint content selection and compression model ILP model to determine phrases that form the highlights ailab.ijs.si 30
  • 31. 19.6.2012 Summarization Evaluation Several evaluation workshops Document Understanding Conferences (DUC), Text Analysis Conferences (TAC) Metrics: ROUGE (n-gram based) Linguistic quality Grammaticality, non-redundancy, referential clarity, focus, structure and coherence E. Pitler, A. Louis, A. Nenkova. Automatic Evaluation of Linguistic Quality in Multi-Document Summarization. ACL 2010. ailab.ijs.si Sentiment analysis Broad sense: sentiment analysis ~ opinion mining “computational treatment of opinion, sentiment, and subjectivity in text” (B. Pang, L. Lee, 2008) Surveys, book chapters: B. Pang, L. Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1-2), pp. 1–135, 2008 B. Liu. Sentiment Analysis and Subjectivity. Handbook of Natural Language Processing, Second Edition, (editors: N. Indurkhya and F. J. Damerau), 2010. B. Liu. Web Data Mining - Exploring Hyperlinks, Contents and Usage Data, Ch. 11: Opinion Mining, Second Edition, Springer, 2011. ailab.ijs.si 31
  • 32. 19.6.2012 Interactive Approach to Sentiment Analysis http://aidemo.ijs.si/render/index.html Model and task selection Query Model-based filters Examples retrieved by uncertainty or class-margin sampling Query result, grouped by predicted label T. Stajner and I. Novalija. Managing Diversity through Social Media, ESWC 2012 Workshop on Common value management. ailab.ijs.si Architecture Indexed documents Active learning for modeling topic, sentiment (diversity analysis) Interactive user interface Example: stream of social media posts relevant to brand management ailab.ijs.si 32
  • 33. 19.6.2012 Social Network Analysis Modeling social relationships Network theory concepts Nodes – individuals within the network Edges – relationships between individuals Mario Karlovčec, Dunja Mladenić, Marko Grobelnik, Mitja Jermol. Visualizations of Business and Research Collaboration in Slovenia, Proc. Of the Information Technology Interfaces 2012. ailab.ijs.si Influence and Passivity in Social Media Majority of Twitter users are passive information consumers – do not forward the content to the network Influence and passivity based on information forwarding activity Passivity User retweeting rate and audience retweeting rate how difficult it is for other users to influence him Algorithm ~ HITS Passivity score ~ authority score Most passive: robot users – follow many users, but retweet a small percentage Influence score ~ hub score Most influential: news services – post many links forwarded by other users D.M. Romero, W. Galuba, S. Asur, and B.A. Huberman. Influence and Passivity in Social Media. ECML PKDD, 2011. ailab.ijs.si 33
  • 34. 19.6.2012 Text stream • What are text streams processing • Key literature overview • Properties of text streams • Further publicly available • Motivation • Topic detection tools • Pre-processing of text • Entity, event and fact • Conclusions streams extraction and resolution • Questions and discussion • Text quality • Word sense disambiguation • Summarization • Sentiment analysis Introduction to • Social network analysis Concluding text streams remarks ailab.ijs.si Key literature ailab.ijs.si 34
  • 35. 19.6.2012 Further publicly available tools Topic detection David Blei’s homepage: http://www.cs.princeton.edu/~blei/topicmodeling.html Mallet: http://mallet.cs.umass.edu/ Natural language toolkits GATE: http://gate.ac.uk/ OpenNLP: http://opennlp.apache.org/ Nltk: http://nltk.org/ Entity Extraction Stanford NER: http://nlp.stanford.edu/ner/index.shtml Relation Extraction NELL: http://rtw.ml.cmu.edu/rtw/ REVERB: http://openie.cs.washington.edu/ WSD WordNet::SenseRelate: http://senserelate.sourceforge.net/ ailab.ijs.si Conclusions Dealing with streams can be often easy, but… …gets hard when we have an intensive data stream and complex operations on data are required! Topic detection Currently online inference (e.g. for LDA) is a new direction Entity, relationship and template extraction, sentiment analysis and social network analysis Are already applied on streams Word Sense Disambiguation Complex knowledge bases (e.g. WordNet + Wikipedia) coupled with simple disambiguation algorithms work well Summarization Abstraction summaries are more suited for text streams ailab.ijs.si 35