Text Stream Processing Tutorial @WIMS 2012

  • 1,983 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,983
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
19
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 19.6.2012 Text Stream Processing Dunja Mladenić Artificial Intelligence Laboratory Marko Grobelnik Jožef Stefan Institute Blaž Fortuna Ljubljana, Slovenia Delia Rusu ailab.ijs.si Text stream• What are text streams processing • Key literature overview• Properties of text streams • Further publicly available• Motivation • Topic detection tools• Pre-processing of text • Entity, event and fact • Conclusions streams extraction and resolution • Questions and discussion• Text quality • Word sense disambiguation • Summarization • Sentiment analysis Introduction to • Social network analysis Concluding text streams remarks ailab.ijs.si 1
  • 2. 19.6.2012 Introduction to Text StreamsWhat are text streamsProperties of text streamsMotivationPre-processing of text streamsText quality ailab.ijs.si What are data streamsContinuously arriving data, usually in real-timeDealing with streams can be often easy, but… …gets hard when we have an intensive data stream and complex operations on data are required!In such situations usually… …the volume of data is too big to be stored …the data can be scanned thoroughly only once …the data is highly non-stationary (changes properties through time), therefore approximation and adaptation are key to successTherefore, a typical solution is… …not to store observed data explicitly, but rather in the aggregate form which allows execution of required operations ailab.ijs.si 2
  • 3. 19.6.2012 Stream processingWho works with real time data processing? “Stream Mining” (subfield of “Data Mining”) dealing with mining data streams in different scenarios in relation with machine learning and data bases http://en.wikipedia.org/wiki/Data_stream_mining “Complex Event Processing” is a research area discovering complex events from simple ones by inference, statistics etc. http://en.wikipedia.org/wiki/Complex_Event_Processing ailab.ijs.si Motivation for stream processingWhy one would need (near) real-time informationprocessing? …because Time and Reaction Speed correlate with many target quantities – e.g.: …on stock exchange with Earnings …in controlling with Quality of Service …in fraud detection with Safety, etc. Generally, we can say: Reaction Speed == Value …if our systems react fast, we create new value! ailab.ijs.si 3
  • 4. 19.6.2012 What are text streamsContinuous, often rapid, ordered sequence of textsText information arriving continuously over time in the form of a data stream News and similar regular report News articles, online comments on news, online traffic reports, internal company reports, web searches, scientific papers, patents Social media discussion forums (eg., Twitter, Facebook), short messages on phones or computer, chat, transcripts of phone conversations, blogs, e-mails Demo http://newsfeed.ijs.si ailab.ijs.si NewsFeed ailab.ijs.si 4
  • 5. 19.6.2012 Properties of text streamsProduced with a high rate over timeCan be read only once or a small number of times(due to the rate and/or overall volume)Challenging for computing and storagecapabilities – efficiency and scalability of theapproachesStrong temporal dimensionModularity over time and sources (topic,sentiment,…) ailab.ijs.si Example task: evolution of research topics and communities over timeBased on time stamped research publication titles andauthorsObserve which topics/communities shrunk, whichemerged, which split, over time, when in time were theturning points,…TimeFall – monitoring dynamic, temporally evolvinggraphs and streams based on Minimum DescriptionLength find good cut-points in time, and stitch together the communities: good cut-point leads to shorter description length. fast and efficient incremental algorithm, scales to large datasets, easily parallelizable ailab.ijs.si 5
  • 6. 19.6.2012 Example task: evolution of research topics and communities over time Given: n time-stamped events (eg., papers), each related to several of m items (eg., title-words, and/or author-names) Find cluster patterns and summarize their evolution in timeTime Papers Words Time Words Time Words1990 1990 19901992 19921991 V 1991 Papers1990 1 1990 2 19911992 19921991 19911990 1990 19921991 1991 3Time Word Clusters Time Word Clusters Time Word Clusters1990 1990 1990 5 4 1991 19911992 1992 1992 ailab.ijs.si TimeFall on 12 million medical publications from PubMed MEDLINE over 40 years scales linearly with the product of the initial time point blocks and the number of non- zeros in the matrix J. Ferlez, C. Faloutsos, J. Leskovec, D. Mladenic, M. Grobelnik. Monitoring Network Evolution using MDL. International Conference on Data Engineering (ICDE 2008). ailab.ijs.si 6
  • 7. 19.6.2012 Pre-processing text streamBasic text pre-processing including removing stop-words, applying stemmingRepresenting text for internal processing Splitting into units (eg., sentences or words) Mapping to internal representation (eg., feature vectors of words, vectors of ontology concepts)Pre-processing for aligning/merging text streams Time wise alignment of multiple text streams - coordinated text streams (appearing over the same time window, eg. news) Content alignment possibly over different languages ailab.ijs.si Example The city hosts a great number of religious buildings, many of them dating back to medieval times. Stop Words ailab.ijs.si 7
  • 8. 19.6.2012 Examplecity hosts great number religious buildings, host religi buildmany them dating back medieval times. date mediev time Stemming ailab.ijs.si Examplecity host great number religi build, manythem date back mediev time. Splitting into units of words (city, host, great, number, religi, build, many, them, date, back, mediev, time) Feature vector of words ailab.ijs.si 8
  • 9. 19.6.2012 Text Quality Factors: Vocabulary use Grammatical and fluent sentences Structure and coherence Non-redundant information Referential clarity – e.g. proper usage of pronouns Models of text quality Global coherence - overall document organization Local coherence - Adjacent sentences Language model based approaches ailab.ijs.si Text stream• What are text streams processing • Key literature overview• Properties of text streams • Further publicly available• Motivation • Topic detection tools• Pre-processing of text • Entity, event and fact • Conclusions streams extraction and resolution • Questions and discussion• Text quality • Word sense disambiguation • Summarization • Sentiment analysis Introduction to • Social network analysis Concluding text streams remarks ailab.ijs.si 9
  • 10. 19.6.2012 Text Stream ProcessingWEB Topic Summarization Detection Sentiment Web Text Pre- Analysis Crawler Processing Information Extraction Social Word Network Sense Analysis Disambiguation Text Stream Processing Results ailab.ijs.si Topic DetectionReligionArt ailab.ijs.si 10
  • 11. 19.6.2012 Topic DetectionSupervised techniques The data is labeled with predefined topics Machine learning algorithms are used to predict unseen data labelsUnsupervised techniques Identify patterns and structure within the dataset Clustering: grouping data sharing similar topics Statistical methods: probabilistic topic modeling ailab.ijs.si Probabilistic Topic ModelingTopic: a probability distribution over words in afixed vocabularyGiven an input corpus containing a number ofdocuments, each having a sequence of words,the goal is to find useful sets of topics ailab.ijs.si 11
  • 12. 19.6.2012 Latent Dirichlet Allocation Documents can have multiple topics Religion ArtD. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003 ailab.ijs.si LDA Generative Process A topic is a distribution over words A document is a mixture of topics (at the level of the corpus) Each word is drawn from one of the corpus-level topics For each document generate the words: 1. Randomly choose a distribution over the topics 2. For each word in the document a) Randomly choose a topic from the distribution over topics in (step 1) b) Randomly choose a word from the corresponding distribution over the vocabulary ailab.ijs.si 12
  • 13. 19.6.2012 LDA Generative ProcessAssume a number of Choose a distribution For each word:topics for the over the topics • Choose a topicdocument collection assignment(Craiova guide) • Choose the word from the topic religious 0.03 monastery 0.01 church 0.01 art 0.02 painter 0.02 sculpture 0.01 park 0.01 garden 0.01 ailab.ijs.si Topic Models - Extensions Hierarchical Topic Models D. Blei, T. Griffiths, and M. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57:2 1–30, 2010. Q. Ho, J. Eisenstein, E. P. Xing. Document Hierarchies from Text and Links. WWW 2012 Dynamic Topic Models D. Blei and J. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, 2006. ailab.ijs.si 13
  • 14. 19.6.2012 Topic Detection in StreamsUnsupervised methodsSimpler approaches – e.g. ClusteringProbabilistic topic models Challenging because of the amount and dynamics of the data E.g. Online inference for LDA – fits a topic model to random Wikipedia articles ailab.ijs.si Topic Detection ToolsAvailable implementations LDA, HLDA, … http://www.cs.princeton.edu/~blei/topicmodeling.html Mallet Toolkit for statistical NLP http://mallet.cs.umass.edu/ ailab.ijs.si 14
  • 15. 19.6.2012 Clustering on text streams Grouping similar documents – adjusting to changes in the topics over time Clusters generated as the data arrives and stored in a tree Adding examples by adjusting the whole path from the root to the leaf node with the new example – adding, removing, splitting and merging clusters ailab.ijs.si Clustering on Reuters V1 news (colors showing predefined topics)B. Novak, Algorithm for identifying topics in text streams, 2008 ailab.ijs.si 15
  • 16. 19.6.2012 Topic Detection - DEMOSA 100-topic browser of the dynamic topic modelfit to Science (1882-2001) http://topics.cs.princeton.edu/Science/Browsing search results http://searchpoint.ijs.si/ ailab.ijs.si 100-topic browser Science (1882-2001) 1890 1940 2000 ailab.ijs.si 16
  • 17. 19.6.2012 Search Point ailab.ijs.si Entity ExtractionSubtask of information extractionIdentifying elements in text which belong to apredefined group of things: Names of people, locations, organizations (most common) Time expressions, quantities, money amounts, percentages Gene and protein names Etc. ailab.ijs.si 17
  • 18. 19.6.2012 Entity Extraction ailab.ijs.si Entity Extraction ApproachesLists of entities (gazetteers) and grammar rules e.g. GATE – General Architecture for Text Engineering H. Cunningham, et al. Text Processing with GATE (Version 6). University of Sheffield Department of Computer Science. 15 April 2011Statistical models e.g. Stanford NER - linear chain Conditional Random Field (CRF) sequence models J. R. Finkel, T. Grenager, and C. Manning. Incorporating Non- local Information into Information Extraction Systems by Gibbs Sampling. In ACL 2005, pp. 363-370. ailab.ijs.si 18
  • 19. 19.6.2012 Collective Entity ResolutionEntity resolution: discover and map entities tocorresponding references (e.g from a database,knowledge base, etc.).Approaches: Pairwise similarity with attributes of references Relational clustering using both attribute and relational information I. Bhattacharya, L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007. Topic models for the context of every word in a knowledge base P. Sen. Collective Context-Aware Topic Models for Entity Disambiguation. WWW 2012 ailab.ijs.si Entity Resolution to Linked DataEnhance named entity classification using Linked Datafeatures Y. Ni, L. Zhang, Z. Qiu, C. Wang. Enhancing the Open-Domain Classification of Named Entity using Linked Open Data. ISWC, 2010.Type knowledge base from LOD (name string, type) E.g. from the triplet (dbpedia:Craiova, rdf:type, Place) -> (Craiova, Place)Uses WordNet as an intermediate taxonomy to computethe similarity between the LOD type and the target type ailab.ijs.si 19
  • 20. 19.6.2012 Entity Resolution to Linked DataFinding all possible forms under which an entitycan occur in text Resource descriptions - most useful rdfs:label and foaf:name Redirect relationship (entity1, type1) (entity2, ?) entity1 has URI1 entity2 has URI2 URI1 owl:sameAs URI2 Conclude: (entity2, type1) ailab.ijs.si Relation ExtractionIdentifying relationships between entities (and moregenerally phrases)Traditional relation extraction The target relation is given, together with corresponding extraction patterns for the relation A specific corpusOpen relation extraction (and more general Openinformation extraction) Diverse relations, not previously fixed Corpus: the WebM. Banko, O. Etzioni. The Tradeoffs Between Openand Traditional Relation Extraction. ACL, 2008. ailab.ijs.si 20
  • 21. 19.6.2012 Identifying Relations for Open IE 3-step method: Label – automatically label sentences with extractions (arg1, relation phrase, arg2) Learn – learn a relation phrase extractor (e.g using CRF) Extract – given a sentence, identify (arg1, arg2) and the relation phrase (based on the learned relation extractor) Examples TextRunner – M. Banko, M. Cafarella, S. Soderland, M. Broadhead, O. Etzioni. Open Information Extraction from the Web. IJCAI 2007. WOE – F. Wu, D.S. Weld. Open Information Extraction using Wikipedia. ACL 2010. ailab.ijs.si Identifying Relations for Open IE REVERB Input: POS-tagged and NP-chunked sentence Identify relation phrases syntactic and lexical constraints Find a pair of NP arguments for each relation phrase – assign confidence score (logistic regression classifier) Output: (x,y,z) extraction tripletsA. Fader, S. Soderland O. Etzioni. Identifying Relations for OpenInformation Extraction. EMNLP 2011. ailab.ijs.si 21
  • 22. 19.6.2012 Identifying Relations for Open IEKey points: Relation phrases are identified holistically as opposed to word-by-word Potential phrases are filtered based on statistics (lexical constraints) relation first opposed to arguments first relation phrase not confused for arguments (e.g. “made a deal with”)DEMO http://openie.cs.washington.edu/# ailab.ijs.si REVERB ailab.ijs.si 22
  • 23. 19.6.2012 Never Ending Language Learning NELL – Never Ending Language Learning Addressed tasks Reading task: read the Web and extract a knowledge base of structured facts and knowledge. Learning task: improved (and updated) reading – extract past information more accuratelyA. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. Hruschka Jr. and T.M. Mitchell.Toward an Architecture for Never-Ending Language Learning. AAAI, 2010. ailab.ijs.si Never Ending Language Learning A. Carlson et al. 2010 ailab.ijs.si 23
  • 24. 19.6.2012 Never Ending Language LearningCoupled Pattern Learner (CPL) Extracts instances of categories and relations (using contextual patterns)Coupled SEAL (CSEAL) Queries the Web with beliefs from each category or relation, mines lists and tables to extract new instancesCoupled Morphological Classifier (CMC) One regression model per category – classifies noun phrasesRule Learner (RL) Infer new relation instancesDEMO http://rtw.ml.cmu.edu/rtw/ ailab.ijs.si NELL ailab.ijs.si 24
  • 25. 19.6.2012 Domain and Summary TemplatesDomain templates Event-centric: the focus is on events described with verbs E. Filatova, V. Hatzivassiloglou, K. McKeown. Automatic creation of domain templates. In Proceedings of COLING/ACL 2006Summary templates Entity-centric: the focus is on summarizing entity categories P. Li, J. Jiang, Y. Wang. Generating templates of entity summaries with an entity-aspect model and pattern mining. ACL 2010. ailab.ijs.si Domain TemplatesDomain is a set of events of a particular type E.g. presidential elections, football championshipsDomains can be instantiated – instances ofevents of a particular type E.g. Euro Championship 2012Different levels of granularityHierarchical structure for domainsTemplate – a set of attribute-value pairs The attributes specify functional roles characteristic for the domain events ailab.ijs.si 25
  • 26. 19.6.2012 Domain Templates Use a corpus describing instances of events within a domain and learn the domain templates (general characteristics of the domain) The verbs are used as a starting point – estimate of the verb importance given the domain The sentences containing the top X verbs are parsed The most frequent subtrees (FREQuent Tree miner) are kept The named entities are substituted with more generic constructs – e.g. POS tags The frequent sub-trees are merged together ailab.ijs.si Domain TemplatesE.g. terrorist attack domain• Important verbsKilled, told, found, injured, reported,Happened, blamed, arrested, died, linked• Frequent subtrees(VP(ADVP(NP))(VBD killed)(NP(CD 34)(NNS people)))(VP(ADVP)(VBD killed)(NP(CD 34)(NNS people)))• Merging subtrees(VBD killed)(NP(NUMBER)(NNS people)) ailab.ijs.si 26
  • 27. 19.6.2012 Summary TemplatesStarting point: a collection of entity summaries fora given entity categoryGoal: to obtain a summary template for the entitycategoryE.g. The physicist categoryENTITY received his phd from ? universityENTITY studied ? under ?ENTITY earned his ? in physics from university of ?ENTITY was awarded the medal in ?ENTITY won the ? awardENTITY received the Nobel prize in physics in ? ailab.ijs.si Summary TemplatesIdentify subtopics (aspects) of the summarycollection Using LDA (see Topic Detection) Each word: a stop word, a background word, a document word, an aspect word ailab.ijs.si 27
  • 28. 19.6.2012 Summary TemplatesSentence patterns are generated for each aspect frequent subtree pattern miningFixed structure of a sentence pattern Aspect words, background words, stop wordsTemplate slots – vary between documents Document words ailab.ijs.si Summary TemplatesSentence pattern generation Locate subject entities (using heuristics) – e.g. pronouns in a biography Generate parse trees (using Stanford Parser) – label stop, background, aspect, document, entity words given by the topic model Mine frequent subtree patterns (using FREQT) Prune patterns without entities or aspect words Convert subtree patterns to sentence patterns (find the sentences that generated the pattern) ailab.ijs.si 28
  • 29. 19.6.2012 Word sense disambiguation Identifying the meaning of words in context Supervised WSD Words labeled with their senses are required Classification task Unsupervised WSD Known as word sense induction Clustering task Knowledge-based WSD Relies on knowledge resources: WordNet, Wikipedia, OpenCyc, etc.R. Navigli. Word sense disambiguation: A survey.ACM Computational Surveys, 41(2), 2009. ailab.ijs.si Word Sense Disambiguation Ponzetto, S.P. and Navigli, R. Knowledge-rich Word Sense Disambiguation Rivaling Supervised Systems. ACL 2010. Extend WordNet with Wikipedia relations Apply simple knowledge-based approaches Performance was similar with state-of-the-art supervised approaches ailab.ijs.si 29
  • 30. 19.6.2012 WSD EvaluationEvaluation workshops SenseEval, SemEval, …WSD evaluation topics (SemEval 2010) Cross-lingual WSD WSD on a specific domain Word sense induction Disambiguating Sentiment Ambiguous AdjectivesEvaluation topics related to WSD (SemEval2012) Semantic textual similarity – similarity between two sentences Relational similarity – between pairs of words ailab.ijs.si SummarizationExtractive Identifying relevant sentences that belong to the summaryAbstractive Identifying/paraphrasing sections of the document to be summarized E.g. Summarization as phrase extraction - K. Woodsend, M. Lapata. Automatic Generation of Story Highlights. ACL, 2010. joint content selection and compression model ILP model to determine phrases that form the highlights ailab.ijs.si 30
  • 31. 19.6.2012 Summarization EvaluationSeveral evaluation workshops Document Understanding Conferences (DUC), Text Analysis Conferences (TAC) Metrics: ROUGE (n-gram based)Linguistic quality Grammaticality, non-redundancy, referential clarity, focus, structure and coherence E. Pitler, A. Louis, A. Nenkova. Automatic Evaluation of Linguistic Quality in Multi-Document Summarization. ACL 2010. ailab.ijs.si Sentiment analysisBroad sense: sentiment analysis ~ opinion mining“computational treatment of opinion, sentiment, andsubjectivity in text” (B. Pang, L. Lee, 2008)Surveys, book chapters: B. Pang, L. Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1-2), pp. 1–135, 2008 B. Liu. Sentiment Analysis and Subjectivity. Handbook of Natural Language Processing, Second Edition, (editors: N. Indurkhya and F. J. Damerau), 2010. B. Liu. Web Data Mining - Exploring Hyperlinks, Contents and Usage Data, Ch. 11: Opinion Mining, Second Edition, Springer, 2011. ailab.ijs.si 31
  • 32. 19.6.2012 Interactive Approach to Sentiment Analysishttp://aidemo.ijs.si/render/index.html Model and task selection Query Model-based filters Examples retrieved by uncertainty or class-margin sampling Query result, grouped by predicted label T. Stajner and I. Novalija. Managing Diversity through Social Media, ESWC 2012 Workshop on Common value management. ailab.ijs.si Architecture Indexed documents Active learning for modeling topic, sentiment (diversity analysis) Interactive user interface Example: stream of social media posts relevant to brand management ailab.ijs.si 32
  • 33. 19.6.2012 Social Network Analysis Modeling social relationships Network theory concepts Nodes – individuals within the network Edges – relationships between individualsMario Karlovčec, Dunja Mladenić, Marko Grobelnik, Mitja Jermol. Visualizations ofBusiness and Research Collaboration in Slovenia, Proc. Of the InformationTechnology Interfaces 2012. ailab.ijs.si Influence and Passivity in Social Media Majority of Twitter users are passive information consumers – do not forward the content to the network Influence and passivity based on information forwarding activity Passivity User retweeting rate and audience retweeting rate how difficult it is for other users to influence him Algorithm ~ HITS Passivity score ~ authority score Most passive: robot users – follow many users, but retweet a small percentage Influence score ~ hub score Most influential: news services – post many links forwarded by other usersD.M. Romero, W. Galuba, S. Asur, and B.A. Huberman. Influence andPassivity in Social Media. ECML PKDD, 2011. ailab.ijs.si 33
  • 34. 19.6.2012 Text stream• What are text streams processing • Key literature overview• Properties of text streams • Further publicly available• Motivation • Topic detection tools• Pre-processing of text • Entity, event and fact • Conclusions streams extraction and resolution • Questions and discussion• Text quality • Word sense disambiguation • Summarization • Sentiment analysis Introduction to • Social network analysis Concluding text streams remarks ailab.ijs.si Key literature ailab.ijs.si 34
  • 35. 19.6.2012 Further publicly available toolsTopic detection David Blei’s homepage: http://www.cs.princeton.edu/~blei/topicmodeling.html Mallet: http://mallet.cs.umass.edu/Natural language toolkits GATE: http://gate.ac.uk/ OpenNLP: http://opennlp.apache.org/ Nltk: http://nltk.org/Entity Extraction Stanford NER: http://nlp.stanford.edu/ner/index.shtmlRelation Extraction NELL: http://rtw.ml.cmu.edu/rtw/ REVERB: http://openie.cs.washington.edu/WSD WordNet::SenseRelate: http://senserelate.sourceforge.net/ ailab.ijs.si ConclusionsDealing with streams can be often easy, but… …gets hard when we have an intensive data stream and complex operations on data are required!Topic detection Currently online inference (e.g. for LDA) is a new directionEntity, relationship and template extraction,sentiment analysis and social network analysis Are already applied on streamsWord Sense Disambiguation Complex knowledge bases (e.g. WordNet + Wikipedia) coupled with simple disambiguation algorithms work wellSummarization Abstraction summaries are more suited for text streams ailab.ijs.si 35
  • 36. 19.6.2012Questions and discussion ailab.ijs.si 36