Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dynamic Search Using Semantics & Statistics

885 views

Published on

This presentation shows 3 applications of successfully combining semantics and statistics for text mining and interactive search.

1) We predict the Lehman bankruptcy using statistical topic modeling, SAP Business Objects entity extraction and associative memories (powered by Saffron Technologies).

2) We semi-automatically handle service requests at Cisco using knowledge extraction and knowledge reuse.

3) We discover user intent for interactive retrieval. User intent is defined as a latent state. The observations of this latent state are the reformulated query sequence, and the retrieved documents, together with the positive or negative feedback provided by the user. Demo shows recognizing user’s intent for health care search.

Published in: Technology, Education
  • Be the first to comment

Dynamic Search Using Semantics & Statistics

  1. 1. Text Mining - Bayesian Topic Modeling for Interactive Retrievalat SAP and Cisco<br />Ram Akella<br />University of California and Stanford<br />With Karla Caballero, Maria Daltayanni, Chunye Wang - UCSC and<br />Paul Hofmann SAP Labs<br />October 6, 2011 SAP<br />
  2. 2. Outline<br />Motivation<br />Statistical Topic Modeling - SAP & Saffron<br />Knowledge Extraction and Reuse at Cisco<br />Interactive Retrieval<br />Interactive Retrieval Demo<br />
  3. 3. Outline<br />Motivation<br />Statistical Topic Modeling - SAP & Saffron <br />Knowledge Extraction and Reuse in Cisco<br />Interactive Retrieval<br />Interactive Retrieval Demo<br />
  4. 4. Motivation<br />10/6/2011<br />User expects to find more relevant results each time she interacts with the system<br />Depression treatment of patients…<br />q3: symptoms and treatment<br />q2: depression symptoms<br />q1: elderly depression<br />DOCTOR<br />SEARCH<br />Depression influence on family relationships…<br />Relevance of the presented documents depends on user context<br />SOCIAL SCIENTIST<br />
  5. 5. Interactive Retrieval Model <br />Query<br />Interactive <br />Retrieval System<br />User Feeback<br />Document<br />Collection <br />Metadata <br />Generation System<br />Information need<br />Update<br />Feedback and <br />propagation to<br /> similar documents<br />
  6. 6. Interactive Retrieval Model <br />Query<br />Interactive <br />Retrieval System<br />User Feeback<br />Document<br />Collection <br />Metadata Generation System<br />Add to the document metadata <br />that facilitates the retrieval process<br />This metadata consist of:<br />Statistical Topic Mixture<br />Knowledge Extraction based<br />on Business process (problem, cause,<br /> solution)<br />Information need<br />Update<br />Feedback and <br />propagation to<br /> similar documents<br />
  7. 7. Outline<br />Motivation<br />Statistical Topic Modeling - SAP & Saffron<br />Motivation<br />Related Work<br />Proposed Approach<br />Topic Modeling and Entity Association<br />Knowledge Extraction and Reuse at Cisco<br />Interactive Retrieval<br />Interactive Retrieval Demo<br />
  8. 8. Topic Modeling: Motivation<br />Given a set of documents, we want to identify the main areas or topics discussed in a unsupervised manner. We take advantage of the semantic associations between words across the documents. <br /> If two words appear in the same document, they should be related.<br />For each topic we have different distributions of words and each document might contain material about a variety of topics.<br />Music<br />notes<br />instrument<br />net<br />ball<br />racquet<br />Sports<br />Play<br />net<br />Topic 1 (80%)<br />Sports<br />game<br />Topic 1<br />Sports<br />Topic 2 (5%)<br />Topic 3 (20%)<br />Common Words<br />ball<br />10/6/2011<br />
  9. 9. Related Work<br />
  10. 10. Our Approach <br />The higher probability mass is accommodated in the upper part of the tree (this facilitates the truncation and reduction in the number of topics)<br />We can define a method to determine the number of topics suitable for a particular dataset without training the model several times (each time for a given number of specified topics)<br />…<br />…<br />…<br /> 0.0851<br /> 0.0660<br />0.0310<br /> 0.0096<br />0.0146<br />10/6/2011<br />
  11. 11. Experimental Setup<br />The datasets are from two types:<br /> Scientific Articles (NIPS)<br />Longer documents<br />News Data (NYT, APW, XIE)<br />Shorter Documents<br />More diverse vocabulary<br />We compare the performance of the algorithm against three approaches in the literature : LDA, CTM and Pachinko<br />We test our model using Empirical Likelihood<br />This method estimate how likely it is that a test document will be generated from the estimated model. <br />We want this value to be high (better generalization and applicability to unseen documents).<br />10/6/2011<br />
  12. 12. Results: NYT Dataset<br />We obtain the topic mixture for the NYT Dataset using K=20 topics .<br />10/6/2011<br />+<br />+<br />-<br />-<br />+<br />+<br />+<br />+<br />
  13. 13. Results: Empirical Likelihood<br />10/6/2011<br />13<br />Our Model<br />APW Dataset<br />NIPS Dataset<br />XIE Dataset<br />NYT Dataset<br />
  14. 14. Results: Running Time<br />10/6/2011<br />Minutes<br />Minutes<br />APW Dataset<br />NIPS Dataset<br />Minutes<br />Our Model<br />Minutes<br />XIE Dataset<br />NYT Dataset<br />
  15. 15. Illustrative Example: NYT Dataset<br />10/6/2011<br />NORTHRIDGE TAUGHT A LESSON<br />LOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.<br />
  16. 16. Illustrative Example: NYT Dataset<br />10/6/2011<br />NORTHRIDGE TAUGHT A LESSON<br />LOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.<br />
  17. 17. Illustrative Example: NYT Dataset<br />10/6/2011<br />NORTHRIDGE TAUGHT A LESSON<br />LOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.<br />
  18. 18. Illustrative Example: NYT Dataset<br />10/6/2011<br />NORTHRIDGE TAUGHT A LESSON<br />LOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.<br />
  19. 19. Topic Modeling & Entity Association<br />Entities<br />SAP Business Objects Entity Extractor<br />Saffron Associative Memory Base<br />Base knowledge Source<br />Query<br />Text Data to be monitored<br />UCSC Topic Mining System<br />We would like to know who are the actors involved in a particular action that led to the failure of Lehman brothers<br />Valukas<br />Report about why Lehman Brothers Failed<br />(6 volumes)<br />Topics<br />Saffron Associative Memory creates associations among entities and topics <br />This work was presented at SAPPHIRE NOW 2010<br />
  20. 20. Outline<br />Motivation<br />Statistical Topic modeling - SAP & Saffron<br />Knowledge Extraction and Reuse in Cisco<br />Knowledge Extraction System<br />System Architecture<br />Domain Knowledge<br />Improving Productivity<br />Performance of Service Request Recommender<br />Interactive Retrieval<br />Interactive Retrieval Demo<br />
  21. 21. Knowledge Extraction System at Cisco<br />Service Request Database<br />Knowledge <br />Database<br />Applications<br />such as retrieval<br />Service Request<br />Text Mining System<br />Unstructured Text<br />Knowledge<br />Finding different solutions to the same problem<br />Problem<br />Cause<br />Document 1<br />Document 2<br />Similarity<br />Solution<br />high<br />Problem<br />Problem<br />high<br />Cause<br />Cause<br />Irrelevant Content<br />low<br />Solution<br />Solution<br />Why did it occur?<br />How was it solved?<br />What was the <br />problem?<br />
  22. 22. System Architecture<br />Features from Expertise<br />Service Request<br />Preprocessor<br />Bag-of-words<br />Feature Generator<br />Hierarchical<br />Classifier<br />Expertise<br />Domain Knowledge<br />Labeled Paragraphs<br />Service Request Recommender<br />User<br />Legend<br />Data flow <br />of Analyzer<br />Data flow of Recommender<br />Data output for User<br />
  23. 23. Domain Knowledge<br /><ul><li>Internetworking Terms and Acronyms Dictionary (ITAD)
  24. 24. Benefits: (1) the expansion of acronyms and terminology; </li></ul> (2) the enhancement of concept dependencies.<br /><ul><li>Example:</li></ul>Snippet from Doc1<br />Measuring <br />similarity<br />Snippet from Doc2<br />[…]: explanation from ITAD. <br />Blue: overlapping words between unexpanded excerpts.<br />Red: overlapping words introduced by ITAD.<br />
  25. 25. Improving Productivity<br />Compare the time spent by engineers in reading service requests before and after using our system.<br />Browse a service request<br />Time to access relevance<br />N<br />Relevant?<br />Y<br />Read and understand thoroughly<br />Time to extract knowledge<br />Read enough? <br />N<br />Y<br />Create knowledge article<br />
  26. 26. Performance of Service Request Recommender<br />Our Method<br />Retrieval Schemes<br /><ul><li>Result 1: Both deterministic and probabilistic model achieved much better results when labeled paragraphs were used; validates our hypothesis of the inherent diagnostic business process.
  27. 27. Result 2: Using domain knowledge further improves retrieval results.
  28. 28. Result 3: Probabilistic recommender outperformed deterministic recommender.</li></li></ul><li>Outline<br />Motivation<br />Statistical Topic modeling – SAP & Saffron<br />Knowledge Extraction and Reuse at Cisco<br />Interactive Retrieval<br />Problem<br />Reinforcement Learning Formulation<br />How many interaction steps needed<br />How much feedback is needed<br />Interactive Retrieval Using Topic Modeling<br />Interactive Retrieval Demo<br />
  29. 29. Interactive Retrieval<br />Model the user intent to retrieve relevant documents<br />Identify the trade-off between<br />Retrieval accuracy (how accurate are the results required to be by the user?)<br />Interaction time (how much time is the user willing to spend on interaction?)<br />Applied to<br />Medical documents retrieval<br />e.g., search for past patient cases with similar symptoms<br />Resume retrieval in a labor marketplace<br />e.g., search for Python developers who work in machine learning<br />MORE IMPORTANT<br />LESS IMPORTANT<br />
  30. 30. Problem<br />10/6/2011<br />28<br />Dynamic Programming<br />t1 t2 t3 … tn<br />Reinforcement Learning<br />User Intent<br />User Intent<br />User Intent<br />Set of Relevant Documents<br />Set of Relevant Documents<br />Set of Relevant Documents<br />Myopic Dynamic<br />Static<br />Dynamic<br />What is the best path to choose ?<br />
  31. 31. Reinforcement Learning formulation of IIR<br />Agent<br />IIR system<br />Environment<br />User<br />Action<br />Ranking Rk<br />Objective<br />Max. sum of rewards<br />Reward<br />Improvement <br />v(Rk)-v(Rk-1)<br />(as observed from user feedback)<br />Intent<br />Best guess for user intent or need<br />(expressed in query terms)<br />
  32. 32. Experiments Set-Up<br />Dataset: TREC-9 OHSUMED, 348.566 medical documents<br />with a list of relevance judgments<br />65 user queries<br />query title: 2 − 5 words<br />query description: 5 − 10 words<br />Interactive Sessions of 3 − 5 steps<br />Relevance function is binary<br />Value of results (with appropriate weights wi) <br />Precision @10: percentage of relevant documents in the top-10 results<br />We compare our results with Pseudo-relevance Feedback<br />
  33. 33. How many interaction steps needed?<br />9/19/2011<br />
  34. 34. How much feedback is needed?<br />Experiments tested on<br />348,566 OHSU-MED medical dataset, TREC 2002<br />
  35. 35. Interactive Retrieval w Topic Modeling<br />Topics help us to reduce the search <br />They add context to the query<br />Some important terms to describe the users’ intent may not be included in the query<br />Topics are calculated a-priori and added to each document as metadata<br />Topic Mixture of<br />Relevant Docs<br />Meta-query<br />(combination of user inputs)<br />Updated each time the user provides feedback (clicks) or additional information to the system (query redefinition)<br />Topic Mixture of<br />Non Relevant Docs<br />Combination of terms and topic relevance scores<br />
  36. 36. Proposed Dataset<br />We test our approach using the HARD TREC queries which consist of :<br />851,018 news documents from NYT APW and XIE agencies<br />Each document has an average length of 305 terms<br />There are 496,779 unique terms<br />We infer the topic information of the corpus using 75 topics <br />For testing purposes we use m=3 interactions<br />We use test 30 queries<br />We compare our algorithm with mixture relevance feedback<br />10/6/2011<br />
  37. 37. Preliminary Results<br />10/6/2011<br />Precision<br />Number of Interactions<br />
  38. 38. Outline<br />Motivation<br />Statistical Topic modeling– SAP & Saffron<br />Knowledge Extraction and Reuse at Cisco<br />Interactive Retrieval<br />Interactive Retrieval Demo<br />
  39. 39. Example User intent<br />young female with fevers and increased CPK (CreatinePhosphoKinase)<br />CPK: enzyme, may cause heart attack or severe muscle breakdown if increased<br />neuroleptic malignant syndrome (life-threatening neurological disorder)<br />Associated with CPK<br />Symptoms: muscular cramps, fever, unstable blood pressure, changes in cognition, including agitation, delirium and coma<br />differential diagnosis<br />List symptoms<br />List causes of the symptoms<br />Prioritize by the most dangerous <br />Treat<br />treatment<br />
  40. 40. Relevant Documents<br />Non-relevant documents:<br />Doc 1: Significance of elevated levels of CPK in febrile diseases: a prospective study. The incidence and significance of elevated serum levels of (CPK) in febrile diseases were studied prospectively in all patients admitted with fever to a department of medicine during 1 year.<br /> Doc 2: Metoclopramide-induced neuroleptic malignant syndrome….Symptoms of NMS include rigidity, hyperpyrexia, altered consciousness, and autonomic instability. This syndrome is generally associated with neuroleptic medications used to treat psychotic and major depressive illnesses…<br />Relevant document:<br />Doc 3: Neuroleptic malignant syndrome: guidelines for treatment and reinstitution of neuroleptics… Cardinal symptoms include fever, muscular rigidity, an elevated serum level of creatine phosphokinase, changes in mental status, and autonomic dysfunction…<br />
  41. 41. Interactive Demo<br />InteractiveDemo_MedicalData<br />Sub-queries<br />young female with fevers and increased CPK<br />neuroleptic malignant syndrome<br />differential diagnosis<br />treatment<br />

×