Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

683 views

Published on

Paper presentation at ICWE2013.

Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

http://icwe2013.webengineering.org/accepted-full-papers

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

  1. 1. Motivation Data on the Web 09/07/13 ICWE 2013, Aalborg, Denmark Some eyecatching opener illustrating growth and or diversity of web data Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents ICWE 2013: International Conference on Web Engineering 8-12 July 2013, Aalborg , Denmark Besnik Fetahu, Bernardo Pereira Nunes, Stefan Dietze (L3S Research Center, DE)
  2. 2. Outline – Introduction – Related Work – Focused Knowledge Extraction • Pre-Processing & Query Expansion • Pattern Generation • Contextual Structure – Evaluation – Results – Conclusions 09/07/13 ICWE 2013, Aalborg, Denmark
  3. 3. Introduction • Motivation – Large amounts of textual Web Documents – Efficient techniques querying for relevant information – Extraction of chunks of text: relations, named entities etc. – Summaries as means on highlighting most important chunks of text • Issues: – Summaries as non-structured text – Weak relationship of user interests and importance of specific chunks of text in a corpus 09/07/13 ICWE 2013, Aalborg, Denmark
  4. 4. Prominent Text Summarisation Approaches • Heuristics for relation extraction • Extraction of information based on predefined templates • Sentence inclusion based on inclusion of specific terms • Latent Semantic Analysis (LSA) for measuring importance of specific terms • Tree Kernels encoding relevant information for event detection • Latent Dirichlet Allocation (LDA) for topic modelling • Populating ontologies based on extracted information from text 09/07/13 ICWE 2013, Aalborg, Denmark IE IR ML SW
  5. 5. Focused Knowledge Extraction Overview • Structured Summary Generation Components: – Query Expansion and Reformulation – Named Entity Definition and Co-Reference Resolution – Pattern Generation – Contextual Structure of Summaries 09/07/13 ICWE 2013, Aalborg, Denmark
  6. 6. Focused Knowledge Extraction Pipeline 09/07/13 ICWE 2013, Aalborg, Denmark Stem Cell user query Anatomical structure Biotechnology Cloning Cell biology Developmental Biology Stem Cell query typing and expansion Corpus OR/AND of expanded query terms NER POS Annotate filtered documents patterns Democrats → applauded → Mr. Spitzer Eliot (Gov) calls → insure → 500 000 children → lack→ health insurance → enroll → 900 000 adults → are → eligible Medicaid → enrolled → issue debt → pay → stem cell research. structured summary Entities Actions
  7. 7. Focused Knowledge Extraction Query Expansion • Query (“Stem Cell”) → NER → http://dbpedia.org/page/Stem_cell • Query Typing & Expansion – DBpedia SPARQL Query Expansion: • Query: “Stem Cell” is processed into: – Typed Query: • http://dbpedia.org/page/Stem_cell – Expanded Query: • http://dbpedia.org/page/Biotechnology • http://dbpedia.org/page/Cloning • http://dbpedia.org/page/Cell_biology • http://dbpedia.org/page/Developmental_biology – Conjunction/Disjunction of expanded query terms 09/07/13 ICWE 2013, Aalborg, Denmark SELECT ?o ?label WHERE{ <http://dbpedia.org/resource/Stem_cell> ?p ?o . ?o rdfs:label ?label }
  8. 8. Focused Knowledge Extraction - Named Entity Definitions & Co-Reference Resolution • Entities recognised using NER&NED tools (Stanford’s NLP toolkit) • Construct a co-occurrence matrix of proper nouns appearing consecutively • Sample entities: “Chicago Bears”, “playoff games” • Co-reference resolution crucial for accurate knowledge extraction 09/07/13 ICWE 2013, Aalborg, Denmark  k i ii termtermoccurrcoiMiscentity 1 1),(][ = +−=
  9. 9. Focused Knowledge Extraction Pattern Generation • Determine topic terms (LDA) from the underlying filtered corpus • Annotate using POS taggers topic terms • Pattern items: – POS tags from topic terms – Query terms (incl. terms after expansion) 09/07/13 ICWE 2013, Aalborg, Denmark police found women men dr death people drug medical officers man problems study killed heart hospital test sex patients evidence dead drugs officer…. police_NN found_VBD women_NNS men_NNS dr_VBP death_NN people_NNS drug_NN medical_JJ officers_NNS man_NN problems_NNS study_NN killed_VBD heart_NN hospital_NN test_NN sex_NN patients_NNS evidence_NN dead_NN drugs_NNS officer_NN NN → VBD → NNS → VBP → NN…. Stem Cell → Anatomical structure → Biotechnology Cloning → Cell Biology → Developmental Biology
  10. 10. Focused Knowledge Extraction Pattern Generation (I) • Construct co-occurrence matrix of pattern items (POS tags, Query terms) • Generate automatically emerging patterns reflecting syntactical relevance of chunks of text • Patterns as a sequence of co-occurring items, modelled as directed tree graphs • For each pattern item generate a directed tree graph, considering it as a root node • Patterns score conveys importance for a given corpus and query 09/07/13 ICWE 2013, Aalborg, Denmark
  11. 11. Generated Patterns Pattern Score ψscore NN → JJ → VB → RB 0.28571429 NN → VB → JJ → RB 0.19949495 Stem Cell → NN → VB → RB → JJ 0.17361111 JJ → RB → VB → NN → Stem Cell 0.17347462 RB → JJ → NN → Stem Cell 0.16466599 NN → Stem Cell → RB → VB → JJ 0.16155811 RB → VB → Stem Cell → NN → JJ 0.16129665 09/07/13 ICWE 2013, Aalborg, Denmark Focused Knowledge Extraction Pattern Generation (II) Automatically generated patterns showing sequence of important syntactical items to appear in a sentence Scoring mechanism of patterns as the marginal probability of co-occurring pattern items based on the filtered corpus Prior probability of a pattern item, as the head node of the directed tree graph. Conditional probability of two consecutive pattern items
  12. 12. Focused Knowledge Extraction Contextual Structure of Summaries • Summaries generated as structured knowledge • Decomposition of summaries into two structures: – global (Entities, Actions) for entire corpus – local (entity-context, action-context) for particular document • Multiple summary perspectives based on generated context • Enrichment with additional information from reference datasets (DBpedia) 09/07/13 ICWE 2013, Aalborg, Denmark
  13. 13. Focused Knowledge Extraction Contextual Structure of Summaries 09/07/13 ICWE 2013, Aalborg, Denmark Contextual Structure of Summaries with global and local structures enabling multiple summary perspectives: “The kinds of stem cell therapies being researched for the most part do not involve the politically sensitive use of embryonic stem cells.” Stem cell Therapies researched involve Stem Cell: Embryonic, sensitive researched: Stem cell therapies ↔ most part
  14. 14. Evaluation Setup • Dataset: New York Times, year 2007 • 40,000 articles with manually generated summaries • Summary relevance w.r.t the generated context (query) • Coverage of the manually NYT generated summaries • ROGUE-n metric to measure coverage of structured vs. manually generated summaries 09/07/13 ICWE 2013, Aalborg, Denmark Total n-grams Matching n-grams from structured and manually generated summaries.
  15. 15. Results • 10 queries used for evaluation (2007’s prominent events from Time’s Magazine1 ) • Human evaluation for summary relevance: 76% correctly generated • 17 evaluators with an average of 20 summaries evaluated 1 http://www.time.com/time/specials/2007/0,28757,1686204,00.html 09/07/13 ICWE 2013, Aalborg, Denmark Query European Union Super Bowl US Congress Virgina Tech Stem Cell Protest Harry Potter Global Warming National Security Terrorist Attacks #Q. Terms 7 13 17 28 5 2 22 5 0 0 #Doc. 157 370 13 12 105 129 10 198 250 57 #Summ. 129 325 19 11 86 103 7 170 207 52 Generated structured summaries for the different queries.
  16. 16. Results • ROGUE-1 evaluation results for the 10 queries • 25% precision and 32% recall as best performing results for ROGUE-1 09/07/13 ICWE 2013, Aalborg, Denmark P/R/F1 measures based on ROGUE-1 metric for the 10 queries used for evaluation
  17. 17. Results Sample Generated Summaries 09/07/13 ICWE 2013, Aalborg, Denmark Query: “Stem Cell” Democrats → applauded → Mr. Spitzer Eliot (Gov) calls → insure → 500, 000 children → lack → health insurance → enrol → 900, 000 adults → are → eligible Medicaid → enrolled → issue debt → pay → stem cell research. Congress’s Shift in Power → revives → Medicare Debate House Democrats → try to rush → legislation → requiring → government → negotiate → lower drug prices for Medicare beneficiaries → overturning → President Bush’s restrictions on embryonic stem cell research. The nation → welcome → ambitious agenda → being offered → today by the new Congress Democratic majority → raising → minimum wage → advancing → stem cell research → restoring → oversight of the executive branch. New study → suggesting → useful stem cells → be derived → amniotic fluid without → destroying → embryos. Swarns, Rachel L → announced → 9 Aug. federal government → pays → studies on stem cell colonies , lines → created before→ that date, government → does not encourage → destruction of additional embryos . Stem cell research → has not produced → a single medical treatment → is morally wrong→ to create human life → to destroy → for research. The measure → allow → scientists → receiving → federal funds → use → embryonic stem cells from surplus embryos → generated → fertility clinics , after cell lines → had been derived → by others → using → nonfederal funds.
  18. 18. Conclusions • Query-based generated summaries • Contextualised Structured Summaries – Typing and expanding of queries using reference datasets – Automated pattern generation • Incorporated user interests and syntactical relevance of chunks of text • Multiple summary perspectives • Overall good accuracy of generated summaries • Infer new knowledge by interlinking summaries of different/same contexts 09/07/13 ICWE 2013, Aalborg, Denmark
  19. 19. Thank you! Questions? 09/07/13 ICWE 2013, Aalborg, Denmark

×