Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multi-step Classification Approaches to Cumulative Citation Recommendation


Published on

Presented at the Open research Areas in Information Retrieval (OAIR 2013) conference

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Multi-step Classification Approaches to Cumulative Citation Recommendation

  1. 1. Multi-step ClassificationApproaches to CumulativeCitation RecommendationKrisztian BalogUniversity of StavangerOpen research Areas in Information Retrieval (OAIR) 2013 | Lisbon, Portugal, May 2013Naimdjon Takhirov, Heri Ramampiaro, Kjetil NørvågNorwegian University of Science and Technology
  2. 2. Motivation- Maintain the accuracy and high quality ofknowledge bases- Develop automated methods to discover (andprocess) new information as it becomesavailable
  3. 3. TREC 2012 KBA track
  4. 4. Central?
  5. 5. TaskCumulative Citation Recommendation- Filter a time-ordered corpus for documentsthat are highly relevant to a predefined set ofentities- For each entity, provide a ranked list of documentsbased on their “citation-worthiness”
  6. 6. Collection and topics- KBA stream corpus- Oct 2011 - Apr 2012- Split into training and testing periods- Three sources: news, social, linking- raw data 8.7TB- cleansed version 1.2TB (270G compressed)- stream documents uniquely identified by stream_id- Test topics (“target entities”)- 29 entities from Wikipedia (27 persons, 2 org)- uniquely identified by urlname
  7. 7. Annotation matrixyescontainsmentionnogarbage neutral relevant centralnon-relevantG N R Crelevant
  8. 8. Scoring1328055120f6462409e60d2748a0adef82fe68b86d132805788079cdee3c9218ec77f6580183cb16e045132805728080fb850c089caa381a796c34e23d9af81328056560450983d117c5a7903a3a27c959cc682a1328056560450983d117c5a7903a3a27c959cc682a1328056260684e2f8fc90de6ef949946f5061a91e01328056560be417475cca57b6557a7d5db0bbc695913280575204e92eb721bfbfdfa0b1d9476b1ecb0091328058660807e4aaeca58000f6889c31c2471224713280600407a8c209ad36bbb9c946348996f8c616b13280632801ac4b6f3a58004d1596d6e42c4746e2113280646601a0167925256b32d715c1a3a2ee0730c13280629807324a71469556bcd1f3904ba090ab685PositiveNegativeAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakscoreTarget entity: Aharon Barakurlname stream_idCutoff10005005004804504304284283803803753152631328055120f6462409e60d2748a0adef82fe68b86d132805788079cdee3c9218ec77f6580183cb16e045132805728080fb850c089caa381a796c34e23d9af81328056560450983d117c5a7903a3a27c959cc682a1328056560450983d117c5a7903a3a27c959cc682a1328056260684e2f8fc90de6ef949946f5061a91e01328056560be417475cca57b6557a7d5db0bbc695913280575204e92eb721bfbfdfa0b1d9476b1ecb0091328058660807e4aaeca58000f6889c31c2471224713280600407a8c209ad36bbb9c946348996f8c616b13280632801ac4b6f3a58004d1596d6e42c4746e2113280646601a0167925256b32d715c1a3a2ee0730c13280629807324a71469556bcd1f3904ba090ab685PositiveNegativeAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_Barak
  9. 9. Approach
  10. 10. Overview- “Is this document central for this entity?”- Binary classification task- Multi-step approach- Classifying every document-entity pair is not feasible- First step do decide whether the document containsthe entity- Subsequent step(s) to decide centrality
  11. 11. 2-step classificationMention? Central?Yscore0 1000NY
  12. 12. 3-step classificationMention? Relevant? Central?Y Yscore0 1000N N Y
  13. 13. ComponentsMention? Central?Yscore0 1000NYMention? Relevant? Central?Y Yscore0 1000N N Ymention detection supervised learning
  14. 14. Identifying entity mentions- Goals- High recall- Keep false positive rate low- Efficiency- Detection based on known surface forms ofthe entity- urlname (i.e., Wikipedia title)- name variants from DBpedia- DBpedia-loose: only last names for people- No disambiguation
  15. 15. Features1.Document (5)- Length of document fields (body, title, anchors)- Type (news/social/linking)2.Entity (1)- Number of related entities in KB3.Document-entity (28)- Occurrences of entity in document- Number of related entity mentions- Similarity between doc and the entity’s WP page
  16. 16. Features4.Temporal (38)- Wikipedia pageviews- Average pageviews- Change in pageviews- Bursts- Mentions in document stream- Average volume- Change in volume- Bursts
  17. 17. Results
  18. 18. Identifying entity mentionsResults on testing periodIdentificationDocument-entity pairsRecallFalsepositive rateurlnameDBpediaDBpedia-loose41.2K 0.842 0.55970.4K 0.974 0.70112.5M 0.994 0.998
  19. 19. End-to-end taskF1 score, single cutoff00.150.30.45J48 RF2-step 3-stepTRECCentral00.250.50.75J48 RF TRECCentral+Relevant
  20. 20. End-to-end taskF1 score, per-entity cutoff00.150.30.45J48 RF2-step 3-stepTRECCentral00.250.50.75J48 RF TRECCentral+Relevant
  21. 21. Summary- Cumulative Citation Recommendation task@TREC 2012 KBA- Two multi-step classification approaches- Four groups of features- Differentiating between relevant and central isdifficult
  22. 22. Classification vs. Ranking[Balog & Ramampiaro, SIGIR’13]- Approach CCR as a ranking task- Learning-to-rank- Pointwise, pairwise, listwise- Pointwise LTR outperforms classificationapproaches using the same set of features
  23. 23. AABKKnowledge Base Acceleration Assessment & Analysis
  24. 24. Questions?Contact | @krisztianbalog |