Your SlideShare is downloading. ×
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

420

Published on

Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study …

Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
As presented in Text Analytics World in San Francisco (Apr 2013)

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
420
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Automatic Taxonomy Generationfor a News GroupAnna DivoliPingar Research@annadivoliSan Francisco Apr 2013A Case Study
  • 2. Why Automatic Generation?DynamicFastCheapConsistentRDF / Flexible…Why from a Document Collection?Focused/specificOptimal for those documents…Why?
  • 3. The TeamThe ProcessNews Group Case StudyEvaluationOther Use CasesSummaryTalk OverviewSan Francisco Apr 2013
  • 4. Taxonomy Generation Research TeamOlena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna Lan Huang and Ian WittenConstructing a Focused Taxonomy from a Document CollectionTo appear in Proceedings of the Extended Semantic Web Conference 2013,ESWC, Montpellier, France
  • 5. ?How Taxonomy Generation Works
  • 6. Input:Documentsstored somewhereAnalysis:Using variety of tools*and datasets, extractconcepts,entities, relationsGrouping & Output:A taxonomy is createdthat groups resultingtaxonomy termshierarchicallyCustomTaxonomyTaxonomy Generation Overview
  • 7. Taxonomy Generation - Detailed
  • 8. DocumentDatabaseSolrConcepts &Relations DatabaseSesame1. Import& convert to text2. Extract concepts3. Annotatewith Linked Data4. Disambiguateclashing concepts5. ConsolidatetaxonomyInputDocsPreferredtop-level termsFocusedSKOSTaxonomyTaxonomy Generation in 5 Steps!
  • 9. InputDocuments DocumentDatabase1. Convert to textCurrent input:• Directory path readrecursivelyOther possible inputs:• Docs in a database or a DMS• Emails +attachments(Exchange)• Website URL• RSS feedExternal tool toconvert different fileformats to textDatabase to storedocument contentStep 1. Document input & conversion
  • 10. DocumentsDatabaseConceptsDatabase2. Extract conceptshttp://localhost/solr/select?q=path:mycollectiondocument456.txtPingar API:Taxonomy Terms:Climate and WeatherLeadersAgreementsPeople:Yvo de BoerMaite Nkoana-MashabaneOrganizations:Associated PressSouth African Council of ChurchesLocations:South AfricaWikify:Wikipedia Terms:South AfricaYvo de BoerU.N.Climate agreementsAssociated PressSpecific terminology:green policies; climate diplomacyStep 2. Extracting concepts
  • 11. AnnotationsDatabase3. Annotate withLinked Datamycollection/document456.txtPingar API:People:Yvo de BoerMaite Nkoana-MashabaneOrganizations:Associated PressSouth African Council of ChurchesLocations:South AfricaLater this additional infowill help createe-Discovery & semantic searchsolutionsConceptsDatabaseStep 3. Annotation with meaning
  • 12. Final ConceptsDatabase4. Disambiguateclashing conceptswikipedia.org/wiki/Oceanwikipedia.org/wiki/Apple_Corps freebase.com/view/en/apple_incwww.fao.org/aos/agrovoc#c_4607Over the past three years, Apple has acquired three mapping companiesFor millions of years, the oceans have been filled with sounds from natural sources.Two concepts were extracted,that are dissimilarDiscard the incorrect oneTwo concepts were extracted,that are similarAccept both correctAgrovoc term:Marine areasConceptsDatabaseStep 4. Discarding irrelevant meanings
  • 13. 5a. Add relationsConcepts &Relations Databasefelines tiger birdhorse familyzebra donkey pigeonhorselizardCategory:Carnivorous animals Category:Animalsanimals Building the taxonomybottom upBroader: Sqamata/Reptiles/Tetrapods/Vertebrates/Chordates/AnimalsFocusedSKOSTaxonomyStep 5a. Group taxonomy
  • 14. Films and film makingFilm starsMila KunisDaniel RadcliffeSally HawkinsJulianna MarguliesAssociation football clubsFormer Football League clubsManchester United F.C.Manchester United F.C.Manchester City F.C.FinanceEconomics and financePersonal financeCommercial financeTaxCapital gains taxTaxCapital gains tax5b. Prune relationsConcepts &Relations DatabaseFocusedSKOSTaxonomyStep 5b. Consolidating taxonomy
  • 15. Analysis:Using variety of tools*and datasets, extractconcepts, entities, relationsCustomTaxonomyTaxonomy Generation ProcessInput:Documentsstored somewhereOutput:A taxonomy is createdthat groups resultingtaxonomy terms hierarhically* Pingar API for People, Organization, Locations & Taxonomy Terms fromrelated taxonomies;Wikification for related Wikipedia articles and category relations;Linked Data analysis for creating links to Freebase & DBpediaFile-shareSharePointExchangeEtc
  • 16. ?How Does It Look Like?
  • 17. Fairfax NZThis taxonomy was created from 2000 newsarticles by Fairfax New Zealand aroundChristmas 2011.Taxonomy StatisticsConcept Count: 10158Edges Count: 12668Intermediate Count: 1383Leaves Count: 8748Labels Count: 11545Nesting Counts0: 27, 1: 6102, 2: 2903, 3: 28914: 2057, 5: 1202, 6: 745, 7: 3548: 179, 9: 41, 10: 10Average Depth: 2.65Case Study: A News Group
  • 18. Case Study: A News Group
  • 19. Case Study: A News Group
  • 20. Case Study: A News Group
  • 21. Case Study: A News Group
  • 22. Case Study: A News Group
  • 23. Case Study: A News Group
  • 24. Labels & Relations
  • 25. Case Study: A News Group
  • 26. Case Study: A News GroupFairfax - 4 Days from Sep 2001Excerpt of the taxonomy generated from:Fairfax articles taken from- Sep 9th & 10th (1242 articles) and- Sep 13th & 14th (1667 articles) NZT!Colors of terms:- proposed to group other terms- found in both document collections- in 9-10 Sep 2001 docs- in 13-14 Sep 2001 docs- search matchTaxonomy Statistics:Concept Count: 12699Edges Count: 13755Intermediate Count: 709Leaves Count: 11985Labels Count: 12741
  • 27. Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs……………………………………………………………….……………………………………………………………….
  • 28. Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs
  • 29. FairFax NZ - 4 Days from Sep 2001Excerpt of the taxonomy generated from:Fairfax articles taken from- Sep 9th & 10th (1242 articles) and- Sep 13th & 14th (1667 articles) NZT!Colors of terms:- proposed to group other terms- found in both document collections- in 9-10 Sep 2001 docs- in 13-14 Sep 2001 docs- search matchTaxonomy Statistics:Concept Count: 12699Edges Count: 13755Intermediate Count: 709Leaves Count: 11985Labels Count: 12741Average Depth: 1.85( 0: 5 - 1: 4082 - 2: 8980 - 3: 7554: 333 - 5: 132 - 6: 31 - 7: 6 - 8: 1 )Including NZPSVTaxonomy Statistics:Concept Count: 13970Edges Count: 15020Intermediate Count: 1277Leaves Count: 12677Labels Count: 15407Average Depth: 3(0: 16 - 1: 10153 - 2: 1888 - 3: 14004: 1203 - 5: 1053 - 6: 756 - 7: 4278: 252 - 9: 267 - 10: 341 - 11: 31512: 330 - 13: 149 - 14: 134 - 15: 8716: 10 )Case Study: A News Group
  • 30. September 2001 Christmas 2011Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs
  • 31. EvaluationSources of error in concept identificationType Number Errors RatePeople 1145 37 3.2%Organizations 496 51 10.3%Locations 988 114 11.5%Wikipedia named entities 832 71 8.5%Wikipedia other entities 99 16 16.4%Taxonomy 868 229 26.4%DBPedia 868 81 8.1%Freebase 135 12 8.9%Overall 3447 393 11.4%Recall: 75%(comparing with manually generated taxonomy for the same domain)Precision:89% for concepts90% for relations(15 human judges based evaluation)
  • 32. Other Use CasesHow to refine search by metadata?What’s in these files / emails?What to include into ourcorporate taxonomy?How to find all docs on a given topic?Content AuditInformation ArchitectureBetter search with facetsBetter browsing
  • 33. proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collectionOther Use Cases: Discovery
  • 34. SummaryEntity ExtractionLinked DataDisambiguationConsolidationNews Group Case StudyOther Use Cases
  • 35. More? bit.ly/f-steppingar.com@PingarHQanna.divoli@pingar.com@annadivoliFocused SKOS Taxonomy Extraction Process (F-STEP) wiki

×