Automatic Taxonomy Generationfor a News GroupAnna DivoliPingar Research@annadivoliSan Francisco Apr 2013A Case Study
Why Automatic Generation?DynamicFastCheapConsistentRDF / Flexible…Why from a Document Collection?Focused/specificOptimal f...
The TeamThe ProcessNews Group Case StudyEvaluationOther Use CasesSummaryTalk OverviewSan Francisco Apr 2013
Taxonomy Generation Research TeamOlena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna Lan Huang and Ian WittenC...
?How Taxonomy Generation Works
Input:Documentsstored somewhereAnalysis:Using variety of tools*and datasets, extractconcepts,entities, relationsGrouping &...
Taxonomy Generation - Detailed
DocumentDatabaseSolrConcepts &Relations DatabaseSesame1. Import& convert to text2. Extract concepts3. Annotatewith Linked ...
InputDocuments DocumentDatabase1. Convert to textCurrent input:• Directory path readrecursivelyOther possible inputs:• Doc...
DocumentsDatabaseConceptsDatabase2. Extract conceptshttp://localhost/solr/select?q=path:mycollectiondocument456.txtPingar ...
AnnotationsDatabase3. Annotate withLinked Datamycollection/document456.txtPingar API:People:Yvo de BoerMaite Nkoana-Mashab...
Final ConceptsDatabase4. Disambiguateclashing conceptswikipedia.org/wiki/Oceanwikipedia.org/wiki/Apple_Corps freebase.com/...
5a. Add relationsConcepts &Relations Databasefelines tiger birdhorse familyzebra donkey pigeonhorselizardCategory:Carnivor...
Films and film makingFilm starsMila KunisDaniel RadcliffeSally HawkinsJulianna MarguliesAssociation football clubsFormer F...
Analysis:Using variety of tools*and datasets, extractconcepts, entities, relationsCustomTaxonomyTaxonomy Generation Proces...
?How Does It Look Like?
Fairfax NZThis taxonomy was created from 2000 newsarticles by Fairfax New Zealand aroundChristmas 2011.Taxonomy Statistics...
Case Study: A News Group
Case Study: A News Group
Case Study: A News Group
Case Study: A News Group
Case Study: A News Group
Case Study: A News Group
Labels & Relations
Case Study: A News Group
Case Study: A News GroupFairfax - 4 Days from Sep 2001Excerpt of the taxonomy generated from:Fairfax articles taken from- ...
Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 do...
Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs
FairFax NZ - 4 Days from Sep 2001Excerpt of the taxonomy generated from:Fairfax articles taken from- Sep 9th & 10th (1242 ...
September 2001 Christmas 2011Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep ...
EvaluationSources of error in concept identificationType Number Errors RatePeople 1145 37 3.2%Organizations 496 51 10.3%Lo...
Other Use CasesHow to refine search by metadata?What’s in these files / emails?What to include into ourcorporate taxonomy?...
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer ...
SummaryEntity ExtractionLinked DataDisambiguationConsolidationNews Group Case StudyOther Use Cases
More? bit.ly/f-steppingar.com@PingarHQanna.divoli@pingar.com@annadivoliFocused SKOS Taxonomy Extraction Process (F-STEP) ...
Upcoming SlideShare
Loading in …5
×

Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

763 views

Published on

Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study
As presented in Text Analytics World in San Francisco (Apr 2013)

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
763
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

  1. 1. Automatic Taxonomy Generationfor a News GroupAnna DivoliPingar Research@annadivoliSan Francisco Apr 2013A Case Study
  2. 2. Why Automatic Generation?DynamicFastCheapConsistentRDF / Flexible…Why from a Document Collection?Focused/specificOptimal for those documents…Why?
  3. 3. The TeamThe ProcessNews Group Case StudyEvaluationOther Use CasesSummaryTalk OverviewSan Francisco Apr 2013
  4. 4. Taxonomy Generation Research TeamOlena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna Lan Huang and Ian WittenConstructing a Focused Taxonomy from a Document CollectionTo appear in Proceedings of the Extended Semantic Web Conference 2013,ESWC, Montpellier, France
  5. 5. ?How Taxonomy Generation Works
  6. 6. Input:Documentsstored somewhereAnalysis:Using variety of tools*and datasets, extractconcepts,entities, relationsGrouping & Output:A taxonomy is createdthat groups resultingtaxonomy termshierarchicallyCustomTaxonomyTaxonomy Generation Overview
  7. 7. Taxonomy Generation - Detailed
  8. 8. DocumentDatabaseSolrConcepts &Relations DatabaseSesame1. Import& convert to text2. Extract concepts3. Annotatewith Linked Data4. Disambiguateclashing concepts5. ConsolidatetaxonomyInputDocsPreferredtop-level termsFocusedSKOSTaxonomyTaxonomy Generation in 5 Steps!
  9. 9. InputDocuments DocumentDatabase1. Convert to textCurrent input:• Directory path readrecursivelyOther possible inputs:• Docs in a database or a DMS• Emails +attachments(Exchange)• Website URL• RSS feedExternal tool toconvert different fileformats to textDatabase to storedocument contentStep 1. Document input & conversion
  10. 10. DocumentsDatabaseConceptsDatabase2. Extract conceptshttp://localhost/solr/select?q=path:mycollectiondocument456.txtPingar API:Taxonomy Terms:Climate and WeatherLeadersAgreementsPeople:Yvo de BoerMaite Nkoana-MashabaneOrganizations:Associated PressSouth African Council of ChurchesLocations:South AfricaWikify:Wikipedia Terms:South AfricaYvo de BoerU.N.Climate agreementsAssociated PressSpecific terminology:green policies; climate diplomacyStep 2. Extracting concepts
  11. 11. AnnotationsDatabase3. Annotate withLinked Datamycollection/document456.txtPingar API:People:Yvo de BoerMaite Nkoana-MashabaneOrganizations:Associated PressSouth African Council of ChurchesLocations:South AfricaLater this additional infowill help createe-Discovery & semantic searchsolutionsConceptsDatabaseStep 3. Annotation with meaning
  12. 12. Final ConceptsDatabase4. Disambiguateclashing conceptswikipedia.org/wiki/Oceanwikipedia.org/wiki/Apple_Corps freebase.com/view/en/apple_incwww.fao.org/aos/agrovoc#c_4607Over the past three years, Apple has acquired three mapping companiesFor millions of years, the oceans have been filled with sounds from natural sources.Two concepts were extracted,that are dissimilarDiscard the incorrect oneTwo concepts were extracted,that are similarAccept both correctAgrovoc term:Marine areasConceptsDatabaseStep 4. Discarding irrelevant meanings
  13. 13. 5a. Add relationsConcepts &Relations Databasefelines tiger birdhorse familyzebra donkey pigeonhorselizardCategory:Carnivorous animals Category:Animalsanimals Building the taxonomybottom upBroader: Sqamata/Reptiles/Tetrapods/Vertebrates/Chordates/AnimalsFocusedSKOSTaxonomyStep 5a. Group taxonomy
  14. 14. Films and film makingFilm starsMila KunisDaniel RadcliffeSally HawkinsJulianna MarguliesAssociation football clubsFormer Football League clubsManchester United F.C.Manchester United F.C.Manchester City F.C.FinanceEconomics and financePersonal financeCommercial financeTaxCapital gains taxTaxCapital gains tax5b. Prune relationsConcepts &Relations DatabaseFocusedSKOSTaxonomyStep 5b. Consolidating taxonomy
  15. 15. Analysis:Using variety of tools*and datasets, extractconcepts, entities, relationsCustomTaxonomyTaxonomy Generation ProcessInput:Documentsstored somewhereOutput:A taxonomy is createdthat groups resultingtaxonomy terms hierarhically* Pingar API for People, Organization, Locations & Taxonomy Terms fromrelated taxonomies;Wikification for related Wikipedia articles and category relations;Linked Data analysis for creating links to Freebase & DBpediaFile-shareSharePointExchangeEtc
  16. 16. ?How Does It Look Like?
  17. 17. Fairfax NZThis taxonomy was created from 2000 newsarticles by Fairfax New Zealand aroundChristmas 2011.Taxonomy StatisticsConcept Count: 10158Edges Count: 12668Intermediate Count: 1383Leaves Count: 8748Labels Count: 11545Nesting Counts0: 27, 1: 6102, 2: 2903, 3: 28914: 2057, 5: 1202, 6: 745, 7: 3548: 179, 9: 41, 10: 10Average Depth: 2.65Case Study: A News Group
  18. 18. Case Study: A News Group
  19. 19. Case Study: A News Group
  20. 20. Case Study: A News Group
  21. 21. Case Study: A News Group
  22. 22. Case Study: A News Group
  23. 23. Case Study: A News Group
  24. 24. Labels & Relations
  25. 25. Case Study: A News Group
  26. 26. Case Study: A News GroupFairfax - 4 Days from Sep 2001Excerpt of the taxonomy generated from:Fairfax articles taken from- Sep 9th & 10th (1242 articles) and- Sep 13th & 14th (1667 articles) NZT!Colors of terms:- proposed to group other terms- found in both document collections- in 9-10 Sep 2001 docs- in 13-14 Sep 2001 docs- search matchTaxonomy Statistics:Concept Count: 12699Edges Count: 13755Intermediate Count: 709Leaves Count: 11985Labels Count: 12741
  27. 27. Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs……………………………………………………………….……………………………………………………………….
  28. 28. Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs
  29. 29. FairFax NZ - 4 Days from Sep 2001Excerpt of the taxonomy generated from:Fairfax articles taken from- Sep 9th & 10th (1242 articles) and- Sep 13th & 14th (1667 articles) NZT!Colors of terms:- proposed to group other terms- found in both document collections- in 9-10 Sep 2001 docs- in 13-14 Sep 2001 docs- search matchTaxonomy Statistics:Concept Count: 12699Edges Count: 13755Intermediate Count: 709Leaves Count: 11985Labels Count: 12741Average Depth: 1.85( 0: 5 - 1: 4082 - 2: 8980 - 3: 7554: 333 - 5: 132 - 6: 31 - 7: 6 - 8: 1 )Including NZPSVTaxonomy Statistics:Concept Count: 13970Edges Count: 15020Intermediate Count: 1277Leaves Count: 12677Labels Count: 15407Average Depth: 3(0: 16 - 1: 10153 - 2: 1888 - 3: 14004: 1203 - 5: 1053 - 6: 756 - 7: 4278: 252 - 9: 267 - 10: 341 - 11: 31512: 330 - 13: 149 - 14: 134 - 15: 8716: 10 )Case Study: A News Group
  30. 30. September 2001 Christmas 2011Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs
  31. 31. EvaluationSources of error in concept identificationType Number Errors RatePeople 1145 37 3.2%Organizations 496 51 10.3%Locations 988 114 11.5%Wikipedia named entities 832 71 8.5%Wikipedia other entities 99 16 16.4%Taxonomy 868 229 26.4%DBPedia 868 81 8.1%Freebase 135 12 8.9%Overall 3447 393 11.4%Recall: 75%(comparing with manually generated taxonomy for the same domain)Precision:89% for concepts90% for relations(15 human judges based evaluation)
  32. 32. Other Use CasesHow to refine search by metadata?What’s in these files / emails?What to include into ourcorporate taxonomy?How to find all docs on a given topic?Content AuditInformation ArchitectureBetter search with facetsBetter browsing
  33. 33. proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collectionOther Use Cases: Discovery
  34. 34. SummaryEntity ExtractionLinked DataDisambiguationConsolidationNews Group Case StudyOther Use Cases
  35. 35. More? bit.ly/f-steppingar.com@PingarHQanna.divoli@pingar.com@annadivoliFocused SKOS Taxonomy Extraction Process (F-STEP) wiki

×