Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Constructing a FocusedTaxonomy from aDocument CollectionOlena Medelyan, Steve Manion,Jeen Broekstra, and Anna DivoliAnna L...
Why Automatic Generation?DynamicFastCheapConsistentRDF / Flexible…Why from a Document Collection?Focused/specificOptimal f...
The TeamThe ProcessEvaluationNews Group Case StudyOther Use CasesSummaryTalk Overview@annadivoli
Taxonomy Generation Research TeamJeen BroekstraSteve Manion Anna Lan HuangIan WittenAnna DivoliAlyona Medelyan
?How Taxonomy Generation Works
Input:Documentsstored somewhereAnalysis:Using variety of tools*and datasets, extractconcepts, entities,relationsGrouping &...
Taxonomy Generation - Detailed
DocumentDatabaseSolrConcepts &Relations DatabaseSesame1. Import& convert to text2. Extract concepts3. Annotatewith Linked ...
InputDocuments DocumentDatabase1. Convert to textCurrent input:• Directory path readrecursivelyOther possible inputs:• Doc...
DocumentsDatabaseConceptsDatabase2. Extract conceptshttp://localhost/solr/select?q=path:mycollectiondocument456.txtPingar ...
AnnotationsDatabase3. Annotate withLinked Datamycollection/document456.txtPingar API:People:Yvo de BoerMaite Nkoana-Mashab...
Final ConceptsDatabase4. Disambiguateclashing conceptswikipedia.org/wiki/Oceanwikipedia.org/wiki/Apple_Corps freebase.com/...
5a. Add relationsConcepts &Relations Databasefelines tiger birdhorse familyzebra donkey pigeonhorselizardCategory:Carnivor...
Films and film makingFilm starsMila KunisDaniel RadcliffeSally HawkinsJulianna MarguliesAssociation football clubsFormer F...
The RDF data modelVocabulary of Ngrams, Concepts and Entities shared across various tools.All intermediate processing data...
Analysis:Using variety of tools*and datasets, extractconcepts, entities, relationsCustomTaxonomyTaxonomy Generation Proces...
?How Does It Look Like?
Fairfax NZThis taxonomy was created from 2000 newsarticles by Fairfax New Zealand aroundChristmas 2011. (4.3MB of uncompre...
Case Study: A News Group
EvaluationCoverage: 75%Comparing with manually generated taxonomy by Fairfax librarians for thesame domain (458 concepts -...
Evaluation: Sources of error in concept identificationType Number ErrorsRatePeople 1145 37 3.2%Organizations 496 51 10.3%L...
Case Study: A News Group
Case Study: A News Group
Case Study: A News Group
Alternative Labels
Alternative Labels
Labels & Relations
Case Study: A News Group
Case Study: A News GroupFairfax - 4 Days from Sep 2001Excerpt of the taxonomy generated from:Fairfax articles taken from- ...
Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 do...
Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs
September 2001 Christmas 2011Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep ...
Other Use CasesHow to refine search by metadata?What’s in these files / emails?What to include into ourcorporate taxonomy?...
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer ...
SummaryEntity ExtractionLinked DataDisambiguationConsolidationEvaluationNews Group Case StudyOther Use Cases
More? bit.ly/f-steppingar.com@PingarHQanna.divoli@pingar.com@annadivoliFocused SKOS Taxonomy Extraction Process (F-STEP) ...
Upcoming SlideShare
Loading in …5
×

Constructing a Focused Taxonomy from a Document Collection - ESWC 2013

1,123 views

Published on

Olena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna Lan Huang and Ian Witten
Constructing a Focused Taxonomy from a Document Collection
ESWC 2013, Montpellier, France

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Constructing a Focused Taxonomy from a Document Collection - ESWC 2013

  1. 1. Constructing a FocusedTaxonomy from aDocument CollectionOlena Medelyan, Steve Manion,Jeen Broekstra, and Anna DivoliAnna Lan Huang and Ian Witten
  2. 2. Why Automatic Generation?DynamicFastCheapConsistentRDF / Flexible…Why from a Document Collection?Focused/specificOptimal for those documents…Why?
  3. 3. The TeamThe ProcessEvaluationNews Group Case StudyOther Use CasesSummaryTalk Overview@annadivoli
  4. 4. Taxonomy Generation Research TeamJeen BroekstraSteve Manion Anna Lan HuangIan WittenAnna DivoliAlyona Medelyan
  5. 5. ?How Taxonomy Generation Works
  6. 6. Input:Documentsstored somewhereAnalysis:Using variety of tools*and datasets, extractconcepts, entities,relationsGrouping & Output:An SKOS taxonomy iscreated that groupsresulting taxonomyterms hierarchicallyCustomTaxonomyTaxonomy Generation Overview
  7. 7. Taxonomy Generation - Detailed
  8. 8. DocumentDatabaseSolrConcepts &Relations DatabaseSesame1. Import& convert to text2. Extract concepts3. Annotatewith Linked Data4. Disambiguateclashing concepts5. ConsolidatetaxonomyInputDocsPreferredtop-level termsFocusedSKOSTaxonomyTaxonomy Generation in 5 Steps!
  9. 9. InputDocuments DocumentDatabase1. Convert to textCurrent input:• Directory path readrecursivelyOther possible inputs:• Docs in a database or a DMS• Emails +attachments(Exchange)• Website URL• RSS feedExternal tool toconvert different fileformats to textDatabase to storedocument contentStep 1. Document input & conversion
  10. 10. DocumentsDatabaseConceptsDatabase2. Extract conceptshttp://localhost/solr/select?q=path:mycollectiondocument456.txtPingar API:Taxonomy Terms:Climate and WeatherLeadersAgreementsPeople:Yvo de BoerMaite Nkoana-MashabaneOrganizations:Associated PressSouth African Council of ChurchesLocations:South AfricaWikify:Wikipedia Terms:South AfricaYvo de BoerU.N.Climate agreementsAssociated PressSpecific terminology:green policies; climate diplomacyStep 2. Extracting concepts
  11. 11. AnnotationsDatabase3. Annotate withLinked Datamycollection/document456.txtPingar API:People:Yvo de BoerMaite Nkoana-MashabaneOrganizations:Associated PressSouth African Council of ChurchesLocations:South AfricaLater this additional infowill help createe-Discovery & semantic searchsolutionsConceptsDatabaseStep 3. Annotation with meaning
  12. 12. Final ConceptsDatabase4. Disambiguateclashing conceptswikipedia.org/wiki/Oceanwikipedia.org/wiki/Apple_Corps freebase.com/view/en/apple_incwww.fao.org/aos/agrovoc#c_4607Over the past three years, Apple has acquired three mapping companiesFor millions of years, the oceans have been filled with sounds from natural sources.Two concepts were extracted,that are dissimilarDiscard the incorrect oneTwo concepts were extracted,that are similarAccept both correctAgrovoc term:Marine areasConceptsDatabaseStep 4. Discarding irrelevant meanings
  13. 13. 5a. Add relationsConcepts &Relations Databasefelines tiger birdhorse familyzebra donkey pigeonhorselizardCategory:Carnivorous animals Category:Animalsanimals Building the taxonomybottom upBroader: Sqamata/Reptiles/Tetrapods/Vertebrates/Chordates/AnimalsFocusedSKOSTaxonomyStep 5a. Group taxonomy
  14. 14. Films and film makingFilm starsMila KunisDaniel RadcliffeSally HawkinsJulianna MarguliesAssociation football clubsFormer Football League clubsManchester United F.C.Manchester United F.C.Manchester City F.C.FinanceEconomics and financePersonal financeCommercial financeTaxCapital gains taxTaxCapital gains tax5b. Prune relationsConcepts &Relations DatabaseFocusedSKOSTaxonomyStep 5b. Consolidating taxonomy
  15. 15. The RDF data modelVocabulary of Ngrams, Concepts and Entities shared across various tools.All intermediate processing data is captured and stored using RDF triples.The data can be queried using the SPARQL query language.
  16. 16. Analysis:Using variety of tools*and datasets, extractconcepts, entities, relationsCustomTaxonomyTaxonomy Generation ProcessInput:Documentsstored somewhereOutput:An SKOS taxonomy is createdthat groups resultingtaxonomy terms hierarhically* Pingar API for People, Organization, Locations & Taxonomy Terms fromrelated taxonomies;Wikification for related Wikipedia articles and category relations;Linked Data analysis for creating links to Freebase & DBpediaFile-shareSharePointExchangeEtc
  17. 17. ?How Does It Look Like?
  18. 18. Fairfax NZThis taxonomy was created from 2000 newsarticles by Fairfax New Zealand aroundChristmas 2011. (4.3MB of uncompressed text,averaging ~ 300 words each)+ UK Integrated Public Service Sector vocabulary(http://doc.esd.org.uk/IPSV/2.00.html)Taxonomy StatisticsConcept Count: 10158Edges Count: 12668Intermediate Count: 1383Leaves Count: 8748Labels Count: 11545Nesting Counts0: 27, 1: 6102, 2: 2903, 3: 28914: 2057, 5: 1202, 6: 745, 7: 3548: 179, 9: 41, 10: 10Average Depth: 2.65Case Study & Evaluation: A News Group
  19. 19. Case Study: A News Group
  20. 20. EvaluationCoverage: 75%Comparing with manually generated taxonomy by Fairfax librarians for thesame domain (458 concepts - was never completed).Some not really missing: “Drunk” vs. “Drinking alcohol” and “Alcohol use and abuse”Trully missing: “Immigration”, “Laptop” and “Hospitality”Precision (15 human judges based evaluation):90% for relations100 concept pairs - yes/no decision whether relation makes sense.Total of 750 relations examined – each by two different judges.Examples: “North Yorkshire  Leeds”, “Israel  History of Israel”Humans: “Infectious Disease  Polio”, “Scandinavia  Sweden” !89% for concepts…
  21. 21. Evaluation: Sources of error in concept identificationType Number ErrorsRatePeople 1145 37 3.2%Organizations 496 51 10.3%Locations 988 114 11.5%Wikipedia named entities 832 71 8.5%Wikipedia other entities 99 16 16.4%Taxonomy 868 229 26.4%DBPedia 868 81 8.1%Freebase 135 12 8.9%… Precision (15 human judges based evaluation):89% for conceptsGiven extracted concepts and original text.300 documents equally distributed plus 5 to all judges.
  22. 22. Case Study: A News Group
  23. 23. Case Study: A News Group
  24. 24. Case Study: A News Group
  25. 25. Alternative Labels
  26. 26. Alternative Labels
  27. 27. Labels & Relations
  28. 28. Case Study: A News Group
  29. 29. Case Study: A News GroupFairfax - 4 Days from Sep 2001Excerpt of the taxonomy generated from:Fairfax articles taken from- Sep 9th & 10th (1242 articles) and- Sep 13th & 14th (1667 articles) NZT!Colors of terms:- proposed to group other terms- found in both document collections- in 9-10 Sep 2001 docs- in 13-14 Sep 2001 docs- search matchTaxonomy Statistics:Concept Count: 12699Edges Count: 13755Intermediate Count: 709Leaves Count: 11985Labels Count: 12741
  30. 30. Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs……………………………………………………………….……………………………………………………………….
  31. 31. Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs
  32. 32. September 2001 Christmas 2011Case Study: A News Groupproposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs
  33. 33. Other Use CasesHow to refine search by metadata?What’s in these files / emails?What to include into ourcorporate taxonomy?How to find all docs on a given topic?Content AuditInformation ArchitectureBetter search with facetsBetter browsing
  34. 34. proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collectionOther Use Cases: Discovery
  35. 35. SummaryEntity ExtractionLinked DataDisambiguationConsolidationEvaluationNews Group Case StudyOther Use Cases
  36. 36. More? bit.ly/f-steppingar.com@PingarHQanna.divoli@pingar.com@annadivoliFocused SKOS Taxonomy Extraction Process (F-STEP) wiki

×