Extracting and MappingSharePoint Content to Create aCustom TaxonomyAnna DivoliPingar Research@annadivoli
Why?Why Automatic Generation?DynamicFastCheapConsistentRDF / Flexible…Why from a DocumentCollection?Focused/specificOptima...
Talk OverviewThe TeamThe ProcessEvaluationUse Cases– Withdrawn drug– Cancer treatments– Re-purposed drugSummary
Taxonomy Generation Research TeamOlena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna Lan Huang and Ian WittenC...
Taxonomy Generation ProcessInput:Documentsstored somewhereAnalysis:Using variety of tools*and datasets, extractconcepts,en...
How Taxonomy Generation works
DocumentDatabaseSolrConcepts &Relations DatabaseSesame1. Import& convert to text2. Extract concepts3. Annotatewith Linked ...
Step 1. Document input & conversionInputDocuments DocumentDatabase1. Convert to textCurrent input:• Directory path readrec...
Step 2. Extracting conceptsDocumentsDatabaseConceptsDatabase2. Extract conceptshttp://localhost/solr/select?q=path:mycolle...
Step 3. Annotation with meaningAnnotationsDatabase3. Annotate withLinked Datamycollection/document456.txtPingar API:People...
Step 4. Discarding irrelevant meaningsFinal ConceptsDatabase4. Disambiguateclashing conceptswikipedia.org/wiki/Oceanwikipe...
Step 5. Group taxonomy (a)5a. Add relationsConcepts &Relations Databasefelines tiger birdhorse familyzebra donkey pigeonho...
Step 5. Consolidating taxonomy (b)Films and film makingFilm starsMila KunisDaniel RadcliffeSally HawkinsJulianna Margulies...
EvaluationRecall: 75%(comparing with manually generated taxonomy for thesame domain)Precision:89% for concepts90% for rela...
SharePoint Taxonomy Generation ProcessAnalysis:Using variety of tools*and datasets, extractconcepts,entities, relationsCus...
Triazolam[A benzodiazepine drug used for short-term treatment of acute insomnia.Withdrawn in 1991 in the UK because ofrisk...
proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs
proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs
proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs
Cancer TreatmentsExcerpt of the taxonomy generated from:- 200 PubMed abstracts on breast cancertreatments- 149 (all) PubMe...
proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsi...
proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsi...
proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsi...
proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsi...
proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsi...
TamoxifenTamoxifen is drug commonly used to treat breast cancerbut with a subsequent indication for treating bipolardisord...
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer ...
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer ...
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer ...
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer ...
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer ...
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer ...
SummaryEntity ExtractionLinked DataDisambiguationConsolidationCase Studies
More? bit.ly/f-steppingar.com@PingarHQanna.divoli@pingar.com@annadivoliFocused SKOS Taxonomy Extraction Process (F-STEP) ...
Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy
Upcoming SlideShare
Loading in...5
×

Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

271

Published on

Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy
Pingar presentation at ShareFEST in Philadelphia (Apr 2013).

Published in: Health & Medicine, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
271
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

  1. 1. Extracting and MappingSharePoint Content to Create aCustom TaxonomyAnna DivoliPingar Research@annadivoli
  2. 2. Why?Why Automatic Generation?DynamicFastCheapConsistentRDF / Flexible…Why from a DocumentCollection?Focused/specificOptimal for those documents…Why Taxonomies?Organize knowledgeDomain representationEnable automatic tasks…Why in SharePoint?All you need is there!Can be used straight away!
  3. 3. Talk OverviewThe TeamThe ProcessEvaluationUse Cases– Withdrawn drug– Cancer treatments– Re-purposed drugSummary
  4. 4. Taxonomy Generation Research TeamOlena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna Lan Huang and Ian WittenConstructing a Focused Taxonomy from a Document CollectionESWC 2013, Montpellier, France
  5. 5. Taxonomy Generation ProcessInput:Documentsstored somewhereAnalysis:Using variety of tools*and datasets, extractconcepts,entities, relationsGrouping & Output:A taxonomy is createdthat groups resultingtaxonomy termshierarchicallyCustomTaxonomy
  6. 6. How Taxonomy Generation works
  7. 7. DocumentDatabaseSolrConcepts &Relations DatabaseSesame1. Import& convert to text2. Extract concepts3. Annotatewith Linked Data4. Disambiguateclashing concepts5. ConsolidatetaxonomyInputDocsPreferredtop-level termsIn 5 Steps!FocusedSKOSTaxonomy
  8. 8. Step 1. Document input & conversionInputDocuments DocumentDatabase1. Convert to textCurrent input:• Directory path readrecursivelyOther possible inputs:• Docs in a database or a DMS• Emails +attachments(Exchange)• Website URL• RSS feedExternal tool toconvert different fileformats to textDatabase to storedocument content
  9. 9. Step 2. Extracting conceptsDocumentsDatabaseConceptsDatabase2. Extract conceptshttp://localhost/solr/select?q=path:mycollectiondocument456.txtPingar API:Taxonomy Terms:Climate and WeatherLeadersAgreementsPeople:Yvo de BoerMaite Nkoana-MashabaneOrganizations:Associated PressSouth African Council of ChurchesLocations:South AfricaWikify:Wikipedia Terms:South AfricaYvo de BoerU.N.Climate agreementsAssociated PressSpecific terminology:green policies; climate diplomacy
  10. 10. Step 3. Annotation with meaningAnnotationsDatabase3. Annotate withLinked Datamycollection/document456.txtPingar API:People:Yvo de BoerMaite Nkoana-MashabaneOrganizations:Associated PressSouth African Council of ChurchesLocations:South AfricaLater this additional infowill help createe-Discovery & semantic searchsolutionsConceptsDatabase
  11. 11. Step 4. Discarding irrelevant meaningsFinal ConceptsDatabase4. Disambiguateclashing conceptswikipedia.org/wiki/Oceanwikipedia.org/wiki/Apple_Corps freebase.com/view/en/apple_incwww.fao.org/aos/agrovoc#c_4607Over the past three years, Apple has acquired three mapping companiesFor millions of years, the oceans have been filled with sounds from natural sources.Two concepts were extracted,that are dissimilarDiscard the incorrect oneTwo concepts were extracted,that are similarAccept both correctAgrovoc term:Marine areasConceptsDatabase
  12. 12. Step 5. Group taxonomy (a)5a. Add relationsConcepts &Relations Databasefelines tiger birdhorse familyzebra donkey pigeonhorselizardCategory:Carnivorous animals Category:Animalsanimals Building the taxonomybottom upBroader: Sqamata/Reptiles/Tetrapods/Vertebrates/Chordates/AnimalsFocusedSKOSTaxonomy
  13. 13. Step 5. Consolidating taxonomy (b)Films and film makingFilm starsMila KunisDaniel RadcliffeSally HawkinsJulianna MarguliesAssociation football clubsFormer Football League clubsManchester United F.C.Manchester United F.C.Manchester City F.C.FinanceEconomics and financePersonal financeCommercial financeTaxCapital gains taxTaxCapital gains tax5b. Prune relationsConcepts &Relations DatabaseFocusedSKOSTaxonomy
  14. 14. EvaluationRecall: 75%(comparing with manually generated taxonomy for thesame domain)Precision:89% for concepts90% for relations(15 human judges based evaluation)
  15. 15. SharePoint Taxonomy Generation ProcessAnalysis:Using variety of tools*and datasets, extractconcepts,entities, relationsCustomTaxonomy
  16. 16. Triazolam[A benzodiazepine drug used for short-term treatment of acute insomnia.Withdrawn in 1991 in the UK because ofrisk of psychiatric adverse drug reactions.It continues to be available in the U.S.]Excerpt of the taxonomy generated from:- 131 PubMed abstracts of clinical trialson triazolam before1991- 180 PubMed abstracts of clinical trialson triazolam since1991Colors of terms:- proposed to group other terms- found in both document collections- in before withdrawal docs- in since withdrawal docsTaxonomy StatisticsConcept Count: 305Edges Count: 437Intermediate Count: 97Leaves Count: 183Labels Count: 353Nesting Counts0: 251: 512: 1243: 1604: 1765: 1536: 547: 4Average Depth: 3.6
  17. 17. proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs
  18. 18. proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs
  19. 19. proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs
  20. 20. Cancer TreatmentsExcerpt of the taxonomy generated from:- 200 PubMed abstracts on breast cancertreatments- 149 (all) PubMed abstracts on lungcancer treatments- 47 (all) PubMed abstracts on gastriccancer treatmentsColors of terms:- proposed to group other terms- found in two or more documentcollections- in the breast treatment docs- in the stomach treatment docs- in the lung treatment docsTaxonomy StatisticsConcept Count: 308Edges Count: 387Intermediate Count: 90Leaves Count: 195Labels Count: 371Nesting Counts0: 231: 522: 993: 1384: 1375: 1596: 607: 368: 6Average Depth: 3.88
  21. 21. proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs
  22. 22. proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs
  23. 23. proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs
  24. 24. proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs
  25. 25. proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs
  26. 26. TamoxifenTamoxifen is drug commonly used to treat breast cancerbut with a subsequent indication for treating bipolardisorder.Excerpt of the taxonomy generated from:- papers discussing tamoxifen and bipolar disorder: 8 PubMedabstracts AND 2 PDFs of full papers (17641532, 18316672)- papers discussing tamoxifen and breast cancer: 50 PubMedabstracts of AND 2 PDFs of full papers (21635709, 12618491)- papers discussing tamoxifen but no mention of either breastcancer nor bipolar disorder: 50 PubMed abstracts of AND 2PDFs of full papers (16275887, 19458291)Colors of terms:- proposed to group other concepts- in two or more document collections- in the bipolar document collection- in the breast cancer document collection- in the neither cancer or bipolar document collectionTaxonomy StatisticsConcept Count: 587Edges Count: 751Intermediate Count: 188Leaves Count: 365Labels Count: 718Nesting Counts0: 341: 732: 1333: 2844: 2255: 1576: 897: 308: 2Average Depth: 3.66
  27. 27. proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection
  28. 28. proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection
  29. 29. proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection
  30. 30. proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection
  31. 31. proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection
  32. 32. proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection
  33. 33. SummaryEntity ExtractionLinked DataDisambiguationConsolidationCase Studies
  34. 34. More? bit.ly/f-steppingar.com@PingarHQanna.divoli@pingar.com@annadivoliFocused SKOS Taxonomy Extraction Process (F-STEP) wiki
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×